# RAPIDS cuDF

[Source](https://colab.research.google.com/github/ritchieng/deep-learning-wizard/blob/master/docs/machine_learning/gpu/rapids_cudf.ipynb)

My version(last updated 20210210): [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nonjosh/cudf-test/blob/main/Copy_of_rapids_cudf.ipynb)

## Environment Setup

### Check Version

#### Python Version

In [1]:
# Check Python Version
!python --version

Python 3.6.9


#### Ubuntu Version

In [2]:
# Check Ubuntu Version
!lsb_release -a

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.5 LTS
Release:	18.04
Codename:	bionic


#### Check CUDA Version

In [3]:
# Check CUDA/cuDNN Version
!nvcc -V && which nvcc

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
/usr/local/cuda/bin/nvcc


#### Check GPU Version

In [4]:
# Check GPU
!nvidia-smi

Wed Feb 10 13:55:30 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   49C    P8    10W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

#### Check GPU if You've Right Version (T4)
Many thanks to NVIDIA team for this snippet of code to automatically set up everything.

In [5]:
import pynvml

pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
device_name = pynvml.nvmlDeviceGetName(handle)

if device_name != b'Tesla T4':
  raise Exception("""
    Unfortunately this instance does not have a T4 GPU.
    
    Please make sure you've configured Colab to request a GPU instance type.
    
    Sometimes Colab allocates a Tesla K80 instead of a T4. Resetting the instance.

    If you get a K80 GPU, try Runtime -> Reset all runtimes...
  """)
else:
  print('Woo! You got the right kind of GPU!')

Woo! You got the right kind of GPU!


### Installation of cuDF/cuML
Many thanks to NVIDIA team for this snippet of code to automatically set up everything.

In [6]:
# Install RAPIDS
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!bash rapidsai-csp-utils/colab/rapids-colab.sh s

Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 19, done.[K
remote: Counting objects: 100% (19/19), done.[K
remote: Compressing objects: 100% (19/19), done.[K
remote: Total 190 (delta 8), reused 0 (delta 0), pack-reused 171[K
Receiving objects: 100% (190/190), 58.54 KiB | 14.63 MiB/s, done.
Resolving deltas: 100% (70/70), done.
PLEASE READ
********************************************************************************************************
Changes:
1. IMPORTANT CHANGES: RAPIDS on Colab will be pegged to 0.14 Stable until further notice.  This version of RAPIDS, while works, is outdated.  We have alternative solutions, https://app.blazingsql.com, to run the latest versions of RAPIDS
2. Default stable version is now 0.14.  Nightly will redirect to 0.14.
3. You can now declare your RAPIDSAI version as a CLI option and skip the user prompts (ex: '0.14' or '0.15', between 0.13 to 0.14, without the quotes): 
        "!bash rapidsai-csp-utils/colab/rapids-colab.sh <ve

In [7]:
import sys, os

dist_package_index = sys.path.index('/usr/local/lib/python3.6/dist-packages')
sys.path = sys.path[:dist_package_index] + ['/usr/local/lib/python3.6/site-packages'] + sys.path[dist_package_index:]
sys.path
exec(open('rapidsai-csp-utils/colab/update_modules.py').read(), globals())

***********************************************************************
Let us check on those pyarrow and cffi versions...
***********************************************************************

You're don't have pyarrow.
unloaded cffi 1.14.4
loaded cffi 1.14.4


In [8]:
# set environment vars
import sys, os

sys.path.append('/usr/local/lib/python3.6/site-packages/')
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'

In [9]:
# !cp /usr/local/lib/libcudf.so .
# !cp /usr/local/lib/librmm.so .

In [10]:
# import shutil
# # copy .so files to current working dir
# for fn in ['libcudf.so', 'librmm.so']:
#   shutil.copy('/usr/local/lib/'+fn, os.getcwd())

In [11]:
!python --version

Python 3.6.12


## Critical Imports

In [12]:
# Critical imports
import nvstrings, nvcategory, cudf
import cuml
import os
import numpy as np
import pandas as pd

Environment variables with the 'NUMBAPRO' prefix are deprecated and consequently ignored, found use of NUMBAPRO_NVVM=/usr/local/cuda/nvvm/lib64/libnvvm.so.

For more information about alternatives visit: ('http://numba.pydata.org/numba-doc/latest/cuda/overview.html', '#cudatoolkit-lookup')
Environment variables with the 'NUMBAPRO' prefix are deprecated and consequently ignored, found use of NUMBAPRO_LIBDEVICE=/usr/local/cuda/nvvm/libdevice/.

For more information about alternatives visit: ('http://numba.pydata.org/numba-doc/latest/cuda/overview.html', '#cudatoolkit-lookup')
  
  


## Creating

### Create a Series of integers

In [13]:
gdf = cudf.Series([1, 2, 3, 4, 5, 6])
print(gdf)
print(type(gdf))

0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64
<class 'cudf.core.series.Series'>


### Create a Series of floats

In [14]:
gdf = cudf.Series([1., 2., 3., 4., 5., 6.])
print(gdf)

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
dtype: float64


### Create a  Series of strings


In [15]:
gdf = cudf.Series(['a', 'b', 'c'])
print(gdf)

0    a
1    b
2    c
dtype: object


### Create 3 column DataFrame
- Consisting of dates, integers and floats

In [16]:
df = cudf.DataFrame()
df['key'] = [0, 1, 2, 3, 4]
df['val'] = [float(i + 10) for i in range(5)]  # insert column
df

Unnamed: 0,key,val
0,0,10.0
1,1,11.0
2,2,12.0
3,3,13.0
4,4,14.0


In [17]:
# Import
import datetime as dt

gdf = cudf.DataFrame({
  # Create 10 busindates ess from 1st January 2019 via pandas
  'dates': pd.date_range('1/1/2019', periods=10, freq='B'),
   # Integers
  'integers': [i for i in range(10)],
  # Floats
  'floats': [float(i) for i in range(10)],
})

# Print dataframe
gdf

Unnamed: 0,dates,integers,floats
0,2019-01-01,0,0.0
1,2019-01-02,1,1.0
2,2019-01-03,2,2.0
3,2019-01-04,3,3.0
4,2019-01-07,4,4.0
5,2019-01-08,5,5.0
6,2019-01-09,6,6.0
7,2019-01-10,7,7.0
8,2019-01-11,8,8.0
9,2019-01-14,9,9.0


### Create 2 column Dataframe
- Consisting of integers and string category

In [18]:
gdf = cudf.DataFrame({
    'integers': [1 ,2, 3, 4],
    'string': ['a', 'b', 'c', 'd'],
})

print(gdf)

   integers string
0         1      a
1         2      b
2         3      c
3         4      d


### Create a 2 Column  Dataframe with Pandas Bridge
- Consisting of integers and string category
- For all string columns, you must convert them to type `category` for filtering functions to work intuitively (for now)

In [19]:
# Create pandas dataframe
pandas_df = pd.DataFrame({
    'integers': [1, 2, 3, 4], 
    'strings': ['a', 'b', 'c', 'd']
})

# Convert string column to category format
pandas_df['strings'] = pandas_df['strings'].astype('category')

# Bridge from pandas to cudf
gdf = cudf.DataFrame.from_pandas(pandas_df)

# Print dataframe
print(gdf)

   integers strings
0         1       a
1         2       b
2         3       c
3         4       d


## Viewing

### Printing Column Names

In [20]:
gdf.columns

Index(['integers', 'strings'], dtype='object')

### Viewing Top of DataFrame

In [21]:
num_of_rows_to_view = 2 
print(gdf.head(num_of_rows_to_view))

   integers strings
0         1       a
1         2       b


### Viewing Bottom of DataFrame

In [22]:
num_of_rows_to_view = 3 
print(gdf.tail(num_of_rows_to_view))

   integers strings
1         2       b
2         3       c
3         4       d


## Filtering

### Method 1: Query

#### Filtering Integers/Floats by Column Values
- This only works for floats and integers, not for strings

In [23]:
print(gdf.query('integers == 1'))

   integers strings
0         1       a


#### Filtering Strings by Column Values
- This only works for floats and integers, not for strings so this will return an error!

In [24]:
try:
  gdf.query('strings == a')
except Exception as e:
  print('an error has occur')

an error has occur


### Method 2:  Simple Columns

#### Filtering Strings by Column Values


In [25]:
# Filtering based on the string column
print(gdf[gdf.strings == 'b'])

   integers strings
1         2       b


####Filtering Integers/Floats by Column Values

In [26]:
# Filtering based on the string column
print(gdf[gdf.integers == 2])

   integers strings
1         2       b


### Method 2:  Simple Rows

#### Filtering by Row Numbers

In [27]:
# Filter rows 0 to 2 (not inclusive of the third row with the index 2)
print(gdf[0:2])

   integers strings
0         1       a
1         2       b


### Method 3:  loc[rows, columns]

In [28]:
# The syntax is as follows loc[rows, columns] allowing you to choose rows and columns accordingly
# The example allows us to filter the first 3 rows (inclusive) of the column integers
print(gdf.loc[0:2, ['integers']])

   integers
0         1
1         2
2         3
