<a href="https://colab.research.google.com/github/martin-fabbri/colab-notebooks/blob/master/rapids-cuDF-cust-fun-02-.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Setup:

1. Use pynvml to confirm Colab allocated you a Tesla T4 GPU.
2. Install most recent Miniconda release compatible with Google Colab's Python install  (3.6.7)
3. Install RAPIDS libraries
4. Copy RAPIDS .so files into current working directory, a workaround for conda/colab interactions
5. Update env variables so Python can find and use RAPIDS artifacts

All of the above steps are automated in the next cell.

You should re-run this cell any time your instance re-starts.

In [1]:
!nvidia-smi

Wed Sep 11 05:30:01 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   53C    P0    28W /  70W |      0MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [2]:
!wget -nc https://github.com/rapidsai/notebooks-extended/raw/master/utils/rapids-colab.sh
!bash rapids-colab.sh

import sys, os

sys.path.append('/usr/local/lib/python3.6/site-packages/')
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'

--2019-09-11 05:30:18--  https://github.com/rapidsai/notebooks-extended/raw/master/utils/rapids-colab.sh
Resolving github.com (github.com)... 52.74.223.119
Connecting to github.com (github.com)|52.74.223.119|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/rapidsai/notebooks-contrib/raw/master/utils/rapids-colab.sh [following]
--2019-09-11 05:30:19--  https://github.com/rapidsai/notebooks-contrib/raw/master/utils/rapids-colab.sh
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/rapidsai/notebooks-contrib/master/utils/rapids-colab.sh [following]
--2019-09-11 05:30:19--  https://raw.githubusercontent.com/rapidsai/notebooks-contrib/master/utils/rapids-colab.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.

# RAPIDS Examples #

Now you can run code! 

What follows are basic examples where all processing takes place on the GPU.

## Required Imports

In [0]:
import cudf
import pandas as pd
import numpy as np
import math
import os

# cuDF Series

---
Creating a cudf.Series

In [4]:
s = cudf.Series([1, 2, 3, None, 4])
print(s)

0       1
1       2
2       3
3    null
4       4
dtype: int64


Creating a cudf dataframe

In [5]:
df = cudf.DataFrame([('a', list(range(20))), ('b', list(reversed(range(20)))), ('c', list(range(30, 50)))])
df.head(5)

Unnamed: 0,a,b,c
0,0,19,30
1,1,18,31
2,2,17,32
3,3,16,33
4,4,15,34


Create a cudf dataframe from a pd.dataframe

In [6]:
pdf = pd.DataFrame({'a': [0, 1, 2, 3], 'b': [0.0, 0.1, None, 0.2]})
print('Panda df:', pdf.info())

cdf = cudf.DataFrame.from_pandas(pdf)
cdf.head(2)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
a    4 non-null int64
b    3 non-null float64
dtypes: float64(1), int64(1)
memory usage: 144.0 bytes
Panda df: None


Unnamed: 0,a,b
0,0,0.0
1,1,0.1


## cuDF Dataframe Basics

Object Creation

In [7]:
pdf = pd.DataFrame({'a': [0, 1, 2, 3, 4],'b': [0.1, 0.2, None, 0.3, 0.5]})
cdf = cudf.DataFrame.from_pandas(pdf)
print(cdf)

   a     b
0  0   0.1
1  1   0.2
2  2  null
3  3   0.3
4  4   0.5


Sorting values

In [8]:
pdf.sort_values(by='b')
cdf.sort_values(by='b')

Unnamed: 0,a,b
0,0,0.1
1,1,0.2
3,3,0.3
4,4,0.5
2,2,


Selection

In [9]:
pdf['a'].head(3)
cdf['a'].head(3)

0    0
1    1
2    2
Name: a, dtype: int64

Selection by label

In [10]:
cdf.loc[:1, ['a', 'b']]

Unnamed: 0,a,b
0,0,0.1
1,1,0.2


Selection by position

In [11]:
pdf.iloc[0]

a    0.0
b    0.1
Name: 0, dtype: float64

In [12]:
print(cdf.iloc[0])
print(cdf.iloc[1:3])

a    0.0
b    0.1
Name: 0, dtype: float64
   a     b
1  1   0.2
2  2  null


### Boolean Indexing

Selecting rows in a DataFrame or Series by boolean indexing.

In [13]:
print(cdf[cdf.a > 1.5])

   a     b
2  2  null
3  3   0.3
4  4   0.5


In [14]:
print(cdf.query("b >= 0.3 and b <= 0.5"))

   a    b
3  3  0.3
4  4  0.5


In [15]:
val = 2
cdf.query("a != @val")

Unnamed: 0,a,b
0,0,0.1
1,1,0.2
3,3,0.3
4,4,0.5


### Missing Data

In [16]:
print(cdf.fillna(0))

   a    b
0  0  0.1
1  1  0.2
2  2  0.0
3  3  0.3
4  4  0.5


### Operations

In [17]:
s = cudf.Series(np.arange(10)).astype(np.float32)
print(s.mean(), s.var())

4.5 9.166666666666668


In [0]:
def add_ten(num):
  return num + 10

# s.applymap(add_ten)

In [0]:
def complex_math_transform(num):
    return math.cos(num) * 3 / 9

# print(s.applymap(complex_math_transform))

### String Methods

In [20]:
s = cudf.Series(['A', 'B', 'C', 'AaBb', 'Baca', None, 'CABA', 'dog', 'cat'])
s.str.lower().head(3)

0    a
1    b
2    c
dtype: object

### Concat

In [21]:
s = cudf.Series([1, 2, 3, None, 5])
cudf.concat([s, s])

0       1
1       2
2       3
3    null
4       5
0       1
1       2
2       3
3    null
4       5
dtype: int64

### Append

In [22]:
s.append(s)

0       1
1       2
2       3
3    null
4       5
0       1
1       2
2       3
3    null
4       5
dtype: int64

### Grouping

Like pandas, cuDF supports the Split-Apply-Combine grouping paradigm.

In [23]:
cdf['agg_col1'] = [1 if x % 2 == 0 else 0 for x in range(len(cdf))]
cdf

Unnamed: 0,a,b,agg_col1
0,0,0.1,1
1,1,0.2,0
2,2,,1
3,3,0.3,0
4,4,0.5,1


In [24]:
cdf.groupby('agg_col1').sum()

Unnamed: 0_level_0,a,b
agg_col1,Unnamed: 1_level_1,Unnamed: 2_level_1
0,4,0.5
1,6,0.6


In [25]:
cdf.groupby('agg_col1').agg({'a':'max', 'b':'mean'})

Unnamed: 0_level_0,a,b
agg_col1,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3,0.25
1,4,0.3


### Time Series

In [26]:
date_df = cudf.DataFrame()
date_df['date'] = pd.date_range('11/20/2018', periods=72, freq='D')
date_df['value'] = np.random.sample(len(date_df))
date_df.head()

Unnamed: 0,date,value
0,2018-11-20,0.109395
1,2018-11-21,0.840001
2,2018-11-22,0.442165
3,2018-11-23,0.839554
4,2018-11-24,0.548335


In [27]:
date_df['minute'] = date_df.date.dt.minute
date_df.head()

Unnamed: 0,date,value,minute
0,2018-11-20,0.109395,0
1,2018-11-21,0.840001,0
2,2018-11-22,0.442165,0
3,2018-11-23,0.839554,0
4,2018-11-24,0.548335,0


### Converting data representation

To Pandas

In [28]:
df.head().to_pandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
a    5 non-null int64
b    5 non-null int64
c    5 non-null int64
dtypes: int64(3)
memory usage: 200.0 bytes


To Numpy

In [29]:
df.as_matrix()[:3]

array([[ 0, 19, 30],
       [ 1, 18, 31],
       [ 2, 17, 32]])

Converting a cuDF series to a numpy ndarray

In [30]:
df['a'].to_array()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

### Getting data In/Out

Writting to a CSV file, by first sending data to a Pandas dataframe on the host

In [0]:
if not os.path.exists('example_output'):
  os.mkdir('example_output')
  
df.to_pandas().to_csv('example_output/foo.csv', index=False)

Reading from a csv file

In [32]:
df = cudf.read_csv('example_output/foo.csv')
df.head()

Unnamed: 0,a,b,c
0,0,19,30
1,1,18,31
2,2,17,32
3,3,16,33
4,4,15,34


### Performance

One of the primary reasons to use cuDF over pandas is performance. For some workflows, the GPU can be much faster than the CPU. 

In [0]:
a = np.random.rand(10000000)

In [0]:
pdf = pd.DataFrame()
cdf = cudf.DataFrame()

In [35]:
%time pdf['a'] = a
%time cdf['a'] = a

CPU times: user 469 ms, sys: 349 ms, total: 817 ms
Wall time: 837 ms
CPU times: user 20.1 ms, sys: 0 ns, total: 20.1 ms
Wall time: 20.2 ms


In [36]:
%%timeit
pdf['a'].sum()

10 loops, best of 3: 59.3 ms per loop


In [37]:
%%timeit
cdf['a'].sum()

1000 loops, best of 3: 524 µs per loop


In [38]:
%time pdf['a'].sum()
%time cdf['a'].sum()

CPU times: user 63.4 ms, sys: 4.04 ms, total: 67.5 ms
Wall time: 73.3 ms
CPU times: user 1.07 ms, sys: 11 µs, total: 1.08 ms
Wall time: 981 µs


5000387.719516846

### Use case: Sensor data analytics

To get a more realistic sense of how powerful cuDF and GPUs can be, let's imagine you had a fleet of sensors that collect data every millisecond. These sensors could be measuring preasure, temprerature, or something else entirely.

Let's image we want to analysize one day's worth of sensor data. We'll assign random values for teh sensor value to use for this example.

In [39]:
%%time

date_df = pd.DataFrame()
date_df['date'] = pd.date_range(start='2019-07-05', end='2019-07-06', freq='ms')
date_df['value'] = np.random.sample(len(date_df))

date_df['hour'] = date_df.date.dt.hour
date_df['minute'] = date_df.date.dt.minute

date_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86400001 entries, 0 to 86400000
Data columns (total 4 columns):
date      datetime64[ns]
value     float64
hour      int64
minute    int64
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 2.6 GB
CPU times: user 11.1 s, sys: 5.4 s, total: 16.5 s
Wall time: 16.5 s


In [40]:
date_df.head(5)

Unnamed: 0,date,value,hour,minute
0,2019-07-05 00:00:00.000,0.044373,0,0
1,2019-07-05 00:00:00.001,0.345961,0,0
2,2019-07-05 00:00:00.002,0.398337,0,0
3,2019-07-05 00:00:00.003,0.08697,0,0
4,2019-07-05 00:00:00.004,0.188798,0,0


In [41]:
%time results = date_df.groupby(['hour', 'minute']).agg({'value': 'max'})
results.head()

CPU times: user 2.78 s, sys: 13.5 ms, total: 2.8 s
Wall time: 2.82 s


Unnamed: 0_level_0,Unnamed: 1_level_0,value
hour,minute,Unnamed: 2_level_1
0,0,0.999991
0,1,0.999976
0,2,0.999995
0,3,0.999974
0,4,0.999985


In [42]:
%%time

cu_df = cudf.DataFrame()
cu_df['date'] = pd.date_range(start='2019-07-05', end='2019-07-06', freq='ms')
cu_df['value'] = np.random.sample(len(date_df))

cu_df['hour'] = cu_df.date.dt.hour
cu_df['minute'] = cu_df.date.dt.minute
cu_df['second'] = cu_df.date.dt.second

print(cu_df.shape)

(86400001, 5)
CPU times: user 1.23 s, sys: 21.5 ms, total: 1.25 s
Wall time: 1.26 s


In [43]:
%time results = cu_df.groupby(['hour', 'minute', 'second']).agg({'value': 'max'})
results.head()

CPU times: user 82.3 ms, sys: 62.4 ms, total: 145 ms
Wall time: 147 ms


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,value
hour,minute,second,Unnamed: 3_level_1
0,0,0,0.999488
0,0,1,0.999413
0,0,2,0.99925
0,0,3,0.999582
0,0,4,0.998248


#[cuDF](https://github.com/rapidsai/cudf)#

Load a dataset into a GPU memory resident DataFrame and perform a basic calculation.

Everything from CSV parsing to calculating tip percentage and computing a grouped average is done on the GPU.

In [0]:
tips_df = cudf.read_csv("https://github.com/plotly/datasets/raw/master/tips.csv")
tips_df['tip_percentage'] = tips_df['tip']/tips_df['total_bill']*100

# display average tip by dining party size
print(tips_df.groupby('size').tip_percentage.mean())

size
1    21.729202
2    16.571919
3    15.215685
4    14.594901
5    14.149549
6    15.622920
Name: tip_percentage, dtype: float64


#[cuML](https://github.com/rapidsai/cuml)

This snippet does label and one-hot encoding of the tips dataset's categorical features and applies standard scaling to all columns. All operations run on the GPU.

In [0]:
import cuml

# label encode the categorical features of the tips dataset
for col in ['sex', 'smoker', 'day', 'time']:
  le = cuml.preprocessing.LabelEncoder()
  tips_df[col] = le.fit_transform(tips_df[col])

# day and time are non-binary categorical features, one-hot-encode them
tips_df = cudf.get_dummies(tips_df, columns=['day', 'time'])

# do standard scaling on all columns
for col in tips_df.columns:
  tips_df[col] = (tips_df[col] - tips_df[col].mean())/tips_df[col].std()

# inspect the results
tips_df.head().to_pandas()

Unnamed: 0,total_bill,tip,sex,smoker,size,tip_percentage,day_0,day_1,day_2,day_3,time_0,time_1
0,-0.314066,-1.436993,-1.340598,-0.783179,-0.598961,-1.659607,-0.289997,-0.742879,1.483734,-0.582463,0.620307,-0.620307
1,-1.061054,-0.967217,0.742879,-0.783179,0.452453,-0.004274,-0.289997,-0.742879,1.483734,-0.582463,0.620307,-0.620307
2,0.137497,0.36261,0.742879,-0.783179,0.452453,0.09472,-0.289997,-0.742879,1.483734,-0.582463,0.620307,-0.620307
3,0.437416,0.225291,0.742879,-0.783179,-0.598961,-0.344218,-0.289997,-0.742879,1.483734,-0.582463,0.620307,-0.620307
4,0.539635,0.442111,-1.340598,-0.783179,1.503867,-0.229154,-0.289997,-0.742879,1.483734,-0.582463,0.620307,-0.620307


## K-Nearest Neighbors
Lastly, create a K-Nearest Neighbors model and find the 5 most similar tippers.

In [0]:
# create a KNN model
knn = cuml.NearestNeighbors()
knn.fit(tips_df)

# find 5 nearest neighbors
k = 5
distances_df, indices_df = knn.kneighbors(tips_df, k)
indices_df.head().to_pandas()

Unnamed: 0,0,1,2,3,4
0,0,162,16,12,166
1,1,53,10,152,151
2,2,165,152,160,55
3,3,45,113,49,55
4,4,157,114,11,52


##Determining Feature Importance with XGBoost

Lastly, we can use [XGBoost](https://github.com/dmlc/xgboost)'s GPU accelerated decision trees to determine which features have the greatest impact on tip percentage.

In [0]:
import xgboost as xgb

params = {
  'n_gpus':       1,
  'tree_method':  'gpu_hist',
  'objective':    'reg:squarederror'
}

X_feature_names = ["total_bill", "sex", "smoker", "size",
                   "day_0",    "day_1",    "day_2", "day_3",
                   "time_0",    "time_1"]

X_train = tips_df[X_feature_names]
y_train = cudf.DataFrame({'y': tips_df['tip_percentage']})

# Convert to XGBoost's DMatrix format and train the model
dmatrix_train = xgb.DMatrix(X_train,
                            label=y_train,
                            feature_names=X_train.columns)

bst = xgb.train(params, dmatrix_train)

# See what data is the most important for predicting % tipped
xgb.plot_importance(bst)

TypeError: ignored

#[cuGraph](https://github.com/rapidsai/cugraph)

Like the cuDF snippet above, this code loads a CSV file from a URL, then dives into cugraph to computes the PageRank score for each vertex.  Those scores are then used as weight to compute the Weighted Jaccard Similarity which is used to find the most common nodes in the Epinions dataset.

In [0]:
import cugraph, cudf
import gzip, io, requests
from collections import OrderedDict

# download some data
url="https://snap.stanford.edu/data/soc-Epinions1.txt.gz"
content = gzip.decompress(requests.get(url).content).decode()

cols = ["src", "dst"]
dtypes = OrderedDict([ ("src", "int32"), ("dst", "int32")])
# read the CSV data from memory buffer
gdf = cudf.read_csv(io.StringIO(content), names=cols, delimiter='\t', dtype=list(dtypes.values()), skiprows=4)

# create a Graph 
G = cugraph.Graph()
G.add_edge_list(gdf["src"], gdf["dst"])

# Call Pagerank on the graph to get weights to use:
pr_df = cugraph.pagerank(G)

# find the max page rank value - there could be more than one with the max score
pr_max = pr_df['pagerank'].max()

pr_filtered = pr_df.query('pagerank >= @pr_max')
    
for i in range(len(pr_filtered)):
    print("PageRank: top vertex is " + str(pr_filtered['vertex'][i]) + 
        " with score of " + str(pr_filtered['pagerank'][i]))  

# Call weighted Jaccard using the Pagerank scores as weights:
# https://github.com/rapidsai/cugraph/issues/398
df = cugraph.jaccard_w(G, pr_df['pagerank'])

max_coeff = df['jaccard_coeff'].max()
j_gdf = df.query('jaccard_coeff >= @max_coeff')

for i in range(len(j_gdf)):
    print("Weighted Jaccard Similarity: Vertices " + str(j_gdf['source'][i]) + 
      " and " + str(j_gdf['destination'][i] ) + 
      " are most similar with score: " + str(df['jaccard_coeff'][i]))

PageRank: top vertex is 18 with score of 0.004534927
Weighted Jaccard Similarity: Vertices 22693 and 57123 are most similar with score: 0.26571962


# Next Steps #

For an overview of how you can access and work with your own datasets in Colab, check out [this guide](https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92).

For more RAPIDS examples, check out our RAPIDS notebooks repos:
1. https://github.com/rapidsai/notebooks
2. https://github.com/rapidsai/notebooks-extended