# Introduction to Rapids

<img src="../nb_images/rapids.png" alt="Drawing" style="width: 600px;"/>


Rapids is a data preparation and machine learning library that is designed to take maximum advantage of the Nvidia GPU.  The libraries are called cuDF and cuML and take a lot of the same design and API semantics from Pandas and Sklearn Python libaries.   Speedups of over 10x are not uncommon for a lot of everyday tasks.

If you are familiar with Pandas and Sklearn, this code in this lab will look familiar.  Rapids is still under development, so its not as full featured as the Pandas and Sklearn libraries, but it is continually getting new functions.  

The following lab will walk you through how to use Rapids with a sample dataset.  **This lab will focus on the performance capabilities of Rapids by comparing it to Pandas and Sklearn equivalent operations.** It is not meant to be a machine learning tutorial. 


## A word on performance comparisons of RAPIDS vs Pandas

Pandas and Numpy are two of the most popular libraries for both data engineers and data scientists.  The libraries are very robust and perfomant, but one major drawback is that they are single threaded libraries.  When comparing RAPIDs vs Pandas/Numpy you are seeing the benefit of parallelizing these types of tasks overs potentially thousands of seperate threads.  

## CuDF basics

Built based on the Apache Arrow columnar memory format, cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and manipulating data.

cuDF provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming.

Definitions :
* GPU Dataframe : a dataframe from the RAPIDS cuDF library running on the GPU


### Helper functions

Execute the functions below, they are needed for follow-on parts of the lab.  Note the **pgdf** function is a convenience function to display the GPU dataframe in a nice format for jupyter notebook.

In [1]:
# Imports
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')
import time
import timeit

from datetime import datetime
import math

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

import glob
import os
import sys
sys.path.append('../utils/') 

#dask
import dask
from dask import dataframe as dd

# Rapids
import cudf
from cudf.dataframe import DataFrame as RapidsDataFrame

In [2]:
# [print gpu dataframe] helper function to print GPU dataframes 
def pgdf(gdf) :
    display(gdf.to_pandas())

In [3]:
def time_command(cmd,repeat=1) :
    avg_runtime = timeit.timeit(cmd, number=repeat)
    return float(avg_runtime / repeat)

In [4]:
# Dictionary to store results ..
# example "describe" : {"gpu" : []}
# TODO : make display results look better ..
class COMPARE() :
        ## Abstract Custom Implementations
    def __init__(self) :
        #nprint("Loading Data.  Overriding __init__ from dfutils")
        self.tests = []
        self.gpu_results = {}
        self.cpu_results = {}
        self.df_shape = (0,0)
        self.df_memory_gb = 0 

    def add_result(self, test_name, gpu_result, runtime) :
        if test_name not in self.tests :
            self.tests.append(test_name)
            self.gpu_results[test_name] = []
            self.cpu_results[test_name] = []
        
        if(gpu_result == "gpu") :
            self.gpu_results[test_name].append(runtime)
        else :
            self.cpu_results[test_name].append(runtime)
            
    def display_results(self) :
        print("Dataframe size : {} {} GB".format(self.df_shape, self.df_memory_gb))
        print("{:<20} {:<20} {:<20} {:<20}".format("test", "CPU(s)", "GPU(s)", "GPU Speedup"))
        for i in self.tests :
            cpu_mean = sum(self.cpu_results[i]) / (len(self.cpu_results[i])+0.00001)
            gpu_mean = sum(self.gpu_results[i]) / (len(self.gpu_results[i])+0.00001)
            su = cpu_mean / (gpu_mean + .00001)
            print("{:<20} {:<20.4f} {:<20.4f} {:<20.2f}".format(i, cpu_mean, gpu_mean, su ))

run_times = COMPARE()


In [5]:
def pca_scree(pca_explained_variance, label) :
        
    # bin is my x axis variable
    bin = []
    for i in range (len(pca_explained_variance)):
        bin.append(i+1)
    # plot the cummulative variance against the index of PCA
    cum_var = np.cumsum(pca_explained_variance)
    plt.plot(bin, cum_var)
    # plot the 95% threshold, so we can read off count of principal components that matter
    plt.plot(bin, [.95]*n_components, '--')
    plt.plot(bin, [.75]*n_components, '--')
    plt.plot(bin, [.50]*n_components, '--')
    #turn on grid to make graph reading easier
    plt.grid(True)
    #plt.rcParams.update({'font.size': 24})
    plt.suptitle(label + ' PCA Variance Explained')
    plt.xlabel('Number of PCA Components', fontsize=18)
    plt.ylabel('Fraction of Variance \nExplained', fontsize=16)
    # control number of tick marks, 
    plt.xticks([i for i in range(0,n_components)])
    plt.show()


### Useful DataFrame attributes

When you create a GPU dataframe, there are a number of methods available for you to understand the composition.  The detailed list is found in the Rapids [cuDF documentation](https://docs.rapids.ai/api/cudf/0.7/) 

Below we will create a small cuDF dataframe and look at some of its attributes.  A few of these attributes come in handy when debugging 

* dtypes  :  Shows all the columns and associated data types 
* shape   :  Shows the shape (rows / columns) of the dataframe
* columns :  Show the column names in a python list


In [6]:
# Create a simple GPU dataframe
df = cudf.DataFrame()
df['column1'] = [0, 1, 2, 3, 4]
df['column2'] = [float(i + 10) for i in range(5)]  # insert column
df['column3'] = ["bbb","aaa","ccc","eee","Ddd"]  # insert column

In [7]:
#Print the dataframe
pgdf(df)

Unnamed: 0,column1,column2,column3
0,0,10.0,bbb
1,1,11.0,aaa
2,2,12.0,ccc
3,3,13.0,eee
4,4,14.0,Ddd


In [8]:
# Dataframe attributes
print("\nDataframe datatypes\n---------------------")
print(df.dtypes)
print("\nDataframe Shape\n---------------------")
print(df.shape)
print("\nDataframe dimesions\n---------------------")
print(df.ndim)
print("\nDataframe Column names\n---------------------")
print(df.columns)


Dataframe datatypes
---------------------
column1      int64
column2    float64
column3     object
dtype: object

Dataframe Shape
---------------------
(5, 3)

Dataframe dimesions
---------------------
2

Dataframe Column names
---------------------
Index(['column1', 'column2', 'column3'], dtype='object')


### Create a cuDF dataframe from Numpy/Pandas array
Rapids cuDF supports the conversion of pandas and numpy arrays to cuDF dataframes.  In the example below we show examples of how you can do this for each type

In [9]:
# Numpy array to cuDF
# Dataframe Operations : Create random large array
a = np.random.rand(100,100)
df = cudf.DataFrame()
df = df.from_records(a)
#df['random_column1'] = [0, 1, 2, 3, 4]
pgdf(df.head())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
0,0.643922,0.406495,0.768091,0.827456,0.375351,0.441299,0.357021,0.427143,0.953115,0.83692,0.425673,0.496715,0.413411,0.376688,0.620565,0.777828,0.009143,0.011081,0.121413,0.968677,0.841048,0.798765,0.407379,0.109702,0.294977,0.074615,0.613099,0.902759,0.326078,0.762535,0.963855,0.636146,0.273927,0.157968,0.878545,0.62097,0.908954,0.774619,0.873769,0.98542,0.729114,0.192622,0.944867,0.525507,0.185602,0.706909,0.085511,0.130098,0.415063,0.000856,0.988707,0.25335,0.118799,0.06825,0.155678,0.509284,0.598306,0.208182,0.204673,0.512484,0.77797,0.538131,0.864877,0.356318,0.778711,0.320857,0.144026,0.881326,0.37189,0.838978,0.141676,0.895684,0.896499,0.975226,0.124143,0.975431,0.720021,0.81186,0.533208,0.39537,0.466463,0.475385,0.931976,0.573965,0.851842,0.973617,0.113473,0.86336,0.523547,0.300478,0.924823,0.328691,0.452695,0.64754,0.474907,0.61195,0.39723,0.015335,0.18712,0.588101
1,0.459184,0.691137,0.109199,0.424105,0.477531,0.342865,0.599161,0.775097,0.16812,0.09557,0.472373,0.77134,0.740423,0.971063,0.2046,0.20007,0.354061,0.640974,0.766024,0.341449,0.7165,0.284101,0.497977,0.414689,0.968464,0.784113,0.429164,0.828664,0.142604,0.491521,0.561241,0.48569,0.262527,0.56133,0.308733,0.06913,0.330997,0.470865,0.12065,0.560311,0.844184,0.576078,0.751663,0.015523,0.100736,0.489148,0.982769,0.884616,0.477452,0.819828,0.474865,0.423696,0.374958,0.800076,0.139316,0.389716,0.345226,0.508374,0.75369,0.171217,0.557635,0.434306,0.051729,0.855949,0.890835,0.312814,0.285499,0.240663,0.601322,0.388028,0.383541,0.556913,0.156364,0.93073,0.363188,0.514229,0.679324,0.382753,0.300092,0.298982,0.915696,0.19007,0.065108,0.363754,0.03167,0.395073,0.004909,0.562352,0.083405,0.885717,0.531211,0.469907,0.02174,0.311523,0.342209,0.615499,0.022378,0.453659,0.763322,0.785673
2,0.614662,0.375019,0.429197,0.362987,0.697312,0.114749,0.898591,0.070021,0.875001,0.691731,0.963583,0.892182,0.581788,0.704416,0.632751,0.69314,0.173518,0.641309,0.229931,0.042655,0.055823,0.43157,0.017272,0.816746,0.235548,0.640845,0.71435,0.393939,0.736512,0.896421,0.838074,0.276378,0.197314,0.527279,0.288963,0.89403,0.395224,0.873747,0.620171,0.519485,0.888235,0.179672,0.287926,0.668181,0.41307,0.936173,0.915568,0.332372,0.113485,0.04127,0.636205,0.022553,0.79271,0.080358,0.769556,0.585719,0.692709,0.472287,0.788365,0.92901,0.045187,0.444184,0.73812,0.552693,0.374343,0.279351,0.654335,0.9013,0.090655,0.164813,0.915889,0.296486,0.174448,0.855637,0.359908,0.743609,0.804891,0.238405,0.614967,0.424452,0.678636,0.504339,0.792747,0.691772,0.63683,0.413405,0.931239,0.191181,0.420947,0.54244,0.415905,0.183959,0.009146,0.965081,0.460755,0.61823,0.974766,0.097237,0.196317,0.690642
3,0.121082,0.212826,0.240101,0.597862,0.244645,0.765512,0.761411,0.063018,0.384224,0.365297,0.004145,0.291207,0.548535,0.492832,0.190693,0.594236,0.278526,0.558755,0.415556,0.37882,0.748527,0.897922,0.597823,0.60687,0.219639,0.391225,0.984674,0.012118,0.657554,0.859413,0.297078,0.884288,0.33233,0.519523,0.95599,0.034365,0.88088,0.695227,0.090108,0.90525,0.622922,0.021629,0.855427,0.408925,0.692249,0.795905,0.076073,0.36004,0.744574,0.727704,0.849569,0.557878,0.143991,0.918917,0.762466,0.321734,0.728866,0.834429,0.582642,0.932724,0.846642,0.453784,0.43806,0.79622,0.94907,0.615866,0.30004,0.153547,0.253677,0.806384,0.442208,0.799623,0.425608,0.845719,0.168133,0.222194,0.958871,0.552785,0.810187,0.646865,0.625707,0.082189,0.963445,0.240594,0.405319,0.593407,0.553092,0.189795,0.489676,0.356104,0.189732,0.23437,0.734763,0.331095,0.795102,0.720697,0.202959,0.279589,0.463243,0.427213
4,0.405394,0.761809,0.918316,0.327007,0.744595,0.45315,0.710927,0.738697,0.375013,0.45136,0.836686,0.297655,0.780544,0.491029,0.485218,0.637508,0.551828,0.052865,0.313814,0.741621,0.048509,0.912344,0.051517,0.599111,0.677366,0.755416,0.65943,0.653337,0.0114,0.939193,0.429581,0.104602,0.720239,0.8645,0.606135,0.843505,0.616343,0.178322,0.913244,0.515471,0.237372,0.069047,0.449189,0.8672,0.555141,0.002557,0.684676,0.921205,0.525043,0.155892,0.310693,0.064383,0.682644,0.223449,0.37164,0.432937,0.287986,0.011582,0.06791,0.611299,0.204166,0.031661,0.231928,0.708137,0.489565,0.487148,0.078329,0.855558,0.267313,0.007178,0.315864,0.42726,0.716236,0.882677,0.192563,0.070186,0.51176,0.468384,0.984783,0.114578,0.638638,0.550163,0.986406,0.69995,0.659147,0.482383,0.074368,0.759969,0.929551,0.280776,0.632832,0.263216,0.467069,0.724657,0.972695,0.379075,0.633286,0.04787,0.947148,0.462618


In [10]:
# pdf = pd.DataFrame({'a': [0, 1, 2, 3],'b': [0.1, 0.2, None, 0.3]})
# df = cudf.from_pandas(pdf)
# pgdf(df)

### Dataframe Operations : Slice  Example - grab 3 arbitrary columns

Sometimes you want to grab slices of dataframes.  Here you can just pass a list of column names to the GPU dataframe to return the columns you want.

In [16]:
pgdf(df[[0,1,5]] )

Unnamed: 0,0,1,5
0,0.643922,0.406495,0.441299
1,0.459184,0.691137,0.342865
2,0.614662,0.375019,0.114749
3,0.121082,0.212826,0.765512
4,0.405394,0.761809,0.45315
5,0.548662,0.903565,0.944342
6,0.816397,0.912059,0.51126
7,0.779689,0.476548,0.284231
8,0.938475,0.334942,0.97705
9,0.374091,0.550647,0.454439


### Optional Exercise : Create a Random numpy array 1000 x 1000 and then convert to GPU dataframe.  The select columns 444,555,888 from the array.



## CuML basics

CuML is the a machine learning library implemented on the Nvidia GPU.  This allows you to use many of the most common machine learning algorithms without having to write CUDA code.  The list of algorithms is growing with each release so its worth taking a look at the cuML github repo, but in general you can expect a 10x to 50x performance speedup when using the GPU enabled algorithm.  **Later in this lab we will see examples of PCA and linear regression.**

## Dask

<img src="../nb_images/dask.png" alt="Drawing" style="float :left; margin-right: 20px; width: 200px;" />


Dask is an extremely useful python library that enables parallel execution of arbitrary python programs allowing you to make maximum use of system resources.  It is typically used for libraries that are written in single threaded implementation like pandas and numpy, but its also very useful for running many a parallel tasks when using Rapids.  We will have a code sample to demonstrate this at the end of the lab.


# Lab Use Case 

**The main goal of this lab is to focus on the performance differences of Rapids(GPU) vs Pandas/Sklearn (CPU) implemenations.**  
<br>

<img src="../nb_images/lendingclub.png" alt="Drawing" style="float :left; margin-right: 20px; width: 300px;" />
<br>To do this we will use the Lending Club publicly available dataset. 
This data set is published by lending club and contains information regarding prospective loan applicants.  
<br><br><br><br>
**As we go through the lab, we will show the similarity in the syntax/usage of the library using this real world dataset and keep track of the runtimes in a comparison report.**



## Lending Club data and Lab Details

In [None]:
# DEBUG CODE REMOVE LATER

# a = timeit.timeit('"-".join(str(n) for n in range(100))', number=1)
# run_times.add_result("test", "gpu", a)
# run_times.display_results()
# 
# b = timeit.timeit('"-".join(str(n) for n in range(100))', number=100)
# run_times.add_result("test", "cpu", b)
# a = timeit.timeit('"-".join(str(n) for n in range(100))', number=1)
# run_times.add_result("test", "gpu", a)
# b = timeit.timeit('"-".join(str(n) for n in range(100))', number=100)
# run_times.add_result("test", "cpu", b)
# a = timeit.timeit('"-".join(str(n) for n in range(100))', number=1)
# run_times.add_result("test1", "gpu", a)
# b = timeit.timeit('"-".join(str(n) for n in range(100))', number=100)
# run_times.add_result("test1", "cpu", b)
# 
# run_times.display_results()

# Data Preparation using cuDF

Here we will load in the lending club dataset and perform some basic data preparation steps.  

Each section is composed of the same workflow

- [ ] cpu example
- [ ] gpu example
- [ ] comparison of results
- [ ] logging of runtimes



## Load the Lending Club Data

Here we will load the data twice.  Once into a pandas dataframe **loan_pdf** and once into a rapids dataframe **loan_rdf**.  

In [None]:
# import data
filename = "../dataprep_common/loan_project_df.parquet.gzip"
DATA_DOUBLE_FACTOR=3

# Pandas dataframe
loan_pdf = pd.read_parquet(filename)#  , names=ts_cols,dtype=ts_dtypes,skiprows=1)

# Rapids Dataframe
loan_rdf = cudf.read_parquet(filename)#  , names=ts_cols,dtype=ts_dtypes,skiprows=1)

In [None]:
# Scale up data to 40 million rows
for i in range(DATA_DOUBLE_FACTOR) :
    loan_pdf = pd.concat([loan_pdf,loan_pdf],axis=0)
    loan_rdf = cudf.concat([loan_rdf,loan_rdf],axis=0)
    loan_rdf = loan_rdf.reset_index().drop("index",axis=1)
    loan_pdf = loan_pdf.reset_index().drop("index",axis=1)
    #pgdf(loan_rdf.head())
    #display(loan_pdf.head())
    

In [None]:
# Dataframe attributes
print("Rapids")
print("\nDataframe datatypes\n---------------------")
print(loan_rdf.dtypes)
print("\nDataframe Shape (rows,cols)\n---------------------")
print(loan_rdf.shape)
print("\nDataframe dimesions\n---------------------")
print(loan_rdf.ndim)
print("\nDataframe Column names\n---------------------")
print(loan_rdf.columns)

# Dataframe attributes
print("\n\nPandas")
print("\nDataframe datatypes\n---------------------")
print(loan_pdf.dtypes)
print("\nDataframe Shape (rows,cols)\n---------------------")
print(loan_pdf.shape)
print("\nDataframe dimesions\n---------------------")
print(loan_pdf.ndim)
print("\nDataframe Column names\n---------------------")
print(loan_pdf.columns)
print("\nDataframe Memory Usage\n---------------------")
print(loan_pdf.memory_usage(index=True).sum())

run_times.df_shape = loan_pdf.shape
run_times.df_memory_gb = loan_pdf.memory_usage(index=True).sum() /10**9

In [None]:
pgdf(loan_rdf)

## Descriptive Statistics - Describe Performance comparison

The first comparison we will make is using the describe function.  Describe is useful because it looks at all the descriptive statistics of the dataset.  It calculates **mean/standard deviation/medain statistics** for all the numerical columns.  If you have a large dataframe it can take some time to calculate.  Lets see how Rapids performs  with this dataset.

In [None]:
# CPU / pandas
loan_pdf.describe()

In [None]:
# GPU / Rapids
pgdf(loan_rdf.describe())

In [None]:
# Record results
def describe_gpu():
    loan_rdf.describe()

def describe_cpu():
    loan_pdf.describe()

#display(loan_rdf.describe().to_pandas())

run_times.add_result("describe", "gpu", time_command(describe_gpu))
run_times.add_result("describe", "cpu", time_command(describe_cpu))

run_times.display_results()

## One Hot Encoding (OHE) Performance Comparison

One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

Currently, one hot endcoding for Rapids requires the column that is to be encoded to be an integer or float, not a string.  You will need to create an integer column prior to using this!  You can use the hash_encode method to accomplish this, although you lose a little bit of readability.  In future versions of the software this is fixed.

In [None]:
# CPU / pandas example
ohe_cpu_df = pd.get_dummies(loan_pdf['grade'])

In [None]:
# GPU / Rapids example
# cudf 0.9 cudf.reshape.general.get_dummies(df, prefix=None, prefix_sep='_', dummy_na=False, columns=None, cats={}, sparse=False, drop_first=False, dtype='int8')

# Needed 50 hash values to get uniqueness ... probably a better way, but for now lets move on
# print(ohe_gpu_df.grade.hash_encode(stop=50).value_counts())
# print(ohe_gpu_df.grade.hash_encode(stop=50))
# print(ohe_gpu_df.grade.hash_encode(stop=50))

MAX_VAL=50
loan_rdf['grade_hash'] = loan_rdf['grade'].hash_encode(stop=MAX_VAL)
ohe_gpu_df = loan_rdf.one_hot_encoding(column='grade_hash', prefix='g', cats=[27,45,36,28,48,17,25])


In [None]:
# Compare the results
print("Pandas ...")
display(ohe_cpu_df.head(10))
#pgdf(ohe_gpu_df[ohe_gpu_df['grade']=='A'].head(20))
print("Rapids ...")
ohe_gpu_df = ohe_gpu_df.rename({"g_27": "A","g_45": "B","g_36": "C","g_28": "D","g_48": "E","g_17": "F","g_25": "G"})
pgdf(ohe_gpu_df[['A','B','C','D','E','F','G']].head(10))



In [None]:
# Record the results

def ohe_cpu() :
    pd.get_dummies(loan_pdf['grade'])

def ohe_gpu() :
    MAX_VAL=50
    loan_rdf['grade_hash'] = loan_rdf['grade'].hash_encode(stop=MAX_VAL)
    ohe_gpu_df = loan_rdf.one_hot_encoding(column='grade_hash', prefix='g', cats=[27,45,36,28,48,17,25])

run_times.add_result("one_hot_encode", "cpu", time_command(ohe_cpu))
run_times.add_result("one_hot_encode", "gpu", time_command(ohe_gpu))
run_times.display_results()


## Filter with Date and Time ops - Performance comparison

Current datetime functionality is limited to filtering data set for specific times.  Datetime doesn't not yet support math operations.

Here we will find loan applicants that have a credit line prior to 2010.

In [None]:
import datetime as dt

search_date = dt.datetime.strptime('2010-01-01', '%Y-%m-%d')

In [None]:
# CPU / pandas
query_cpu=loan_pdf.query('earliest_cr_line <= @search_date')


In [None]:
# GPU / Rapids
query_gpu=loan_rdf.query('earliest_cr_line <= @search_date')


In [None]:
# compare results
display(query_cpu.head())
pgdf(query_gpu.head())

In [None]:
# Filter Record results
def filter_cpu():
    loan_pdf.query('earliest_cr_line <= @search_date')
    
def filter_gpu():
    loan_rdf.query('earliest_cr_line <= @search_date')
    
run_times.add_result("filter_dt", "cpu", time_command(filter_cpu,repeat=3))
run_times.add_result("filter_dt", "gpu", time_command(filter_gpu,repeat=3))
run_times.display_results()


## Sort by value

Sorting is a very expensive operation in data preparation so its useful to evaluate the performance of method.  Here we select a column to sort by and then compare the results.

In [None]:
# CPU / pandas
sort_cpu=loan_pdf.sort_values(by='fico_range_high')

In [None]:
# GPU / Rapids
sort_gpu=loan_rdf.sort_values(by='fico_range_high')

In [None]:
# compare results
display(query_cpu.head())
pgdf(query_gpu.head())

In [None]:
# Sorting Record results
def sort_cpu():
    loan_pdf.sort_values(by='fico_range_high')
    
def sort_gpu():
    loan_rdf.sort_values(by='fico_range_high')
    
run_times.add_result("sorting", "cpu", time_command(sort_cpu))
run_times.add_result("sorting", "gpu", time_command(sort_gpu))
run_times.display_results()


## Histograms and Custom functions

Here we demonstrate how fast Rapids is at creating histogram bins.  We use the loan_amount column with a custom function to create a loan_bins column.  Then we grab the value counts using both Pandas and Rapids to get a rough comparison of the speed of these types of operations.


In [None]:
# custom function example : creates simple bins for loan_amount histogram
def roundto(num):
    roundto=5000
    a = int(num / roundto)
    return float(a*roundto) 


In [None]:
# CPU / pandas

loan_pdf['loan_bins'] = loan_pdf.loan_amnt.apply(roundto)
loan_pdf['loan_bins'].value_counts()


In [None]:
# GPU / rapids
loan_rdf['loan_bins'] = loan_rdf.loan_amnt.applymap(roundto)
print(loan_rdf['loan_bins'].value_counts())



In [None]:
# Record the results
def hist_cpu() :
    loan_pdf['loan_bins'] = loan_pdf.loan_amnt.apply(roundto)
    loan_pdf['loan_bins'].value_counts()

def hist_gpu() :
    loan_rdf['loan_bins'] = loan_rdf.loan_amnt.applymap(roundto)
    loan_rdf['loan_bins'].value_counts()

run_times.add_result("histogram_ops", "cpu", time_command(hist_cpu,repeat=1))
run_times.add_result("histogram_ops", "gpu", time_command(hist_gpu,repeat=1))
run_times.display_results()


## Groupby 

Here we perform some aggregation on the lending club data set to get some per grade statistics.  For this exercise we will compare the speed of aggregating over Pandas dataframes and Rapids dataframes using the **groupby** function as shown in the [Rapids documentation](https://docs.rapids.ai/api/cudf/stable/) .  Notice how the syntax is exactly the same!

In [None]:
# CPU / Pandas
# stats by grade
grade_stats_pdf = loan_pdf.groupby('grade', as_index=False).agg({"annual_inc": ["count","mean"], "loan_amnt": ["count","mean"]})

In [None]:
#GPU / Rapids
# stats by grade
grade_stats_rdf = loan_rdf.groupby('grade', as_index=False).agg({"annual_inc": ["count","mean"], "loan_amnt": ["count","mean"]})

In [None]:
# Grade summary statistics
display(grade_stats_pdf)
pgdf(grade_stats_rdf)

In [None]:
# Record the results

def groupby_cpu() :
    loan_pdf.groupby('grade', as_index=False).agg({"annual_inc": ["count","mean"], "loan_amnt": ["count","mean"]})

def groupby_gpu() :
    loan_rdf.groupby('grade', as_index=False).agg({"annual_inc": ["count","mean"], "loan_amnt": ["count","mean"]})

run_times.add_result("groupby_ops", "cpu", time_command(groupby_cpu))
run_times.add_result("groupby_ops", "gpu", time_command(groupby_gpu))
run_times.display_results()

## Join 

Joining two dataframes can be an extremely computationally expensive task.  Here we take the grade summary statistics computed in the groupby experiment above, and join it back with our table using grade as the key.  This is a common practice in machine learning to apply average values per group back to the individual row.  This is a form of [mean encoding](https://towardsdatascience.com/why-you-should-try-mean-encoding-17057262cd0)

In [None]:
# Pandas Join
loan_join_pdf = loan_pdf.set_index('grade').join(grade_stats_pdf.set_index('grade'),on="grade",how="left").reset_index()

In [None]:
#cuDF Join
loan_rdf.set_index('grade').join(grade_stats_rdf.set_index('grade'),on="grade",how="left").reset_index()

In [None]:
# Record the results
def join_cpu() :
    loan_pdf.set_index('grade').join(grade_stats_pdf.set_index('grade'),on="grade",how="left").reset_index()
def join_gpu() :
    loan_rdf.set_index('grade').join(grade_stats_rdf.set_index('grade'),on="grade",how="left").reset_index()

run_times.add_result("join_ops", "cpu", time_command(join_cpu))
run_times.add_result("join_ops", "gpu", time_command(join_gpu))
run_times.display_results()

# Machine Learning

## PCA (cuML and sklearn) - Performance comparison

<img src="https://github.com/dustinvanstee/random-public-files/raw/master/techu-pca.png"  width="200" height="125" align="middle"/>

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables.

A simple way to think about PCA is that it helps compress the data in a lossy representation of the original dataset.

**Lets compare the performance of the Sklearn (cpu-based) implemenation vs cuML!**

In [None]:
# Helper function to normalize GPU dataframe function
def normalize_df(gdf) :
    for col in gdf.columns :
        gdf[col] = (gdf[col] - gdf[col].mean()) / gdf[col].std()
    return gdf

### Prepare the data for PCA [not timed]
Here we do some initial data preparation to normalize the dataframe columns.  We arent comparing performance of this step, its just to get us ready to do the comparison.

In [17]:
X_cols = list(loan_rdf.columns)
print("Analysis Continuing with {}".format(X_cols))
X_cols.remove('default')
X_cols.remove('grade')
X_cols.remove('grade_hash')
X_cols = [x for x in X_cols if loan_rdf[x].dtype == "float64" or loan_rdf[x].dtype == "int8"]
print("Analysis Continuing with {}".format(X_cols))
# All types must be same ....
for x in X_cols :
    loan_rdf[x] = loan_rdf[x].astype("float64")

#print(loan_rdf[X_cols].dtypes)
print("Normalizing dataframe prior to PCA")
loan_norm_rdf = normalize_df(loan_rdf[X_cols])
print("Copying dataframe to pandas")
loan_norm_pdf = loan_norm_rdf.to_pandas()


NameError: name 'loan_rdf' is not defined

In [None]:
print("Normalized Dataframe")
print(loan_norm_rdf[X_cols].dtypes)
pgdf(loan_norm_rdf) #.describe()


## Principal Component Analysis (PCA) Performance 

We will compare the runtimes of PCA on CPU and then on GPU and also compare the results to make sure they are the same.  

Here we take the normalized frame we built above and copy to pandas.  The two dataframes we will be working with are 

* loan_norm_pdf : normalized pandas dataframe
* loan_norm_rdf : normalized GPU/RAPIDS dataframe

these are exactly the same dataframe ...

In [None]:
# PCA
# Both import methods supported
from cuml import PCA
from cuml.decomposition import PCA as PCA_gpu
from sklearn.decomposition import PCA as PCA_cpu
n_components=5


In [None]:
# RUN PCA ! : CPU / Sklearn implementation
pca_loan_cpu = PCA_cpu(n_components=n_components)
pca_loan_cpu.fit(loan_norm_pdf)


In [None]:
# RUN PCA ! : GPU / cuML implementation
pca_loan_gpu = PCA_gpu(n_components=n_components)
pca_loan_gpu.fit(loan_norm_rdf)


**Compare results** : For PCA we use a scree plot to compare the results.  Scree plots show how much variance in the dataset is explained by each additional principal component.  Below, run the cell and just eyeball the graphs and convince yourself they are the same
    

In [None]:
# Compare results ...

display(pca_scree(pca_loan_cpu.explained_variance_ratio_, "CPU"))
pca_scree(pca_loan_gpu.explained_variance_ratio_, "GPU")

In [None]:
# record PCA performance results
def pca_cpu() :    
    print("cpu pca")
    pca_loan_cpu = PCA_cpu(n_components=n_components)
    pca_loan_cpu.fit(loan_norm_pdf)


def pca_gpu() :
    pca_loan_gpu = PCA_gpu(n_components=n_components)
    pca_loan_gpu.fit(loan_norm_rdf)

    
#print(loan_norm_rdf.shape)    
run_times.add_result("pca", "gpu", time_command(pca_gpu, repeat=2))
run_times.add_result("pca", "cpu", time_command(pca_cpu, repeat=2))

run_times.display_results()

## Linear Regression (cuML / sklearn // snapML)

Linear regression is one of the most common algorithms applied to structured data.  Its useful when trying to make a prediction of a continuous variable.  For example, you could use linear regression to try and predict the total expected payment of a loan given historical data about default rates.  Lets try this below with our data set.  (Note lending club doesn't explicity provide this data in its data set, so we will use a fictitious total_payment column in our analysis)

In [None]:
# Linear Regression : CPU / Sklearn
from sklearn.linear_model import LinearRegression as LRSKL
X = loan_norm_rdf.to_pandas()
y = loan_rdf['default'].to_pandas()    
lr_cpu = LRSKL(fit_intercept = True, normalize = False)
res_cpu = lr_cpu.fit(X,y)

In [None]:
# Linear Regression : GPU / Rapids cuML example
from cuml.linear_model import LinearRegression as LRCUML
X = loan_norm_rdf
y2 = loan_rdf['default'].astype("float64")    
lr_gpu = LRCUML(fit_intercept = True, normalize = False) #, algorithm = "eig")
res_gpu = lr_gpu.fit(X,y2)


In [None]:
# Compare results
print("Coefficients:")
print(res_cpu.coef_)
print("intercept:")
print(res_cpu.intercept_)

print("Coefficients:")
print(res_gpu.coef_)
print("intercept:")
print(res_gpu.intercept_)


In [None]:
#Record Results 

# CPU 
def lr_cpu() :
    lr_cpu = LRSKL(fit_intercept = True, normalize = False)
    res = lr_cpu.fit(X,y)
    
X = loan_norm_rdf.to_pandas()
y = loan_rdf['default'].to_pandas()    
run_times.add_result("linear_reg", "cpu", time_command(lr_cpu, repeat=5))


# GPU
def lr_gpu() :
    lr_gpu = LRCUML(fit_intercept = True, normalize = False, algorithm = "eig")
    res = lr_gpu.fit(X,y)

X = loan_norm_rdf
y = loan_rdf['default'].astype("float64")    
run_times.add_result("linear_reg", "gpu", time_command(lr_gpu, repeat=5))



run_times.display_results()    



# Summary

In this lab we covered a number of common functions used by both data engineers and data scientists to manipulate dataframes and also build machine learning models.  The RAPIDS implementation demonstrates how much time you can save by running a lot of these operations on the GPU.   As data set sizes grow, and the number of experiments required increase, this performance gain can be a real advantage for getting to the answers faster.  Lets recap your speedups here ...

In [19]:
run_times.display_results()    

Dataframe size : (0, 0) 0 GB
test                 CPU(s)               GPU(s)               GPU Speedup         


Note, you can play with the dataset size and rerun the notebook to see how that impacts your run results!  TL;DR the larger your dataframe the better the GPU speedups ...

## Credits

This notebook was built by  Dustin VanStee (vanstee@us.ibm.com) from IBM Worldwide Client Experience Centers.  Special thanks to Steve LaFalce for reviewing the content and suggesting edits.


# Other interesting applications

## Dask With RAPIDS

Rapids cuDF and cuML are designed to primarily run on a single GPU.  Using the distributed computing framework dask, we will demonstrate how you can take the above examples and run them at scale over many GPUs!

<< Demo Time! >>

## Lessons Learned

lessons learned ... data MUST be clean prior to descibe functions.  Errors encountered 
- duplicate index caused error (this was due to concatenating dataframes)
- NaN causes KeyError messages.  Columns must be clean !
- Pandas takes care of these automatically ....

In [None]:
# NaN report
def nan_report(df) :
    for c in df.columns :
        print("{} {}".format(c, df[c].null_count))

nan_report(loan_rdf)

In [None]:
## SnapML + Rapids

In [None]:
### SnapML

X = loan_norm_rdfloan_rdf2.to_pandas()
y = loan_rdf['default'].to_pandas()

from pai4sk.linear_model import Ridge as LRSNAP
clf = LRSNAP(alpha=1.0)
clf.fit(X, y) # doctest: +NORMALIZE_WHITESPACE


#Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
#      normalize=False, random_state=None, solver='auto', tol=0.001)