### Import Libraries

Built in Python imports

In [None]:
import time


Additional CPU imports

In [None]:
import numpy as np;print('numpy Version:', np.__version__)
import pandas as pd;print('pandas Version:', pd.__version__)
import sklearn
## Visulaization libraries 
import ipyvolume as ipv
import matplotlib.pyplot as plt 
from mpl_toolkits.mplot3d import Axes3D

Import Algorithms and Dataset libraries 

In [None]:

 
from sklearn import datasets; 
from sklearn.metrics import confusion_matrix, accuracy_score

Imports for GPU dataset and algorithms accelerations 

In [None]:
import cupy;print('cupy Version', cupy.__version__)
import cudf;print('cudf Version', cudf.__version__)


import rapids_lib_v8 as rl
''' NOTE: anytime changes are made to rapids_lib.py you can either:
      1. refresh/reload via the code below, OR
      2. restart the kernel '''
import importlib; importlib.reload(rl)

### Data Generation

We will generate data shapes [coordinate lists] and hand them to the GPU. The GPU will randomly build 3D blobs [ cupy.random.normal ] around each coordinate point to create a much larger, noisier, and more realistic dataset.

Using this concept we offer the following dataset variations:
1. Helix - two entwined coils, inspired by DNA casing
2. Whirl - an increasingly unwinding Helix 

In [None]:
rl.plot_dataset_variants()

In order to generate the dataset with either of two variants, take a look into the library and create your own! 

In [None]:
help(rl.gen_blob_coils)

This is the set of values we used for our experiment: 

In [None]:
nBlobPoints = 500
nCoordinates =10
sdevScales = [ .01, .01, .01]
noiseScale = 1/5.
coilDensity = 8

In [None]:
data, labels, t_gen = rl.gen_blob_coils( coilType='helix', 
                                        shuffleFlag = False, 
                                         nBlobPoints = nBlobPoints,  
                                         nCoordinates = nCoordinates, 
                                         sdevScales = sdevScales, 
                                         noiseScale = noiseScale, 
                                         coilDensity = coilDensity )

This rapids library will return cudf Dataframe object as expected. Let's check .

In [None]:
print ('Type of dataset rl returned is', type(data))

### Split Training and Testing data

In [None]:
expLog = {}

In [None]:
## helper function to split the dataset 
def split_train_test_nfolds ( dataDF, labelsDF, nFolds = 10, seed = 1, nSamplesToSwap = 50 ):
    print('splitting data into training and test set')
    startTime = time.time()
    
    nSamplesPerFold = int(dataDF.shape[0] // nFolds)
    sampleRanges = np.arange(nFolds) * nSamplesPerFold
        
    np.random.seed(seed)
    foldStartInds = np.random.randint(0, nFolds-1, size = nFolds)
    foldEndInds = foldStartInds + 1 
    
    testFold = np.random.randint(0,nFolds-1)
    trainInds = None; testInds = None
    
    for iFold in range( nFolds ):
        lastFoldFlag = ( iFold == nFolds-1 )
        if lastFoldFlag: foldInds = np.arange(sampleRanges[iFold], dataDF.shape[0] )
        else: foldInds = np.arange(sampleRanges[iFold], sampleRanges[iFold+1])
        
        if iFold == testFold: testInds = foldInds
        else:
            if trainInds is None: trainInds = foldInds
            else: trainInds = np.concatenate([trainInds, foldInds])
                
    # swap subset of train and test samples [ low values require higher model generalization ]
    if nSamplesToSwap > 0:
        trainIndsToSwap = np.random.permutation(trainInds.shape[0])[0:nSamplesToSwap]
        testIndsToSwap = np.random.permutation(testInds.shape[0])[0:nSamplesToSwap]        
        trainBuffer = trainInds[trainIndsToSwap].copy()
        trainInds[trainIndsToSwap] = testInds[testIndsToSwap]
        testInds[testIndsToSwap] = trainBuffer
    
    # build final dataframes
    trainDF = dataDF.iloc[trainInds]
    testDF = dataDF.iloc[testInds]
    trainLabelsDF = labelsDF.iloc[trainInds]
    testLabelsDF = labelsDF.iloc[testInds]                
    
    return trainDF, trainLabelsDF, testDF, testLabelsDF, time.time() - startTime


### Rescale / Normalize the data

In [None]:
def scale_dataframe_inplace ( targetDF, trainMeans = {}, trainSTDevs = {} ):    
    print('rescaling data')
    sT = time.time()
    for iCol in targetDF.columns:
        
        # omit scaling label column
        if iCol == targetDF.columns[-1] == 'label': continue
            
        # compute means and standard deviations for each column [ should skip for test data ]
        if iCol not in trainMeans.keys() and iCol not in trainSTDevs.keys():            
            trainMeans[iCol] = targetDF[iCol].mean()
            trainSTDevs[iCol] = targetDF[iCol].std()
            
        # apply scaling to each column
        targetDF[iCol] = ( targetDF[iCol] - trainMeans[iCol] ) / ( trainSTDevs[iCol] + 1e-10 )
        
    return trainMeans, trainSTDevs, time.time() - sT

## GPU vs CPU work 

We will use two above helper functions to split our dataset into trainin and testing and normalize it afterwards.
Before we dive deep into the work i want to make a note about variable naming:

We will use **_pDF** (as Pandas DataFrame) sufix to our varaible names to emphasize that the variable resides on CPU. Also , we will use **_cDF** (cudf DataFrame) to recognize the variables that are using GPU acclerated libraries (cudf and cuml)

### CPU split & scale

In order to have our functions using CPU only we will use use pandas.

In [None]:
# split
trainData_pDF, trainLabels_pDF, testData_pDF, testLabels_pDF, t_split_CPU = split_train_test_nfolds ( data.to_pandas(), 
                                                                                                     labels.to_pandas(), 
                                                                                                     nSamplesToSwap = 20 )

# apply standard scaling
trainMeans_CPU, trainSTDevs_CPU, t_scaleTrain_CPU = scale_dataframe_inplace ( trainData_pDF )
_,_, t_scaleTest_CPU = scale_dataframe_inplace ( testData_pDF, trainMeans_CPU, trainSTDevs_CPU )    

expLog = rl.update_log( expLog, [['CPU_split_train_test', t_split_CPU],
                                 ['CPU_scale_train_data', t_scaleTrain_CPU], 
                                 ['CPU_scale_test_data', t_scaleTest_CPU]] )

Let's now take a look into our dataset, shape and datatypes : 


In [None]:
print(testData_pDF.head());
print (trainData_pDF.shape)
print('trainData_pDF: ', trainData_pDF.shape, type(trainData_pDF), 'trainLabels_pDF: ', trainLabels_pDF.shape, type(trainLabels_pDF))

### GPU split & scale

In [None]:
# split
trainData_cDF, trainLabels_cDF, testData_cDF, testLabels_cDF, t_split = split_train_test_nfolds ( data, labels, nSamplesToSwap = 20)

# apply standard scaling
trainMeans, trainSTDevs, t_scaleTrain = scale_dataframe_inplace ( trainData_cDF )
_,_, t_scaleTest = scale_dataframe_inplace ( testData_cDF, trainMeans, trainSTDevs )    

expLog = rl.update_log( expLog, [['GPU_split_train_test', t_split],
                                 ['GPU_scale_train_data', t_scaleTrain],
                                 ['GPU_scale_test_data', t_scaleTest]] ); 

In [None]:
print(trainData_cDF.head());
print (trainData_cDF.shape)
print('trainData_cDF: ', trainData_cDF.shape, type(trainData_cDF), 'trainLabels_cDF: ', trainLabels_cDF.shape, type(trainLabels_cDF))

In [None]:
rl.plot_train_test(trainData_cDF, trainLabels_cDF, testData_cDF, testLabels_cDF)

In [None]:
# %store trainData_cDF
# %store trainLabels_cDF
# %store testData_cDF 
%store testLabels_cDF

In [None]:
%store expLog
%store trainData_pDF
%store trainLabels_pDF 
%store testData_pDF
%store testLabels_pDF