# Data Handling in RAPIDS

## Installing Rapids

- Note again use NVIDIA T4 or P4 or P100 GPU only

In [1]:
!nvidia-smi

Fri Mar 20 15:34:24 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   64C    P8    11W /  70W |      0MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [2]:
# Install RAPIDS
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!bash rapidsai-csp-utils/colab/rapids-colab.sh

import sys, os

dist_package_index = sys.path.index('/usr/local/lib/python3.6/dist-packages')
sys.path = sys.path[:dist_package_index] + ['/usr/local/lib/python3.6/site-packages'] + sys.path[dist_package_index:]
sys.path
exec(open('rapidsai-csp-utils/colab/update_modules.py').read(), globals())

Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 44, done.[K
remote: Counting objects:   2% (1/44)[Kremote: Counting objects:   4% (2/44)[Kremote: Counting objects:   6% (3/44)[Kremote: Counting objects:   9% (4/44)[Kremote: Counting objects:  11% (5/44)[Kremote: Counting objects:  13% (6/44)[Kremote: Counting objects:  15% (7/44)[Kremote: Counting objects:  18% (8/44)[Kremote: Counting objects:  20% (9/44)[Kremote: Counting objects:  22% (10/44)[Kremote: Counting objects:  25% (11/44)[Kremote: Counting objects:  27% (12/44)[Kremote: Counting objects:  29% (13/44)[Kremote: Counting objects:  31% (14/44)[Kremote: Counting objects:  34% (15/44)[Kremote: Counting objects:  36% (16/44)[Kremote: Counting objects:  38% (17/44)[Kremote: Counting objects:  40% (18/44)[Kremote: Counting objects:  43% (19/44)[Kremote: Counting objects:  45% (20/44)[Kremote: Counting objects:  47% (21/44)[Kremote: Counting objects:  50% (22/44)[Kremote

# Data Analysis

In [0]:
import cudf
import numpy as np
import dask_cudf

In [0]:
bank_df = cudf.read_csv('https://raw.githubusercontent.com/srivatsan88/YouTubeLI/master/dataset/bank-full.csv',sep=';')



1 - age (numeric)

2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur", "student","blue-collar","self-employed","retired","technician","services")

3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)

4 - education (categorical: "unknown","secondary","primary","tertiary")

5 - default: has credit in default? (binary: "yes","no")

6 - balance: average yearly balance, in euros (numeric)

7 - housing: has housing loan? (binary: "yes","no")

8 - loan: has personal loan? (binary: "yes","no")

related with the last contact of the current campaign:

9 - contact: contact communication type (categorical: "unknown","telephone","cellular")

10 - day: last contact day of the month (numeric)

11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")

12 - duration: last contact duration, in seconds (numeric)

other attributes:

13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

15 - previous: number of contacts performed before this campaign and for this client (numeric)

16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

output variable (desired target):

17 - y - has the client subscribed a term deposit? (binary: "yes","no")


In [9]:
! nvidia-smi

Fri Mar 20 15:54:34 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   61C    P0    29W /  70W |    413MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
+-------

In [5]:
print("rows: ", bank_df.shape[0])
print("columns: ", bank_df.shape[1])

rows:  45211
columns:  17


In [0]:
bank_df.dtypes

In [0]:
bank_df.isnull().sum()

In [10]:
bank_df['y'].value_counts()

no     39922
yes     5289
Name: y, dtype: int32

# Benchmarking against dask cudf vs cudf

In [0]:
import time

In [15]:
start_time = time.time()
bank_df.describe()
end_time = time.time()
print("Time taken on GPU : %s" %(end_time - start_time))

Time taken on GPU : 0.11950492858886719


In [0]:
dcudf = dask_cudf.from_cudf(bank_df, npartitions=2)

In [17]:
start_time = time.time()
dcudf.describe()
end_time = time.time()
print("Time taken on GPU : %s" %(end_time - start_time))

Time taken on GPU : 0.48287105560302734


# Exploring Data

In [18]:
bank_df.describe()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
count,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0
mean,40.93621,1362.272058,15.806419,258.16308,2.763841,40.197828,0.580323
std,10.618762,3044.765829,8.322476,257.527812,3.098021,100.128746,2.303441
min,18.0,-8019.0,1.0,0.0,1.0,-1.0,0.0
25%,33.0,72.0,8.0,103.0,1.0,-1.0,0.0
50%,39.0,448.0,16.0,180.0,2.0,-1.0,0.0
75%,48.0,1428.0,21.0,319.0,3.0,-1.0,0.0
max,95.0,102127.0,31.0,4918.0,63.0,871.0,275.0


In [19]:
bank_df.groupby(['marital', 'y']).agg({'balance':'mean'})

Unnamed: 0_level_0,Unnamed: 1_level_0,balance
marital,y,Unnamed: 2_level_1
divorced,no,1107.095747
divorced,yes,1707.96463
married,no,1370.746228
married,yes,1915.810163
single,no,1235.869921
single,yes,1674.875523


In [20]:
bank_df.groupby(['marital', 'y']).agg({'balance':'mean', 'y': 'count'})

Unnamed: 0_level_0,Unnamed: 1_level_0,balance,y
marital,y,Unnamed: 2_level_1,Unnamed: 3_level_1
divorced,no,1107.095747,4585
divorced,yes,1707.96463,622
married,no,1370.746228,24459
married,yes,1915.810163,2755
single,no,1235.869921,10878
single,yes,1674.875523,1912


In [0]:
loan_outcome = bank_df.groupby(['loan', 'y']).agg({'balance':'mean','y':'count'})

In [22]:
print(loan_outcome)

              balance      y
loan y                      
no   no   1413.228726  33162
     yes  1897.001041   4805
yes  no    766.481953   6760
     yes   883.642562    484


In [0]:
def convert_hour(duration):
    return duration / 60

In [0]:
bank_df['duration_hour'] = bank_df['duration'].applymap(convert_hour)

In [26]:
bank_df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y,duration_hour
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no,4.35
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no,2.516667
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no,1.266667
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no,1.533333
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no,3.3


In [31]:
bank_df.groupby('y').campaign.mean()

y
no     2.846350
yes    2.141047
Name: campaign, dtype: float64

In [0]:
bank_campaign_df = bank_df.query("campaign <= 8")

In [33]:
bank_df['education'].value_counts()

secondary    23202
tertiary     13301
primary       6851
unknown       1857
Name: education, dtype: int32