# Blach Friday Dask Implementation

We will train machine learning model over the Black Friday dataset using dask apis.

In [1]:
import pandas as pd
import dask.dataframe as daskdf

**Reading data in pandas dataframe and in dask dataframe**

Notice that the reading time of dask dataframe is less than the pandas dataframe

In [2]:
%time temp = pd.read_csv('./data/BlackFriday_train.csv')

Wall time: 522 ms


In [3]:
%time data = daskdf.read_csv('./data/BlackFriday_train.csv')

Wall time: 25 ms


In [4]:
data.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969


**Performing some basic computation using dask api**

In [5]:
data.Gender.value_counts().compute()

M    414259
F    135809
Name: Gender, dtype: int64

In [6]:
data.groupby(data.Gender)['Purchase'].count().compute()

Gender
F    135809
M    414259
Name: Purchase, dtype: int64

In [7]:
data.groupby(data.Gender).Purchase.max().compute()

Gender
F    23959
M    23961
Name: Purchase, dtype: int64

In [8]:
# temp.groupby(temp.Gender)['Purchase'].agg('max')
# data.groupby(data.Gender)['Purchase'].agg('max').compute()

In [9]:
data.isnull().sum().compute()

User_ID                            0
Product_ID                         0
Gender                             0
Age                                0
Occupation                         0
City_Category                      0
Stay_In_Current_City_Years         0
Marital_Status                     0
Product_Category_1                 0
Product_Category_2            173638
Product_Category_3            383247
Purchase                           0
dtype: int64

In [10]:
data.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969


In [11]:
data = data.fillna(0)

In [12]:
data.isnull().sum().compute()

User_ID                       0
Product_ID                    0
Gender                        0
Age                           0
Occupation                    0
City_Category                 0
Stay_In_Current_City_Years    0
Marital_Status                0
Product_Category_1            0
Product_Category_2            0
Product_Category_3            0
Purchase                      0
dtype: int64

**Filtering the data**

Seperating the features and the labels from the dataframe. Getting dummies against the categorical features.

In [13]:
categorical_variables = data[['Gender', 'Age', 'Occupation', 'City_Category', 'Stay_In_Current_City_Years', 'Marital_Status']]

target = data['Purchase']

In [14]:
train_data = daskdf.get_dummies(categorical_variables.categorize()).compute()

Getting the Dask Array from the dataframe

In [15]:
train_data = train_data.values

**Training the model using the Dask ML**

Dask ML provides the implementation to some simpler sklearn ml algorithms.

In [16]:
# fit the model

from dask_ml.linear_model import LinearRegression

model = LinearRegression()
model.fit(train_data, target)

LinearRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
         intercept_scaling=1.0, max_iter=100, multiclass='ovr', n_jobs=1,
         penalty='l2', random_state=None, solver='admm',
         solver_kwargs=None, tol=0.0001, verbose=0, warm_start=False)

**Making Predictions for test data**

In [17]:
test_data = daskdf.read_csv('./data/BlackFriday_test.csv')

In [18]:
test_data.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3
0,1000004,P00128942,M,46-50,7,B,2,1,1,11.0,
1,1000009,P00113442,M,26-35,17,C,0,0,3,5.0,
2,1000010,P00288442,F,36-45,1,B,4+,1,5,14.0,
3,1000010,P00145342,F,36-45,1,B,4+,1,4,9.0,
4,1000011,P00053842,F,26-35,1,C,1,0,4,5.0,12.0


In [19]:
test_data.isnull().sum().compute()

User_ID                            0
Product_ID                         0
Gender                             0
Age                                0
Occupation                         0
City_Category                      0
Stay_In_Current_City_Years         0
Marital_Status                     0
Product_Category_1                 0
Product_Category_2             72344
Product_Category_3            162562
dtype: int64

In [20]:
test_data = test_data.fillna(0)

In [21]:
test_categorical = test_data[['Gender', 'Age', 'Occupation', 'City_Category', 'Stay_In_Current_City_Years', 'Marital_Status']]

In [22]:
test_categorical = daskdf.get_dummies(test_categorical.categorize())

In [23]:
testnew = test_categorical.values

In [24]:
predict = model.predict(testnew)

**Now let's use Grid Search and Random Forest to find the best paramter**

In [25]:
from dask.distributed import Client
from sklearn.externals.joblib import parallel_backend

client = Client()  # Connect to a Dask Cluster

In [26]:
with parallel_backend('dask'):
    # Create the parameter grid based on the results of random search 
    param_grid = {
    'bootstrap': [True],
    'max_depth': [8, 9],
    'max_features': [2, 3],
    'min_samples_leaf': [4, 5],
    'min_samples_split': [8, 10],
    'n_estimators': [100, 200]
    }
    # Create a based model
    from sklearn.ensemble import RandomForestRegressor
    rf = RandomForestRegressor()

In [27]:
import dask_searchcv as dcv

In [28]:
grid_search = dcv.GridSearchCV(estimator = rf, param_grid = param_grid, cv = 3)
grid_search.fit(train_data, target)
grid_search.best_params_

{'bootstrap': True,
 'max_depth': 9,
 'max_features': 3,
 'min_samples_leaf': 4,
 'min_samples_split': 8,
 'n_estimators': 200}