# Big Data Challenge

In this challenge, you'll replicate one of the supervised learning projects that you did before in the program. This time, you'll use Dask instead of pandas and NumPy. Follow the instructions below:

The aim of this challenge is to get your hands dirty on writing code using Dask. This is why there is no minimum or maximum limit on the size of the dataset that you can use.

Use Dask counterparts to replicate all the data-cleaning and machine-learning parts of your supervised learning project. In other words, do the following:

*   Instead of pandas DataFrames, use Dask DataFrames whenever possible.
*   Instead of NumPy arrays, use Dask arrays whenever possible.
*   Use Dask to parallelize your model trainings.

Then, submit your work on Github and share the link below. Good luck!



In [1]:
!pip install dask-ml



In [2]:
!pip install --upgrade "dask[complete]"

Requirement already up-to-date: dask[complete] in /Users/jscott/opt/anaconda3/lib/python3.8/site-packages (2021.3.0)


In [3]:
!pip install aiohttp



In [4]:
import warnings
warnings.filterwarnings("ignore")

from dask.distributed import Client, progress

client = Client(n_workers=4, threads_per_worker=2, memory_limit='2GB')
client

0,1
Client  Scheduler: tcp://127.0.0.1:60038  Dashboard: http://127.0.0.1:60037/status,Cluster  Workers: 4  Cores: 8  Memory: 8.00 GB


# Capstone 2: Supervised learning
### Mushroom classification model
Dataset: Mushroom records drawn from The Audubon Society Field Guide to North American Mushrooms (1981).


### 1. Go out and find a dataset of interest. It could be from one of the recommended resources or some other aggregation. Or it could be something that you scraped yourself. Just make sure that it has lots of variables, including an outcome of interest to you.

URL: https://archive.ics.uci.edu/ml/datasets/Mushroom

Dataset information:

This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be'' for Poisonous Oak and Ivy.

In [5]:
# Import necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [6]:
# Load the dataset
import dask.dataframe as dd

mushroom_df = dd.read_csv('/Users/jscott/Thinkful Data Science Projects/big data challenge/agaricus-lepiota.csv')
mushroom_df.head()

Unnamed: 0,p,x,s,n,t,p.1,f,c,n.1,k,...,s.2,w,w.1,p.2,w.2,o,p.3,k.1,s.3,u
0,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
1,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
2,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
3,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
4,e,x,y,y,t,a,f,c,b,n,...,s,w,w,p,w,o,p,k,n,g


### 2. Explore the data. Get to know the data. Spend a lot of time going over its quirks. You should understand how it was gathered, what's in it, and what the variables look like.

In [7]:
# Data inspection & cleaning
mushroom_df.info()

<class 'dask.dataframe.core.DataFrame'>
Columns: 23 entries, p to u
dtypes: object(23)

In [8]:
# Print unique values for each column and map to attribute info: https://archive.ics.uci.edu/ml/datasets/Mushroom
for column in mushroom_df.columns:
  print(f'{column}:', mushroom_df[column].unique(), '\n')

p: Dask Series Structure:
npartitions=1
    object
       ...
Name: p, dtype: object
Dask Name: unique-agg, 4 tasks 

x: Dask Series Structure:
npartitions=1
    object
       ...
Name: x, dtype: object
Dask Name: unique-agg, 4 tasks 

s: Dask Series Structure:
npartitions=1
    object
       ...
Name: s, dtype: object
Dask Name: unique-agg, 4 tasks 

n: Dask Series Structure:
npartitions=1
    object
       ...
Name: n, dtype: object
Dask Name: unique-agg, 4 tasks 

t: Dask Series Structure:
npartitions=1
    object
       ...
Name: t, dtype: object
Dask Name: unique-agg, 4 tasks 

p.1: Dask Series Structure:
npartitions=1
    object
       ...
Name: p.1, dtype: object
Dask Name: unique-agg, 4 tasks 

f: Dask Series Structure:
npartitions=1
    object
       ...
Name: f, dtype: object
Dask Name: unique-agg, 4 tasks 

c: Dask Series Structure:
npartitions=1
    object
       ...
Name: c, dtype: object
Dask Name: unique-agg, 4 tasks 

n.1: Dask Series Structure:
npartitions=1
    object

In [9]:
# Rename columns based on attribute info mapping
renamed_columns = {'p': 'poisonous',
                   'x': 'cap_shape',
                   's': 'cap_surface',
                   'n': 'cap_color',
                   't': 'bruises',
                   'p.1': 'odor',
                   'f': 'gill_attachment',
                   'c': 'gill_spacing',
                   'n.1': 'gill_size',
                   'k': 'gill_color',
                   'e': 'stalk_shape',
                   'e.1': 'stalk_root',
                   's.1': 'stalk_surface_above_ring',
                   's.2': 'stalk_surface_below_ring',
                   'w': 'stalk_color_above_ring',
                   'w.1': 'stalk_color_below_ring',
                   'p.2': 'veil_type',
                   'w.2': 'veil_color',
                   'o': 'ring_number',
                   'p.3': 'ring_type',
                   'k.1': 'spore_print_color',
                   's.3': 'population',
                   'u': 'habitat'}

mushroom_df = mushroom_df.rename(columns=renamed_columns)
mushroom_df.head()

Unnamed: 0,poisonous,cap_shape,cap_surface,cap_color,bruises,odor,gill_attachment,gill_spacing,gill_size,gill_color,...,stalk_surface_below_ring,stalk_color_above_ring,stalk_color_below_ring,veil_type,veil_color,ring_number,ring_type,spore_print_color,population,habitat
0,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
1,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
2,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
3,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
4,e,x,y,y,t,a,f,c,b,n,...,s,w,w,p,w,o,p,k,n,g


In [10]:
# Convert all columns to dummies
from dask_ml.preprocessing import DummyEncoder

de = DummyEncoder()
dummies = de.fit_transform(mushroom_df.categorize()).compute()
dummies_dask = dd.from_pandas(dummies, npartitions = 2)
dummies_dask.info()

<class 'dask.dataframe.core.DataFrame'>
Columns: 119 entries, poisonous_e to habitat_l
dtypes: uint8(119)

In [11]:
dummies_dask.compute()

Unnamed: 0,poisonous_e,poisonous_p,cap_shape_x,cap_shape_b,cap_shape_s,cap_shape_f,cap_shape_k,cap_shape_c,cap_surface_s,cap_surface_y,...,population_v,population_y,population_c,habitat_g,habitat_m,habitat_u,habitat_d,habitat_p,habitat_w,habitat_l
0,1,0,1,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
2,0,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
3,1,0,1,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
4,1,0,1,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8118,1,0,0,0,0,0,1,0,1,0,...,0,0,1,0,0,0,0,0,0,1
8119,1,0,1,0,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,1
8120,1,0,0,0,0,1,0,0,1,0,...,0,0,1,0,0,0,0,0,0,1
8121,0,1,0,0,0,0,1,0,0,1,...,1,0,0,0,0,0,0,0,0,1


In [12]:
mushroom_df.categorize().info()

<class 'dask.dataframe.core.DataFrame'>
Columns: 23 entries, poisonous to habitat
dtypes: category(1), category(1), category(1), category(1), category(1), category(1), category(1), category(1), category(1), category(1), category(1), category(1), category(1), category(1), category(2), category(1), category(1), category(2), category(1), category(1), category(1)

In [13]:
# Correlation coefficient analysis between the features and target
columns = dummies_dask.columns
correlations = np.abs(dummies_dask[columns].iloc[:,1:].corr().loc[:,'poisonous_p'])
correlations.head(21)

poisonous_p      1.000000
cap_shape_x      0.027031
cap_shape_b      0.182548
cap_shape_s      0.060660
cap_shape_f      0.018629
cap_shape_k      0.163620
cap_shape_c      0.023012
cap_surface_s    0.095285
cap_surface_y    0.088791
cap_surface_f    0.195352
cap_surface_g    0.023012
cap_color_y      0.113072
cap_color_w      0.133644
cap_color_g      0.046391
cap_color_n      0.044574
cap_color_e      0.097181
cap_color_p      0.034722
cap_color_b      0.067567
cap_color_u      0.042851
cap_color_c      0.030903
cap_color_r      0.042851
Name: poisonous_p, dtype: float64

### 3. Model your outcome of interest. You should try several different approaches and really work to tune a variety of models before using the model evaluation techniques to choose what you consider to be the best performer. Make sure to think about explanatory versus predictive power, and experiment with both.

In [48]:
Y = dummies_dask.poisonous_p
X = dummies_dask.drop(columns=['poisonous_e', 'poisonous_p'])

from dask_ml.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X.compute(), Y.compute(), test_size=0.2)

In [49]:
# Potential classification models: 

# 1. Logistic regression
# 2. KNN classifier
# 3. Random forest model
# 4. Support vector classifier
# 5. Gradient boosting classifier

In [50]:
# 1. Logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
import joblib
import time

time_start = time.time()

lr_clf = LogisticRegression()
with joblib.parallel_backend('dask'):
    lr_clf.fit(X_train, y_train)

print('Time elapsed: {} seconds\n'.format(time.time()-time_start))

preds_train = lr_clf.predict(X_train.values)
preds_test = lr_clf.predict(X_test.values)

print("Logistic regression training score is: ", roc_auc_score(preds_train, y_train.values))
print("Logistic regression test score is: ", roc_auc_score(preds_test, y_test.values))

Time elapsed: 0.4138481616973877 seconds

Logistic regression training score is:  1.0
Logistic regression test score is:  1.0


In [51]:
# 2. KNN classifier
from sklearn.neighbors import KNeighborsClassifier

time_start = time.time()

knn_clf = KNeighborsClassifier()
with joblib.parallel_backend('dask'):
    knn_clf.fit(X_train, y_train)

print('Time elapsed: {} seconds\n'.format(time.time()-time_start))

preds_train = knn_clf.predict(X_train.values)
preds_test = knn_clf.predict(X_test.values)

print("KNN training score is: ", roc_auc_score(preds_train, y_train.values))
print("KNN test score is: ", roc_auc_score(preds_test, y_test.values))

Time elapsed: 0.061654090881347656 seconds

KNN training score is:  1.0
KNN test score is:  1.0


In [53]:
# 3. Random forest classifier
from sklearn.ensemble import RandomForestClassifier

time_start = time.time()

rf_clf = RandomForestClassifier()
with joblib.parallel_backend('dask'):
    rf_clf.fit(X_train, y_train)

print('Time elapsed: {} seconds\n'.format(time.time()-time_start))

preds_train = rf_clf.predict(X_train.values)
preds_test = rf_clf.predict(X_test.values)

print("Random forest training score is: ", roc_auc_score(preds_train, y_train.values))
print("Random forest test score is: ", roc_auc_score(preds_test, y_test.values))

Time elapsed: 0.48662614822387695 seconds

Random forest training score is:  1.0
Random forest test score is:  1.0


In [55]:
# 4. Support vector classifer
from sklearn.svm import SVC

time_start = time.time()

svm_clf = SVC(kernel = 'linear')
with joblib.parallel_backend('dask'):
    svm_clf.fit(X_train, y_train)

print('Time elapsed: {} seconds\n'.format(time.time()-time_start))

preds_train = svm_clf.predict(X_train.values)
preds_test = svm_clf.predict(X_test.values)

print("Random forest training score is: ", roc_auc_score(preds_train, y_train.values))
print("Random forest test score is: ", roc_auc_score(preds_test, y_test.values))

Time elapsed: 0.25940608978271484 seconds

Random forest training score is:  1.0
Random forest test score is:  1.0


In [57]:
# 5. Gradient boosting classifier
from sklearn import ensemble

time_start = time.time()

gb_clf = SVC(kernel = 'linear')
with joblib.parallel_backend('dask'):
    gb_clf.fit(X_train, y_train)

print('Time elapsed: {} seconds\n'.format(time.time()-time_start))

preds_train = gb_clf.predict(X_train.values)
preds_test = gb_clf.predict(X_test.values)

print("Random forest training score is: ", roc_auc_score(preds_train, y_train.values))
print("Random forest test score is: ", roc_auc_score(preds_test, y_test.values))

Time elapsed: 0.23425889015197754 seconds

Random forest training score is:  1.0
Random forest test score is:  1.0


Based on these results, the models all perform with an accuracy of 1.0 for training and test data.
Therefore we will base the model performance on their processing time:

The models are ranked below by processing time:

1. [0.062 s] KNN classifier
2. [0.234 s] Gradient boosting classifier
3. [0.259 s] Support vector classifier
4. [0.414 s] Logistic regression classifier
5. [0.487 s] Random forest classifier