# Machine Learning Nanodegree

## Capstone Project: Instacart Market Basket Analysis
### Which products will an Instacart consumer purchase again?

The dataset for this challenge is a relational set of files describing customers' orders over time. The goal of the competition is to predict which products will be in a user's next order. The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order. We also provide the week and hour of day the order was placed, and a relative measure of time between orders. For more information, see the blog post accompanying its public release.

### The Road Ahead

We break the notebook into separate steps.  Feel free to use the links below to navigate the notebook.

* [Step 0](#step0): Import Datasets
* [Step 1](#step1): Data Exploration
* [Step 2](#step2): Exploratory Visualizations
* [Step 3](#step3): Preprocessing 
* [Step 4](#step4): Benchmarks
* [Step 5](#step5): Algorithm and Techniques
* [Step 6](#step6): Refinements
* [Step 7](#step7): Algorithm Evaluation and Validation

---
<a id='step0'></a>
## Step 0: Import Datasets

In [0]:
### Import libraries XXXXXXXXXXXXXXXXXXXX
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

%matplotlib inline

In [0]:
!pip install -U -q PyDrive

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# 1. Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# 2. Load a file by ID and create local file.

downloaded = drive.CreateFile({'id':'1L05r3qtAhmzFfNazo8xfKkKbWRwQhODZ'}) # replace fileid with Id of file you want to access
downloaded.GetContentFile('order_products__train.csv') # now you can use export.csv 

downloaded = drive.CreateFile({'id':'14sI0oP8FXxfYd_0OmaAPfqPSrrRyvoIH'}) # replace fileid with Id of file you want to access
downloaded.GetContentFile('order_products__prior.csv') # now you can use export.csv 

downloaded = drive.CreateFile({'id':'1dSIrVAQ5delsaDYbjdB7gzsxTCf792kf'}) # replace fileid with Id of file you want to access
downloaded.GetContentFile('orders.csv') # now you can use export.csv 

downloaded = drive.CreateFile({'id':'1F4sDO7oTimeDNrI2FcaEkx9WnF7_n1y5'}) # replace fileid with Id of file you want to access
downloaded.GetContentFile('products.csv') # now you can use export.csv 

downloaded = drive.CreateFile({'id':'1b122CO2v4on8ixfE-8Cc6g4RHcARiTvn'}) # replace fileid with Id of file you want to access
downloaded.GetContentFile('aisles.csv') # now you can use export.csv 

downloaded = drive.CreateFile({'id':'1YNGRUqCG9slq8fPwjwAFb0WwuzBbJVE8'}) # replace fileid with Id of file you want to access
downloaded.GetContentFile('departments.csv') # now you can use export.csv 

In [3]:
### Import Instacart Data
order_products_train_df = pd.read_csv("order_products__train.csv")
order_products_prior_df = pd.read_csv("order_products__prior.csv")
orders_df = pd.read_csv("orders.csv")
products_df = pd.read_csv("products.csv")
aisles_df = pd.read_csv("aisles.csv")
departments_df = pd.read_csv("departments.csv")

print('Total no. of orders: {}'.format(orders_df.shape[0]))
print('Total no. of products: {}'.format(products_df.shape[0]))
print('Total no. of aisles: {}'.format(aisles_df.shape[0]))
print('Total no. of departments: {}'.format(departments_df.shape[0]))

Total no. of orders: 3421083
Total no. of products: 49688
Total no. of aisles: 134
Total no. of departments: 21


---
<a id='step1'></a>
## Step 1: Data Exploration

orders_df tells to which set (prior, train, test) an order belongs. Will be predicting reordered items only for the 'test' set orders. 

In [0]:
orders_df.describe()

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,1.0,1.0,1.0,0.0,0.0,0.0
25%,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421083.0,206209.0,100.0,6.0,23.0,30.0


In [0]:
orders_df.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [0]:
order_cnt = orders_df.groupby("eval_set").user_id.aggregate({'total_user':'nunique'}).reset_index()
order_cnt

is deprecated and will be removed in a future version
  """Entry point for launching an IPython kernel.


Unnamed: 0,eval_set,total_user
0,prior,206209
1,test,75000
2,train,131209


These data frames (order_products_[prior/train]_df specify which products were purchased in each order. order_products_prior_df contains previous order contents for all customers. 

'reordered' indicates that the customer has a previous order that contains the product. (Some orders will have no reordered items). We may predict an explicit 'None' value for orders with no reordered items. 

In [0]:
order_products_prior_df.describe()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
count,32434490.0,32434490.0,32434490.0,32434490.0
mean,1710749.0,25576.34,8.351076,0.5896975
std,987300.7,14096.69,7.126671,0.4918886
min,2.0,1.0,1.0,0.0
25%,855943.0,13530.0,3.0,0.0
50%,1711048.0,25256.0,6.0,1.0
75%,2565514.0,37935.0,11.0,1.0
max,3421083.0,49688.0,145.0,1.0


In [0]:
order_products_prior_df.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [0]:
order_products_train_df.describe()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
count,1384617.0,1384617.0,1384617.0,1384617.0
mean,1706298.0,25556.24,8.758044,0.5985944
std,989732.6,14121.27,7.423936,0.4901829
min,1.0,1.0,1.0,0.0
25%,843370.0,13380.0,3.0,0.0
50%,1701880.0,25298.0,7.0,1.0
75%,2568023.0,37940.0,12.0,1.0
max,3421070.0,49688.0,80.0,1.0


In [0]:
order_products_train_df.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1


In [0]:
products_df.describe()

Unnamed: 0,product_id,aisle_id,department_id
count,49688.0,49688.0,49688.0
mean,24844.5,67.769582,11.728687
std,14343.834425,38.316162,5.85041
min,1.0,1.0,1.0
25%,12422.75,35.0,7.0
50%,24844.5,69.0,13.0
75%,37266.25,100.0,17.0
max,49688.0,134.0,21.0


In [0]:
print(products_df.head())
products_df.tail()

   product_id                                       product_name  aisle_id  \
0           1                         Chocolate Sandwich Cookies        61   
1           2                                   All-Seasons Salt       104   
2           3               Robust Golden Unsweetened Oolong Tea        94   
3           4  Smart Ones Classic Favorites Mini Rigatoni Wit...        38   
4           5                          Green Chile Anytime Sauce         5   

   department_id  
0             19  
1             13  
2              7  
3              1  
4             13  


Unnamed: 0,product_id,product_name,aisle_id,department_id
49683,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5
49684,49685,En Croute Roast Hazelnut Cranberry,42,1
49685,49686,Artisan Baguette,112,3
49686,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8
49687,49688,Fresh Foaming Cleanser,73,11


In [0]:
aisles_df.describe()

Unnamed: 0,aisle_id
count,134.0
mean,67.5
std,38.826537
min,1.0
25%,34.25
50%,67.5
75%,100.75
max,134.0


In [0]:
print(aisles_df.head())
print(aisles_df.tail())

   aisle_id                       aisle
0         1       prepared soups salads
1         2           specialty cheeses
2         3         energy granola bars
3         4               instant foods
4         5  marinades meat preparation
     aisle_id                       aisle
129       130    hot cereal pancake mixes
130       131                   dry pasta
131       132                      beauty
132       133  muscles joints pain relief
133       134  specialty wines champagnes


In [0]:
departments_df.describe()

Unnamed: 0,department_id
count,21.0
mean,11.0
std,6.204837
min,1.0
25%,6.0
50%,11.0
75%,16.0
max,21.0


In [0]:
departments_df

Unnamed: 0,department_id,department
0,1,frozen
1,2,other
2,3,bakery
3,4,produce
4,5,alcohol
5,6,international
6,7,beverages
7,8,pets
8,9,dry goods pasta
9,10,bulk


Since products, aisles and departments data frames could be related amongst themselves using IDs as keys, we can merge them recursively and create a single data frame (pad_df for products, aisles and dept) for simplification as below:

In [4]:
pad_df = pd.merge(left=pd.merge(left=products_df, right=departments_df, how='left'), right=aisles_df, how='left')
print(pad_df.describe())
pad_df.head()

         product_id      aisle_id  department_id
count  49688.000000  49688.000000   49688.000000
mean   24844.500000     67.769582      11.728687
std    14343.834425     38.316162       5.850410
min        1.000000      1.000000       1.000000
25%    12422.750000     35.000000       7.000000
50%    24844.500000     69.000000      13.000000
75%    37266.250000    100.000000      17.000000
max    49688.000000    134.000000      21.000000


Unnamed: 0,product_id,product_name,aisle_id,department_id,department,aisle
0,1,Chocolate Sandwich Cookies,61,19,snacks,cookies cakes
1,2,All-Seasons Salt,104,13,pantry,spices seasonings
2,3,Robust Golden Unsweetened Oolong Tea,94,7,beverages,tea
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,frozen,frozen meals
4,5,Green Chile Anytime Sauce,5,13,pantry,marinades meat preparation


<a id='step2'></a>
## Step 2: Exploratory Vizualizations

In [0]:
"""
plt.figure(figsize=(12,8))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8)#, color=color[2])
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Maximum order number', fontsize=12)
plt.xticks(rotation='vertical')
plt.show()
"""

NameError: ignored

<matplotlib.figure.Figure at 0x7f0cdd10f3c8>

---
<a id='step3'></a>
## Step 3: Data Preprocessing


### Pre-process the Data

Starting with sorting the values based on user_id and order_number for the user.

In [5]:
orders_df.sort_values(by=['user_id', 'order_number'], inplace=True)
orders_df.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [6]:
#Replace NaN with mean
orders_df.days_since_prior_order.fillna(orders_df.days_since_prior_order.mean(), inplace=True)
orders_df.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,11.114836
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


During exploration we discovered that there is one Department and one Aisle labeled "Missing"
Let's explore that more..

In [7]:
print(departments_df[departments_df.department == 'missing'])
missing_prod_df = products_df[products_df.department_id == 21].reset_index()
print (len(missing_prod_df), "products items are associated with department='missing'")

#checking for any missing labled aisles
print(aisles_df[aisles_df.aisle == 'missing'])

missing_prod_df.head()

    department_id department
20             21    missing
1258 products items are associated with department='missing'
    aisle_id    aisle
99       100  missing


Unnamed: 0,index,product_id,product_name,aisle_id,department_id
0,37,38,Ultra Antibacterial Dish Liquid,100,21
1,71,72,Organic Honeycrisp Apples,100,21
2,109,110,Uncured Turkey Bologna,100,21
3,296,297,"Write Bros Ball Point Pens, Cap-Pen, Medium (1...",100,21
4,416,417,Classics Baby Binks Easter Chocolate Bunny,100,21


Since departments and aisles are categorical variables, we chose to keep the missing values for the given products and save them in a separate data structure missing_prod_df.

Next we'll try to merge and consolidate different data sets into a single data structure to be ready for fitting a model.

In [9]:
orders_train_df = pd.merge(orders_df, order_products_train_df, on='order_id')
print(orders_df.describe())
orders_train_df.describe()

           order_id       user_id  order_number     order_dow  \
count  3.421083e+06  3.421083e+06  3.421083e+06  3.421083e+06   
mean   1.710542e+06  1.029782e+05  1.715486e+01  2.776219e+00   
std    9.875817e+05  5.953372e+04  1.773316e+01  2.046829e+00   
min    1.000000e+00  1.000000e+00  1.000000e+00  0.000000e+00   
25%    8.552715e+05  5.139400e+04  5.000000e+00  1.000000e+00   
50%    1.710542e+06  1.026890e+05  1.100000e+01  3.000000e+00   
75%    2.565812e+06  1.543850e+05  2.300000e+01  5.000000e+00   
max    3.421083e+06  2.062090e+05  1.000000e+02  6.000000e+00   

       order_hour_of_day  days_since_prior_order  
count       3.421083e+06            3.421083e+06  
mean        1.345202e+01            1.111484e+01  
std         4.226088e+00            8.924952e+00  
min         0.000000e+00            0.000000e+00  
25%         1.000000e+01            5.000000e+00  
50%         1.300000e+01            8.000000e+00  
75%         1.600000e+01            1.500000e+01  
max   

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered
count,1384617.0,1384617.0,1384617.0,1384617.0,1384617.0,1384617.0,1384617.0,1384617.0,1384617.0
mean,1706298.0,103112.8,17.09141,2.701392,13.57759,17.06613,25556.24,8.758044,0.5985944
std,989732.6,59487.15,16.61404,2.167646,4.238458,10.42642,14121.27,7.423936,0.4901829
min,1.0,1.0,4.0,0.0,0.0,0.0,1.0,1.0,0.0
25%,843370.0,51732.0,6.0,1.0,10.0,7.0,13380.0,3.0,0.0
50%,1701880.0,102933.0,11.0,3.0,14.0,15.0,25298.0,7.0,1.0
75%,2568023.0,154959.0,21.0,5.0,17.0,30.0,37940.0,12.0,1.0
max,3421070.0,206209.0,100.0,6.0,23.0,30.0,49688.0,80.0,1.0


In [10]:
orders_train_pad_df = pd.merge(orders_train_df, pad_df, on='product_id')
del orders_train_df
del order_products_train_df
orders_train_pad_df.describe()


Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,aisle_id,department_id
count,1384617.0,1384617.0,1384617.0,1384617.0,1384617.0,1384617.0,1384617.0,1384617.0,1384617.0,1384617.0,1384617.0
mean,1706298.0,103112.8,17.09141,2.701392,13.57759,17.06613,25556.24,8.758044,0.5985944,71.30423,9.839777
std,989732.6,59487.15,16.61404,2.167646,4.238458,10.42642,14121.27,7.423936,0.4901829,38.10409,6.29283
min,1.0,1.0,4.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0
25%,843370.0,51732.0,6.0,1.0,10.0,7.0,13380.0,3.0,0.0,31.0,4.0
50%,1701880.0,102933.0,11.0,3.0,14.0,15.0,25298.0,7.0,1.0,83.0,8.0
75%,2568023.0,154959.0,21.0,5.0,17.0,30.0,37940.0,12.0,1.0,107.0,16.0
max,3421070.0,206209.0,100.0,6.0,23.0,30.0,49688.0,80.0,1.0,134.0,21.0


In [0]:
order_prior_df = pd.merge(order_products_prior_df, orders_df, on='order_id', how='left')

In [0]:
orders_prior_pad_df = pd.merge(order_prior_df, pad_df, on='product_id')
#del order_prior_df

In [0]:
del order_prior_df
del order_products_prior_df
orders_prior_pad_df.describe()

In [0]:
dfFullInfo = pd.concat([orders_prior_pad_df, orders_train_pad_df])

del orders_prior_pad_df
del orders_train_pad_df

dfFullInfo.head()

In [0]:
dfFullInfo.sort_values(['user_id','order_number','eval_set'],inplace=True)

In [0]:
columns = ['user_id','order_number','order_id','product_id','product_name','reordered',
           'department_id','department','aisle_id','aisle','add_to_cart_order',
           'days_since_prior_order','order_dow','order_hour_of_day','eval_set']
dfFullInfo = dfFullInfo[columns]
dfFullInfo.head()

In [0]:
##### Look at distribution of order timing throughout the day #####
order_hour_of_day_range = np.amax(dfFullInfo['order_hour_of_day']) - np.amin(dfFullInfo['order_hour_of_day']) + 1

dfFullInfo.hist(column='order_hour_of_day',bins=2*order_hour_of_day_range-1,
                figsize=(12,7),color='blue', histtype='bar')

plt.xlabel('Hour of Day',fontsize=15)
plt.ylabel('Number of Orders',fontsize=15)

In [0]:
##### Look at distribution of order timing throughout the week #####
order_dow_range = np.amax(dfFullInfo['order_dow']) - np.amin(dfFullInfo['order_dow']) + 1
dfFullInfo.hist(column='order_dow',bins=2*order_dow_range-1,
                figsize=(12,7),color='blue', histtype='bar')

plt.xlabel('Day of Week',fontsize=15)
plt.ylabel('Number of Orders',fontsize=15)

In [0]:
##### Look at how many prior orders we have #####
order_number_range = np.amax(dfFullInfo['order_number']) - np.amin(dfFullInfo['order_number']) + 1
dfFullInfo[dfFullInfo['eval_set']=='prior'].groupby('order_number').agg(len).plot(figsize=(12,7),color='blue',
                                                                               legend=False,grid=True,
                                                                               xticks=range(0,order_number_range+1,10))

plt.xlabel('Order Number',fontsize=15)
plt.ylabel('Count',fontsize=15)

In [0]:
##### Look at relationship between reordering of a product and its position in the cart #####
add_to_cart_order_range = np.amax(dfFullInfo['add_to_cart_order']) - np.amin(dfFullInfo['add_to_cart_order']) + 1
dfFullInfo[dfFullInfo['reordered']==1].groupby('add_to_cart_order').agg(len).plot(figsize=(12,7),color='blue',
                                                                               legend=False,grid=True,
                                                                               xticks=range(0,add_to_cart_order_range+1,10))

plt.xlabel('Add to Cart Order',fontsize=15)
plt.ylabel('Count of Reordered',fontsize=15)

In [0]:
##### Look at distribution of reordered and non-reordered entries in dataset #####
dfFullInfo.hist(column='reordered', figsize=(12,7), histtype='bar', bins=[-0.1,0.1,0.9,1.1], grid=False)

plt.xlabel('Reordered',fontsize=15)
plt.ylabel('Count',fontsize=15)
plt.xticks([0,1])

In [0]:
##### Separate Features and Targets #####
dfFeatures = dfFullInfo[['user_id', 'order_number', 'days_since_prior_order', 'order_dow', 'order_hour_of_day', 'product_id', 'eval_set']]
dfTarget = dfFullInfo[['reordered', 'eval_set']]

In [0]:
##### Use Prior orderes to train models, and train orders for model testing and evaluation #####
Xtrain = dfFeatures.loc[dfFeatures.eval_set=='prior',:]
ytrain = dfTarget.loc[dfTarget.eval_set=='prior',:]
Xtest = dfFeatures.loc[dfFeatures.eval_set=='train',:]
ytest = dfTarget.loc[dfTarget.eval_set=='train',:]

##### Now drop eval_set from DataFrames (no longer need them) #####
droplist = [Xtrain,ytrain,Xtest,ytest]
for x in droplist:
    x.drop('eval_set',axis=1,inplace=True)

In [0]:
print(Xtrain.shape,ytrain.shape)

In [0]:
##### Try Gaussian Naive Bayes (GNB) #####

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import f1_score

GNB = GaussianNB()
GNB.fit(Xtrain.iloc[0:5000000,:], np.ravel(ytrain.iloc[0:5000000,:]))
Predictions = GNB.predict(Xtest)

score = f1_score(np.ravel(ytest),Predictions)
print("F1 Score: %.4f"%score)

In [0]:
##### Try Support Vector Classifier (SVC) with GridSearch and 5-fold Crossvalidation #####
from sklearn.svm import SVC

#Use a smaller subset of the dataset for this model for feasible runtimes (score doesn't vary too much beyond 8000-10000entries)
x = 15000   

SVC_model = SVC(random_state=47)

params = {'kernel': ['rbf'],
         'shrinking': [True, False]}

grid = GridSearchCV(SVC_model, params, verbose=1, cv=5)

print("Fitting Model")
grid.fit(Xtrain.iloc[0:x,:], np.ravel(ytrain.iloc[0:x,:]))
#SVC_model.fit(Xtrain.iloc[0:x,:], np.ravel(ytrain.iloc[0:x,:]))
print("Fitted, Predicting")
Predictions = grid.predict(Xtest.iloc[0:x,:])
#Predictions = SVC_model.predict(Xtest.iloc[0:x,:])
score = f1_score(ytest.iloc[0:x,:],Predictions)
print("F1 Score: %.4f"%score)
print(grid.best_params_)

In [0]:
##### Try K-Nearest-Neighbors (KNN) with GridSearch and 5-fold Crossvalidation #####
from sklearn.neighbors import KNeighborsClassifier

params = {'n_neighbors': range(10,15),
         'weights': ['uniform','distance'],
         'p': range(1,4)}

neigh = KNeighborsClassifier()

grid = GridSearchCV(neigh, params, verbose=1, cv=5)
print("Fitting Model")
grid.fit(Xtrain.iloc[0:x,:], np.ravel(ytrain.iloc[0:x,:]))
print("Fitted, Predicting")
Predictions = grid.predict(Xtest.iloc[0:x,:])
score = f1_score(ytest.iloc[0:x,:],Predictions)
print("F1 Score: %.4f"%score)
print(grid.best_params_)

In [0]:
##### Sweep Subset Size and show progression of F1 Score and Mean Fit Time #####

SVC_model = []
Predictions = []
score = []
fit_times = []
xRange = range(1000,16000,1000)
i = 0
params = {'kernel': ['rbf'],
          'shrinking': [True]}

for x in xRange:
    SVC_model.append(GridSearchCV(SVC(random_state=47),params,cv=5,verbose=1))
    print("Fitting Model %d with %d samples"%(i,x))
    SVC_model[i].fit(Xtrain.iloc[0:x,:], np.ravel(ytrain.iloc[0:x,:]))
    fit_times.append(SVC_model[i].cv_results_['mean_fit_time'])
    print("Fitted, Predicting")
    Predictions.append(SVC_model[i].predict(Xtest.iloc[0:x,:]))
    score.append(f1_score(ytest.iloc[0:x,:],Predictions[i]))
    print("F1 Score: %.4f"%score[i])
    i+=1

In [0]:
##### Visualize F1 Score and Mean Fit Time Progression #####
fig, ax1 = plt.subplots(figsize=(12,7))
ax1.plot(xRange, score, 'b-')
ax1.set_xlabel('Number of Training Samples', fontsize = 15)
ax1.set_ylabel('F1 Score', color='b', fontsize = 15)
ax1.tick_params('y', colors='b')
ax1.grid()

ax2 = ax1.twinx()
ax2.plot(xRange, fit_times, 'r-')
ax2.set_ylabel('Mean Fit Time (s)', color='r', fontsize = 15)
ax2.tick_params('y', colors='r')
fig.tight_layout()

---
<a id='step4'></a>
## Step 4: Benchmarks

---
<a id='step5'></a>
## Step 5: Algorithm and Techniques

PCA?

---
<a id='step6'></a>
## Step 6: Model Refinements


---
<a id='step7'></a>
## Step 7: Model Evaluation and Validation
