# Welcome to Jupyter Notebook and Python

For many of you, this will be your first time "programming," at least in a traditional language, though you likely have used similar ideas and concepts in constructing Excel worksheets, particularly if you use VBA. While the idea of programming may be intimidating, modern tools and software have made it very accessible. For example, you may have had visions of strings of ones and zeros scrolling by Matrix fashion as you program, but the reality is that most of programming (particularly in data science) looks like constructing mathematical formulae using something approaching natural language.

We are going to be using the programming language **Python** for this exercise, the standard programming language in data science. We will also be using several auxillary industry standard packages that are commonly used in the data science workflow.

The interface that we are using is a *Jupyter Notebook*. This is a piece of software that can run either locally on your own computer or on a server in the cloud, but either way you access it through your browser. It provides a user friendly interface to work with Python as well as other programming languages commonly used in data science workflows.

A Jupyter Notebook consists of *cells* which can either contain Python code or written text/images. What you are reading now is the latter kind of cell. You progress through a Jupyter Notebook by *running* the code cells, which just executes the code in that cell. The below image show the difference between the two:

<img src="images/1 - Starting Jupyter Notebook.png" width="800"/>


When a cell has run, you will know it worked because their will now be a number next to the code cell:

<img src="images/4 - Number from Code Cell.png" width="800"/>


For this exercise, we will primarily be using some user interface controls that I have built to make things a little more accessible. You will need to run the code in order to get the controls to show up, and the decisions you make will directly correspond to the common decisions that a data scientist must make when analyzing a data set. This will allow you to progress with no knowledge of programming.

Instead of running all the cells one by one, this notebook has been set up for you to run the entire notebook and then just interact with the controls I have built. To do this, you will click on "Run" in the toolbar and then click on "Run all Cells". See picture below:

<img src="images/Run all cells.png" width="800"/>


You will know this worked because you will see numbers next to all of the code cells. From here on out, you shouldn't need to run any additional cells. Just go through the notebook and the interactive components should be obvious.

**If something seems broken or hung up with the notebook, you can use the option "Restart Kernel and Run All Cells..." displayed in the image above to completely reset the notebook (but you will have to start again).** One way in which this might happen is if you choose too many variables when training a model (the logistic regression will take a long time to finish if you have more than a 1000 columns, this will make sense shortly). If something is taking forever to run, "Restart Kernel and Run All Cells..." will fix the problem, but you will have to recreate your data set.

Good luck, data scientist!

# Import packages

At it's core, Python is a relatively simple set of operations that define the *programming language*. You can use these *primitives* to build more complicated functionality. While this is powerful, it is also very complicated, so fortunately, we do not have to do this ourselves!

Instead, we will be using *open source* software (i.e. software that has been written and freely distributed by a mixture of volunteers and companies that use the software and contribute to it's development) that adds in additional functionality. This software, while it is free, is also industry standard, and much of a typical data science "software stack" used in large companies will consist of open source software, much of which we will be using for this exercise.

This software is *imported*, i.e. made available for us to use, and this is what we are doing below.

In [2]:
import numpy as np 
import pandas as pd
import math
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer, LabelEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score
# %matplotlib inline 
import matplotlib.pyplot as plt
import time
import datetime
import itertools as it
import dill

import ipywidgets as widgets
from ipywidgets import interact, interactive, interactive_output, fixed, interact_manual, HBox, VBox, Layout
import seaborn as sns

# Load and reshape the data

The *sine qua non* of data science, is unsurprisingly, the data! However, in order to work with the data, we must *load* the data. Typically, we start with a *csv* (or comma separated values) file. This can be exported from an excel sheet. You can also connect directly to databases and load data direct from databases. For the purpose of this exercise, we will be using the "training.csv" file for training data for Carvana. Below we are going to be loading the data and reshaping it into a more useable form.

First, we will load in the data set.

In [3]:
df = pd.read_csv('train.csv', low_memory=False)

How big is our dataset? It is always good to have a handle on the size before attempting any analysis.

In [4]:
df.shape

(65684, 34)

The first number above is the number of rows, and the second number is the number of columns. This is both a long (many rows) and wide (many columns) dataset, though it is not extreme.

Before diving in, it's good to see what the data actually looks like. Below is the first 10 rows of the data.

In [5]:
with pd.option_context('display.max_columns', 200):
    display(df.head(10))

Unnamed: 0,RefId,IsBadBuy,PurchDate,Auction,VehYear,VehicleAge,Make,Model,Trim,SubModel,Color,Transmission,WheelTypeID,WheelType,VehOdo,Nationality,Size,TopThreeAmericanName,MMRAcquisitionAuctionAveragePrice,MMRAcquisitionAuctionCleanPrice,MMRAcquisitionRetailAveragePrice,MMRAcquisitonRetailCleanPrice,MMRCurrentAuctionAveragePrice,MMRCurrentAuctionCleanPrice,MMRCurrentRetailAveragePrice,MMRCurrentRetailCleanPrice,PRIMEUNIT,AUCGUART,BYRNO,VNZIP1,VNST,VehBCost,IsOnlineSale,WarrantyCost
0,2,0,12/7/2009,ADESA,2004,5,DODGE,1500 RAM PICKUP 2WD,ST,QUAD CAB 4.7L SLT,WHITE,AUTO,1.0,Alloy,93593,AMERICAN,LARGE TRUCK,CHRYSLER,6854.0,8383.0,10897.0,12572.0,7456.0,9222.0,11374.0,12791.0,,,19638,33619,FL,7600.0,0,1053
1,3,0,12/7/2009,ADESA,2005,4,DODGE,STRATUS V6,SXT,4D SEDAN SXT FFV,MAROON,AUTO,2.0,Covers,73807,AMERICAN,MEDIUM,CHRYSLER,3202.0,4760.0,6943.0,8457.0,4035.0,5557.0,7146.0,8702.0,,,19638,33619,FL,4900.0,0,1389
2,4,0,12/7/2009,ADESA,2004,5,DODGE,NEON,SXT,4D SEDAN,SILVER,AUTO,1.0,Alloy,65617,AMERICAN,COMPACT,CHRYSLER,1893.0,2675.0,4658.0,5690.0,1844.0,2646.0,4375.0,5518.0,,,19638,33619,FL,4100.0,0,630
3,5,0,12/7/2009,ADESA,2005,4,FORD,FOCUS,ZX3,2D COUPE ZX3,SILVER,MANUAL,2.0,Covers,69367,AMERICAN,COMPACT,FORD,3913.0,5054.0,7723.0,8707.0,3247.0,4384.0,6739.0,7911.0,,,19638,33619,FL,4000.0,0,1020
4,6,0,12/7/2009,ADESA,2004,5,MITSUBISHI,GALANT 4C,ES,4D SEDAN ES,WHITE,AUTO,2.0,Covers,81054,OTHER ASIAN,MEDIUM,OTHER,3901.0,4908.0,6706.0,8577.0,4709.0,5827.0,8149.0,9451.0,,,19638,33619,FL,5600.0,0,594
5,7,0,12/7/2009,ADESA,2004,5,KIA,SPECTRA,EX,4D SEDAN EX,BLACK,AUTO,2.0,Covers,65328,OTHER ASIAN,MEDIUM,OTHER,2966.0,4038.0,6240.0,8496.0,2980.0,4115.0,6230.0,8603.0,,,19638,33619,FL,4200.0,0,533
6,8,0,12/7/2009,ADESA,2005,4,FORD,TAURUS,SE,4D SEDAN SE,WHITE,AUTO,2.0,Covers,65805,AMERICAN,MEDIUM,FORD,3313.0,4342.0,6667.0,7707.0,3713.0,4578.0,6942.0,8242.0,,,19638,33619,FL,4500.0,0,825
7,9,0,12/7/2009,ADESA,2007,2,KIA,SPECTRA,EX,4D SEDAN EX,BLACK,AUTO,2.0,Covers,49921,OTHER ASIAN,MEDIUM,OTHER,6196.0,7274.0,9687.0,10624.0,6417.0,7371.0,9637.0,10778.0,,,21973,33619,FL,5600.0,0,482
8,11,0,12/14/2009,ADESA,2005,4,GMC,1500 SIERRA PICKUP 2,SLE,REG CAB 4.3L,SILVER,AUTO,1.0,Alloy,80080,AMERICAN,LARGE TRUCK,GM,5243.0,6627.0,8848.0,10458.0,5712.0,7552.0,9494.0,11663.0,,,5546,33619,FL,5500.0,0,1373
9,12,0,12/14/2009,ADESA,2001,8,FORD,F150 PICKUP 2WD V6,XL,REG CAB 4.2L XL,WHITE,MANUAL,1.0,Alloy,75419,AMERICAN,LARGE TRUCK,FORD,3168.0,4320.0,5826.0,6762.0,2871.0,3822.0,5734.0,6559.0,,,5546,33619,FL,5300.0,0,869


Note that there is a column for 'RefId' in the data set, that is just an internal column referencing the row. It is not necessary and will provide no predictive value, so we will get rid of it.

In [5]:
if 'RefId' in df:
    df = df.drop(columns='RefId')
with pd.option_context('display.max_columns', 200):
    display(df.head())

Unnamed: 0,IsBadBuy,PurchDate,Auction,VehYear,VehicleAge,Make,Model,Trim,SubModel,Color,Transmission,WheelTypeID,WheelType,VehOdo,Nationality,Size,TopThreeAmericanName,MMRAcquisitionAuctionAveragePrice,MMRAcquisitionAuctionCleanPrice,MMRAcquisitionRetailAveragePrice,MMRAcquisitonRetailCleanPrice,MMRCurrentAuctionAveragePrice,MMRCurrentAuctionCleanPrice,MMRCurrentRetailAveragePrice,MMRCurrentRetailCleanPrice,PRIMEUNIT,AUCGUART,BYRNO,VNZIP1,VNST,VehBCost,IsOnlineSale,WarrantyCost
0,0,12/7/2009,ADESA,2004,5,DODGE,1500 RAM PICKUP 2WD,ST,QUAD CAB 4.7L SLT,WHITE,AUTO,1.0,Alloy,93593,AMERICAN,LARGE TRUCK,CHRYSLER,6854.0,8383.0,10897.0,12572.0,7456.0,9222.0,11374.0,12791.0,,,19638,33619,FL,7600.0,0,1053
1,0,12/7/2009,ADESA,2005,4,DODGE,STRATUS V6,SXT,4D SEDAN SXT FFV,MAROON,AUTO,2.0,Covers,73807,AMERICAN,MEDIUM,CHRYSLER,3202.0,4760.0,6943.0,8457.0,4035.0,5557.0,7146.0,8702.0,,,19638,33619,FL,4900.0,0,1389
2,0,12/7/2009,ADESA,2004,5,DODGE,NEON,SXT,4D SEDAN,SILVER,AUTO,1.0,Alloy,65617,AMERICAN,COMPACT,CHRYSLER,1893.0,2675.0,4658.0,5690.0,1844.0,2646.0,4375.0,5518.0,,,19638,33619,FL,4100.0,0,630
3,0,12/7/2009,ADESA,2005,4,FORD,FOCUS,ZX3,2D COUPE ZX3,SILVER,MANUAL,2.0,Covers,69367,AMERICAN,COMPACT,FORD,3913.0,5054.0,7723.0,8707.0,3247.0,4384.0,6739.0,7911.0,,,19638,33619,FL,4000.0,0,1020
4,0,12/7/2009,ADESA,2004,5,MITSUBISHI,GALANT 4C,ES,4D SEDAN ES,WHITE,AUTO,2.0,Covers,81054,OTHER ASIAN,MEDIUM,OTHER,3901.0,4908.0,6706.0,8577.0,4709.0,5827.0,8149.0,9451.0,,,19638,33619,FL,5600.0,0,594


Let's also look at the last 5 rows.

In [6]:
with pd.option_context('display.max_columns', 200):
    display(df.tail())

Unnamed: 0,IsBadBuy,PurchDate,Auction,VehYear,VehicleAge,Make,Model,Trim,SubModel,Color,Transmission,WheelTypeID,WheelType,VehOdo,Nationality,Size,TopThreeAmericanName,MMRAcquisitionAuctionAveragePrice,MMRAcquisitionAuctionCleanPrice,MMRAcquisitionRetailAveragePrice,MMRAcquisitonRetailCleanPrice,MMRCurrentAuctionAveragePrice,MMRCurrentAuctionCleanPrice,MMRCurrentRetailAveragePrice,MMRCurrentRetailCleanPrice,PRIMEUNIT,AUCGUART,BYRNO,VNZIP1,VNST,VehBCost,IsOnlineSale,WarrantyCost
65679,0,12/2/2009,ADESA,2004,5,FORD,EXPLORER 2WD V6,XLS,4D SUV 4.0L FFV XLS,SILVER,AUTO,1.0,Alloy,82563,AMERICAN,MEDIUM SUV,FORD,4668.0,5714.0,5541.0,6671.0,6148.0,7521.0,9659.0,10944.0,,,18881,30212,GA,7000.0,0,1243
65680,0,12/2/2009,ADESA,2006,3,KIA,SORENTO 2WD,EX,4D SPORT UTILITY EX,GOLD,AUTO,1.0,Alloy,65399,OTHER ASIAN,MEDIUM SUV,OTHER,7843.0,9171.0,8970.0,10405.0,7652.0,9310.0,12148.0,14204.0,,,18111,30212,GA,7900.0,0,1508
65681,1,12/2/2009,ADESA,2001,8,MERCURY,SABLE,GS,4D SEDAN GS,BLACK,AUTO,1.0,Alloy,45234,AMERICAN,MEDIUM,FORD,1996.0,2993.0,2656.0,3732.0,2190.0,3055.0,4836.0,5937.0,,,18111,30212,GA,4200.0,0,993
65682,0,12/2/2009,ADESA,2007,2,CHEVROLET,MALIBU 4C,LS,4D SEDAN LS,SILVER,AUTO,,,71759,AMERICAN,MEDIUM,GM,6418.0,7325.0,7431.0,8411.0,6785.0,8132.0,10151.0,11652.0,,,18881,30212,GA,6200.0,0,1038
65683,0,12/2/2009,ADESA,2005,4,JEEP,GRAND CHEROKEE 2WD V,Lar,4D WAGON LAREDO,SILVER,AUTO,1.0,Alloy,88500,AMERICAN,MEDIUM SUV,CHRYSLER,8545.0,9959.0,9729.0,11256.0,8375.0,9802.0,11831.0,14402.0,,,18111,30212,GA,8200.0,0,1893


# Summarize the data

Before selecting variables to use for our model, we will stop and try to understand at a high level what our variables are. The below table provides a summary of all of our variables.

First, you can see that there really are two types of variables, continuous and categorical. Continuous features are those features that change continuously, like price or age of vehicle. These can take any value (even a negative number sometimes). Categorical features by contrast represent categories of items. An example of a categorical feature is the make of a car.

You generally treat these two different kind of variables very different when building a model. Specifically, for categorical features, you can't give them directly to a model because the model expects numbers. Instead, you create a new feature that is a "dummy variable" for each realization of a category. For example, if we are thinking about the make of a car, one make might be Toyota. So, we would create a new variable called "Make_Toyota" and the value would be 1 for any car that was a Toyota and 0 for any car that was not a Toyota. This means that you really need to pay careful attention to which categorical variable you add into the model. Since there are 1041 different unique values for the Model of a car, if you add in this variable, you will be adding 1041 variables to your model. This can add a lot of columns to your data set very quickly.

Look through the summary below to begin identify which variables might make sense for your model.

In [7]:
def summarize_dataframe(df):
    missing_values = pd.concat([pd.DataFrame(df.columns, columns=['Variable Name']), 
                      pd.DataFrame(df.dtypes.values.reshape([-1,1]), columns=['Data Type']),
                      pd.DataFrame(df.isnull().sum().values, columns=['Missing Values']), 
                      pd.DataFrame([df[name].nunique() for name in df.columns], columns=['Unique Values'])], 
                     axis=1).set_index('Variable Name')
    return pd.concat([missing_values, df.describe(include='all').transpose()], axis=1).fillna("")

summarize_dataframe(df)

Unnamed: 0,Data Type,Missing Values,Unique Values,count,unique,top,freq,mean,std,min,25%,50%,75%,max
IsBadBuy,int64,0,2,65684.0,,,,0.122891,0.328315,0.0,0.0,0.0,0.0,1.0
PurchDate,object,0,517,65684.0,517.0,11/23/2010,349.0,,,,,,,
Auction,object,0,3,65684.0,3.0,MANHEIM,36940.0,,,,,,,
VehYear,int64,0,10,65684.0,,,,2005.344041,1.729519,2001.0,2004.0,2005.0,2007.0,2010.0
VehicleAge,int64,0,10,65684.0,,,,4.175842,1.710202,0.0,3.0,4.0,5.0,9.0
Make,object,0,32,65684.0,32.0,CHEVROLET,15504.0,,,,,,,
Model,object,0,1041,65684.0,1041.0,PT CRUISER,2094.0,,,,,,,
Trim,object,2143,134,63541.0,134.0,Bas,12615.0,,,,,,,
SubModel,object,8,846,65676.0,846.0,4D SEDAN,13737.0,,,,,,,
Color,object,8,16,65676.0,16.0,SILVER,13397.0,,,,,,,


# Engineer features and impute missing values

A big part of any data science project is getting the data ready to be analyzed by a model. Models often have very strict expectations about the format of data that they will use, and if that format is not met, the model will give an error. Additionally, often the data we have is not exactly in the form that we think will make modeling easiest, so often we *engineer* new features, i.e. create a new feature using existing features. This can often lead to much better performance, but knowing what features to engineer can sometimes be a challenge.

Additionally, anything done to the *training* data set will also have to be done to the *testing* and *validation* data sets, so if you change the data to build your model, you need to also change the data before you use your model.

Convert PurchDate (the date in which Carvana purchased the car) column from an object (i.e., string) to a formal datetime object in Python. This will allow us to use it more easily as a date in our model.

In [8]:
if 'PurchDate' in df:
    df['PurchDate'] = pd.to_datetime(df['PurchDate'], format='%m/%d/%Y')

Instead of using the variable directly, we will engineer two new features based on the purchase date: month and year.

In [9]:
if 'PurchDate' in df:
    df['Month'] = df['PurchDate'].dt.month
    df['Year'] = df['PurchDate'].dt.year

Now we no longer need the PurchDate column and we will drop it from the data frame.

In [10]:
if 'PurchDate' in df:
    df = df.drop('PurchDate', 1) 

  df = df.drop('PurchDate', 1)


Finally, we will go ahead and explicitly categorize which variables are continuous and which are categorical. This is not always obvious becuase some variables (WheelTypeID specifically) looked like a continuous variable in the summary above, but is actually a categorical variable.

In [11]:
continuous_features = ['VehOdo', 'MMRAcquisitionAuctionAveragePrice', 'MMRAcquisitionAuctionCleanPrice',
                       'MMRAcquisitionRetailAveragePrice', 'MMRAcquisitonRetailCleanPrice',
                       'MMRCurrentAuctionAveragePrice', 'MMRCurrentAuctionCleanPrice',
                       'MMRCurrentRetailAveragePrice', 'MMRCurrentRetailCleanPrice', 'VehBCost', 'WarrantyCost', 'VehYear', 'VehicleAge']
categorical_features = ['Auction', 'Make', 'Model', 'Trim', 'SubModel', 'Color', 
                        'Transmission', 'WheelTypeID', 'WheelType', 'Nationality', 
                        'Size', 'TopThreeAmericanName', 'PRIMEUNIT', 'AUCGUART', 
                        'BYRNO', 'VNZIP1', 'VNST', 'IsOnlineSale', 'Month', 'Year']

Models also expect that all of variables in all of the rows have numbers in them. They generally can't handle missing data. However, our data has lots of missing values (refer to the table where we summarized the data). We could throw out all rows with missing values, but that would mean throwing out 62,584 rows (see PRIMEUNIT). We could also drop any columns with missing values, but that would be most columns. Instead, we *impute* the missing values, where we fill in the values with something reasonable. For continuous features, we fill in the value with the mean value of everything else in the column. With categorical features, we add a new category called "MISSING" that gets its own dummy variable.

However, if a variable is primarily missing (e.g. PRIMEUNIT), then the new imputed values are not likely to provide much information. It is probably not a good idea to include those variables in your model unless you have a compelling reason to do so.

The following code creates a *transformer* that transforms our data using the above strategy.

In [12]:
def convert_to_str(df):
    return df[:].astype(str)

def convert_to_object(df):
    return df[:].astype(object)

numeric_transformer = Pipeline([('fill_NaN', SimpleImputer(missing_values=np.nan, strategy='mean'))])
categorical_transformer = Pipeline([('convert_objects', FunctionTransformer(convert_to_object, validate=False)),
                                    ('fill_Missing', SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='MISSING')), 
                                    ('convert_floats', FunctionTransformer(convert_to_str, validate=False)),
                                    ('dummy_transform', OneHotEncoder(handle_unknown='ignore', sparse=False))])

# Visualizing the data

Before we decide what variables to include in the model, it is useful to spend a little bit of time visualizing the data set to try to decide what to include when modeling.

In [13]:
plot_df = df.copy()
plot_df[categorical_features] = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='MISSING').fit_transform(df[categorical_features])

def plot_variables(v1, v2):
    if v1 in categorical_features + ['IsBadBuy'] and v2 in continuous_features + ['IsBadBuy']:
        print("You have selected " + str(v1) + " (a categorical feature) for the horizontal axis and " + str(v2) + " for the vertical axis.")
        print("The first plot plots the average value of " + str(v2) + " for each category in " + str(v1) + ".")
        g = sns.catplot(x=v1, y=v2,
                        saturation=.5, data=plot_df,
                        kind="bar", ci=None, aspect=5)
        g.set_xticklabels(rotation=45)
        print("The second plot plots the count of " + str(v1) + " in each category.")
        g = sns.catplot(x=v1, 
                        saturation=.5, data=plot_df,
                        kind="count", ci=None, aspect=5)
        g.set_xticklabels(rotation=45)
    elif v1 in continuous_features and v2 in categorical_features + ['IsBadBuy']:
        print("You have selected " + str(v1) + " (a continuous feature) for the horizontal axis and " + str(v2) + " (a categorical feature) for the vertical axis.")
        print("This plots the average value of " + str(v1) + " for each category in " + str(v2) + "\nand it shows the distribution of " + str(v1) + " within each category represented by the width of the plot at each point.")
        g = sns.catplot(x=v1, y=v2,
                        saturation=.5, data=plot_df, orient='h', 
                        kind="violin", ci=None, aspect=5, scale='count')
        g.set_xticklabels(rotation=45)
    elif v1 in categorical_features + ['IsBadBuy'] and v2 in categorical_features:
        print("You have selected " + str(v1) + " (a categorical feature) for the horizontal axis and " + str(v2) + " (a categorical feature) for the vertical axis.")
        print("This plots a heatmap showing the relationship betweeen the two categories.")
        print("A bright color means those two categories are seen together more often.")
        fig, ax = plt.subplots(figsize=(10,10))
        temp = pd.crosstab(plot_df[v2], plot_df[v1])
        sns.heatmap(temp, cbar=True, ax=ax)
        ax.set_xticklabels(ax.get_xticklabels(), rotation=45)
        ax.set_yticklabels(ax.get_yticklabels(), rotation=45)
    elif v1 in continuous_features and v2 in continuous_features:
        print("You have selected " + str(v1) + " (a continuous feature) for the horizontal axis and " + str(v2) + " (a continuous feature) for the vertical axis.")
        print("This plots a scatterplot between the two variales, and it shows histograms of the variables individually on the outer axis.")
        sns.jointplot(x=v1, y=v2, data=plot_df)



plot_relationships = interactive(plot_variables, 
#                            {'manual' : True, 'manual_name' : 'Display plot'}, 
                           v1 = widgets.Dropdown(value='VNST',
                                                options=['IsBadBuy'] + categorical_features + continuous_features,
                                                description="Variable on the horizontal axis:", 
                                                style={'description_width': 'initial'},
                                                layout=Layout(width='500px'),
                                                disabled=False), 
                           v2 = widgets.Dropdown(value='IsBadBuy',
                                                options=['IsBadBuy'] + categorical_features + continuous_features,
                                                description="Variable on the vertical axis:", 
                                                style={'description_width': 'initial'},
                                                layout=Layout(width='500px'),
                                                disabled=False)
                                )
display(plot_relationships)

interactive(children=(Dropdown(description='Variable on the horizontal axis:', index=17, layout=Layout(width='…

# Splitting into training and testing and selecting features

One of the most important concepts for data science is the idea of "out of sample" testing. I.e., you have to test your model on data that it was not trained on. The models used by data scientists are incredibly powerful, and if you are not careful, the model can "memorize" the data set instead of uncovering true, robust patterns. This leads to something that data scientists call **overfitting**. The only way to test for overfitting is to split your data into two data sets: a *training set* which you use to train your model and a *testing set* which you only use to evaluate your model.

However, this naturally means that you have less data to train your model on, and generally speaking, models do better the more data they have to train on. However, the more testing data you have, the more confident you can be in the performance of your model. This sets up a natural tension between how much data you hold out for testing versus training. The below slider will let you choose the percentage of data to hold out for testing. Again, the fundamental tradeoff is the more data you hold out testing, the worse your best model is likely to be, but the more data you hold out for testing, the more likely you are to identify the best of (the slightly worse) models. How will you choose?

Therefore, when you are testing your model, your final score will be much closer to the score you have on your testing set. However, the score will not be the same as that of your testing set because I have held out a third set of data that you do not have that will be used for the final validation. You will not know your true score until we run the competition in class, but it should be similar to your testing set score unless you use a really small test set.

Additionally, we have to choose what features our model will actually use. While you can choose all of the features, you may find that it "overfits." We have roughly 70,000 rows of data and there are 2,410 columns after adding all of the dummies for the categorical variables. By selecting fewer variables, our model may be more robust to new data. Moreover, the more variables you pick, the longer it will take to train the models, and the less time you will have to tune the models. Do you go with more variables and less tuning time or less variables and more tuning time?

The next piece of code will allow you to choose the size of your testing set and the variables on which to build your model.

In [14]:
def create_data_set(df, percent_test, **kwargs):
    print("You will hold out " + str(percent_test) + "% of the data for the test set.\nYou will have " + str(math.floor((1 - percent_test/100)*len(df))) 
          + " rows in your training set and " + str(math.ceil((percent_test/100)*len(df))) + " in your test set.")
    features_selected = {key for key, value in kwargs.items() if value}
    cont_features_selected = list(features_selected.intersection(continuous_features))
    cat_features_selected = list(features_selected.intersection(categorical_features))
    print("\nThe continuous features you have selected are " + str(cont_features_selected))
    print("The categorical features you have selected are " + str(cat_features_selected) + '\n')
    
    X=df.drop('IsBadBuy',1)
    if not cat_features_selected and not cont_features_selected:
        print("You have to choose something! Choose some variables and create the data set.")
        raise Exception("No variables chosen!")
        return
    elif not cat_features_selected:
        preprocessor = ColumnTransformer(transformers = [('num', numeric_transformer, cont_features_selected)])
        preprocessor.fit(X)
        feature_names = cont_features_selected
        X = pd.DataFrame(preprocessor.transform(X), columns=feature_names)
    else:
        preprocessor = ColumnTransformer(transformers = [('num', numeric_transformer, cont_features_selected),
                                                         ('cat', categorical_transformer, cat_features_selected)])
        preprocessor.fit(X)
        feature_names=cont_features_selected + list(preprocessor.transformers_[1][1]['dummy_transform'].get_feature_names(cat_features_selected))
        X = pd.DataFrame(preprocessor.transform(X), 
                         columns=feature_names)
    
    
    y = df['IsBadBuy']
    print("The first ten rows of your data set now looks like:")
    display(X.head(10))
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=percent_test/100, random_state=201)
    print("\nThe number of columns in your data set is: " + str(len(X.columns)))
    print("\nThe number of rows in your training set is: " + str(len(X_train)) + "\nThe number of rows in your testing set is: " + str(len(X_test)))
    
    return feature_names, preprocessor, X_train, X_test, y_train, y_test, cont_features_selected, cat_features_selected


feature_selection = interactive(create_data_set, 
                               {'manual' : True, 'manual_name' : 'Create Data Set'},
                                df=fixed(df[:]),
                                percent_test= widgets.IntSlider(min=1, 
                                                                 max=99, 
                                                                 step=1, 
                                                                 value=20,
                                                                 layout=Layout(width='500px'), 
                                                                 style={'description_width': 'initial'}, 
                                                                 description="Proportion of data for testing:"),
                                **{str(col):widgets.Checkbox(value=False, description=str(col), style = {'description_width': 'initial'}) for col in df.columns if col != 'IsBadBuy'}
                          )

controls = VBox([HBox([feature_selection.children[0]], layout = Layout(display='flex', flex_flow='column', align_items='center')), 
                 HBox(feature_selection.children[1:-2], layout = Layout(flex_flow='row wrap')), 
                 HBox([feature_selection.children[-2]], layout = Layout(display='flex', flex_flow='column', align_items='center')), 
                 HBox([feature_selection.children[-1]], layout = Layout(max_width='800px', display='inline-flex'))])

display(controls)

VBox(children=(HBox(children=(IntSlider(value=20, description='Proportion of data for testing:', layout=Layout…

Note that you must select **at least one** checkbox above before moving on and then pressing the "Create Data Set" button. If you do not select at least one checkbox and push the button, you will get an error in the following code.

How many columns did we end up with?

That may not be equal to the number of checkboxes you selected. Why?

How many rows are in your testing set? The training set?

# Set up the scoring metric

As important as model building is the choice of the right model evaluation criteria is equally as important. The model that you choose will be dictated by the metric with which you measure it. That metric should reflect the intended use of the model.

For the Carvana challenge, Kaggle uses scaled version of the normalized Gini index. The normalized Gini index is 2\*AUC - 1, where AUC is the area under the receiver operating curve (a plot of false positive rate vs. true positive rate for all possible thresholds). AUC is a very popular evaluation metric in machine learning challenges. A brief explanation of AUC can be found at https://stats.stackexchange.com/questions/132777/what-does-auc-stand-for-and-what-is-it, but for the purposes of this challenge, the AUC can be thought of as a metric that balances the penalty for false positives (i.e. saying a car IsBadBuy when it isn't) with false negatives (i.e. saying a car IsGoodBuy when it isn't).

The metric is fairly complicated, but the only really important information you need to know is that a **higher score is better**. The Kaggle competition was won with a score of .26, so that is an upper bound on what you will be able to achieve.

In [15]:
def Kaggle_Gini_Index(predictions, realizations):
    this_score = 0.43639 * (2 * roc_auc_score(realizations, predictions) - 1)
    return this_score

Additionally, any model should be measured against a benchmark. Often times, you will use an existing model to benchmark your new model. In this case since we do not have an existing model, we will use the *naive* model. This model just guesses that the probability of any particular car being a bad buy is equal to the proportion of bad buy cars in the data set. Why is this a reasonable model?

In [16]:
y_train = df['IsBadBuy']

naive = np.repeat(np.mean(y_train), len(y_train))
print("The Gini index for the naive model is " + str(Kaggle_Gini_Index(naive, y_train)))

The Gini index for the naive model is 0.0


A higher Gini index is better. The naive model has a Gini index of 0, so any model that is above 0 is at least better than the naive model. That's a benchmark to beat (probably not the one that will win you a prize).

# Regularization

A core concept in data science is *regularization*. The models used in data science are often very powerful, and if they are left unchecked they can memorize the data that they have seen. The problem with this is that any real data set has noise that is either a result of the data collection process or is just intrinsic to the system. If a model memorizes the data, then it ends up fitting the noise and not the true pattern. A common example is in the below image to the right. The true data generating process is the smooth line in the middle plot and some noise added, but if we overfit we can end up with a model that goes crazy trying to exactly match the pattern.

<img src="images/underfitting-overfitting.png" width="600"/>

The way to avoid overfitting is to *regularize* your model, i.e. set a parameter that restrains its ability to memorize the data set. Generally this either takes the form of a penalty for being "too wavy" or constrains the class of models.

However if you constrain your class of models too much, you end up *underfitting*, as seen in the image on the left above. To figure out where exactly you've landed, you need to pay careful attention to the model performance on the testing set.

# Regularized Logistic Regression

A logistic regression is the cousin to the common linear regression that most of us are likely familiar with. The logistic regression instead of fitting a straight line, it fits a *sigmoid* function that relates the dependent variables to the probability that something is true, in this case whether or not a car IsBadBuy. The below image demonstrates this for a single dependent variable. Note that for larger values of *x*, there are more type 1 points, so the sigmoid goes up as x goes up.

<img src="images/Logistic_Regression.png" width="500"/>

The nice thing about a logistic regression is that it is very easy to interpret the model. If you look at the coefficients, the size of a coefficient is related to the strength of the models response to a variable. Specifically, the coefficient indicates the change in probability of something happening for one standard deviation increase in that variable. So, a positive cofficient means that it makes it more likely for a car to be IsBadBuy, and a negative coefficient makes it less likely, at least as that variable goes up.

These models are easy to translate into clear business decisions. Find the variables that have large positive coefficients and do not buy cars that have high values of those variables. Instead buy cars that have high values of variables that have large negative coefficients.

In the regularized logistic regression model, the paramater C is the main tuning parameter here. It regularizes the logistic regression for penalizing large values of coefficients. So, the model will only assign a large coefficient to a variable if it is really clear that a large coefficient is warranted. The larger the C, the less regularization, and the smaller a C, more regularization. 

Fit a logistic regression below. If you have chosen a lot of variables, this could take up to a minute or two to run. Play around with C and look at the effect on the training and the testing set. Note that there are three more models after this one to evaluate, so budget training time accordingly.

In [17]:
def train_logit(C):
    print("Training the model, this could take a little bit of time...")
    X_train = feature_selection.result[2]
    X_test = feature_selection.result[3]
    y_train = feature_selection.result[4]
    y_test = feature_selection.result[5]
    cont_features_selected = feature_selection.result[6]
    cat_features_selected = feature_selection.result[7]
    scaler=StandardScaler().fit(X_train)
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)
    features_selected = feature_selection.result[0]
    preprocessor = Pipeline([('common_preprocessor', feature_selection.result[1]), ('scaler', scaler)])
    algorithm_starts = time.time()
    rlr = LogisticRegression(penalty='l1', C=C, random_state=201, solver='liblinear')
    rlr_train = rlr.fit(X_train, y_train)
    training_time = time.time() - algorithm_starts
    
    rlr_train_prob = pd.DataFrame(rlr_train.predict_proba(X_train))[1]
    rlr_test_prob = pd.DataFrame(rlr_train.predict_proba(X_test))[1]
    
    output_string = "For the Logistic Regression, the continuous features chosen were: " + str(cont_features_selected) + \
                    "\nThe categorical features chosen were: " + str(cat_features_selected) + \
                    "\nThe C parameter chosen was: " + str(C) + "\n"
    
    print("Training the model took " + "{:.3f}".format(training_time) + " seconds\n" 
          + "The regularization parameter (C) is " + str(C) + "\n"
          + "In the test set, the Gini index is " + "{:.5f}".format(Kaggle_Gini_Index(rlr_test_prob, y_test)) 
          + " and in the training set it is " + "{:.5f}".format(Kaggle_Gini_Index(rlr_train_prob, y_train)))
    feature_importances = pd.concat([pd.DataFrame(features_selected, columns=['Variable']), 
                                     pd.DataFrame(rlr_train.coef_[0], columns=['Importance'])], axis=1
                                   )
    feature_importances = feature_importances.reindex(feature_importances.Importance.abs().sort_values(ascending=False).index).head(10).reset_index(drop=True)
    print("\nWe can also look at the ten most important features according to the model \n(measured by the absolute size of the coefficient in the logistic regression):")
    display(feature_importances)
    
    return rlr_train, preprocessor, features_selected, cont_features_selected, cat_features_selected, output_string

rlr_training = interactive(train_logit, 
                           {'manual' : True, 'manual_name' : 'Train the logit model'}, 
                           C = widgets.BoundedFloatText(value=1, 
                                                        min=.000001,
                                                        max=2000, 
                                                        step=.1, 
                                                        description="Regularization Parameter (C) (between .00001 and 2000):", 
                                                        style={'description_width': 'initial'},
                                                        layout=Layout(width='500px'),
                                                        disabled=False))
display(rlr_training)

interactive(children=(BoundedFloatText(value=1.0, description='Regularization Parameter (C) (between .00001 an…

Does the logistic regression make sense? What actions would you take as Carvana based on the coefficients that you see?

# Decision Tree

A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g. whether a car is american or not), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.

<img src="images/Simple_Decision_Tree.png" width="600"/>


A decision tree is commonly used in data science because it can capture non-linear relationships (relationships that can't be captured by a straight line or a logistic regression). It works by building what amounts to a flow chart and as it looks at a data point, it works it's way down the flow chart until it gets to a "leaf" (a point where it can't go any further), and then it labels the point based on whatever label has the most values in the final node.

A decision tree is tuned with *min_samples_split*. This is a paramater that controls how many samples must be in a leaf before it can split. If you set it to 2, then 2 samples are required in any node for that node to be split on. This means that every single leaf can correspond to a single sample, since the preceeding node split 2 into 1. How might this memorize the data and lead to overfitting? If you set it to 20, once a node has fewer than 20 samples in it, no more splits will occur. The smaller you set min_samples_split, the deeper the fitted tree you get.

The other parameter that we will tune is by setting the *max_depth* of the tree. This is how many levels the tree can have (alternatively how many times it can split in any given path in the tree). The deeper the tree, the more powerful the tree, but also the more likely you are to overfit. Why?

In [18]:
def train_dt(min_samples_split, max_depth):
    print("Training the model, this could take a little bit of time...")
    X_train = feature_selection.result[2]
    X_test = feature_selection.result[3]
    y_train = feature_selection.result[4]
    y_test = feature_selection.result[5]
    cont_features_selected = feature_selection.result[6]
    cat_features_selected = feature_selection.result[7]
    features_selected = feature_selection.result[0]
    preprocessor = feature_selection.result[1]
    algorithm_starts = time.time()
    dt_model = DecisionTreeClassifier(min_samples_split=min_samples_split, max_depth=max_depth, random_state=201)
    dt_train = dt_model.fit(X_train, y_train)
    training_time = time.time() - algorithm_starts
    
    dt_test_prob = pd.DataFrame(dt_train.predict_proba(X_test))[1]
    dt_train_prob = pd.DataFrame(dt_train.predict_proba(X_train))[1]
    
    output_string = "For the Decision Tree, the continuous features chosen were: " + str(cont_features_selected) + \
                    "\nThe categorical features chosen were: " + str(cat_features_selected) + \
                    "\nThe min_samples_split parameter chosen was: " + str(min_samples_split) + \
                    "\nThe max_depth parameter chosen was: " + str(max_depth) + "\n"
    
    print("Training the model took " + "{:.3f}".format(training_time) + " seconds\n" 
          + "In the test set, the Gini index is " + "{:.5f}".format(Kaggle_Gini_Index(dt_test_prob, y_test)) 
          + " and in the training set it is " + "{:.5f}".format(Kaggle_Gini_Index(dt_train_prob, y_train)))
    
    feature_importances = pd.concat([pd.DataFrame(features_selected, columns=['Variable']), 
                                     pd.DataFrame(dt_train.feature_importances_, columns=['Importance'])], axis=1
                                   ).sort_values(by=['Importance'], ascending=False).head(10).reset_index(drop=True)
    print("\nWe can also look at the ten most important features according to the model \n(feature importance measures the percent of variation that variable is responsible for):")
    display(feature_importances)
    
    
    print("We can also visualize the decision tree by plotting out the decision it makes to classify.")
    print("Below, the node to the left is the node that satisfies the criteria in the decision tree, the node to the right fails it.")
    print("Note: Depending on how deep your tree is, the plot may be difficult to read.")
    plt.figure(figsize=(20,20))
    
    plot_tree(dt_train, feature_names=features_selected, filled=True, fontsize=12, 
              class_names=['IsGoodBuy', 'IsBadBuy'], impurity=False, proportion=True, label='none')
    
    return dt_train, preprocessor, features_selected, cont_features_selected, cat_features_selected, output_string

dt_training = interactive(train_dt, 
                           {'manual' : True, 'manual_name' : 'Train Decision Tree'}, 
                           max_depth = widgets.BoundedIntText(value=6, 
                                                        min=1,
                                                        max=200, 
                                                        step=1, 
                                                        description="Depth of each tree (between 1 and 200):", 
                                                        style={'description_width': 'initial'},
                                                        layout=Layout(width='500px'),
                                                        disabled=False), 
                           min_samples_split = widgets.BoundedIntText(value=2, 
                                                        min=2,
                                                        max=60000, 
                                                        step=1, 
                                                        description="Regularization Parameter (min_samples_split) (between 2 and 60000):", 
                                                        style={'description_width': 'initial'},
                                                        layout=Layout(width='500px'),
                                                        disabled=False)
                          )
display(dt_training)

interactive(children=(BoundedIntText(value=2, description='Regularization Parameter (min_samples_split) (betwe…

Above we both plot the feature importance, and the decision tree (though it may be difficult to read based on the parameters selected). How would you use these results to avoid bad buy cars?

# Random Forest

A random forest extends the idea of a decision tree. Instead of having a single decision tree, it fits a bunch of decision trees (introducing randomness into the fitting of each tree) and then in order to classify a point, it lets all the trees in the forest vote on the classification. This is, untuitively, trying to take advantage of the "wisdom of crowds" phenomenon.

<img src="images/Random_Forest.png" width="600"/>


A random forest is tuned much like a decision tree. However, you also have to choose how many trees you want to have in the forest. When you have time, think about whether or not you can overfit by using too many trees. Note that since each tree is just a Decision Tree, you can definitely overfit by overfitting the individual decision trees, but can you overfit by using too many Decision Trees?

Here though, the more trees you have, the longer it will take to train, so be careful.

In [19]:
def train_rf(min_samples_split, max_depth, n_estimators):
    print("Training the model, this could take a little bit of time...")
    X_train = feature_selection.result[2]
    X_test = feature_selection.result[3]
    y_train = feature_selection.result[4]
    y_test = feature_selection.result[5]
    cont_features_selected = feature_selection.result[6]
    cat_features_selected = feature_selection.result[7]
    features_selected = feature_selection.result[0]
    preprocessor = feature_selection.result[1]
    algorithm_starts = time.time()
    rf_model = RandomForestClassifier(min_samples_split=min_samples_split, max_depth=max_depth, n_jobs=2, n_estimators=n_estimators, random_state=201)
    rf_train = rf_model.fit(X_train, y_train)
    training_time = time.time() - algorithm_starts
    
    rf_test_prob = pd.DataFrame(rf_train.predict_proba(X_test))[1]
    rf_train_prob = pd.DataFrame(rf_train.predict_proba(X_train))[1]
    
    output_string = "For the Random Forest, the continuous features chosen were: " + str(cont_features_selected) + \
                    "\nThe categorical features chosen were: " + str(cat_features_selected) + \
                    "\nThe min_samples_split parameter chosen was: " + str(min_samples_split) + \
                    "\nThe max_depth parameter chosen was: " + str(max_depth) + \
                    "\nThe n_estimators parameter chosen was: " + str(n_estimators) + "\n"
    
    print("Training the model took " + "{:.3f}".format(training_time) + " seconds\n" 
#           + "The regularization parameter (C) is " + str(C) + "\n"
          + "In the test set, the Gini index is " + "{:.5f}".format(Kaggle_Gini_Index(rf_test_prob, y_test)) 
          + " and in the training set it is " + "{:.5f}".format(Kaggle_Gini_Index(rf_train_prob, y_train)))
    
    feature_importances = pd.concat([pd.DataFrame(features_selected, columns=['Variable']), 
                                     pd.DataFrame(rf_train.feature_importances_, columns=['Importance'])], axis=1
                                   ).sort_values(by=['Importance'], ascending=False).head(10).reset_index(drop=True)
    print("\nWe can also look at the ten most important features according to the model \n(feature importance measures the percent of variation that variable is responsible for):")
    display(feature_importances)
    
    return rf_train, preprocessor, features_selected, cont_features_selected, cat_features_selected, output_string

rf_training = interactive(train_rf, 
                           {'manual' : True, 'manual_name' : 'Train Random Forest'},
                           max_depth = widgets.BoundedIntText(value=6, 
                                                        min=1,
                                                        max=200, 
                                                        step=1, 
                                                        description="Depth of each tree (between 1 and 200):", 
                                                        style={'description_width': 'initial'},
                                                        layout=Layout(width='500px'),
                                                        disabled=False), 
                           min_samples_split = widgets.BoundedIntText(value=2, 
                                                        min=2,
                                                        max=60000, 
                                                        step=1, 
                                                        description="Regularization Parameter (min_samples_split) (between 2 and 60000):", 
                                                        style={'description_width': 'initial'},
                                                        layout=Layout(width='500px'),
                                                        disabled=False),
                           n_estimators = widgets.BoundedIntText(value=50, 
                                                        min=1,
                                                        max=500, 
                                                        step=1, 
                                                        description="Number of trees in the forest (between 1 and 500):", 
                                                        style={'description_width': 'initial'},
                                                        layout=Layout(width='500px'),
                                                        disabled=False)
                          )
display(rf_training)

interactive(children=(BoundedIntText(value=2, description='Regularization Parameter (min_samples_split) (betwe…

# Boosted Trees

A boosted trees model continues to extend on the basic decision tree. Basically, it is an iterated decision tree. First, you train a decision tree. Then you figure out what errors that decision tree made, and then train a second decision tree to correct the errors. To get your final prediction you take the prediction of the first tree and you add to it the prediction from the second tree (technically weighted by a learning rate described below). You can keep doing this as many times as you like with each iteration fixing the mistakes of the previous iterations. This concept is illustrated in the picture below where each tree (except for the first) fits the errors of the previous trees. Note that a "tree" here is just the step like function that predicts the variable (which is what a tree looks like when it is plotted).

<img src="images/gradient-boosted-regression-trees-632x238.png" width="800"/>


You still have to choose the depth of each individual tree (like in a Decision Tree), but you also have to choose a *learning_rate* parameter that determines how the trees are combined. A lower value means that the trees are combined more slowly so each individual tree contributes less to the overall outcome. So, a lower value reduces overfitting, but a too low value can underfit.

You also have to choose how many iterative trees to train. This is different than the number of trees in a Random Forest. Why? Can you overfit with too many trees here?

What is the relationship between these parameters? I.e., if you change the number of trees, should you also change the learning rate? Or are those parameters independent?

These models are very powerful, and very easy to overfit. Pay careful attention to your test set.

In [20]:
def train_xgb(max_depth, learning_rate, n_estimators):
    print("Training the model, this could take a little bit of time...")
    X_train = feature_selection.result[2].values
    X_test = feature_selection.result[3].values
    y_train = feature_selection.result[4].values
    y_test = feature_selection.result[5].values
    cont_features_selected = feature_selection.result[6]
    cat_features_selected = feature_selection.result[7]
    features_selected = feature_selection.result[0]
    preprocessor = feature_selection.result[1]
    algorithm_starts = time.time()
    xgb_model = XGBClassifier(max_depth=max_depth, learning_rate=learning_rate, n_estimators=n_estimators, n_jobs=2, random_state=201)
    xgb_train = xgb_model.fit(X_train, y_train)
    training_time = time.time() - algorithm_starts
    
    xgb_test_prob = pd.DataFrame(xgb_train.predict_proba(X_test))[1]
    xgb_train_prob = pd.DataFrame(xgb_train.predict_proba(X_train))[1]
    
    output_string = "For the Boosted Trees, the continuous features chosen were: " + str(cont_features_selected) + \
                    "\nThe categorical features chosen were: " + str(cat_features_selected) + \
                    "\nThe learning_rate parameter chosen was: " + str(learning_rate) + \
                    "\nThe max_depth parameter chosen was: " + str(max_depth) + \
                    "\nThe n_estimators parameter chosen was: " + str(n_estimators) + "\n"
    
    print("Training the model took " + "{:.3f}".format(training_time) + " seconds\n" 
          + "In the test set, the Gini index is " + "{:.5f}".format(Kaggle_Gini_Index(xgb_test_prob, y_test)) 
          + " and in the training set it is " + "{:.5f}".format(Kaggle_Gini_Index(xgb_train_prob, y_train)))
    
    feature_importances = pd.concat([pd.DataFrame(features_selected, columns=['Variable']), 
                                     pd.DataFrame(xgb_train.feature_importances_, columns=['Importance'])], axis=1
                                   ).sort_values(by=['Importance'], ascending=False).head(10).reset_index(drop=True)
    print("\nWe can also look at the ten most important features according to the model \n(feature importance measures the percent of variation that variable is responsible for):")
    display(feature_importances)
    
    return xgb_train, preprocessor, features_selected, cont_features_selected, cat_features_selected, output_string

xgb_training = interactive(train_xgb, 
                           {'manual' : True, 'manual_name' : 'Train XGB'}, 
                           max_depth = widgets.BoundedIntText(value=6, 
                                                        min=1,
                                                        max=200, 
                                                        step=1, 
                                                        description="Depth of each tree (between 1 and 200):", 
                                                        style={'description_width': 'initial'},
                                                        layout=Layout(width='500px'),
                                                        disabled=False), 
                           learning_rate = widgets.BoundedFloatText(value=1, 
                                                        min=.001,
                                                        max=2000, 
                                                        step=.1, 
                                                        description="Regularization Parameter (learning_rate) (between .001 and 200):", 
                                                        style={'description_width': 'initial'},
                                                        layout=Layout(width='500px'),
                                                        disabled=False),
                           n_estimators = widgets.BoundedIntText(value=50, 
                                                        min=1,
                                                        max=500, 
                                                        step=1, 
                                                        description="Number of trees (between 1 and 500):", 
                                                        style={'description_width': 'initial'},
                                                        layout=Layout(width='500px'),
                                                        disabled=False)
                          )
display(xgb_training)

interactive(children=(BoundedIntText(value=6, description='Depth of each tree (between 1 and 200):', layout=La…

# Submitting your predictions

Now that you have trained and tuned four different models, you must submit your predictions on a held out data set, called the *validation* set. This validation set is not part of your training or evaluation in any way, and you do not have the labels for this training set. So, you will not know how your chosen model does on this data set until we reveal it in class.

You should pick a model that does well on your test set, not your training set (though it should also do well on that) because your test set is going to be a better proxy for the performance on the validation set.

In [21]:
def submit_model(model):
    print("Submitting predictions...")
    if model == 'Logistic Regression':
        if not rlr_training.result:
            print("You have not trained a Logistic Regression model yet, train one first.")
            raise ValueError("Model not trained")
        fitted_model = rlr_training.result[0]
        preprocessor = rlr_training.result[1]
        features_selected = rlr_training.result[2]
        model_selected_string = "Logistic Regression was submitted."
    elif model == 'Decision Tree':
        if not dt_training.result:
            print("You have not trained a Decision Tree model yet, train one first.")
            raise ValueError("Model not trained")
        fitted_model = dt_training.result[0]
        preprocessor = dt_training.result[1]
        features_selected = dt_training.result[2]
        model_selected_string = "Decision Tree was submitted."
    elif model == 'Random Forest':
        if not rf_training.result:
            print("You have not trained a Random Forest model yet, train one first.")
            raise ValueError("Model not trained")
        fitted_model = rf_training.result[0]
        preprocessor = rf_training.result[1]
        features_selected = rf_training.result[2]
        model_selected_string = "Random Forest was submitted."
    elif model == 'Boosted Trees':
        if not xgb_training.result:
            print("You have not trained a Boosted Trees model yet, train one first.")
            raise ValueError("Model not trained")
        fitted_model = xgb_training.result[0]
        preprocessor = xgb_training.result[1]
        features_selected = xgb_training.result[2]
        model_selected_string = "Boosted Trees was submitted."
        
    model_parameters_string = ""
    if rlr_training.result:
        model_parameters_string = model_parameters_string + rlr_training.result[5] + "\n"
    if dt_training.result:
        model_parameters_string = model_parameters_string + dt_training.result[5] + "\n"
    if rf_training.result:
        model_parameters_string = model_parameters_string + rf_training.result[5] + "\n"
    if xgb_training.result:
        model_parameters_string = model_parameters_string + xgb_training.result[5] + "\n"
    
    validation_df = pd.read_csv('test.csv')
    validation_df['PurchDate'] = pd.to_datetime(validation_df['PurchDate'], format='%m/%d/%Y')
    validation_df['Month'] = validation_df['PurchDate'].dt.month
    validation_df['Year'] = validation_df['PurchDate'].dt.year
    validation_df = validation_df.drop('PurchDate', 1)
    validation_df = validation_df.drop('RefId', 1)
    print("The first 10 rows of the validation set are:")
    display(validation_df.head(10))
    validation_df.shape
    X = preprocessor.transform(validation_df)
    prediction = pd.DataFrame(fitted_model.predict_proba(X), columns=['ProbIsGoodBuy', 'ProbIsBadBuy'])
    print("Your first 10 predictions for the validation set are:")
    display(prediction.head(10))
    subpath_to_submission = ""
    prediction['ProbIsBadBuy'].to_csv(subpath_to_submission + "submission.csv", index=False, header=False)
    with open(subpath_to_submission + "model_description.txt", "w") as file:
        file.write(model_parameters_string)
        file.write(model_selected_string)
        
    model_dump = [preprocessor, fitted_model]
    with open("model.dump", 'wb') as f:
        dill.dump(model_dump, f)
    
    print("Predictions submitted!")
    
    return 

model_submission = interactive(submit_model, 
                           {'manual' : True, 'manual_name' : 'Submit Model'}, 
                           model = widgets.Dropdown(options=['Logistic Regression', 'Decision Tree', 'Random Forest', 'Boosted Trees'],
                                                   description='Model to submit:',
                                                   style={'description_width': 'initial'},
                                                   layout=Layout(width='500px'),
                                                   disabled=False),
                          )
display(model_submission)

interactive(children=(Dropdown(description='Model to submit:', layout=Layout(width='500px'), options=('Logisti…

After you have submitted your model, you will need to "Log Out". This lets Amazon Web Services know that you no longer need your virtual machine, and it can be shut down. This will allow us to save money by not paying for machines that have not been shut down. We would ask that everytime you are finished, that you always ensure you log out. If you are unsure whether or not you did, you can log back in and log out again.

To log out, click "File -> Log Out" (as shown in the image below). You will know that you have successfully logged out becuase it will take you back to the login page.

<img src="images/Log Out.png" width="800"/>

We appreciate your help in stewarding UVA's resources!