# Overview

This notebook continues the exercise of predicting car pricies based on Craigslist listings.

This notebook uses `pandas` and `scikit-learn` two popular Python packages for data science.


## Setup

Here are the packages that need to be loaded to run this example

These is where all the needed packages are imported for the exercise. If you get an `ModuleNotFoundError` then install the package (pip or conda) before continuing.These is where all the needed packages are imported for the exercise. If you get an `ModuleNotFoundError` then install the package (pip or conda) before continuing.

In [1]:
import time
import pandas as pd
from pandas_profiling import ProfileReport
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

from sklearn.linear_model import LinearRegression, LassoLars, ElasticNet, Ridge, HuberRegressor, PassiveAggressiveRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import explained_variance_score, mean_squared_error

## Read data from CSV to Pandas dataFrame

In [2]:
cars = pd.read_csv('../data/vehicles.csv')

## Print first observations for sanity

In [3]:
cars.head()

Unnamed: 0,id,url,region,region_url,price,year,manufacturer,model,condition,cylinders,...,drive,size,type,paint_color,image_url,description,county,state,lat,long
0,7088746062,https://greensboro.craigslist.org/ctd/d/cary-2...,greensboro,https://greensboro.craigslist.org,10299,2012.0,acura,tl,,,...,,,other,blue,https://images.craigslist.org/01414_3LIXs9EO33...,2012 Acura TL Base 4dr Sedan Offered by: B...,,nc,35.7636,-78.7443
1,7088745301,https://greensboro.craigslist.org/ctd/d/bmw-3-...,greensboro,https://greensboro.craigslist.org,0,2011.0,bmw,335,,6 cylinders,...,rwd,,convertible,blue,https://images.craigslist.org/00S0S_1kTatLGLxB...,BMW 3 Series 335i Convertible Navigation Dakot...,,nc,,
2,7088744126,https://greensboro.craigslist.org/cto/d/greens...,greensboro,https://greensboro.craigslist.org,9500,2011.0,jaguar,xf,excellent,,...,,,,blue,https://images.craigslist.org/00505_f22HGItCRp...,2011 jaguar XF premium - estate sale. Retired ...,,nc,36.1032,-79.8794
3,7088743681,https://greensboro.craigslist.org/ctd/d/cary-2...,greensboro,https://greensboro.craigslist.org,3995,2004.0,honda,element,,,...,fwd,,SUV,orange,https://images.craigslist.org/00E0E_eAUnhFF86M...,2004 Honda Element LX 4dr SUV Offered by: ...,,nc,35.7636,-78.7443
4,7074612539,https://lincoln.craigslist.org/ctd/d/gretna-20...,lincoln,https://lincoln.craigslist.org,41988,2016.0,chevrolet,silverado k2500hd,,,...,,,,,https://images.craigslist.org/00S0S_8msT7RQquO...,"Shop Indoors, Heated Showroom!!!www.gretnaauto...",,ne,41.1345,-96.2458


## Print the columns and their respective types for the cars dataFrame

In [4]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 539759 entries, 0 to 539758
Data columns (total 25 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            539759 non-null  int64  
 1   url           539759 non-null  object 
 2   region        539759 non-null  object 
 3   region_url    539759 non-null  object 
 4   price         539759 non-null  int64  
 5   year          538772 non-null  float64
 6   manufacturer  516175 non-null  object 
 7   model         531746 non-null  object 
 8   condition     303707 non-null  object 
 9   cylinders     321264 non-null  object 
 10  fuel          536366 non-null  object 
 11  odometer      440783 non-null  float64
 12  title_status  536819 non-null  object 
 13  transmission  535786 non-null  object 
 14  vin           315349 non-null  object 
 15  drive         383987 non-null  object 
 16  size          168550 non-null  object 
 17  type          392290 non-null  object 
 18  pain

## Create profile of full dataset using `pandas-profile`

In [None]:
prof = ProfileReport(cars)
prof.to_file(output_file='vehicle_profile.html')

## Filter the data for cars with more than 300,000 miles and sort decending

This is an example of how Python programers can chain methods to each other to accomplish multiple tasks in the same line of code

In [None]:
cars[cars['odometer']>300000].filter(items=['year','price','odometer','model']).sort_values('odometer', ascending=False).head()
#filter(items=['one', 'three'])
#,['year','price','odometer','model']]

## Filter the number of columns

With wide datasets it is often hard to see all the columns. The code below is an example of how the columns can be filtered to just those that are desired.

As with all things Python, there are many ways to accomplish this task

In [None]:
cars.filter(items=['year','price','odometer','model']).sort_values('odometer', ascending=False).head()


In [None]:
cars.filter(items=['year','price','odometer','model']).sort_values('price', ascending=False).head()

# Create new dataFrame `cars2` 

`cars2` only includes listings that meet all the following criteria
*    newer than 2000
*    cost less than $40,000
*    have less than 100,000 miles

In [5]:
cars2 = cars[(cars['year']>2000) & (cars['price']<40000) & (cars['odometer']<100000)]

## Comparing the shape of `cars` and `cars2`

I want to see how many rows and columns `cars` has and how many were lost due to the `filter` method

The data shows a reduction from 540K to 219K

In [6]:
cars.shape

(539759, 25)

In [7]:
cars2.shape

(218870, 25)

In [None]:
prof = ProfileReport(cars2)
prof.to_file(output_file='vehicle_profile2.html')

In [8]:
cars2.to_csv('../data/cars2.zip')

In [79]:
cars2 = pd.read_csv('../data/cars2.zip')

In [80]:
cars2.dtypes

Unnamed: 0        int64
id                int64
url              object
region           object
region_url       object
price             int64
year            float64
manufacturer     object
model            object
condition        object
cylinders        object
fuel             object
odometer        float64
title_status     object
transmission     object
vin              object
drive            object
size             object
type             object
paint_color      object
image_url        object
description      object
county          float64
state            object
lat             float64
long            float64
dtype: object

## Setup for data prep and modeling

To make the code more portable and easier to update. In the cell below I will define the inputs and targets.

The inputs are the attributes of a car that will be used to predict the price, which is the target.

To deal with missing values, there are different treatments for nominal vs interval data. Below lists are defined to identify the type of each variable.

The virtue of this type of organization is that to add or subtract a variable from the model, it only requires modifications to this cell and then a rerun of the code below. 

It also allows more copy/paste from project to projects saving time in the long run.

In [9]:
inputs = cars2.drop(columns=['price','id','url','region_url','vin','description','county','image_url'])
nominal = ['region', 'manufacturer', 'fuel', 'model','title_status','transmission', 'drive', 'type','paint_color','state']
ordinal = ['condition','size','cylinders']
interval = ['year','odometer','lat','long']
target = cars2.price
print(inputs.shape, target.shape)


(218870, 17) (218870,)


## Create Pipelines

A `pipeline` is a process or set of steps that you want to organize into specific pattern.

In this pipeline below, each variable is imputed if it is missing based on its type and then scaled to improve model performance (scaling input is a best practice along with spliting data to training and validation).

In [10]:
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, interval),
        ('cat', categorical_transformer, nominal),
        ('ord', categorical_transformer, ordinal)])

## Split the data into training and validation (70/30)

In [11]:
input_train, input_test, target_train, target_test = train_test_split(inputs, target, test_size = 0.30, random_state=9878)

# Build the models

In the code below a list of many algorythnms is provided and for each a model is build on the training data. 

The model is then applied to the validation data and assessment metrics are calculated.

The assessment metrics are then printed so that comparisons can be made.

**Remember:** there is no single best modeling algorthym so you should try many types and parameterizations based on the business problems

In [62]:
# Creates a results DataFrame. This code is in a different cell to prevent loss run to run.
results_df = pd.DataFrame(columns=('model','r_2','explainedVariance','mse','buildTime','scoreTime'))
results_df['r_2'] = results_df['r_2'].map('${:,.4f}'.format)
results_df['explainedVariance'] = results_df['explainedVariance'].map('${:,.4f}'.format)
results_df['mse'] = results_df['mse'].map('${:,.0f}'.format)
results_df['buildTime'] = results_df['buildTime'].map('${:,.1f}'.format)
results_df['scoreTime'] = results_df['scoreTime'].map('${:,.1f}'.format)

In [67]:
%%time
# List of models to run
model_list = [LinearRegression(), ElasticNet(), Ridge(), HuberRegressor(), PassiveAggressiveRegressor(), DecisionTreeRegressor(), RandomForestRegressor(), GradientBoostingRegressor(), MLPRegressor()]
model_list = [MLPRegressor(early_stopping=True, hidden_layer_sizes=(3,50))]
# LassoLars(), 

# loop through model list
for model in model_list:
    build_model = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', model)])
    ts_build = time.time()
    build_model.fit(input_train, target_train)
    ts_score = time.time()
    target_pred = build_model.predict(input_test)
    ts_done = time.time()
    results = pd.DataFrame([[model, 
                build_model.score(input_test, target_test), 
                explained_variance_score(target_test, target_pred), 
                mean_squared_error(target_test, target_pred), 
                ts_score-ts_build, 
                ts_done-ts_score]], columns=results_df.columns)
    # append results to dataFrame
    results_df = results_df.append(results, ignore_index=True)

    print("model score for {}: R_2: {:0.4f} Explained Variance:{:0.4f} MSE:{:0.0f} Build Time(s):{:0.1f} Score Time(s):{:0.1f}".format(model, build_model.score(input_test, target_test), explained_variance_score(target_test, target_pred), mean_squared_error(target_test, target_pred), ts_score-ts_build, ts_done-ts_score))

model score for MLPRegressor(): R_2: 0.6342 Explained Variance:0.6342 MSE:37016331 Build Time(s):19540.3 Score Time(s):2.1
CPU times: user 14h 42min 43s, sys: 2h 16min 48s, total: 16h 59min 31s
Wall time: 5h 25min 45s


In [78]:
#results_df['r_2'] = results_df['r_2'].map('{:0.4f}'.format)
#results_df['explainedVariance'] = results_df['explainedVariance'].map('{:0.4f}'.format)
#results_df['mse'] = results_df['mse'].map('{:,0.0f}'.format)
#results_df['buildTime'] = results_df['buildTime'].map('{:.1f}'.format)
#results_df['scoreTime'] = results_df['scoreTime'].map('{:.1f}'.format)
results_df

Unnamed: 0,model,r_2,explainedVariance,mse,buildTime,scoreTime
0,ElasticNet(),$0.2273,$0.2273,"$78,199,094",$21.1,$0.4
1,Ridge(),$0.5074,$0.5074,"$49,848,804",$3.6,$0.4
2,HuberRegressor(),$0.4090,$0.4201,"$59,806,606",$6.2,$0.4
3,ElasticNet(selection='random'),$0.2273,$0.2273,"$78,199,078",$64.4,$0.4
4,Ridge(),$0.5074,$0.5074,"$49,848,804",$3.7,$0.4
5,HuberRegressor(),$0.4090,$0.4201,"$59,806,606",$6.1,$0.4
6,MLPRegressor(),$0.6342,$0.6342,"$37,016,331","$19,540.3",$2.1


In [None]:
scaler = StandardScaler()
scaler.fit(input_train.loc[interval])

onehot = OneHotEncoder(dtype=np.int, sparse=True)
onehot.fit(input_train.loc[nominal])

input_train_nominal = onehot.transform(input_train.loc[nominal])
input_train_interval = scaler.transform(input_train.loc[interval])
input_test = scaler.transform(input_test)

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import accuracy_score, recall_score, precision_score
from sklearn.model_selection import cross_val_score



def displaymetrics(code: list, models: list, X_train: np.array, X_test: np.array, X: pd.DataFrame):
    for model in models:  # or for i in range(0, len(models)):
        y_score = model.fit(X_train, y_train).decision_function(X_test)
        # or y_score = models[i].fit(X_train, y_train).decision_function(X_test)
        fpr, tpr, _ = roc_curve(y_test, y_score)
        # etc etc