# Overview

This notebook continues the exercise of predicting car prices based on Craigslist listings.

This notebook uses `pandas` and `scikit-learn` two popular Python packages for data science. 

In order to make the processing more efficient, a sample of the data has been taken.


## Setup

Here are the packages that need to be loaded to run this example

These is where all the needed packages are imported for the exercise. If you get an `ModuleNotFoundError` then install the package (pip or conda) before continuing.These is where all the needed packages are imported for the exercise. If you get an `ModuleNotFoundError` then install the package (pip or conda) before continuing.

In [None]:
import time
import feather
import pandas as pd
from pandas_profiling import ProfileReport
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

from sklearn.linear_model import LinearRegression, LassoLars, ElasticNet, Ridge, HuberRegressor, PassiveAggressiveRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import explained_variance_score, mean_squared_error

# Code from Introduction to Data Science Exercise

This code was used to create the profiles you used in earlier course in this series with the exception that this code is running on a 10% sample instead of the 500k+ listings.

## Read Sample Data from CSV to Pandas dataFrame

In [None]:
cars = feather.read_dataframe('../data/vehicles_samp.feather')

## Print first observations for sanity

In [None]:
cars.head()

## Print the columns and their respective types for the cars dataFrame

In [None]:
cars.info()

## Create profile of full dataset using `pandas-profile`

In [None]:
prof = ProfileReport(cars)
prof.to_file(output_file='vehicle_profile.html')

## Filter the data for cars with more than 300,000 miles and sort decending

This is an example of how Python programers can chain methods to each other to accomplish multiple tasks in the same line of code

In [None]:
cars[cars['odometer']>300000].filter(items=['year','price','odometer','model']).sort_values('odometer', ascending=False).head()
#filter(items=['one', 'three'])
#,['year','price','odometer','model']]

## Filter the number of columns

With wide datasets it is often hard to see all the columns. The code below is an example of how the columns can be filtered to just those that are desired.

As with all things Python, there are many ways to accomplish this task

In [None]:
cars.filter(items=['year','price','odometer','model']).sort_values('odometer', ascending=False).head()


In [None]:
cars.filter(items=['year','price','odometer','model']).sort_values('price', ascending=False).head()

# Create new dataFrame `cars2` 

`cars2` only includes listings that meet all the following criteria
*    newer than 2000
*    cost less than $40,000
*    have less than 100,000 miles

In [None]:
cars2 = cars[(cars['year']>2000) & (cars['price']<40000) & (cars['odometer']<100000)]

## Comparing the shape of `cars` and `cars2`

I want to see how many rows and columns `cars` has and how many were lost due to the `filter` method

The data shows a reduction from 540K to 219K

In [None]:
cars.shape

In [None]:
cars2.shape

In [None]:
prof = ProfileReport(cars2)
prof.to_file(output_file='vehicle_profile2.html')

In [None]:
cars2.to_csv('../data/cars2.zip')

In [None]:
cars2 = pd.read_csv('../data/cars2.zip')

In [None]:
cars2.dtypes

# Introduction to Modeling and Machine Learning

The rest of this notebook is work that should be completed along with the Modeling and Machine Learning course



# Setup for data prep and modeling

To make the code more portable and easier to update. In the cell below I will define the inputs and targets.

The inputs are the attributes of a car that will be used to predict the price, which is the target.

To deal with missing values, there are different treatments for nominal vs interval data. Below lists are defined to identify the type of each variable.

The virtue of this type of organization is that to add or subtract a variable from the model, it only requires modifications to this cell and then a rerun of the code below. 

It also allows more copy/paste from project to projects saving time in the long run.

In [None]:
inputs = cars2.drop(columns=['price','id','url','region_url','vin','description','county','image_url'])
nominal = ['region', 'manufacturer', 'fuel', 'model','title_status','transmission', 'drive', 'type','paint_color','state']
ordinal = ['condition','size','cylinders']
interval = ['year','odometer','lat','long']
target = cars2.price
print(inputs.shape, target.shape)


## Create Pipelines

A `pipeline` is a process or set of steps that you want to organize into a specific pattern.

In the pipeline below, each variable is imputed based on its type and then scaled to improve model performance (scaling input is a best practice along with spliting data to training and validation).

In [None]:
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, interval),
        ('cat', categorical_transformer, nominal),
        ('ord', categorical_transformer, ordinal)])

## Split the data into training and validation (70/30)

Another needed step in modeling is to split the data into training and validation. The help protect against a model overfitting and as a result having poor predictions for future cases. 

People disagree as to the right amount to split between training and validation but assuming I have "enough" data, I usually split the data to 70% training and 30% validation.

In [None]:
input_train, input_test, target_train, target_test = train_test_split(inputs, target, test_size = 0.30, random_state=9878)

# Build the models

In the code below a list of many algorythnms is provided and for each a model is build on the training data. 

The model is then applied to the validation data and assessment metrics are calculated.

The assessment metrics are then printed so that comparisons can be made.

**Remember:** there is no single best modeling algorthym so you should try many types and parameterizations based on the business problems

In [None]:
# Creates a results DataFrame. This code is in a different cell to prevent loss run to run.
results_df = pd.DataFrame(columns=('model','r_2','explainedVariance','mse','buildTime','scoreTime'))
results_df['r_2'] = results_df['r_2'].map('${:,.4f}'.format)
results_df['explainedVariance'] = results_df['explainedVariance'].map('${:,.4f}'.format)
results_df['mse'] = results_df['mse'].map('${:,.0f}'.format)
results_df['buildTime'] = results_df['buildTime'].map('${:,.1f}'.format)
results_df['scoreTime'] = results_df['scoreTime'].map('${:,.1f}'.format)

**Note: This step could take several minutes to complete**

The list of models are all default values of regression or decision trees and their derivatives.

We're not training a neural network because for this data it takes over 60 minutes in the binder env.

Here is a link to the documentation so that you can [learn more](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model). 

In [None]:
%%time
# List of models to run
model_list = [LinearRegression(), ElasticNet(), Ridge(max_iter=300), HuberRegressor(), PassiveAggressiveRegressor(), DecisionTreeRegressor(), RandomForestRegressor(), GradientBoostingRegressor()]
# Removed MLPRegressor() because it was taking 60 minutes to train
#model_list = [MLPRegressor(early_stopping=True, hidden_layer_sizes=(3,50))]
# LassoLars(), 

# loop through model list
for model in model_list:
    build_model = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', model)])
    ts_build = time.time()
    build_model.fit(input_train, target_train)
    ts_score = time.time()
    target_pred = build_model.predict(input_test)
    ts_done = time.time()
    results = pd.DataFrame([[model, 
                build_model.score(input_test, target_test), 
                explained_variance_score(target_test, target_pred), 
                mean_squared_error(target_test, target_pred), 
                ts_score-ts_build, 
                ts_done-ts_score]], columns=results_df.columns)
    # append results to dataFrame
    results_df = results_df.append(results, ignore_index=True)

    print("model score for {}: R_2: {:0.4f} Explained Variance:{:0.4f} MSE:{:0.0f} Build Time(s):{:0.1f} Score Time(s):{:0.1f}".format(model, build_model.score(input_test, target_test), explained_variance_score(target_test, target_pred), mean_squared_error(target_test, target_pred), ts_score-ts_build, ts_done-ts_score))

In [None]:
# print the results dataFrame
results_df

# Question 1 

Using the Chart above which model is the best if `explainedVariance` is the most important metric?

# Answer 1 



# Question 2

Considering all the factors in the table, which model would you choose to deploy? Why?

# Answer 2



# Question 3

How do the times to train (`buildTime`) the model compare to the times to score (`scoreTime`) the model?

# Answer 3



# Question 4

What might be some next steps to take after getting these results?

# Answer 4

# Question 5 BONUS
 
Can you find a model that has a better explained variance than the default models I provided above?


# Answer 5

You solution here: