<a href="https://colab.research.google.com/github/mvince33/Coding-Dojo/blob/main/week06/6_20_Challenge_Regression_Metrics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# From data to model

In this notebook we will take a dataset and prepare it, train a model, and make a prediction.

Steps
1. Load and inspect the data
2. Clean the data
3. Split the data
4. Pre-process the data
5. Model the Data
6. Evaluate the Model

Goal : We want to predict the mpg of the car.


Import Libraries

In [None]:
# Imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn import set_config
set_config(display='diagram')

In [None]:
def eval_regression(true, pred):
  """Takes true and predicted values (arrays) and prints MAE, MSE, RMSE and R2"""
  mae = mean_absolute_error(true, pred)
  mse = mean_squared_error(true, pred)
  rmse = np.sqrt(mse)
  r2 = r2_score(true, pred)

  print(f'MAE {mae},\n MSE {mse},\n RMSE: {rmse},\n R^2: {r2} ')

Load Data

The link to the data is already provided for you below.  You can just run the cell.

**Data Dictionary:**

**Attribute** | **Description**  
--- | ---
model | model of the car
price | price car last sold for
transmission | transmission type: Automatic or Manual
mileage | current mileage of the car
fuelType | fuel type the car runs on
tax | tax paid on car at last sale
mpg | miles per gallon of car (target)
engineSize | size of engine in cubic litres

In [None]:
# Load Data
path = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vS2dIT3WEj2j4nSpai7K0wSCwFc_hQBYQR6Xf10VtnyI64EItM9SWxN1UFU_XhrkWdUp6ayrUOoJSgY/pub?output=csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize
0,SLK,2005,5200,Automatic,63000,Petrol,325,32.1,1.8
1,S Class,2017,34948,Automatic,27000,Hybrid,20,61.4,2.1
2,SL CLASS,2016,49948,Automatic,6200,Petrol,555,28.0,5.5
3,G Class,2016,61948,Automatic,16000,Petrol,325,30.4,4.0
4,G Class,2016,73948,Automatic,4000,Petrol,325,30.1,4.0


In [None]:
df.shape

(13119, 9)

# 2. Clean the Data

### Check for duplicates

In [None]:
df.duplicated().sum()

259

There are some duplicate values here, we will drop them.

In [None]:
df = df.drop_duplicates()

df.duplicated().sum()

0

Check out the datatypes and missing data.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12860 entries, 0 to 13118
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         12860 non-null  object 
 1   year          12860 non-null  int64  
 2   price         12860 non-null  int64  
 3   transmission  12860 non-null  object 
 4   mileage       12860 non-null  int64  
 5   fuelType      12860 non-null  object 
 6   tax           12860 non-null  int64  
 7   mpg           12860 non-null  float64
 8   engineSize    12860 non-null  float64
dtypes: float64(2), int64(4), object(3)
memory usage: 1004.7+ KB


There is no missing data, and all datatypes are as expected

Let's inpect the unique values of the categorical columns


In [None]:
print('unique models', df['model'].unique())
print('\n')
print('unique transmissions', df['transmission'].unique())
print('\n')
print('unique fuel types', df['fuelType'].unique())
print('\n')

unique models ['SLK' 'S Class' 'SL CLASS' 'G Class' 'GLE Class' 'GLA Class' 'A Class'
 'B Class' 'GLC Class' 'C Class' 'E Class' 'GL Class' 'CLS Class'
 'CLC Class' 'CLA Class' 'V Class' 'M Class' 'CL Class' 'GLS Class'
 'GLB Class' 'X-CLASS' '180' 'CLK' 'R Class' '230' '220' '200']


unique transmissions ['Automatic' 'Manual' 'Semi-Auto' 'Other']


unique fuel types ['Petrol' 'Hybrid' 'Diesel' 'Other']




When we one-hot encode this data there will a column for each one of the above categories with a value for each car of 0 or 1.

# 3. Split the Data

We want to predict the mpg of the car, so that will be our `y` variable.  The rest of the columns are the features our model will use to make that prediction, so those are the `X` variable.

In [None]:
# split X and y
X = df.drop(columns=['mpg'])
y = df['mpg']

# split training and test
# set random_state to 42 for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# 4. Prepare the Data

### ColumnSelector

Remember that we only want to one-hot encode our categorical variables, but not our numeric ones.  We are going to use the OneHotEncoder transformer from sklearn, but that class cannot automatically decide which columns to encode.  After all, maybe we want to encode some integer columns because they are actually nominal categories.  

We could make a list of all of the categorical columns to encode.  That sounds like a lot of work, so let's let Python do it for us.

We can use the sklearn ColumnSelector class to return a list of names of a particular type.  We will use this to return the categorical columns.

In [None]:
# instantiate the column selectors
num_selector = make_column_selector(dtype_include='number')
cat_selector = make_column_selector(dtype_include='object')

### OneHotEncoder

sklearn includes a class for one-hot encoding nominal variables.  However, we ONLY want to use it on the categorical columns.  If we use it on numeric variables then we will end up with a separate column for each different value in that column (which would be a lot!) and the model will not consider that column as a numeric variable.

* We also want to make sure that OneHotEncoder returns a dense array, rather than a compressed kind of array called a 'sparse array', so we set `sparse=False`.

* We also want it to ignore any categories in the test data that it doesn't see in the training data, so we set `handle_unkown = 'ignore'`.  Otherwise it will give us an error if there is a category in the test set that it didn't see in the training set.  This could be a problem if we put this model into a production environment where it is making predictions on new data!

In [None]:
#instantiate the encoder
scaler = StandardScaler()
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

# Setup Tuples

We will create a tuple for the standard scaler and the numeric column selector and a categorical tuple for the one hot encoder and the categorical column selector.

In [None]:
num_tuple = (scaler, num_selector)
cat_tuple = (encoder, cat_selector)

### ColumnTransformer

sklearn has another class that allows us to apply certain preprocessing steps, such as imputers, scalers, or encoders, to certain columns and not others.  This is much easier than splitting them up by hand and joining them again after processing.

ColumnTransformer takes a list of tuples.  The columns can be a list of columns, or it can be a ColumnSelector class object like we made above.

By default it won't include any columns that haven't been selected, so we have to set `remainder='passthrough'`

In [None]:
preprocessor = make_column_transformer(num_tuple, cat_tuple, remainder = 'drop')

### Baseline Model
Define a baseline model with DummyRegressor using the 'mean' strategy.
Put your Columntransformer and the baseline model into a pipeline.
Fit your pipe onto the training data.


In [None]:
# instantiate a baseline model
dummy_reg = DummyRegressor(strategy='mean')

# create model pipeline
dummy_pipe = make_pipeline(preprocessor, dummy_reg)

dummy_pipe.fit(X_train, y_train)

In [None]:
train_pred = dummy_pipe.predict(X_train)
test_pred = dummy_pipe.predict(X_test)

# Calculating Regression Metrics

## Refer to [this lesson](https://login.codingdojo.com/m/213/7197/70300) for instructions on how to calculate regression metrics in Python.

The R^2 scores for the baseline model should be at or very close to 0.



In [None]:
# find MAE, MSE, RMSE and R2 on the baseline model for both the train and test data


### Linear Regression Model with Pipeline
Instantiate a linear regression model. Put your ColumnTransformer and linear regression model into a pipeline. Fit your pipe on the training data.

In [None]:
# instantiate a linear regression model

# combine the preprocessor object and the linear regression model in a pipeline

# fit your pipe on the training data


In [None]:
# find MAE, MSE, RMSE and R2 on the linear regression model for both the train and test data


# Discuss: What do these metrics tell you?
1. Which is better, the baseline or the linear regression?  How do you know?
2. Whic is more important in this case, MAE or RMSE?
3. Is the linear regression model making a lot of large errors?  How can you tell?
4. How much of the variation in the target is your model able to explain?

# Bonus:
If you finish early, test some other types of models on this data to see if they can perform better than the linear regression:

Some suggestions:
1. [DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)
2. [KNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)
3. [SGDRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor)
4. Try one from the list of regression models near the bottom of [this article.](https://www.educative.io/blog/scikit-learn-cheat-sheet-classification-regression-methods)