From Data to model
In this notebook we will take a dataset, clean it, train a model, and make a predictions

Steps:
1. Import libraries 
2. Load dataset
3. Make a copy of original dataset
4. We will use the copy for further processing
5. Check for inconsistencies and missing values, duplicates
6. Split the data set in train/test
7. Separate the categorical and numerical features
8. Check for missing and handle missing values in categorical and numerical using simple imputer
10.Fit model and evaluate it

We want to predict the mpg of the car

Step 1: Import most the libraries we will use

In [1]:
import pandas as pd # handles dataframe
import numpy as np #handles array
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn import set_config #draws the diagram of the pipeline
set_config(display='diagram')

eval_regression method evaluates the performance of the model

In [2]:
def eval_regression(true,predicted_values):
    ''' Takes true and predicted values (arrays) and prints MAE, MSE, RMSE, and R2'''
    # don't need to use numpy to do these calculations
    # taking 2 parameters because that's what is need the true values and predicted values for calculations
    mae = mean_absolute_error(true, predicted_values)
    mse = mean_squared_error(true, predicted_values)
    rmse = np.sqrt(mse)
    r2 = r2_score(true, predicted_values)
    print(f'MAE {mae}, \n MSE {mse}, \n RMSE {rmse}, \n R^2 {r2}')

2. Load Data

In [3]:
path = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vS2dIT3WEj2j4nSpai7K0wSCwFc_hQBYQR6Xf10VtnyI64EItM9SWxN1UFU_XhrkWdUp6ayrUOoJSgY/pub?output=csv'
mercedes_df = pd.read_csv(path)
mercedes_df.info()
mercedes_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13119 entries, 0 to 13118
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         13119 non-null  object 
 1   year          13119 non-null  int64  
 2   price         13119 non-null  int64  
 3   transmission  13119 non-null  object 
 4   mileage       13119 non-null  int64  
 5   fuelType      13119 non-null  object 
 6   tax           13119 non-null  int64  
 7   mpg           13119 non-null  float64
 8   engineSize    13119 non-null  float64
dtypes: float64(2), int64(4), object(3)
memory usage: 922.6+ KB


Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize
0,SLK,2005,5200,Automatic,63000,Petrol,325,32.1,1.8
1,S Class,2017,34948,Automatic,27000,Hybrid,20,61.4,2.1
2,SL CLASS,2016,49948,Automatic,6200,Petrol,555,28.0,5.5
3,G Class,2016,61948,Automatic,16000,Petrol,325,30.4,4.0
4,G Class,2016,73948,Automatic,4000,Petrol,325,30.1,4.0


In [4]:
# making a copy of original dataset
explore_df = mercedes_df.copy()

In [5]:
explore_df.shape

(13119, 9)

13119 rows
9 columns (features)

checking for duplicates and dropping them

In [6]:

explore_df.duplicated().sum()

259

In [7]:
explore_df = explore_df.drop_duplicates()
explore_df.duplicated().sum()

0

checking for null values

In [8]:
explore_df.isna().sum()

model           0
year            0
price           0
transmission    0
mileage         0
fuelType        0
tax             0
mpg             0
engineSize      0
dtype: int64

There are no null values in this dataset

In [9]:
print('unique models', explore_df['model'].unique())
print('\n')
print('unique transmissions', explore_df['transmission'].unique())
print('\n')
print('unique fuel types', explore_df['fuelType'].unique())
print('\n')

unique models ['SLK' 'S Class' 'SL CLASS' 'G Class' 'GLE Class' 'GLA Class' 'A Class'
 'B Class' 'GLC Class' 'C Class' 'E Class' 'GL Class' 'CLS Class'
 'CLC Class' 'CLA Class' 'V Class' 'M Class' 'CL Class' 'GLS Class'
 'GLB Class' 'X-CLASS' '180' 'CLK' 'R Class' '230' '220' '200']


unique transmissions ['Automatic' 'Manual' 'Semi-Auto' 'Other']


unique fuel types ['Petrol' 'Hybrid' 'Diesel' 'Other']




All categorical values seems consistent...
When we one-hot-encode this data there will be a column for reach one of the above categories with a value for each car of 0 or 1

3. Split the Data

In [10]:
# import libraries first
from sklearn.model_selection import train_test_split #used to split dataset
from sklearn.compose import make_column_selector, make_column_transformer #selector separates columns (objects and numbers) selector doesn't separate nominal and ordinal and would manually do it with .replace method
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [11]:
# splitting the X and y features
target = 'mpg'
X = explore_df.drop(columns=[target])
y = explore_df[target]

# split training and test
# set random state for 42 reproducibility (meaning it gives a different sample, results will be different every time we rn the notebook)
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42)

4. Prepare the Data

Column Selector 
We only want to one-hot encode our categorical variables, but not our numeric ones. We are going to use the OneHotEncoder transformer from sklearn, but that class cannot automatically decide which columns to encode.We can use the sklearn ColumnSelector class to return a list of names of a particular type. We will use this to return the categorical columns.


In [12]:
# instantiate the columns selectors
num_selector = make_column_selector(dtype_include='number')
cate_selector = make_column_selector(dtype_include='object')

OneHotEncoder
sklearn includes a class for one-hot encoding nominal variables. However, we ONLY want to use it on the categorical columns. If we use it on numeric variables then we will end up with a separate column for each different value in that column (which would be a lot!) and the model will not consider that column as a numeric variable.

We also want to make sure that OneHotEncoder returns a dense array, rather than a compressed kind of array called a 'sparse array', so we set sparse=False.

We also want it to ignore any categories in the test data that it doesn't see in the training data, so we set handle_unkown = 'ignore'. Otherwise it will give us an error if there is a category in the test set that it didn't see in the training set. This could be a problem if we put this model into a production environment where it is making predictions on new data!

In [13]:
# instantiate the encoder
scaler = StandardScaler()
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')

Setup Tuples ? Why?
We will create a tuple for the standard scaler and the numeric column selector tuple for the one hot encoder and the categorical column selector 
Why? Because make_column_transformer takes only tuples

In [14]:
num_tuple = (scaler, num_selector)
cate_tuple = (ohe, cate_selector)

We are not using a pipeline because we are only performer 1 transformer on column. There was no missing data and we are just converting data on this dataset

ColumnTransformer
sklearn has another class that allows us to apply certain preprocessing steps, such as imputers, scalers, or encoders, to certain columns and not others. This is much easier than splitting them up by hand and joining them again after processing.

ColumnTransformer takes a list of tuples. The columns can be a list of columns, or it can be a ColumnSelector class object like we made above.

By default it won't include any columns that haven't been selected, so we have to set remainder='passthrough'

In [15]:
preprocessor = make_column_transformer(num_tuple, cate_tuple, remainder='drop') #use the remainder parameter so it doesn't throw an error

Baseline Model
**Define a baseline model with DummyRegressor using the 'mean' strategy. Put your Columntransformer and the baseline model into a pipeline. ** 
Fit your pipe onto the training data.

In [16]:
# import some more libraries 
from sklearn.linear_model import LinearRegression 
from sklearn.pipeline import make_pipeline
from sklearn.dummy import DummyRegressor #baseline model for regression


In [17]:
# instantiate a baseline model
dummy_reg = DummyRegressor(strategy='mean')

# create model pipeline
dummy_pipe = make_pipeline(preprocessor, dummy_reg) #since we preprocessed or prepared the pipeline with cleaned data

# this is where the model starts learning 
dummy_pipe.fit(X_train, y_train)




In [19]:
train_predicts = dummy_pipe.predict(X_train)
test_predicts= dummy_pipe.predict(X_test)

In [20]:
train_predicts

array([55.1775324, 55.1775324, 55.1775324, ..., 55.1775324, 55.1775324,
       55.1775324])

In [21]:
test_predicts

array([55.1775324, 55.1775324, 55.1775324, ..., 55.1775324, 55.1775324,
       55.1775324])

Calculating Regression Metrics

In [22]:
# finding the MAE. MSE, RMSE, and r2 score on the baseline model for both train and test data
eval_regression(y_train, train_predicts)

MAE 10.82498517377261, 
 MSE 227.3778912670944, 
 RMSE 15.079054720608132, 
 R^2 0.0


In [23]:
eval_regression(y_test, test_predicts)

MAE 10.95010438530508, 
 MSE 239.66345129752673, 
 RMSE 15.481067511561557, 
 R^2 -2.6711749175678534e-05


This is not a good model. Errors are high and r2 score is extremely low. Explore more...