### Week 3: Explore more - 3

#### Compare model using LinearRegression Vs RidgeRegression. 
#### Test different regularization values with RidgeRegression

## Download and Load data

In [1]:
# !wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv

In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mutual_info_score, mean_squared_error
from sklearn.feature_extraction import DictVectorizer

In [3]:
df = pd.read_csv("data.csv")
df.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


## 2.2 Data Prepatation

The names of columns do not have a common/standard format - some are in lower case, some have spaces, while some use underscores. We will standardize these. Similarly the actual values under these columns (non numerical data) can also be standardized.

In [4]:
df.columns = df.columns.str.lower().str.replace(' ','_')

obj_cols = list(df.dtypes[df.dtypes == 'object'].index)
obj_cols
for col in obj_cols:
    df[col] = df[col].str.lower().str.replace(' ','_')
df

Unnamed: 0,make,model,year,engine_fuel_type,engine_hp,engine_cylinders,transmission_type,driven_wheels,number_of_doors,market_category,vehicle_size,vehicle_style,highway_mpg,city_mpg,popularity,msrp
0,bmw,1_series_m,2011,premium_unleaded_(required),335.0,6.0,manual,rear_wheel_drive,2.0,"factory_tuner,luxury,high-performance",compact,coupe,26,19,3916,46135
1,bmw,1_series,2011,premium_unleaded_(required),300.0,6.0,manual,rear_wheel_drive,2.0,"luxury,performance",compact,convertible,28,19,3916,40650
2,bmw,1_series,2011,premium_unleaded_(required),300.0,6.0,manual,rear_wheel_drive,2.0,"luxury,high-performance",compact,coupe,28,20,3916,36350
3,bmw,1_series,2011,premium_unleaded_(required),230.0,6.0,manual,rear_wheel_drive,2.0,"luxury,performance",compact,coupe,28,18,3916,29450
4,bmw,1_series,2011,premium_unleaded_(required),230.0,6.0,manual,rear_wheel_drive,2.0,luxury,compact,convertible,28,18,3916,34500
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11909,acura,zdx,2012,premium_unleaded_(required),300.0,6.0,automatic,all_wheel_drive,4.0,"crossover,hatchback,luxury",midsize,4dr_hatchback,23,16,204,46120
11910,acura,zdx,2012,premium_unleaded_(required),300.0,6.0,automatic,all_wheel_drive,4.0,"crossover,hatchback,luxury",midsize,4dr_hatchback,23,16,204,56670
11911,acura,zdx,2012,premium_unleaded_(required),300.0,6.0,automatic,all_wheel_drive,4.0,"crossover,hatchback,luxury",midsize,4dr_hatchback,23,16,204,50620
11912,acura,zdx,2013,premium_unleaded_(recommended),300.0,6.0,automatic,all_wheel_drive,4.0,"crossover,hatchback,luxury",midsize,4dr_hatchback,23,16,204,50920


When trying to predict using linear regression method, if there are any missing values, then calculations/operations on the data will fail. Hence we also need to know if there are any missing values (which we will see later how to handle).

In [5]:
df.isnull().sum()

make                    0
model                   0
year                    0
engine_fuel_type        3
engine_hp              69
engine_cylinders       30
transmission_type       0
driven_wheels           0
number_of_doors         6
market_category      3742
vehicle_size            0
vehicle_style           0
highway_mpg             0
city_mpg                0
popularity              0
msrp                    0
dtype: int64

In [6]:
df['age'] = df['year'].max() - df['year']

In [7]:
df['msrp'] = np.log1p(df['msrp'])

## Setting up the validation framework

When we create a prediction model, we will need to run our predictions on a subset of the data and validate/test how accurate our predictions are. Now if we develop our model by looking at the entire dataset, then obviously the validation/testing will show high accuracy (since our model already knows that data). Hence we will create 3 parts/subsets of the entire dataset - set1 to be used to train our model, set2 to validate our model (on unseen data) and possibly fine tune our model and validate again, till we have a good enough model, set3 to finally test the fine tuned model (again on another set of unseen data)

We will split set1=training (60%) : set2=validation (20%) : set3=testing (20%)

In [8]:
df_full_train, df_test = train_test_split(df,test_size=0.2,shuffle=True,random_state=42)
df_train,df_val = train_test_split(df_full_train,test_size=0.25,shuffle=True,random_state=42)

In [9]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

df_full_train.reset_index(drop=True)

Unnamed: 0,make,model,year,engine_fuel_type,engine_hp,engine_cylinders,transmission_type,driven_wheels,number_of_doors,market_category,vehicle_size,vehicle_style,highway_mpg,city_mpg,popularity,msrp,age
0,cadillac,ct6,2016,premium_unleaded_(recommended),265.0,4.0,automatic,rear_wheel_drive,4.0,luxury,large,sedan,31,22,1624,10.887362,1
1,mercedes-benz,gls-class,2017,premium_unleaded_(required),449.0,8.0,automatic,all_wheel_drive,4.0,"crossover,luxury,performance",large,4dr_suv,18,14,617,11.449464,0
2,kia,forte,2016,regular_unleaded,173.0,4.0,automatic,front_wheel_drive,2.0,,compact,coupe,34,25,1720,9.898023,1
3,dodge,ram_250,1993,regular_unleaded,180.0,6.0,manual,rear_wheel_drive,2.0,,large,regular_cab_pickup,16,11,1851,7.601402,24
4,hyundai,tiburon,2008,regular_unleaded,172.0,6.0,automatic,front_wheel_drive,2.0,hatchback,compact,2dr_hatchback,24,17,1439,9.965100,9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9526,toyota,venza,2014,regular_unleaded,181.0,4.0,automatic,front_wheel_drive,4.0,crossover,midsize,wagon,26,20,2031,10.238208,3
9527,pontiac,g6,2009,flex-fuel_(unleaded/e85),219.0,6.0,automatic,front_wheel_drive,4.0,flex_fuel,midsize,sedan,26,17,210,10.115004,8
9528,volkswagen,golf_gti,2016,premium_unleaded_(recommended),220.0,4.0,automated_manual,front_wheel_drive,2.0,"hatchback,performance",compact,2dr_hatchback,33,25,873,10.225245,1
9529,saab,9-5,2009,premium_unleaded_(recommended),260.0,4.0,automatic,front_wheel_drive,4.0,"luxury,performance",midsize,wagon,27,17,376,10.675238,8


In [10]:
y_train = df_train['msrp']
y_val = df_val['msrp']
y_test = df_test['msrp']

y_full_train = df_full_train['msrp']

In [11]:
del df_train['msrp']
del df_val['msrp']
del df_test['msrp']

del df_full_train['msrp']

In [12]:
all_features = ['make', 'model', 'age', 'year', 'engine_fuel_type', 'engine_hp',
       'engine_cylinders', 'transmission_type', 'driven_wheels',
       'number_of_doors', 'market_category', 'vehicle_size', 'vehicle_style',
       'highway_mpg', 'city_mpg', 'popularity']

categorical_cols = ['make', 'model', 'engine_fuel_type', 'transmission_type', 'driven_wheels',
       'market_category', 'vehicle_size', 'vehicle_style']

numerical_cols = ['age', 'year', 'engine_hp','engine_cylinders','number_of_doors','highway_mpg','city_mpg','popularity']

Checking for missing values

In [13]:
df.isnull().sum()

make                    0
model                   0
year                    0
engine_fuel_type        3
engine_hp              69
engine_cylinders       30
transmission_type       0
driven_wheels           0
number_of_doors         6
market_category      3742
vehicle_size            0
vehicle_style           0
highway_mpg             0
city_mpg                0
popularity              0
msrp                    0
age                     0
dtype: int64

#### Filling missing values

* For engine_fuel_type, will fill with mode. 
* For engine_hp will use median. 
* For engine_cylinders will use mode. 
* For number_of_doors will use mode. 
* For market_category will use mode and add a column to indicate missing value.

In [14]:
impute_values = {
    'engine_fuel_type': 'regular_unleaded',
    'engine_hp': 225.0,
    'engine_cylinders': 4.0,
    'number_of_doors': 4.0,
    'market_category': "crossover",
}

In [15]:
for feature,missing_value in impute_values.items():
    
    df_val[feature] = df_val[feature].fillna(missing_value)
    df_test[feature] = df_test[feature].fillna(missing_value)
    df_train[feature] = df_train[feature].fillna(missing_value)
    
    df_full_train[feature] = df_full_train[feature].fillna(missing_value)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_full_train[feature] = df_full_train[feature].fillna(missing_value)


In [16]:
categories = {}

for c in categorical_cols:
    categories[c] = list(df[c].value_counts().head(10).index)

In [17]:
def prepare_X(df):
    df = df.copy()
    new_features = features
    
    for v in [2, 3, 4]:
        df[f'num_doors_{v}'] = (df['number_of_doors'] == v).astype(int)
        new_features.append(f'num_doors_{v}')
    
    for c, values in categories.items():
        for v in values:
            df[f'{c}_{v}'] = (df[c] == v).astype(int)
            new_features.append(f'{c}_{v}')
      
    df_num = df[new_features]
    for c in categorical_cols:
        del df_num[c]

    return df_num

In [18]:
features = numerical_cols + categorical_cols
new_X_train = prepare_X(df_train[features])

In [19]:
features = numerical_cols + categorical_cols
new_X_val = prepare_X(df_val[features])

In [20]:
features = numerical_cols + categorical_cols
new_X_test = prepare_X(df_test[features])

In [21]:
X_train = new_X_train.values
X_val = new_X_val.values
X_test = new_X_test.values

In [22]:
model = LinearRegression()
model.fit(X_train,y_train)

y_pred = model.predict(X_val)
(y_pred - y_val).mean()

0.006599843289089046

In [23]:
rmse = mean_squared_error(y_val,y_pred,squared=False)
rmse

0.4417302461721654

In [24]:
for r in [0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100]:
    model_ridge = Ridge(alpha=r)
    model_ridge.fit(X_train,y_train)
    
    y_pred_ridge = model_ridge.predict(X_val)
    mean_error = (y_pred_ridge - y_val).mean()
    
    rmse = mean_squared_error(y_val,y_pred_ridge,squared=False)
    print(r,mean_error,rmse)
    print('\n')

1e-05 0.006597803204595513 0.4417232904751806


0.0001 0.006597812286299296 0.4417232826696619


0.001 0.006597903088900081 0.4417232047260443


0.01 0.006598809691459433 0.4417224363575998


0.1 0.006607742959085255 0.44171577718722427


1 0.0066874258385572986 0.4417092034064096


10 0.007090992320034615 0.442670420176717


100 0.007861380892662579 0.45137632223229585




#### Intemediate observations

The rmse scores using LinearRegression and RidgeRegression are very similar. Also, do not see much difference using different values of alpha with RidgeRegression, however comparatively alpha=1 has the minor bettwe result

In [25]:
features = numerical_cols + categorical_cols
new_X_full_train = prepare_X(df_full_train[features])

In [26]:
X_full_train = new_X_full_train.values

In [27]:
#Train with train+val using LinearRegression
model = LinearRegression()
model.fit(X_train,y_train)

y_pred = model.predict(X_val)
mean_error = (y_pred - y_val).mean()

rmse = mean_squared_error(y_val,y_pred,squared=False)

print('LinearRegression', mean_error,rmse)

LinearRegression 0.006599843289089046 0.4417302461721654


In [28]:
#Train with train+val using RidgeRegression
model_ridge = Ridge(alpha=1)
model_ridge.fit(X_full_train,y_full_train)

y_pred_ridge = model_ridge.predict(X_val)
mean_error = (y_pred_ridge - y_val).mean()

rmse = mean_squared_error(y_val,y_pred_ridge,squared=False)

print('RidgeRegression', mean_error,rmse)

RidgeRegression 0.004667411563401679 0.43789429368719873


### Final observations

The rmse scores using RidgeRegression are slightly better than LinearRegression.