
# Machine Learning Zoomcamp - Homework 4 - Holley St. Germain

homework: https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/cohorts/2023/04-evaluation/homework.md

notebook on github: https://github.com/holleyst/mlzoomcamp2023/blob/main/mlzc_hw4.ipynb

notebook on colab: https://colab.research.google.com/github/holleyst/mlzoomcamp2023/blob/main/mlzc_hw4.ipynb


In [19]:
# import libraries and such
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression


## Dataset
car price dataset

https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv

## Features
Use the following columns:
*   Make
*   Model
*   Year
*   Engine HP
*   Engine Cylinders
*   Transmission Type
*   Vehicle Style
*   highway MPG
*   city mpg
*   MSRP

## Data Preparation
*   Lowercase the column names and replace spaces with underscores
*   Fill the missing values with 0
*   Make the price binary (1 if above the average, 0 otherwise) - this will be our target variable above_average


Split the data into 3 parts: train/validation/test with 60%/20%/20% distribution. Use train_test_split function for that with random_state=1



In [27]:
dataurl = 'https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv'
hwcols = ['Make', 'Model', 'Year', 'Engine HP', 'Engine Cylinders', 'Transmission Type', 'Vehicle Style', 'highway MPG', 'city mpg', 'MSRP']

df = pd.read_csv(dataurl, usecols=hwcols)

df.columns = df.columns.str.replace(' ', '_').str.lower()
df = df.fillna(0)

# make price binary
dfbin = df.copy()
mean_price = dfbin.msrp.mean()
dfbin['above_average'] = np.where(dfbin.msrp >= mean_price, 1, 0)

# drop price from dataset
dfbin = dfbin.drop('msrp', axis=1)

# split and process data
# df_train_full: 80%, df_test: 20%
df_train_full, df_test = train_test_split(dfbin, test_size=0.2, random_state=1)
# df_train: 75% of 80% = 60%, df_val: 25% of 80% = 20%
df_train, df_val = train_test_split(df_train_full, test_size=0.25, random_state=1)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.above_average.values
y_val = df_val.above_average.values
y_test = df_test.above_average.values

del df_train['above_average']
del df_val['above_average']
del df_test['above_average']

### Q1: ROC AUC feature importance

ROC AUC could also be used to evaluate feature importance of numerical variables.

  *   For each numerical variable, use it as score and compute AUC with the above_average variable
  *   Use the training dataset for that

If your AUC is < 0.5, invert this variable by putting "-" in front

(e.g. -df_train['engine_hp'])

AUC can go below 0.5 if the variable is negatively correlated with the target varialble. You can change the direction of the correlation by negating this variable - then negative correlation becomes positive.

Which numerical variable (among the following 4) has the highest AUC?

  *   engine_hp
  *   engine_cylinders
  *   highway_mpg
  *   city_mpg

**A1: engine_hp**


In [28]:
numcols = ['engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg']

# calculate ROC AUC
for col in numcols:
    auc = roc_auc_score(y_train, df_train[col])
    if auc < 0.5:
        auc = roc_auc_score(y_train, -df_train[col])
    print(col, round(auc, 3))

engine_hp 0.917
engine_cylinders 0.766
highway_mpg 0.633
city_mpg 0.673


### Q2: Training the model

Apply one-hot-encoding using DictVectorizer and train the logistic regression with these parameters:

`LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)`

What's the AUC of this model on the validation dataset? (round to 3 digits)
  *   0.678
  *   0.779
  *   0.878
  *   0.979

**A2: 0.979**

In [34]:
train_dicts = df_train[df_train.columns].to_dict(orient='records')
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dicts)

model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)
model.fit(X_train, y_train)

val_dicts = df_val[df_train.columns].to_dict(orient='records')
X_val = dv.transform(val_dicts)

y_pred = model.predict_proba(X_val)[:, 1]

print('%.3f' % roc_auc_score(y_val, y_pred))

0.980


### Q3: Precision and Recall

Now let's compute precision and recall for our model
  *   Evaluate the model on all thresholds from 0.0 to 1.0 with step 0.01
  *   For each threshold, compute precision and recall
  *   Plot them

At which threshold precision and recall curves intersect?
  *   0.28
  *   0.48
  *   0.68
  *   0.88

**A3:**