# Machine Learning Evaluation

Evaluating a machine learning model is a critical step to ensure its performance and reliability in making predictions or classifications. The evaluation process helps you understand how well your model generalizes to unseen data and whether it's meeting the desired objectives.

In [None]:
# download the data
!wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv

In [34]:
# import all the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.metrics import accuracy_score, classification_report, mutual_info_score, mean_squared_error, roc_auc_score, precision_recall_curve, confusion_matrix
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.feature_extraction import DictVectorizer

## Prepare the data and Exploratory Data Analysis (EDA):

We'll work with the MSRP variable, and we'll transform it to a classification task.

For the rest of the homework, you'll need to use only these columns:

- Make,
- Model,
- Year,
- Engine HP,
- Engine Cylinders,
- Transmission Type,
- Vehicle Style,
- highway MPG,
- city mpg
- MSRP

In [2]:
features = [
    'Make', 'Model','Year','Engine HP','Engine Cylinders','Transmission Type',
    'Vehicle Style','highway MPG','city mpg','MSRP'
]

df = pd.read_csv('../data/data.csv', iterator=False, usecols=features)
df.head()

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle Style,highway MPG,city mpg,MSRP
0,BMW,1 Series M,2011,335.0,6.0,MANUAL,Coupe,26,19,46135
1,BMW,1 Series,2011,300.0,6.0,MANUAL,Convertible,28,19,40650
2,BMW,1 Series,2011,300.0,6.0,MANUAL,Coupe,28,20,36350
3,BMW,1 Series,2011,230.0,6.0,MANUAL,Coupe,28,18,29450
4,BMW,1 Series,2011,230.0,6.0,MANUAL,Convertible,28,18,34500


### Data Preparation

In [3]:
# transform the column names to lower case and replace spaces with underscores
df.columns = df.columns.str.replace(' ', '_').str.lower()

# fill the missing values with 0
df.fillna(0, inplace=True)

# rename msrp to price
df.rename(columns={'msrp': 'price'}, inplace=True)

df.head()

Unnamed: 0,make,model,year,engine_hp,engine_cylinders,transmission_type,vehicle_style,highway_mpg,city_mpg,price
0,BMW,1 Series M,2011,335.0,6.0,MANUAL,Coupe,26,19,46135
1,BMW,1 Series,2011,300.0,6.0,MANUAL,Convertible,28,19,40650
2,BMW,1 Series,2011,300.0,6.0,MANUAL,Coupe,28,20,36350
3,BMW,1 Series,2011,230.0,6.0,MANUAL,Coupe,28,18,29450
4,BMW,1 Series,2011,230.0,6.0,MANUAL,Convertible,28,18,34500


#### Make price binary

In [5]:
# make the price column binary getting the price median value and then adding a new column above_average = 1 when price > median and 0 otherwise
price_median = df['price'].median()
df['above_average'] = (df['price'] > price_median).astype(int)
df.head()

Unnamed: 0,make,model,year,engine_hp,engine_cylinders,transmission_type,vehicle_style,highway_mpg,city_mpg,price,above_average
0,BMW,1 Series M,2011,335.0,6.0,MANUAL,Coupe,26,19,46135,1
1,BMW,1 Series,2011,300.0,6.0,MANUAL,Convertible,28,19,40650,1
2,BMW,1 Series,2011,300.0,6.0,MANUAL,Coupe,28,20,36350,1
3,BMW,1 Series,2011,230.0,6.0,MANUAL,Coupe,28,18,29450,0
4,BMW,1 Series,2011,230.0,6.0,MANUAL,Convertible,28,18,34500,1


#### Split the data
- Split your data in train/val/test sets with 60%/20%/20% distribution.
- Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
- Make sure that the target value (price) is not in your dataframe.

In [8]:
# split the data in train/val/test sets, with 60%/20%/20% distribution with seed 1
# .2 splits the data into 80% train and 20% test
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
#.25 splits the 80% train into 60% train and 20% val
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

# reset the indexes of the dataframes
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

# separate the target variable from the train/val/test sets
y_train = df_train.price.values
y_val = df_val.price.values
y_test = df_test.price.values

# delete the price column from the train/val/test sets
del df_train['price']
del df_val['price']
del df_test['price']

print('train data length: ',len(df_train),'price values length: ', len(y_train))


train data length:  7148 price values length:  7148


### Question 1: ROC AUC feature importance

ROC AUC (Area Under the Curve) could also be used to evaluate feature importance of numerical variables.

Let's do that

- For each numerical variable, use it as score and compute AUC with the above_average variable
- Use the training dataset for that

If your AUC is < 0.5, invert this variable by putting "-" in front

(e.g. -df_train['engine_hp'])

AUC can go below 0.5 if the variable is negatively correlated with the target varialble. You can change the direction of the correlation by negating this variable - then negative correlation becomes positive.

Which numerical variable (among the following 4) has the highest AUC?

- engine_hp
- engine_cylinders
- highway_mpg
- city_mpg

In [25]:
# define the numerical features
numerical_features = [
    'year', 'engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg'
]   

# get the categorical features
categorical_features = [
    'make', 'model', 'transmission_type', 'vehicle_style'
]

all_features = numerical_features + categorical_features

In [27]:
# for each numerical feature, use it as score and compute AUC with the above_average variable
auc_scores = {}
for feature in numerical_features:

     # Compute AUC for the current numerical feature
    auc = roc_auc_score(df_train['above_average'], df_train[feature])    
    auc_scores[feature] = auc
    print(feature, auc)

# get the value and label of the max auc score
max_auc = max(auc_scores.values())
max_auc_label = max(auc_scores, key=auc_scores.get)
print(f'max auc score: { max_auc_label} with {max_auc}')


year 0.7355219698158973
engine_hp 0.9185032753403845
engine_cylinders 0.7359391338287677
highway_mpg 0.377306802286729
city_mpg 0.34417514781409736
max auc score: engine_hp with 0.9185032753403845


### Question 2 - Training the model

Apply one-hot-encoding using DictVectorizer and train the logistic regression with these parameters:
```python
LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)
```
What's the AUC of this model on the validation dataset? (round to 3 digits)

- 0.678
- 0.779
- 0.878
- 0.979

In [32]:
# to the train data apply the OneHotEncoder using DictVectorizer
# fit_transform() fits the data and then transforms it

dv = DictVectorizer(sparse=False)

train_dict = df_train[all_features].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

# train the logistic regression model with these parameters: 
# solver='liblinear', C=1.0, max_iter=1000
# fit() fits the model according to the given training data
model = LogisticRegression(solver='liblinear',multi_class='ovr', C=1.0, max_iter=1000)
model.fit(X_train, y_train)


In [37]:
# what is the AUC of this model on the validation dataset? (round to 3 digits)
val_dict = df_val[all_features].to_dict(orient='records')
X_val = dv.transform(val_dict)
# The [:, 1] notation selects the second column of the result, which contains the probabilities for the positive class.
y_pred = model.predict_proba(X_val)[:, 1]
auc = round(roc_auc_score(y_val, y_pred), 3)
print(f'AUC score: {auc}')

ValueError: multi_class must be in ('ovo', 'ovr')

### Question 3 - Precision and Recall

Now let's compute precision and recall for our model.

- Evaluate the model on all thresholds from 0.0 to 1.0 with step 0.01
- For each threshold, compute precision and recall
- Plot them

At which threshold precision and recall curves intersect?

- 0.28
- 0.48
- 0.68

In [18]:
# Calculate precision and recall for various thresholds
precision, recall, thresholds = precision_recall_curve(y_val, y_pred)

# Plot precision-recall curve
plt.figure(figsize=(10, 6))
plt.plot(recall, precision, marker='.')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.grid(True)
plt.show()


Average MSRP: 10.13

Example for alpha=0: [10.11584984 10.79182425  9.98981731 10.64237202  7.57773985]
RMSE y_val for alpha=0: 33953.57759489213
Example for alpha=0.01: [10.11607356 10.79215907  9.98824596 10.64224427  7.57922702]
RMSE y_val for alpha=0.01: 33927.69712938783
Example for alpha=0.1: [10.11783697 10.7953237   9.97500983 10.64086138  7.59233706]
RMSE y_val for alpha=0.1: 33768.33898526065
Example for alpha=1: [10.12591129 10.82638967  9.88511599 10.6189389   7.69934958]
RMSE y_val for alpha=1: 34958.47188150898
Example for alpha=10: [10.13774966 10.92951104  9.60906658 10.43902102  8.07015606]
RMSE y_val for alpha=10: 43540.71536001107
RMSE Scores: 
[(0, 33953.57759489213), (0.01, 33927.69712938783), (0.1, 33768.33898526065), (1, 34958.47188150898), (10, 43540.71536001107)]

Alpha with lowest RMSE: 0.1 33768.34

