# Supervised Learning Model Evaluation Lab

Complete the exercises below to solidify your knowledge and understanding of supervised learning model evaluation.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

## Regression Model Evaluation

In [2]:
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
data = pd.read_csv('housing.csv', header=None, delimiter=r"\s+", names=column_names)

In [15]:
"""
CRIM - per capita crime rate by town
ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - proportion of non-retail business acres per town.
CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX - nitric oxides concentration (parts per 10 million)
RM - average number of rooms per dwelling
AGE - proportion of owner-occupied units built prior to 1940
DIS - weighted distances to five Boston employment centres
RAD - index of accessibility to radial highways
TAX - full-value property-tax rate per $10,000
PTRATIO - pupil-teacher ratio by town
B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT - % lower status of the population
MEDV - Median value of owner-occupied homes in $1000's
"""

"\nCRIM - per capita crime rate by town\nZN - proportion of residential land zoned for lots over 25,000 sq.ft.\nINDUS - proportion of non-retail business acres per town.\nCHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)\nNOX - nitric oxides concentration (parts per 10 million)\nRM - average number of rooms per dwelling\nAGE - proportion of owner-occupied units built prior to 1940\nDIS - weighted distances to five Boston employment centres\nRAD - index of accessibility to radial highways\nTAX - full-value property-tax rate per $10,000\nPTRATIO - pupil-teacher ratio by town\nB - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\nLSTAT - % lower status of the population\nMEDV - Median value of owner-occupied homes in $1000's\n"

In [4]:
data

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296.0,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273.0,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273.0,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273.0,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273.0,21.0,393.45,6.48,22.0


## 1. Split this data set into training (80%) and testing (20%) sets.

The `MEDV` field represents the median value of owner-occupied homes (in $1000's) and is the target variable that we will want to predict.

In [31]:
# Your code here :
from sklearn.model_selection import train_test_split

X = data.iloc[:, :13]
y = data.iloc[:, 13]
print(X.shape, y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

(506, 13) (506,)
(404, 13) (102, 13)
(404,) (102,)


## 2. Train a `LinearRegression` model on this data set and generate predictions on both the training and the testing set.

In [53]:
# Your code here :
import numpy as np
from sklearn.linear_model import LinearRegression

lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

yhat_train = np.round(lr_model.predict(X_train))
yhat_test = np.round(lr_model.predict(X_test))

n = 5

print("Predictions on Train set:")
print(f"{y_train[:n].values}\n{yhat_train[:n]}")
print()
print("Predictions on Test set:")
print(f"{y_test[:n].values}\n{yhat_test[:n]}")

Predictions on Train set:
[17.9 21.2 50.  19.3 32. ]
[ 3. 21. 25. 21. 34.]

Predictions on Test set:
[43.8 20.7 18.4 20.8 29.6]
[34. 21. 16. 19. 24.]


## 3. Calculate and print R-squared for both the training and the testing set.

R² (also known as the coefficient of determination) explains how well the model's predictions fit the actual data. It represents the proportion of the variance in the dependent variable that is predictable from the independent variables.

In [59]:
# Your code here :
from sklearn.metrics import r2_score
print("r2 score Train:", r2_score(y_train, yhat_train))
print("r2 score Test:", r2_score(y_test, yhat_test))


r2 score Train: 0.724464736530442
r2 score Test: 0.7998089672959028


This is a relatively good fit, especially for the test data which is over the train score. This is a positive sign for my model's ability to generalize to unseen data.

## 4. Calculate and print mean squared error for both the training and the testing set.

The MSE measures the average of the squares of the errors (i.e., the differences between the predicted values and actual values). It is a common metric used for regression models.

In [60]:
# Your code here :
from sklearn.metrics import mean_squared_error
print("Mean square error Train:", mean_squared_error(y_train, yhat_train))
print("Mean square error Test:", mean_squared_error(y_test, yhat_test))


Mean square error Train: 24.181212871287126
Mean square error Test: 14.248333333333333


The test error is lower than the training error, which could indicate that the model is not overfitting and is doing a better job on the test data than on the training data. However, the differences in MSE need to be considered in conjunction with other metrics and the scale of the problem to fully evaluate the performance.

## 5. Calculate and print mean absolute error for both the training and the testing set.

MAE measures the average of the absolute differences between the predicted values and actual values. It is more interpretable than MSE since it provides an error in the same unit as the predicted value.

In [57]:
# Your code here :
from sklearn.metrics import mean_absolute_error
print("Mean absolute error Train:", mean_absolute_error(y_train, yhat_train))
print("Mean absolute error Test:", mean_absolute_error(y_test, yhat_test))

Mean absolute error Train: 3.4542079207920793
Mean absolute error Test: 2.8362745098039213


Again, the test set has a slightly lower error, which suggests good generalization performance. The error seems relatively small in the context of my target variable (assuming it's on a similar scale).

## Classification Model Evaluation

In [66]:
from sklearn.datasets import load_iris
data = load_iris()
print(type(data))

<class 'sklearn.utils._bunch.Bunch'>


In [11]:
print(data.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

In [65]:
column_names = data.feature_names  # returns a list
column_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [73]:
df = pd.DataFrame(data['data'], columns=column_names)  # column names required when we load from a Bunch object or they will be named 0, 1, 2, ...

In [79]:
data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [74]:
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [98]:
target = pd.DataFrame(data.target)[0]  # my classes
print(target.shape)
target

(150,)


0      0
1      0
2      0
3      0
4      0
      ..
145    2
146    2
147    2
148    2
149    2
Name: 0, Length: 150, dtype: int32

In [99]:
data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [110]:
print(data['target_names'])
print(target.value_counts())

['setosa' 'versicolor' 'virginica']
0
0    50
1    50
2    50
Name: count, dtype: int64


The dataset is perfectly balanced, with 50 data sample in each category.

## 6. Split this data set into training (80%) and testing (20%) sets.

The `class` field represents the type of flower and is the target variable that we will want to predict.

In [101]:
# Your code here :
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.2)

print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

(120, 4) (30, 4)
(120,) (30,)


## 7. Train a `LogisticRegression` model on this data set and generate predictions on both the training and the testing set.

In [102]:
# Your code here :
from sklearn.linear_model import LogisticRegression

logr_model = LogisticRegression()
logr_model.fit(X_train, y_train)

yhat_train = logr_model.predict(X_train)
yhat_test = logr_model.predict(X_test)

n = 5

print("Predictions on Train set:")
print(f"{y_train[:n].values}\n{yhat_train[:n]}")
print()
print("Predictions on Test set:")
print(f"{y_test[:n].values}\n{yhat_test[:n]}")

Predictions on Train set:
[0 0 0 2 0]
[0 0 0 2 0]

Predictions on Test set:
[1 1 0 0 0]
[1 1 0 0 0]


## 8. Calculate and print the accuracy score for both the training and the testing set.

Accuracy is the most commonly used metric for evaluating classification models. It measures the percentage of correctly predicted instances (both positive and negative) out of all instances in the dataset.

Drawback: If you have an imbalanced dataset (where one class is much more frequent than others), accuracy can be misleading. For example, if 95% of your samples belong to class A, a model that always predicts class A will have an accuracy of 95%, even though it's not necessarily useful.

In [103]:
# Your code here :
from sklearn.metrics import accuracy_score

print("Accuracy score Train:", accuracy_score(y_train, yhat_train))
print("Accuracy score Test:", accuracy_score(y_test, yhat_test))

Accuracy score Train: 0.975
Accuracy score Test: 1.0


## 9. Calculate and print the balanced accuracy score for both the training and the testing set.

Balanced Accuracy adjusts for imbalanced datasets by taking the average of the recall (or sensitivity) for each class. This metric gives equal importance to each class, making it more suitable for datasets where classes are not equally represented.

This is essentially the average of the recall for each class. For multiclass problems, it is generalized to the mean recall across all classes. This ensures that all classes contribute equally to the final score, regardless of their frequencies.

In [104]:
# Your code here :
from sklearn.metrics import balanced_accuracy_score

print("Balanced accuracy score Train:", balanced_accuracy_score(y_train, yhat_train))
print("Balanced accuracy score Test:", balanced_accuracy_score(y_test, yhat_test))

Balanced accuracy score Train: 0.9755799755799756
Balanced accuracy score Test: 1.0


## 10. Calculate and print the precision score for both the training and the testing set.

Precision Score: Useful when minimizing false positives is important, as it measures the proportion of correct positive predictions out of all predicted positives.

Which Method to Use?
- If you have an imbalanced dataset: average='weighted' is the most suitable because it takes into account the different frequencies of each class.
- If you care about all classes equally (whether balanced or not): average='macro' treats all classes equally and gives an unweighted mean of the precision scores.
- For overall model performance: average='micro' might be useful when you want to evaluate the performance of your model across all classes as a single entity, especially when handling multilabel tasks.

In [111]:
# Your code here :
from sklearn.metrics import precision_score

print("Precision score Train:", precision_score(y_train, yhat_train, average='weighted'))  # using weighted as our dataset is perfectly balanced
print("Precision score Test:", precision_score(y_test, yhat_test, average='weighted'))

Precision score Train: 0.9752134146341462
Precision score Test: 1.0


## 11. Calculate and print the recall score for both the training and the testing set.

Recall Score: Focuses on minimizing false negatives, measuring the proportion of actual positives correctly identified by the model.

In [113]:
# Your code here :
from sklearn.metrics import recall_score

print("Recall score Train:", recall_score(y_train, yhat_train, average='weighted'))
print("Recall score Test:", recall_score(y_test, yhat_test, average='weighted'))

Recall score Train: 0.975
Recall score Test: 1.0


## 12. Calculate and print the F1 score for both the training and the testing set.

F1 Score: Balances precision and recall by providing the harmonic mean, making it valuable when you need a trade-off between precision and recall.

In [115]:
# Your code here :
from sklearn.metrics import f1_score

print("F1 score Train:", f1_score(y_train, yhat_train, average='weighted'))
print("F1 score Test:", recall_score(y_test, yhat_test, average='weighted'))

F1 score Train: 0.9750076254384626
F1 score Test: 1.0


## 13. Generate confusion matrices for both the training and the testing set.

Confusion Matrix: Offers a detailed view of model performance by displaying true positives, true negatives, false positives, and false negatives in a matrix format for thorough error analysis.

In [119]:
# Your code here :
from sklearn.metrics import confusion_matrix

print("Confusion matrix on Train set:")
print(confusion_matrix(y_train, yhat_train))
print()

print("Confusion matrix on Test set:")
print(confusion_matrix(y_test, yhat_test))

Confusion matrix on Train set:
[[39  0  0]
 [ 0 40  2]
 [ 0  1 38]]

Confusion matrix on Test set:
[[11  0  0]
 [ 0  8  0]
 [ 0  0 11]]


The model is generalising pretty good, providing better results on the Test data with no false positive/negatives

## Bonus: For each of the data sets in this lab, try training with some of the other models you have learned about, recalculate the evaluation metrics, and compare to determine which models perform best on each data set.

### HOUSING DATASET

In [122]:
# Have fun here !

column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
data = pd.read_csv('housing.csv', header=None, delimiter=r"\s+", names=column_names)
X = data.iloc[:, :13]
y = data.iloc[:, 13]
print(X.shape, y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

(506, 13) (506,)
(404, 13) (102, 13)
(404,) (102,)


In [138]:
# Lets try an XBoost model
import xgboost as xgb

# Initialise model
xg_reg = xgb.XGBRegressor(
    objective='reg:squarederror',  # Regression task
    colsample_bytree=0.3,          # Fraction of features to consider per tree
    learning_rate=0.1,             # Step size for model updates (tweaked)
    max_depth=5,                   # Maximum tree depth
    alpha=10,                      # L2 regularization term to reduce overfitting
    n_estimators=100               # Number of trees (tweaked)
)

# Train the model
xg_reg.fit(X_train, y_train)

# Make predictions
yhat_train = xg_reg.predict(X_train)
yhat_test = xg_reg.predict(X_test)

# Evaluate metrics
print("r2 score Train:", r2_score(y_train, yhat_train))
print("r2 score Test:", r2_score(y_test, yhat_test))
print()
print("Mean Squared Error Train:", mean_squared_error(y_train, yhat_train))
print("Mean Squared Error Test:", mean_squared_error(y_test, yhat_test))
print()
print("Mean absolute error Train:", mean_absolute_error(y_train, yhat_train))
print("Mean absolute error Test:", mean_absolute_error(y_test, yhat_test))


r2 score Train: 0.9723064816209458
r2 score Test: 0.7187936580841876

Mean Squared Error Train: 2.4998296111893707
Mean Squared Error Test: 16.707298012988694

Mean absolute error Train: 1.1577572369339442
Mean absolute error Test: 2.587977136350146


XBoost gives us a result a less good than our first LinearRegression model. Maybe fine tuning the hyperparameters would improve the prediction results.

In [149]:
# Now let's try using a DecisionTreeRegressor

from sklearn.tree import DecisionTreeRegressor

# Initialise model
tree_reg = DecisionTreeRegressor()

# Train the model
tree_reg.fit(X_train, y_train)

def print_reg_metrics(model):
    
    # Make predictions
    yhat_train = model.predict(X_train)
    yhat_test = model.predict(X_test)

    # Evaluate metrics
    print("r2 score Train:", r2_score(y_train, yhat_train))
    print("r2 score Test:", r2_score(y_test, yhat_test))
    print()
    print("Mean Squared Error Train:", mean_squared_error(y_train, yhat_train))
    print("Mean Squared Error Test:", mean_squared_error(y_test, yhat_test))
    print()
    print("Mean absolute error Train:", mean_absolute_error(y_train, yhat_train))
    print("Mean absolute error Test:", mean_absolute_error(y_test, yhat_test))

print_reg_metrics(tree_reg)


r2 score Train: 1.0
r2 score Test: 0.6620430395873815

Mean Squared Error Train: 0.0
Mean Squared Error Test: 20.07901960784314

Mean absolute error Train: 0.0
Mean absolute error Test: 3.0215686274509808


In [150]:
# Fine tuning hyperparameters

from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 10, 20],
    'min_samples_leaf': [1, 5, 10],
    'max_features': [None, 'sqrt', 'log2'],
}

# Set up GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=tree_reg, param_grid=param_grid, cv=3, scoring='neg_mean_squared_error', verbose=1)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Best hyperparameters
print("Best parameters found: ", grid_search.best_params_)

# Evaluate the best model
best_tree_reg = grid_search.best_estimator_

print_reg_metrics(best_tree_reg)


Fitting 3 folds for each of 108 candidates, totalling 324 fits
Best parameters found:  {'max_depth': 10, 'max_features': None, 'min_samples_leaf': 5, 'min_samples_split': 2}
r2 score Train: 0.924998065314742
r2 score Test: 0.586859989855563

Mean Squared Error Train: 6.770250520587774
Mean Squared Error Test: 24.54586629713614

Mean absolute error Train: 1.6877970297029703
Mean absolute error Test: 3.0774074074074083


Fine-tuning the hyperparameters did not improve the model's prediction results. Overall, the Tree model is predicting better on the train set but worse on the test one, meaning that it's overfitting to the training data and not generalising on the test set as good as our first LinearRegression model.

### Iris Dataset

In [151]:
from sklearn.datasets import load_iris

# load dataset
data = load_iris()
df = pd.DataFrame(data['data'], columns=data.feature_names) 
target = pd.DataFrame(data.target)[0]

# split data
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.2)


In [153]:
# Let's try a DecisionTreeClassifier

from sklearn.tree import DecisionTreeClassifier

# initialize model
tree_class = DecisionTreeClassifier()

# train model
tree_class.fit(X_train, y_train)

# make predictions
def print_class_metrics(model):
    yhat_train = model.predict(X_train)
    yhat_test = model.predict(X_test)

    print("Predictions on Train set:")
    print(f"{y_train[:n].values}\n{yhat_train[:n]}")
    print()
    print("Predictions on Test set:")
    print(f"{y_test[:n].values}\n{yhat_test[:n]}")
    print()
    print("Accuracy score Train:", accuracy_score(y_train, yhat_train))
    print("Accuracy score Test:", accuracy_score(y_test, yhat_test))
    print()
    print("Balanced accuracy score Train:", balanced_accuracy_score(y_train, yhat_train))
    print("Balanced accuracy score Test:", balanced_accuracy_score(y_test, yhat_test))
    print()
    print("Precision score Train:", precision_score(y_train, yhat_train, average='weighted'))  # using weighted as our dataset is perfectly balanced
    print("Precision score Test:", precision_score(y_test, yhat_test, average='weighted'))
    print()
    print("Recall score Train:", recall_score(y_train, yhat_train, average='weighted'))
    print("Recall score Test:", recall_score(y_test, yhat_test, average='weighted'))
    print()
    print("F1 score Train:", f1_score(y_train, yhat_train, average='weighted'))
    print("F1 score Test:", recall_score(y_test, yhat_test, average='weighted'))
    print()
    print("Confusion matrix on Train set:")
    print(confusion_matrix(y_train, yhat_train))
    print()
    print("Confusion matrix on Test set:")
    print(confusion_matrix(y_test, yhat_test))

print_class_metrics(tree_class)

Predictions on Train set:
[2 2 1 2 2]
[2 2 1 2 2]

Predictions on Test set:
[1 0 2 0 2]
[2 0 2 0 2]

Accuracy score Train: 1.0
Accuracy score Test: 0.8666666666666667

Balanced accuracy score Train: 1.0
Balanced accuracy score Test: 0.8293650793650794

Precision score Train: 1.0
Precision score Test: 0.8676190476190477

Recall score Train: 1.0
Recall score Test: 0.8666666666666667

F1 score Train: 1.0
F1 score Test: 0.8666666666666667

Confusion matrix on Train set:
[[39  0  0]
 [ 0 43  0]
 [ 0  0 38]]

Confusion matrix on Test set:
[[11  0  0]
 [ 0  4  3]
 [ 0  1 11]]


In conclusion, using a DecisionTreeClassifier also gives us good results on classifying the iris data. However, we can see that if the Train metrics improve, the Test ones decrease slightly (especially recall and f1 score), showing that the model is overfitting to the training data ans will be less robust when presented with new data. We should rather stick to our initial LogisticRegression model.