# CPSC 330 - Applied Machine Learning 

## Homework 5: Evaluation metrics
### Associated lectures: [Lectures 9, 10](https://ubc-cs.github.io/cpsc330/README.html) 

**Due date: Monday, Feb 27, 2023 at 11:59pm**

## Imports

In [2]:
import os
import re
import sys
from hashlib import sha1

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.compose import make_column_transformer
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    f1_score,
    make_scorer,
    precision_score,
    recall_score,
)
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler

## Instructions 
<hr>
rubric={points:3}

Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330-2022W2/blob/main/docs/homework_instructions.md). 

**You may work with a partner on this homework and submit your assignment as a group.** Below are some instructions on working as a group.  
- The maximum group size is 2. 
- Use group work as an opportunity to collaborate and learn new things from each other. 
- Be respectful to each other and make sure you understand all the concepts in the assignment well. 
- It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. 
- You can find the instructions on how to do group submission on Gradescope [here](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members).

<br><br>

## Exercise 1: Precision, recall, and f1 score by hand <a name="1"></a>
<hr>

Consider the problem of predicting whether a patient has cancer or not. It is important to catch this disease early to reduce mortality rate; late diagnosis will result in metastasis to other organs, which adversely impacts patient's prognosis. Below are confusion matrices of two machine learning models: Model A and Model B. 

- Model A

|         | Predicted disease | Predicted no disease |
| :------------- | -----------------------: | -----------------------: |
| **Actual disease**       | 48 | 32 |
| **Actual no disease**       | 20 | 100 |


- Model B

|        | Predicted disease | Predicted no disease |
| :------------- | -----------------------: | -----------------------: |
| **Actual disease**       | 43 | 22 |
| **Actual no disease**       | 35 | 100 |
- 
|        | Predicted disease | Predicted no disease |
| :------------- | -----------------------: | -----------------------: |
| **Actual disease**       | TP | FN |
| **Actual no disease**       | FP | TN |

### 1.1 Positive vs. negative class 
rubric={points:2}

**Your tasks:**

Precision, recall, and f1 score depend upon which class is considered "positive", that is the thing you wish to find. In the example above, which class is likely to be the "positive" class? Why? 

**Answer:** Because in the above problem, it is important to "catch this disease early", we are trying to spot patients ***with cancer***, and therefore the positive class is ***Predicted Disease***

<br><br>

### 1.2 Accuracy
rubric={points:2}

**Your tasks:**

Calculate accuracies for Model A and Model B. 

We'll store all metrics associated with Model A and Model B in the `results_dict` below. 

In [4]:
results_dict = {"A": {}, "B": {}}

In [6]:
results_dict["A"]["accuracy"] = (48 + 100) / (48 + 100 + 32 + 20)
results_dict["B"]["accuracy"] = (43 + 100) / (43 + 22 + 35 + 100)
print(f'Model A\'s accuracy: {results_dict["A"]["accuracy"]}')
print(f'Model B\'s accuracy: {results_dict["B"]["accuracy"]}')

Model A's accuracy: 0.74
Model B's accuracy: 0.715


<br><br>

### 1.3 Which model would you pick? 
rubric={points:1}

**Your tasks:**

Which model would you pick simply based on the accuracy metric? 

**Answer:** Simply based on accuracy, I would pick ***Model A*** because the accuracy of its predictions is higher.

<br><br>

### 1.4 Precision, recall, f1-score
rubric={points:6}

**Your tasks:**

1. Calculate precision, recall, f1-score for Model A and Model B manually, without using `scikit-learn` tools. 


In [12]:
results_dict["A"]["precision"] = 48/ (48 + 20)
results_dict["B"]["precision"] = 43 / (43 + 35)
results_dict["A"]["recall"] = 48 / (48 + 32)
results_dict["B"]["recall"] = 43 / (43 + 22)
results_dict["A"]["f1"] = 2 * (results_dict["A"]["precision"] * results_dict["A"]["recall"]) / (results_dict["A"]["precision"] + results_dict["A"]["recall"])
results_dict["B"]["f1"] = 2 * (results_dict["B"]["precision"] * results_dict["B"]["recall"]) / (results_dict["B"]["precision"] + results_dict["B"]["recall"])


Show the dataframe with all results. 

In [13]:
pd.DataFrame(results_dict)

Unnamed: 0,A,B
accuracy,0.74,0.715
precision,0.705882,0.551282
recall,0.6,0.661538
f1,0.648649,0.601399


<br><br>

### 1.5 Discussion
rubric={points:4}

**Your tasks:**
1. Given the type of problem (early cancer diagnosis), which metric is more informative in this problem? Why? 
2. Which model would you pick based on this information? 

1. **Answer:** Given the type of problem is early cancer diagnosis, and it would be better for the patient to be predicted to have cancer and actually not have cancer (FP) rather than the opposite, recall would be the more informative metric. We want to identify as many people that actualy have cancer as possible.

2. **Answer:** I would pick ***Model B*** based on this information for its higher recall score

<br><br>

### (Optional) 1.6 
rubric={points:1}

**Your tasks:**

Provide 2 to 3 example classification datasets (with links) where accuracy metric would be misleading. Discuss which evaluation metric would be more appropriate for each dataset. You may consider datasets we have used in this course so far. You could also look up datasets on Kaggle. 

1. [Random Sample of NIH Chest X-ray Dataset](https://www.kaggle.com/datasets/nih-chest-xrays/sample): Predicts disease classification based on chest x-ray. Imbalanced classes with a majority of x-rays having `No Finding` classification. A better evaluation metric may be recall since it would favor the patient to correctly identify disease.

2. [EMNIST (Extended MNIST)](https://www.kaggle.com/datasets/crawford/emnist?select=emnist-byclass-train.csv): 28x28 pixel images mapping to handwritten characters. In the ByClass dataset, there is a significant imbalance towards lower value numerical characters. Precision of each character may be a better measurement than overall accuracy because a model being able to identify one character consistently, but not other (more infrequent) characters, may have a high accuracy, but poor performance.

<br><br><br><br>

### Exercise 2: Classification evaluation metrics using `sklearn` <a name="2"></a>
<hr>

In general, when a dataset is imbalanced, accuracy does not provide the whole story. In class, we looked at credit card fraud dataset which is a classic example of an imbalanced dataset. 

Another example is customer churn datasets. [Customer churn](https://en.wikipedia.org/wiki/Customer_attrition) refers to the notion of customers leaving a subscription service like Netflix. In this exercise, we will try to predict customer churn in a dataset where most of the customers stay with the service and a small minority cancel their subscription. To start, please download the [Kaggle telecom customer churn dataset](https://www.kaggle.com/becksddf/churn-in-telecoms-dataset). Once you have the data, you should be able to run the following code:

The starter code below reads the data CSV as a pandas dataframe and splits it into 70% train and 30% test. 

Note that `churn` column in the dataset is the target. "True" means the customer left the subscription (churned) and "False" means they stayed.

> Note that for this kind of problem a more appropriate technique is something called survival analysis and we'll be talking about it later in the course. For now, we'll just treat it as a binary classification problem. 

In [14]:
df = pd.read_csv("bigml_59c28831336c6604c800002a.csv", encoding="utf-8")
train_df, test_df = train_test_split(df, test_size=0.3, random_state=123)
train_df

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
1402,NE,70,415,421-8535,no,no,0,213.4,86,36.28,...,77,17.40,256.6,101,11.55,5.7,4,1.54,1,False
1855,WI,67,510,417-2265,no,no,0,109.1,134,18.55,...,76,12.10,91.2,86,4.10,10.9,5,2.94,2,False
633,NJ,122,415,327-9341,no,yes,34,146.4,104,24.89,...,103,7.62,220.0,91,9.90,15.6,4,4.21,2,False
1483,NV,107,510,419-9688,yes,no,0,234.1,91,39.80,...,105,13.86,282.5,100,12.71,10.0,3,2.70,1,False
2638,HI,105,510,364-8128,no,no,0,125.4,116,21.32,...,95,22.23,241.6,104,10.87,11.4,9,3.08,2,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2154,WY,126,408,339-9798,yes,no,0,197.6,126,33.59,...,112,20.95,285.3,104,12.84,12.5,8,3.38,2,False
3089,WV,70,510,348-3777,no,yes,30,143.4,72,24.38,...,92,14.45,127.9,68,5.76,9.4,4,2.54,3,False
1766,NJ,125,415,406-6400,no,no,0,182.3,64,30.99,...,121,11.88,171.6,96,7.72,11.6,7,3.13,2,False
1122,NE,159,415,362-5111,no,no,0,189.1,105,32.15,...,147,20.92,242.0,106,10.89,10.4,5,2.81,1,True


<br><br>

### 2.1 Distribution of target values
rubric={points:4}

**Your tasks:**

Examine the distribution of target values in the train split. Do you see class imbalance? If yes, do we need to deal with it? Why or why not? 

**ANSWER:**
>  `churn` is a binary class with  `False` in $1984/2333$ examples. It is also our target class.

We need to deal with these class imbalances such as to not overemphasize certain target classes. This is especially important since the `churn` class has an imbalance towards `False`, while the `True` prediction may be more significant since we would like to find cases in which customers <ins>left</ins> the subscription.

In [21]:
pd.set_option('display.max_columns', None)
train_df.describe(include="all")

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
count,2333,2333.0,2333.0,2333,2333,2333,2333.0,2333.0,2333.0,2333.0,2333.0,2333.0,2333.0,2333.0,2333.0,2333.0,2333.0,2333.0,2333.0,2333.0,2333
unique,51,,,2333,2,2,,,,,,,,,,,,,,,2
top,WV,,,421-8535,no,no,,,,,,,,,,,,,,,False
freq,70,,,1,2106,1695,,,,,,,,,,,,,,,1984
mean,,100.434634,436.324046,,,,8.02829,179.655679,100.567081,30.542015,201.175782,99.885555,17.10021,201.211745,99.988856,9.054591,10.269567,4.503215,2.773365,1.55165,
std,,39.64247,41.8542,,,,13.665229,54.546284,20.202414,9.272847,50.449386,19.788878,4.288194,50.888058,19.406455,2.290012,2.777601,2.507555,0.749929,1.328702,
min,,1.0,408.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.2,33.0,1.04,0.0,0.0,0.0,0.0,
25%,,73.0,408.0,,,,0.0,143.4,87.0,24.38,167.3,87.0,14.22,166.9,87.0,7.51,8.5,3.0,2.3,1.0,
50%,,100.0,415.0,,,,0.0,179.2,101.0,30.46,202.4,100.0,17.2,201.6,100.0,9.07,10.4,4.0,2.81,1.0,
75%,,127.0,415.0,,,,19.0,216.3,114.0,36.77,236.0,113.0,20.06,236.6,113.0,10.65,12.1,6.0,3.27,2.0,


In [30]:
print(train_df.shape)
train_df.info()

(2333, 21)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2333 entries, 1402 to 1346
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   2333 non-null   object 
 1   account length          2333 non-null   int64  
 2   area code               2333 non-null   int64  
 3   phone number            2333 non-null   object 
 4   international plan      2333 non-null   object 
 5   voice mail plan         2333 non-null   object 
 6   number vmail messages   2333 non-null   int64  
 7   total day minutes       2333 non-null   float64
 8   total day calls         2333 non-null   int64  
 9   total day charge        2333 non-null   float64
 10  total eve minutes       2333 non-null   float64
 11  total eve calls         2333 non-null   int64  
 12  total eve charge        2333 non-null   float64
 13  total night minutes     2333 non-null   float64
 14  total night calls       23

<br><br>

### (Optional) 2.2 EDA 
rubric={points:1}

**Your tasks:**

Come up with **two** exploratory questions you would like to answer and explore those. Briefly discuss your results in 1-3 sentences.

You are welcome to use `pandas_profiling` (see Lecture 10) but you don't have to.

**ANSWER:** Two exploratory questions
1. Are there any numeric-looking columns that aren't actually numeric? 
    - `area code` seems to be a categorical feature related to the location that the phone number is from.
    - `customer service calls` seemed to possibly be an ordinal feature at first because it only had values from 0-9, however, judging by the name and the context, it is probably the number of customer service calls that a customer received, which at max, is 9. So this is actually a numeric feature.
2. Are there any ordinal features?
    - There doesn't seem to be any ordinal features of all the categorical features, although there are several binary categorical features.

In [67]:
numeric_looking_columns = train_df.select_dtypes(include=np.number).columns.tolist()
non_numeric_looking_cols = set(train_df.columns.tolist()) - set(numeric_looking_columns)
print(numeric_looking_columns)
print(non_numeric_looking_cols)

['account length', 'area code', 'number vmail messages', 'total day minutes', 'total day calls', 'total day charge', 'total eve minutes', 'total eve calls', 'total eve charge', 'total night minutes', 'total night calls', 'total night charge', 'total intl minutes', 'total intl calls', 'total intl charge', 'customer service calls']
{'international plan', 'phone number', 'voice mail plan', 'state', 'churn'}


In [53]:
print(train_df[numeric_looking_columns].nunique())

account length             205
area code                    3
number vmail messages       45
total day minutes         1402
total day calls            115
total day charge          1402
total eve minutes         1337
total eve calls            115
total eve charge          1215
total night minutes       1360
total night calls          111
total night charge         852
total intl minutes         154
total intl calls            21
total intl charge          154
customer service calls      10
dtype: int64


In [60]:
possible_categorical_numeric_looking_cols = ['area code', 'customer service calls']
for n in possible_categorical_numeric_looking_cols:
    print(f'unique {n}: {train_df[n].unique()}')


unique area code: [415 510 408]
unique customer service calls: [1 2 0 5 3 4 8 6 7 9]


In [68]:
print(train_df[non_numeric_looking_cols].nunique())

international plan       2
phone number          2333
voice mail plan          2
state                   51
churn                    2
dtype: int64


  print(train_df[non_numeric_looking_cols].nunique())


In [69]:
for n in ['international plan', 'voice mail plan', 'churn']:
    print(f'unique {n}: {train_df[n].unique()}')

unique international plan: ['no' 'yes']
unique voice mail plan: ['no' 'yes']
unique churn: [False  True]


<br><br>

### 2.3 Column transformer 
rubric={points:14}

The code below creates `X_train`, `y_train`, `X_test`, `y_test` for you. 
In preparation for building a classifier, set up a `ColumnTransformer` that performs whatever feature transformations you deem sensible. This can include dropping features if you think they are not helpful. Remember that by default `ColumnTransformer` will drop any columns that aren't accounted for when it's created.

For each group of features (e.g. numeric, categorical or else) explain why you are applying the particular transformation. For example, "I am doing transformation X to the following categorical features: `a`, `b`, `c` because of reason Y," etc.

Finally, fit `ColumnTransformer` on your training set; and use the `ColumnTransformer` to transform your train data.

In [70]:
X_train = train_df.drop(columns=["churn"])
X_test = test_df.drop(columns=["churn"])

y_train = train_df["churn"]
y_test = test_df["churn"]

**ANSWER:** In my EDA, I did not find that any columns were missing values (i.e. they all had $2333$ counts), so I will not be using imputation.

- **DROP:** I have decided to drop the `phone number` column. Because phone number is categorical and is different for every customer, there will not be useful patterns found by training a model using it. Although `area code` relates to the phone number, I did not drop it because it is specific to location that may be different than `state`.

- **NUMERIC:** I will be scaling all numerical features because by standardizing their values, numeric features that have larger values won't influence models that are scale-sensitive.

- **CATEGORICAL:** I will be applying one-hot encoding to categorical features to represent them numerically for models that require numerical representation for predictor features. 
    - Additionally, I will be using the options `handle_unknown=ignore` for cases such as in cross-validation if there is a class that is split entirely in the validation set, and not trained on, errors do not occur. 
    - Finally I will pass the `drop=if_binary` option so that binary categorical features are represented using one numeric column rather than two.



In [81]:
numeric_features = [
    'account length', 'number vmail messages', 'total day minutes', 
    'total day calls', 'total day charge', 'total eve minutes', 'total eve calls', 
    'total eve charge', 'total night minutes', 'total night calls', 'total night charge', 
    'total intl minutes', 'total intl calls', 'total intl charge', 'customer service calls'
]

categorical_non_binary_features = [
    'state', 'area code'
]

categorical_binary_features = [
    'international plan', 'voice mail plan'
]

categorical_features = categorical_non_binary_features + categorical_binary_features

drop_features = [
    'phone number'
]

assert(20 == len(numeric_features) + len(categorical_features) + len(drop_features))

In [95]:
numeric_transformer = make_pipeline(StandardScaler())

categorical_transformer = make_pipeline(
    OneHotEncoder(handle_unknown="ignore", sparse=False, drop="if_binary"),
)

preprocessor = make_column_transformer(
    ("drop", drop_features),
    (numeric_transformer, numeric_features),
    (categorical_transformer, categorical_features),
)

In [96]:
preprocessor.fit(X_train)



In [97]:
preprocessor.named_transformers_

{'drop': 'drop',
 'pipeline-1': Pipeline(steps=[('standardscaler', StandardScaler())]),
 'pipeline-2': Pipeline(steps=[('onehotencoder',
                  OneHotEncoder(drop='if_binary', handle_unknown='ignore',
                                sparse=False))])}

In [98]:
# Adapted from Lecture 10
# get the new list of columns after preprocessing
ohe_columns = list(
    preprocessor.named_transformers_["pipeline-2"]
    .named_steps["onehotencoder"]
    .get_feature_names_out(categorical_features)
)

new_columns = numeric_features + ohe_columns

# create transformed training set
X_train_enc = pd.DataFrame(
    preprocessor.transform(X_train), index=X_train.index, columns=new_columns
)
# create transformed testing set
X_test_enc = pd.DataFrame(
    preprocessor.transform(X_test), index=X_test.index, columns=new_columns
)

In [99]:
X_train_enc.head()

Unnamed: 0,account length,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,state_AK,state_AL,state_AR,state_AZ,state_CA,state_CO,state_CT,state_DC,state_DE,state_FL,state_GA,state_HI,state_IA,state_ID,state_IL,state_IN,state_KS,state_KY,state_LA,state_MA,state_MD,state_ME,state_MI,state_MN,state_MO,state_MS,state_MT,state_NC,state_ND,state_NE,state_NH,state_NJ,state_NM,state_NV,state_NY,state_OH,state_OK,state_OR,state_PA,state_RI,state_SC,state_SD,state_TN,state_TX,state_UT,state_VA,state_VT,state_WA,state_WI,state_WV,state_WY,area code_408,area code_415,area code_510,international plan_yes,voice mail plan_yes
1402,-0.767893,-0.587624,0.618769,-0.721211,0.618927,0.069871,-1.156734,0.069926,1.088667,0.052115,1.089926,-1.645501,-0.200722,-1.644994,-0.415269,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1855,-0.843585,-0.587624,-1.293778,1.655252,-1.293517,-1.167277,-1.207278,-1.166291,-2.162302,-0.72099,-2.164029,0.227019,0.198158,0.222249,0.337507,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
633,0.544113,1.900976,-0.609809,0.169963,-0.609654,-2.21013,0.157417,-2.211244,0.369287,-0.463288,0.369252,1.919489,-0.200722,1.916105,0.337507,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
1483,0.16565,-0.587624,0.998345,-0.473663,0.998611,-0.754894,0.258506,-0.755774,1.597736,0.000574,1.596582,-0.097071,-0.599603,-0.09785,-0.415269,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
2638,0.115188,-0.587624,-0.994886,0.764078,-0.994731,1.195994,-0.246937,1.196515,0.793839,0.206736,0.792921,0.407069,1.793679,0.408973,0.337507,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


<br><br>

<br><br>

### 2.4 area code feature
rubric={points:4}

The original dataset had a feature called `area code`.

1. The area codes are numbers. Does it make sense to encode them as one-hot-endoded (OHE) or not? Please justify your response.
2. What were the possible values of `area code`? 
3. If area code is encoded with OHE, how many new features are created to replace it?

**ANSWER:**
1. `area code` makes sense to be encoded using OHE, as it provides location information that is different than the `state` feature and isn't unique to each example like the `phone number` feature. There are discrete, finite categories of the `area code` feature such that each example is classified into one of them, therefore it can be considered a categorical feature.
2. There are 3 possible values of `area code = {415, 510, 408}`
3. Because there are 3 possible values of `area code`, the OHE of the feature will create 3 columns to replace it: `area code_415`, `area code_510`, `area code_408`

<br><br>

### 2.5 Logistic regression
rubric={points:12} 

**Your tasks:**

1. Report the cross-validation results of a `LogisticRegression` model, with default Hparams, on the following metrics: `"accuracy", "precision", "recall", "f1"`
2. Are you satisfied with the results? Explain why or why not. Discuss in a few sentences. 

In [145]:
scoring = [
    "accuracy",
    "f1",
    "recall",
    "precision",
]

scores_dict = {
    "C": np.logspace(-3, 5, 9),
    "mean_fit_time": list(),
    "mean_score_time": list(),
    "mean_test_accuracy": list(),
    "mean_train_accuracy": list(),
    "mean_test_f1": list(),
    "mean_train_f1": list(),
    "mean_test_recall": list(),
    "mean_train_recall": list(),
    "mean_test_precision": list(),
    "mean_train_precision": list()
}


for c in scores_dict['C']:
    pipe_lr = make_pipeline(preprocessor, LogisticRegression(C=c, max_iter=1000))
    scores = cross_validate(
        pipe_lr, X_train, y_train, return_train_score=True, scoring=scoring
    )
    scores_dict["mean_fit_time"].append(scores["fit_time"].mean())    
    scores_dict["mean_score_time"].append(scores["score_time"].mean())
    scores_dict["mean_test_accuracy"].append(scores["test_accuracy"].mean())
    scores_dict["mean_train_accuracy"].append(scores["train_accuracy"].mean())
    scores_dict["mean_test_f1"].append(scores["test_f1"].mean())
    scores_dict["mean_train_f1"].append(scores["train_f1"].mean())
    scores_dict["mean_test_recall"].append(scores["test_recall"].mean())
    scores_dict["mean_train_recall"].append(scores["train_recall"].mean())
    scores_dict["mean_test_precision"].append(scores["test_precision"].mean())
    scores_dict["mean_train_precision"].append(scores["train_precision"].mean())

results = pd.DataFrame(scores_dict)
results

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,C,mean_fit_time,mean_score_time,mean_test_accuracy,mean_train_accuracy,mean_test_f1,mean_train_f1,mean_test_recall,mean_train_recall,mean_test_precision,mean_train_precision
0,0.001,0.007864,0.002981,0.850408,0.850407,0.0,0.0,0.0,0.0,0.0,0.0
1,0.01,0.006998,0.002688,0.852982,0.854372,0.05999,0.073544,0.031594,0.038682,0.72,0.767074
2,0.1,0.009923,0.002618,0.854696,0.862195,0.230719,0.279605,0.146294,0.179091,0.553662,0.641289
3,1.0,0.015664,0.002717,0.855978,0.866481,0.302739,0.355603,0.209317,0.246423,0.548663,0.6392
4,10.0,0.031864,0.003149,0.852976,0.866802,0.314864,0.373505,0.226501,0.265773,0.521628,0.629905
5,100.0,0.03484,0.003061,0.850405,0.867553,0.307957,0.381023,0.223644,0.272937,0.500641,0.63344
6,1000.0,0.064308,0.00293,0.850404,0.867981,0.30815,0.38418,0.223644,0.275806,0.502379,0.635289
7,10000.0,0.06632,0.002942,0.850833,0.868517,0.30867,0.38626,0.223644,0.27724,0.504512,0.639499
8,100000.0,0.048677,0.002709,0.850833,0.86841,0.30867,0.385414,0.223644,0.276526,0.504512,0.638884


(2333, 20)
(2333,)


<br><br>

### 2.6 Logistic regression with `class_weight`
rubric={points:6}

**Your tasks:**

1. Set the `class_weight` parameter of your logistic regression model to `'balanced'` and report the same metrics as in the previous part. 
2. Do you prefer this model to the one in the previous part? Discuss your results in a few sentences while comparing the metrics of this model and the previous model.

<br><br>

### 2.7 Hyperparameter optimization
rubric={points:10}

1. Jointly optimize `C` and `class_weight` with `GridSearchCV` and `scoring="f1"`.
  - For `class_weight`, consider 3 values: 
    - `None` (no weight)
    - "weight of class 0 = 1"  and  "weight of class 1 = 3"
    - '`balanced`'
  - For `C`, choose some reasonable values
2. What values of `C` and `class_weight` are chosen and what is the best cross-validation f1 score?


<br><br>

### 2.8 Test results
rubric={points:10}

**Your tasks**
1. Evaluate the best model on the test set. In particular show each of the following on the test set:  
    - Plot Confusion matrix
    - Plot Precision-recall curve 
    - Calculate average precision score
    - Plot ROC curve
    - Report AUC score
3. Comment on the AUC score and give an intuitive explanation of what this value of AUC means for this problem.

<br><br><br><br>

### Exercise 3: Regression metrics <a name="3"></a>
<hr> 


For this exercise, we'll use [California housing dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html) from `sklearn datasets`. The code below loads the dataset.  

In [None]:
from sklearn.datasets import fetch_california_housing

housing_df = fetch_california_housing(as_frame=True).frame

### 3.1: Data spitting and exploration 
rubric={points:4}

**Your tasks:**

1. Split the data into train (75%) and test (25%) splits. 
2. Explore the train split. Do you need to apply any transformations on the data? If yes, create a preprocessor with the appropriate transformations. 
3. Separate `X` and `y` to train and test splits. 

<br><br>

### 3.2 Baseline: Linear Regression 
rubric={points:2}

**Your tasks:**
1. Carry out cross-validation using `sklearn.linear_model.LinearRegression` with default scoring. 
2. What metric is used for scoring by default? 

<br><br>

### 3.3 Random Forest Regressor
rubric={points:7}

In this exercise, we are going to use [`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) model which we haven't looked into yet. At this point you should feel comfortable using models with our usual ML workflow even if you don't know the details. We'll talk about `RandomForestRegressor` later in the course.  

The code below defines a custom scorer called `mape_scorer` and creates dictionaries for two model (`models`) and five evaluation metrics (`score_types_reg`). 

**Your tasks:**

1. Using the `models` and the evaluation metrics `score_types_reg` in the code below, carry out cross-validation with each model, by passing the evaluation metrics to `scoring` argument of `cross_validate`. Use a pipeline with the model as an estimator if you are applying any transformations. 
2. Show results as a dataframe. 
3. Interpret the results. How do the models compare to the baseline? Which model seems to be performing well with different metrics? 


In [None]:
models = {
    "Ridge": Ridge(),
    "Random Forest": RandomForestRegressor(),
}

score_types_reg = {
    "neg_mean_squared_error": "neg_mean_squared_error",
    "neg_root_mean_squared_error": "neg_root_mean_squared_error",
    "neg_mean_absolute_error": "neg_mean_absolute_error",
    "r2": "r2",
    "neg_mean_absolute_percentage_error": "neg_mean_absolute_percentage_error",
}

<br><br>

### 3.4 Hyperparameter optimization 
rubric={points:1}

1. Carry out hyperparameter optimization using `RandomizedSearchCV` and `Ridge` with the following `param_dist`. The `alpha` hyperparameter of `Ridge` controls the fundamental tradeoff. Choose `neg_mean_absolute_percentage_error` as the HParam optimization metric.

2. What was the best `alpha` hyper-parameter found?

In [None]:
from scipy.stats import loguniform

param_dist = {"ridge__alpha": loguniform(1e-3, 1e3)}

<br><br>

### 3.5 Test results
rubric={points:4}

**Your tasks:**

Test the best model (from 3.4) on the test set based on the `neg_mean_absolute_percentage_error` score.

<br><br>

### 3.6 Model interpretation  
rubric={points:4}

Ridge is a linear model and it learns coefficients associated with each feature during `fit()`. 

**Your tasks:**

1. Explore coefficients learned by the `Ridge` model above as a pandas dataframe with two columns: 
   - features 
   - coefficients
2. Increasing which feature values would result in higher housing price? 

<br><br>

## Submission instructions 

**PLEASE READ:** When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from “1” will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
3. Upload the assignment using Gradescope's drag and drop tool. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission. 