## W3&W4 post studio exercises (errors, model fitting)

Name: Gue Zhen Xue
Monash ID: 33521352

Enter your solution in the cell(s) below each exercise. Add couple of inline comments explaining your code. Don't forget to add comments in markdown cell after each exercise. Missing comments (in markdown cells and/or inline) and late submissions will incur penalties.

Once done, drag&drop your python file to your ADS1002-name github account.

Copy url of this file on github to appropriate folder on Moodle by 09.30am prior your next studio. 

Solutions will be released later in the semester.

Max 10 marks - 2.5 marks per each exercise.

***
We will use 

* [who-health-data.csv](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/tree/main/Machine-Learning/Supervised-Methods/who-health-data.csv)

* [wisconsin-cancer-data.csv](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/tree/main/Machine-Learning/Supervised-Methods/kaggle-wisconsin-cancer.csv)

throughout the exercises. Download the datasets into the same directory as your post-studio notebook.

In [14]:
#import necessary lib
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error, precision_score
from sklearn.linear_model import LinearRegression

In [15]:
who_data_2015 = (
    pd.read_csv("who-health-data.csv") # Read in the csv data.
    .rename(columns=lambda c: c.strip())      # Clean up column names.
    .query("Year == 2015")                    # Restrict the dataset to records from 2015.
    # Removes two columns which contain a lot of missing data...
    .drop(columns=["Alcohol", "Total expenditure"])
    # ... then drop any rows with missing values.
    .dropna()
)

wisconsin_cancer_biopsies = (
    pd.read_csv("kaggle-wisconsin-cancer.csv")
    # This tidies up the naming of results (M -> malignant, B -> benign)
    .assign(diagnosis=lambda df: df['diagnosis']  
        .map({"M": "malignant", "B": "benign"})
        .astype('category')
    )
)

### Exercise 1

Given the dataframe `ex1_who_with_predictions` below, compute the Mean Absolute Error for the predicted values of life expectancy. You can repeat the process previously shown, or find a function in `sklearn.metrics` to compute this for you.

In [16]:
"""
Scaffold
"""
ex1_who_with_predictions = (
    who_data_2015[["Schooling", "Life expectancy"]]
    .assign(Predicted=lambda df: df["Schooling"] * 2.3 + 43)
    .dropna()
)
ex1_who_with_predictions.head()

Unnamed: 0,Schooling,Life expectancy,Predicted
0,10.1,65.0,66.23
16,14.2,77.8,75.66
32,14.4,75.6,76.12
48,11.4,52.4,69.22
80,17.3,76.3,82.79


In [17]:
# errors is exact - predicted
errors = ex1_who_with_predictions["Life expectancy"] - ex1_who_with_predictions["Predicted"]
# taking the mean of it
mae = errors.abs().mean()

# take 3 dp of the float
print(f"The mean absolute error is: {mae:.3f} years")

The mean absolute error is: 3.790 years


### Exercise 2

Given the classification predictions and actual results in the dataframe `ex2_biopsies_with_predictions` below, compute accuracy, precision and recall. Also find the number of false negatives.

In [18]:
"""
Scaffold
"""
ex2_biopsies_with_predictions = (
    wisconsin_cancer_biopsies
    .assign(prediction=lambda df: df['texture_mean'].lt(20)
        .map({True: "benign", False: "malignant"})
    )
    [['radius_mean', 'texture_mean', 'diagnosis', 'prediction']]
)
ex2_biopsies_with_predictions.head()

Unnamed: 0,radius_mean,texture_mean,diagnosis,prediction
0,17.99,10.38,malignant,benign
1,20.57,17.77,malignant,benign
2,19.69,21.25,malignant,malignant
3,11.42,20.38,malignant,malignant
4,20.29,14.34,malignant,benign


In [19]:
confusion_matrix = ex2_biopsies_with_predictions.groupby(['diagnosis', 'prediction']).size().unstack()

# see what is in the confusion matrix
confusion_matrix

prediction,benign,malignant
diagnosis,Unnamed: 1_level_1,Unnamed: 2_level_1
benign,274,83
malignant,70,142


In [20]:
# extract confusion matrix components
TP = confusion_matrix.loc["malignant", "malignant"] #both pred and diag are malignant
TN = confusion_matrix.loc["benign", "benign"] #both pred and diag are beningn
FP = confusion_matrix.loc["benign", "malignant"] #pred is beningn but diag is malig
FN = confusion_matrix.loc["malignant", "benign"] #pred is malig but diag is ben

In [21]:
# recalling back formula for metrics from W3 material
accuracy = (TP + TN) / (TP + TN + FP + FN)
precision = TP / (TP + FP)
recall = TP / (TP + FN)
TOTAL = TP + TN + FP + FN

# Output
print(f"Accuracy = {(TP + TN) / TOTAL = :.3f}")
print(f"Precision = {TP / (TP + FP) = :.3f}")
print(f"Recall = {TP / (TP + FN) = :.3f}")
print(f"Number of False Negatives: {FN}")

Accuracy = (TP + TN) / TOTAL = 0.731
Precision = TP / (TP + FP) = 0.631
Recall = TP / (TP + FN) = 0.670
Number of False Negatives: 70


### Exercise 3

Consider three different predictors for the cancer biopsy screening dataset:

* Predictor A has an accuracy of 0.95, and recall of 0.99
* Predictor B has an accuracy of 0.99, and recall of 0.95
* Predictor C has an accuracy of 0.5, and a recall of 1.0

The test required to collect data from a new patient (on which the predictor will give a predicted diagnosis) is minimally invasive. If the predictor predicts a positive (malignant) diagnosis, the patient will be referred for further screening which can be expensive.

Considering the context, which predictive model (A, B, or C) would likely be preferred for this task? Write your answer in a markdown cell below, and give a brief explanation of your reasoning.

**Answer:**

When dealing with data fitting, we aim to have a predictor that have minimised error. 

Accuracy provides measurement on how often the model's predictions to be correct overall, while recall measured based on correctly identified True Positive (but not take account of False Positive).

Therefore, here, we will consider a predictor that have higher (or highest) accuracy.

For this task, **Model B** would be prefered compared to A and C as it has the highest accuracy, indicating that it predicted correctly the highest amount of malignant diagnosis (TP and FP). 


### Exercise 4

Choose one different input/feature variable (other than Schooling) and fit a linear regression model to predict Life Expectancy using sklearn. Can you achieve a better error rate than what we found in pre-studio notebook? (RMSE and MAE for Schooling were 4.71 and 3.69, respectively.) Suggest a method to narrow down your choices of variables to use in order to arrive at a good model. 

Hint 1: Correlation.

Hint 2: You can use the functions written in the pre-studio notebook, e.g. prediction_root_mean_squared_error(gradient, intercept), to calculate the model error once you choose your model parameters (features).

In [22]:
# look at the head of the data to see features available
who_data_2015.head()

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,71.279624,65.0,1154,19.1,83,6.0,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
16,Albania,2015,Developing,77.8,74.0,0,364.975229,99.0,0,58.0,0,99.0,99.0,0.1,3954.22783,28873.0,1.2,1.3,0.762,14.2
32,Algeria,2015,Developing,75.6,19.0,21,0.0,95.0,63,59.5,24,95.0,95.0,0.1,4132.76292,39871528.0,6.0,5.8,0.743,14.4
48,Angola,2015,Developing,52.4,335.0,66,0.0,64.0,118,23.3,98,7.0,64.0,1.9,3695.793748,2785935.0,8.3,8.2,0.531,11.4
80,Argentina,2015,Developing,76.3,116.0,8,0.0,94.0,0,62.8,9,93.0,94.0,0.1,13467.1236,43417765.0,1.0,0.9,0.826,17.3


In [23]:
# Here, it's decided to 
feature = "Income composition of resources"

In [24]:
def prediction_root_mean_squared_error(gradient, intercept):
    """ Return the prediction error associated with the value of the parameters."""
    predictions = who_data_2015[feature] * gradient + intercept
    actual = who_data_2015["Life expectancy"]
    return mean_squared_error(y_true=actual, y_pred=predictions, squared=False)

def prediction_mean_absolute_error(gradient, intercept):
    """ Return the prediction error"""
    predictions = who_data_2015[feature] * gradient + intercept
    actual = who_data_2015["Life expectancy"]
    return mean_absolute_error(y_true=actual, y_pred=predictions)

In [25]:
#Linear Regression function
model = LinearRegression(fit_intercept=True)
data = who_data_2015[[feature, "Life expectancy"]].dropna()
model.fit(X=data[[feature]], y=data["Life expectancy"])
optimal_gradient = model.coef_[0]
optimal_intercept = model.intercept_

In [26]:
#Display result
print("Model is y = {:.2f}x + {:.2f}".format(optimal_gradient, optimal_intercept))
print("RMSE = {:.2f}".format(prediction_root_mean_squared_error(optimal_gradient, optimal_intercept)))
print("MAE = {:.2f}".format(prediction_mean_absolute_error(optimal_gradient, optimal_intercept)))

# with feature of "Income composition of resources", better error rate is achieved
# with RMSE = 3.5 and MAE = 2.74
# OLS is used here as we have moderate size of data (2900+), and due to its simplicity and efficiency
# Besides, for linear regressionm OLS could also give us optimal model directly.

Model is y = 47.50x + 38.69
RMSE = 3.50
MAE = 2.74
