## W3&W4 post studio exercises (errors, model fitting)

Enter your solution in the cell(s) below each exercise. Add couple of inline comments explaining your code. Don't forget to add comments in markdown cell after each exercise. Missing comments (in markdown cells and/or inline) and late submissions will incur penalties.

Once done, drag&drop your python file to your ADS1002-name github account.

Copy url of this file on github to appropriate folder on Moodle by 09.30am prior your next studio. 

Solutions will be released later in the semester.

Max 10 marks - 2.5 marks per each exercise.

***
We will use 

* [who-health-data.csv](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/tree/main/Machine-Learning/Supervised-Methods/who-health-data.csv)

* [wisconsin-cancer-data.csv](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/tree/main/Machine-Learning/Supervised-Methods/kaggle-wisconsin-cancer.csv)

throughout the exercises. Download the datasets into the same directory as your post-studio notebook.

In [2]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [3]:
who_data_2015 = (
    pd.read_csv("who-health-data.csv") # Read in the csv data.
    .rename(columns=lambda c: c.strip())      # Clean up column names.
    .query("Year == 2015")                    # Restrict the dataset to records from 2015.
    # Removes two columns which contain a lot of missing data...
    .drop(columns=["Alcohol", "Total expenditure"])
    # ... then drop any rows with missing values.
    .dropna()
)

wisconsin_cancer_biopsies = (
    pd.read_csv("kaggle-wisconsin-cancer.csv")
    # This tidies up the naming of results (M -> malignant, B -> benign)
    .assign(diagnosis=lambda df: df['diagnosis']  
        .map({"M": "malignant", "B": "benign"})
        .astype('category')
    )
)

### Exercise 1

Given the dataframe `ex1_who_with_predictions` below, compute the Mean Absolute Error for the predicted values of life expectancy. You can repeat the process previously shown, or find a function in `sklearn.metrics` to compute this for you.

In [4]:
ex1_who_with_predictions = (
    who_data_2015[["Schooling", "Life expectancy"]]
    .assign(Predicted=lambda df: df["Schooling"] * 2.3 + 43)
    .dropna()
)
ex1_who_with_predictions.head()

Unnamed: 0,Schooling,Life expectancy,Predicted
0,10.1,65.0,66.23
16,14.2,77.8,75.66
32,14.4,75.6,76.12
48,11.4,52.4,69.22
80,17.3,76.3,82.79


In [5]:
def prediction_mean_absolute_error(gradient, intercept):
    """ Return the prediction error associated with the value of the parameters.
    This time around, let's use sklearn.metrics. """
    predictions = who_data_2015["Schooling"] * gradient + intercept
    actual = who_data_2015["Life expectancy"]
    return mean_absolute_error(y_true=actual, y_pred=predictions)

# Set up the data matrices X and y 
X = np.matrix(who_data_2015[['Schooling']].assign(constant=1).values)
y = who_data_2015["Life expectancy"].values

# Use matrix multiplication to compute model parameters
beta = np.linalg.inv(X.transpose() @ X) @ X.transpose() @ y
optimal_gradient, optimal_intercept = beta.tolist()[0]

# Display prediction error
print("MAE = {:.2f}".format(prediction_mean_absolute_error(optimal_gradient, optimal_intercept)))

MAE = 3.69


The Mean Absolute Error is 3.69 years which is a good score, indiciating an accurate model.

### Exercise 2

Given the classification predictions and actual results in the dataframe `ex2_biopsies_with_predictions` below, compute accuracy, precision and recall. Also find the number of false negatives.

In [6]:
ex2_biopsies_with_predictions = (
    wisconsin_cancer_biopsies
    .assign(prediction=lambda df: df['texture_mean'].lt(20)
        .map({True: "benign", False: "malignant"})
    )
    [['radius_mean', 'texture_mean', 'diagnosis', 'prediction']]
)
ex2_biopsies_with_predictions.head()

Unnamed: 0,radius_mean,texture_mean,diagnosis,prediction
0,17.99,10.38,malignant,benign
1,20.57,17.77,malignant,benign
2,19.69,21.25,malignant,malignant
3,11.42,20.38,malignant,malignant
4,20.29,14.34,malignant,benign


In [7]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

# Define the columns for actual and predicted
actual_values = ex2_biopsies_with_predictions['diagnosis']
predicted_values = ex2_biopsies_with_predictions['prediction']

# Use the sklearn commands for accuracy, precision and recall
accuracy = accuracy_score(actual_values, predicted_values)
precision = precision_score(actual_values, predicted_values, pos_label='malignant')
recall = recall_score(actual_values, predicted_values, pos_label='malignant')

# Create the confusion matrix to calculate the number of false negatives
conf_matrix = confusion_matrix(actual_values, predicted_values, labels=['benign', 'malignant'])
false_negatives = conf_matrix[1, 0]

# Print all the found values
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'False Negatives: {false_negatives}')

Accuracy: 0.7311072056239016
Precision: 0.6311111111111111
Recall: 0.6698113207547169
False Negatives: 70


The valules for accuracy, precision and recall are all moderate. 

### Exercise 3

Consider three different predictors for the cancer biopsy screening dataset:

* Predictor A has an accuracy of 0.95, and recall of 0.99
* Predictor B has an accuracy of 0.99, and recall of 0.95
* Predictor C has an accuracy of 0.5, and a recall of 1.0

The test required to collect data from a new patient (on which the predictor will give a predicted diagnosis) is minimally invasive. If the predictor predicts a positive (malignant) diagnosis, the patient will be referred for further screening which can be expensive.

Considering the context, which predictive model (A, B, or C) would likely be preferred for this task? Write your answer in a markdown cell below, and give a brief explanation of your reasoning.

The test needs to pickup as many maligant cases as possible whilst reducing the number of false positives. The higher the recall the less false negatives there are, and although Predictor C has the highest recall with 1.0, the accuracy is so low that it would miss 50% of true positives. Therefore Predictor B is the best model for this task, as it balances a high accuracy with a high recall, but the accuracy is more important as even though the screening is expensive, it is more important that true positives are picked up on as this is a life threatening medical situation. 

### Exercise 4

Choose one different input/feature variable (other than Schooling) and fit a linear regression model to predict Life Expectancy using sklearn. Can you achieve a better error rate than what we found in pre-studio notebook? (RMSE and MAE for Schooling were 4.71 and 3.69, respectively.) Suggest a method to narrow down your choices of variables to use in order to arrive at a good model. 

Hint 1: Correlation.

Hint 2: You can use the functions written in the pre-studio notebook, e.g. prediction_root_mean_squared_error(gradient, intercept), to calculate the model error once you choose your model parameters (features).

In [13]:
# Dropping the columns with categorical data
who_data_2015_filtered = who_data_2015.drop(["Country", "Year", "Status" ], axis=1)
who_data_2015_filtered.head()

Unnamed: 0,Life expectancy,Adult Mortality,infant deaths,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,65.0,263.0,62,71.279624,65.0,1154,19.1,83,6.0,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
16,77.8,74.0,0,364.975229,99.0,0,58.0,0,99.0,99.0,0.1,3954.22783,28873.0,1.2,1.3,0.762,14.2
32,75.6,19.0,21,0.0,95.0,63,59.5,24,95.0,95.0,0.1,4132.76292,39871528.0,6.0,5.8,0.743,14.4
48,52.4,335.0,66,0.0,64.0,118,23.3,98,7.0,64.0,1.9,3695.793748,2785935.0,8.3,8.2,0.531,11.4
80,76.3,116.0,8,0.0,94.0,0,62.8,9,93.0,94.0,0.1,13467.1236,43417765.0,1.0,0.9,0.826,17.3


In [9]:
# Create a correlation chart to analyse which variables have the strongest correlation
who_data_2015_filtered.corr()

Unnamed: 0,Life expectancy,Adult Mortality,infant deaths,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
Life expectancy,1.0,-0.731215,-0.209304,0.064494,0.372109,-0.049305,0.544987,-0.241013,0.493438,0.466223,-0.620511,0.487018,-0.027594,-0.459153,-0.454897,0.898059,0.806074
Adult Mortality,-0.731215,1.0,0.15484,-0.056151,-0.134546,0.027975,-0.351283,0.178261,-0.300425,-0.228731,0.632276,-0.31454,0.03284,0.254959,0.25909,-0.587776,-0.466075
infant deaths,-0.209304,0.15484,1.0,-0.018945,-0.075556,0.824389,-0.208357,0.993963,-0.120151,-0.106575,0.070195,-0.115139,0.269533,0.557387,0.555316,-0.197383,-0.215488
percentage expenditure,0.064494,-0.056151,-0.018945,1.0,0.053143,-0.018023,0.054433,-0.019436,0.010948,0.047256,-0.046816,-0.026668,-0.020973,-0.020541,-0.020098,0.028133,0.029464
Hepatitis B,0.372109,-0.134546,-0.075556,0.053143,1.0,0.034322,0.147275,-0.093338,0.503902,0.895829,-0.342678,0.0884,-0.045324,-0.038189,-0.086664,0.279625,0.304968
Measles,-0.049305,0.027975,0.824389,-0.018023,0.034322,1.0,-0.125854,0.78733,-0.013857,0.019518,-0.040197,-0.068698,0.125615,0.376052,0.367871,-0.057675,-0.062165
BMI,0.544987,-0.351283,-0.208357,0.054433,0.147275,-0.125854,1.0,-0.218591,0.198913,0.167397,-0.265041,0.387052,0.005963,-0.487245,-0.505187,0.622817,0.612644
under-five deaths,-0.241013,0.178261,0.993963,-0.019436,-0.093338,0.78733,-0.218591,1.0,-0.138411,-0.126753,0.097053,-0.120094,0.308769,0.547162,0.543834,-0.220828,-0.236128
Polio,0.493438,-0.300425,-0.120151,0.010948,0.503902,-0.013857,0.198913,-0.138411,1.0,0.577022,-0.375415,0.218366,-0.23327,-0.175525,-0.176887,0.442925,0.38731
Diphtheria,0.466223,-0.228731,-0.106575,0.047256,0.895829,0.019518,0.167397,-0.126753,0.577022,1.0,-0.406572,0.20014,-0.053164,-0.080184,-0.131753,0.397359,0.388661


Life expectancy and income composition of resources have the highest correlation (r = 0.898) whereas life expectancy and schooling has r = 0.806, and thus we can expect a lower RMSE and MAE using incoming composition of resources as the feature.

In [12]:
from sklearn.linear_model import LinearRegression

# Set up the model type
model = LinearRegression(fit_intercept=True)

# drop any rows with N/A values
data = who_data_2015_filtered[["Income composition of resources", "Life expectancy"]].dropna()

data.head()

# Fit the model.
model.fit(X=data[["Income composition of resources"]], y=data["Life expectancy"])

# Extract parameters from the model. model.coef_ gives a coefficient for each
# column of X. We are only using one input column, so the [0] element is our
# gradient parameter.
optimal_gradient = model.coef_[0]
optimal_intercept = model.intercept_


# Define functions for RMSE and MAE 
def prediction_root_mean_squared_error(gradient, intercept):
    """ Return the prediction error associated with the value of the parameters.
    This time around, let's use sklearn.metrics. """
    predictions = who_data_2015["Income composition of resources"] * gradient + intercept
    actual = who_data_2015["Life expectancy"]
    return mean_squared_error(y_true=actual, y_pred=predictions, squared=False)

def prediction_mean_absolute_error(gradient, intercept):
    """ Return the prediction error associated with the value of the parameters.
    This time around, let's use sklearn.metrics. """
    predictions = who_data_2015["Income composition of resources"] * gradient + intercept
    actual = who_data_2015["Life expectancy"]
    return mean_absolute_error(y_true=actual, y_pred=predictions)


# Display the fitted model and prediction error measures.
print("Model is y = {:.2f}x + {:.2f}".format(optimal_gradient, optimal_intercept))
print("RMSE = {:.2f}".format(prediction_root_mean_squared_error(optimal_gradient, optimal_intercept)))
print("MAE = {:.2f}".format(prediction_mean_absolute_error(optimal_gradient, optimal_intercept)))

Model is y = 47.50x + 38.69
RMSE = 3.50
MAE = 2.74


The RMSE is 3.5 and MAE 2.74 which are better than when using schooling as the feature. 

## Extra exercises

The following exercises with (*) will not be assessed. Use these to check your understanding of topics covered in the past 2 weeks.

### Exercise 5*

The function `model_correct_predictions` below returns the number of correct predictions made by a predictive model for the cancer biopsy dataset, for a given parameter value. This parameter value simply controls the threshold value for radius above which a sample is predicted as malignant.

Try different values of the parameter in this model within the range [0, 30]. Record and plot the resulting accuracy values against the parameter value (similar to the regression cost function example above).

What value of the parameter provides the best error rate? Explain how can you be confident you have found the best result here.

In [None]:
def model_correct_predictions(radius_split_parameter):
    """ Return the number of correct predictions made by the model
    for the given parameter value. """
    data = wisconsin_cancer_biopsies.assign(
        predicted=lambda df: df['radius_mean'].lt(radius_split_parameter)
            .map({True: "benign", False: "malignant"})
    )
    return (data['diagnosis'] == data['predicted']).sum()

model_correct_predictions(12)

### Exercise 6*

In examples in pre-studio notebook (W4) we have used root mean squared error (the standard cost function for linear regression) to fit the model parameters. Try re-running the `scipy.optimise` method using mean absolute error. Are the resulting model parameters the same as above? Give some brief reasoning why there might be a difference here.

In [None]:
# Hint: you only need to make one small change in the prediction_error function to do this.

In [None]:
def prediction_root_mean_squared_error(gradient, intercept):
    """ Return the prediction error associated with the value of the parameters.
    This time around, let's use sklearn.metrics. """
    predictions = who_data_2015["Schooling"] * gradient + intercept
    actual = who_data_2015["Life expectancy"]
    # Note that `squared=False` gives us RMSE. Then we're in the same units as MAE.
    return mean_squared_error(y_true=actual, y_pred=predictions, squared=False)

def prediction_mean_absolute_error(gradient, intercept):
    """ Return the prediction error associated with the value of the parameters.
    This time around, let's use sklearn.metrics. """
    predictions = who_data_2015["Schooling"] * gradient + intercept
    actual = who_data_2015["Life expectancy"]
    return mean_absolute_error(y_true=actual, y_pred=predictions)

### Exercise 7*

We can see above that different methods for determining model parameters arrive at the same result, but what happens if we change the dataset slightly. Experiment by taking several (at least 10) different samples of the data, fitting a linear model for each one, and plotting a histogram of the different gradient and intercept coefficients you find. Is there a significant amount of variation in the parameter values?

In [None]:
sample_data = who_data_2015.sample(30)  # selects a small sample of 30 random rows from the data.