# INFO204 Practical Test 1 (Practice) - Model Answer

## PLEASE NOTE: THE PURPOSE OF THIS PRACTICE TEST IS TO PROVIDE A _ROUGH_ IDEA OF HOW THE TEST WILL BE PRESENTED - WHILE WILL BE OVERLAP IN CONCEPTS, THE ORDER AND MANNER IN WHICH QUESTIONS ARE ASKED MAY DIFFER. ALSO, THE ACTUAL TEST WILL CONTAIN A SMALL NUMBER (2-3) OF EXAM-LIKE SHORT ANSWER QUESTIONS THAT ARE BASED ON LECTURE CONTENT, DETAILS ON HOW TO PREPARE FOR THESE QUESTIONS WILL BE PROVIDED ON BLACKBOARD

Please enter your details below:<br />
***Student Name:***

***Student ID:***

## Guidelines:
- Attempt **all** tasks as best you can.
- Type in your solution code in the cell *right under* each task and run it. Keep the output. 
- If stuck on one task, don't waste your time - there are other tasks you can attempt.
- All work (bar one exercise) has been discussed in labs - please use the previous labs as inspiration to complete this test.

## Precursors

### <span style="color: #ce2227;">Please run the first cell to import relevant libraries and to define the `extract_cv_stats` function that will be used later in the test. This will also declare a repeated 10-fold cross validation generator called `rkf` and a mean squared error scoring function `mse_score` that will be used throughout the test.</span>

In [8]:
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.metrics import make_scorer, mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV, cross_val_score, RepeatedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.dummy import DummyRegressor

from matplotlib import pyplot as plt

def cleanup_cv_results(cv, model_name='model'):
    import re
    
    cv_results = pd.DataFrame(cv.cv_results_)

    ## there's a few columns returned by GridSearchCV.fit() that we don't need, so let's get rid of them to make things clearer
    unwanted_columns = ['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'mean_test_score', 'std_test_score', 'rank_test_score']

    ## remove the "param_"  and "param_model__" prefixes from columns
    r = re.compile(f"param_({model_name}__)*")
    cleaned_names = cv_results.drop(columns=unwanted_columns).rename(columns=lambda x: r.sub('', x))

    ## identify all the columns that are not the per-split cross validation scores
    r = re.compile(f"split.+_test_score")
    header_cols = [ c for c in cleaned_names.columns.values if not r.match(c) ]
    
    ## return the long version of the data
    long_data = cleaned_names.melt(id_vars=header_cols, var_name='split', value_name='score')
    long_data['split'] = long_data['split'].replace('split([0-9]+)_test_score', '\\1', regex=True).astype(int)
    return long_data

rkf = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1234)

## Data Manipulation

***Note: <span style="color: #ce2227;">In this test, assume the response variable is named `target` and use this as the response (target) in all subsquent questions</span>***

### Q1: load the CSV file <span style="font-family: monospace">regression_features.csv</span> into a pandas data frame called `features`. Then, load the CSV file <span style="font-family: monospace">regression_target.csv</span> into a pandas data frame called `target`. Join these two data frames together using the common column information - name the resulting data frame `all_data`. Display the data frame `all_data`

In [10]:
features = pd.read_csv('regression_features.csv')
target = pd.read_csv('regression_target.csv')

all_data = target.merge(features, how='inner', on='instance')
display(all_data)

FileNotFoundError: [Errno 2] No such file or directory: 'regression_features.csv'

### Q2: `all_data` is in long format - convert it to wide format, using the `instance` and `target` columns as index. Once the data frame is in wide format, use the [`reset_index`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html) function to restore `instance` and `target` as actual columns. Then, remove the `instance` column from the data frame. Finally, display the data frame

In [None]:
all_data = all_data.pivot(index=[ 'instance', 'target' ], columns='feature', values='value').reset_index()
all_data.drop(columns='instance', inplace=True)
display(all_data)

### Q3: Save the edited wide-format data frame to a new CSV file called <span style="font-family: monospace">tidy_regression_data.csv</span>

In [None]:
all_data.to_csv('tidy_regression_data.csv', index=False)

## Exploratory Data Analysis

### Q4: Read the CSV file <span style="font-family: monospace">regression_data.csv</span> into a data frame called `dataset`. Then, print out descriptive statistics of the data frame

In [None]:
dataset = pd.read_csv('regression_data.csv')

display(dataset)

display(dataset.info())

display(dataset.describe())

display(dataset.corr())

### Q5: Produce a heatmap of the correlations of the numeric columns in `dataset`, and a pairwise scatter plot of `dataset`

In [None]:
with sns.plotting_context(rc={ 'axes.labelsize' : 20, 'xtick.labelsize' : 10, 'ytick.labelsize' : 10 }):
    fig = plt.figure(figsize=(12, 12))
    sns.heatmap(dataset.corr())

    sns.pairplot(dataset, aspect=1, height=3, corner=True);

### Q6: Identify THREE columns in `dataset` for removal and briefly suggest (1-3 sentences total) why these columns can be removed. Then, put the names of the columns into a list called `drop_columns`. Finally, remove these columns from `dataset`

#### _a and c appear to have little to no relationship with the response. While d and e appear to be uncorrelated to the response, they have a clear underlying non-linear relationship to the response (and so a non-linear learner may be able to exploit them). Finally, f has MANY missing values, so rather than remove instances with missing values, we will remove this column_

In [None]:
drop_columns = [ 'a', 'c', 'f' ]
dataset.drop(columns=drop_columns, inplace=True)

### Q7: Extract the features of `dataset` into an array named `X`, and the response into a variable named `t`. Create a list of the feature names, and store the result in a variable named `feature_names`

In [None]:
target = 'target'
X = dataset.drop(columns=target).to_numpy()
t = dataset[target].to_numpy()
feature_names = dataset.drop(columns=target).columns.values

### Q8: define a machine learning pipeline named `mlpipe`. This pipeline needs two steps: one called `'preprocess'` that is set to `'passthrough'` by default, and another step called `'model'` that defaults to a `DummyRegressor`.

In [None]:
mlpipe = Pipeline([
    ('preprocess', 'passthrough'),
    ('model', DummyRegressor())
])

### Q9: define a suitable hyperparameter tuning grid for the pipeline `mlpipe` such that it uses a decision tree to model the data. The tuning grid should explore the following decision tree hyperparameter `min_samples_split` at the values: 2, 4, 8, 16, 32, 64, 128, and 256.

In [None]:
CART = DecisionTreeRegressor(random_state=0)

CART_param_grid = {
    'preprocess' : [ 'passthrough' ],
    'model' : [ CART ],
    'model__min_samples_split' : np.logspace(np.log2(2), np.log2(256), 8, base=2).astype(int)
}

### Q10: define a suitable hyperparameter tuning grid for the pipeline `mlpipe` such that it uses k-nearest neighbours to model the data. The preprocessing step should explore standardisation (i.e., using `'passthrough'` or a `StandardScaler`). The tuning grid should also explore the k-nearest neighbour hyperparameters:
1. `n_neighbors` at the values: 2, 4, 8, 16, 32, 64, 128, and 256.
2. `weights` at the values 'uniform', 'distance'

Note: we have not discussed the `weights` hyperparameter for kNN in classes - <strong>you do not need to know what this hyperpameter does for kNN, you only need to be able to assess its impact on performance of kNN for the given problem</strong>. The `weights` hyperparameter adjusts the "Weight function used in prediction" (in other words, how the neighbouring instances are combined to form the final prediction). Possible values are:
* 'uniform' : uniform weights. All points in each neighborhood are weighted equally. (this is what you've been using all semester)
* 'distance' : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.


In [None]:
knn = KNeighborsRegressor()
scaler = StandardScaler()

knn_param_grid = {
    'preprocess' : [ 'passthrough', scaler ],
    'model' : [ knn ],
    'model__n_neighbors' : np.logspace(np.log2(1), np.log2(256), 9, base=2).astype(int),
    'model__weights' : [ 'uniform', 'distance' ]
}

### Q11: Perform cross validation on the tuning grids defined in the previous two questions. Extract the best model identified by cross validation and assign it to a variable named `best_model`. Describe the hyperparameters and score of the best model.

In [None]:
param_grid = [
    CART_param_grid,
    knn_param_grid
]

cv = GridSearchCV(mlpipe, param_grid, cv=rkf)
cv.fit(X, t)

best_model = cv.best_estimator_

print(cv.best_params_)
print(cv.best_score_)

#### _DESCRIBE THE MODEL PARAMETERS AND SCORE HERE_

### Q12: Use the provided `cleanup_cv_results` to extract a data frame called `cv_stats`. Then use `cv_stats` to create two line plots: one with the x axis exploring the `min_samples_split` hyperparameter of the decision tree, and another with the x axis exploring the `n_neighbors` hyperparameter of kNN. For the kNN plot, use the `hue` semantic to visualise the `'weights'` hyperparameter, and a `style` hyperparameter to visualise the `'preprocess'` hyperparameter.

In [None]:
cv_stats = cleanup_cv_results(cv)

fig, axs = plt.subplots(ncols=2, figsize=(18, 6))
sns.lineplot(data=cv_stats, x='min_samples_split', y='score', color='black', ax=axs[0]).set(title='CART', xscale='log', xlabel='$minsplit$')
sns.lineplot(data=cv_stats, x='n_neighbors', y='score', hue='weights', style='preprocess', ax=axs[1]).set(title='$k$-NN', xscale='log', xlabel='$k$ (Neighbourhood Size)')
for ax in axs: ax.set_ylim(bottom=-0.1, top=1.0)
for ax in axs: ax.set_ylabel('$R^2$')
fig.suptitle('Cross Validation Grid Search Results')
plt.show()

### Q13: declare a new scorer called `mse_score` that uses `mean_squared_error` (in other words, use the scikit-learn function `make_scorer` to make a new scorer based on the `mean_squared_error` loss function). Perform cross validation on linear regression (using `X` and `t` as source data) using `mse_score` as the required scorer and store the result of this in `lm_scores`. Then, perform cross validation on the best model returned in Q11 (using `X` and `t` as source data) using `mse_score` as the required scorer and store the result of this in `best_scores`. 

In [1]:
mse_score = make_scorer(mean_squared_error)

lm_scores = cross_val_score(LinearRegression(), X, t, cv=rkf, scoring=mse_score)
best_scores = cross_val_score(best_model, X, t, cv=rkf, scoring=mse_score)

NameError: name 'make_scorer' is not defined

### Q14: Create a data frame, called `results`, that contains two columns: the `best_scores` and `lm_scores` arrays from Q13. Convert this data frame from its wide format into long format, naming the "variable" column `'method'` and the "value" column `'MSE'`. Group this long version of the data frame by `'method'` and aggregate the result such that the resulting data frame presents the mean MSE value for each method.

In [None]:
results = pd.DataFrame({
    'Best from Grid Search' : best_scores,
    'Linear Regression'   : lm_scores
}).melt(var_name='method', value_name='MSE')

display(results.groupby('method')[['MSE']].mean().reset_index())

### Q15: Use the `results` data frame to create a boxplot comparing the cross-validated mean-squared error results obtained in Q13. Briefly discuss the relative performance of the methods, and declare which method you would select as your final model for future predictions.

In [None]:
fig, ax = plt.subplots(figsize=(12, 12))
sns.boxplot(data=results, x='method', y='MSE', ax=ax)
ax.set_ylabel('MSE');

#### _Metric is MSE, so we aim to minimise. Linear regression can be considered a reasonable baseline of expected performance, and we can see that the model discovered through cross validation (standardised kNN, with a neighbourhood size of 4 and using distance weighting) is a substantial improvement over this._

**SELECTED MODEL:** The one returned by cross validation

### Q16: load the CSV file <span style="font-family: monospace">new_regression_data.csv</span> into a pandas data frame. Name the data frame `test`. Use `drop_columns` to remove the extraneous columns (as identified in Q6) in `test`. Extract the features in the `test` data frame into an array called `X_test` and the response into a variable called `t_test`. Use the selected model identified in the previous question to obtain predictions from `X_test` and store the result in `y_test`. Finally, compute and print out the mean squared error and $R^2$ of the predictions in `y_test` relative to `t_test`.

In [None]:
test = pd.read_csv('new_regression_data.csv')
test.drop(columns=drop_columns, inplace=True)

X_test = test.drop(columns=target).to_numpy()
t_test = test[target].to_numpy()
y_test = cv.best_estimator_.predict(X_test)

print(mean_squared_error(t_test, y_test), r2_score(t_test, y_test))