# Technical Tasks
The aim of these tasks is to demonstrate a justifiable approach to common ML tasks. The aim is **not** particularly about code quality.

# 1. Regression: Simple Model Selection
## Goals
#### Write a simple pipeline that will empirically estimate the generalisation error of different strategies so that we can find the one with the lowest error.
- We want to compare:
    - Missing data strategy: mean vs median
    - Model: Random Forest Vs Ridge


In [39]:
import numpy as np
import pandas as pd

import sklearn.datasets
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error

In [37]:
data, target = sklearn.datasets.load_diabetes(return_X_y=True, as_frame=True)
# Fill 10% of data with NaN's
data = data * np.random.choice([1, np.nan], size=data.shape, p=[0.9, 0.1])
data.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,,-0.092204
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593
3,-0.089063,,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,,0.008142,-0.002592,-0.031991,-0.046641


# 2. Classifier Evaluation
We have a binary classification task. The ground truth labels are loaded along with predictions from one of our models.

Quantify and comment on the quality and usefulness of these predictions.

In [170]:
predictions = np.load("classifier_predictions.npy")
targets = np.load("classifier_targets.npy")

# 3. General ML
To be discussed verbally.
### 3.1
In a regression problem with feature vectors $\mathbf{x_1}, ..., \mathbf{x_n} \in {\rm I\!R^d}$ and targets $y_1, ..., y_n \in {\rm I\!R}$, how would you adjust the following loss function on parameters $\mathbf{b} \in {\rm I\!R^d}$ to achieve the sparsest solution?
$$\mathcal{L}(\mathbf{b}) = \sum_{i=1}^n (y_i - \mathbf{x_ib})^2 + \lambda\sum_{j=1}^d |b_j|^q$$

### 3.2
What methods and models might you use on a supervised learning problem with a high cardinality (>10000) categorical feature, several lower cardinality (<20) categoricals, and 2-3 real valued features? Discuss pros and cons of different models and methods.

# 4. Model Selection Continued
Build on your work from part 1 to add feature selection and parameter tuning.

In [193]:
data, target = sklearn.datasets.load_diabetes(return_X_y=True, as_frame=True)
# Fill 5% of data with NaN's
data = data * np.random.choice([1, np.nan], size=data.shape, p=[0.95, 0.05])