# Technical Tasks

* The aim of these tasks is to demonstrate a justifiable approach to common ML tasks.
* The aim is **not** particularly about code quality. 
* Only spend a small amount of time summarising work in markdown cells (<20% total time), focus on the data manipulation and model build.

# 1. Regression: Data preparation and model build

## Goals

* Explore and plot data
    * establish distributions of variables
    * find any problems in the data to fix

Write a simple model-building pipeline including the three tasks below:

* Clean data
    * fix any problems with the data that are necessary to fix before building a model
* Normalise data
    * normalise the variables for use in a linear regression 
* Fit and evaluate model
    * estimate the generalisation error of your model

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

In [2]:
data = pd.read_csv("diabetes.csv")

X, y = data.iloc[:, :-1], data.iloc[:, -1]

print(X.head())
print("\n")
print(y.head())

    AGE  SEX   BMI     BP     S1     S2    S3   S4          S5    S6
0  59.0  2.0        101.0  157.0   93.2  38.0  NaN  110.383198  87.0
1  48.0  1.0         87.0  183.0  103.2  70.0  3.0   43.731989   NaN
2  72.0  NaN  30.5   93.0  156.0   93.6  41.0  4.0   61.926388  85.0
3  24.0  1.0  25.3   84.0  198.0  131.4  40.0  5.0  154.266706  89.0
4  50.0  1.0  23.0  101.0  192.0  125.4  52.0  4.0  154.005616  80.0


0    151.0
1     75.0
2    141.0
3    206.0
4    135.0
Name: Y, dtype: float64


# 2. General ML
To be discussed verbally.
### 2.1
In a regression problem with feature vectors $\mathbf{x_1}, ..., \mathbf{x_n} \in {\rm I\!R^d}$ and targets $y_1, ..., y_n \in {\rm I\!R}$, how would you adjust the following loss function on parameters $\mathbf{b} \in {\rm I\!R^d}$ to achieve the sparsest solution?
$$\mathcal{L}(\mathbf{b}) = \sum_{i=1}^n (y_i - \mathbf{x_ib})^2 + \lambda\sum_{j=1}^d |b_j|^q$$

### 2.2
What methods and models might you use on a supervised learning problem with a high cardinality (>10000) categorical feature, several lower cardinality (<20) categoricals, and 2-3 real valued features? Discuss pros and cons of different models and methods.