# Technical Tasks
The aim of these tasks is to demonstrate a justifiable approach to common ML tasks. The aim is **not** particularly about code quality.

# 1. Simple Model Selection
## Goals
- We want to compare:
    - Missing data strategy: mean vs median
    - Model: Lasso Vs Ridge
- We want to find the model that will have the lowest error in production.

In [24]:
import sklearn.datasets
import numpy as np

In [92]:
data, target = sklearn.datasets.load_diabetes(return_X_y=True, as_frame=True)
# Fill 5% of data with NaN's
data = data * np.random.choice([1, np.nan], size=data.shape, p=[0.95, 0.05])

# 2. Classifier Evaluation
We have a binary classification task. The ground truth labels are loaded along with predictions from one of our models.

Quantify and comment on the quality and usefulness of these predictions.

In [170]:
predictions = np.load("classifier_predictions.npy")
targets = np.load("classifier_targets.npy")

# 3. General ML
To be discussed verbally.
### 3.1
In a regression problem with feature vectors $\mathbf{x_1}, ..., \mathbf{x_n} \in {\rm I\!R^d}$ and targets $y_1, ..., y_n \in {\rm I\!R}$ how would you adjust the following loss function on parameters $\mathbf{b} \in {\rm I\!R^d}$ to achieve the sparsest solution?
$$\mathcal{L}(\mathbf{b}) = \sum_{i=1}^n (y_i - \mathbf{x_ib})^2 + \sum_{j=1}^d |b_j|^q$$

### 3.2
What methods and models might you use on a supervised learning problem with a high cardinality (>10000) categorical feature, several lower cardinality (<20) categoricals, and 2-3 real valued features? Discuss pros and cons of different models and methods.

# 4. Model Selection Continued
Build on your work from part 1 to add feature selection and parameter tuning.

In [193]:
data, target = sklearn.datasets.load_diabetes(return_X_y=True, as_frame=True)
# Fill 5% of data with NaN's
data = data * np.random.choice([1, np.nan], size=data.shape, p=[0.95, 0.05])