In [7]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize as opt
import sklearn.linear_model
import sklearn.model_selection

# Applied Machine Learning

## Linear Models

### In previous episode

- We have some dataset
- We identify the problem and define the loss function
- Then we minimize the total loss (empirical risk) using available (training) data
- Overfitting!

In [21]:
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()
print(data['data'][0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







### Overfitting

- We can always come up with a model that fits data perfectly
- For some reason that's not what we want
- Let us try to (at least) measure that

### Cross validation

- Split the dataset into a few (say 5) non-overlapping parts
- 4 go to training data, 1 goes to test data
- Do the above 5 times to train the model and test it
- Makes a good way to *detect* overfitting

In [11]:
xval = sklearn.model_selection.KFold(n_splits=5)
for train, test in xval.split(data):
    pass

### Leave-on-out

- Generate as many samples as there are examples
- All but one go to training data, just one goes to testing
- Better estimate if we don't have a lot of data
- We still need a principled way to reduce that during training

### Ill-posed problems

- A mathematical problem is ill-posed when the solution is not unique
- That's exactly the case of regression/classification/...
- We need to make the problem well-posed: *regularization*

### Structural risk minimization

- Structural risk is empirical risk plus regularizer
- Instead of minimizing empirical risk we find some tradeoff
- Regularizer is a function of model we get
- $\mathsf{objective} = \mathsf{loss} + \mathsf{regularizer}$

### Regularizer

- A functions that reflects the complexity of a model
- What is the complexity of a set of 'if ... then'?
- Not obvious for linear model

### Gradient Descent

- Last time we used `opt.fmin` which magically found the solution
- The method is simple though
- Start with random weights $w_0$
- Iterate: $w_{i+1} = w_{i} - \alpha \times \nabla \mathsf{objective}(w_i)$
- All we need to know: the gradients of loss and regularizer
- $\nabla \mathsf{objective} = \nabla \mathsf{loss} + \nabla \mathsf{regularizer}$

### Gradient of loss

- Last time we used $(y-p)^2$
- Gradient is obvious $2 (y - p)$
- The regularizer is not known yet

### $\ell_1$ regularizer

- Derivative is const
- Forces weight to be zero if it doesn't hurt performance much 
- Use if you believe some features are useless

In [19]:
model = sklearn.linear_model.LogisticRegression(penalty='l1');
model = sklearn.linear_model.Lasso();

### $\ell_2$ regularizer

- Derivative is linear
- Forces weights to get *similar* magnitude if it doesn't hurt performance much
- Use if you believe all features are more or less important

In [18]:
model = sklearn.linear_model.LogisticRegression(penalty='l2');
model = sklearn.linear_model.Ridge();

### Elastic net

- Just a weighted sum of $\ell_1$ and $\ell_2$ regularizers
- An attempt to get useful properties of both

In [20]:
model = sklearn.linear_model.ElasticNet()

### Limitation of linearity

- In low-dimensional spaces linear models are not very 'powerful'
- The higher dimensionality becomes, the more powerful linear model becomes
- What if $d > 1000000$?

### High-dimensional mode

- If $d$ becomes high features become correlated
- The real power of linear models comes from using sparse features

### Sparse features

- We say features are sparse when most of the values are zero
- Examples: visited hosts, movies that user liked, ...

### One hot encoding, hashing trick

- One way to encode categorical things like visited hosts
- We enumerate all the hosts
- We put 1 to position of every host, 0 otherwise
- Hashing trick: instead of enumerating them just hash

In [19]:
hash('hse.ru') % 2**16

14011

### Homework 1

- No score, just have to be done
- For best score, deadline is next class
- Load dataset, create linear model, train, and explain results