$n \times m$: $X = [x^{(i)}_j]^{i=1\dots n}_{j=1\dots m} $

$y \in \mathbb{R}^n$.

$$\hat{y} = X\beta \quad \Leftrightarrow \quad \hat{y}^{(i)} = \beta_0 + \beta_1x^{(i)}_1 + \dots$$

$$ L(\beta) = \frac{1}{2n}(\hat{y} - y)^{\top}(\hat{y} - y) = \frac{1}{2n}(X\beta - y)^{\top}(X\beta - y) \rightarrow \min$$ $$ \Updownarrow $$  $$ L(\beta_0,\beta_1,\dots) = \frac{1}{2n}\sum^{n}_{i=1}(\hat{y}^{(i)} - y^{(i)})^2 = \frac{1}{2n}\sum^{n}_{i=1}(\beta_0 + \beta_1x^{(i)}_1 + \dots - y^{(i)})^2  \rightarrow \min $$

### Libraries

Libraries: [`scikit-learn`](http://scikit-learn.org/stable/) and [`statmodels`](http://statsmodels.sourceforge.net/).

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

plt.style.use('ggplot')

%matplotlib inline

In [None]:
df = pd.read_csv('http://bit.ly/1gIQs6C')

In [None]:
df.head()

mileage - predictor

In [1]:
y = df.price.values
X = df.mileage.values.reshape(-1,1)

NameError: name 'df' is not defined

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
model = LinearRegression(fit_intercept=True)
model.fit(X, y)

In [None]:
print 'Модель:\nprice = %.2f + (%.2f)*mileage' % (model.intercept_, model.coef_[0])

In [None]:
x = np.linspace(0, max(X), 100)
y_line = model.intercept_ + model.coef_[0]*x

fig, ax = plt.subplots(1, 1, figsize=(10,5))
ax.scatter(X, y)

ax.plot(x, y_line, c='red')

### Quality

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(14, 5))

y_hat = model.predict(X)
res = y - y_hat
ax[0].hist(res)
ax[0].set_xlabel('residuals')
ax[0].set_ylabel('counts')

ax[1].scatter(X, res)
ax[1].set_xlabel('mileage')
ax[1].set_ylabel('residuals')


In [None]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import median_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

Residuals:

* $\frac{1}{n} \sum_i |\hat{y}^{(i)}-y^{(i)}|$
* $\frac{1}{n} \sum_i (\hat{y}^{(i)}-y^{(i)})^2$

In [2]:
print 'MAE %.2f' % mean_absolute_error(y, y_hat)
print 'MSE %.2f' % mean_squared_error(y, y_hat)

SyntaxError: invalid syntax (<ipython-input-2-31a8f364c48f>, line 1)

* $TSS = \sum_i (y^{(i)}-\bar{y})^2$ - total sum of squares
* $RSS = \sum_i (\hat{y}^{(i)}-y^{(i)})^2$ - residual sum of squares
* $ESS = \sum_i (\hat{y}^{(i)}-\bar{y})^2$ - explained sum of squares

$$TSS = ESS + RSS$$

$R^2=1-\frac{RSS}{TSS}$


In [None]:
print 'R^2 %.2f:' % r2_score(y, y_hat)

In [None]:
import statsmodels.api as sm
model = sm.OLS(y, sm.add_constant(X))
results = model.fit()
print(results.summary())

## Prepocessing

### Normalization

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X = scaler.fit_transform(X)

model = LinearRegression(fit_intercept=True)
model.fit(X, y)

In [3]:
print 'Model:\nprice = %.2f + (%.2f)*mileage`' % (model.intercept_, model.coef_[0])

SyntaxError: invalid syntax (<ipython-input-3-da01a51254d3>, line 1)

### Nominal values

In [None]:
df.head()

In [None]:
from sklearn.feature_extraction import DictVectorizer
cols = df.columns[1:]

dv = DictVectorizer()
dv.fit(df[cols].T.to_dict().values())

X = dv.transform(df[cols].T.to_dict().values())

In [None]:
model = LinearRegression(fit_intercept=False)
model.fit(X, y)

In [None]:
# Coefficients?


In [None]:
y_hat = model.predict(X)

print 'MAE %.2f' % mean_absolute_error(y, y_hat)
print 'MSE %.2f' % mean_squared_error(y, y_hat)
print 'R^2 %.2f:' % r2_score(y, y_hat)

Example:
$$\log(y) = \beta_0 + \beta_1\log(x_1)$$
or
$$y = \beta_0 + \beta_1\frac{1}{x_1}$$
or
$$y = \beta_0 + \beta_1\log(x_1)$$
or
$$y = \beta_0 + \beta_1 x_1^2 + \beta_2 x_2^2 + \beta_3 x_1x_2 $$
etc.

Data -  [link](https://www.dropbox.com/s/8srfeh34lnj2cb3/weights.csv?dl=0)

In [None]:
df = pd.read_csv('weights.csv', sep=';', index_col=0)
df.head()

In [4]:
df.plot(x = 'body_w', y='brain_w', kind='scatter')
for k, v in df.iterrows():
    plt.annotate(k, v[:2])
# ??

NameError: name 'df' is not defined

Log!

In [None]:
# Your Code Here


Linear regression

In [None]:
# Your Code Here


## 3. Overfitting

<img src=http://www.holehouse.org/mlclass/10_Advice_for_applying_machine_learning_files/Image%20%5B8%5D.png>
[Andrew's Ng Machine Learning Class - Stanford]

### Regularization

$$ L(\beta_0,\beta_1,\dots) = \frac{1}{2n}\sum^{n}_{i=1}(\hat{y}^{(i)} - y^{(i)})^2 $$
Ridge Regularization
$$ L(\beta_0,\beta_1,\dots) = \frac{1}{2n}[ \sum^{n}_{i=1}(\hat{y}^{(i)} - y^{(i)})^2 + \lambda\sum_{j=1}^{m}\beta_j^2]$$
Lasso Regularization
$$ L(\beta_0,\beta_1,\dots) = \frac{1}{2n}[ \sum^{n}_{i=1}(\hat{y}^{(i)} - y^{(i)})^2 + \lambda\sum_{j=1}^{m}|\beta_j|]$$

In [None]:
# В sklearn эти методы называются так
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge