# Supervised learning linear models



### Notes

* Unsupervised learning: extract from data X a structure that can be generalized

#### Supervised learning - Linear models
* Target is linear combination of feature: coef_, intercept_, .fit()
* Linear regression: minimize the residual sum between observed and predicted target
* Important that features are independent of each other
* Evaluate with mean square error
* Coefficients can be forced to be non-negative: positive parameter
* Does not have a classifier with it


In [4]:
from sklearn import linear_model
reg = linear_model.LinearRegression()
X = [[0, 0], [1, 1], [2, 2]]
y = [0, 1, 2]
reg.fit(X, y)
print('coefficients', reg.coef_)


coefficients [0.5 0.5]


#### [Ridge regression and classification](https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression-and-classification)

* Penalty **equal to the square of the magnitude of coefficients** on the coefficients of correlated predictors (**shrinks** them)
* Tries to handle multicolinearity
* good when you want to keep most featuers, but want to remove the effect of correlation
* When alpha is very large, the regularization effect dominates the squared loss function and the coefficients tend to zero.
* comes with a classifier (`RidgeClassifier`) -> Least Squares Support Vector Machines with a linear kernel
* Cross-validation: `RidgeCV` and `RidgeClassifierCV`: work in the same way as `GridSearchCV` except that it defaults to efficient Leave-One-Out cross-validation

In [7]:
import numpy as np
from sklearn import linear_model
reg = linear_model.Ridge(alpha=.5)
reg.fit(X, y)
print('coefficients ', reg.coef_)
print('intercept ',  reg.intercept_)

regCV = linear_model.RidgeCV(alphas=np.logspace(-6, 6, 13))


coefficients  [0.44444444 0.44444444]
intercept  0.11111111111111116


In the example below there are multiple alphas, for each the MSE is calculated and the optimal alpha is chosen based on the best negative mean squared error (default) score.

In [None]:
regCV = linear_model.RidgeCV(alphas=np.logspace(-6, 6, 13))
regCV.fit(X, y)


print('alpha ', regCV.alpha_)
print('coefficients ', regCV.coef_)
print('intercept ',  regCV.intercept_)

alphas  1e-06
coefficients  [0.49999988 0.49999988]
intercept  2.491287887096405e-07


#### [Lasso regression](https://scikit-learn.org/stable/modules/linear_model.html#lasso)
* Penalty **equal to the absolute value of the coefficients** on the coefficients
* Can set coefficients to exactly zero
* good when there are many featuers, and you want to reduce them
* Cross-validation: there is 
    * `LassoCV`: 
        * coordinate descent -> optimizes one coefficient at a time 
        * simpler and straight forward implementation
        * advised for data with many correlated features
    * `LassoLarsCV`: 
        * optimizes all coefficients at the same time by moving the coefficients in the direction of the most correlated features and updates all coefficients simultaneously
        * efficient in contexts where the number of features is significantly greater than the number of samples
        * often faster than LassoCV, more relevant parameters of alpha


Some notes on [being cautious with coefficients](https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html#sphx-glr-auto-examples-inspection-plot-linear-model-coefficient-interpretation-py):
* coefficients vary significantly when changing the input dataset their robustness is not guaranteed (train/test) - two or more features could be correlated


Other models:
* `LassoLarsIC" uses Akaike Information Criterion (AIC) and Mayes Information Criterion (BIC)
* tend to break when the problem is badly conditioned (e.g. more features than samples)