# [Supervised learning - linear models](https://scikit-learn.org/stable/modules/linear_model.html#)


### Notes

* Unsupervised learning: extract from data X a structure that can be generalized

#### Supervised learning - Linear models
* Target is linear combination of feature: coef_, intercept_, .fit()
* Linear regression: minimize the residual sum between observed and predicted target
* Important that features are independent of each other
* Evaluate with mean square error
* Coefficients can be forced to be non-negative: positive parameter
* Does not have a classifier with it


In [4]:
from sklearn import linear_model
reg = linear_model.LinearRegression()
X = [[0, 0], [1, 1], [2, 2]]
y = [0, 1, 2]
reg.fit(X, y)
print('coefficients', reg.coef_)


coefficients [0.5 0.5]


#### [Ridge regression and classification](https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression-and-classification)

* Penalty **equal to the square of the magnitude of coefficients** on the coefficients of correlated predictors (**shrinks** them)
* Tries to handle multicolinearity
* good when you want to keep most featuers, but want to remove the effect of correlation
* When alpha is very large, the regularization effect dominates the squared loss function and the coefficients tend to zero.
* comes with a classifier (`RidgeClassifier`) -> Least Squares Support Vector Machines with a linear kernel
* Cross-validation: `RidgeCV` and `RidgeClassifierCV`: work in the same way as `GridSearchCV` except that it defaults to efficient Leave-One-Out cross-validation

In [7]:
import numpy as np
from sklearn import linear_model
reg = linear_model.Ridge(alpha=.5)
reg.fit(X, y)
print('coefficients ', reg.coef_)
print('intercept ',  reg.intercept_)

regCV = linear_model.RidgeCV(alphas=np.logspace(-6, 6, 13))


coefficients  [0.44444444 0.44444444]
intercept  0.11111111111111116


In the example below there are multiple alphas, for each the MSE is calculated and the optimal alpha is chosen based on the best negative mean squared error (default) score.

In [None]:
regCV = linear_model.RidgeCV(alphas=np.logspace(-6, 6, 13))
regCV.fit(X, y)


print('alpha ', regCV.alpha_)
print('coefficients ', regCV.coef_)
print('intercept ',  regCV.intercept_)

alphas  1e-06
coefficients  [0.49999988 0.49999988]
intercept  2.491287887096405e-07


#### [Lasso regression](https://scikit-learn.org/stable/modules/linear_model.html#lasso)
* Penalty **equal to the absolute value of the coefficients** on the coefficients
* Can set coefficients to exactly zero
* good when there are many featuers, and you want to reduce them
* Cross-validation: there is 
    * `LassoCV`: 
        * coordinate descent -> optimizes one coefficient at a time 
        * simpler and straight forward implementation
        * advised for data with many correlated features
    * `LassoLarsCV`: 
        * optimizes all coefficients at the same time by moving the coefficients in the direction of the most correlated features and updates all coefficients simultaneously
        * efficient in contexts where the number of features is significantly greater than the number of samples
        * often faster than LassoCV, more relevant parameters of alpha


Some notes on [being cautious with coefficients](https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html#sphx-glr-auto-examples-inspection-plot-linear-model-coefficient-interpretation-py):
* coefficients vary significantly when changing the input dataset their robustness is not guaranteed (train/test) - two or more features could be correlated


Other models:
* `LassoLarsIC` uses Akaike Information Criterion (AIC) and Mayes Information Criterion (BIC): the lower the better the model fits based on the # parameters
* `alpha`  and regularization parameter `C` of SVM are connected with `alpha = 1/C`
* `MulitTaskLasso`: handles multiple regression problems jointly. Constraint: selected features are the same for all the regression problems
* `ElasticNetCV` & `MultiTaskElasticNet`: Ridge + Lasso regression
* `OrthogonalMatchinPursuit`: a linear model that allows to fix the number of desired non-zero coefficients
* `Bayesian(Ridge)Regression`: introduce uninformative priors to the Ridge regression/classification model. Then `alpha` is like a random variable determined by the data
* `ARDRegression`: Automatic Relevance Determination, similar to Bayesin Ridge regression, but leads to sparser coefficients (different prior)

#### [Logistic Regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)
* implemented as a linear model for classification
* implementation can fit binary, One-vs-Rest, or multinomial logistic regression with optional or Elastic-Net regularization
* Regularization is applied 
* numerical output is the predicted probability -> classifier when setting threshold
* `.predict` -> class lables, `.predict_proba` -> predict probabilities
* binary case can be extended to K classes leading to the multinomial logistic regression


#### [Solvers](https://scikit-learn.org/stable/modules/linear_model.html#differences-between-solvers)
* Characteristics: performance, speed, and suitability for different types of data
* `lbfgs` (default): small to medium-sized datasets; works well with multi-class problems and provides good convergence properties.
* `liblinear`: designed for large,sparse data and linear classification
* `newton-cg`: Suitable for problems where the number of features is much smaller than the number of samples 
* `newton-cholesky`: particularly well-suited for datasets where the features are dense (i.e., the majority of feature values are non-zero)
* `sag`: large datasets
* `saga`: Opens up more flexibility with regularization options while maintaining the efficiency of SAG.


#### [Generalized Linear Models](https://scikit-learn.org/stable/modules/linear_model.html#generalized-linear-models)
* allow to define your own probability density function, based on target distribution eg 
    * Poisson: (if target is relative counts or frequency, thus non-negative)
    * Gamma: If the target values are positive valued and skewed
    * inverse gaussian: If the target values seem to be heavier tailed than a Gamma distribution
    * Bernoulli: If the target values are probabilities
* `TweedieRegressor` implement the GLM using the `power` parameter


### Algorithms for large scale learning (> samples & features)
#### [Stochastic Gradient Descent](https://scikit-learn.org/stable/modules/linear_model.html#stochastic-gradient-descent-sgd)
* `SGDClassifier`, `SGDRegressor`
* Fits different know model types depending on the loss function
    * `loss = 'log'` -> logistic regression
    * `loss = "hinge"` -> linear SVM

#### [Perceptron](https://scikit-learn.org/stable/modules/linear_model.html#perceptron)
* No leanring rate, no regularization, only learning on previous mistakes

#### [Passive Agressive Algorithms](https://scikit-learn.org/stable/modules/linear_model.html#passive-aggressive-algorithms)
* No learning rate, but require regularization parameter

### Algorithms for dealing with outliers and modelling errors
* [Senarios](https://scikit-learn.org/stable/modules/linear_model.html#robustness-regression-outliers-and-modeling-errors): Outliers in X & y, number of outliers, amplitude of outliers
* Methods use a subset of the data to fit/complete the dataset
* `HuberRegressor`: fast for << nsamples, does not ignore, but gives less weight to outliers (like ridge)
* `RANSAC`: good for large outliers in y-direction, uses a subset of the data
* `Theil Sen`: better with medium ouliers in X direction, but not good >> n features, uses a generalization of the median

### Other models:
* Predicting intervals - [Quantile regression](https://scikit-learn.org/stable/modules/linear_model.html#quantile-regression)
* [Polynomial regression](https://scikit-learn.org/stable/modules/linear_model.html#quantile-regression) - extend the linear model with non-linear data -> `PolynomialFeatures`. Allows to fit wider range of data while mainaining speed of linear algos


