# Stochastic methods - helpful code and notes for Linear, Ridge, Lasso, Elastic Net, Logistic regression and SVM.

This brief notebook looks at stochastic methods for analysis in machine learning. It focuses primarily on methods that use stochastic gradient descent as an optimizing function. It also looks briefly as linear regression as a building block for the rest of the methods. It looks at regularized versions of regression(Ridge, Lasso, ElasticNet) before adapting these to account for polynomial relationships. From here it moved onto classifier models -  logistic regression and SVM which are the next logical jump from linear regression models. 

### Package and pre-requesite import

In [6]:
import pandas as pd
import sklearn
import numpy as np

# Prediction

### Simple Linear Regression

In [16]:
from sklearn.linear_model import LinearRegression
X = 2 * np.random.rand(100,1) #create a 100x1 vector of instances
y = 4 + 3 * X + np.random.randn(100,1)

In [33]:
lin_reg = LinearRegression()
lin_reg.fit(X,y)
print(lin_reg.intercept_, lin_reg.coef_)
X_new = np.array([[0], [2]]) #predict requires a 2D array, can reshape data if only one feature but easier to just predict 2
lin_reg.predict(X_new)

[3.82929424] [[3.15667259]]


array([[ 3.82929424],
       [10.14263941]])

- As can be seen, if Xi = 0 then y^ is simply equal to the intercept.
- The scikit learn module uses SVD(Singular value decimposition) as the matrix factroring technique. 
- This approach is less computationaly complex than using the normal equation (On^2 >> ~~ On^3)
- This will still be very slow as n > infinity.
- Simple Linear regression also has no regularization and is therefore prone to overfitting. 
- Both the problems of overfitting and computational complexity can be solved by not using the normal equation(or similar) and using a gradient descent algorithm. There are three main types of regularized regressions which can be seen below. 

### Stochastic methods. 

Firstly, it must be noted that stochastic methods are not a family of machine learning models and are simply an optimization technique.You can compute a linear regression without stochastic gradient descent by use of the normal equation but it is not reccomended. It is very rare that we will need to use an unregularized linear regression.

Gradient descent has three main variations. Batch Gradient desent (full GD), Stochastic, or mini batch. Batch gradient descent goes through each individual obervation individually and adjusts the relevant coefficients. Stochastic, picks coefficients randomly each epoch and does the same, mini batch is the a hybrid of the two an choses small batches. Bath gradient descent of course takes the longest time but will converge the closest to exactly. Stochastic and mini batch will converge to effectively the same value if enough time is allowed. The advantages of the latter two techniques are that they are considerably faster. 

Gradient descent is sensitive to feature scaling (in terms of computational time) therefore, features should always be normalised or standardised before fitting the model. 


### Regularised linear regression (Ridge,   Lasso, Elastic Net)

The three main type of regularized linear regressions ar Ridge, Lasso, Elastic net. These can all be accessed through Scikit Learms Stochastic models.

In [59]:
from sklearn.linear_model import SGDRegressor
ridge = SGDRegressor(loss="squared_loss", penalty="l2", max_iter=500) #ridge uses l2
lasso = SGDRegressor(loss="squared_loss", penalty="l1", max_iter=500) #lasso uses l1
elastic_net =  SGDRegressor(loss="squared_loss", penalty="l2", max_iter=500) #hybrid of l1 and l2 

In [80]:
ridge.fit(X,y.ravel()) #Ravel just flattens arrays to 1D, doesnt matter in this isntance but useful in future. 
ridge.predict(X_new)

array([ 3.46845465, 10.37826369])

In [81]:
lasso.fit(X,y.ravel()) #And it stops the ugly wanring coming up.
lasso.predict(X_new)

array([ 3.46090492, 10.39331964])

In [82]:
elastic_net.fit(X,y.ravel()) 
elastic_net.predict(X_new)

array([ 3.47242528, 10.36771542])

As can be seen, the results from all three models are very similar with such simple data. This will not be the case in the future where the data is more complicated. 

Important arguments:
- loss: default = "squared loss", alternatively "huber" modifies OLS to focus less on extreme outliers.
- penalty: This is the regularisation extension to the loss function. l1 is the sum of absolutes (MAE) and is equivalent to the manhattan distance. This is used in Lasso regression. L2 (default) is the RMSE and is used in Ride, it is equivalent to the Euclidiean distance. "elasticnet" is a hybrid of both l1 and l2. The ratio is controlled by "l1_ratio" and defaults at 0.15.
- alpha: The alpha is the constant that multiples the regularisation term, as alpha increases so does the regularisation. Defaults at 0.001.
- Learning rate defaults to optimal.Should rarely be adjusted.
- early_stopping: can be used to stop training when vaidation score stops improving. 

Important things to remember
- Lasso regression and elastic net (with high ratio of L1) effectively does feature selection. It tends the least important variable coefficients towards zero. This is useful when you think that only a few features are actually important. 
- Instances must be idnetically and independantly distributed  - therefore,  shuffling is suggested (Default = shuffle)
- ElasticNet is almost always preferred to lasso (even though they can be identical) because sometimes lasso can be arratic at high numbers of features and instances. 

### Polynomial regression

Polynomial regression expands on linear regression by allowing features to mix with each other and themselves. If there were two features A and B, a polynomial regression (degrees = 3) would consist of features a + b + ab + a^2 + b^2. Therefore, this allows the relationship to take any shape, not just a straight line. 

Polynomial regressions thereforesignificantly increase the chance of overfitting, so regularisation is a must. 

Apart from the relationships between features, the process is the same as linear regression. The most apprpriate way to get polynomial features is to use sklearns preprocessing module.

In [84]:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
X = np.arange(6).reshape(3, 2)
poly = PolynomialFeatures(2)
poly.fit_transform(X)
poly = PolynomialFeatures(interaction_only=True) #can chose interaction only (not exponentials)
poly.fit_transform(X)

array([[ 1.,  0.,  1.,  0.],
       [ 1.,  2.,  3.,  6.],
       [ 1.,  4.,  5., 20.]])

Following this step, the models used before can be done the same way.

# Classification

### Logisitic Regression

- Logisitic regression is also most commonly fitted using gradient descent algorithms - here is no closed form solution.
- Just like Linear Regression, logistic regression computesa weighted sum of the input features. Instead, however, it outputs the logistic of the results.
- Log reg can be (and should be) regularized in the same fashion as linear regression.

In [89]:

from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
#could also introduce polynomial varibles into the pipeline here too, fittransform the X.
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
Y = np.array([1, 1, 2, 2])
# Always scale the input. The most convenient way is to use a pipeline.
log_clf = make_pipeline(StandardScaler(),
                        SGDClassifier(max_iter=1000, tol=1e-3, loss = "log")) #loss = log for logistic regression
log_clf.fit(X, Y)

#print(log_clf.predict([[-0.8, -1]]))

Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('sgdclassifier',
                 SGDClassifier(alpha=0.0001, average=False, class_weight=None,
                               early_stopping=False, epsilon=0.1, eta0=0.0,
                               fit_intercept=True, l1_ratio=0.15,
                               learning_rate='optimal', loss='log',
                               max_iter=1000, n_iter_no_change=5, n_jobs=None,
                               penalty='l2', power_t=0.5, random_state=None,
                               shuffle=True, tol=0.001, validation_fraction=0.1,
                               verbose=0, warm_start=False))],
         verbose=False)

As can be seen, a lot of the argments are the same as the regularized linear regressions. Penalty is automatically set to l2, bu can be adjusted to "elasticnet" and the proportion can be set with "l1_ratio". Shuffle is also automatically assumed.

### Support Vector Machines

- Support vector machines can be employed in the same way as logistic regression by changing the loss to "hinge". 
- SVM are particularly useful in small to medium sized complex datasets. 
- SVM are storngly sensitive to feature scales to features must be standardised before. 
- Model can be regularized by controlling "C" - the "width of the road".
- more flexibility can be achived with SVC model "from sklearn imort SVC".
- SVM work on linear seperability, so therefore, introducint polynomial features works well here.
- The "kernel trick" allows us to introdudce these polynomial features without actually doing so and therefore not requiring the computing power. 
- Similarilty functions are also a really useful way to add features - Guassian Radical bias function is the most commonly used.
- The most common and simplest approach is to create a landmark for each instance. 
- Really good documentation here. 
