### Stochastic Gradient Descent

 Stochastic Gradient Descent (SGD) is a simple yet efficient optimization algorithm used to find the values of parameters/coefficients of functions that minimize a cost function. In other words, it is used for discriminative learning of linear classifiers under convex loss functions such as SVM and Logistic regression. It has been successfully applied to large-scale datasets because the update to the coefficients is performed for each training instance, rather than at the end of instances.



### SGD Classifier
Stochastic Gradient Descent (SGD) classifier basically implements a plain SGD learning routine supporting various loss functions and penalties for classification. Scikit-learn provides SGDClassifier module to implement SGD classification.


#### Implementation Example
Like other classifiers, Stochastic Gradient Descent (SGD) has to be fitted with following two arrays.
1.   An array X holding the training samples. It is of size [n_samples, _features].
2.   An array Y holding the target values i.e. class labels for the training samples. It is of size [n_samples].
Following Python script uses SGDClassifier linear model

In [1]:
import numpy as np
from sklearn import linear_model
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
Y = np.array([1, 1, 2, 2])
SGDClf = linear_model.SGDClassifier(max_iter=1000, tol=1e-3,penalty="elasticnet")
SGDClf.fit(X, Y)

In [2]:
SGDClf.predict([[2.,2.]])

array([2])

### SGD Regressor

Stochastic Gradient Descent (SGD) regressor basically implements a plain SGD learning routine supporting various loss functions and penalties to fit linear regression models. Scikit-learn provides SGDRegressor module to implement SGD regression

##### Parameters
Parameters used by SGDRegressor are almost same as that were used in SGDClassifier module. The difference lies in ‘loss’ parameter. For SGDRegressor modules’ loss parameter the positives values are as follows:
1. squared_loss: It refers to the ordinary least squares fit.huber: SGDRegressor correct the outliers by switching from squared to linear loss past a distance of epsilon. The work of ‘huber’ is to modify ‘squared_loss’ so that algorithm focus less on correcting outliers.
2. epsilon_insensitive: Actually, it ignores the errors less than epsilon.
3. squared_epsilon_insensitive: It is same as epsilon_insensitive. The only difference is that it becomes squared loss past a tolerance of epsilon.
Another difference is that the parameter named ‘power_t’ has the default value of 0.25 rather than 0.5 as in SGDClassifier. Furthermore, it doesn’t have ‘class_weight’ and ‘n_jobs’ parameters.
Attributes
Attributes of SGDRegressor are also same as that were of SGDClassifier module. Rather it has three extra attributes as follows:
4. average_coef_: array, shape(n_features,)
As name suggest, it provides the average weights assigned to the features.
5. average_intercept_: array, shape(1,)
As name suggest, it provides the averaged intercept term.
6. t_: int
It provides the number of weight updates performed during the training phase.
Note: the attributes average_coef_ and average_intercept_ will work after enabling parameter ‘average’ to True.

In [3]:
import numpy as np
from sklearn import linear_model
n_samples, n_features = 10, 5
rng = np.random.RandomState(0)
y = rng.randn(n_samples)
X = rng.randn(n_samples, n_features)
SGDReg =linear_model.SGDRegressor(max_iter=1000,penalty="elasticnet",loss='huber',tol=1e-3, average=True)
SGDReg.fit(X, y)

### Pros and Cons of SGD

**Following the pros of SGD:**
1. Stochastic Gradient Descent (SGD) is very efficient.
2. It is very easy to implement as there are lots of opportunities for code tuning

**Following the cons of SGD:**
1. Stochastic Gradient Descent (SGD) requires several hyperparameters like regularization parameters.
2. It is sensitive to feature scaling.