### INTRO

* Wide use, very scalable
* large #parameters, #iterations reqd
* sensitive to feature scaling

### SGD CLASSIFICATION

* loss functions: 'hinge' (SVM),'modified_huber','log' (LR)
* 'log'/'modified_huber': .predict_proba (returns prob ests per sample)
* penalty options:
   * penalty='l2'; l2 norm (default)
   * penalty='l1'; l1 norm
   * penalty='elasticnet'; convex combo of l1,l2

* supports multiclass by combining binary classifiers, 1vsAll scheme

* coef_ = 2D (#classes,#features)
* intercept_ = 1D (#classes)
* classes_ = index of classes, ascending order

* weighted classes/instances supported (class_weight, sample_weight)

[SGD: max margin separating hyperplane](plot_sgd_separating_hyperplane.ipynb) |
[SGD: multiclass, iris dataset](plot_sgd_iris.ipynb) |
[SGD: weighted samples](plot_sgd_weighted_samples.ipynb) |
[SGD: various online solvers](plot_sgd_comparison.ipynb) |
[SVM: unbalanced classes](plot_separating_hyperplane_unbalanced.ipynb)

[SGD: sparse data: text doc classification](document_classification_20newsgroups.ipynb)

In [5]:
from sklearn.linear_model import SGDClassifier
X = [[0.0, 0.0], [1.0, 1.0]]
y = [0, 1]
clf = SGDClassifier(loss="hinge", penalty="l2")
clf.fit(X, y)
print(clf.predict([
            [2.0,2.0]]))
print(clf.coef_)
print(clf.intercept_)
print(clf.decision_function([[2.0,2.0]])) # distance to hyperplane

[1]
[[ 9.91080278  9.91080278]]
[-9.99002993]
[ 29.65318117]


In [6]:
clf = SGDClassifier(loss="log").fit(X, y)
clf.predict_proba([[1., 1.]])

array([[ 0.00459185,  0.99540815]])

### SGD REGRESSION

* Use case: Vlarge #training samples (>10K).
* loss function controlled by .loss ('squared_loss','huber','epsilon_insensitive')

[API](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor) |
[demo: prediction latency](plot_prediction_latency.ipynb)


In [7]:
import numpy as np
from sklearn import linear_model
n_samples, n_features = 10, 5
np.random.seed(0)
y = np.random.randn(n_samples)
X = np.random.randn(n_samples, n_features)
clf = linear_model.SGDRegressor()
clf.fit(X, y)

SGDRegressor(alpha=0.0001, average=False, epsilon=0.1, eta0=0.01,
       fit_intercept=True, l1_ratio=0.15, learning_rate='invscaling',
       loss='squared_loss', n_iter=5, penalty='l2', power_t=0.25,
       random_state=None, shuffle=True, verbose=0, warm_start=False)

### SPARSE DATA

* any scipy.sparse format OK; use scipy.sparse.csr_matrix for best results

[demo:textdoc class](document_classification_20newsgroups.ipynb) *** BUGGED ***

### TIPS

* sensitive to feature scaling -- scale your data [0,1],[-1,+1],mean0.0/var1.0
* find reasonable alpha using [GridSearchCV](), range 10.0**-np.arange(1,7) = .1,.001,.001,.0001,.00001,.000001
* typical convergence after ~10^6 samples; default n_iter = np.ceil(10**6/#samples)

In [2]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
#scaler.fit(X_train)  # Don't cheat - fit only on training data
#X_train = scaler.transform(X_train)
#X_test = scaler.transform(X_test)  # apply same transformation to test data