## Content

**Supervised**<br>
[Linear Regression](#LinearRegression)<br>
[Lasso Regression](#LassoRegression)<br>
[Ridge Regression](#RidgeRegression)<br>
[Logistic Regression](#LogisticRegression)<br>
[Decision Tree](#DecisionTree)<br>
[K Nearest Neighbors](#KNN)<br>
[Bagging](#Bagging)<br>
[Random Forest](#RandomForest)<br>
[Gradient Boosting](#GradientBoost)<br>
[AdaBoost](#AdaBoost)<br>
[SVM](#SVM) (Support Vector Machines)<br>
[Tf-Idf](#TfIdf)<br>

**Unsupervised**<br>
[K Means](#KMeans)<br>
[Hierarchical Clustering](#HierarchicalClustering)<br>
[PCA](#PCA) (Principal Component Analysis)<br>
[SVD](#SVD) (Singular Value Decomposition)<br>
[NMF](#NMF) (Non-negative Matrix Factorization)<br>

Imports

In [1]:
from __future__ import division
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

Data

In [2]:
from sklearn.datasets import load_iris
data = load_iris()
y_all = data.target
X_all = data.data

Train Test Split

In [3]:
from sklearn.cross_validation import train_test_split

X, X_test, y, y_test = train_test_split(X_all, y_all, test_size=0.25, random_state=42)

<a id='LinearRegression'></a>
### Linear Regression

[LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) - sklearn <br>
[OLS](http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.OLS.html) - statsmodels

- supervised
- needs intercept: yes (can use fit_intercept=True in sklearn)
- needs dummies: yes (drop one)
- needs normalization: ?
- needs numeric: yes
- target type: numeric

*get rid of collinear features*

In [4]:
from sklearn.linear_model import LinearRegression

model = LinearRegression(fit_intercept=True)
model.fit(X,y)
y_pred = model.predict(X)

model.coef_
model.intercept_
accuracy = model.score(X,y)

In [5]:
from statsmodels.regression.linear_model import OLS
import statsmodels.api as sm

X_intercept = sm.add_constant(X)

model = OLS(y,X_intercept)
results = model.fit()
results.summary()
results.params

array([ 0.21647142, -0.11268404, -0.05848178,  0.26393642,  0.5272465 ])

<a id='LassoRegression'></a>
### Lasso Regression
[Lasso](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)

- supervised
- needs intercept: yes
- needs dummies: yes (drop one)
- needs normalization: ?
- needs numeric: yes
- target type: numeric

*alpha is l1 penalty, alpha = 0 is same as linear regression*

In [6]:
from sklearn.linear_model import Lasso

model = Lasso(fit_intercept=True, alpha=1.)
model.fit(X,y)
y_pred = model.predict(X)

model.coef_
model.intercept_
accuracy = model.score(X,y)

<a id='RidgeRegression'></a>
### Ridge Regression

[Ridge](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)

- supervised
- needs intercept: yes
- needs dummies: yes (drop one)
- needs normalization: ?
- needs numeric: yes
- target type: numeric

*alpha is l2 penalty, alpha = 0 is same as linear regression*

In [7]:
from sklearn.linear_model import Ridge

model = Ridge(fit_intercept=True, alpha=1.)
model.fit(X,y)
y_pred = model.predict(X)

model.coef_
model.intercept_
accuracy = model.score(X,y)

<a id='LogisticRegression'></a>
### Logistic Regression
[LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

- supervised
- needs intercept: yes
- needs dummies: yes (drop one)
- needs normalization: ?
- needs numeric: yes
- target type: numeric ?

*C is inverse of regularization strength (positive float), smaller C means more regularization*

In [8]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(fit_intercept=True,
                           penalty='l2', C=10.)
model.fit(X,y)
y_pred = model.predict(X)

model.coef_
model.intercept_
accuracy = model.score(X,y)

<a id='DecisionTree'></a>
### Decision Tree
[DecisionTreeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

- supervised
- needs intercept: no
- needs dummies: no, can deal with classes
- needs normalization: no
- needs numeric: no
- target type: class

*criterion can be 'gini' or 'entropy'* <br>
*max_features can be used to randomize the tree*

In [9]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(criterion='gini',
                              max_depth=3,
                              max_features=None)
model.fit(X,y)
y_pred = model.predict(X)

model.classes_
model.feature_importances_
model.tree_
accuracy = model.score(X,y)

In [10]:
#to visualize the tree export graphviz (.dot) format to current directory
from sklearn.tree import export_graphviz
export_graphviz(model, out_file='tree.dot')
#then use bash command to make it .png file

In [11]:
%%bash
dot -Tpng tree.dot -o tree.png

<img src='./tree.png' width='100'>

<a id='KNN'></a>
### K Nearest Neighbors
[KNeighborsClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

- supervised
- needs intercept: no
- needs dummies: ?
- needs normalization: ?
- needs numeric: ?
- target type: ?

In [12]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=5)
model.fit(X,y)
y_pred = model.predict(X)
y_proba = model.predict_proba(X)

model.classes_
accuracy = model.score(X,y)
kneighbors = model.kneighbors(X[0].reshape(1,X[0].shape[0]), return_distance=False) #indices of k nearest neighbors

<a id='Bagging'></a>
### Bagging
[BaggingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html) <br>
[BaggingRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html)

*classifier for classification (classes), regressor for regression (continuous values)*

<a id='RandomForest'></a>
### Random Forest
[RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) <br>
[RandomForestRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)

- supervised
- needs intercept: no
- needs dummies: no, can deal with classes
- needs normalization: ?
- needs numeric: no
- target type: class for classifier, numeric for regressor

In [13]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor

model = RandomForestClassifier(n_estimators=20, criterion='gini', 
                               max_depth=3, max_features='auto', 
                               bootstrap=True, oob_score=True,
                               random_state=None, warm_start=False)
model.fit(X,y)
y_pred = model.predict(X)

model.estimators_
model.feature_importances_
model.oob_score_
accuracy = model.score(X,y) #regressor returns coefficient of determination R^2 of the prediction

<a id='GradientBoost'></a>
### Gradient Boosting
[GradientBoostingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) <br>
[GradientBoostingRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)

- supervised
- needs intercept: no
- needs dummies: ?
- needs normalization: ?
- needs numeric: ?
- target type: class for classifier, numeric for regressor

*uses **trees*** <br>
*loss is loss function to be optimized, 'deviance' is for classification, 'ls' is least squares in regression, 'lad' is least absolute deviation in regression* <br>
*learning_rate shrinks the contribution of each tree by learning_rate; lower learning_rate needs more n_estimators* <br>
*subsample is fraction of samples for each base learner; smaller than 1.0 results in Stochastic Gradient Boosting (reduction of variance, increase in bias)*

In [14]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingClassifier(loss='deviance', learning_rate=0.1, 
                                   n_estimators=100, subsample=1.0,
                                   max_depth=3, init=None, 
                                   random_state=None, max_features=None, 
                                   verbose=0, max_leaf_nodes=None, warm_start=False)
model.fit(X,y)
y_pred = model.predict(X)

model.estimators_
model.feature_importances_
accuracy = model.score(X,y) #regressor returns coefficient of determination R^2 of the prediction

<a id='AdaBoost'></a>
### AdaBoost
[AdaBoostClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) <br>
[AdaBoostRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html)

- supervised
- needs intercept: no
- needs dummies: ?
- needs normalization: ?
- needs numeric: ?
- target type: class for classifier, numeric for regressor

*can use **trees or any other base estimators*** <br>
*learning_rate shrinks the contribution of each estimator by learning_rate; lower learning_rate needs more n_estimators* <br>

In [15]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import AdaBoostRegressor

model = AdaBoostClassifier(base_estimator=None, n_estimators=50, learning_rate=1.0)
model.fit(X,y)
y_pred = model.predict(X)

model.estimators_
model.feature_importances_
accuracy = model.score(X,y) #regressor returns coefficient of determination R^2 of the prediction

<a id='SVM'></a>
### SVM (Support Vector Machines)
[SVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

*stands for Support Vector Machines (C-Support Vector Classification)*

<a id='TfIdf'></a>
### Tf-idf
[TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

- supervised
- needs text data (single feature) as input
- target type: none (k-neighbors or k-means or similar method used for prediciton/classification)

*stands for Term Frequency - Inverse Document Frequency*

In [16]:
text_data = pd.DataFrame([['After he slapped two soldiers, US Lieutenant General George S. Patton was sidelined from combat command'],
                         ['The Fringes of the Fleet is a booklet written in 1915 by Rudyard Kipling. It contains essays and poems.'],
                         ['Antarctica, on average, is the coldest, driest, and windiest continent, and has the highest average elevation of all the continents.']])
text_data.columns = ['wiki_content']
text_data

Unnamed: 0,wiki_content
0,"After he slapped two soldiers, US Lieutenant G..."
1,The Fringes of the Fleet is a booklet written ...
2,"Antarctica, on average, is the coldest, driest..."


In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(input='content', lowercase=True, tokenizer=None, 
                        stop_words='english', use_idf=True)
tfidf = vectorizer.fit_transform(text_data['wiki_content'])

#each token (word) is turned into a feature, get full list like this
vectorizer.get_feature_names()
#tfidf is sparse matrix, to print it nicely we can convert it to an array
tfidf.toarray()

#nice output (only possible for super small sets)
pandas_tfidf = pd.DataFrame(tfidf.toarray())
pandas_tfidf.columns = vectorizer.get_feature_names()
pandas_tfidf.head(3)

Unnamed: 0,1915,antarctica,average,booklet,coldest,combat,command,contains,continent,continents,...,kipling,lieutenant,patton,poems,rudyard,sidelined,slapped,soldiers,windiest,written
0,0.0,0.0,0.0,0.0,0.0,0.333333,0.333333,0.0,0.0,0.0,...,0.0,0.333333,0.333333,0.0,0.0,0.333333,0.333333,0.333333,0.0,0.0
1,0.316228,0.0,0.0,0.316228,0.0,0.0,0.0,0.316228,0.0,0.0,...,0.316228,0.0,0.0,0.316228,0.316228,0.0,0.0,0.0,0.0,0.316228
2,0.0,0.288675,0.57735,0.0,0.288675,0.0,0.0,0.0,0.288675,0.288675,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.288675,0.0


<a id='KMeans'></a>
### K Means
[KMeans](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)

- unsupervised
- needs intercept: no
- needs dummies: ?
- needs normalization: ?
- needs numeric: ?
- target type: ?

In [18]:
from sklearn.cluster import KMeans

model = KMeans(n_clusters=8, init='k-means++', 
               n_init=10, max_iter=300, tol=0.0001)
model.fit(X,y)
y_pred = model.predict(X)

model.cluster_centers_
model.labels_
model.inertia_
score = model.score(X) #opposite of the value of X on the K-means objective

<a id='HierarchicalClustering'></a>
### Hierarchical Clustering

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, dendrogram

vectorizer = TfidfVectorizer(input='content', lowercase=True, tokenizer=None, 
                        stop_words='english', use_idf=True)
tfidf = vectorizer.fit_transform(text_data['wiki_content'])

tfidf_array = tfidf.toarray()
dist = squareform(pdist(tfidf_array))
links = linkage(dist, method= 'complete')

In [20]:
#plot as follows
#plt.figure(figsize=(15,5))
#dendrogram(links)

<a id='PCA'></a>
### PCA (Principal Component Analysis)
[PCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)

Seek r-dimensional basis that best captures the variance in data. Direction with the largest projected variance is the first principal component, orthogonal direction that captures the second largest projected variance is the second principal component etc. In practice eigenvectors of covariance matrix are calculated. **PCA is a special case of SVD.**

- unsupervised
- dimensionality reduction / topic modelling - set number of latent features up front

*sklearn implementation just calls numpy.linalg.svd and reduces the data, keeping r singular values.*

In [21]:
from sklearn.decomposition import PCA

model = PCA(n_components=2) #number of dimensions/topics required
model.fit(tfidf.toarray())
pca = model.transform(tfidf.toarray())

sum(model.explained_variance_ratio_) #two features explain 100% of variance

1.0

In [22]:
print 'tfidf shape: %s - articles, words' % str(tfidf.toarray().shape)
print 'pca shape:   %s - articles, topics' % str(pca.shape)

tfidf shape: (3, 28) - articles, words
pca shape:   (3, 2) - articles, topics


<a id='SVD'></a>
### SVD (Singular Value Decomposition)
[SVD](http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html)

- unsupervised
- dimensionality reduction / topic modelling - returns all latent features, we can reduce the number later
- very specific use case for recommenders (see [this blog](http://sifter.org/~simon/journal/20061211.html))

In [23]:
from numpy.linalg import svd

U, s, V = np.linalg.svd(tfidf.toarray(), full_matrices=False)

In [24]:
print 'tfidf shape:              %s - articles, words' % str(tfidf.toarray().shape)
print 'U (weights) shape:        %s - articles, topics' % str(U.shape)
print 's (singular value) shape: %s - topics (rank, diagonal of eigenvalues)' % str(s.shape)
print 'V (features) shape:       %s - topics, words' % str(V.shape)

tfidf shape:              (3, 28) - articles, words
U (weights) shape:        (3, 3) - articles, topics
s (singular value) shape: (3,) - topics (rank, diagonal of eigenvalues)
V (features) shape:       (3, 28) - topics, words


In [25]:
#tfidf is roughly U.dot(np.diag(s)).dot(V)

power = s**2
cum_power = (np.cumsum(power))
n = sum(cum_power < max(cum_power) * .9) #singular values (topics) needed to retain 90% of the total power

#limit data to keep 90% of the total power - everything is sorted, just use first n
U_lim = U[:,:n]
s_lim = s[:n]
V_lim = V[:n,:]

<a id='NMF'></a>
### NMF (Non-negative Matrix Factorization)
[NMF](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html)

Find two non-negative matrices W(m,r), H(r,n) whose product approximates the non-negative matrix V(m,n). r is set by user and can be substantially smaller than m or n.

- unsupervised
- dimensionality reduction / topic modelling - set number of latent features up front
- needs non-negative matrix
- best in class option for many recommendation problems
- can fit via ALS (alternating least squares) or SGD (stochastic gradient descent)
- regularize data to avoid overfitting

*the objective function is minimized with an alternating minimization of W and H*<br>
*use X.clip(min=0, max=X.max()) to ensure matrix is non-negative*

In [26]:
from sklearn.decomposition import NMF

model = NMF(n_components=2, max_iter=100) #number of dimensions/topics required
W = model.fit_transform(tfidf.toarray())
H = model.components_

#MSE mean-squared error of (V - WH)
print (np.array(tfidf - W.dot(H))**2).mean()

#it is the same as norm of (V - WH) divided by number of elements in V
#print np.linalg.norm(tfidf - np.dot(W, H)) / tfidf.toarray().size
#print (np.array(tfidf - W.dot(H))**2).sum()**(0.5) / tfidf.toarray().size

0.0119047619048


In [27]:
print 'tfidf shape: %s - articles, words' % str(tfidf.toarray().shape)
print 'W shape:     %s - articles, topics' % str(W.shape)
print 'H shape:     %s - topics, words' % str(H.shape)

tfidf shape: (3, 28) - articles, words
W shape:     (3, 2) - articles, topics
H shape:     (2, 28) - topics, words
