# Perceptron

The most basic classifer is the simple perceptron, which is better optimized in sklearn. 

The resulting accuracy (0.91) is due to the fact that the Iris dataset is not linearly seperable. 

In [5]:
from sklearn import datasets
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Perceptron
from sklearn.metrics import accuracy_score

iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target # Stored as integers for better performance.

# 30% test data, 70% train.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

# Standardise features for gradient decent.
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

# Perceptron.
ppn = Perceptron(max_iter = 40, eta0=0.1, random_state = 0)
ppn.fit(X_train_std, y_train)

# Predict.
y_pred = ppn.predict(X_test_std)
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))


Accuracy: 0.91


# Logistic Regression

Given a set of features $\mathbf{x}$, and weights $\mathbf{w}$ then the conditional probability $y$ belongs to class $1$ is:

$$
logit(p(y=1|\mathbf{x}))=\sum_{i=0}^{m}w_{i}x_{i}
$$

From here, the inverse function gives the probability of a sample belonging to a particular class:

$$
\phi(z) = \frac{1}{1 + e^{-z}} = \frac{1}{1 + e^{-\sum_{i=0}^{m}w_{i}x_{i}}}
$$

The resulting function approaches 1 as $z \rightarrow \infty$ and 0 as $z \rightarrow - \infty$. The result is classification in the following form:

$$
\hat{y} = \Bigg \{ \begin{array}{c} 1 \; \textrm{if} \; z \geq 0.5 \\ 0 \; otherwise \end{array}
$$

## Overfitting

To avoid building a model that does not generalise well, the complexity must not be too high to overfit, nor too low to overfit. Regularization is used to penalise extreme parameter weights that might be causing complexity issues. The regularization parameter $\lambda$ controls how this behaviour works - but sklearn takes its inverse: 

$$
C = \frac{1}{\lambda}
$$

In [12]:
from sklearn import datasets
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target # Stored as integers for better performance.

# 30% test data, 70% train.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

# Standardise features for gradient decent.
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

# LogisticRegression. C denotes regularization strength (smaller numbers are stronger).
lr = LogisticRegression(C=1000.0, random_state = 0)
lr.fit(X_train_std, y_train)

# Predict.
y_pred = lr.predict(X_test_std)
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))


Accuracy: 0.98


# Support Vector Machines

Where a perceptrons tries to a define a single hyperplane that seperates two classes, SVMs also attempt to maximise a margin around the hyperplane to lower the generalization error. 

Where data is nonlinearly separable, the slack variable $\xi$ which allows for the control of penalisation for misclassification. Like with regularization, increasing $C$ increases bias and lowers variance in a model. 

In [16]:
from sklearn import datasets
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target # Stored as integers for better performance.

# 30% test data, 70% train.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

# Standardise features for gradient decent.
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

# Support Vector Machine. C denotes slack strength (smaller numbers introduce more slack).
svm = SVC(kernel='linear', C=1.0, random_state = 0)
svm.fit(X_train_std, y_train)

"""For online learning via partial fit use:

from sklearn.linear_model import SGDClassifier
svm = SGDClassifier(loss='hinge') # loss='log' creates a logistic regression. 

"""

"""For hyperplanes in higher dimensions use:
svm = SVC(kernel='rbf', C=1.0, gamma = 1.0 random_state = 0)

High gamma results in overfitting, and low in underfitting the Gaussian sphere. 

"""

# Predict.
y_pred = svm.predict(X_test_std)
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))

Accuracy: 0.98


# Decision Tree Learning 

This form of classifier sorts samples into classes by asking a series of questions. Given a set of features, the model learns the optimal questions to ask to quickly divide the data.

This is done by optimising Information Gain for each decision, followed by pruning to reduce overfitting. There are three main splitting criteria:
1. Entropy - the proportion of samples belonging to clas $i$ in node $t$, which is minimised by having all samples in the node belong to $i$, and maximised by a uniform distribution of classes.
2. Gini Impurity - Minimisation of the probability of misclassification. 
3. Classification Error - Overall number of error of misclassification for nodes. Not as sensitive to changes in class probabilities. 

In [18]:
from sklearn import datasets
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target # Stored as integers for better performance.

# 30% test data, 70% train.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

# Standardise features - not required for DTs, but good for visualisation.
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

# Decision Tree - max_depth helps reduce overfitting.
tree = DecisionTreeClassifier(criterion = 'entropy', max_depth = 3, random_state = 0)
tree.fit(X_train_std, y_train)

# Predict.
y_pred = tree.predict(X_test_std)
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))

Accuracy: 0.98


## Random Forests

A random forest is an ensemble of decision trees, which combines many weak learners into one strong one. The result is better generalization error and the model being less prone to overfitting. The algorithm works as follows:
1. Draw $n$ random samples.
2. Grow a decision tree, at each node:
   1. Randomly select $d$ features.
   2. Split the node using the feature that gives the best IG.
3. Repeat 1 to 2 $k$ times.
4. Aggregate trees via majority vote. 

The reason for using a random forest is it doesn't rely as heavily on good hyper parameters, as noise from individual trees is well controlled. 

In [20]:
from sklearn import datasets
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target # Stored as integers for better performance.

# 30% test data, 70% train.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

# Standardise features - not required for DTs, but good for visualisation.
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

# Random Forest (low n_estimators results in poorer performance than previous models).
# n_jobs is number of CPU cores to use. 
forest = RandomForestClassifier(criterion = 'entropy', n_estimators = 100, random_state = 1, n_jobs = 4)
forest.fit(X_train_std, y_train)

# Predict.
y_pred = forest.predict(X_test_std)
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))

Accuracy: 0.98


# K-nearest Neighbour Clustering

KNN is a lazy learning algorithm, what this means is that memorizes the data instead of building a function from it. The algorithm works in the following way:
1. Choose $k$ and a distance metric.
2. Find the $k$ nearest neighbours of a sample.
3. Assign the class label via majority vote. 
This approaches means that the algorithm can immediately adapt to new samples, but requires increasingly more computational power.

In this approach, the choice of $k$ and the distance metric is very important. Standard metrics include the Euclidean or Manhattan distances, which are generalised as the minkowski distance with $p=[1,2]$.

In very spare high dimensional datasets, feature selection and dimensionality reduction is required to close distances between neighbours. 

In [21]:
from sklearn import datasets
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target # Stored as integers for better performance.

# 30% test data, 70% train.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

# Standardise features - not required for DTs, but good for visualisation.
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

# KNeighbours classifier - p =2 is Euclidean distance. 
knn = KNeighborsClassifier(n_neighbors = 5, p = 2, metric = 'minkowski')
knn.fit(X_train_std, y_train)

# Predict.
y_pred = knn.predict(X_test_std)
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))

Accuracy: 1.00
