# Writing a pipeline - Google ML YouTube tutorials

In [1]:
from sklearn import datasets
iris = datasets.load_iris()

The features set in a ML classifier can be thought of as the variable `x` in an equation, and the output of the equation (values of `y`) being the labels.

$$
y = f(x)
$$

In [2]:
x = iris.data
y = iris.target

For any ML classifier constructed, a small subset of the training data should always be partitioned to be used as a test to calculate the accuracy of the classifier. The `train_test_split()` provides a way to partition the training data (features and associated labels) into two sets.

In [23]:
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.5)

In [4]:
from sklearn import tree
my_classifier = tree.DecisionTreeClassifier()

my_classifier.fit(x_train, y_train)

predictions = my_classifier.predict(x_test)
print predictions

[1 2 2 0 0 0 1 1 0 2 1 2 1 1 0 0 2 0 2 1 0 2 0 2 2 2 2 2 1 0 2 0 0 0 2 1 2
 1 1 2 0 1 0 1 0 1 1 0 0 2 1 0 1 0 0 0 2 0 1 1 1 0 0 1 0 0 0 2 2 1 2 0 2 2
 1]


`sklearn` provides a metrics function for calculating the accuracy of the ML classifier, using the testing data (partitioned from the testing data, `x_test`) with the known labels of this dataset (`y_test`).

In [6]:
from sklearn.metrics import accuracy_score
print accuracy_score(y_test, predictions)

0.946666666667


## K nearest neighbours
ML classifiers can be also constructed through other methods, such as the k-nearest neighbours algorithm.

In [27]:
from sklearn.neighbors import KNeighborsClassifier
my_classifier = KNeighborsClassifier()

my_classifier.fit(x_train, y_train)

k_predictions = my_classifier.predict(x_test)
print predictions

[1 2 2 0 0 0 1 1 0 2 1 2 1 1 0 0 2 0 2 1 0 2 0 2 2 2 2 2 1 0 2 0 0 0 2 1 2
 1 1 2 0 1 0 1 0 1 1 0 0 2 1 0 1 0 0 0 2 0 1 1 1 0 0 1 0 0 0 2 2 1 2 0 2 2
 1]


In [28]:
print accuracy_score(y_test, k_predictions)

0.973333333333


A high level example of how classifier works can be illustrated by imagining the features of the training data as points on a grid. The classifier's job is to draw a line between the features of the different labels. If the line separating the data can be described by an equation, such as $y = mx + b$, the classifier can be thought to be iteratively adjusting the coefficients of the equation with each data point from the training data. For example, at the $n^{th}$ data point during learning, if the classifier predicts the label incorrectly, the coefficients can be adjusted until it gives a correct classification. This adjustment continues with every data point, until it correctly classifies all the data in the training set.