<p> One type of machine learning problem is the classfication problem. The goal of classification is that given data about something, assign it a discrete, qualitative label. A classic classification problem is what a spam filter does. A programmer has a list of emails, each with data about it such as the number of words in the email or the number of capital letters in the title, and whether or not that email is considered by him to be spam. Then he come up with a model that relates the probability of an email being spam to the data. It is called "training the model", and that list called the training data. Then if a new email comes in, the model will calculate the probability of it being spam. If the email has a higher probability of being spam than not spam, it will be labeled "spam" and be delivered to the spam folder, and vice versa. </p>
<p> In practice, the training data often comes in a table, with each email ("instance") being a row and each parameter being a column. One of the columns will be the label you want to assign, which is called the target. A new email, whether or not it is spam is unknown, is also an instance, except with the target column blank. The problem is then to fill this column. Conversely, any problem that asks you to fill an empty column of a table for a certain instance with a discrete label is a classification problem. </p>
<img src="iris.png">
<p> In our example, we will use the famous Iris flower dataset. The table contains the length and width of the sepal and petal of samples of flowers from the Iris genus collected in an area, as well as the species name of the flower, determined by a qualified biologist. Then imagine somebody goes to the same area, finds a flower and measures the length and width of its sepal and petal. But that person cannot tell the species of that flower. Then a classification problem is to determine, from the measurement data, what species the flower most likely is. </p>

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np

df = pd.read_csv("iris-mv.csv")
df = df.fillna(method='ffill')
df

It is quite simple to use one of scikit-learn's classifier with a Pandas dataframe. Check out the code below:

In [3]:
from sklearn.neighbors import KNeighborsClassifier
X = df.drop(columns=['Species'])
y = df['Species']
clf = KNeighborsClassifier(n_neighbors=3).fit(X,y)
clf.predict([[3,4,5,6]])

array(['Iris-virginica'], dtype=object)

You can see here a model we have trained using the K-neighbors claassification has predicted a flower with sepal length 3 cm, Sepal width 4 cm, Petal length 5 cm and petal width 6 cm is a virginica. But how do we know how accurate the model is? We can use the accuracy_score function in scikit-learn:

In [1]:
from sklearn.metrics import accuracy_score
a = [0,1,1,2,2,1]
b = [0,1,2,2,0,1]
accuracy_score(a,b)

0.6666666666666666

In [None]:
accuracy_score(clf.predict(X),y)

Which looks good! But it turns out the high accuracy score does not mean that much. Remember that the purpose of a classification is to create a model that predicts the label of a new data point should it arise, not to predict the label of an existing point which we already used to create the model. But how do we get any new points to test our model? One simple way is to randomly divide our dataset into two subsets, a training set and a test set. We will use the training set to train our model, and the test set to test it. The scikit-learn package has an easy to use function that does that:

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y)
X_train

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth
7,5.0,3.4,1.5,0.2
124,6.4,3.1,5.5,2.4
127,6.7,3.1,4.8,2.4
102,6.8,2.5,5.0,2.1
0,5.1,3.5,1.4,0.2
...,...,...,...,...
56,5.9,3.0,4.2,1.5
129,5.8,2.7,5.1,1.9
20,5.4,3.4,1.7,0.2
65,5.9,3.2,4.8,1.8


By default, 80% of the points are assigned to the training set and 20% test set, but the proportion can be manually changed. And note the order of X's and y's the result of the function returns. So we do exactly what we want to do:

In [6]:
clf = KNeighborsClassifier(n_neighbors=3).fit(X_train,y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_pred,y_test)

0.9411764705882353

<p> It is this accuracy score that is more meaningful, and it tells us our model is still pretty good. </p>
<p> There are other algorithms that scikit-learn provides. One of which is logistic regression. The syntax is very similar: </p>

In [7]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(X_train,y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_pred,y_test)

0.9117647058823529

And the support vector machine algorithm:

In [9]:
from sklearn.svm import SVC
clf = SVC().fit(X_train,y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_pred,y_test)

0.9411764705882353

Naive Bayes algorithm:

In [11]:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB().fit(X_train,y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_pred,y_test)

1.0

Decision tree algorithm:

In [12]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier().fit(X_train,y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_pred,y_test)

0.8823529411764706

Random forest algorithm:

In [13]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier().fit(X_train,y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_pred,y_test)

0.9411764705882353

Exercise: Using the cleaned up version of the penguin dataframe, create a model that predicts the species of penguin given the dimensions of a penguin.