<p> One type of machine learning problem is the classfication problem. The goal of classification is that given data about something, assign it a discrete, qualitative label. A classic classification problem is what a spam filter does. A programmer has a list of emails, each with data about it such as the number of words in the email or the number of capital letters in the title, and whether or not that email is considered by him to be spam. Then he come up with a model that relates the probability of an email being spam to the data. It is called "training the model", and that list called the training data. Then if a new email comes in, the model will calculate the probability of it being spam. If the email has a higher probability of being spam than not spam, it will be labeled "spam" and be delivered to the spam folder, and vice versa. </p>
<p> In practice, the training data often comes in a table, with each email ("instance") being a row and each parameter being a column. One of the columns will be the label you want to assign, which is called the target. A new email, whether or not it is spam is unknown, is also an instance, except with the target column blank. The problem is then to fill this column. Conversely, any problem that asks you to fill an empty column of a table for a certain instance with a discrete label is a classification problem. </p>
<img src="iris.png">
<p> In our example, we will use the famous Iris flower dataset. The table contains the length and width of the sepal and petal of samples of flowers from the Iris genus collected in an area, as well as the species name of the flower, determined by a qualified biologist. Then imagine somebody goes to the same area, finds a flower and measures the length and width of its sepal and petal. But that person cannot tell the species of that flower. Then a classification problem is to determine, from the measurement data, what species the flower most likely is. </p>

In [1]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

As usual, we start by creating a Pandas dataframe from a csv file in the same folder as this jupyter notebook.

In [3]:
df = pd.read_csv('iris.csv')
df

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Virginica
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica


The target, the column 'variety', is usually called y, and is a 1-dimensional array. The data used to predict the species is usually called X, and is a 2-dimensional array that is the table minus the target.

In [4]:
X = df.drop(columns=['variety'])
y = df['variety']

One algorithm used in a classification probloem is called k-nearest neighbors, or KNN. Now machine learning experts disagree on whether learning the mathematics behind the machine learning algorithms is an essential part of learning how to use machine learning. In this course we will skip most of the mathematics, and focus on how to write the correct code. In scikit-learn, codes for training a model using any classification algorithm is generally
<p> [a variable name] = [name of scikit-learn function](parameters of that function, if any).fit(X,y). </p>
Therefore in using KNN, it reads

In [5]:
clf = KNeighborsClassifier(n_neighbors=3).fit(X,y)

where <code>n_neighbors=3</code> means the algorithm has a parameter called n_neighbors, and we set it to 3. In many algorithms the parameters have a default value if they are not specified. After the model, called <code>clf</code>, is trained, we feed another 2-dimensional table with the same number of columns as X, called X_test, to make a prediction, using the code
<p> [the variable name].predict(X_test). </p>
For example, the code

In [6]:
clf.predict([[3,4,5,6],
             [6,5,4,3]])

array(['Virginica', 'Versicolor'], dtype=object)

<p> means we want the model clf to predict the species names of two flowers. The sepal length, sepal width, petal length and petal width of the first flower are 3 cm, 4 cm, 5 cm and 6 cm respectively, and the second are 6 cm, 5 cm, 4 cm and 3 cm. And the model tells us it predicts the first one is a Virginica, and the second a Versicolor. </p>
<p> Other algorithms are implemented similarly. For example the logistic regression classification in scikit-learn: </p>

In [17]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(X,y)
clf.predict([[3,4,5,6],
             [6,5,4,3]])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


array(['Virginica'], dtype=object)

In [None]:
And the support vector machine classification:

In [18]:
clf = SVC().fit(X,y)
clf.predict([[3,4,5,6],
             [6,5,4,3]])

array(['Virginica'], dtype=object)