# Machine Learning: Introduction to Classification

Prepared by John C.S. Lui (www.cse.cuhk.edu.hk/~cslui) for the course CSCI3320 (Fundamentals of Machine Learning).

#### Date:  April 6, 2021


In [None]:
import sys
sys.version

### Iris dataset

The dataset is a collection of measurements of several Iris flowers. These measurements will enable us to distinguish different species of the flowers. Features of inputs iclude:
* sepal length
* sepal width
* petal length
* petal width

For each input, we also have the **label** to identify the type of Iris.  So we are considering **supervised learning**.  For this dataaset, the labels are:
* Setosa 
* Versicolor
* Virginica

### Use visualization to get a feeling on the dataset

In [None]:
from matplotlib import pyplot as plt
import numpy as np

# scikit-learn has several data sets, one of them is the iris data set.
# We load the data with load_iris from scikit-learn

from sklearn.datasets import load_iris
data = load_iris()  # load_iris returns an object with several fields


# features is a list, with each item being a list of 4 features of inputs
features = data.data  
feature_names = data.feature_names

# target is a list, with each item being a class label of 0, 1, or 2 of inputs
target = data.target  
target_names = data.target_names

# Let's examine our data
print('first 10 features are: ', features[:10],'\n')
print('shape o features: ', features.shape, '\n')
print('first 10 target are: ', target[:10],'\n')
print('feature_names: ', feature_names,'\n')
print('target_names: ', target_names,'\n')

# Let's do a scatter sub-plot of feature 0 vs. feature 1

plt.figure(num=None, figsize=(4,4))
plt.clf()
for t in range(3):
  if t == 0: 
       c = 'r'
       marker = '>' 
  elif t == 1:
       c = 'g'
       marker = 'o' 
  elif t == 2:
       c = 'b'
       marker = 'x' 
  plt.scatter(features[target == t,0], features[target == t,1], marker=marker, c=c)
  plt.title('Sub-plot for 2 features')
  plt.xlabel(feature_names[0])
  plt.ylabel(feature_names[1])
  
  
# Let's do a scatter sub-plot of feature 0 vs. feature 2

plt.figure(num=None, figsize=(4,4))
plt.clf()
for t in range(3):
  if t == 0: 
       c = 'r'
       marker = '>' 
  elif t == 1:
       c = 'g'
       marker = 'o' 
  elif t == 2:
       c = 'b'
       marker = 'x' 

  # notice how we do select features via features[target==t, 0] or features[target==t, 2]
  plt.scatter(features[target == t,0], features[target == t,2], marker=marker, c=c)
  plt.title('Sub-plot for 2 features')
  plt.xlabel(feature_names[0])
  plt.ylabel(feature_names[2])
    

# Let's do a scatter sub-plot of feature 0 vs. feature 3

plt.figure(num=None, figsize=(4,4))
plt.clf()
for t in range(3):
  if t == 0: 
       c = 'r'
       marker = '>' 
  elif t == 1:
       c = 'g'
       marker = 'o' 
  elif t == 2:
       c = 'b'
       marker = 'x' 
    
  plt.scatter(features[target == t,0], features[target == t,3], marker=marker, c=c)
  plt.title('Sub-plot for 2 features')
  plt.xlabel(feature_names[0])
  plt.ylabel(feature_names[3])
  

# Let's do a scatter sub-plot of feature 1 vs. feature 2

plt.figure(num=None, figsize=(4,4))
plt.clf()
for t in range(3):
  if t == 0: 
       c = 'r'
       marker = '>' 
  elif t == 1:
       c = 'g'
       marker = 'o' 
  elif t == 2:
       c = 'b'
       marker = 'x' 
    
  plt.scatter(features[target == t,1], features[target == t,2], marker=marker, c=c)
  plt.title('Sub-plot for 2 features')
  plt.xlabel(feature_names[1])
  plt.ylabel(feature_names[2])
  

# Let's do a scatter sub-plot of feature 1 vs. feature 3

plt.figure(num=None, figsize=(4,4))
plt.clf()
for t in range(3):
  if t == 0: 
       c = 'r'
       marker = '>' 
  elif t == 1:
       c = 'g'
       marker = 'o' 
  elif t == 2:
       c = 'b'
       marker = 'x' 
    
  plt.scatter(features[target == t,1], features[target == t,3], marker=marker, c=c)
  plt.title('Sub-plot for 2 features')
  plt.xlabel(feature_names[1])
  plt.ylabel(feature_names[3])

# Let's do a scatter sub-plot of feature 2 vs. feature 3

plt.figure(num=None, figsize=(4,4))
plt.clf()
for t in range(3):
  if t == 0: 
       c = 'r'
       marker = '>' 
  elif t == 1:
       c = 'g'
       marker = 'o' 
  elif t == 2:
       c = 'b'
       marker = 'x' 
    
  plt.scatter(features[target == t,2], features[target == t,3], marker=marker, c=c)
  plt.title('Sub-plot for 2 features')
  plt.xlabel(feature_names[2])
  plt.ylabel(feature_names[3])
    

### Observation

By looking at the sub-plots,  **petal length** seems to be able to separate *Iris Setosa* from the other two flower species.  But how can we use Python program to help us to find this cutoff?

In [None]:
# We use NumPy indexing to get an array of strings:
labels = target_names[target]  # labels is a list, each item is the label of the input

# Extract feature 2, which is the petal length
plength = features[:, 2]  # plength is a list with petal length of each input

# build a boolean array (or tuple in Python)
is_setosa = (labels == 'setosa')

max_setosa = plength[is_setosa].max()   # find the maximun pedal length of setosa
min_non_setosa = plength[~is_setosa].min() # find the minimum pedal length for non-setosa

print('Maximum of setosa: {0}.'.format(max_setosa), ';  Minimum of others:{1}', format(min_non_setosa))


## WE FOUND A CLASSIFIER for Setosa !!!!!

**Classification rule**: If the petal length is smaller than 2, then this is an *Iris Setosa* flower; otherwise it is either *Iris Virginica* or *Versicolor*. 

Let's discover another classification rule to distinguish Virginica and Versicolor.

In [None]:
# Find the best feature and its cut-off threshold so as to best distinguish Virginica from Versicolor
#
# The idea is this:  Let say we consider feature i and threshold t, 
# and those sample points whose feature i's values are less than t,
# we say they are Virginica (Vericolor), while those sample points whose 
# feature i's values are >= t, we say are Vericolor (Virginica). 
#
# For this feature i and threshold t, we compute the accuracy with the given labels.
#
# The threshold t can be found by sweeping ALL feature i's values.
#
# Therefore, for each feature i and threshold t, we have two configurations to consider:
# Less than threshold t is Virginica (or Vericolor), while greater than or equal to t
# is Vericolor (or Virginica).  This is the reason why we have the reverse in the code.
# Finally, we have to consider for ALL 4 features and for all their values

# First, let us select features and labels which are non-Setosa 

features = features[~is_setosa]   # now features array only contains 100 entries
labels = labels[~is_setosa]       # now labels array only contains 100 entries


is_virginica = (labels == 'virginica')   # create a boolean list to identify virginica


#loop over all possible features and thresholds to see which one results in better accuracy


best_acc = -1.0   # initialize best accuracy to be negative first

for fi in range(features.shape[1]):  # loop through each 4 features
   # test different thresholds
   threshold = features[:,fi]   # list of potential thresholds for feature fi
   for t in threshold:          # go through each threshold value
     # access the vector for feature 'fi'
      feature_i = features[:,fi]
      # apply the threshold 't'
      pred = (feature_i > t)    # build a boolean list of feature values to see if > t
    
      acc = (pred == is_virginica).mean()    # count average number of accuracy
      rev_acc = (pred == ~is_virginica).mean()   # count average number of accuracy
      if rev_acc > acc:
            reverse = True
            acc = rev_acc
      else:
            reverse = False
      
      if acc > best_acc:    # if acc is better than the current one, remember its state
            best_acc = acc
            best_fi = fi
            best_t  = t
            best_reverse = reverse
            
#  the variables best_fi, best_t, and best_reverse hold our model

print('Best feature index is: ', best_fi)
print('Best feature is: ', feature_names[best_fi])
print('Best threshold is: ', best_t)
print("Accuracy is: ", best_acc, '%')

#print('best_fi:', best_fi, '; Best feature is: ', feature_names[best_fi], '; best_t:', best_t, '; best_acc:', best_acc)



## Remark:

So far, we can accurately classify Setosa (with 100%) and find the *best feature* to classify  whether the flower is Virginica or Versicolor with 94% accuracy.  But this is good?

Well, not really, this evaluation may be overly optimistic. We used the data to define what the threshold will be, and then we used the same data to evaluate the model. Of course, the model will perform better than anything else we tried on this dataset. The above reasoning is **circular**.

What we  want to do is estimate the ability of the model to generalize to *new instances*. We should measure its performance in instances that the algorithm has **not seen at training**.

This can be achieved via **cross-validation**.  In a nut shell, instead of using all data for training, we leave some for *validation*.  Let's illustrate.

In [None]:
## Simple illustration of cross-validation using "leave-one-out" for cross validation
## In other words, just leave ONE data point for doing validation and the rest for training

## Let's define some functions

def is_virginica_test(fi, t, reverse, example):
    'Apply threshold model to a new example to see whether it is correct or not.'
    test = example[fi] > t
    if reverse:
        test = not test
    return test

#from threshold import fit_model, predict

def fit_model(features, labels):
    '''Learn a simple threshold model'''
    best_acc = -1.0
    # Loop over all the features:
    for fi in range(features.shape[1]):
        thresh = features[:, fi].copy()   # get all the feature values in fi
        # test all feature values in order:
        thresh.sort()     # sort first
        for t in thresh:
            pred = (features[:, fi] > t)

            # Measure the accuracy of this 
            acc = (pred == labels).mean()

            rev_acc = (pred == ~labels).mean()
            if rev_acc > acc:
                acc = rev_acc
                reverse = True
            else:
                reverse = False
            if acc > best_acc:
                best_acc = acc
                best_fi = fi
                best_t = t
                best_reverse = reverse

    # A model is a threshold and an index
    return best_t, best_fi, best_reverse

def predict(model, features):
    '''Apply a learned model'''
    # A model is a pair as returned by fit_model
    t, fi, reverse = model
    if reverse:
        return features[:, fi] <= t
    else:
        return features[:, fi] > t

correct = 0.0     # initialize

for ei in range(len(features)):   # go through all 100 sample points
    # we will use all but the one at the position ei
    training = np.ones(len(features), bool)  # make an array with 100 entries for True
    training[ei] = False      # exclude the data item with index ei
    testing = ~training
    model = fit_model(features[training], is_virginica[training]) 
    predictions = predict(model, features[testing])
    correct += np.sum(predictions == is_virginica[testing])

acc = correct/float(len(features))
print('Accuracy:{0:.1%}'.format(acc))

## Remark

For the above model, since we took out one sample before the training, so we break the circular argument !!!  This is also known as the **leave-one-out cross-validation**. 

We can *generalize* the above concept.  Instead of leaving one labeled sample point, we leave certain percentage of labeled sample points out. After training, we use those sample points for cross-validation.  This is called the **k-fold cross-validation**. 

The idea is to leave certain percentage of the sample points for cross-validation. Let say $k=5$ (e.g., separate the inputs into $k$ groups), it means we leave around $20$ (e.g., $\frac{100}{k}$) of the data out for cross-validation. The idea is illustrated as follows:
![image](figure-k-fold.png)

For the above example, we have to go through training $5$ times, and the final accuracy is the **average** of the five experiments.  

Note that when we have a large $k$, the computational complexity is higher but the accuracy is also higher (because we are using more input data for the training).  The rule of thumb is to use $5$-fold cross-validation

## Explore a classifier:  Decision Tree

The above code assume we do *linear separation* along each feature.  In fact, we have something like this, and it is called **decision tree**, which can also handle separation with *multiple features*.

In class, we will discuss the **decision tree** classifier/regressor.  Right now, just think
of it as a way to divide-up the feature space so as to get a **good** 
(heuristcially speaking) accuracy result.  Let's study this

In [None]:
# load decision tree classifier
from sklearn import tree

tr = tree.DecisionTreeClassifier(min_samples_leaf=10)

# fit performs the learning (or it fits the model)
tr.fit(features, labels)

# Evaluating performance on the training set (which is not the right way because
# we didn't split the data for training and testing !!!!!!! ) 
prediction = tr.predict(features)
print("Accuracy of the decision tree: {:.1%}".format(np.mean(prediction == labels)))

# Plotting the decision tree (using an intermediate file):

# You may need to first install graphviz via % pip install graphviz

import graphviz

tree.export_graphviz(tr, feature_names=feature_names, rounded=True, out_file='decision.dot')

graphviz.Source(open('decision.dot').read())  # display the decision tree

## Let's try to do a leave-one-out approach to check the accuracy

In [None]:
# We remove entry i (i=0,1,2,3), for each entry we leave out, we train the model
# Since we have four results, we average the accuracy

predictions = []
for i in range(len(features)):
    train_features = np.delete(features, i, axis=0)  # train_features has 1 less 
    train_labels = np.delete(labels, i, axis=0)
    tr.fit(train_features, train_labels)
    predictions.append(tr.predict([features[i]]))
predictions = np.array(predictions)

print("Accuracy (with LOO cross-validation):.{:.1%}".
              format(np.mean(predictions.ravel() == labels)))

## The right way to do leave-one-out in decision tree

In [None]:
from sklearn import model_selection

predictions = model_selection.cross_val_predict(
    tr,
    features,
    labels,
    cv=model_selection.LeaveOneOut())
print('Accuracy (with LOO cross-validation):',np.mean(predictions == labels))

## A more complex dataset and a more advacned classifier

Previously, we tried to find the *threshold* of in a feature to do classification. 
It is too simplistic because one can also find threshold for a **combination** of features.
Let's example a more advanced classification algorithm and apply it on a more complicated
dataset.

The following argricultural dataset is too large to exhaustively plot out.  There are
three possible wheat labels, they are
* Canadians
* Koma
* Rosa

There are seven features for each sample point, they are:
1. Area $A$
2. Perimeter $P$
3. Compactness $C=4\pi A/P^2$
4. Length of kernel
5. Width of kernel
6. Asymmetry coefficient
7. Length of kernel groove

Notice that feature 3 derived from feature 1 and 2.  How to select which features, as well
as how to *transform* these features, is called **feature engineering**.  In general,
it is quite "*laborious*" but one can have better performance by appying simple algorithm on 
dataset with good features, than advanced algorithm on bad or poorly selected features.

This dataset is in the directory:  *data/seeds.tsv*

**We should take a look of that file.**

## Nearest neighbor classifier as well as KNN classifier

The nearest neighbor classifier is very simple. When classifying a new element, it looks at the training data for the object that is **closest** to it, its nearest neighbor. Then, it returns its label as the answer.

One can also *generalize* the above classifer to the **KNN** model,that is, it looks 
not at a single neighbor, but to $k\geq 1$ nearest ones and take a *majority vote* 
amongst the neighbors to do the classification.

Scikit-learn has many built-in classification/regression models, including KNN.  Let's 
illustrate it.

In [None]:

###########################################
############## SEEDS DATASET ##############
###########################################

from load import load_dataset

# define feature_names

feature_names = [
    'area',
    'perimeter',
    'compactness',
    'length of kernel',
    'width of kernel',
    'asymmetry coefficien',
    'length of kernel groove',
]

features, labels = load_dataset('seeds')  # load both features and labels


print('Shape of features: ', features.shape,'\n')
print('Shape of labels: ', labels.shape, '\n')
print('first 10 features are: ', features[:10],'\n')
print('first 10 labels are: ', labels[:10],'\n')



from sklearn.neighbors import KNeighborsClassifier

# define parameter for KNN
k=1  # look for the NEAREST NEIGHBOR
classifier = KNeighborsClassifier(n_neighbors=k)  # define  1NN

kf = model_selection.KFold(n_splits=5, shuffle=False)

# now kf contains BOTH training and testing data points
means = []
for training, testing in kf.split(features):
   # We learn a model for this fold with `fit` and then apply it to the
   # testing data with `predict`:
   classifier.fit(features[training], labels[training])  # do the training
   prediction = classifier.predict(features[testing])    # do the cross-validation

   # np.mean on an array of booleans returns fraction
   # of correct decisions for this fold:
   curmean = np.mean(prediction == labels[testing])  # compute the prediction rate
   means.append(curmean)                             # append the prediction rate to means list
    
print('means list contains: ', means, '\n')   # print out the means
print('Mean accuracy: {:.1%}'.format(np.mean(means)))   # print out the aveage of means



###  Normalizing features

If you look at the above code, you will notice that different features have different ranges. This may create a bias in learning, in particular, when we have to use *distance* to 
find the *nearest* $k$ neighbors.

In ML, we usually need to **normalize** all of the features to a common scale. This can
be achieved using a *z-score*, which is a value reflecting
how far away from the feature's mean, and in units of the feature's standard deviation. Mathematically, it is:
$$f^{'} = \frac{f-\mu}{\sigma}$$
where $f$ is the original feature's value, $\mu$ is the feature's average, and $\sigma$ is
feature's standard deviation.  Therefore $f^{'}$ will be *mean-centered* and *normalized*
by the standard deviation.

In scikit-learn, one can use *Pipeline* to achieve it.  Let's illustrate.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

classifier = KNeighborsClassifier(n_neighbors=1)    # call the 1NN classifier

# The Pipeline constructor takes a list of pairs (str,clf). Each pair corresponds to a 
# step in the pipeline: the first element is a string naming the step, while the 
# second element is the object that performs the transformation.

# Here, we use pipeline to chain two operations, one is the covert input into z-scores, then
# pipe the input to the classifier defined above.  This becomes our ENHANCED classifier
classifier = Pipeline([('norm', StandardScaler()), ('knn', classifier)])

means = []
for training,testing in kf.split(features):  # use the Kfold from previous cell
    # We learn a model for this fold with `fit` and then apply it to the
    # testing data with `predict`:
    classifier.fit(features[training], labels[training])
    prediction = classifier.predict(features[testing])

    # np.mean on an array of booleans returns fraction
    # of correct decisions for this fold:
    curmean = np.mean(prediction == labels[testing])
    means.append(curmean)
print('Mean accuracy: {:.1%}'.format(np.mean(means)))

## Plotting the decision area for 1NN, 1NN with normalization, KNN where $k=11$

In [None]:
### Note that this program may take around 30 seconds to run !!!!!

COLOUR_FIGURE = True

from matplotlib import pyplot as plt
from matplotlib.colors import ListedColormap
from load import load_dataset
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

feature_names = [
    'area',
    'perimeter',
    'compactness',
    'length of kernel',
    'width of kernel',
    'asymmetry coefficien',
    'length of kernel groove',
]


def plot_decision(features, labels, num_neighbors=1):
    '''Plots decision boundary for KNN

    Parameters
    ----------
    features : ndarray
    labels : sequence

    Returns
    -------
    fig : Matplotlib Figure
    ax  : Matplotlib Axes
    '''
    y0, y1 = features[:, 2].min() * .9, features[:, 2].max() * 1.1
    x0, x1 = features[:, 0].min() * .9, features[:, 0].max() * 1.1
    X = np.linspace(x0, x1, 1000)
    Y = np.linspace(y0, y1, 1000)
    X, Y = np.meshgrid(X, Y)

    model = KNeighborsClassifier(num_neighbors)
    model.fit(features[:, (0,2)], labels)
    C = model.predict(np.vstack([X.ravel(), Y.ravel()]).T).reshape(X.shape)
    if COLOUR_FIGURE:
        cmap = ListedColormap([(1., .7, .7), (.7, 1., .7), (.7, .7, 1.)])
    else:
        cmap = ListedColormap([(1., 1., 1.), (.2, .2, .2), (.6, .6, .6)])
    fig,ax = plt.subplots()
    ax.set_xlim(x0, x1)
    ax.set_ylim(y0, y1)
    ax.set_xlabel(feature_names[0])
    ax.set_ylabel(feature_names[2])
    ax.pcolormesh(X, Y, C, cmap=cmap)
    if COLOUR_FIGURE:
        cmap = ListedColormap([(1., .0, .0), (.1, .6, .1), (.0, .0, 1.)])
        ax.scatter(features[:, 0], features[:, 2], c=labels, cmap=cmap)
    else:
        for lab, ma in zip(range(3), "Do^"):
            ax.plot(features[labels == lab, 0], features[
                     labels == lab, 2], ma, c=(1., 1., 1.), ms=6)
    return fig,ax


features, labels = load_dataset('seeds')
names = sorted(set(labels))
labels = np.array([names.index(ell) for ell in labels])

fig,ax = plot_decision(features, labels)
fig.tight_layout()
fig.savefig('area_vs_compactness_1NN.png')

features -= features.mean(0)
features /= features.std(0)
fig,ax = plot_decision(features, labels)
fig.tight_layout()
fig.savefig('area_vs_compactness_1NN_with_normalization.png')

fig,ax = plot_decision(features, labels, 11)
fig.tight_layout()
fig.savefig('area_vs_compactness_11-NN_with_normalization.png')

plt.show()


## Random forests

There is an interesting paper,  [Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?](http://jmlr.org/papers/v15/delgado14a.html) by Delgado et al. (2014) to understand why we recommend random forests as the **default classifier**.

In [None]:
from sklearn import ensemble
rf = ensemble.RandomForestClassifier(n_estimators=100)

# We use cross-validation to evaluate

predict = model_selection.cross_val_predict(rf, features, labels)
print("RF accuracy: {:.1%}".format(np.mean(predict == labels)))

# Let's plot it, we define a new decision function


def plot_decision_space(clf, features, target, use_color=True):
    from matplotlib.colors import ListedColormap

    clf.fit(features[:, [0,2]], target)

    y0, y1 = features[:, 2].min() * .9, features[:, 2].max() * 1.1
    x0, x1 = features[:, 0].min() * .9, features[:, 0].max() * 1.1
    X = np.linspace(x0, x1, 1000)
    Y = np.linspace(y0, y1, 1000)
    X, Y = np.meshgrid(X, Y)
    C = clf.predict(np.vstack([X.ravel(), Y.ravel()]).T).reshape(X.shape)
    if use_color:
        cmap = ListedColormap([(1., .7, .7), (.7, 1., .7), (.7, .7, 1.)])
    else:
        cmap = ListedColormap([(1., 1., 1.), (.2, .2, .2), (.6, .6, .6)])

    fig,ax = plt.subplots()
    ax.scatter(features[:, 0], features[:, 2], c=target, cmap=cmap)
    for lab, ma in zip(range(3), "Do^"):
        ax.plot(features[target == lab, 0], features[
                 target == lab, 2], ma, c=(1., 1., 1.), ms=6)

    ax.set_xlim(x0, x1)
    ax.set_ylim(y0, y1)
    ax.set_xlabel(feature_names[0])
    ax.set_ylabel(feature_names[2])
    ax.pcolormesh(X, Y, C, cmap=cmap)
    return fig

_= plot_decision_space(rf, features, labels)

## Conclusion

We have learnt: 
* Visualization subset of features in our dataset
* From visualization, discover classification rules
* Use of simple threshold technique to do classification
* The need to split up the data into training and validation
* From leave-one cross-validation to k-fold cross validation
* Using 1NN and KNN as classifier
* The need to **normalize** all features
* Color scatter plot of results in KNN (with different values of $k$)
* Classification via random forest

<br>
<span style="font-family:times; font-size:0.9em;">
p.s.: The above codes are the "modified" and "enhanced" version of the codes from the book, "Building Machine Learning Systems with Python".</span>
