
# Classifier comparison

A comparison of a several classifiers in scikit-learn on synthetic datasets.
The point of this example is to illustrate the nature of decision boundaries
of different classifiers.
This should be taken with a grain of salt, as the intuition conveyed by
these examples does not necessarily carry over to real datasets.

Particularly in high-dimensional spaces, data can more easily be separated
linearly and the simplicity of classifiers such as naive Bayes and linear SVMs
might lead to better generalization than is achieved by other classifiers.

The plots show training points in solid colors and testing points
semi-transparent. The lower right shows the classification accuracy on the test
set.


## Imports

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC, SVC, NuSVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import *
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import plot_precision_recall_curve, precision_recall_fscore_support

## Shows by default the output of the plots to Google Colab

In [None]:
%matplotlib inline

## Upload the CSV files from drawdata.xyz
You can draw and upload several! Feel free to experiment!

In [None]:
from google.colab import files
uploaded = files.upload()

## Create a list in python with the name of all classifiers you have imported

In [None]:
names = # <- create a list here 

## TODO: Create a list in Python, with one instance of each classifier described above.
You may need to look at the imports to know how the module is called to instantiate it

You can leave the default hyperparameters. However, to achieve better results, you may want to check documentation of all of them and see what parameters you can pass to the constructors:

https://scikit-learn.org/stable/supervised_learning.html

In [None]:
classifiers = # <- create a list here 

## TODO: Read the csvs you have imported. Show the dataframe with the points from your drawing

In [None]:
df = # <- read with pandas your dataframe 

## Separate the predicted class in another dataframe

In [None]:
df_points = df[['x', 'y']]
df_class = df[['z']]

## TODO: Get the values of both dataframe. Use 'values' function

In [None]:
df_points_values = # <- get values of df_points dataframe
df_class_values = # <- get values of df_class dataframe

## Print df_train_values. It should be an array of [x,y] values

In [None]:
df_points_values

## Print the class values. 
You will see it's one array (list) per line. Scikit-learn wants 1 array at all, with all elements in a row. We concate them.

In [None]:
df_class_values

In [None]:
df_class_values_concat = np.concatenate(df_class_values, axis=0)
df_class_values_concat

## TODO: Create a tuple with train and text 

In [None]:
df_tuple = # <- create a tuple with 2 elements, df_points_values    and     df_class_values_concat
df_tuple

## Run this code that will:
1) For each dataset...

2) ... for each classifier ...

3) ......train.......

4) ......predit......

5) ......plot.


In [None]:
h=.2
figure = plt.figure(figsize=(27, 9))
i = 1

# preprocess dataset, split into training and test part
X, y = df_tuple
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=.4, random_state=42)

x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                      np.arange(y_min, y_max, h))

# just plot the dataset first
cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
ax = plt.subplot(1, len(classifiers) + 1, i)
ax.set_title("Input data")
# Plot the training points
#ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright,
#           edgecolors='k')
ax.scatter(X_train[:, 0], X_train[:, 1], c='gold', cmap=cm_bright,
            edgecolors='k')
# Plot the testing points
ax.scatter(X_test[:, 0], X_test[:, 1], c='green', cmap=cm_bright, alpha=0.6,
            edgecolors='k')
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
i += 1

# iterate over classifiers
for name, clf in zip(names, classifiers):
    ax = plt.subplot(1, len(classifiers) + 1, i)
    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)
    

    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    if hasattr(clf, "decision_function"):
        Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
    else:
        Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
    
    # Put the result into a color plot
    try:
      Z = Z.reshape(xx.shape)
      ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)
    except:
      print('Error: ' + str(clf))
      pass
    
    # Plot the training points
    ax.scatter(X_train[:, 0], X_train[:, 1], c='gold', cmap=cm_bright,
                edgecolors='k')
    # Plot the testing points
    ax.scatter(X_test[:, 0], X_test[:, 1], c='green', cmap=cm_bright,
                edgecolors='k', alpha=0.6)

    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_xticks(())
    ax.set_yticks(())
    ax.set_title(name)
    ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),
            size=15, horizontalalignment='right')
    i += 1

plt.tight_layout()
plt.show()

## Confusion Matrix
Run this code that will:

1) For each classifier ...

2) ......train.......

3) ......predict......

4) ......calculate confusion matrix...

5) ......plot.

In [None]:
figure = plt.figure(figsize=(20, 2))
i = 1

X, y = df_tuple
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=.4, random_state=42)
counter = 0
# iterate over classifiers
for name, clf in zip(names, classifiers):
    ax = plt.subplot(1, len(classifiers) + 1, i)
    ax.set_title(names[counter])
    clf.fit(X_train, y_train)
    display = plot_confusion_matrix(clf, X_test, y_test, ax=ax)
    display.im_.colorbar.remove()
    i += 1
    counter += 1

plt.show()

## Roc Curve
Run this code that will:

1) For each classifier ...

2) ......train.......

3) ......predict......

4) ......calculate ROC and AUC...

5) ......plot.


In [None]:
X, y = df_tuple
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=.4, random_state=42)
# iterate over classifiers
for name, clf in zip(names, classifiers):
    clf.fit(X_train, y_train)
    plot_roc_curve(clf, X_test, y_test)

## Precision, Recall, F1
Run this code that will:

1) For each classifier ...

2) ......train.......

3) ......predict......

4) ......calculate metrics...

5) ......print them.

In [None]:
# preprocess dataset, split into training and test part
X, y = df_tuple
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=.4, random_state=42)
    
# iterate over classifiers
counter = 0
for name, clf in zip(names, classifiers):
    clf.fit(X_train, y_train)
    pred = clf.predict(X_test)
    precision, recall, f1, support = precision_recall_fscore_support(y_test, pred)
    print(f"Classifier {names[counter]} metrics:\n-P({precision})\n-R({recall})\n-F1({f1})\n-Support=({support})\n")
    counter +=1

## Answer to the following questions with a colleague

1.   Which one do you think is better for  your data?
2.   Do results differ much among models?
3.   What do the color gradients mean?
4.   How do the different models work? Discuss the theory.
5.   In Precision / Recall / F1, wy there are two values?
6.   In Precision / Recall / F1, what does 'Support' mean?
7.   What different metrics have we used? How do you interpret them?