Part 2 of scikit-learn workshop

We'll start by importing our modules again:

In [37]:
# :: IMPORTS ::

# Scikit-learn specifics:
from sklearn import datasets
from sklearn import preprocessing
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn import svm

# Helper modules
import pandas as pd
import matplotlib as plt

# <span style = "color:rebeccapurple"> Part 4 - Classification

Now that we are experienced with preprocessing and pipelines. Let's move on to a different machine learning task: classification.

### <span style = "color:teal"> Conceptual intermezzo - What is classification?

See slides

## <span style = "color:darkorchid"> Alice: sommelière extraordinaire!

After coming back from her cold polar adventures, Alice is ready for a change of scenery, so she takes a trip to Italy. Being quite the sommelière herself, she visits her favorite ristorante in La Toscana, *Il Cappellaio Matto*, in search of good wine. As it turns out, there is a heated debate going on, everyone is wondering if sommelieres can actually distinguish different wines, or if they are just faking it.

Alice decides to settle the dispute with machine learning. Her colleagues trust her, so they give her a dataset containing the chemical composition of different wine samples. All these samples were grown in the Toscana region, but they were made of three different cultivars (a cultivar is a specific plant variety, in this case varieties of grape vines).

### <span style = "color:teal"> Load data

In [29]:
wine = datasets.load_wine(as_frame = True)

In [65]:
wine_X = wine["data"]
wine_y = wine["target"]

In [66]:
wine_X.shape

(178, 13)

In [31]:
wine_X.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


In [32]:
wine_y.head()

0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int64

Note the target is a pandas series. We could make it a dataframe:

In [51]:
wine_y = wine_y.to_frame(name = "cultivar")

In [53]:
wine_y.head()

Unnamed: 0,cultivar
0,0
1,0
2,0
3,0
4,0


<b>BUT</b>, scikit learn classifiers usually expect the target to be a 1-d array or a series, so we'll keep it as a series.

And just to make sure we don't mess things up, let's do it again:

In [54]:
# Load wine data
wine = datasets.load_wine(as_frame = True)
wine_X = wine["data"]
wine_y = wine["target"]

### <span style = "color:teal"> Using a support vector classifier

For now, we will skip the preprocessing stage and head straight to the classification task. `scikit-learn` provides different classifiers for us, let's use the support vector machine. We can find support vector classifiers in the `svm` module, which we have already imported.

In [89]:
wX_train, wX_test, wy_train, wy_test = train_test_split(wine_X, wine_y, test_size = .15)

In [90]:
# Create SVM classifier:
wine_clf = svm.SVC()

In [91]:
# Fit the SVM classifier:
wine_clf.fit(wX_train, wy_train)

In [92]:
wine_clf.classes_

array([0, 1, 2])

In [93]:
# Predict the cultivars
wine_predictions = wine_clf.predict(wX_test)
wine_predictions

array([2, 1, 1, 0, 2, 2, 0, 2, 0, 1, 1, 0, 1, 2, 2, 2, 1, 1, 0, 0, 2, 0,
       1, 2, 1, 1, 0])

Let's compare the predictions to the true values:

In [94]:
comparison_df = pd.DataFrame(data = {"Predicted": wine_predictions,
                                    "True Cultivars": wy_test})

In [95]:
comparison_df.head(20)

Unnamed: 0,Predicted,True Cultivars
158,2,2
84,1,1
87,1,1
19,0,0
100,2,1
153,2,2
9,0,0
135,2,2
54,0,0
125,1,1


Note: we didn't have a lot of data points, and since we have three classes (wine varieties), we are a bit tight on our data. That's why I chose a small test size. If you go back and change the test size parameter to something like $.3$, you will notice a big drop in performance.

Let's check now our accuracy:

In [101]:
# Evaluate:
wine_clf.score(wX_test, wy_test)

0.7037037037037037

### <span style = "color:teal"> Preprocessing

As stated before, the type of preprocessing we must do depends on your data and your model. We don't have time to go into much detail, so for now we will stick to the standard scaler.

In [108]:
# Create and fit preprocessor
wine_preprocessor = preprocessing.StandardScaler().fit(wX_train, wy_train)

In [109]:
# Transform data
wX_train_trans = wine_preprocessor.transform(wX_train)

In [110]:
# Fit classifier to transformed data
wine_clf = svm.SVC().fit(wX_train_trans, wy_train)

In [113]:
# Evaluate on testing data:
    # Transform test data
wX_test_trans = wine_preprocessor.transform(wX_test)

    # Check accuracy
wine_clf.score(wX_test_trans, wy_test)

0.9814814814814815

Woah!! Our accuracy skyrocketed, looks like preprocessing is very important uh?

### <span style = "color:teal"> Make a pipeline

In [116]:
# Step 1: Load the data
wine = datasets.load_wine(as_frame = True)
wine_X = wine["data"]
wine_y = wine["target"]

# Step 2: Split the data
wX_train, wX_test, wy_train, wy_test = train_test_split(wine_X, wine_y, test_size = .3)

# Step 3: Create the pipeline
wine_pipeline = Pipeline(
    [
        ("preprocessor", preprocessing.StandardScaler()),
        ("classifier", svm.SVC())
    ]
)

# Step 4: Fit the pipeline
wine_pipeline.fit(wX_train, wy_train)

# Step 5: Evaluate accuracy of classifier
wine_pipeline.score(wX_test, wy_test)

0.9814814814814815

In [117]:
wine_pipeline

## <span style = "color:red"> Long Exercise - Bo's daunting dilemma

Having followed Alice to Italy, Bo travels south to visit the famous city Pompei. In there, they find the walls are full of ancient romans' graffiti! Bo would like to have a dataset of all the text in the walls, but copying it one by one would be incredibly dauting. They decides to instead take pictures and figure it out later.

Back in their lab, Bo needs to create a classifier that takes as inputs images of hand-written text, and maps them to their proper symbol. To begin, Bo will focus on numerical digits.

Load the "digits" dataset from `scikit-learn` and build a classifier to achieve this task. Each data point is an image, which is basically a matrix of numbers. You can "flatten" this matrix into a long vector, which will be your feature vector. The labels are the actual characters for the different numbers (0, 1, 2, etc.).

Good luck!

# <span style = "color:rebeccapurple"> Summary

OK, we've done a lot. We have learned:
1. How to preprocess data.
2. How to split data into training and testing.
3. How to build a linear regression object for regression.
4. How to build a support vector classifier for classification.
5. How to build pipelines.

There are so many things left to do! For instance, there are many other types of classifiers, k-nearest neighbors, logistic regression, decision trees, etc. We can't cover it all here, so you'll have to venture into those lands by yourself. But don't worry, you can always come to us for help :-)

## <span style = "color:darkorchid"> What's next?

1. Well, immediately, I suggest you learn about <b>cross-validation</b>. It is an <span style = "color:red">extremely important</span> concept that we did not cover. Lucky for you `scikit-learn` can do cross validation for you, AND it works well with pipelines.

2. Second, our metrics for evaluation were pretty simple. There is a whole world to explore there. For instance, if you are doing classification, you should learn about the "receiver operating characteristic" curve (ROC curve).

3. Third, I suggest you play around with other models for both regression and classification.

4. Fourth, go beyond supervised learning and into the realm of unsupervised learning. You may need to learn what these terms mean first. Then learn about "clustering", the most common unsupervised learning method. I've prepared an optional notebook for that!

5. Finally, just keep learning and keep doing. At times it will be frustrating, that's fine, that's how it is with everything new. Good luck in your machine learning endeavors :-)

# <span style = "color:rebeccapurple"> (Optional) Supervised versus unsupervised learning

If there is time, go to "clustering" notebook.