# Introduction to Machine Learning
# Part 2: Classification

In this notebook, we'll use the same data set to walk through the process of making a supervised learning model for a classification problem.  The task at hand is to __predict if the wine is 'drinkable' using chemical properties.__  

Data: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/

## Overview:

(Repeat from before)

Generally, a supervised machine learning workflow will consist of the following elements:
1. Data Cleaning and Exploration
2. Data Preparation
3. Model Selection
4. Train and Validate Model (training data)
5. Test and Make Predictions (test data)

## Preliminaries

As we did before, go ahead an import Pandas and NumPy.  We won't be doing any visualizations in this notebook, so we won't need Matplotlib or Seaborn.  If you were starting from scratch with new data, you'd want to do some data exploration and visualization like we did previously.  

In [1]:
# Import libraries
import pandas as pd 
import numpy as np

## 1. Data Cleaning and Exploration

Refer to the regression notebook for this section.  We won't repeat the same work here.  Go ahead and import the data and we'll move on to the next section.

In [2]:
# Import data
df = pd.read_csv('winequality-red.csv', delimiter=';')

In [3]:
# Take a look
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


## 2. Data Preparation

Let's prepare the data for modeling.  Since the last notebook, you've become a considerable wine snob and will only drink wines that score 7 or greater.

The first task is to create a new column called `drinkable` based on the existing quality score. This column will be a binary feature and indicates if a particular wine is drinkable `1` or not drinkable `0`.

In [4]:
# Create new binary column
df['drinkable'] = np.where(df['quality']>=7, 1, 0)

In [5]:
# Inspect changes.  
df.head(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,drinkable
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,0
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0
5,7.4,0.66,0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,5,0
6,7.9,0.6,0.06,1.6,0.069,15.0,59.0,0.9964,3.3,0.46,9.4,5,0
7,7.3,0.65,0.0,1.2,0.065,15.0,21.0,0.9946,3.39,0.47,10.0,7,1
8,7.8,0.58,0.02,2.0,0.073,9.0,18.0,0.9968,3.36,0.57,9.5,7,1
9,7.5,0.5,0.36,6.1,0.071,17.0,102.0,0.9978,3.35,0.8,10.5,5,0


Now let's take a look at class balance.  This tells us how much of the data set is drinkable vs. not drinkable.  Ideally, we'd have an equal class balance (50/50).

In [6]:
# Class balance
df['drinkable'].mean()

0.1357098186366479

In [7]:
# Baseline accuracy
print('Baseline accuracy:', 1 - df['drinkable'].mean())

Baseline accuracy: 0.8642901813633521


Turns out we're not so lucky.  We have an imbalanced class here.  Only 13.6% of the data is 'drinkable'.  This means 86.4% of the data is 'not drinkable'. So for our model to be of any value, it'll need to score higher than 86.4%. Otherwise, it's just guessing the dominant class. 

Ok, now we're ready to separate out our target variable and feature.  For the features, let's go ahead and drop both the `quality` and `drinkable` columns.  For the target variable, we'll want the `drinkable` column.  `Quality` won't be used in this model.

In [8]:
# Specify features.
X = df.drop(['quality', 'drinkable'], axis=1)

# Specify target
y = df['drinkable']

Let's take a look to confirm we have the desired features:

In [9]:
# Features.  Quality and drinkable columns have been removed.
X.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4


In [10]:
# Target is binary 
y.head(10)

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    1
8    1
9    0
Name: drinkable, dtype: int32

As we did before, the next step is to split the data into training and testing sets.  Because we have an inbalanced class, we'll stratify the split on `y`, which means the training and testing sets will have the same percentage of drinkable and not drinkable wines.  If we don't do this, there's a risk that very few of the drinkable wines will get put into the training data set and our model won't be able to learn what characteristics make a wine drinkable.

In [11]:
# Import train_test_split from sklearn
from sklearn.model_selection import train_test_split

# Specify split.  Reserve 30% for testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42, stratify=y)

In [12]:
# Shapes of split features
print('X_train:', X_train.shape)
print('X_test:', X_test.shape)

X_train: (1119, 11)
X_test: (480, 11)


And again, let's standardize the data using `StandardScaler()`.

In [13]:
# Import scaler
from sklearn.preprocessing import StandardScaler

# Create instance
scaler = StandardScaler()

In [14]:
# Use instance to fit and transform training features
X_train_scaled = scaler.fit_transform(X_train)

# Transform test features relative to standard dev. and mean of training features
X_test_scaled = scaler.transform(X_test)

# Take a look
X_train_scaled

array([[-0.47593758, -0.71625367, -0.78123   , ...,  0.37097178,
         0.2389513 , -0.12936731],
       [-0.82501393, -0.65974919,  0.19702143, ...,  1.01777363,
        -0.65095275, -0.86802753],
       [-0.01050245, -1.50731645,  0.76337752, ..., -0.66391117,
         0.83222066, -0.96036006],
       ...,
       [ 1.38580296, -0.99877609,  1.32973361, ..., -1.05199228,
        -0.05768338,  0.51696039],
       [-1.11591089,  1.51567346, -1.39907301, ...,  1.147134  ,
        -0.82893356,  1.07095556],
       [-1.58134603,  2.19372727, -1.39907301, ...,  2.37605751,
         0.41693211,  0.8862905 ]])

## 3. Model Selection

We'll use a Random Forest model again, but this time we need to import the `RandomForestClassifier`.

In [15]:
# Import Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

# Create model instance
rf = RandomForestClassifier()

## 4. Train and Validate Model (training data only)
Now let's fit our model with the training data and use k-folds cross validation (k=3) again to test out model performance.

In [16]:
# Import cross validation
from sklearn.model_selection import cross_val_score

# Fit with training data
rf.fit(X_train_scaled, y_train)

# Evaluate using cross validation 
cross_val_score(rf, X_train_scaled, y_train, cv=3)

array([0.87165775, 0.8847185 , 0.89784946])

Recall from above that our benchmark model accuracy was 86.4%.  Based on cross validation, the model is only performing slightly better than baseline.  We could try adjusting some hyperparameters, or test out another model to see if it performs any better.  

More on accuracy in a minute...

## 5. Test and Make Predictions (test data)

Let's go ahead and feed our test data into the model and make predictions.

In [17]:
# Get score using test data
rf.score(X_test_scaled, y_test)

0.9166666666666666

In [18]:
# Make predictions
y_pred = rf.predict(X_test_scaled)

In terms of accuracy, the model is actually performing quite well relative to baseline.

## But is accuracy enough?

This brings up an interesting point specific to classification problems.  Evaluating a classifier can be tricky business.  See: https://en.wikipedia.org/wiki/Confusion_matrix.  You have LOTS of options.

Oftentimes, accuracy is not the best metric to use.  At a minimum, you'll want to look at a confusion matrix.  A confusion matrix tells you *how* your data is being classified (or misclassified).  Depending on your objectives, you may want to focus in on a particular metric.

Let's take a look at the confusion matrix for our model.

In [19]:
# Import confusion matrix
from sklearn.metrics import confusion_matrix

# Confusion matrix
confusion_matrix(y_test, y_pred)

array([[409,   6],
       [ 34,  31]], dtype=int64)

The rows represent actual values and the columns represent predicted values. Let's break it apart:

In [20]:
# Break into components
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

print('True positive:', tp)
print('True negative:', tn)
print('False positive:', fp)
print('False negative:', fn)

True positive: 31
True negative: 409
False positive: 6
False negative: 34


Desirable Outcomes: 
* __True positives__ are drinkable wines that were classified as drinkable wines.
* __True negatives__ are undrinkable wines classified as undrinkable wines

Undesirable Outcomes:
* __False positives__ are undrinkable wines classified as drinkable (gross)
* __False negatives__ are drinkable wines classified as undrinkable (say what?!)

We can take this further and calculate the __true positive rate__.  The true positive rate takes the number of identified drinkable wines (true positives) divided by the total number of drinkable wines (true positives + false negatives):

In [21]:
sensitivity = tp / (tp + fn)
print(sensitivity)

0.47692307692307695


Which metric to use ultimately comes down to your objective.  Do you care about missing good wine?  Is it more insulting to be served a bad wine?  It's up to you.

## Wrap Up:

Congratulations!  You've now walked through both a regression and classification machine learning model.  I hope you've learned something new along the way.  

Thanks for reading!