# Introduction to Machine Learning
# Part 2: Classification

__Goal: Predict if the wine is 'drinkable' using its chemical properties.__  

Data: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/

## Overview:

Generally, a supervised machine learning workflow will consist of the following elements:
1. Data Cleaning and Exploration
2. Data Preparation
3. Model Selection
4. Train and Validate Model (training data)
5. Test and Make Predictions (test data)

## 0. Preliminaries

* Import supporting libraries
* No plotting this time

In [39]:
# Import libraries
import pandas as pd 
import numpy as np

## 1. Data Cleaning and Exploration

* Skip - same as before

In [40]:
# Import data
df = pd.read_csv('winequality-red.csv', delimiter=';')

## 2. Data Preparation


New:
* You're now a wine snob!
* Create a `drinkable` column based on score >= 7
* Class balance

Normal business:
* Specify features and target variable
* Train / test split
* Standardize scale

__Drinkable?__

In [41]:
# Create new binary column
df['drinkable'] = np.where(df['quality']>=7, 1, 0)

In [42]:
# Inspect changes
df.head(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,drinkable
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,0
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0
5,7.4,0.66,0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,5,0
6,7.9,0.6,0.06,1.6,0.069,15.0,59.0,0.9964,3.3,0.46,9.4,5,0
7,7.3,0.65,0.0,1.2,0.065,15.0,21.0,0.9946,3.39,0.47,10.0,7,1
8,7.8,0.58,0.02,2.0,0.073,9.0,18.0,0.9968,3.36,0.57,9.5,7,1
9,7.5,0.5,0.36,6.1,0.071,17.0,102.0,0.9978,3.35,0.8,10.5,5,0


__Class balance__

In [43]:
# Class balance
round(df['drinkable'].mean(), 2)

0.14

In [44]:
# Baseline accuracy
print('Baseline accuracy:', round(1 - df['drinkable'].mean(), 2))

Baseline accuracy: 0.86


__Specify features and target__

In [45]:
# Specify features.
X = df.drop(['quality', 'drinkable'], axis=1)

# Specify target
y = df['drinkable']

In [46]:
# Features.  Quality and drinkable columns have been removed.
X.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4


In [47]:
# Target is binary 
y.head(10)

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    1
8    1
9    0
Name: drinkable, dtype: int32

__Train/Test Split__

In [48]:
# Import train_test_split from sklearn
from sklearn.model_selection import train_test_split

# Specify split.  Reserve 30% for testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42, stratify=y)

In [49]:
# Shapes 
print('X_train:', X_train.shape)
print('X_test:', X_test.shape)

X_train: (1119, 11)
X_test: (480, 11)


__Standardization__

In [50]:
# Import scaler
from sklearn.preprocessing import StandardScaler

# Create instance
scaler = StandardScaler()

# Scale features
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

# Take a look
X_train_scaled

array([[-0.47593758, -0.71625367, -0.78123   , ...,  0.37097178,
         0.2389513 , -0.12936731],
       [-0.82501393, -0.65974919,  0.19702143, ...,  1.01777363,
        -0.65095275, -0.86802753],
       [-0.01050245, -1.50731645,  0.76337752, ..., -0.66391117,
         0.83222066, -0.96036006],
       ...,
       [ 1.38580296, -0.99877609,  1.32973361, ..., -1.05199228,
        -0.05768338,  0.51696039],
       [-1.11591089,  1.51567346, -1.39907301, ...,  1.147134  ,
        -0.82893356,  1.07095556],
       [-1.58134603,  2.19372727, -1.39907301, ...,  2.37605751,
         0.41693211,  0.8862905 ]])

## 3. Model Selection

* Random Forest Classifier

In [51]:
# Import Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

# Create model instance
rf = RandomForestClassifier()

## 4. Train and Validate Model (training data only)
* Fit with training data
* K-folds cross validation (k=3)
* Feature importance (skip)

__Fit & Validate__

In [52]:
# Import cross validation
from sklearn.model_selection import cross_val_score

# Fit 
rf.fit(X_train_scaled, y_train)

# Validate  
cross_val_score(rf, X_train_scaled, y_train, cv=3)

array([0.86631016, 0.86327078, 0.90322581])

* Barely pulling ahead
* Adjust hyperparameters
* Test additional models
* Pick the best and carry on

## 5. Test and Make Predictions (test data)

* Feed test data into model
* Generate predictions
* A word on accuracy

__Test Accuracy__

In [53]:
# Get score using test data
rf.score(X_test_scaled, y_test)

0.9125

__Predictions__

In [54]:
# Make predictions
y_pred = rf.predict(X_test_scaled)

## But is accuracy enough?

* Evaluating a classifier is more complicated

__Confusion Matrix__

In [55]:
# Import confusion matrix
from sklearn.metrics import confusion_matrix

# Confusion matrix
confusion_matrix(y_test, y_pred)

array([[405,  10],
       [ 32,  33]], dtype=int64)

* Rows = actual values 
* Columns = predicted values

Break apart further:

In [56]:
# Break into components
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

print('True positive:', tp)
print('True negative:', tn)
print('False positive:', fp)
print('False negative:', fn)

True positive: 33
True negative: 405
False positive: 10
False negative: 32


Desirable Outcomes: 
* __True positives__ are drinkable wines that were classified as drinkable wines.
* __True negatives__ are undrinkable wines classified as undrinkable wines

Undesirable Outcomes:
* __False positives__ are undrinkable wines classified as drinkable (gross)
* __False negatives__ are drinkable wines classified as undrinkable (say what?!)

We can take this further and calculate the __true positive rate__.  The true positive rate takes the number of identified drinkable wines (true positives) divided by the total number of drinkable wines (true positives + false negatives):

In [57]:
sensitivity = tp / (tp + fn)
print(sensitivity)

0.5076923076923077


__Which to Use?__

Depends on objective.  

## Wrap Up:

Congratulations!  You made it through a regression and classification supervised learning problem!