
# Introducing scikit-learn
<br>   
Author: James D. Triveri    
<br>
    




**scikit-learn** is a popular, full-featured machine learning library written for Python. 
The API is remarkably well designed. The library exposes and makes regular use of the following objects: 
<br>     
  * **Estimators**: Any object that can estimate a set parameters based on a dataset
    is called an estimator. The estimation itself is
    performed by the `fit()` method, and it takes only a dataset as a parameter (or
    two for supervised learning algorithms, the second dataset contains the
    labels). Any other parameter needed to guide the estimation process is con‐
    sidered a hyperparameter.  
<br>    
  * **Transformers**: Some estimators can also transform a dataset. The
    transformation is facilitated by the `transform()` method, which returns the 
    transformed dataset. The transformation generally relies on the learned 
    parameters, as is the case for an imputer.   
<br>         
  * **Predictors**: Some estimators are capable of making predictions given a
    dataset. A predictor has a `predict()` method that takes a dataset of new 
    instances and returns a dataset of corresponding predictions. Predictors also 
    have a `score()` method that measures the quality of the predictions given
    a test set (and the corresponding labels in the case of supervised learning
    algorithms).     
<br>   



To implement a model in scikit-learn:

* **Select a model appropriate for the task at hand** (see below)        
<br>   
* **Pre-process explanatory data** (scale variables, impute missing data, encode categorical variables)     
<br>
* **Instantiate model**        
<br>        
* **Fit model to training data**            
<br>   
* **Tune hyperparameters via cross-validation**       
<br>    
* **Predict classes on test/holdout data**    
<br>    


![sklearn cheat-sheet](ml_map.png)


<br>  
    
[Link to cheat-sheet](http://peekaboo-vision.blogspot.de/2013/01/machine-learning-cheat-sheet-for-scikit.html)


## Preprocessing

Preprocessing in sklearn is primarily concerned with:

* **Encoding**: Encode string features as ints; 'One-Hot' encode features with more than 2 values.   
 <br>  
* **Scaling**: Standardize explanatory variables. The two most common techniques are:    

    * `StandardScaler`: Standardize features by removing the mean and scaling to unit variance.   
    <br>     
    * `MinMaxScaler`: Scales and translates each feature individually such that it is in the 
      given range on the training set, i.e. between 0 and 1.   
 <br> 
* **Imputing**: Replace missing values with the mean, median or most frequent value of 
  the corresponding feature.  


In [None]:
# Preprocessing Example =>

import pandas as pd
import numpy as np


df = pd.DataFrame({
    'COL1'    : [215.43,np.nan,212.08,169.23,428.43,112.15,np.nan,338.20,192.25,213.40],
    'COL2'    : ['A','B','A','B','B','A','B','A','B','B'],
    'COL3'    : [ np.nan,2.39476,1.78377,0.24551,np.nan,-0.99175,-0.01066,-0.88275,0.21074,-0.5943],
    'COL4'    : [5501.75,9465.95,9544.02,11564.05,4984.09, 4467.97,np.nan,26996.26,np.nan,6763.45],
    'COL5'    : ['M', 'F', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'M'],
    'RESPONSE': ['Y', 'Y', 'N', 'N', 'Y', 'N', 'Y', 'Y', 'N', 'Y']
    })

df.head()

In [None]:
# [1] Impute missing values =>

from sklearn.preprocessing import Imputer

# `strategy` can be one of 'mean', 'median' or 'most_frequent'
imp = Imputer(missing_values=np.nan, strategy='mean')

df[['COL1', 'COL3', 'COL4']] = imp.fit_transform(df[['COL1', 'COL3', 'COL4']])

df.head()

In [None]:
# [2] Encode categorical features as integers =>
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df['COL2']     = le.fit_transform(df['COL2'])
df['COL5']     = le.fit_transform(df['COL5'])
df['RESPONSE'] = le.fit_transform(df['RESPONSE'])

df.head()

In [None]:
# [3] Scale all features with StandardScaler =>

from sklearn.preprocessing import StandardScaler

sclr = StandardScaler()

df[['COL1','COL3','COL4']] = sclr.fit_transform(df[['COL1','COL3','COL4']])

df.head()

In [None]:
# design matrix must be fed to classifier as a numpy array =>

# `as_matrix` coerces DataFrame to numpy array
X = df.drop('RESPONSE', axis=1, inplace=True).as_matrix()
y = df['RESPONSE'].values

# ready to pass X, y to classifier...


## scikit-learn Classifiers

[List of available sklearn models](http://scikit-learn.org/stable/user_guide.html)



We'll discuss the following classifiers:

* **Gaussian Naive Bayes**
* **k-Nearest Neighbors**
* **Logistic Regression**
* **Support Vector Machines**
* **Voting Classifiers** (Ensemble MEthod)




## Gaussian Naive Bayes

*Naive Bayes* methods are a set of supervised learning algorithms based on applying Bayes’ Theorem with the “naive” assumption of independence between every pair of features.
The Naive Bayes classifier makes two strong assumptions:    
<br>    

1.  **The value of a particular feature is independent of the value of any other feature, given the class variable.**   
<br>
2.  **The set of features associated with an unclassified instance are assumed to follow a normal distribution.**      

<br>

**To create a Gaussian Naive Bayes Classifier (without scikit-learn):**

1. Ensure all explanatory variables are continuous: If the dataset contains categorical features, look into the Bernoulli or Multinomial form of Naive Bayes.  
<br>    
2. For each explanatory variable, calculate the maximum likelihood estimate of the mean and variance for each class.   
<br>     
3. To classify a new instance, calculate the posterior probability for each class. There will be as many posterior probabilities per unclassified instance as there are distinct classes.    
<br>    
4. The new instance will be classified based on the class with the greatest posterior probability.           
<br>   

**Example:**

Consider a sample dataset representing business school admissions:
<br>
<br>

|  ID        |  GPA  |  GMAT   |   ADMITTED_IND   |
|:----------:|:-----:|:-------:|:------------:|
| 000000001  |  3.14 |	473	   |1             |
| 000000002  |  3.22 |	482	   |1             |
| 000000003  |  2.96 |	596	   |1             |
| 000000004  |  3.28 |	523	   |1             | 
| 000000005  |  2.72 |	399	   |0             |
| 000000006  |  2.85 |	381	   |0             |
| 000000007  |  2.51 |	458	   |0             | 
| 000000008  |  2.36 |	399	   |0             |

<br>
<br>
We have two additional instances that will be used to test the classifier:
<br>
<br>

|  ID        |  GPA  |  GMAT   |   ADMITTED_IND   |
|:----------:|:-----:|:-------:|:------------:|
| 000000009  |  2.90 |  384    |0             |
| 000000010  |  3.40 |  431    |1             |
<br>
<br>
For each feature, we calculate the mean and variance for admitted and not-admitted:  
<br> 

|  ADMITTED_IND  |$\mu_{GPA}$ |$\sigma^{2}_{GPA}$|$\mu_{GMAT}$|$\sigma^{2}_{GMAT}$|
|:----------:|:----------:|:----------------:|:----------:|:-----------------:|
|   1/yes    |  3.150     |  0.0193          |  518.50    |  3143.00          |        
|   0/no     |  2.610     |  0.0474          |  409.25    |  1128.25          |
<br>  

In the sample dataset, we have equiprobable priors (since $P(admit) = P(!admit) = .5$). However, the prior probabilities need not be derived from the dataset of interest. They can be based on external data sources (such as admissions from prior years). 

<br> 
Recall the general form of Bayes' Theorem:

$$
P(A|B) = \frac{P(B|A)P(A)}{P(B)}
$$
<br> 
<br>

The posterior probability for admitted is given by:   
<br>    
$$
P(admit|data) = \frac {P(admit)P(GPA|admit)P(GMAT|admit)}{P(data)},
$$
<br>
<br> 

and for not-admitted:    
<br>
$$
p(!admit|data) = \frac {P(!admit)P(GPA|!admit)P(GMAT|!admit)}{P(data)},
$$
<br>  
Where: 
   
*  $P(admit)/P(!admit)$ represents the prior probability, $.50$ in this example.      
<br>     
*  $P(GPA|admit)P(GMAT|admit)$ represents the likelihood. We assume zero correlation between $GPA$ and $GMAT$ via the first assumption of Naive Bayes.     
<br>     
*  *data* is a stand-in for $GMAT$ and $GPA$ for a given instance.     
<br>     

The second assumption of Naive Bayes is that all explanatory variables follow a normal distribution. Thus, $P(GMAT∣admitted)$ is calculated by passing the observation's $GMAT$ score and $GPA$ into the associated normal density function, parameterized by the corresponding estimates of mean and variance determined above.  
<br>    
For the admitted class:


$$
\begin{align*}
P(GMAT∣admit) &= \frac {1} {\sqrt{2 \pi \sigma^{2}_{GMAT|admit}}} exp\Big({-\frac {(GMAT - \mu_{GMAT|admit})^{2}}{2\sigma^{2}_{GMAT|admit}}}\Big)\\ \\
P(GPA∣admit) &= \frac {1} {\sqrt{2 \pi \sigma^{2}_{GPA|admit}}} exp\Big({-\frac {(GPA - \mu_{GPA|admit})^{2}}{2\sigma^{2}_{GPA|admit}}}\Big)
\end{align*}
$$
<br>       

For not-admitted:

$$
\begin{align*}
P(GMAT∣!admit) &= \frac {1} {\sqrt{2 \pi \sigma^{2}_{GMAT|!admit}}} exp\Big({-\frac {(GMAT - \mu_{GMAT|!admit})^{2}}{2\sigma^{2}_{GMAT|!admit}}}\Big) \\ \\
P(GPA∣!admit) &= \frac {1} {\sqrt{2 \pi \sigma^{2}_{GPA|!admit}}} exp\Big({-\frac {(GPA - \mu_{GPA|!admit})^{2}}{2\sigma^{2}_{GPA|!admit}}}\Big)
\end{align*}
$$
<br>  
<br>  

### Classifying Instances

Recall our test observations:
<br>

|  ID        |  GPA  |  GMAT   |   ADMITTED_IND   |
|:----------:|:-----:|:-------:|:------------:|
| 000000009  |  2.90 |  384    |0             |
| 000000010  |  3.40 |  431    |1             |
<br>
<br>
We calculate the admitted and not-admitted posterior for each instance: The observation will be classified as admitted/1 or not-admitted/0 based on the class with the greatest posterior probability.   
<br>       

### For **ID=000000009**:  
<br>  

GMAT calculation for admitted:

$$
\begin{align*}
p(GMAT∣admit) &= \frac{1}{\sqrt{2 \pi \sigma^{2}_{GMAT|admit}}} exp\Big({-\frac{(GMAT - \mu_{GMAT|admit})^{2}}{2\sigma^{2}_{GMAT|admit}}}\Big) \\ \\
&= \frac{1}{\sqrt{2 \pi (3143)}} exp\Big({-\frac {(384 - 518.50)^{2}}{2(3143)}}\Big) \\ \\
&=\mathbf{.0004}
\end{align*}
$$

GPA calculation for admitted:

$$
\begin{align*}
p(GPA∣admit) &= \frac{1}{\sqrt{2 \pi \sigma^{2}_{GPA|admit}}} exp\Big({-\frac{(GPA - \mu_{GPA|admit})^{2}}{2\sigma^{2}_{GPA|admit}}}\Big) \\ \\
&= \frac{1}{\sqrt{2 \pi (0.0193)}} exp\Big({-\frac {(2.90 - 3.15)^{2}}{2(0.0193)}}\Big) \\ \\
&=\mathbf{0.568767}
\end{align*}
$$

<br>
<br>

GMAT calculation for not-admitted:

$$
\begin{align*}
p(GMAT∣!admit) &= \frac{1}{\sqrt{2 \pi \sigma^{2}_{GMAT|!admit}}} exp\Big({-\frac{(GMAT - \mu_{GMAT|!admit})^{2}}{2\sigma^{2}_{GMAT|!admit}}}\Big) \\ \\
&= \frac{1}{\sqrt{2 \pi (1128.25)}} exp\Big({-\frac {(384 - 409.25)^{2}}{2(1128.25)}}\Big) \\ \\
&=\mathbf{0.00895364}
\end{align*}
$$

GPA calculation for not-admitted:

$$
\begin{align*}
p(GPA∣!admit) &= \frac{1}{\sqrt{2 \pi \sigma^{2}_{GPA|!admit}}} exp\Big({-\frac{(GPA - \mu_{GPA|!admit})^{2}}{2\sigma^{2}_{GPA|!admit}}}\Big) \\ \\
&= \frac{1}{\sqrt{2 \pi (0.0474)}} exp\Big({-\frac {(2.90 - 2.610)^{2}}{2(0.0474)}}\Big) \\ \\
&=\mathbf{0.7546488}
\end{align*}
$$
<br>   



Then, plugging in values into the posterior expression, class probabilities for **ID=000000009** are given by:   
<br>   

$$
\begin{align*}
P(admit|data) &= \frac{P(admit)P(GPA|admit)P(GMAT|admit)}{P(data)} \\ \\
&= \frac {(.5)*(0.568767)*(.0004)}{(0.568767)*(.0004)*(.5) + (0.7546488)* (0.00895364)*(.5)} \\ \\
&= \mathbf{0.03266} \\ \\
P(!admit|data) &= \frac{P(!admit)P(GPA|!admit)P(GMAT|!admit)}{P(data)} \\ \\
&= \frac {(.5)*(0.7546488)*(0.00895364)}{(0.568767)*(.0004)*(.5) + (0.7546488)* (0.00895364)*(.5)} \\ \\
&= \mathbf{0.967340} \\ \\
\end{align*}
$$
<br> 
<br> 
Thus, an individual with $GPA=2.90$ and $GMAT=384$ would almost certainly not be admitted according to the Gaussian Naive Bayes classifier.
<br>   
<br>  



## k-Nearest Neighbors

The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest neighbor learning), or vary based on the local density of points (radius-based neighbor learning). The distance can, in general, be any metric measure: standard Euclidean distance is the most common choice.  
<br>    
Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.
<br>


To determine the optimal value for `n_neighbors`, the kNN hyperparameter, use the 
`model_selection.GridSearchCV` class:  
<br>
```python
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

param_grid = [{
        'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20]
        }]

knn_clf = KNeighborsClassifier()

grid_search = GridSearchCV(
                    knn_clf, param_grid, cv=5, scoring='neg_mean_squared_error'
                    )
                    
# fit estimator to training data with optimal `n_neighbors` hyperparameter =>
grid_search.fit(X_train, y_train)
```

 
#### [k-Nearest Neighbors]

<br>
<a href="http://www.youtube.com/watch?feature=player_embedded&v=ZWygMcenuWM
" target="_blank"><img src="http://img.youtube.com/vi/ZWygMcenuWM/0.jpg" 
alt="kNN" width="350" height="275" border="0" /></a>
<br>


## Logistic Regression


The `linear_model.LogisticRegression` class implements regularized Logistic Regression. Without offsetting the regularization and intercept scaling parameters, coefficent estimates generated by sklearn's `linear_model.LogisticRegression` *will not* align with estimates generated by `glm` in R. 
<br>

The `LogisticRegression` classifier cannot handle categorical explanatory variables with more than 2 classes. If a categorical explanatory variable has more than 2 classes, it must be transformed using the '1-vrs-all' or 'one-hot' encoding schemes.     


In [None]:
# ================================================================================
# Logistic Regression parameter estimation demonstration, with and without       |
# Regularization and intercept scaling.                                          | 
# ================================================================================
import numpy as np

# prepare Challenger data =>
dfc = pd.read_table(
        "S:\\public\\Actuarial\\DSSG\\20170721_Materials\\Challenger.csv",
        sep=",")

# bind `TEMPERATURE` and response (`O_RING_FAILURE`) =>
X = dfc.iloc[:,[1]].as_matrix()
y = dfc['O_RING_FAILURE'].values

# ================================================================================
# Model I: No dampening of regularization parameter (`C`) or intercept scaling.  |
#          The coefficients estimated by Model I will not match those estimated  | 
#          by R's glm function.                                                  |            
# ================================================================================
from sklearn.linear_model import LogisticRegression
lr_init = LogisticRegression().fit(X, y)

# get coefficients estimated from training data =>
coeffs_init = (lr_init.intercept_[0], lr_init.coef_[0][0])

coeffs_init

# (0.52055517729376832, -0.021002151714167645)

In [None]:
# ================================================================================
# Model II: Dampen regularization by setting C to a large positive value: `C`    |
#           represents the  inverse of regularization strength; smaller values   |
#           of C specify stronger regularization. Model II coefficient           |
#           estimates should match with R's glm output when family = `binomial`  |
#           and link = `logit`.                                                  |
# ================================================================================
lr_mod = LogisticRegression(C=1e10, intercept_scaling=200).fit(X, y)

# get coefficients estimated from training data =>
coeffs_mod = (lr_mod.intercept_[0], lr_mod.coef_[0][0])

coeffs_mod

# sklearn coeffs (no regularization): (15.042890754064651, -0.2321625838604314)
# R glm coeffs   (binomial, logit)  : (15.0429016  ,       -0.2321627 )

## Support Vector Machines

A Support Vector Machine model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on on which side of the gap they fall.
In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.

The `svm` class in sklearn is a wrapper around the C libraries `LIBSVM` and `LIBLINEAR`, popular utilities for support vector machines and large linear classification.

<br>

In sklearn, the linear support vector classifier is implemented separately from the 
general purpose SVM classifier:

```python
# sample usage of LinearSVC classifier =>
from sklearn.svm import LinearSVC

linear_svm_clf = LinearSVC(C=1.0).fit(X, y)

# get support vectors =>
linear_svm_clf.support_vectors_

# array([[ 0.,  0.],
#        [ 1.,  1.]])

# get number of support vectors for each class
clf.n_support_ 

# array([1, 1]...)
```

For datasets that are not linearly seperable, or when dealing with a non-linear decision function, use the general `svm.SVC` class.

Note that `svm.SVC(kernel='linear')` is essentially the same classifier as `svm.LinearSVC` above, but `svm.LinearSVC` has superior performance characteristics, since it's optimized to solve a single type of optimization problem. 

The most common kernel used with non-linear SVM's is the 'Radial Basis Function' or 'rbf' kernel. 

Support Vector Machines are powerful tools, but their compute and storage requirements increase rapidly with the number of training vectors.

Here's a comparison of estimator throughput per unit time:   
<br>

<p align="center">
  ![Throughput](latency.png)
</p>


## Implementing Classifiers 

The sample dataset (`mw.data`) contains ~150 records, with each records classified as either `male` (1) or `female` (2). The fields in the dataset are:

* gender (response)
* height (mm)
* Hand length (mm)  
* forearm length (mm) 

Prior to fitting the classifiers, we need to preprocess our features. 


In [None]:
# ===================================
# Setup and Preprocessing           |
# ===================================
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set(style='whitegrid', context='notebook')
plt.rcParams["figure.figsize"] = (12, 9)


# read in dataset =>
fpath = "S:\\public\\Actuarial\\DSSG\\20170721_Materials\\mw.data"
hdrs  = ["ID", "GENDER", "HEIGHT", "HAND_LENGTH", "FOREARM_LENGTH"]
df    = pd.read_table(fpath, sep="\s+", names=hdrs)


# split explanatory variables from response, and convert
# 1/2 response to 0/1 =>
X = df.drop(['GENDER','ID'], axis=1)
y = df['GENDER'].map(lambda x: 0 if x==2 else x).values


# split data into training and test sets...
# NOTE: replace `cross_validation` with `model_selection` in 
#       the latest release of scikit-learn =>
from sklearn.cross_validation  import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
                                X, y, test_size=.33, random_state=16)


# scale explanatory variables =>
from sklearn.preprocessing import StandardScaler
sclr = StandardScaler()
X_train = sclr.fit_transform(X_train)
X_test  = sclr.transform(X_test)


# take a look at preprocessed data =>
X_train[:5]


In [None]:
#==============================================
# Gaussian Naive Bayes Classifier             |
#==============================================
from sklearn.naive_bayes import GaussianNB

nb_clf = GaussianNB().fit(X_train, y_train)

# get estimated class predictions and probabilities =>
nb_y_hat = nb_clf.predict(X_test)
nb_p_hat = nb_clf.predict_proba(X_test)[:,1]

# generate classification report =>
from sklearn.metrics import classification_report
print(classification_report(y_test, nb_y_hat, target_names=['Male', 'Female']))


In [None]:
#=============================================
# kNN Classifier                             |
#=============================================
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train)

# get estimated class predictions and probabilities =>
knn_y_hat = knn_clf.predict(X_test)
knn_p_hat = knn_clf.predict_proba(X_test)[:,1]

# generate classification report =>
from sklearn.metrics import classification_report
print(classification_report(y_test, knn_y_hat, target_names=['Male', 'Female']))


In [None]:
#=============================================
# Logistic Regression Classifier             |
#=============================================
from sklearn.linear_model import LogisticRegression

lr_clf = LogisticRegression(C=1).fit(X_train, y_train)

# pass test set to classifier to evaluate model fit =>
lr_y_hat = lr_clf.predict(X_test)
lr_p_hat = lr_clf.predict_proba(X_test)[:,1]

# generate classification report =>
from sklearn.metrics import classification_report
print(classification_report(y_test, lr_y_hat, target_names=['Male', 'Female']))


In [None]:
#=============================================
# Linear SVM Classifier                      |
#=============================================
from sklearn.svm import SVC

svm_clf = SVC(C=1.0, probability=True).fit(X_train, y_train)

# pass test set to classifier to evaluate model fit =>
svm_y_hat = svm_clf.predict(X_test)
svm_p_hat = svm_clf.predict_proba(X_test)[:,1]

# generate classification report =>
from sklearn.metrics import classification_report
print(classification_report(y_test, svm_y_hat, target_names=['Male', 'Female']))


In [None]:
#=============================================
# Random Forest Classifier                   |
#=============================================
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier().fit(X_train, y_train)

# pass test set to classifier to evaluate model fit =>
rf_y_hat = rf_clf.predict(X_test)
rf_p_hat = rf_clf.predict_proba(X_test)[:,1]

# generate classification report =>
from sklearn.metrics import classification_report
print(classification_report(y_test, rf_y_hat, target_names=['Male', 'Female']))


In [None]:
#=============================================
# AdaBoost Classifier                        |
#=============================================
from sklearn.ensemble import AdaBoostClassifier

ab_clf = RandomForestClassifier().fit(X_train, y_train)

# pass test set to classifier to evaluate model fit =>
ab_y_hat = ab_clf.predict(X_test)
ab_p_hat = ab_clf.predict_proba(X_test)[:,1]

# generate classification report =>
from sklearn.metrics import classification_report
print(classification_report(y_test, ab_y_hat, target_names=['Male', 'Female']))


### Voting Classifier

A very simple way to create a superior classifier is to aggregate the predictions of
many classifiers and predict the class that gets the most votes. This majority-vote classifier is called a 'hard voting classifier'.

If all classifiers are able to estimate class probabilities (i.e., they have a 
`predict_proba()` method), then you can tell the voting classifier to predict the class with the highest class probability, averaged over all the individual classifiers. This is called 'soft voting'. It often achieves higher performance than hard voting because it gives more weight to highly confident votes. All you need to do is replace `voting='hard'` with `voting='soft'` and ensure that all classifiers can estimate class probabilities.


In [None]:
#=============================================
# Voting Classifier (Ensemble Learner)       |
#=============================================
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

# instantiate models with known `predict_proba` attribute =>
rf_clf  = RandomForestClassifier().fit(X_train, y_train)
lr_clf  = LogisticRegression().fit(X_train, y_train)
nb_clf  = GaussianNB().fit(X_train, y_train)
knn_clf = KNeighborsClassifier().fit(X_train, y_train)

# initialize voting classifier, providing an abbr. for each model =>
voting_clf = VotingClassifier(
            estimators=[('nb', nb_clf),('lr', lr_clf),('rf', rf_clf),('knn', knn_clf)],
            voting='soft'
            )

# If voting=‘soft’, predicts the class label based on the argmax 
# of the sums of the predicted probabilities, which is 
# recommended for an ensemble of well-calibrated classifiers.

# train voting classifier =>
voting_clf.fit(X_train, y_train)

# print accuracy score for all classifiers =>
from sklearn.metrics import accuracy_score

for clf in (lr_clf, rf_clf, knn_clf, nb_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))



## Classifier Assessment

**Precision** is a measure of the accuracy of positive predictions.   
**Recall** is a measure of the number of positive instances detected by the classifier.   

* *True/False* refers to whether the model's predicition is correct or incorrect.
* *Positive/Negative* refers to whether the model predicted the positive or negative class. 
* *P* is the number of positive instances in the actual dataset.
* *N* is the number of negative instances in the actual dataset.

It immediately follows that:

* **True Positive** (TP) - The model of interest correctly predicts the positive class.
* **True Negative** (TN) - The model of interest correctly predicts the negative class.
* **False Positive** (FP) - The model of interest incorrectly predicts the positive class (Type I error).
* **False Negative** (FN) - The model of interest incorrectly predicts the negative class (Type II error).

*Accuracy* is defined as:

$$
\begin{aligned}
ACC = \frac {TP+TN}{P+N} = \frac {TP+TN}{TP+TN+FP+FN}
\end{aligned}
$$
    
<br>

*Precision* is the fraction of positive predictions that are correct:

$$       
\begin{aligned}
Precision = \frac{TP}{TP+FP}
\end{aligned}
$$
 
<br>

   
*Recall* (True Positive Rate) is the fraction of all positive instances the classifier correctly predicts as positive: 

$$  
\begin{aligned}
Recall = TPR = \frac{TP}{TP+FN}
\end{aligned}
$$

<br>

*False Positive Rate* (Type-I error) is the fraction of all negative instances the classifier incorrectly identifies as positive:

$$
\begin{aligned}
FPR = \frac{FP}{TN+FP}
\end{aligned}
$$

<br>

Precision and recall are used in the $F_{1}$ score, which is defined as the harmonic mean of precision and recall:

$$   
\begin{aligned}
F_{1} = 2 \frac{Precision * Recall}{Precision + Recall}
\end{aligned}
$$

<br>

### Confusion Matrix
The *confusion matrix* can be used to evaluate the accuracy of a classification. From the scikit-learn [docs](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html):

*By definition a confusion matrix $C$ is such that $C_{i, j}$ is equal to the number of observations known to be in group i but predicted to be in group j.

$$
C_{0, 0} = True  Negatives \\
C_{1, 0} = False Negatives \\
C_{1, 1} = True  Positives \\
C_{0, 1} = False Positives \\
$$


Thus in binary classification, the count of true negatives is $C_{0,0}$, false negatives is $C_{1,0}$, true positives is $C_{1,1}$ and false positives is $C_{0,1}$.*

  
### ROC Curve
The [ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) (Receiver Operating Characteristic) is a plot often used used in assessing the quality of a binary classifier as the discrimination threshold is varied. The ROC curve uses $TPR$ and $FPR$. Convention dictates plotting $TPR$ as a function of $FPR$, with $TPR$ on the y-axis and $FPR$ along the x-axis. 

The area under the ROC curve (AUC) is used to test model quality. The higher the AUC, the better the model. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).



In [None]:
# ====================================
# Display Confusion Matrix           |
# ====================================
from sklearn.metrics import confusion_matrix

actual_response    = y_test
predicted_response = knn_y_hat

cm = confusion_matrix(actual_response, predicted_response)

sns.heatmap(cm, square=True, annot=True, cbar=False)
plt.xlabel('Predicted Value')
plt.ylabel('Actual Value')
plt.show()



In [None]:
# ==========================================
#  Plot ROC curve for each classifier      |
# ==========================================
from sklearn.metrics import roc_curve

fpr1, tpr1, thresholds1 = roc_curve(y_test, rf_p_hat, pos_label=1)
fpr2, tpr2, thresholds2 = roc_curve(y_test, lr_p_hat, pos_label=1)
fpr3, tpr3, thresholds3 = roc_curve(y_test, nb_p_hat, pos_label=1)
fpr4, tpr4, thresholds4 = roc_curve(y_test, knn_p_hat, pos_label=1)
fpr5, tpr5, thresholds5 = roc_curve(y_test, ab_p_hat, pos_label=1)

plt.plot(fpr1, tpr1, linewidth=2, label='Random Forest')
plt.plot(fpr2, tpr2, linewidth=2, label='Logistic')
plt.plot(fpr3, tpr3, linewidth=2, label='Naive Bayes')
plt.plot(fpr4, tpr4, linewidth=2, label='kNN')
plt.plot(fpr5, tpr5, linewidth=2, label='Ada Boost')
plt.plot([0,1], [0,1], '--')
plt.axis([-.05, 1.05, -.05, 1.05])
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.grid(True)
plt.legend(loc='lower right',prop={'size':20}, frameon=True)
plt.show()


In [None]:
# =============================================
# print AUC Score for kNN classifier          |
# =============================================
print("kNN classifier AUC Score: {}".format(roc_auc_score(y_test, knn_p_hat)))
