<h1> Introduction to Supervised Learning through Scikit-Learn</h1>
<h2>What is scikit-learn?</h2>
<p style="font-size: 16px">Scikit is a machine learning python library built off of packages you have recently been introduced to such as numpy, scipy and matplotlib. For more information, visit the <a href='http://scikit-learn.org/stable/index.html#'>scikit-learn homepage</a><br>

<br>The library contains function in the following machine learning categories:
<li style="font-size: 16px"><a href='http://scikit-learn.org/stable/supervised_learning.html#supervised-learning'> Classification</a> </li>
<li style="font-size: 16px"><a href='http://scikit-learn.org/stable/supervised_learning.html#supervised-learning'> Regression </a></li>
<li style="font-size: 16px"><a href='http://scikit-learn.org/stable/modules/clustering.html#clustering'> Clustering </a></li>
<li style="font-size: 16px"><a href='http://scikit-learn.org/stable/modules/decomposition.html#decompositions'>Dimensionality Reduction</a></li>
<li style="font-size: 16px"><a href='http://scikit-learn.org/stable/model_selection.html#model-selection'>Model Selection</a></li>
<li style="font-size: 16px"><a href='http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing'>Preprocessing </a></li>

<p style="font-size: 16px">Scikit-learn should be installed along with your Anaconda3 installation. However, if this is not the case, follow the installation instructions provided by scikit-learn <a href='http://scikit-learn.org/stable/install.html'>here</a><br>

<br>Next, lets import some of the packages we will use and see what version you are running!</p>

In [1]:
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))
import pandas as pd
print('The pandas version is {}'.format(pd.__version__))
import matplotlib
print('The matplotlib version is {}'.format(matplotlib.__version__))
import numpy as np
print('The numpy version is {}'.format(np.__version__))

The scikit-learn version is 0.19.1.
The pandas version is 0.23.0
The matplotlib version is 2.2.2
The numpy version is 1.14.3


<h2>What is supervised learning?</h2>
<p style="font-size: 16px">Supervised learning are algorithms that learn patterns from the data using a training subset with labels to generalize to the set of all possible inputs. Examples of techniques in supervised learning:</p> 
<li style="font-size: 16px"><a href='http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression'>logistic regression</a></li>
<li style="font-size: 16px"><a href='http://scikit-learn.org/stable/modules/svm.html'>support vector machines</a></li>
<li style="font-size: 16px"><a href='http://scikit-learn.org/stable/modules/tree.html'>decision trees</a></li>
<li style="font-size: 16px"><a href='http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html'>random forest</a></li>


<p style="font-size: 16px">We will focus on classification tasks in this notebook. Classification is the prediction of discrete variables i.e. YES/NO. Classification is regarded as the problem of finding $h(x): \mathbb{R^d}\to\mathbb{K}$ that maps an input space in $\mathbb{R^d}$ onto a discrete set of $k$ target outputs or classes $\mathbb{K}=\{1,...,k\}$
In contrast, regression problems involve prediction of continuous variables. </p>

<p style="font-size: 16px">Input data into sklearn objects are structured in numpy arrays with size [n_samples, n_features].</p>
$$\text{feature matrix: } \mathbf{X} = 
\begin{pmatrix} 
x_{11} & x_{12} & \cdots & x_{1d} \\ 
x_{21} & x_{22} & \cdots & x_{2d} \\
x_{31} & x_{32} & \cdots & x_{3d} \\
\vdots & \vdots & \ddots & \vdots \\
\vdots & \vdots & \ddots & \vdots \\
x_{n1} & x_{n2} & \cdots & x_{nd} 
\end{pmatrix}
$$
$$ 
\text{label vector: }
\mathbf{y^T} = [y_1, y_2, y_3,\cdots, y_n]
$$

<h2>Case 1: Lending Club</h2>
<p style="font-size: 16px">This dataset is provided by the Lending Club, a peer-to-peer lending company offering loans funded by other people acting as hub connection borrowers and investors. The company assesses the risk of clients applying for a loan of a certain amount and whether it will be fully funded. The task is to predict unsuccessful accepted loans defined if the funded amount (funded_amnt) or the amount funded by investors (funded_amnt_inv) falls short of the request loan amount (loan_amnt).</p><br> $$\text{binary classification task of:  }\frac{loan-funded}{loan}\geq0.95$$ 

<br><a href='https://www.lendingclub.com/info/download-data.action'>Download data here for years 2007-2011</a></p>



<table>
    <tr>
        <th>Column</th>
        <th>Description</th>
    </tr>
   <tr>
        <td>annual_inc</td>
        <td>The annual income provided by the borrower during registration.</td>
    </tr>
    <tr>
        <td>delinq_2yrs</td>
        <td> The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years
</td>
    </tr>
    <tr>
        <td>dti</td>
        <td> A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
</td>
    </tr>
    <tr>
        <td>earliest_cr_line</td>
        <td> The month the borrower's earliest reported credit line was opened
</td>
    </tr>
    <tr>
        <td>emp_length</td>
        <td> Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
</td>
    </tr>
    <tr>
        <td>home_ownership</td>
        <td> The home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER.
</td>
    </tr>
    <tr>
        <td>installment</td>
        <td> The monthly payment owed by the borrower if the loan originates.
</td>
    </tr>
    <tr>
        <td>int_rate</td>
        <td> Interest Rate on the loan
</td>
    </tr>
    <tr>
        <td>is_inc_v</td>
        <td> Indicates if income was verified by LC, not verified, or if the income source was verified
</td>
    </tr>
    <tr>
        <td>last_fico_range_high</td>
        <td> The last upper boundary of range the borrower’s FICO belongs to pulled.
</td>
    </tr>
    <tr>
        <td>last_fico_range_low</td>
        <td> The last lower boundary of range the borrower’s FICO belongs to pulled.
</td>
    </tr>
    <tr>
        <td>fico_range_high</td>
        <td> The upper boundary of range the borrower’s FICO belongs to.
</td>
    </tr>
    <tr>
        <td>fico_range_low</td>
        <td> The lower boundary of range the borrower’s FICO belongs to.
</td>
    </tr>
    <tr>
        <td>mths_since_last_delinq</td>
        <td> The number of months since the borrower's last delinquency.
</td>
    </tr>
    <tr>
        <td>mths_since_last_major_derog</td>
        <td> Months since most recent 90-day or worse rating
</td>
    </tr>
     <tr>
        <td>open_acc</td>
        <td> The number of open credit lines in the borrower's credit file.
</td>
    </tr>
    <tr>
        <td>term</td>
        <td> The number of payments on the loan. Values are in months and can be either 36 or 60.
</td>
    </tr>
    <tr>
        <td>total_acc</td>
        <td> The total number of credit lines currently in the borrower's credit file
</td>
    </tr>
    <tr>
        <td>loan_amnt</td>
        <td> The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.</td>
    </tr>


</table>

In [2]:
keep_cols = ['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate', 'installment', 'emp_length', 'home_ownership', 'annual_inc', 'dti', 'delinq_2yrs', 'mths_since_last_delinq', 'total_acc']
df_lend = pd.read_csv('files_ch05/LoanStats3a.csv', skiprows=1, low_memory=False)
df_lend = df_lend[keep_cols]
print(df_lend.shape)
print(df_lend.columns.tolist())
df_lend.head()

IOError: File files_ch05/LoanStats3a.csv does not exist

<p style="font-size: 16px">We will encode our binary target variable using the logic described earlier and visualize the results: $$\frac{loan-funded}{loan}\geq0.95$$  </p>

In [None]:
loan = df_lend['loan_amnt'].values
funded = df_lend['funded_amnt_inv'].values
targets = np.abs(loan-funded)/loan

df_lend['targets'] = targets
y = [-1 if t >= .95 else 1 for t in targets]
df_lend['failed_loan'] = y
df_lend['failed_loan'].value_counts()


In [None]:
import matplotlib.pyplot as plt
%matplotlib inline 
labels = 'Fully Covered', 'Not Fully Covered'
sizes = df_lend['failed_loan'].value_counts()
sizes = [sizes[1], sizes[-1] ]
colors = ['gold', 'lightcoral']
 
# Plot
plt.pie(sizes, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=140)
 
plot_ = plt.axis('equal')

<p style="font-size: 16px">Note that there is a significant disporportion of positive labels to negative ones, making the dataset unbalanced. This can have drastic consequences for a classifier.</p>

In [None]:
loan = df_lend['loan_amnt'].values
funded = df_lend['funded_amnt_inv'].values
targets = np.abs(loan-funded)/loan

df_lend['targets'] = targets
wrk_records = np.where(~np.isnan(targets))
y = targets[wrk_records]>=0.05
plt.hist(targets[wrk_records],bins=30)

print('Larger deviation: {}'.format(np.sum(y)))
print('Total: {}'.format(np.sum(1-y)))

<p style="font-size: 16px">Now for a quick introduction in the functionality of sklearn. We will begin with the k-nearest neighbor clustering algorithm. For more details, see pages 462-470 in Python for Data Science Handbook VanderPlas, Jake)</p>
<h2>What is K-Nearest Neighbor?</h2>
<p style="font-size: 16px">This algorithm, when used for classification, uses the neighboring values of a particular input value to make a prediction. The input is assigned to the class most common among its $k$ nearest neighbors where $k\in\mathbf{R}$. The parameter $k$ is defined in the beginning as a hyperparameter, or a configuration that is prior to the model fit and is not estimated from the data but can be tuned to be more beneficial to the fitting of the model on the data. Determining the distance between the input value and other values is commonly done by Euclidean distance for continuous variables or Hamming distance for discrete variables. We will not go into too much detail here, but we will use this algorithm as a starting example on sklearn. For more information on sklearn's package, see the <a href='http://scikit-learn.org/stable/modules/neighbors.html#neighbors'>sklearn.neighbors documentation.</a>
<br><br>
One thing to note is this algorithm is computation heavy. This is because it must hold all other values in memory in order to measure the closest $k$ values to the input value.</p>

<h3>Training a Model with sklearn 5.3</h3>
<p style="font-size: 16px">To fit the model, we will use sklearn's object oriented interface. Firstly we create an object, which we name 'model'. We then can use the model.fit method to set the state of the object based on the training data. The data passed to the method must be in a two dimensional numpy array $\mathbf{X}$ of shape(n_samples, n_predictors holding the feature matrix and a one-dimensinal numpy array $\mathbf{y}$ that holds the response variable values. To view the documentaiton on this method, <a href='http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html'>visit here</a>.</p>

<p style="font-size: 16px">Once you have fit the model using the appropriate parameters passed to the fit method, the new state of the model object is stored in instances attributes with a trailing underscore '\_' (i.e. model.coefficients_). The new state can also be accessed from different methods, where the instance will return the new state in response to a method call (i.e. get_params).</p>

<p style="font-size: 16px">Estimator objects that can generate predictions provide a model.predict method. In the case of regression, model.predict will return the predicted regression values, $\hat{\mathbf{y}}$.


In [None]:
from sklearn import neighbors
from sklearn import datasets
import pickle

# we will use a smaller dataset for example sake
ofname = open('files_ch05/dataset_small.pkl','rb') 
(X,y) = pickle.load(ofname, encoding='bytes')

#Create an instance of K-nearest neighbor classifier
knn = neighbors.KNeighborsClassifier(n_neighbors=11)

#Train the classifier
knn.fit(X, y)

#Compute the prediction according to the model
y_pred = knn.predict(X)

print("Complete!")

<p style="font-size: 16px">Sklearn's estimators come with a score method that calculates the accuracy of the model based on the predicted values.</p>

In [None]:
knn.score(X,y)

In [None]:
labels = 'Fully Covered', 'Not Fully Covered'
sizes = df_lend['failed_loan'].value_counts()

# np.where(condition, val if true, val if false)
sizes = [np.sum(np.where(y==1,1,0)), np.sum(np.where(y==-1,1,0))]
colors = ['gold', 'lightcoral']
explode = (0.1, 0)
 
# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=140)
 
ply_ = plt.axis('equal')

<p style="font-size: 16px">This unbalanced labeling means that if we <b>always</b> predicted the loan will be fully funded, we would be correct 81.4% of the time, very very close to the accuracy of our model. This demonstrates that accuracy may not be the best metric for understanding the the predictive power of our classifier.A better matrix is the <i>confusion matrix</i>.</p>

<table style='border-style : hidden'>
    <tr>
        <th></th>
        <th>Positive</th>
        <th>Negative</th>
        <th></th>
    </tr>
    <tr>
        <th>Positive</th>
        <td>TP: True Positive, when correctly predicted positively</td>
        <td>FP: False Positive, when incorrectly predicted positively</td>
        <td>$\to$Precision $\frac{TP}{TP+FP}$</td>
    </tr>
    <tr>
        <th>Prediction Negative</th>
        <td>FN: False Negative, when incorrectly predicted negative</td>
        <td>TN: True Negative, when correctly predicted negative</td>
        <td>$\to$Negative Predictive Value $\frac{TN}{TN+FN}$</td>
    </tr>
    <tr>
        <th></th>
        <td>$$\downarrow$$Sensitivity (Recall) $\frac{TP}{TP+FN}$</td>
        <td>$$\downarrow$$Specificity $\frac{TN}{TN+FP}$</td>
        <th></th>
    </tr>
<table>

<br>
$$\text{accuracy: }=\frac{TP+TN}{TP+TN+FP+FN}$$

In [None]:
confusion_matrix = dict()
confusion_matrix['TP'] = np.sum(np.logical_and(y_pred==-1,y==-1))
confusion_matrix['TN'] = np.sum(np.logical_and(y_pred==1,y==1))
confusion_matrix['FP'] = np.sum(np.logical_and(y_pred==-1,y==1))
confusion_matrix['FN'] = np.sum(np.logical_and(y_pred==1,y==-1))
pd.DataFrame(data=confusion_matrix, index=['results'])

<p style="font-size: 16px">Sklearn also has a method for this to make calculation easier. For a more elaborate example of how to use this, <a href='http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py'>visit here</a>.</p>

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_pred,y)

<h3>Training and Testing Subsets 5.6</h3>
<p style="font-size: 16px">We have previously used full dataset for fitting the mdoek and evaluating the model, this is not good practice. It is better practice to use a subset of the data to train on and then test the accuracy of your model with a smaller subset of data that was not part of your training dataset. This is because we want to compare the <i>in-sample error rate</i> $E_{in}$ or the error on values in the training set and <i>out of sample error</i> $E_{out}$ which is the generalization error on unseen data, or our test set.</p>

$$E_{in}=\frac{1}{N}\sum_{i=1}^{N}e(x_i, y_i)$$
$$E_{out}=E_{xy}(e(x,y))$$
$$\text{ where } e(x_i, y_i)=I[h(x)=y_i]= 
\begin{cases}
    1     & \quad \text{if } h(x_i)=y\\
    0 & \quad \text{otherwise }
  \end{cases}$$
$$\text{observe that }E_{out}\geq E_{in}$$

<p style="font-size: 16px">The goal of model learning is to minimize the genearalization error. Desirable:</p>
<li>$E_{in}\to 0$</li>
<li>$E_{out}\approx E_{in}$</li>

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# train_test_split splits the data into random subsets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=y.size)  
print('Training shape: {}, training targets shape: {}'.format(X_train.shape, y_train.shape))
print('Testing shape: {}, testing targets shape: {}'.format(X_test.shape, y_test.shape))

knn = neighbors.KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)

#Check on the training set and visualize performance
y_pred_train =knn.predict(X_train)
y_pred_test = knn.predict(X_test)

results = dict()
train_cm = confusion_matrix(y_train, y_pred_train)
results['Training'] = {'classification_acc': accuracy_score(y_pred_train, y_train),
                      'TP': train_cm[0,0], 'FP': train_cm[0,1],
                      'FN': train_cm[1,0], 'TN': train_cm[1,1]}

test_cm = confusion_matrix(y_test, y_pred_test)
results['Test'] = {'classification_acc': accuracy_score(y_pred_test, y_test),
                      'TP': test_cm[0,0], 'FP': test_cm[0,1],
                      'FN': test_cm[1,0], 'TN': test_cm[1,1]}

pd.DataFrame(data=results)
