In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

## Data description

This exercise is based on subset of data for the ["Give Me Some Credit" Kaggle competition] (https://www.kaggle.com/c/GiveMeSomeCredit), follow this link and take a look at the competition description.

### Getting the data
Dowload the data file called `credit_scoring_sample.csv` from https://github.com/Yorko/mlcourse.ai/tree/master/data


### Data columns
Not all of these columns are present in the sample data we user for this exercise

 - **SeriousDlqin2yrs** (prediction target) - Person experienced 90 days past due delinquency or worse 
 - **RevolvingUtilizationOfUnsecuredLines** - Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits
 - **age** - Age of borrower in years
 - **DebtRatio** - Monthly debt payments, alimony, living costs divided by monthly gross income
 - **MonthlyIncome** - Monthly income
 - **NumberOfOpenCreditLinesAndLoans** - Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards)
 - **NumberRealEstateLoansOrLines** - Number of mortgage and real estate loans including home equity lines of credit
 - **NumberOfDependents** - Number of dependents in family excluding themselves (spouse, children etc.)
 - **NumberOfTimes90DaysLate** - Number of times borrower has been 90 days or more past due.
 - **NumberOfTime60-89DaysPastDueNotWorse**	 - Number of times borrower has been 60-89 days past due but no worse in the last 2 years.
 - **NumberOfTime30-59DaysPastDueNotWorse** - Number of times borrower has been 30-59 days past due but no worse in the last 2 years.


In [None]:
df = pd.read_csv('../../mlcourse.ai/data/credit_scoring_sample.csv', sep=';')

In [4]:
df.head().T

Unnamed: 0,0,1,2,3,4
SeriousDlqin2yrs,0.0,0.0,0.0,0.0,1.0
age,64.0,58.0,41.0,43.0,49.0
NumberOfTime30-59DaysPastDueNotWorse,0.0,0.0,0.0,0.0,0.0
DebtRatio,0.249908,3870.0,0.456127,0.00019,0.27182
NumberOfTimes90DaysLate,0.0,0.0,0.0,0.0,0.0
NumberOfTime60-89DaysPastDueNotWorse,0.0,0.0,0.0,0.0,0.0
MonthlyIncome,8158.0,,6666.0,10500.0,400.0
NumberOfDependents,0.0,0.0,0.0,2.0,0.0


#### How many columns and samples do we have in this dataset?

#### What percentage of people in this dataset had serious delinquency?
 - It's the first column `SeriousDlqin2yrs`
 - This is the column we will be trying to predict today

#### What accuracy score would you expect from the optimistic classifier that expects no delinquency at all?
 - This is called the Null accuracy
 - Verify that accuracy score using the `accuracy_score()` function


#### Are there any columns with missing (NaN) values?
Hint: isnull()

#### Fill in all the missing values using the median value of the corresponding column
Hint: fillna()

#### Define X and y to experiment with some classifiers below

In [5]:
# We will be using all the columns except the target to base our preidctions on
# This is the list of the columns
cols = df.columns[1:].tolist()

#### Train a DecisionTreeClassifier using ALL the data and find the accuracy_score

#### Repeat the above with different values of max_depth in the range between 2 and 15. Plot the accuracy score as a function of max_depth
- max_depth is passed to the constructor when creating an instance of DecisionTreeClassifier. It is the maximum depth the decision tree is allowed to have and works as a way to avoid overfitting. By default, without max_depth specified DecisionTreeClassifier will split until all leaf nodes are pure (contain only samples of one class). In most cases this is sever overfitting (similar to kNN with k=1)

In [None]:
depths = list(range(2,15))
# Your code here

In [None]:
# plot

#### At this point we still have no idea how well our tree performs on "out of sample" data. Repeat the loop above, but instead of finding the training accuracy score, find the 5-fold corss_validation score on each iteration. Plot the scores as a function of max depth.

#### Based on the plot of CV scores, what is the best value for max depth?

In [None]:
maxDepth = ?

#### Let's create a random forest of 20 decision trees using the optimal value for max depth

In [None]:
from sklearn.ensemble import BaggingClassifier
tree = DecisionTreeClassifier(max_depth=6)
forest = BaggingClassifier(clf, n_estimators=20)

#### Find the 5-fold CV score for the `forest` classifier

#### Split the data into training and testing sets use 40% of the data for testing

#### Fit both the forest and the single tree using the training data

#### Find the predictions according to both the tree and the forest classifiers and the corresponding accuracy scores

#### Take a look at the `confusion_matrix` for the above two predictions and the dummy
 - Where are the actual classes and where the predicted ones are?
 - Try also looking at the matrices using percentage of the samples rather than counts (divide by total number of samples)

#### Take a look a the confusion matrix for the dummy prediction that predicts no delinquency at all

#### Compute the Sensitivity, Specificity and Precision based on each of the confusion matrices
 - Use [Kavin Markham's notebook](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) as reference

In [None]:
# Sensitivity is also called recall, and there is a special function sklearn.metrics.recall_score
from sklearn.metrics import recall_score
recall_score(y_test, ypf)

In [None]:
# Sensitivity manual computation

