# Problem description

You are to predict whether a company will go bankrupt in the following year, based on financial attributes of the company.

Perhaps you are contemplating lending money to a company, and need to know whether the company
is in near-term danger of not being able to repay.



## Goal

In the warm up exercise, we walked you through some of the challenges that you will confront
- Messy data
- Correlated features
- Imbalanced dataset

For the Final Project you will create a model, following all the steps in the Recipe, to solve
the Bankruptcy prediction task.



## Learning objectives

- Demonstrate mastery on solving a classification problem and presenting
the entire Recipe for Machine Learning process in a notebook.
- There will be little explicit direction for this task.
- It is meant to be analogous to a pre-interview task that a potential employer might assign
to verify your skill

## Grading
Prior assignments evaluated you step by step.

This project is results-based. Your goal is to create a well performing model.

We will give you some metrics on which your model will be judged.
Each metric will have 3 thresholds of increasing value
- You will get points for each threshold that your model surpasses

There are 2 files:

- `train/data.csv`:      
    - This is the dataset on which you will train your model
    
- `holdout/data.csv`:
    - This is the dataset by which you will be judged !
    - It has no labels so you can't use it to train or test your model
        - But **the instructors** do have the labels so we can evaluate your model
    - Once you have built your model, you will make predictions on these examples and submit them for grading
    

# Import modules

In [1]:
## Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn

import os
import math

%matplotlib inline


In [2]:
## Load the bankruptcy_helper module

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%reload_ext autoreload
%autoreload 1

# Import bankruptcy_helper module
import bankruptcy_helper
%aimport bankruptcy_helper

helper = bankruptcy_helper.Helper()

# API for students

We have defined some utility routines in a file `bankruptcy_helper.py`. There is a class named `Helper` in it.  

This will simplify problem solving


`helper = bankruptcy_helper.Helper()`



- getData: get the training data and holdout data
  > `train, holdout = getData()`

- plot_attr: Create multiple plots of the distribution of the feature names `attr`, one plot per possible value of target/label `y`
  >`helper.plot_attr(X, y, attr, trunc)`       

  > `X`: DataFrame of features. Each row is an example          
  > `y`: DataFrame/ndarray. Label of each example.,      
  > `attr`: string.  Name of feature whose distribution will be plotted      
  > `trunc`: Scalar. Optional parameter to truncate distribution at a threshold percentage.




# Reminders

The data set for this exercise is the same as for the warm up exercise.

In the warm up: we flagged potential issues with the data
- Numeric values encoded as strings
- Examples that have features with missing values
- Uneven distribution of examples across target values

We also expressed the merit of creating your own out of sample dataset on which to
evaluate your model before submitting your results for grading.

Also: the holdout data (the examples without labels for which your predictions will be graded) come from the
same distribution as the data with labels on which you may train/test.
So if there are issues with the training/test data, those same issues may be present in the holdout data.

Please think about whether some of the lessons and code from the warm up may be useful here.



**Remember**

The holdout data is in the same format as the one we used for training
- Except that it has no attribute for the target
- So you will need to **perform all the transformations on the holdout data**
    - As you did on the training data
    - Including turning the string representation of numbers into actual numeric data types

# Create your own model, using the Recipe for Machine Learning

Time for you to continue the Recipe for Machine Learning on your own.


In [3]:
# Get the data
#  data: training dataset
#  holdout: hold out dataset without target
data, holdout = helper.getData()
target_attr = 'Bankrupt'

# Convert all attributes to numeric
### BEGIN SOLUTION
non_numeric_cols = data.select_dtypes(exclude=['float', 'int']).columns
data[ non_numeric_cols] = data[ non_numeric_cols ].apply(pd.to_numeric, downcast='float', errors='coerce')
### END SOLUTION

# Separate the target Bankrupt from all features
data, labels = data.drop(columns=[target_attr]), data[target_attr]

# Shuffle the data
data, labels = sklearn.utils.shuffle(data, labels, random_state=42)

# Split data into train and test
### BEGIN SOLUTION
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.10, random_state=42)
### END SOLUTION

# Results

You hopefully have conducted multiple experiments, tried several forms of data transformation
and used a couple of different algorithms.

Now you need to make a choice: which decisions will give you the *best* predictions out of sample ?
We will refer to this as your "best model".

For your best model, using the test set you created, report
- Accuracy 
- Recall
- Precision

We will evaluate your model using the holdout data.  Grades will be based on
the following metrics meeting certain thresholds
- Accuracy
- Recall
- Precision

We will evaluate the metric using 3 increasing values for the threshold
- You will get points for each threshold that you surpass

In [5]:
### BEGIN SOLUTION
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score, classification_report
from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer 

from sklearn import linear_model
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.decomposition import PCA

from sklearn.metrics import confusion_matrix, recall_score, precision_score, classification_report

impute_transformer = SimpleImputer(strategy='median')


## SVM and Random Forest model
# logistic_clf = linear_model.LogisticRegression(solver = 'liblinear', max_iter = 10000)
svm_clf = SVC(gamma="auto", C=.1)
forest_clf = RandomForestClassifier(n_estimators=50, random_state=42)

r = "None"

for name, clf in { "SVM": svm_clf,
                   "Random Forest": forest_clf
                 }.items():
    
    pipe = Pipeline([("imputer", impute_transformer), 
                      ("model", clf)
                     ]
                    )
    
    scores = cross_val_score(pipe, X_train, y_train, cv=5)
    print("Model: {m:s} (t={r:s}) avg cross val score={s:3.4f}\n".format(m=name, r=r, s=scores.mean()) )

    # Out of sample prediction
    _= pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)

    accuracy_test = accuracy_score(y_test, y_pred)

    # recall_
    recall_test = recall_score(y_test, y_pred, pos_label=1, average="binary")
    precision_test = precision_score(y_test,   y_pred, pos_label=1, average="binary")


    print("\t{m:s} Accuracy: {a:3.1%}, Recall {r:3.1%}, Precision {p:3.1%}".format(m=name,
                                                                                a=accuracy_test,
                                                                                r=recall_test,
                                                                                p=precision_test
                                                                                )
         )

    
### Models with Dimensionality reduction
# Reduce the number of features
#    Try other models. For example, PCA
# Cost sensitive training

from sklearn.preprocessing import StandardScaler

stand_transformer = StandardScaler()

cwt = { 0:1, 1:20 }



for r in [ 1, 10, 12, 13, 15, 18]:
    cwt = { 0:1, 1:r }
    
    logistic_clf = linear_model.LogisticRegression(
        class_weight = cwt,
        solver = 'liblinear', max_iter = 10000)
    svm_clf = SVC(class_weight = cwt,
              gamma="auto", C=.1)
    
    for name, clf in { "SVM": svm_clf,
                       "Logistic": logistic_clf
                     }.items():

        pipe = Pipeline([("imputer", impute_transformer), 
                         ("Standardize", stand_transformer),
                         ("PCA", PCA(n_components = 20)),
                         ("model", clf)
                         ]
                        )
        scores = cross_val_score(pipe, X_train, y_train, cv=5)
        print("Model: {m:s} (t={r:d}) avg cross val score={s:3.4f}\n".format(m=name, r=r, s=scores.mean()) )

        # Out of sample prediction
        _= pipe.fit(X_train, y_train)
        y_pred = pipe.predict(X_test)

        accuracy_test = accuracy_score(y_test, y_pred)

        # recall_
        recall_test = recall_score(y_test, y_pred, pos_label=1, average="binary")
        precision_test = precision_score(y_test,   y_pred, pos_label=1, average="binary")

        
        print("\t{m:s} Accuracy: {a:3.1%}, Recall {r:3.1%}, Precision {p:3.1%}".format(m=name,
                                                                                    a=accuracy_test,
                                                                                    r=recall_test,
                                                                                    p=precision_test
                                                                                    )
             )

### END SOLUTION

Model: SVM (t=None) avg cross val score=0.9382



  'precision', 'predicted', average, warn_for)


	SVM Accuracy: 91.7%, Recall 0.0%, Precision 0.0%
Model: Random Forest (t=None) avg cross val score=0.9412

	Random Forest Accuracy: 93.2%, Recall 25.0%, Precision 76.9%
Model: SVM (t=1) avg cross val score=0.9382



  'precision', 'predicted', average, warn_for)


	SVM Accuracy: 91.7%, Recall 0.0%, Precision 0.0%
Model: Logistic (t=1) avg cross val score=0.9350

	Logistic Accuracy: 91.5%, Recall 0.0%, Precision 0.0%
Model: SVM (t=10) avg cross val score=0.8727

	SVM Accuracy: 85.7%, Recall 47.5%, Precision 28.4%
Model: Logistic (t=10) avg cross val score=0.8755

	Logistic Accuracy: 85.3%, Recall 47.5%, Precision 27.5%
Model: SVM (t=12) avg cross val score=0.8485

	SVM Accuracy: 82.6%, Recall 55.0%, Precision 25.0%
Model: Logistic (t=12) avg cross val score=0.8450

	Logistic Accuracy: 83.4%, Recall 62.5%, Precision 27.8%
Model: SVM (t=13) avg cross val score=0.8321

	SVM Accuracy: 81.3%, Recall 60.0%, Precision 24.5%
Model: Logistic (t=13) avg cross val score=0.8266

	Logistic Accuracy: 81.1%, Recall 62.5%, Precision 24.8%
Model: SVM (t=15) avg cross val score=0.7975

	SVM Accuracy: 78.4%, Recall 65.0%, Precision 22.4%
Model: Logistic (t=15) avg cross val score=0.7897

	Logistic Accuracy: 78.2%, Recall 75.0%, Precision 24.0%
Model: SVM (t=18) avg

# Submission guidelines

You will make a prediction for *each example* in the holdout dataset.

**Question**
- Set a variable `my_predictions` to be a list` or `ndarray`of predictions

`my_predictions`[i] (Element $i$ of `my_predictions`) should be your prediction
- for the $i^{th}$ holdout example
- So 
    - the length of `my_predictions` must be equal to the number of holdout examples
    - the ordering of predictions must be the same as the ordering of holdout examples

We will evaluate the performance metrics on `my_predictions` and assign  you a grade.



In [6]:
# Set variable
#  my_predictions: list/ndarray
my_predictions = None


### BEGIN SOLUTION

# It should create an array of predictions; we initialize it to the empty array for convenience
my_predictions = []

# Relative weight of Bankrupt class to Non Bankrupt class
r = 13

# Class weights
cwt = { 0:1, 1:r }

logistic_clf = linear_model.LogisticRegression(
    class_weight = cwt,
    solver = 'liblinear', max_iter = 10000)
name = "Logistic"

pipe = Pipeline([("imputer", impute_transformer), 
                 ("Standardize", stand_transformer),
                 ("PCA", PCA(n_components = 20)),
                 ("model", logistic_clf)
                 ]
               )

scores = cross_val_score(pipe, X_train, y_train, cv=5)
print("Model: {m:s} (t={r:d}) avg cross val score={s:3.4f}\n".format(m=name, r=r, s=scores.mean()) )

# Fit the model
_= pipe.fit(X_train, y_train)

# Out of sample prediction 
_, X_hold = helper.getData()

# transform X_hold
non_numeric_cols = X_hold.select_dtypes(exclude=['float', 'int']).columns
X_hold[ non_numeric_cols] = X_hold[ non_numeric_cols ].apply(pd.to_numeric, downcast='float', errors='coerce')

# predict X_hold
y_pred = pipe.predict(X_hold)
my_predictions = y_pred


### END SOLUTION

Model: Logistic (t=13) avg cross val score=0.8270



# Illustration of grading

The following code illustrates how we will grade your predictions.

We suggest that you first try this code on the predictions you make from the test dataset you have created
so that you can identify any issues that may arise with the holdout dataset.

In [7]:
### BEGIN HIDDEN TESTS

# load the holdout data with targets
DATA_DIR = './Data'
file_name = '5th_yr_with_target.csv'

if not os.path.exists(DATA_DIR):
    DATA_DIR = '../resource/asnlib'

y_hold = pd.read_csv(os.path.join(DATA_DIR, file_name))['Bankrupt']

# accuracy
accuracy_hold = accuracy_score(y_hold, my_predictions)

# recall & precision
recall_hold = recall_score(y_hold, my_predictions, pos_label=1, average="binary")
precision_hold = precision_score(y_hold, my_predictions, pos_label=1, average="binary")

# check accuracy
assert(accuracy_hold > 0.75)

### END HIDDEN TESTS

FileNotFoundError: [Errno 2] File b'./Data/5th_yr_with_target.csv' does not exist: b'./Data/5th_yr_with_target.csv'

In [None]:
### BEGIN HIDDEN TESTS
assert( ( (recall_hold  > 0.50) and (precision_hold > 0.15) )
       or
        ( (recall_hold  > 0.20) and (precision_hold > 0.50) )
      )
### END HIDDEN TESTS

In [None]:
# Extra points
### BEGIN HIDDEN TESTS
assert(accuracy_hold > .80)
### END HIDDEN TESTS

In [None]:
recall_hold, precision_hold

In [None]:
# Extra points
### BEGIN HIDDEN TESTS
assert( ( (recall_hold > .60) and (precision_hold > 0.20) )
       or
        ( (recall_hold  > 0.20) and (precision_hold > 0.60) )
      )
### END HIDDEN TESTS

# Discussion
- Most of the features are expressed as ratios: why is that a good idea ?
- Even if you don't understand all of the financial concepts behind the names of the attributes
    - You should be able to infer some relationships.  For example, here are some definitions of terms
$$
\begin{array}[lll] \\
X1   & = & \frac{\text{net profit} }{ \text{total assets} } \\
X9   & = & \frac{\text{sales}     }{ \text{total assets} } \\
X23  & = & \frac{\text{net profit} }{ \text{sales} } \\
\end{array}
$$

    - Therefore
$$
\begin{array}[lll] \\
X23  & = & \frac{X1}{X9} & \text{Algebra !}
\end{array}
$$

    - You might speculate that `net profit` is closely related to `gross profit`
        - The difference between "net" and "gross" is usually some type of additions/subtractions
    - Is this theory reflected in which features are most highly correlated with `X1` ?
- If you perform dimensionality reduction using PCA (the topic of the Unsupervised Learning lecture)
    - PCA is scale sensitive
    - If you *don't* scale the features: how many do you need to capture 95% of the variance ?
    - If you *do* scale the features: how many do you need to capture 95% of the variance ?