# Problem description

You are to predict whether a company will go bankrupt in the following year, based on financial attributes of the company.

Perhaps you are contemplating lending money to a company, and need to know whether the company
is in near-term danger of not being able to repay.

This task is divided in to two parts,
- Part 1 is the Assignment 3
- Part 2 is this final project.


## Goal

In the Assignment 3, we went through the first few but very important steps to solve a machine learning problem, and we got the data prepared for the final project. 

Now, you will need to build your own models to train your prepared dataset.

## Learning objectives

- Demonstrate mastery on solving a classification problem and presenting
the entire Recipe for Machine Learning process in a notebook.
- We will make suggestions for ways to approach the problem
    - But there will be little explicit direction for this task.
- It is meant to be analogous to a pre-interview task that a potential employer might assign
to verify your skill

## Grading
Prior assignments evaluated you step by step.

This project is results-based. Your goal is to create a well performed model.

We will evaluate the metric using 3 increasing values for the threshold
- You will get points for each threshold that you surpass

There are 2 data files in this directory:

- train.csv:
This is the dataset on which you will train your model
- test.csv:
    - This is the dataset by which you will be judged !
    - It has no labels so you can't use it to train or test your model
        - But we do have the labels so we can test your accuracy
    - Once you have built your model, you will make predictions on these examples and submit them for grading
    

# Import modules

In [1]:
## Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn

import os
import math

%matplotlib inline


In [2]:
## Load the bankruptcy_helper module

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%reload_ext autoreload
%autoreload 1

# Import bankruptcy_helper module
import bankruptcy_helper
%aimport bankruptcy_helper

helper = bankruptcy_helper.Helper()

# API for students

We have defined some utility routines in a file `bankruptcy_helper.py`. There is a class named `Helper` in it.  

This will simplify problem solving



`helper = bankruptcy_helper.Helper()`

- plot_attr: plot the distribution of one feature, conditional on the value of the associated target value
  > `X`: features      
  > `y`: labels       
  > `attr`: condition feature        
  > `trunc`: percentage of outliers you want to remove  
  
  >`helper.plot_attr(X, y, attr, trunc)`       

- save_data: save the training and test data into a folder named "my_data"
  > `helper.save_data(X_train, X_test, y_train, y_test)`
 
- load_data: load the training and test data from a folder named "my_data"
  > `X_train, X_test, y_train, y_test = helper.load_data()`

# Load the data

The first step we need to do in this project is to load the data we have dealed with in the Assignment 3.

In [15]:
# Load the data you have prepared for this project
X_train, X_test, y_train, y_test = helper.load_data()


print('X_train shape: ', X_train.shape)
print('X_test shape:', X_test.shape)


X_train shape:  (4336, 65)
X_test shape: (482, 65)


# Your model

Time for you to continue the Recipe for Machine Learning on your own.

Follow the steps and submit your *best* model.

For your best model, using the test set you created, report
- Accuracy 
- Recall
- Precision

We will evaluate your model using the holdout data.  Grades will be based on
the following metrics meeting certain thresholds
- Accuracy
- Recall
- Precision

We will evaluate the metric using 3 increasing values for the threshold
- You will get points for each threshold that you surpass

In [8]:
### BEGIN SOLUTION
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score, classification_report
from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer 

from sklearn import linear_model
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.decomposition import PCA

from sklearn.metrics import confusion_matrix, recall_score, precision_score, classification_report

impute_transformer = SimpleImputer(strategy='median')


## SVM and Random Forest model
# logistic_clf = linear_model.LogisticRegression(solver = 'liblinear', max_iter = 10000)
svm_clf = SVC(gamma="auto", C=.1)
forest_clf = RandomForestClassifier(n_estimators=50, random_state=42)

r = "None"

for name, clf in { "SVM": svm_clf,
                   "Random Forest": forest_clf
                 }.items():
    
    pipe = Pipeline([("imputer", impute_transformer), 
                      ("model", clf)
                     ]
                    )
    
    scores = cross_val_score(pipe, X_train, y_train, cv=5)
    print("Model: {m:s} (t={r:s}) avg cross val score={s:3.4f}\n".format(m=name, r=r, s=scores.mean()) )

    # Out of sample prediction
    _= pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)

    accuracy_test = accuracy_score(y_test, y_pred)

    # recall_
    recall_test = recall_score(y_test, y_pred, pos_label=1, average="binary")
    precision_test = precision_score(y_test,   y_pred, pos_label=1, average="binary")


    print("\t{m:s} Accuracy: {a:3.1%}, Recall {r:3.1%}, Precision {p:3.1%}".format(m=name,
                                                                                a=accuracy_test,
                                                                                r=recall_test,
                                                                                p=precision_test
                                                                                )
         )


### END SOLUTION

Model: SVM (t=None) avg cross val score=0.9382



  _warn_prf(average, modifier, msg_start, len(result))


	SVM Accuracy: 91.7%, Recall 0.0%, Precision 0.0%
Model: Random Forest (t=None) avg cross val score=0.9405

	Random Forest Accuracy: 93.2%, Recall 25.0%, Precision 76.9%


## Models wih Dimensionality reduction

- Reduce the number of features
    -Try other models. For example, PCA
- Cost sensitive training

In [13]:
### BEGIN SOLUTION

from sklearn.preprocessing import StandardScaler

stand_transformer = StandardScaler()

cwt = { 0:1, 1:20 }



for r in [ 1, 10, 12, 13, 15, 18]:
    cwt = { 0:1, 1:r }
    
    logistic_clf = linear_model.LogisticRegression(
        class_weight = cwt,
        solver = 'liblinear', max_iter = 10000)
    svm_clf = SVC(class_weight = cwt,
              gamma="auto", C=.1)
    
    for name, clf in { "SVM": svm_clf,
                       "Logistic": logistic_clf
                     }.items():

        pipe = Pipeline([("imputer", impute_transformer), 
                         ("Standardize", stand_transformer),
                         ("PCA", PCA(n_components = 20)),
                         ("model", clf)
                         ]
                        )
        scores = cross_val_score(pipe, X_train, y_train, cv=5)
        print("Model: {m:s} (t={r:d}) avg cross val score={s:3.4f}\n".format(m=name, r=r, s=scores.mean()) )

        # Out of sample prediction
        _= pipe.fit(X_train, y_train)
        y_pred = pipe.predict(X_test)

        accuracy_test = accuracy_score(y_test, y_pred)

        # recall_
        recall_test = recall_score(y_test, y_pred, pos_label=1, average="binary")
        precision_test = precision_score(y_test,   y_pred, pos_label=1, average="binary")

        
        print("\t{m:s} Accuracy: {a:3.1%}, Recall {r:3.1%}, Precision {p:3.1%}".format(m=name,
                                                                                    a=accuracy_test,
                                                                                    r=recall_test,
                                                                                    p=precision_test
                                                                                    )
             )

### END SOLUTION

Model: SVM (t=1) avg cross val score=0.9382



  _warn_prf(average, modifier, msg_start, len(result))


	SVM Accuracy: 91.7%, Recall 0.0%, Precision 0.0%
Model: Logistic (t=1) avg cross val score=0.9345

	Logistic Accuracy: 91.5%, Recall 0.0%, Precision 0.0%
Model: SVM (t=10) avg cross val score=0.8750

	SVM Accuracy: 85.7%, Recall 47.5%, Precision 28.4%
Model: Logistic (t=10) avg cross val score=0.8741

	Logistic Accuracy: 85.3%, Recall 47.5%, Precision 27.5%
Model: SVM (t=12) avg cross val score=0.8522

	SVM Accuracy: 82.6%, Recall 55.0%, Precision 25.0%
Model: Logistic (t=12) avg cross val score=0.8448

	Logistic Accuracy: 83.4%, Recall 62.5%, Precision 27.8%
Model: SVM (t=13) avg cross val score=0.8342

	SVM Accuracy: 81.3%, Recall 60.0%, Precision 24.5%
Model: Logistic (t=13) avg cross val score=0.8263

	Logistic Accuracy: 81.1%, Recall 62.5%, Precision 24.8%
Model: SVM (t=15) avg cross val score=0.7952

	SVM Accuracy: 78.4%, Recall 65.0%, Precision 22.4%
Model: Logistic (t=15) avg cross val score=0.7874

	Logistic Accuracy: 78.2%, Recall 75.0%, Precision 24.0%
Model: SVM (t=18) avg

## Submission guidelines

- You will implement the body of a subroutine `MyModel`
    - That takes as argument a Pandas DataFrame 
        - Each row is an example on which to predict
        - The features of the example are elements of the row
    - Performs predictions on each example
    - Returns an array or predictions with a one-to-one correspondence with the examples in the test set
    

We will evaluate your model against the holdout data
- By reading the holdout examples `X_hold` (as above)
- Calling `y_hold_pred = MyModel(X_hold)` to get the predictions
- Comparing the predicted values `y_hold_pred` against the true labels `y_hold` which are known only to the instructors

See the following cell as an illustration

**Remember**

The holdout data is in the same format as the one we used for training
- Except that it has no attribute for the target
- So you will need to **perform all the transformations on the holdout data**
    - As you did on the training data
    - Including turning the string representation of numbers into actual numeric data types

All of this work *must* be performed within the body of the `MyModel` routine you will write

We will grade you by comparing the predictions array you create to the answers known to us.

In [16]:
import pandas as pd
import os

def MyModel(X):
    # It should create an array of predictions; we initialize it to the empty array for convenience
    predictions = []
    
    ### BEGIN SOLUTION
    
    # Relative weight of Bankrupt class to Non Bankrupt class
    r = 13
    
    # Class weights
    cwt = { 0:1, 1:r }
    
    logistic_clf = linear_model.LogisticRegression(
        class_weight = cwt,
        solver = 'liblinear', max_iter = 10000)
    name = "Logistic"
   
    pipe = Pipeline([("imputer", impute_transformer), 
                     ("Standardize", stand_transformer),
                     ("PCA", PCA(n_components = 20)),
                     ("model", logistic_clf)
                     ]
                   )
   
    scores = cross_val_score(pipe, X_train, y_train, cv=5)
    print("Model: {m:s} (t={r:d}) avg cross val score={s:3.4f}\n".format(m=name, r=r, s=scores.mean()) )

    # Fit the model
    _= pipe.fit(X_train, y_train)
                    
    # Out of sample prediction 
    y_pred = pipe.predict(X)
    
 
    predictions = y_pred
    
    ### END SOLUTION
    
    return predictions



# Check your work: predict and evaluate metrics on *your* test examples
- Test whether your implementation of `MyModel` works
- See the metrics  your model produces

In [17]:
# Predict the data using X_test
y_test_pred = MyModel(X_test)

# Get the accuracy, recall and precision of your model
accuracy_test = accuracy_score(y_test, y_test_pred)
recall_test = recall_score(y_test, y_test_pred, pos_label=1, average="binary")
precision_test = precision_score(y_test,   y_test_pred, pos_label=1, average="binary")

print("\t{m:s} Accuracy: {a:3.1%}, Recall {r:3.1%}, Precision {p:3.1%}".format(m=name,
                                                                            a=accuracy_test,
                                                                            r=recall_test,
                                                                            p=precision_test
                                                                            )
         )

Model: Logistic (t=13) avg cross val score=0.8270

	Logistic Accuracy: 81.1%, Recall 62.5%, Precision 24.8%


In [None]:
# Check accuracy
assert(accuracy_test > 0.75)

In [None]:
# Check recall and precision
assert( ( (recall_test  > 0.50) and (precision_test > 0.15) )
       or
        ( (recall_test  > 0.20) and (precision_test > 0.50) )
      )

In [None]:
# Extra points
assert(accuracy_test > .80)

In [None]:
# Extra points
assert( ( (recall_test > .60) and (precision_test > 0.20) )
       or
        ( (recall_test  > 0.20) and (precision_test > 0.60) )
      )

# This is how we will evaluate your model on the holdout examples

In [18]:
### BEGIN HIDDEN TESTS
DATA_DIR = './Data'
file_name = 'test_with_labels.csv'
if not os.path.exists(DATA_DIR):
    DATA_DIR = '../resource/asnlib'
y_hold = pd.read_csv(os.path.join(DATA_DIR, file_name))
y_hold_pred = MyModel(X_hold)

# accuracy
accuracy_hold = accuracy_score(y_hold, y_hold_pred)

# recall & precision
recall_hold = recall_score(y_hold, y_hold_pred, pos_label=1, average="binary")
precision_hold = precision_score(y_hold,   y_hold_pred, pos_label=1, average="binary")

# check accuracy
assert(accuracy_hold > 0.75)
### END HIDDEN TESTS

In [19]:
### BEGIN HIDDEN TESTS
assert( ( (recall_hold  > 0.50) and (precision_hold > 0.15) )
       or
        ( (recall_hold  > 0.20) and (precision_hold > 0.50) )
      )
### END HIDDEN TESTS

In [21]:
# Extra points
### BEGIN HIDDEN TESTS
assert(accuracy_hold > .80)
### END HIDDEN TESTS

In [20]:
# Extra points
### BEGIN HIDDEN TESTS
assert( ( (recall_hold > .60) and (precision_hold > 0.20) )
       or
        ( (recall_hold  > 0.20) and (precision_hold > 0.60) )
      )
### END HIDDEN TESTS

# Discussion
- Most of the features are expressed as ratios: why is that a good idea ?
- Even if you don't understand all of the financial concepts behind the names of the attributes
    - You should be able to infer some relationships.  For example, here are some definitions of terms
$$
\begin{array}[lll] \\
X1   & = & \frac{\text{net profit} }{ \text{total assets} } \\
X9   & = & \frac{\text{sales}     }{ \text{total assets} } \\
X23  & = & \frac{\text{net profit} }{ \text{sales} } \\
\end{array}
$$

    - Therefore
$$
\begin{array}[lll] \\
X23  & = & \frac{X1}{X9} & \text{Algebra !}
\end{array}
$$

    - You might speculate that `net profit` is closely related to `gross profit`
        - The difference between "net" and "gross" is usually some type of additions/subtractions
    - Is this theory reflected in which features are most highly correlated with `X1` ?
- If you perform dimensionality reduction using PCA (the topic of the Unsupervised Learning lecture)
    - PCA is scale sensitive
    - If you *don't* scale the features: how many do you need to capture 95% of the variance ?
    - If you *do* scale the features: how many do you need to capture 95% of the variance ?