# Problem description

You are to predict whether a company will go bankrupt in the following year, based on financial attributes of the company.
- Each row of data corresponds to a single company
- There are 64 attributes, described in the section below
- The column `Bankrupt` is 1 if the company subsequently went bankrupt; 0 if it did not go bankrupt
- The column `Id` is a Company Identifier

## Goal

## Learning objectives

- Demonstrate mastery on solving a classification problem and presenting
the entire Recipe for Machine Learning process in a notebook.
- There will be little explicit direction for this task.
- It is meant to be analogous to a pre-interview task that a potential employer might assign
to verify your skill

# Import modules

In [1]:
## Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn

import os
import math

%matplotlib inline


# API for students

In [2]:
## Load the bankruptcy_helper module

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

# Import bankruptcy_helper module
import bankruptcy_helper
%aimport bankruptcy_helper

helper = bankruptcy_helper.Helper()

# Get the data

The first step in our Recipe is Get the Data.


In [3]:
# Data directory
DATA_DIR = "./Data"

if not os.path.isdir(DATA_DIR):
    DATA_DIR = "../resource/asnlib/publicdata/bankruptcy/data"

data_file = "5th_yr.csv"
data = pd.read_csv( os.path.join(DATA_DIR, "train", data_file) )

target_attr = "Bankrupt"

n_samples, n_attrs = data.shape
print("Date shape: ", data.shape)

Date shape:  (4818, 66)


## Have a look at the data

We will not go through all steps in the Recipe, nor in depth.

But here's a peek

In [4]:
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X57,X58,X59,X60,X61,X62,X63,X64,Bankrupt,Id
0,0.025417,0.41769,0.0568,1.1605,-126.39,0.41355,0.025417,1.2395,1.165,0.51773,...,0.049094,0.85835,0.12322,5.6167,7.4042,164.31,2.2214,1.334,0,4510
1,-0.023834,0.2101,0.50839,4.2374,22.034,0.058412,-0.027621,3.6579,0.98183,0.76855,...,-0.031011,1.0185,0.069047,5.7996,7.7529,26.446,13.802,6.4782,0,3537
2,0.030515,0.44606,0.19569,1.565,35.766,0.28196,0.039264,0.88456,1.0526,0.39457,...,0.077337,0.95006,0.25266,15.049,2.8179,104.73,3.4852,2.6361,0,3920
3,0.052318,0.056366,0.54562,10.68,438.2,0.13649,0.058164,10.853,1.0279,0.61173,...,0.085524,0.97282,0.0,6.0157,7.4626,48.756,7.4863,1.0602,0,1806
4,0.000992,0.49712,0.12316,1.3036,-71.398,0.0,0.001007,1.0116,1.2921,0.50288,...,0.001974,0.99925,0.019736,3.4819,8.582,114.58,3.1854,2.742,0,1529


Pretty *unhelpful* !

What are these mysteriously named features ?

## Description of attributes

This may still be somewhat unhelpful for those of you not used to reading Financial Statements.

But that's partially the point of the exercise
- You can *still* perform Machine Learning *even if you are not an expert in the problem domain(
    - That's what makes this a good interview exercise: you can demonstrate your thought process even if you don't know the exact meaning of the terms
- Of course: becoming an expert in the domain *will improve* your ability to create better models
    - Feature engineering is easier if you understand the features, their inter-relationships, and the relationship to the target

Let's get a feel for the data
- What is the type of each attribute ?


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4818 entries, 0 to 4817
Data columns (total 66 columns):
X1          4818 non-null object
X2          4818 non-null object
X3          4818 non-null object
X4          4818 non-null object
X5          4818 non-null object
X6          4818 non-null object
X7          4818 non-null object
X8          4818 non-null object
X9          4818 non-null float64
X10         4818 non-null object
X11         4818 non-null object
X12         4818 non-null object
X13         4818 non-null float64
X14         4818 non-null object
X15         4818 non-null object
X16         4818 non-null object
X17         4818 non-null object
X18         4818 non-null object
X19         4818 non-null float64
X20         4818 non-null float64
X21         4818 non-null object
X22         4818 non-null object
X23         4818 non-null float64
X24         4818 non-null object
X25         4818 non-null object
X26         4818 non-null object
X27         4818 non-null obje

You may be puzzled:
- Most attributes are `object` and *not* numeric (`float64`)
- But looking at the data via `data.head()` certainly gives the impression that all attributes are numeric

Welcome to the world of messy data !  The dataset has represented numbers as strings.
- These little unexpected challenges are common in the real-word
- Data is rarely perfect and clean

So we will first have to convert all attributes to numeric

**Question**

Create an all-numeric version of the data.  Assign it to the variable `data` (replacing the original)

**Hint**
- Look up the Pandas method `to_numeric`
    - We suggest you use the option `errors='coerce'`
    

In [6]:
### BEGIN SOLUTION
non_numeric_cols = data.select_dtypes(exclude=['float', 'int']).columns
data[ non_numeric_cols] = data[ non_numeric_cols ].apply(pd.to_numeric, downcast='float', errors='coerce')
### END SOLUTION

Let's look at the data again, now that it is numeric

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4818 entries, 0 to 4817
Data columns (total 66 columns):
X1          4816 non-null float32
X2          4816 non-null float32
X3          4816 non-null float32
X4          4803 non-null float32
X5          4808 non-null float32
X6          4816 non-null float32
X7          4816 non-null float32
X8          4804 non-null float32
X9          4818 non-null float64
X10         4816 non-null float32
X11         4816 non-null float32
X12         4803 non-null float32
X13         4818 non-null float64
X14         4816 non-null float32
X15         4812 non-null float32
X16         4804 non-null float32
X17         4804 non-null float32
X18         4816 non-null float32
X19         4818 non-null float64
X20         4818 non-null float64
X21         4744 non-null float32
X22         4816 non-null float32
X23         4818 non-null float64
X24         4702 non-null float32
X25         4816 non-null float32
X26         4804 non-null float32
X27      

Hopefully you will see that all the attributes are now numeric.

Surprise !

Looks like there are some examples with undefined values for some features !
- Why didn't we see this when the data was not encoded as numbers ?



**Questions**

List all the attributes of `data` that are missing from at least one example.

Set list `attrs_missing` to either a list or array of attributes that are missing from at least one example.

In [8]:
### BEGIN SOLUTION

num_examples = data.shape[0]
num_examples_undefined = data.isnull().sum(axis=0)
attrs_missing = num_examples_undefined[ num_examples_undefined > 0 ].index.tolist()
### END SOLUTION

So it looks like you will have to deal with missing data at some point.

We won't do this just now; you will need to address the issue yourself later.

But you will hopefully see that our target (`Bankrupt`) is not missing in any example

In [9]:
assert( not target_attr in set(attrs_missing) )

The label/target is included in this dataset
- It is the attribute `Bankrupt`
- Let's separate it from the feature attributes so we don't accidentally train the model with a feature that **is** the target !

In [10]:
data, labels = data.drop(columns=[target_attr]), data[target_attr]
print("Data shape: ", data.shape)

Data shape:  (4818, 65)


We will shuffle the examples before doing anything else.

In [11]:
# Shuffle the data first
data, labels = sklearn.utils.shuffle(data, labels, random_state=42)

print("Labels shape: ", labels.shape)
print("Label values: ", np.unique(labels))


Labels shape:  (4818,)
Label values:  [0 1]


We will evaluate your submission on a test dataset that we provide
- It has no labels, so **you** can't use it to evaluate your model, but **we** have the labels
- We will call this evaluation dataset the "holdout" data

Let's get it

In [12]:
holdout_data = pd.read_csv( os.path.join(DATA_DIR, "holdout", '5th_yr.csv') )

print("Data shape: ", holdout_data.shape)


Data shape:  (1092, 65)


## Create a test set

To train and evaluate a model, we need to split the original dataset into
a training subset (in-sample) and a test subset (out of sample).

Although **we** are the only ones with the holdout dataset, you probably want
to perform out of sample evaluation of your model.

**Question**

Split the data 
- 90% will be used for training the model
- 10% will be used as validation (out of sample) examples
- Use `train_test_split()` from `sklearn` to perform this split
    -  Set the `random_state` parameter of `train_test_split()` to be 42


In [13]:
# Split data into train and test
# Create variables X_train, X_test, y_train, y_test
#   X_train: training examples
#   y_train: labels of the training examples
#   X_test:  test examples
#   y_test:  labels of test examples

### BEGIN SOLUTION
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.10, random_state=42)

### END SOLUTION

In [14]:
X_train.shape
X_test.shape

(4336, 65)

(482, 65)

# Exploratory Data Analysis

You may want to analyze potential relationships
- Between features and the target
- Between pairs/groups of features

We'll make some suggestions but, ultimately it is up to you.

**Remember**

Base your analysis on `X_train`, don't peek at your out of sample data !

## Features correlated with the target

**Question**

List the 5 features whose correlations with the target are largest (most positive).


Set variable `corr_features`
- To be a list or array with the names (e.g., `X3`) of the 5 features
- Most highly correlated with `Bankrupt`
- *In *descending order*

**Hint**
- Look up the Pandas `corr` method
- Look up the Pandas `sort_values`

In [15]:
### BEGIN SOLUTION

# Put target back with data to facilitate correlation
df = X_train.copy()
df[ target_attr ] = y_train
corr_matrix = df.corr()

target_corr = corr_matrix['Bankrupt'].sort_values(ascending = False)
corr_features = target_corr.index[ 1:6 ].tolist()

### END SOLUTION

In [16]:
print("Features most correlated with target: ", corr_features)

Features most correlated with target:  ['X2', 'X51', 'X32', 'X9', 'X36']


## Mutually correlated features

When you have a lot of features, you might discover that some of them convey little information
- Pairs of highly correlated features
- A small number of features that adequately represent the whole
    - In the Unsupervised Learning lecture, we will learn about PCA, a way to discover a small set of synthetic features that capture the whole

**Question**

- List the 5 features whose correlations with the `X1` are largest (most positive).
    - Set variable `X1_corr_p`
    - To be a list or array with the names (e.g., `X3`) of the 5 features
    - Most highly correlated
    - *In *descending order*
    
- List the 5 features whose correlations with the `X1` are *most negative*.
    - Set variable `X1_corr_n`
    - To be a list or array with the names (e.g., `X3`) of the 5 features
    - Most highly *negatively* correlated
    - *In *ascending order* (most negative first)

In [17]:
### BEGIN SOLUTION

# Put target back with data to facilitate correlation
df = X_train.copy()
df[ target_attr ] = y_train
corr_matrix = df.corr()

X1_corr = corr_matrix['X1'].sort_values(ascending = False)

X1_corr_p = X1_corr.index[ 1: 6].tolist()
X1_corr_n = X1_corr.index[ -1: - 6 : -1 ].tolist()


### END SOLUTION

In [18]:
print("Features most positively correlated with X1", X1_corr_p)
print("Features most negatively correlated with X1", X1_corr_n)

Features most positively correlated with X1 ['X7', 'X14', 'X11', 'X22', 'X35']
Features most negatively correlated with X1 ['X36', 'X38', 'X10', 'X25', 'X53']


# Imbalanced data

We have a binary classification problem.

Do we have roughly the same number of examples associated with each of the two targets ?

**Question**

How many training examples do we have that became Bankrupt ?
- Set variable `num_bankrupt` to this value

How many training examples do we have that *did not become* Bankrupt ?
- Set variable `num_nonbankrupt` to this value

In [None]:
num_examples = X_train.shape[0]

### BEGIN SOLUTION
bankrupt = X_train[ y_train == 1 ] 
nonbankrupt = X_train[ y_train == 0 ]

num_bankrupt    = bankrupt.shape[0]
num_nonbankrupt = nonbankrupt.shape[0]

### END SOLUTION

In [None]:
print("Of the {t:d} total examples: {b:d} became bankrupt and {nb:d} did not become bankrupt".format(t=num_examples,
                                                                                                    b=num_bankrupt,
                                                                                                    nb=num_nonbankrupt)

     )

In [None]:
assert( num_bankrupt + num_nonbankrupt == num_examples )

This dataset is highly imbalanced: many more examples of one class than the other.

Why might this be a problem ?
    

Consider a naive model that ignores the features and always predicts the *most frequent* value of the target.

Assuming the out of sample data has the same distribution as the training data:
- We will have perfect conditional accuracy for the examples with target in the majority class
- We will have zero conditional accuracy for the examples with target in the non-majority class
- Because the number of examples in the majority class is so much larger:
    - We might get good unconditional accuracy

Recall our lecture on Recall and Precision.

These are metrics that will help us evaluate our model's ability to correctly predict Bankruptcy.

We think that you will find that your model may have
- High Accuracy
- Low Recall

There are several ways for you to deal with imbalanced data
- Class sensitive weights
    - Many models in `sklearn` take an optional argument `class_weight`
    - For each target class: you can assign a weight
    - The Loss will be computed on a class-weighted basis
    - You can choose weights that increase the influence of the non-majority class

Another way is re-sampling the training set
- Expand the number of training examples
- By increasing the number of examples of the non-majority class
    - Randomly sample examples in the non-majority class
    - So you will have duplicates
- This creates a more balanced dataset on which to train

These are just some ideas for you to achieve a model with better
conditional metrics.

# Your model

Time for you to continue the Recipe for Machine Learning on your own.

Follow the steps and submit your *best* model.

For your best model, using the test set you created, report
- Accuracy 
- Recall
- Precision

We will evaluate your model using the holdout data.  Grades will be based on
the following metrics meeting certain thresholds
- Accuracy
- Recall
- Precision

We will evaluate the metric using 3 increasing values for the threshold
- You will get points for each threshold that you surpass

In [None]:
## Submission guidelines

- You will implement the body of a subroutine `MyModel`
    - That takes as argument, the name of a CSV file containing the holdout set
    - Performs predictions on each example in the test set
    - Returns an array or predictions with a one-to-one correspondence with the examples in the test set
    
- You will call the subroutine, passing the name of the test set file that we will supply.


In [None]:
TEST_PATH = "./data/midterm_project/bankruptcy/holdout"

import pandas as pd
import os

testFileName = os.path.join(TEST_PATH, data_file)

def MyModel(fileName=None):
    print("Test file: ", fileName)
    
    # It should create an array of predictions; we initialize it to the empty array for convenience
    predictions = []
    
    # YOUR CODE GOES HERE
    
    
    return predictions

predicts = MyModel(fileName=testFileName)


**Remember**

The holdout file is in the same format as the one we used for training
- Except that it has no attribute for the target
- So you will need to perform all the transformations on the holdout data
    - As you did on the training data
    - Including turning the string representation of numbers into actual numeric data types

All of this work *must* be performed within the body of the `MyModel` routine you will write

We will grade you by comparing the `predicts` array you create to the answers known to us.

# Discussion
- Most of the features are expressed as ratios: why is that a good idea ?
- Even if you don't understand all of the financial concepts behind the names of the attributes
    - You should be able to infer some relationships
$$
\begin{array}[lll] \\
X1   & = & \frac{\text{net profit} }{ \text{total assets} } \\
X9   & = & \frac{\text{sales}     }{ \text{total assets} } \\
X23  & = & \frac{\text{net profit} }{ \text{sales} } \\
X23  & = & \frac{X1}{X9} & \text{Algebra !}
\end{array}
$$
    - You might speculate that `net profit` is closely related to `gross profit`
        - The difference between "net" and "gross" is usually some type of additions/subtractions
    - Is this theory reflected in which features are most highly correlated with `X1` ?
- If you perform dimensionality reduction using PCA (the topic of the Unsupervised Learning lecture)
    - PCA is scale sensitive
    - If you *don't* scale the features: how many do you need to capture 95% of the variance ?
    - If you *do* scale the features: how many do you need to capture 95% of the variance ?