# Problem description

You are to predict whether a company will go bankrupt in the following year, based on financial attributes of the company.

Perhaps you are contemplating lending money to a company, and need to know whether the company
is in near-term danger of not being able to repay.

This task is divided in to two parts,
- Part 1 is this Assignment 3
- Part 2 will be your final project.


## Goal

In the previous 2 assignments, we helped you to deal with data in order to make you focus on the model building and evaluation. But in this assignment, we want you to go through the first few but very important steps to solve a machine learning problem.

You will need to prepare the data you need and get some ideas for your final projcet in this assignment.

## Learning objectives

- Demonstrate mastery on solving a classification problem and presenting
the entire Recipe for Machine Learning process in a notebook.
- We will make suggestions for ways to approach the problem
    - But there will be little explicit direction for this task.
- It is meant to be analogous to a pre-interview task that a potential employer might assign
to verify your skill

# Import modules

In [None]:
## Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn

import os
import math

%matplotlib inline


In [None]:
## Load the bankruptcy_helper module

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%reload_ext autoreload
%autoreload 1

# Import bankruptcy_helper module
import bankruptcy_helper
%aimport bankruptcy_helper

helper = bankruptcy_helper.Helper()

# API for students

We have defined some utility routines in a file `bankruptcy_helper.py`. There is a class named `Helper` in it.  

This will simplify problem solving



`helper = bankruptcy_helper.Helper()`

- getData: get the train data and holdout data (without label)
  > `train, holdout = getData()`

- plot_attr: plot the distribution of one feature `attr` from DataFrame `X`, conditional on the value of the associated target value `y`
  > `X`: DataFrame, features            
  > `y`: DataFrame/ndarray, labels       
  > `attr`: string, condition feature        
  > `trunc`: scalar number, percentage of outliers you want to remove  
  
  >`helper.plot_attr(X, y, attr, trunc)`       

- save_data: save the training and test data into a folder named "my_data"
  > `helper.save_data(X_train, X_test, y_train, y_test)`
 
- load_data: load the training and test data from a folder named "my_data"
  > `X_train, X_test, y_train, y_test = helper.load_data()`

# Get the data

The first step in our Recipe is Get the Data. 

There are two datasets in this assignment, which are stored in two different directories.
- training data: include all features and labels, stored as `train/5th_yr.csv`
- holdout data: include only features, **no label**, used to test your model performance. we will grade your work in final project based on your predicting labels. Stored as `holdout/5th_yr.csv`


For the training data
- Each example is a row of data corresponding to a single company
- There are 64 attributes, described in the section below
- The column `Bankrupt` is 1 if the company subsequently went bankrupt; 0 if it did not go bankrupt
- The column `Id` is a Company Identifier

The holdout data is similar with the training data. The olny difference is that it doesn't have attribute for the target, and you need to predict them.

In [None]:
# Get the data
data, holdout = helper.getData()

target_attr = "Bankrupt" # target attribute in training data, 1 for bankrupt and 0 for not bankrupt

n_samples, n_attrs = data.shape
print("Date shape: ", data.shape)

## Have a look at the data

We will not go through all steps in the Recipe, nor in depth.

But here's a peek

In [None]:
# training data
data.head()

In [None]:
# holdout data
holdout.head()

Pretty *unhelpful* !

What are these mysteriously named features ?

## Description of attributes

This may still be somewhat unhelpful for those of you not used to reading Financial Statements.

But that's partially the point of the exercise
- You can *still* perform Machine Learning *even if* you are not an expert in the problem domain
    - That's what makes this a good interview exercise: you can demonstrate your thought process even if you don't know the exact meaning of the terms
- Of course: becoming an expert in the domain *will improve* your ability to create better models
    - Feature engineering is easier if you understand the features, their inter-relationships, and the relationship to the target

Let's get a feel for the data
- What is the type of each attribute ?


In [None]:
data.info()

You may be puzzled:
- Most attributes are `object` and *not* numeric (`float64`)
- But looking at the data via `data.head()` certainly gives the impression that all attributes are numeric

Welcome to the world of messy data !  The dataset has represented numbers as strings.
- These little unexpected challenges are common in the real-word
- Data is rarely perfect and clean

So we will first have to convert all attributes to numeric

**Question:**

Create an all-numeric version of the data.  Assign it to the variable `data` (replacing the original)

**Hint:**
- Look up the Pandas method `to_numeric`
    - We suggest you use the option `errors='coerce'`
    

In [None]:
### BEGIN SOLUTION
non_numeric_cols = data.select_dtypes(exclude=['float', 'int']).columns
data[ non_numeric_cols] = data[ non_numeric_cols ].apply(pd.to_numeric, downcast='float', errors='coerce')
### END SOLUTION

In [None]:
### BEGIN HIDDEN TESTS
assert 'object' not in data.dtypes
### END HIDDEN TESTS

Let's look at the data again, now that it is numeric

In [None]:
data.info()

Hopefully you will see that all the attributes are now numeric.

Surprise !

Looks like there are some examples with undefined values for some features !
- Why didn't we see this when the data was not encoded as numbers ?



**Question:**

List all the attributes of `data` that are missing from at least one example.
- Set list `attrs_missing` to either a list or array of attributes that are missing from at least one example.

In [None]:
# Set variable
#  attrs_missing: list or array, attributes that are missing from at least one example
attrs_missing = None

### BEGIN SOLUTION
num_examples = data.shape[0]
num_examples_undefined = data.isnull().sum(axis=0)
attrs_missing = num_examples_undefined[ num_examples_undefined > 0 ].index.tolist()
### END SOLUTION

print("Attributes with values missing for at least some examples\t:\n\t" + "\n\t".join(attrs_missing))

In [None]:
### BEGIN HIDDEN TESTS
tmp = np.sum(data.isna())
attrs_missing_test = tmp[tmp>0].index.tolist()
assert set(attrs_missing_test) == set(attrs_missing)
### END HIDDEN TESTS

So it looks like you will have to deal with missing data at some point.

We won't do this just now; you will need to address the issue yourself later.

But you will hopefully see that our target (`Bankrupt`) is not missing in any example

In [None]:
# Check if you target is missing any example
assert( not target_attr in set(attrs_missing) )

The label/target is included in this dataset
- It is the attribute `Bankrupt`
- Let's separate it from the feature attributes so we don't accidentally train the model with a feature that **is** the target !

In [None]:
data, labels = data.drop(columns=[target_attr]), data[target_attr]
print("Data shape: ", data.shape)

We will shuffle the examples before doing anything else.

In [None]:
# Shuffle the data first
data, labels = sklearn.utils.shuffle(data, labels, random_state=42)

print("Labels shape: ", labels.shape)
print("Label values: ", np.unique(labels))


## Create a test set 

To train and evaluate a model, we need to split the original dataset into
a training subset (in-sample) and a test subset (out of sample).

Although **we** are the only ones with the holdout dataset, you probably want
to perform out of sample evaluation of your model.

**Question:**

Split the data 
- Set `X_train`, `X_test`, `y_train` and `y_test` to match the description in the comment
- 90% will be used for training the model
- 10% will be used as validation (out of sample) examples
- Use `train_test_split()` from `sklearn` to perform this split
    -  Set the `random_state` parameter of `train_test_split()` to be 42


In [None]:
# Split data into train and test
# Create variables X_train, X_test, y_train, y_test
#   X_train: training examples
#   y_train: labels of the training examples
#   X_test:  test examples
#   y_test:  labels of test examples
X_train = None
X_test = None
y_train = None
y_test = None

### BEGIN SOLUTION
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.10, random_state=42)
### END SOLUTION

print('X_train shape: ', X_train.shape)
print('X_test shape:', X_test.shape)

In [None]:
### BEGIN HIDDEN TESTS
X_train_, X_test_, y_train_, y_test_ = train_test_split(data, labels, test_size=0.10, random_state=42)
assert np.allclose(X_train_, X_train, equal_nan=True)
assert np.allclose(y_train_, y_train, equal_nan=True)
### END HIDDEN TESTS

## Save the data for your final project

In [None]:
# Save X_train, X_test, y_train, y_test for final project
helper.save_data(X_train, X_test, y_train, y_test)

# Exploratory Data Analysis

You may want to analyze potential relationships
- Between features and the target
- Between pairs/groups of features

We'll make some suggestions but, ultimately it is up to you.

**Warning**

We will perform *our* exploration using the **raw** data
- Thus, there may be features with missing values
- This may affect your analysis
- For example: how is the correlation of 2 features computed when their are missing values ?
- For the purpose of answering the questions: *leave the missing values in place*
- For *your* model: feel free to deal with missing features before doing Exploratory Data Analysis

**Remember**

- Base your analysis on `X_train`, don't peek at your out of sample data !


## Features correlated with the target

**Question:**

List the 5 features whose correlations with the target are largest (most positive).


- Set variable `corr_features`
    - To be a list or array with the names (e.g., `X3`) of the 5 features
    - Most highly correlated with `Bankrupt`
    - In *descending order*

**Hint:**
- Look up the Pandas `corr` method
- Look up the Pandas `sort_values`

In [None]:
# Set variable
#  corr_features: list or array, 5 features whose correlations with target are largest
corr_features = None

### BEGIN SOLUTION

# Put target back with data to facilitate correlation
df = X_train.copy()
df[ target_attr ] = y_train
corr_matrix = df.corr()

target_corr = corr_matrix['Bankrupt'].sort_values(ascending = False)
corr_features = target_corr.index[ 1:6 ].tolist()

### END SOLUTION

print("Features most correlated with target: ", corr_features)

In [None]:
### BEGIN HIDDEN TESTS
df_test = X_train.copy()
df_test[ target_attr ] = y_train
corr_matrix_test = df_test.corr()

target_corr_test = corr_matrix_test['Bankrupt'].sort_values(ascending = False)
corr_features_test = target_corr_test.index[ 1:6 ].tolist()

assert list(corr_features) == corr_features_test
### END HIDDEN TESTS

## Mutually correlated features

When you have a lot of features, you might discover that some of them convey little information
- Pairs of highly correlated features
- A small number of features that adequately represent the whole
    - In the Unsupervised Learning lecture, we will learn about PCA, a way to discover a small set of synthetic features that capture the whole

**Questions:**

- List the 5 features whose correlations with the `X1` are largest (most positive).
    - Set variable `X1_corr_p`
        - To be a list or array with the names (e.g., `X3`) of the 5 features
        - Most highly correlated
        - In *descending order*
    
- List the 5 features whose correlations with the `X1` are *most negative*.
    - Set variable `X1_corr_n`
        - To be a list or array with the names (e.g., `X3`) of the 5 features
        - Most highly *negatively* correlated
        - In *ascending order* (most negative first)

In [None]:
# Set varaibels
#  X1_corr_p: list or array, 5 features whose correlations with target are most positive
#  X1_corr_n: list or array, 5 features whose correlations with target are most negative
X1_corr_p = None
X1_corr_n = None

### BEGIN SOLUTION
# Put target back with data to facilitate correlation
df = X_train.copy()
df[ target_attr ] = y_train
corr_matrix = df.corr()

X1_corr = corr_matrix['X1'].sort_values(ascending = False)

X1_corr_p = X1_corr.index[ 1: 6].tolist()
X1_corr_n = X1_corr.index[ -1: - 6 : -1 ].tolist()
### END SOLUTION

print("Features most positively correlated with X1", X1_corr_p)
print("Features most negatively correlated with X1", X1_corr_n)

In [None]:
### BEGIN HIDDEN TESTS
X1_corr_test = corr_matrix_test['X1'].sort_values(ascending = False)

X1_corr_p_test = X1_corr_test.index[ 1: 6].tolist()
X1_corr_n_test = X1_corr_test.index[ -1: - 6 : -1 ].tolist()

assert X1_corr_p_test == list(X1_corr_p)
assert X1_corr_n_test == list(X1_corr_n)
### END HIDDEN TESTS

One thing to consider (we saw something similar in the lecture topic on Influential Points)
- Outliers (feature values that are at the extremes of the distribution) can affect correlation

To illustrate:
- We will show the distribution of one feature, conditional on the value of the associated target value
- Here we overlay two distributions
    - The distribution of the feature value, conditioned on examples having target 0 (colored green)
    - The distribution of the feature value, conditioned on examples having target 1 (colored red)
    - When the two distributions overlap: the color will be a blend



In [None]:
helper.plot_attr(X_train, y_train, "X51", trunc=0)

The above graph is not very informative
- The distributions overlap for the bins chosen
- But there seem to be many bins with very few values (i.e. X51 > 2)

But let's perform the same plot while *eliminating* extreme values of the feature

In [None]:
helper.plot_attr(X_train, y_train, "X51", trunc=.01)

We can now see that
- When the feature value is greater than 1.25
- The associated example indicates the company will go Bankrupt (`Bankrupt` = 1)

Just something to keep in mind in performing your own analysis and building your models
- Is there value in creating a synthetic feature: `X51 > t` for some threshold `t` ?

**Question:**

- Let `t = 1.1`
- Set variable `cond_frac_pos` to the fraction of examples that go Bankrupt where `X51 > t`
$$
\frac{ \text{count(Bankrupt == 1 and X51 > t} )} { \text{count(Bankrupt == 1)} }
$$

- Set variable `cond_frac_neg` to the fraction of examples that *do not* go Bankrupt where `X51 > t`
$$
\frac{ \text{count(Bankrupt == 0 and X51 > t} )} { \text{count(Bankrupt == 0)} }
$$


In [None]:
# Set variables
#  t: scalar number, threshold
#  cond_frac_pos: scalar number, fraction of examples that go bankrupt where X51 > t
#  Cond_frac_neg: scalar number, fraction of examples that do not go bankrupt where X51 > t
t = 1.1
cond_frac_pos = None
cond_frac_neg = None

### BEGIN SOLUTION
def cond_attr(df, attr, trunc=.01, thresh=1):
    X = df[attr]
    
    # Remove outliers, to improve clarity
    mask = (X > X.quantile(trunc)) & (X < X.quantile(1-trunc))
    X_trunc, y_trunc = X[ mask  ], y_train[ mask ]
    
    # Condition on value of target and thresh
    cp = X_trunc[ (y_trunc == 1) & (X_trunc > thresh) ].size/X_trunc[ y_trunc == 1].size
    cn = X_trunc[ (y_trunc == 0) & (X_trunc > thresh) ].size/X_trunc[ y_trunc == 0].size
      
    return cp, cn

attr = "X51"
trunc = 0
cond_frac_pos, cond_frac_neg = cond_attr(X_train, attr, trunc=trunc, thresh=t)
### END SOLUTION

print("The fraction of training examples that go Bankrupt, with ({attr:s} > {t:2.2f}) is {frac:3.1%}".format(attr=attr, 
                                                                                        t=t,
                                                                                        frac=cond_frac_pos)
     )

print("The fraction of training examples that DO NOT go Bankrupt, with ({attr:s} > {t:2.2f}) is {frac:3.1%}".format(attr=attr, 
                                                                                        t=t,
                                                                                        frac=cond_frac_neg)
     )


In [None]:
### BEGIN HIDDEN TESTS
def cond_attr_test(df, attr, trunc=.01, thresh=1):
    X = df[attr]
    
    # Remove outliers, to improve clarity
    mask = (X > X.quantile(trunc)) & (X < X.quantile(1-trunc))
    X_trunc, y_trunc = X[ mask  ], y_train[ mask ]
    
    # Condition on value of target and thresh
    cp = X_trunc[ (y_trunc == 1) & (X_trunc > thresh) ].size/X_trunc[ y_trunc == 1].size
    cn = X_trunc[ (y_trunc == 0) & (X_trunc > thresh) ].size/X_trunc[ y_trunc == 0].size
      
    return cp, cn

cond_frac_pos_test, cond_frac_neg_test = cond_attr_test(X_train, 'X51', trunc=0, thresh=1.1)

assert np.allclose(cond_frac_pos_test, cond_frac_pos)
assert np.allclose(cond_frac_neg_test, cond_frac_neg)
### END HIDDEN TESTS

It seems that we can discover a large fraction of examples that go Bankrupt by examining 
one feature and threshold.

But using this alone will result in some number of False Positives (non Bankrupt examples)
- And although the percent is small, we will see that the non Bankrupt examples are more numerous

# Imbalanced data

We have a binary classification problem.

Do we have roughly the same number of examples associated with each of the two targets ?

**Question:**

How many training examples do we have that became Bankrupt ?
- Set variable `num_bankrupt` to this value

How many training examples do we have that *did not become* Bankrupt ?
- Set variable `num_nonbankrupt` to this value

In [None]:
# Set variables
#  num_examples: scalar number, number of examples in the training dataset
#  num_bankrupt: scalar number, number of examples that became bankrupt
#  num_nonbankrupt: scalar number, number of examples that did not become bankrupt
num_examples = X_train.shape[0]
num_bankrupt = None
num_nonbankrupt = None

### BEGIN SOLUTION
bankrupt = X_train[ y_train == 1 ] 
nonbankrupt = X_train[ y_train == 0 ]

num_bankrupt    = bankrupt.shape[0]
num_nonbankrupt = nonbankrupt.shape[0]

### END SOLUTION

print("Of the {t:d} total examples: {b:d} became bankrupt and {nb:d} did not become bankrupt".format(t=num_examples,
                                                                                                    b=num_bankrupt,
                                                                                                    nb=num_nonbankrupt)

     )

In [None]:
### BEGIN HIDDEN TESTS
bankrupt_test = X_train[ y_train == 1 ] 
nonbankrupt_test = X_train[ y_train == 0 ]

num_bankrupt_test = bankrupt_test.shape[0]
num_nonbankrupt_test = nonbankrupt_test.shape[0]

assert num_bankrupt == num_bankrupt_test
assert num_nonbankrupt == num_nonbankrupt_test
### END HIDDEN TESTS

This dataset is highly imbalanced: many more examples of one class than the other.

Why might this be a problem ?
    

Consider a naive model that ignores the features and always predicts the *most frequent* value of the target.

Assuming the out of sample data has the same distribution as the training data:
- We will have perfect conditional accuracy for the examples with target in the majority class
- We will have zero conditional accuracy for the examples with target in the non-majority class
- Because the number of examples in the majority class is so much larger:
    - We might get good unconditional accuracy

Recall our lecture on Recall and Precision.

These are metrics that will help us evaluate our model's ability to correctly predict Bankruptcy.

We think that you will find that your model may have
- High Accuracy
- Low Recall

There are several ways for you to deal with imbalanced data
- Class sensitive weights
    - Many models in `sklearn` take an optional argument `class_weight`
    - For each target class: you can assign a weight
    - The Loss will be computed on a class-weighted basis
    - You can choose weights that increase the influence of the non-majority class

Another way is re-sampling the training set
- Expand the number of training examples
- By increasing the number of examples of the non-majority class
    - Randomly sample examples in the non-majority class
    - So you will have duplicates
- This creates a more balanced dataset on which to train

These are just some ideas for you to achieve a model with better
conditional metrics.

## Now submit your assignment!

Up to now, you have prepared the data you need, generate some ideas about what the feature correlation is like and how to handle imbalanced labels. Next you will need to build your own Machine Learning model and do evaluation in your final project.

Please click on the blue button <span style="color: blue;"> **Submit** </span> in this notebook. 