# Isolate Binary Outcomes

The purpose of this notebook is to isolate the loans that have a `loan_status` value of `Fully Paid` or `Charged Off` so as to turn this into a binary classification problem.

### Load Packages

In [1]:
import pandas as pd
pd.options.display.max_rows = 200

### Read-In Data

In [2]:
df_training = pd.read_csv("data_training_testing/lending_club_training.csv")
df_testing = pd.read_csv("data_training_testing/lending_club_testing.csv")

In [3]:
df_training.head(1).T

Unnamed: 0,0
loan_status,Current
funded_amnt,1275.0
addr_state,OH
annual_inc,50000.0
application_type,Individual
dti,9.53
earliest_cr_line,Aug-1997
emp_length,10+ years
emp_title,Teacher
fico_range_high,669.0


### Explore Data

In [4]:
df_training['loan_status'].value_counts()

loan_status
Fully Paid                                             753863
Current                                                614398
Charged Off                                            188233
Late (31-120 days)                                      15111
In Grace Period                                          5873
Late (16-30 days)                                        3080
Does not meet the credit policy. Status:Fully Paid       1360
Does not meet the credit policy. Status:Charged Off       522
Default                                                    28
Name: count, dtype: int64

In [5]:
df_testing['loan_status'].value_counts()

loan_status
Fully Paid                                             322888
Current                                                263919
Charged Off                                             80326
Late (31-120 days)                                       6356
In Grace Period                                          2563
Late (16-30 days)                                        1269
Does not meet the credit policy. Status:Fully Paid        628
Does not meet the credit policy. Status:Charged Off       239
Default                                                    12
Name: count, dtype: int64

### Separate Out Binary Outcomes

We are going to ignore all the loans that aren't `Fully Paid` or `Charged Off`.

In [6]:
df_binary_training = df_training.query('loan_status == "Fully Paid" | loan_status == "Charged Off"').copy()
df_binary_testing = df_testing.query('loan_status == "Fully Paid" | loan_status == "Charged Off"').copy()

### Encoding Loan Status

Let's create a new label column called `charged_off` which is the the one-hot encoding of `loan_status`.

In [7]:
df_binary_training['charged_off'] = pd.get_dummies(df_binary_training[['loan_status']])['loan_status_Charged Off']
df_binary_training.drop(columns=['loan_status'], inplace=True)

In [8]:
df_binary_testing['charged_off'] = pd.get_dummies(df_binary_testing[['loan_status']])['loan_status_Charged Off']
df_binary_testing.drop(columns=['loan_status'], inplace=True)

### Write to CSV

In [9]:
df_binary_training.to_csv('data_processed/01_binary_training.csv', index=False)
df_binary_testing.to_csv('data_processed/02_binary_testing.csv', index=False)