## <center>**Banktruptcy - Making it easier**</center>


This document is a part of GovHack 2019 Challenge. Challenge details and datasets have been provided by Australian Financial Security Authority (AFSA). Challenge aims to make it easier for people to tell AFSA their income and assets when they become bankrupt. 

### Team - Rookie Hackers
> - Minesh Chunawala
> - Rupesh Singh
> - Dilan Dissanayake
> - Chet Dissanayake

### Background
The __Australian Financial Security Authority (AFSA)__ is responsible for regulating the personal insolvency and personal property securities regimes. 

#### What is bankruptcy? (Source - AFSA Website)
Bankruptcy is a legal process where you're declared unable to pay your debts. It can release you from most debts, provide relief and allow you to make a fresh start.

You can enter into voluntary bankruptcy. We refer to this as a debtor's petition. It's also possible that someone you owe money to (a creditor) can make you bankrupt through a court process. We refer to this as a creditor's petition.

Bankruptcy normally lasts for 3 years and 1 day.

#### Bankruptcy trustee
When you become bankrupt we appoint a trustee. A trustee is a person or body who manages your bankruptcy.

This can either be the Official Trustee (AFSA) or a registered trustee. You can also nominate a registered trustee of your choice.

#### Debtors Obligations
When you are bankrupt:

> - you must provide details of your debts, income and assets to your trustee.
> - your trustee notifies your creditors that you’re bankrupt - this prevents most creditors from contacting you about your debt.
> - your trustee can sell certain assets to help pay your debts.
> - you may need to make compulsory payments if your income exceeds a set amount.

#### About the data

Personal insolvencies include bankruptcies, debt agreements and personal insolvency agreements. We have provided data on personal insolvencies for calendar years 2007 to 2018. Data for 2007 are from 1 July 2007, and data for 2018 are to 30 June 2018.

When you enter into a personal insolvency, you must complete a statement of affairs. That form asks for a range of information, including your income, debts and assets. We have provided debtors’ responses from this form. We have published this information as the debtor initially reported it, irrespective of subsequent checks and revisions. 

More information about the dataset can be obtained by visiting the link below:

[Bankruptcy Challenge Dataset](https://data.gov.au/data/dataset/attributes-of-insolvent-debtors)

### Problem Statement from AFSA

> 1. __Validate Income and Assets declared by Debtors:__ 
       When people go through personal insolvency, it could be very stressful time for them. When they complete the statement of affairs they might forget to include certain details e.g. other sources of income, all assets, etc. AFSA wants to understand how we can predict the debtors income and assets using social, employment and other insolvency related information provided by debtors. 
> 2. __Using Debtors information predict Income and Assets over 3 year period:__
       When people file for bankruptcy they are usually bankrupt for 3 years so AFSA wants to understand when their circumstances might change. Or within limits, AFSA might be able to sell their assets to repay their creditors. 

##### Example
John is 44, a builder and is currently not working because of health problems. If he was bankrupt, how would AFSA tell if he gave AFSA accurate data when he declared bankruptcy and whether his circumstances will change over the next 3 years e.g. getting another job, buying houses or shares etc?

### The Hack

As a part of this hackathon, we will fundamentally focus on Problem statement 1. We will use AFSA dataset to build a __Machine Learning model__ to predict debtors' income category based on other socio-economic attributes described in the dataset. 

This document goes through following steps:

> __1. Importing Dataset and Relevant Python Libraries__

> __2. Data Cleansing and Exploratory Data Analysis__

> __3. Feature Engineering__

> __4. Predictive Model Building__

> __5. Validating Model Accuracy__

Now let's do the deep dive...

## 1. Importing Dataset and Relevant Python Libraries

In this section we will load relevant python libraries, load the AFSA dataset and perform initial checks on the data.

In [47]:
# Import relevant python libraries
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (20.0, 10.0)

In [48]:
# Import Insolvency datset
data = pd.read_csv('attributes-insolvent-debtors.csv',index_col='Unique ID')  

In [49]:
# Reading the data to understand data features and size of the dataset 
data.head()

Unnamed: 0_level_0,Calendar Year of Insolvency,SA3 of Debtor,SA3 Code of Debtor,GCCSA of Debtor,GCCSA Code of Debtor,State of Debtor,Sex of Debtor,Family Situation,Debtor Occupation Code (ANZSCO),Debtor Occupation Name (ANZSCO),Main Cause of Insolvency,Business Related Insolvency,Debtor Income,Primary Income Source,Unsecured Debts,Value of Assets
Unique ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
3452750,2010,South Canberra,80106,Australian Capital Territory,8ACTE,Australian Capital Territory,Female,Single with Dependants,39.0,Other Technicians and Trades Workers,Unemployment or loss of income,No,$0-$49999,Government benefits/Pensions,$0-$49999,$0-$49999
3563908,2011,Weston Creek,80108,Australian Capital Territory,8ACTE,Australian Capital Territory,Male,Couple without Dependants,84.0,"Farm, Forestry and Garden Workers",Unemployment or loss of income,No,$0-$49999,Government benefits/Pensions,$0-$49999,$0-$49999
3252673,2012,Tuggeranong,80107,Australian Capital Territory,8ACTE,Australian Capital Territory,Female,Single with Dependants,11.0,"Chief Executives, General Managers and Legisla...",Ill health or absence of health insurance,No,$50000-$99999,Gross Wages and Salary,$50000-$99999,$0-$49999
3610744,2013,Tuggeranong,80107,Australian Capital Territory,8ACTE,Australian Capital Territory,Female,Couple with Dependants,99.0,AFSA,Unemployment or loss of income,No,$0-$49999,Unknown,$0-$49999,$0-$49999
3610734,2013,Tuggeranong,80107,Australian Capital Territory,8ACTE,Australian Capital Territory,Male,Couple with Dependants,22.0,"Business, Human Resource and Marketing Profess...",Other business reason or reason unknown,Yes,$50000-$99999,Gross Wages and Salary,$50000-$99999,$0-$49999


In [50]:
data.tail()

Unnamed: 0_level_0,Calendar Year of Insolvency,SA3 of Debtor,SA3 Code of Debtor,GCCSA of Debtor,GCCSA Code of Debtor,State of Debtor,Sex of Debtor,Family Situation,Debtor Occupation Code (ANZSCO),Debtor Occupation Name (ANZSCO),Main Cause of Insolvency,Business Related Insolvency,Debtor Income,Primary Income Source,Unsecured Debts,Value of Assets
Unique ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
3950029,2008,Creswick - Daylesford - Ballan,20102,Rest of Vic.,2RVIC,Victoria,Female,Single without Dependants,25.0,Health Professionals,Excessive use of credit facilities including l...,No,$0-$49999,Unknown,$0-$49999,$0-$49999
3873822,2008,Cairns - South,30602,Rest of Qld,3RQLD,Queensland,Female,Single with Dependants,42.0,Carers and Aides,Excessive use of credit facilities including l...,No,$0-$49999,Unknown,$0-$49999,$0-$49999
3867041,2008,Mount Druitt,11603,Greater Sydney,1GSYD,New South Wales,Male,Single without Dependants,56.0,Clerical and Office Support Workers,Domestic discord or relationship breakdown,No,$0-$49999,Unknown,$0-$49999,$0-$49999
3962232,2008,Brisbane Inner - North,30503,Greater Brisbane,3GBRI,Queensland,Male,Single with Dependants,59.0,Other Clerical and Administrative Workers,Domestic discord or relationship breakdown,No,$0-$49999,Unknown,$0-$49999,$0-$49999
3988903,2008,Toowoomba,31701,Rest of Qld,3RQLD,Queensland,Male,Single without Dependants,62.0,Sales Assistants and Salespersons,Unemployment or loss of income,No,$0-$49999,Unknown,$0-$49999,$0-$49999


In [51]:
# Explore the data to understand object type, null values and completeness of the data
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 356500 entries, 3452750 to 3988903
Data columns (total 16 columns):
Calendar Year of Insolvency        356500 non-null int64
SA3 of Debtor                      356500 non-null object
SA3 Code of Debtor                 356500 non-null int64
GCCSA of Debtor                    356500 non-null object
GCCSA Code of Debtor               356500 non-null object
State of Debtor                    356500 non-null object
Sex of Debtor                      356500 non-null object
Family Situation                   356500 non-null object
Debtor Occupation Code (ANZSCO)    340614 non-null float64
Debtor Occupation Name (ANZSCO)    340614 non-null object
Main Cause of Insolvency           356500 non-null object
Business Related Insolvency        356500 non-null object
Debtor Income                      356500 non-null object
Primary Income Source              356500 non-null object
Unsecured Debts                    356500 non-null object
Value of Asse

In [52]:
# Describe function provide more information on the count of certain feature along with various other statistical information. 
data.describe()

Unnamed: 0,Calendar Year of Insolvency,SA3 Code of Debtor,Debtor Occupation Code (ANZSCO)
count,356500.0,356500.0,340614.0
mean,2012.335195,26240.855436,55.365698
std,3.228209,15011.277165,26.713706
min,2007.0,0.0,11.0
25%,2010.0,12301.0,33.0
50%,2012.0,21305.0,55.0
75%,2015.0,31503.0,74.0
max,2018.0,80111.0,99.0


In [53]:
# Again using the count function to understand value counts of each columns
data.count()

Calendar Year of Insolvency        356500
SA3 of Debtor                      356500
SA3 Code of Debtor                 356500
GCCSA of Debtor                    356500
GCCSA Code of Debtor               356500
State of Debtor                    356500
Sex of Debtor                      356500
Family Situation                   356500
Debtor Occupation Code (ANZSCO)    340614
Debtor Occupation Name (ANZSCO)    340614
Main Cause of Insolvency           356500
Business Related Insolvency        356500
Debtor Income                      356500
Primary Income Source              356500
Unsecured Debts                    356500
Value of Assets                    356500
dtype: int64

## 2. Data Cleansing and Exploratory Data Analysis

In this section we go through each of the features, analyse each of the features in detail.

In [54]:
# Count the people filing for bankruptcy and what their family situation is. 
# Looking at the data we see a trend that people who are "Single without Dependants" are more likely to file for bankruptcy.
data['Family Situation'].value_counts()

Single without Dependants    146679
Couple with Dependants        99447
Couple without Dependants     63320
Single with Dependants        43322
Unknown                        3348
Not Stated                      384
Name: Family Situation, dtype: int64

In [56]:
# Count the people filing for bankruptcy and what their income source was at the time of filing for banktrupcy. 
# Looking at the data we see a trend that most of the people filing bankruptcy were either unemployed, retired, student or pensioners.
data['Primary Income Source'].value_counts()

Unknown                          191606
Gross Wages and Salary            81803
Government benefits/Pensions      63959
Self Employment                   12119
Business earnings                  2835
Other                              2775
Superannuation                      910
Income from Investments             304
Deceased Estate or Trusts           106
Lump Sum termination payments        62
Income from reverse mortgage         21
Name: Primary Income Source, dtype: int64

In [57]:
# Here we look at the gender spread of people filing for bankruptcy. 
# Looking at the count there is no real indication of any trend.
data['Sex of Debtor'].value_counts()

Male          202206
Female        154133
Not Stated       158
Unknown            3
Name: Sex of Debtor, dtype: int64

In [58]:
# Reason for filing for bankruptcy is one of the most important attributes for our model.
# As we see below that majority of people filed bankruptcy due to unemployment, excessive use of credit facilities or ill-health.
data['Main Cause of Insolvency'].value_counts()

Unemployment or loss of income                                                                                             99140
Excessive use of credit facilities including losses on repossessions, high interest payments and pressure selling          89909
Ill health or absence of health insurance                                                                                  25838
Economic conditions affecting industry, including competition, credit restrictions, fall in prices or increases in cost    23167
Domestic discord or relationship breakdowns                                                                                22496
Other causes or causes unknown                                                                                             18048
Other business reason or reason unknown                                                                                    17160
Domestic discord or relationship breakdown                                                       

In [59]:
# Here we try to understand what is the most common asset range.
data['Unsecured Debts'].value_counts()

$0-$49999             244724
$50000-$99999          48088
$100000-$149999        19878
$150000-$199999        10619
More Than $1000000      7037
$200000-$249999         6295
$250000-$299999         4219
$300000-$349999         3127
$350000-$399999         2348
$400000-$449999         1853
$450000-$499999         1482
$500000-$549999         1289
$550000-$599999         1045
$600000-$649999          902
$650000-$699999          771
$750000-$799999          614
$700000-$749999          564
$800000-$849999          480
$850000-$899999          426
$900000-$949999          400
$950000-$999999          339
Name: Unsecured Debts, dtype: int64

In [60]:
# Debtor's current occupation is one of the most important features for our model. 
# However we do see that most frequently occuring class is 'AFSA' which includes people who were either unemployed, retired, student or pensioners.
data['Debtor Occupation Name (ANZSCO)'].value_counts()

AFSA                                                        38610
Sales Assistants and Salespersons                           17801
Other Clerical and Administrative Workers                   17545
Other Labourers                                             17412
Road and Rail Drivers                                       17251
Hospitality, Retail and Service Managers                    13933
Carers and Aides                                            13337
Construction Trades Workers                                 12833
Specialist Managers                                         12402
Business, Human Resource and Marketing Professionals        10006
Cleaners and Laundry Workers                                 9822
Automotive and Engineering Trades Workers                    8998
Other Technicians and Trades Workers                         8781
Sales Representatives and Agents                             8657
Food Trades Workers                                          8277
Factory Pr

In [61]:
# Here we try to understand if there was any correlation between the years when people filed for bankruptcy.
# We can ignore data for 2007 and 2018 as these years didnt contain data for the full year.
# However we can safely conclude that the general average across all years was constant around 31K insolvencies. 
# Hence we will ignore this feature in the predictive model. 
data['Calendar Year of Insolvency'].value_counts()

2009    37921
2010    35205
2008    35163
2012    32805
2017    31973
2011    31717
2013    30582
2016    30220
2014    29618
2015    29060
2018    16316
2007    15920
Name: Calendar Year of Insolvency, dtype: int64

In [17]:
# Here we try to understand if there was any correlation between the states and region where people filed for bankruptcy.
data['SA3 of Debtor'].value_counts()

Townsville                    4395
Campbelltown (NSW)            4346
Wyong                         4346
Ormeau - Oxenford             3722
Gosford                       3644
Wyndham                       3497
Mount Druitt                  3442
Fairfield                     3325
Penrith                       3129
Frankston                     2954
Whittlesea - Wallan           2898
Casey - South                 2840
Wanneroo                      2724
Bankstown                     2714
Geelong                       2707
Sydney Inner City             2686
Blacktown                     2652
Merrylands - Guildford        2631
Toowoomba                     2603
Newcastle                     2507
Melton - Bacchus Marsh        2500
Ipswich Inner                 2475
Rockhampton                   2441
Springfield - Redbank         2439
Salisbury                     2435
Tullamarine - Broadmeadows    2434
Onkaparinga                   2410
Mackay                        2392
Brimbank            

In [62]:
# Here we try to understand if there was any correlation between the states and region where people filed for bankruptcy.
data['GCCSA of Debtor'].value_counts()

Greater Sydney                  72223
Greater Melbourne               53229
Rest of Qld                     52906
Rest of NSW                     44466
Greater Brisbane                44052
Greater Perth                   23338
Rest of Vic.                    19907
Greater Adelaide                16990
Rest of WA                       6396
Rest of Tas.                     5660
Rest of SA                       5061
Greater Hobart                   4207
Australian Capital Territory     3952
Greater Darwin                   1643
Unknown                          1255
Rest of NT                        743
Overseas Address                  381
No Geocode Available QLD           25
No Geocode Available VIC           17
No Geocode Available NT            17
No Geocode Available NSW           16
No Geocode Available WA             7
No Geocode Available SA             5
No Geocode Available TAS            2
No Geocode Available ACT            2
Name: GCCSA of Debtor, dtype: int64

In [63]:
# Here we try to understand if there was any correlation between the states and region where people filed for bankruptcy.
data['State of Debtor'].value_counts()

New South Wales                 116705
Queensland                       96983
Victoria                         73153
Western Australia                29741
South Australia                  22056
Tasmania                          9869
Australian Capital Territory      3954
Northern Territory                2403
Unknown                           1255
International                      381
Name: State of Debtor, dtype: int64

In [64]:
# We also tried assessing if there was any pattern between people filing for insolvency and if it was business related. 
data['Business Related Insolvency'].value_counts()

No                       289661
Yes                       63116
Not Stated or Unknown      3723
Name: Business Related Insolvency, dtype: int64

In [65]:
# Here we tried understanding the amount of depth people had and most common range.
data['Unsecured Debts'].value_counts()

$0-$49999             244724
$50000-$99999          48088
$100000-$149999        19878
$150000-$199999        10619
More Than $1000000      7037
$200000-$249999         6295
$250000-$299999         4219
$300000-$349999         3127
$350000-$399999         2348
$400000-$449999         1853
$450000-$499999         1482
$500000-$549999         1289
$550000-$599999         1045
$600000-$649999          902
$650000-$699999          771
$750000-$799999          614
$700000-$749999          564
$800000-$849999          480
$850000-$899999          426
$900000-$949999          400
$950000-$999999          339
Name: Unsecured Debts, dtype: int64

## 3. Feature Engineering

This is one of the most important steps for building a predictive model. Since we have gone through all the features, we decided we will trim the dataset to consider only those features which we believed had the most relevant impact on their current circumstances. 

As per our understanding individuals current circumstances were very much impacted by their current socio-economic factors, health, employment, employment category, family situation and their primary source of income. 

Hence we dediced to trim the dataset for the below mention features:

> 1. Debtor Occupation Name (ANZSCO)
> 2. Family Situation
> 3. Main Cause of Insolvency
> 4. Business Related Insolvency
> 5. Primary Income Source
> 6. Unsecured Debts
> 7. Value of Assets
> 8. Debtor Income

We also go through following steps as a part of feature engineering:

> 1. Imputation - Dealing with missing values
> 2. Handling Outliers
> 3. Binning - Makes model more robust and prevent overfitting
> 4. One-Hot Encoding - dealing multicategorical data

In [68]:
# Reducing the dataset to selected features.
cols = ['Debtor Occupation Name (ANZSCO)','Family Situation', 'Main Cause of Insolvency', 'Business Related Insolvency', 'Primary Income Source', 'Unsecured Debts', 'Value of Assets', 'Debtor Income']
shrink_data = data[cols]
shrink_data.head()

Unnamed: 0_level_0,Debtor Occupation Name (ANZSCO),Family Situation,Main Cause of Insolvency,Business Related Insolvency,Primary Income Source,Unsecured Debts,Value of Assets,Debtor Income
Unique ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
3452750,Other Technicians and Trades Workers,Single with Dependants,Unemployment or loss of income,No,Government benefits/Pensions,$0-$49999,$0-$49999,$0-$49999
3563908,"Farm, Forestry and Garden Workers",Couple without Dependants,Unemployment or loss of income,No,Government benefits/Pensions,$0-$49999,$0-$49999,$0-$49999
3252673,"Chief Executives, General Managers and Legisla...",Single with Dependants,Ill health or absence of health insurance,No,Gross Wages and Salary,$50000-$99999,$0-$49999,$50000-$99999
3610744,AFSA,Couple with Dependants,Unemployment or loss of income,No,Unknown,$0-$49999,$0-$49999,$0-$49999
3610734,"Business, Human Resource and Marketing Profess...",Couple with Dependants,Other business reason or reason unknown,Yes,Gross Wages and Salary,$50000-$99999,$0-$49999,$50000-$99999


In [72]:
# Here we understand which of the selected features would directly impact the model based on their frequency.
# Primary Income Source - has 11 unique categories and also has 190K entries as 'Unknown' - hence this feature can ignored for the model.
# Unsecured debts - has 21 unique categories and 244k entries in one single category. We could be overfitting the data if we use this feature.
# Value of Assets - has 21 unique categories and 312k entries in one single category. We could be overfitting the data if we use this feature.
# All the features above didnt contain wide spread of values amongst all categories to build the model that wouldnt overfit.
shrink_data.describe()

Unnamed: 0,Debtor Occupation Name (ANZSCO),Family Situation,Main Cause of Insolvency,Business Related Insolvency,Primary Income Source,Unsecured Debts,Value of Assets,Debtor Income
count,340614,356500,356500,356500,356500,356500,356500,356500
unique,44,6,23,3,11,21,21,7
top,AFSA,Single without Dependants,Unemployment or loss of income,No,Unknown,$0-$49999,$0-$49999,$0-$49999
freq,38610,146679,99140,289661,191606,244724,312948,269763


In [73]:
# Visualizing the data that we will use for the model
shrink_data.drop(['Value of Assets','Primary Income Source','Unsecured Debts'], axis=1)

Unnamed: 0_level_0,Debtor Occupation Name (ANZSCO),Family Situation,Main Cause of Insolvency,Business Related Insolvency,Debtor Income
Unique ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3452750,Other Technicians and Trades Workers,Single with Dependants,Unemployment or loss of income,No,$0-$49999
3563908,"Farm, Forestry and Garden Workers",Couple without Dependants,Unemployment or loss of income,No,$0-$49999
3252673,"Chief Executives, General Managers and Legisla...",Single with Dependants,Ill health or absence of health insurance,No,$50000-$99999
3610744,AFSA,Couple with Dependants,Unemployment or loss of income,No,$0-$49999
3610734,"Business, Human Resource and Marketing Profess...",Couple with Dependants,Other business reason or reason unknown,Yes,$50000-$99999
3631198,Skilled Animal and Horticultural Workers,Single without Dependants,Domestic discord or relationship breakdowns,No,$0-$49999
3648556,Other Clerical and Administrative Workers,Couple without Dependants,Excessive use of credit facilities including l...,No,$50000-$99999
3019574,Hospitality Workers,Single without Dependants,Other business reason or reason unknown,Yes,$50000-$99999
3217157,"Hospitality, Retail and Service Managers",Couple with Dependants,Other business reason or reason unknown,Yes,$0-$49999
3389604,Sports and Personal Service Workers,Single without Dependants,"Personal reasons, including ill health of self...",Yes,$0-$49999


In [74]:
# Visualizing the data features that will be used for the model. 
shrink_data.describe()

Unnamed: 0,Debtor Occupation Name (ANZSCO),Family Situation,Main Cause of Insolvency,Business Related Insolvency,Primary Income Source,Unsecured Debts,Value of Assets,Debtor Income
count,340614,356500,356500,356500,356500,356500,356500,356500
unique,44,6,23,3,11,21,21,7
top,AFSA,Single without Dependants,Unemployment or loss of income,No,Unknown,$0-$49999,$0-$49999,$0-$49999
freq,38610,146679,99140,289661,191606,244724,312948,269763


In [75]:
# Now we will create a new dataframe just with the using the required features. 
columns = ['Debtor Occupation Name (ANZSCO)','Family Situation', 'Main Cause of Insolvency', 'Debtor Income']
new_data = data[columns].dropna()
new_data.head()

Unnamed: 0_level_0,Debtor Occupation Name (ANZSCO),Family Situation,Main Cause of Insolvency,Debtor Income
Unique ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3452750,Other Technicians and Trades Workers,Single with Dependants,Unemployment or loss of income,$0-$49999
3563908,"Farm, Forestry and Garden Workers",Couple without Dependants,Unemployment or loss of income,$0-$49999
3252673,"Chief Executives, General Managers and Legisla...",Single with Dependants,Ill health or absence of health insurance,$50000-$99999
3610744,AFSA,Couple with Dependants,Unemployment or loss of income,$0-$49999
3610734,"Business, Human Resource and Marketing Profess...",Couple with Dependants,Other business reason or reason unknown,$50000-$99999


In [76]:
new_data.describe(include = 'all')

Unnamed: 0,Debtor Occupation Name (ANZSCO),Family Situation,Main Cause of Insolvency,Debtor Income
count,340614,340614,340614,340614
unique,44,6,22,7
top,AFSA,Single without Dependants,Unemployment or loss of income,$0-$49999
freq,38610,142180,95841,255383


### One-Hot Encoding

One-hot encoding is one of the most common encoding methods in machine learning. This method spreads the values in a column to multiple flag columns and assigns 0 or 1 to them. These binary values express the relationship between grouped and encoded column.
This method will change our categorical data, which is challenging to understand for algorithms, to a numerical format and will enable us to group our categorical data without losing any information.

In [78]:
# Importing relevant Sci kit libraries
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

In [79]:
# Define a function to pre process the new-data. This function take the selected features, performs one-Hot encoding.
# Function also fits the test and training data. 
# And return encoded test and training sets of data. 
def preprocess_bankruptcy(X_train, X_test, y_train, y_test):

    ### Hardcode variables which need categorical encoding
    to_encode = ['Debtor Occupation Name (ANZSCO)','Family Situation', 'Main Cause of Insolvency']

    ### Find top categories in categorical columns
    ### Used for dropping majority class to prevent multi-colinearity
    top_categories = []

    for col in to_encode:
        top_categories.append(X_train[col].value_counts().index[0])

    ### Create and fit one-hot encoder for categoricals
    OHE = OneHotEncoder(sparse = False)
    OHE.fit(X_train[to_encode])

    ## Create and fit Label encoder for target
    LabEnc = LabelEncoder()
    LabEnc.fit(y_train)

    def create_encoded_df(X, to_encode = to_encode, OHE = OHE, top_categories = top_categories):
        # Return columns which need encoding.
        def return_encoded_cols(X, to_encode = to_encode, OHE = OHE, top_categories = top_categories):
            # Use onehotencoder to transform.
            # Use "categories" to name
            toRet = pd.DataFrame(OHE.transform(X[to_encode]), columns = np.concatenate(OHE.categories_))

            # Drop top_categories and return
            return toRet.drop(top_categories, axis = 1)

        # create encoded columns
        ret_cols = return_encoded_cols(X)

        # Drop columns that were encoded
        dr_enc = X.drop(to_encode, axis = 1)

        # Concatenate values
        # use index from original data
        # use combined column names
        return pd.DataFrame(np.concatenate([ret_cols.values, dr_enc.values],axis = 1),
                            index = dr_enc.index,
                            columns = list(ret_cols.columns) + list(dr_enc.columns))


    def encode_target(y, LabEnc = LabEnc):
        # Use label encoder, and supply with original index
        return pd.Series(LabEnc.transform(y), index= y.index)

    return create_encoded_df(X_train), create_encoded_df(X_test), encode_target(y_train), encode_target(y_test)

In [80]:
# Finally we seperate the 'Income' data for our test-train model and perform the actual 20/80 split on the data. 
from sklearn.model_selection import train_test_split

# Create training and testing sets; preprocess them.
target = new_data['Debtor Income']
predictors = new_data.drop("Debtor Income", axis = 'columns')

X_train, X_test, y_train, y_test = preprocess_bankruptcy(*train_test_split(predictors, target, test_size = .2))

## 4. Predictive Model Building

There are various methods of building machine learning models to predict data. As AFSA had mentioned that people are stressed when declaring for banktrupcy and may not always give us correct or complete information. Hence we believe the do believe that data accuracy and data quality may be an issue. As a result we will be using __Ensemble Adaptive Boosting and Random Forest Classification__ techniques to build train our model. 

#### What is an ensemble method?
Ensemble is a Machine Learning concept in which the idea is to train multiple models using the same learning algorithm. The ensembles take part in a bigger group of methods, called multiclassifiers, where a set of hundreds or thousands of learners with a common objective are fused together to solve the problem.

The second group of multiclassifiers contain the hybrid methods. They use a set of learners too, but they can be trained using different learning techniques. Stacking is the most well-known. If you want to learn more about Stacking, you can read my previous post, “Dream team combining classifiers“.

The main causes of error in learning are due to noise, bias and variance. Ensemble helps to minimize these factors. These methods are designed to improve the stability and the accuracy of Machine Learning algorithms. Combinations of multiple classifiers decrease variance, especially in the case of unstable classifiers, and may produce a more reliable classification than a single classifier.

[More Information on Ensemble Methods](https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/)

#### What is Boosting?

Boosting is a method of converting a set of weak learners into strong learners. Suppose we have a binary classification task. A weak learner has an error rate that is slightly lesser than 0.5 in classifying the object, i.e the weak learner is slightly better than deciding from a coin toss. A strong learner has an error rate closer to 0. To convert a weak learner into strong learner, we take a family of weak learners, combine them and vote. This turns this family of weak learners into strong learners.

#### AdaBoost(Adaptive Boosting):

The Adaptive Boosting technique was formulated by Yoav Freund and Robert Schapire, who won the Gödel Prize for their work. AdaBoost works on improving the areas where the base learner fails. The base learner is a machine learning algorithm which is a weak learner and upon which the boosting method is applied to turn it into a strong learner. Any machine learning algorithm that accept weights on training data can be used as a base learner. In the example taken below, Decision stumps are used as the base learner.

We take the training data and randomly sample points from this data and apply decision stump algorithm to classify the points. After classifying the sampled points we fit the decision tree stump to the complete training data. This process iteratively happens until the complete training data fits without any error or until a specified maximum number of estimators.

[More Information on Boosting](https://hackernoon.com/boosting-algorithms-adaboost-gradient-boosting-and-xgboost-f74991cad38c)

### What is Bagging?

Bootstrap Aggregation (or Bagging for short), is a simple and very powerful ensemble method. Bagging is the application of the Bootstrap procedure to a high-variance machine learning algorithm, typically decision trees. Suppose there are N observations and M features. A sample from observation is selected randomly with replacement(Bootstrapping). A subset of features are selected to create a model with sample of observations and subset of features. Feature from the subset is selected which gives the best split on the training data. This is repeated to create many models and every model is trained in parallel
Prediction is given based on the aggregation of predictions from all the models.

[More information on Bagging](https://becominghuman.ai/ensemble-learning-bagging-and-boosting-d20f38be9b1e)

#### Random Forest Classification:
Random forest is an ensemble model using bagging as the ensemble method and decision tree as the individual model. Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.

[More information on Random Forest Classification](https://medium.com/ibm-data-science-experience/markdown-for-jupyter-notebooks-cheatsheet-386c05aeebed)

Now we call the sci-kit learn's inbuilt library function to instanstiate the both the classifier model, fit the training data and predict the income on the test data. 

In [81]:
# Using classifiers to train and test the data
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier

RF = RandomForestClassifier(n_estimators = 100)
RF.fit(X_train, y_train)
RF.predict(X_test)

ABC = AdaBoostClassifier(n_estimators = 100)
ABC.fit(X_train, y_train)
ABC.predict(X_test)

array([2, 2, 2, ..., 2, 2, 2])

## 5. Validating Model Accuracy

Now we will use the accuracy score on the predicted results by comparing them with the actual income stated by the debtors when filling AFSA Statement of Affairs. Higher the score better is model accuracy. 

In [32]:
print("Random Forest Model Accuracy Score:", RF.score(X_test, y_test))

Random Forest Model Accuracy Score: 0.7563084420827033


In [33]:
print("Adaptive Boosting Model Accuracy Score:", ABC.score(X_test, y_test))

Adaptive Boosting Model Accuracy Score: 0.7325132480953569


As you can see that both the models are giving us close to 75% accuracy with:

Random Forest Model Accuracy Score is 75%
Adaptive Boosting Model Accuracy Score is 73%

This means that using given dataset, by selecting certain key features we can could correctly match debtors income for 75% of the training dataset. This would suggest that AFSA may need to look at reviewing 25% of the debtors on the income information they had filled in. 

### What's Next

> - Predict Assets with the declared assets.
> - Improve model accuracy by using other machine learning techniques. 

It was fun working on this challenge. We hope the effort we put in would help AFSA in some way. We are glad we could contribute to a cause. Please feel free to reach out to 
    Minesh Chunawala at minesh@au@gmail.com 
if you have any questions about the model. 

### Thank you