# Predicting Donors for a Charity - Part 2 - Deep learning

## Introduction

In an earlier __[project](https://github.com/pranath/predict_charity_donors/blob/master/finding_donors.ipynb)__ I developed a machine learning model to predict from a dataset of census data those who earn over $50K. That project used conventional models, in particular a Gradient Boosting Classifier.

In this project I will use a deep learning model to predict the same target variable, but using a deep learning model - and we will compare results.

## Load libraries

In [1]:
import pandas as pd
from fastai.tabular import *

## Load data 

In [11]:
df = pd.read_csv('census.csv')
df.head()

Unnamed: 0,age,workclass,education_level,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50K
2,38,Private,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K
3,53,Private,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50K
4,28,Private,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


In [12]:
# Define target variable
dep_var = 'income'
# Define categorical columns
cat_names = ['workclass', 'education_level', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
# Define numeric columns
cont_names = ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
# Define data transforms
procs = [FillMissing, Categorify, Normalize]

In [13]:
# Create test dataset
test = TabularList.from_df(df.iloc[800:1000].copy(), cat_names=cat_names, cont_names=cont_names)

In [14]:
# Create dataset object
data = (TabularList.from_df(df, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(list(range(800,1000)))
                           .label_from_df(cols=dep_var)
                           .add_test(test)
                           .databunch())

In [15]:
# View dataset object (with transforms applied)
data.show_batch(rows=10)

workclass,education_level,marital-status,occupation,relationship,race,sex,native-country,age,education-num,capital-gain,capital-loss,hours-per-week,target
Private,Some-college,Married-civ-spouse,Craft-repair,Husband,White,Male,United-States,0.4126,-0.0462,-0.1468,-0.2186,-0.0777,<=50K
Private,HS-grad,Married-civ-spouse,Sales,Husband,White,Male,United-States,-0.6464,-0.4379,-0.1468,-0.2186,0.3388,<=50K
Private,Assoc-voc,Never-married,Other-service,Not-in-family,White,Male,United-States,-0.4952,0.3455,-0.1468,-0.2186,-0.0777,<=50K
Private,Bachelors,Married-civ-spouse,Adm-clerical,Husband,Asian-Pac-Islander,Male,Philippines,-0.2682,1.1289,-0.1468,-0.2186,-0.0777,>50K
Self-emp-not-inc,Prof-school,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States,0.5639,1.9123,-0.1468,4.4831,0.3388,>50K
Self-emp-not-inc,HS-grad,Divorced,Craft-repair,Not-in-family,White,Male,United-States,-1.1003,-0.4379,-0.1468,-0.2186,0.7552,<=50K
Private,Some-college,Married-civ-spouse,Adm-clerical,Husband,White,Male,United-States,-0.3439,-0.0462,-0.1468,-0.2186,-0.0777,<=50K
Local-gov,Masters,Separated,Prof-specialty,Not-in-family,White,Male,United-States,0.9421,1.5206,-0.1468,-0.2186,-0.161,>50K
Private,Assoc-voc,Divorced,Tech-support,Not-in-family,White,Female,United-States,0.7908,0.3455,-0.1468,-0.2186,-0.0777,<=50K
Private,Assoc-voc,Never-married,Tech-support,Not-in-family,White,Male,United-States,-1.1759,0.3455,-0.1468,-0.2186,-0.0777,<=50K


In [18]:
# Create deep learning model
learn = tabular_learner(data, layers=[200,100], metrics=accuracy)

In [19]:
# Fit the model
learn.fit(6, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.343425,0.320738,0.835,00:04
1,0.330152,0.286586,0.865,00:04
2,0.32978,0.277911,0.89,00:04
3,0.336001,0.30098,0.835,00:04
4,0.330773,0.279806,0.88,00:04
5,0.336908,0.293961,0.87,00:04


## Results

In our earlier study with the Gradient Boosting Model, we achieved an best accuracy of 0.86. With this model with just a few epochs of training we acheive an accuracy of between 0.83 - 0.89, so a comparable level of accuracy.

## Inference

Test our model

In [20]:
row = df.iloc[175]

In [21]:
learn.predict(row)

(Category >50K, tensor(1), tensor([0.3134, 0.6866]))