<a href="https://colab.research.google.com/github/pranath/predict_charity_donors/blob/master/finding_donors3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Donors for a Charity - Part 3 - Deep learning

## Introduction

In an earlier __[project](https://github.com/pranath/predict_charity_donors/blob/master/finding_donors.ipynb)__ I developed a machine learning model to predict from a dataset of census data those who earn over $50K. That project used conventional models, in particular a Gradient Boosting Classifier.

In a later [project](https://github.com/pranath/predict_charity_donors/blob/master/finding_donors2.ipynb) I used the fastai deep learning library for the prediction problem.

In this project I will use the latest version of the fastai library version 2 - and we will compare results.

## Load libraries

In [1]:
!pip install fastai --upgrade -q # Upgrade to fastai v2
import fastai
from fastai.tabular.all import *
import pandas as pd

fastai.__version__

'2.2.7'

## Load data 

In [5]:
path = untar_data(URLs.ADULT_SAMPLE)
path.ls()

(#3) [Path('/root/.fastai/data/adult_sample/export.pkl'),Path('/root/.fastai/data/adult_sample/models'),Path('/root/.fastai/data/adult_sample/adult.csv')]

In [10]:
df = pd.read_csv(path/'adult.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,49,Private,101320,Assoc-acdm,12.0,Married-civ-spouse,,Wife,White,Female,0,1902,40,United-States,>=50k
1,44,Private,236746,Masters,14.0,Divorced,Exec-managerial,Not-in-family,White,Male,10520,0,45,United-States,>=50k
2,38,Private,96185,HS-grad,,Divorced,,Unmarried,Black,Female,0,0,32,United-States,<50k
3,38,Self-emp-inc,112847,Prof-school,15.0,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,United-States,>=50k
4,42,Self-emp-not-inc,82297,7th-8th,,Married-civ-spouse,Other-service,Wife,Black,Female,0,0,50,United-States,<50k


We have a mixture of numeric and categorical columns.

In [11]:
# Create dataloader using factory method
dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
    cont_names = ['age', 'fnlwgt', 'education-num'],
    procs = [Categorify, FillMissing, Normalize])

In [13]:
# Create dataloader using tabular pandas class
splits = RandomSplitter(valid_pct=0.2)(range_of(df))
to = TabularPandas(df, procs=[Categorify, FillMissing,Normalize],
                   cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
                   cont_names = ['age', 'fnlwgt', 'education-num'],
                   y_names='salary',
                   splits=splits)
# Preview data pre-processed by tabular pandas
to.xs.iloc[:2]

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num
8681,5,16,5,2,4,5,1,-1.14295,2.450554,-0.033488
13173,1,16,5,1,4,5,1,-1.50928,0.718655,-0.033488


In [15]:
# Create dataloader from tabular pandas object
dls = to.dataloaders(bs=64)
# Show some
dls.show_batch()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,salary
0,Self-emp-inc,Bachelors,Widowed,Exec-managerial,Not-in-family,White,False,46.0,200948.999623,13.0,<50k
1,Local-gov,Bachelors,Married-civ-spouse,Prof-specialty,Husband,White,False,30.0,44565.99371,13.0,<50k
2,Private,HS-grad,Married-civ-spouse,Sales,Husband,White,False,59.0,179594.000235,9.0,<50k
3,Private,Assoc-acdm,Never-married,Handlers-cleaners,Own-child,Black,False,22.000001,230703.998842,12.0,<50k
4,Private,HS-grad,Divorced,Other-service,Unmarried,White,False,32.0,153963.001214,9.0,<50k
5,Private,Some-college,Never-married,Prof-specialty,Own-child,White,False,20.999999,151158.000568,10.0,<50k
6,Private,HS-grad,Married-civ-spouse,Machine-op-inspct,Husband,White,False,50.0,767403.001341,9.0,>=50k
7,Private,Some-college,Married-civ-spouse,Craft-repair,Husband,White,False,38.0,32271.001721,10.0,<50k
8,Local-gov,Some-college,Never-married,Transport-moving,Unmarried,White,False,63.999999,198728.000129,10.0,<50k
9,Private,Bachelors,Married-civ-spouse,Exec-managerial,Husband,Black,False,34.0,318640.993793,13.0,>=50k


In [21]:
# Create deep learning model
learn = tabular_learner(dls, metrics=accuracy)
# Fit the model (fine tune is pointless as we dont use a pre-trained model)
learn.fit_one_cycle(6)

epoch,train_loss,valid_loss,accuracy,time
0,0.355588,0.375851,0.823864,00:07
1,0.353663,0.371332,0.826474,00:07
2,0.363593,0.357748,0.831235,00:07
3,0.353131,0.356008,0.836149,00:07
4,0.341748,0.354585,0.836763,00:07
5,0.32347,0.354438,0.835534,00:07


In [17]:
# Show some predictions
row, clas, probs = learn.predict(df.iloc[0])
row.show()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,salary
0,Private,Assoc-acdm,Married-civ-spouse,#na#,Wife,White,False,49.0,101320.00106,12.0,>=50k


In [18]:
clas, probs

(tensor(1), tensor([0.4300, 0.5700]))

## Results

In our earlier study with the Gradient Boosting Model, we achieved an best accuracy of 0.86. With this model with just one epoch of training we acheive an accuracy of 0.83, so a comparable level of accuracy.