# Charity data analysis

This project was originally developed in R for a machine learning course.
Redeveloping in Python3 as a programming exercise.

The goal is to maximize the return on investment for a targeted mailing by targeting likely donors.

**To do:**
* Can x_df and c_df creation be captured in a function (DRY)?
* Outlier and influential point detection and management
* Missing value handling
* Adjustments for non-normal distributions in predictors and target
* Logistic regression
* GAM
* LDA (not usually used for qualitative variables, but use anyway as given method to try)
* QDA
* KNN classifier
* Decision tree
* Bagging and random forest
* Boosting
* SVC

In [1]:
import numpy as np
import pandas as pd

In [2]:
charity = pd.read_csv('charity.csv')

In [3]:
# Pause EDA until environmental issue with pandas_profiling can be worked out.
# Work from known facts about the data from previous project version; dataset is the same file from 2016.

charity.head(5)

Unnamed: 0,ID,reg1,reg2,reg3,reg4,home,chld,hinc,genf,wrat,...,npro,tgif,lgif,rgif,tdon,tlag,agif,donr,damt,part
0,1,0,0,1,0,1,1,4,1,8,...,20,81,81,19,17,6,21.05,0.0,0.0,train
1,2,0,0,1,0,1,2,4,0,8,...,95,156,16,17,19,3,13.26,1.0,15.0,train
2,3,0,0,1,0,1,1,5,1,8,...,64,86,15,10,22,8,17.37,,,test
3,4,0,0,0,0,1,1,4,0,8,...,51,56,18,7,14,7,9.59,,,test
4,5,0,0,1,0,1,0,4,1,4,...,85,132,15,10,10,6,12.07,1.0,17.0,valid


In [4]:
charity.shape

(8009, 24)

In [5]:
charity.columns

Index(['ID', 'reg1', 'reg2', 'reg3', 'reg4', 'home', 'chld', 'hinc', 'genf',
       'wrat', 'avhv', 'incm', 'inca', 'plow', 'npro', 'tgif', 'lgif', 'rgif',
       'tdon', 'tlag', 'agif', 'donr', 'damt', 'part'],
      dtype='object')

In [6]:
# Dataset pre-prepared with designations for training/validation/testing split.

charity_train = charity.loc[charity['part'] == 'train']

In [7]:
charity_train.shape

(3984, 24)

In [8]:
# Select predictive features, dropping ID value and target.
# Python handles numbers differently, so cols are 1-21.

x_train = charity_train.iloc[:, 1:21]

In [9]:
x_train.head(5)

Unnamed: 0,reg1,reg2,reg3,reg4,home,chld,hinc,genf,wrat,avhv,incm,inca,plow,npro,tgif,lgif,rgif,tdon,tlag,agif
0,0,0,1,0,1,1,4,1,8,302,76,82,0,20,81,81,19,17,6,21.05
1,0,0,1,0,1,2,4,0,8,262,130,130,1,95,156,16,17,19,3,13.26
5,0,1,0,0,1,1,5,0,9,114,17,25,44,83,131,5,3,13,4,4.12
9,0,0,0,0,1,3,4,1,7,200,38,58,5,42,63,12,10,19,3,9.42
11,0,0,0,1,1,3,4,1,6,272,69,69,0,98,169,29,36,23,7,8.97


In [10]:
# Create a label vector to hold donr values

c_train = charity_train.iloc[:, 21]

In [30]:
c_train.head(10)

0     0.0
1     1.0
5     1.0
9     0.0
11    1.0
12    1.0
16    1.0
18    1.0
23    0.0
25    0.0
Name: donr, dtype: float64

In [28]:
c_train_len = len(c_train)
c_train_len

3984

In [31]:
# Create response variable showing donation amounts for known donors.

y_train = charity_train[(charity_train.donr == 1)][['damt']]

In [32]:
y_train_len = len(y_train)
y_train_len

1995

In [12]:
charity_valid = charity.loc[charity['part'] == 'valid']

In [13]:
charity_valid.shape

(2018, 24)

In [16]:
x_valid = charity_valid.iloc[:, 1:21]

In [17]:
x_valid.head()

Unnamed: 0,reg1,reg2,reg3,reg4,home,chld,hinc,genf,wrat,avhv,incm,inca,plow,npro,tgif,lgif,rgif,tdon,tlag,agif
4,0,0,1,0,1,0,4,1,4,295,39,71,14,85,132,15,10,10,6,12.07
6,0,0,0,0,1,3,4,0,8,145,39,42,10,50,74,6,5,22,3,6.5
7,0,0,0,0,1,3,2,0,5,165,34,35,19,11,41,4,2,20,7,3.45
10,0,0,1,0,1,3,2,1,8,152,46,46,20,100,414,25,14,39,7,10.12
13,0,0,0,1,1,0,4,0,8,108,21,36,32,54,117,5,4,15,9,5.11


In [18]:
c_valid = charity_valid.iloc[:, 21]

In [19]:
c_valid.head()

4     1.0
6     0.0
7     0.0
10    0.0
13    1.0
Name: donr, dtype: float64

In [14]:
charity_test = charity.loc[charity['part'] == 'test']

In [15]:
charity_test.shape

(2007, 24)