# Charity data analysis

This project was originally developed in R for a machine learning course.
Redeveloping in Python3 as a programming exercise.

The goal is to maximize the return on investment for a targeted mailing by targeting likely donors.

**To do:**
* Can x_df and c_df creation be captured in a function (DRY)?
* pop vs iloc?
* Outlier and influential point detection and management
* Missing value handling
* Adjustments for non-normal distributions in predictors and target
* Logistic regression
* GAM
* LDA (not usually used for qualitative variables, but use anyway as given method to try)
* QDA
* KNN classifier
* Decision tree
* Bagging and random forest
* Boosting
* SVC

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import preprocessing as pp
from sklearn.linear_model import LogisticRegressionCV
from sklearn.compose import ColumnTransformer

%matplotlib inline

In [2]:
charity = pd.read_csv('charity.csv')

In [10]:
# Pause EDA until environmental issue with pandas_profiling can be worked out.
# Work from known facts about the data from previous project version; dataset is the same file from 2016.

charity.head(5)

Unnamed: 0,ID,reg1,reg2,reg3,reg4,home,chld,hinc,genf,wrat,...,npro,tgif,lgif,rgif,tdon,tlag,agif,donr,damt,part
0,1,0,0,1,0,1,1,4,1,8,...,20,81,81,19,17,6,21.05,0.0,0.0,train
1,2,0,0,1,0,1,2,4,0,8,...,95,156,16,17,19,3,13.26,1.0,15.0,train
2,3,0,0,1,0,1,1,5,1,8,...,64,86,15,10,22,8,17.37,,,test
3,4,0,0,0,0,1,1,4,0,8,...,51,56,18,7,14,7,9.59,,,test
4,5,0,0,1,0,1,0,4,1,4,...,85,132,15,10,10,6,12.07,1.0,17.0,valid


In [4]:
# Dataset pre-prepared with designations for training/validation/testing split.
# Normal response rate is around 10%, and training/validation sets have oversampled donors to address class imbalance.

charity_train = charity.loc[charity['part'] == 'train']
charity_train = charity_train.drop(columns = ['ID', 'part'])
c_train = charity_train.pop('donr').values

In [17]:
log_train = charity_train.copy()
log_train['hm_ch_int'] = log_train['home'] * log_train['chld']
log_train['incm_tgif_int'] = log_train['incm'] * log_train['tgif']
log_train['hinc_sq'] = np.square(log_train['hinc'])
log_train = log_train.drop(columns = ['home', 'chld', 'hinc', 'incm', 'tgif'])
log_train.head()

Unnamed: 0,reg1,reg2,reg3,reg4,genf,wrat,avhv,inca,plow,npro,lgif,rgif,tdon,tlag,agif,damt,hm_ch_int,incm_tgif_int,hinc_sq
0,0,0,1,0,1,8,302,82,0,20,81,19,17,6,21.05,0.0,1,6156,16
1,0,0,1,0,0,8,262,130,1,95,16,17,19,3,13.26,15.0,2,20280,16
5,0,1,0,0,0,9,114,25,44,83,5,3,13,4,4.12,12.0,1,2227,25
9,0,0,0,0,1,7,200,58,5,42,12,10,19,3,9.42,0.0,3,2394,16
11,0,0,0,1,1,6,272,69,0,98,29,36,23,7,8.97,17.0,3,11661,16


In [None]:
# TESTING: ColumnTransformer with logistic regression features identified in my course paper
# Paper uses log transform to normalize data; sklearn has Box-Cox and Yeo-Johnson transforms.

column_trans = ColumnTransformer(
    [('incm_bc', pp.PowerTransformer(method='box-cox', standardize=False), ['incm']),
    ('tgif_bc', pp.PowerTransformer(method='box-cox', standardize=False), ['tgif'])],
    remainder='passthrough')

log_train = column_trans.fit_transform(charity_train)

In [None]:
# Select predictive features, dropping ID value and target.
# Python handles numbers differently, so cols are 1-21.

x_train = charity_train.iloc[:, 1:21]

In [None]:
# Create a label vector to hold donr values

c_train = charity_train.iloc[:, 21]

In [None]:
c_train_len = len(c_train)

In [None]:
# Create response variable showing donation amounts for known donors.

y_train = charity_train[(charity_train.donr == 1)][['damt']]

In [None]:
y_train_len = len(y_train)

In [None]:
charity_valid = charity.loc[charity['part'] == 'valid']

In [None]:
x_valid = charity_valid.iloc[:, 1:21]

In [None]:
c_valid = charity_valid.iloc[:, 21]

In [None]:
y_valid = charity_valid[(charity_valid.donr == 1)][['damt']]

In [None]:
y_valid_len = len(y_valid)
y_valid_len

In [None]:
charity_test = charity.loc[charity['part'] == 'test']

In [None]:
x_test = charity_test.iloc[:, 1:21]

In [None]:
# Standardize features to zero mean and unit standard deviation for algorithms that require standardization.

df_list = [x_train, x_test, x_valid]

In [None]:
scaler = preprocessing.StandardScaler()

In [None]:
x_train_std = scaler.fit_transform(x_train[x_train.columns]) # Need to send to dataframe

In [None]:
x_valid_std = scaler.fit_transform(x_valid[x_valid.columns])

In [None]:
x_valid_std = scaler.fit_transform(x_test[x_test.columns])