# Charity data analysis

This project was originally developed in R for a machine learning course.
Redeveloping in Python3 as a programming exercise.

The goal is to maximize the return on investment for a targeted mailing by targeting likely donors.

**To do:**
* Can x_df and c_df creation be captured in a function (DRY)?
* Outlier and influential point detection and management
* Missing value handling
* Adjustments for non-normal distributions in predictors and target
* Logistic regression
* GAM
* LDA (not usually used for qualitative variables, but use anyway as given method to try)
* QDA
* KNN classifier
* Decision tree
* Bagging and random forest
* Boosting
* SVC

In [49]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import preprocessing
from sklearn.linear_model import LogisticRegressionCV
from sklearn.compose import ColumnTransformer

%matplotlib inline

In [2]:
charity = pd.read_csv('charity.csv')

In [3]:
# Pause EDA until environmental issue with pandas_profiling can be worked out.
# Work from known facts about the data from previous project version; dataset is the same file from 2016.

charity.head(5)

Unnamed: 0,ID,reg1,reg2,reg3,reg4,home,chld,hinc,genf,wrat,...,npro,tgif,lgif,rgif,tdon,tlag,agif,donr,damt,part
0,1,0,0,1,0,1,1,4,1,8,...,20,81,81,19,17,6,21.05,0.0,0.0,train
1,2,0,0,1,0,1,2,4,0,8,...,95,156,16,17,19,3,13.26,1.0,15.0,train
2,3,0,0,1,0,1,1,5,1,8,...,64,86,15,10,22,8,17.37,,,test
3,4,0,0,0,0,1,1,4,0,8,...,51,56,18,7,14,7,9.59,,,test
4,5,0,0,1,0,1,0,4,1,4,...,85,132,15,10,10,6,12.07,1.0,17.0,valid


In [6]:
# Dataset pre-prepared with designations for training/validation/testing split.
# Normal response rate is around 10%, and training/validation sets have oversampled donors to address class imbalance.

charity_train = charity.loc[charity['part'] == 'train']

In [51]:
# TESTING: ColumnTransformer with logistic regression features identified in my course paper
# Paper uses log transform to normalize data; sklearn has Box-Cox and Yeo-Johnson transforms.

column_trans = ColumnTransformer(
    [('incm_bc', preprocessing.PowerTransformer(method='box-cox', standardize=False), ['incm']),
    ('tgif_bc', preprocessing.PowerTransformer(method='box-cox', standardize=False), ['tgif'])],
    remainder='passthrough')

test = column_trans.fit_transform(charity_train)

ImportError: cannot import name 'PowerTransformer'

In [8]:
# Select predictive features, dropping ID value and target.
# Python handles numbers differently, so cols are 1-21.

x_train = charity_train.iloc[:, 1:21]

In [10]:
# Create a label vector to hold donr values

c_train = charity_train.iloc[:, 21]

In [12]:
c_train_len = len(c_train)

3984

In [13]:
# Create response variable showing donation amounts for known donors.

y_train = charity_train[(charity_train.donr == 1)][['damt']]

In [14]:
y_train_len = len(y_train)

1995

In [15]:
charity_valid = charity.loc[charity['part'] == 'valid']

In [17]:
x_valid = charity_valid.iloc[:, 1:21]

In [19]:
c_valid = charity_valid.iloc[:, 21]

In [21]:
y_valid = charity_valid[(charity_valid.donr == 1)][['damt']]

In [22]:
y_valid_len = len(y_valid)
y_valid_len

999

In [23]:
charity_test = charity.loc[charity['part'] == 'test']

In [25]:
x_test = charity_test.iloc[:, 1:21]

In [27]:
# Standardize features to zero mean and unit standard deviation for algorithms that require standardization.

df_list = [x_train, x_test, x_valid]

In [31]:
scaler = preprocessing.StandardScaler()

In [32]:
x_train_std = scaler.fit_transform(x_train[x_train.columns]) # Need to send to dataframe

In [37]:
x_valid_std = scaler.fit_transform(x_valid[x_valid.columns])

In [36]:
x_valid_std = scaler.fit_transform(x_test[x_test.columns])