# Finding Donors for *CharityML*
## Feature Engineering
### Kebei Jiang 06/04/2019

### Goal  
 * benchmark ft engineering: standard normalization and scaling, no discarding or regrouping  
 * EDA inspired ft engineering: with discarding and regrouping

In [1]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from time import time
from IPython.display import display # Allows the use of display() for DataFrames

# Import libraries for visulization
import matplotlib.pyplot as plt
import seaborn as sns

#plt.style.use('ggplot')
%matplotlib inline

sns.set(color_codes=True)

In [2]:
# Load the Census dataset
data = pd.read_csv("census.csv")
ft_num = data.select_dtypes(include=['int64','float64']).columns.values
ft_cat = data.select_dtypes(exclude=['int64','float64']).columns.values

## 1st round EDA decisions

| feature-numerical | 1st round decision        |
|-------------------|---------------------------|
| age               | as-is                     |
| education-num     | as-is                     |
| capital-gain      | divide into zero/non-zero |
| capital-loss      | divide into zero/non-zero |
| hours-per-week    | as-is                     |








| feature-categorical | 1st round decision                                                                                         | reasoning                                                         |
|---------------------|------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------|
| workclass           | {without-pay}, {*-gov, private, self-emp-not-inc}, {self-emp-inc}                                          | '-gov', 'private' and 'self-emp-not-inc' are all paid employees   |
| occupation          | {Exec-managerial, prof-specialty}, {farming, handlers, machine-op, other-service, priv-house-serv}, {rest} | based on income classes ratio                                     |
| marital-status      | {Married-AF-spouse, Married-civ-spouse}, {rest}                                                            |                                                                   |
| relationship        | {Husband, Wife}, {rest}                                                                                    | correlation with marital-status, should include both or just one? |
| race                | {Asian-Pac-Islander, white}, {rest}                                                                        | based on income classes ratio                                     |
| sex                 | as-is                                                                                                      |                                                                   |
| native-country      | drop                                                                                                       | just assume everyone is from US                                   |
| education-level      | drop                                                                                                       | duplicated                                   |










 [ref 1](http://scg.sdsu.edu/dataset-adult_r/)
 * Capital-gain/loss into low/high groups
 * combine government works; self-employed...  
 * 'occupation' to 'blue collar' and 'white collar'  
 * 'native-cournty' into continents  
 * scaling/normalizing features  
 * put all feature engineering into a function

[ref 2](https://faculty.biu.ac.il/~yahavi1/Projects/CP2010T1_rep.pdf)  
 * visualize DT  
 * average hours-per-week w.r.t. Gender (married or not)  
 * check predictive error in different classes  

[ref3](http://rstudio-pubs-static.s3.amazonaws.com/265200_a8d21a65d3d34b979c5aafb0de10c221.html)  
Capital gain:

We mark all values of “capital_gain” which are less than the first quartile of the nonzero capital gain (which is equal to 3464) as “Low”; all values that are between the first and third quartile (between 3464 and 14080) - as “Medium”; and all values greater than or equal to the third quartile are marked as “High”.


Asia_East <- c(" Cambodia", " China", " Hong", " Laos", " Thailand",
               " Japan", " Taiwan", " Vietnam")

Asia_Central <- c(" India", " Iran")

Central_America <- c(" Cuba", " Guatemala", " Jamaica", " Nicaragua", 
                     " Puerto-Rico",  " Dominican-Republic", " El-Salvador", 
                     " Haiti", " Honduras", " Mexico", " Trinadad&Tobago")

South_America <- c(" Ecuador", " Peru", " Columbia")


Europe_West <- c(" England", " Germany", " Holand-Netherlands", " Ireland", 
                 " France", " Greece", " Italy", " Portugal", " Scotland")

Europe_East <- c(" Poland", " Yugoslavia", " Hungary")

----

## benchmark feature engineering

In [3]:
from sklearn.preprocessing import MinMaxScaler

def ft_num_engineer(data, ft_num):

    # logrithmic transform on 'capital-gain' and 'capital-loss'
    data['capital-gain']=np.log(data['capital-gain'] + 1)
    data['capital-loss']=np.log(data['capital-loss'] + 1)
    
    # scaling the features
    # scaling works on multiple featurs simultaneously
    scaler = MinMaxScaler()
    data[ft_num] = scaler.fit_transform(data[ft_num])
    
    return data

In [36]:
def ft_cat_eda(data, workclass, occupation, marital, relationship, race):

    # workclass
    workclass_dict = {' Without-pay':'without-pay',' State-gov':'employee', ' Federal-gov':'employee', ' Local-gov':'employee', \
                      ' Private':'employee', ' Self-emp-not-inc':'employee', ' Self-emp-inc':'owner'}
    # occupation
    occupation_income = pd.Series([0, 1, 1, 2, 0, 0, 0, 0, 0, 2, 1, 1, 1, 1]).map({0:'low', 1:'mid', 2:'high'})
    occupation_dict = dict(zip(sorted(data['occupation'].unique()), occupation_income))
    # marital-status
    marital_group = pd.Series([1 if x in [' Married-civ-spouse', ' Married-AF-spouse'] else 0 for x in data['marital-status'].unique()]).map({0:'single', 1:'couple'})
    marital_dict = dict(zip(data['marital-status'].unique(), marital_group))
    # relationship
    relationship_group = pd.Series([1 if x in [' Husband', ' Wife'] else 0 for x in data['relationship'].unique()]).map({0:'single', 1:'couple'})
    relationship_dict = dict(zip(data['relationship'].unique(), relationship_group))
    # race
    race_dict = {' White': 'high', ' Asian-Pac-Islander': 'high', ' Black':'low', ' Amer-Indian-Eskimo':'low', ' Other':'low'}
    
    # replacement
    filter = np.array([workclass, occupation, marital, relationship, race])
    fts = np.array(['workclass', 'occupation', 'marital-status', 'relationship', 'race'])
    dicts = np.array([workclass_dict, occupation_dict, marital_dict, relationship_dict, race_dict])
    
    replace_dict = dict(zip(fts[filter], dicts[filter]))

    data = data.replace(replace_dict)
    
    return data

def ft_engineer(data, ft_num, capital=False, workclass=False, occupation=False, marital=False, \
                    relationship=False, race=False, drop_native_country=False):

    # numerical engineering
    ft_num = data.select_dtypes(include=['int64','float64']).columns.values    
    data = ft_num_engineer(data, ft_num)
      
    # should we binarize 'capital'
    if capital:
        data['capital-gain']= data['capital-gain'].apply(lambda x: 'no' if x==0 else 'yes')
        data['capital-loss']= data['capital-loss'].apply(lambda x: 'no' if x==0 else 'yes')
    
    # EDA suggested update
    data = ft_cat_eda(data, workclass, occupation, marital, relationship, race)
    
    # should we drop 'native-country'
    if drop_native_country:
        data.drop('native-country', axis=1, inplace=True)

    # target and get_dummies
    target = np.array(data['income'] == '<=50K').astype(int)
    
    data.drop(['education_level', 'income'], axis=1, inplace=True)
    data = pd.get_dummies(data)
    
    return data, target
    

----

## model selection

In [38]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Split the 'features' and 'income' data into training and testing sets
def xy_split(data, target, random_state):
    X_train, X_test, y_train, y_test = train_test_split(data, 
                                                    target, 
                                                    test_size = 0.2, 
                                                    random_state = random_state)
    # Show the results of the split
    print("Training set has {} samples.".format(X_train.shape[0]))
    print("Testing set has {} samples.".format(X_test.shape[0]))

    return X_train, X_test, y_train, y_test

In [88]:
from sklearn.metrics import fbeta_score
from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

def train_predict(data, target, clf):
    
    X_train, X_test, y_train, y_test = xy_split(data, target, 10)
    
    clf = clf
    clf_name = clf.__class__.__name__
    
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    
    print(accuracy_score(y_test, y_pred))
    print(fbeta_score(y_test, y_pred, beta=0.5))        

In [124]:
tmp, target = ft_engineer(data.copy(), 
                          ft_num, 
                          capital=False, 
                          workclass=True, 
                          occupation=True, 
                          marital=True, 
                          relationship=True, 
                          race=True, 
                          drop_native_country=True)
tmp.shape

  return self.partial_fit(X, y)


(45222, 19)

In [125]:
train_predict(tmp, target, AdaBoostClassifier())

Training set has 36177 samples.
Testing set has 9045 samples.
0.8582642343836374
0.8933134881631049


In [86]:
train_predict(tmp, target, LogisticRegression())

Training set has 36177 samples.
Testing set has 9045 samples.
0.8410171365395246
0.8846949474340445


