<div class="alert alert-info" role="alert">
    <H1> Introduction </H1>
<p>
    This notebook contains the steps I took to analyze and label the WiDS 2018 Datathon data. The data contained demographic and behavioral information from a representative sample of survey respondents from India and their usage of traditional financial and mobile financial services. The dataset is a product of InterMedia’s research to help the world’s poorest people take advantage of widely available mobile phones and other digital technology to access financial tools and participate more fully in their local economies. 
    To obtain the data contact Intermedia directly at http://finclusion@intermedia.org and fill out a data request form [here](http://finclusion.org/data_fiinder/)
</p>

<p>

The goal of this datathon was to determine if a survey respondent was male or female (0 or 1), based on how they answered questions.

I performed the following steps to produce a model with a resulting accuracy of 0.97107.

   <li>Wrangling the data</li>
   <li>Feature selection</li>
   <li>Optimization of an XGBoost model</li>
   <li>Use optimized model to predict labels of test dataset</li>
    
</p>
</div>

In [1]:
# general imports
import pickle
import pandas as pd
import numpy as np

#imports for chi-squared
from scipy.stats import chi2_contingency
from collections import defaultdict

# imports for xgboost
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from xgboost import cv
from sklearn.model_selection import RandomizedSearchCV



<div class="alert alert-info" role="alert">
    <H1> Data Wrangle </H1>
<p>
    In order to build an effective model the data needs to be cleaned and orgainized. 
    
</p>

<p>
    <li>Read data into memory as a pandas dataframe</li>
    <li>Remove empty columns</li>
    <li>Ensure feature agreement between test and training data</li>
    <li>Separate the different data types and cast the categorical data as object type</li>
    
</p>
</div>

In [2]:
# import training data
df_train = pd.read_csv(r'train.csv', low_memory=False)
df_test = pd.read_csv(r'test.csv', low_memory=False)

The helper function common_columns ensures feature agreement between 2 data frames, and the sufficiently_filled function removes columns that are not filled to the threshold

In [3]:
#helper functions
def common_columns(df1, df2):
    """Returns tuple (df1, df2) with columns that BOTH df1 and df2 have in common"""
    joint_column_list = list(set(df1.columns) & set(df2.columns))
    
    return (df1[joint_column_list], df2[joint_column_list])

def sufficiently_filled(df, threshold):
    """
    Removes columns from df that are below the threshold for being filled

    Paramerters
    -----------
    df : pandas dataframe
        dataframe with NaN values
    threshold : int
        number of exceptable NaN values for each column
    Returns
    -------
    dataframe
        dataframe with columns that have less than the threshold number of NaN values
    """
    # remove all columns with no data
    df1 = df.dropna(axis=1, how='all')
    # counts the number of NaNs in each column and keeps only the ones with less NaNs then threshold
    good_cols = df1.isna().astype('int').sum() < threshold
    cols_to_keep = (good_cols[good_cols == True]).index
    return df[cols_to_keep]

In [4]:
# create dataframe where each column has at least 50% of entries filled
threshold = len(df_train)/2. # define threshold
# removes empty or insufficently filled columns
training_data = sufficiently_filled(df_train, threshold) 
test_data = sufficiently_filled(df_test, threshold)

# return dataframes with columns that are only in both test and training data
clean_train, clean_test = common_columns(training_data, test_data)

In [5]:
num_removed_cols = len(list(df_train)) - len(list(clean_train))
num_removed_rows = len(df_train)-len(clean_train)
print('The cleaning process removed {} columns and {} rows in the training data'.format(num_removed_cols, num_removed_rows))

The cleaning process removed 946 columns and 0 rows in the training data


<H2> Separate data by type </H2>

In [6]:
# Isolate text data
text_train = clean_train.select_dtypes(exclude=['float64','int64'])
text_test = clean_test.select_dtypes(exclude=['float64', 'int64'])
print('There are {} columns of text training data'.format(len(text_train.columns)))

There are 0 columns of text training data


<H4> The data dictionary provided a description of all of the categorical data. So I will use those column names to separate the categorical data from the numerical, making sure to only keep the categories that remained after cleaning the data </H4>

In [7]:
# Create list of catagorical feature names
data_dictionary = pd.read_excel('WiDS data dictionary v2.xlsx')
col_list = list(data_dictionary['Column Name'][1:].apply(lambda x: str(x)))
# Create list of columns in cleaned training data
clean_data_columns = clean_train.columns
# Create list of columns both categorical and in the cleaned training data
categorical_column_names = [name for name in clean_data_columns if name in col_list]
# Cast catagorical data as object datatype
categorical_train = clean_train[categorical_column_names].drop(columns='DG1').astype('object')
categorical_test = clean_test[categorical_column_names].drop(columns='DG1').astype('object')
print('There are {} columns of categorical training data'.format(len(categorical_train.columns)))

There are 281 columns of categorical training data


In [8]:
# Dataframe of numerical data
drop_columns = categorical_column_names + list(text_train)
numerical_train = clean_train.drop(columns=drop_columns)
numerical_test = clean_test.drop(columns=drop_columns)
print('There are {} columns of numerical training data'.format(len(numerical_train.columns)))

There are 7 columns of numerical training data


<div class="alert alert-info" role="alert">
    <H1> Feature Selection </H1>

<p>
    <li>Remove any column that is more than 50% NaN</li>
    <li>Use the Chi Squared metric to determine if the categorical data is dependent on gender</li>
    <li>One-hot encode the categorical data</li>
    <li>Ensure training and test data have the same features</li>
 
</p>
</div>

In [9]:
# Transform dataframes where NaN = 0 and value=1, and sum them
text_count = text_train.notna().astype(int).sum()
categorical_count = categorical_train.notna().astype(int).sum()
numerical_count = numerical_train.notna().astype(int).sum()
# Define threshold for 50% filled in
threshold = 18255*0.5
# Transform dataframes where NaN = 0 and value=1, and sum them
text_count = text_train.notna().astype(int).sum()
categorical_count = categorical_train.notna().astype(int).sum()
numerical_count = numerical_train.notna().astype(int).sum()
# Define threshold for 50% filled in
threshold = 18255*0.5

# Create list of columns that exceed the threshold for each data type
valid_text_columns = []
valid_categorical_columns = []
valid_numerical_columns = []
for text_name, text_num, categorical_name, categorical_num,numerical_name, numerical_num in zip(text_count.index, text_count,categorical_count.index,categorical_count,numerical_count.index, numerical_count):
    if text_num > threshold:
        valid_text_columns.append(text_name)
    if categorical_num > threshold:
        valid_categorical_columns.append(categorical_name)
    if numerical_num > threshold:
        valid_numerical_columns.append(numerical_name)

In [10]:
# print the number of valid columns for each datatype
print(len(valid_text_columns))
print(len(valid_categorical_columns))
print(len(valid_numerical_columns))

0
0
0


<div class="alert alert-warning">
<p>
    <H7>
    There is no data left after filtering out columns that are more then 50% empty, I will not be using this method during the final feature selection.  Instead I will focus on using the chi-squared metric for filtering categorical data.  
    </H7>
</p>
</div>

<div class="alert alert-info" role="alert">
    <H3> Categorical data feature selection </H3>
    <p> 
    I need to determine if any of the categorical questions are answered differently based on the gender of who is filling out the survey.  To do that I will create joint contingency tables for each one of the categorical columns and then use the chi-squared test statistic to determine if the distribution of answers is dependent on gender.  If it is, then I will keep that column to train the final model. The last step is to determine if using this feature selection method actually improves the model.
    </p>
    

<p>
    <li>For each categorical column:
        
        <ol>
            <li>Create joint distribution table</li>
            <li>Calculate the chi-squared contingency test statistic</li>
        </ol>
    </li>
    <li>Filter columns based on the resulting p-value</li>
    <li>Perform one-hot encoding for the gender dependent categorical data</li>
    <li>Verify feature selection improved the model</li>
</p>
</div>

Create functions to turn categorical columns into joint distribution tables

In [11]:
# function to count values for each possible category
def cat_count(pd_series):
    '''Returns all possible values in a pandas series'''
    categories = list(set(pd_series))
    cat_count = dict.fromkeys(categories, 0)
    for cat in pd_series:
        cat_count[cat] += 1
    return cat_count

# function to create joint dist table
def joint_dist_table(cat_series, df):
    '''
    Create a joint distribution table for pandas series
    
    Paramerters
    -----------
    cat_series:
    df:
    
    Returns
    -------
    Joint distribution table of the catagorical distribution for each gender
    
    '''
    data = df.copy()
    # split male and female counts, and drops and Nans
    F_series = cat_series[data.is_female == 1].dropna()
    M_series = cat_series[data.is_female == 0].dropna()
    # create count of each category, for each gender
    F = cat_count(F_series)
    M = cat_count(M_series) 
    keep = set(F) & set(M)
    F_new = {k: F[k] for k in keep}
    M_new = {k: M[k] for k in keep}
    # combine counts in dataframe
    dist_table = pd.DataFrame.from_dict(F_new, orient='index')
    dist_table[1] = M_new.values()
    # format to distribution table
    final_dist_table = dist_table.rename(columns={0:'Male',1:'Female'}).transpose()
    return final_dist_table   
    

Calculate the chi-squared test statistic for each categorical feature and filter by resulting p-value

In [12]:
# calculate p-value for each categorical value
chi_dict = defaultdict(list)
for cat_cols in categorical_train:
    jd_table = joint_dist_table(categorical_train[cat_cols], df_train)
    chi_test_value, chi_p, degfree, exp_val = chi2_contingency(jd_table)
    chi_dict[cat_cols] = [chi_test_value, chi_p, degfree, exp_val]


In [13]:
# filter columns based on p-value
sig_level = 0.05 # significance level
sig_cols = []

for k,v in chi_dict.items():
    if v[1] < sig_level:
            sig_cols.append(k)
# print the number of significant features
print('There are {} significant categorical features'.format(len(sig_cols)))

There are 204 significant categorical features


Filter and one-hot encode the siginificant categorical features

In [14]:
# create a dataframe for only the categorical data dependent on gender
significant_categorical_train = df_train[sig_cols].astype('object') 
significant_categorical_test = df_test[sig_cols].astype('object')
# one-hot encode
encoded_categorical_train = pd.get_dummies(significant_categorical_train, dummy_na=True)
encoded_categorical_test = pd.get_dummies(significant_categorical_test, dummy_na=True)
# make sure training and test features are the same
joint_features = list(set(encoded_categorical_test.columns) & set(encoded_categorical_train))
encoded_categorical_train = encoded_categorical_train[joint_features]
encoded_categorical_test = encoded_categorical_test[joint_features]

In order to verify the feature selection improved the model, I will use an unoptimized XGBoost model to see which datasets performed the best.

In [15]:
# one-hot encode categorical data, without feature selection
unfiltered_categorical_train = pd.get_dummies(categorical_train, dummy_na=True)
unfiltered_categorical_test = pd.get_dummies(categorical_test, dummy_na=True)
joint_features = list(set(unfiltered_categorical_test.columns) & set(unfiltered_categorical_train))
unfiltered_categorical_train = unfiltered_categorical_train[joint_features]
unfiltered_categorical_test = unfiltered_categorical_test[joint_features]

In [16]:
# Use un-optomized xgboost model with the numerical, categorical, and a combination of the two
combined_data = pd.concat([encoded_categorical_train, numerical_train], axis=1) 
datasets = [unfiltered_categorical_train, significant_categorical_train.fillna(value=100).astype(int), encoded_categorical_train, numerical_train, combined_data]
# define parameters
fixed_parameters = {
               'max_depth':3,
               'learning_rate':0.3,
               'min_child_weight':3,
               'colsample_bytree':0.8,
               'subsample':0.8,
               'gamma':0,
               'max_delta_step':0,
               'colsample_bylevel':1,
               'scale_pos_weight':1,
               'base_score':0.5,
               'random_state':5,
               'objective':'binary:logistic',
               'silent': 1}

accuracy_scores = []
for data in datasets:
    # define features(X), and target(y)
    X = data
    y = df_train.is_female
    # instantiate model
    xg_reg = XGBRegressor(**fixed_parameters)
    # fit model
    xg_reg.fit(X, y)
    # predict y values
    y_pred = xg_reg.predict(X)
    predictions = [round(value) for value in y_pred]
    # score model
    score = accuracy_score(y, predictions)
    print(score)
   

0.927526705012
0.926540673788
0.927745823062
0.622678718159
0.928403177212


I also submitted the most promising models to Kaggle.com to see how the model scored for the test data. The results are the following:
<li>train: 0.9280 (test: 0.96762) with only encoded categorical data</li>
<li>train: 0.9296 (test: 0.96780) with encoded categorical and numerical data</li>
<li>train: 0.9275 (test: 0.96640) with NO feature selection on categorical data</li>
<li>train: 0.9287 (test: 0.96809) with NO feature selection on categorical data and numerical data</li>


<div class="alert alert-warning">
<H2> Feature selection conclusions </H2>

<p> The feature selection did not improve the model, although it didn't really hurt the model either. Thus, I will continue with parameter optimization using the filtered categorical data and numerical data. That will decrease the computational resources that need to be used for optimization.
</p>
</div>

<div class="alert alert-info" role="alert">
    <H1>XGBoost Model Optimization</H1>

<p>
        Sklearn has two automated methods for parameter tuning: RandomizedSearchCV and GridSearchCV. RandomizedSeachCV has a set number of tests to run, and randomly chooses from the given parameter ranges and performs cross-fold validation for each test.  GridSearchCV tests every possible combination of parameters using cross-fold validation.  Given the large number of parameters I would like to optimize for, I will perform the following two steps until I've achived 0.95 or above accuracy for the test set.
    
    <li>Use random grid search to tune hyper-parameters</li>
    <li>Verify random search parameters improved the model, and test to see if further optimization is needed</li>
  
 
</p>
</div>

In [57]:
# dictionary of fixed parameters, which will not be optimized
fixed_parameters = {
    'objective':'binary:logistic',
    'max_delta_step':0,
    'scale_pos_weight':1,
    'base_score':0.5,
    'random_state':5,
    'subsample':0.8,
    'silent': 1
}

In [58]:
# dictionary of parameters to optimize, and the range of optimization values
reg_param_grid = {'max_depth': range(2,10),
                  'learning_rate': [0.05, 0.1, 0.15, 0.3],
                  'min_child_weight':[2,3,4],
                  'colsample_bytree':[0.6, 0.7, 0.8],
                  'gamma':[0, 2, 5, 8],
                  'colsample_bylevel':[0.7, 1]
                 }
    


<div class="alert alert-info" role="alert">

<p> 
    In order to computational resources, I chose only 6 parameters to optimize initially. The results of said optimization will determine if more parameter need to be considered.
</p>
</div>

In [61]:
# define features and labels for training data
X = combined_data
y = df_train.is_female

# instantiate classifier
xg_reg = XGBRegressor(**fixed_parameters)

# RandomSearch
grid_search = RandomizedSearchCV(param_distributions = reg_param_grid, estimator = xg_reg, cv=4, n_iter=200)
grid_search.fit(X,y)

RandomizedSearchCV(cv=4, error_score='raise',
          estimator=XGBRegressor(base_score=0.5, colsample_bylevel=1, colsample_bytree=1, gamma=0,
       learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=5, silent=1, subsample=0.8),
          fit_params=None, iid=True, n_iter=200, n_jobs=1,
          param_distributions={'max_depth': range(2, 10), 'learning_rate': [0.05, 0.1, 0.15, 0.3], 'min_child_weight': [2, 3, 4], 'colsample_bytree': [0.6, 0.7, 0.8], 'gamma': [0, 2, 5, 8], 'colsample_bylevel': [0.7, 1]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring=None, verbose=0)

In [62]:
# Print best parameters and results
print(grid_search.best_params_)
print(grid_search.best_score_)

{'min_child_weight': 3, 'max_depth': 8, 'learning_rate': 0.1, 'gamma': 0, 'colsample_bytree': 0.6, 'colsample_bylevel': 1}
0.736253869383


Train the model using the parameters optimized by random grid search to test if it improved the model from the baseline unoptimized model.

In [63]:
optimized_fixed_parameters = {
    'objective':'binary:logistic',
    'max_delta_step':0,
    'scale_pos_weight':1,
    'base_score':0.5,
    'random_state':5,
    'subsample':0.8,
    'silent': 1,
    'min_child_weight': 3,
    'max_depth': 8,
    'learning_rate': 0.1,
    'gamma': 0,
    'colsample_bytree': 0.6,
    'colsample_bylevel': 1
}

In [65]:
# instantiate classifier
xg_reg = XGBRegressor(**optimized_fixed_parameters)
xg_reg.fit(X,y)

XGBRegressor(base_score=0.5, colsample_bylevel=1, colsample_bytree=0.6,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=8,
       min_child_weight=3, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=5, silent=1, subsample=0.8)

In [66]:
# predict the labels
y_pred = xg_reg.predict(X)
# convert probabilities to binary output
predictions = [round(value) for value in y_pred]
# score model
score = accuracy_score(y, predictions)
# print accuracy
print(score)

0.958641468091



<div class="alert alert-info" role="alert">

<p> 
    The optimized parameters did improve the model from the baseline, so the next step is to use the model to predict the labels for the test data. The test predictions then need to be saved to a csv file and submitted to Kaggle.com.
</p>
</div>

In [67]:
# Define test features
X_sub = pd.concat([encoded_categorical_test, numerical_test], axis=1) 

# Predict label of test data with optimized model
test_predictions = xg_reg.predict(X_sub)

In [71]:
# Place predictions and their corresponding test ID in a dataframe
submission_df = pd.DataFrame({'test_id': df_test.test_id, 'is_female': test_predictions})

In [73]:
# export the results to a csv file, so that it can be submitted to Kaggle.com
submission_df.to_csv('sub20.csv')

<div class="alert alert-warning">
<H2> Model optimization conclusions </H2>

<p> A score of 0.97107 was achived for the test data (shown below), exceeds my goal of a 0.95 score. Thus, I will not take anymore optimization steps. However, having a training score so much lower than the test score might indicate the model is underfitted. Further steps would include using more forgiving cleaning steps to increase the number of features.
</p>
</div>

![Image of Score](img/optimized_model_score.png)

<div class="alert alert-info" role="alert">
    <H1>Conclusions</H1>

<p>
     Due to the fact I do not own this data I cannot publish which features were the most indicative of gender.  However, I will comment on the modeling process itself.
     
</p>

<p>
   XGBoost worked very well, with very little optimization needed. If this were not the case, I would have tried straight logistic regression or random forests. Since the goal of this analysis was to determine the underlying issues women face, I would not use neural nets, because of poor interpretability.
</p>

</div>