Zindi Challenge link: https://zindi.africa/competitions/busara-mental-health-prediction-challenge-indabax-nigeria

##### References and Resources: 

1. https://www.researchgate.net/publication/352810639_Predicting_Individuals_Mental_Health_Status_in_Kenya_using_Machine_Learning_Methods
2. https://zindi.africa/learn/meet-the-winners-of-the-busara-mental-health-challenge

######  Mount drive

In [1]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Import the required packages and dataset

In [2]:
import pandas as pd
import numpy as np

import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

df_train = pd.read_csv('/content/drive/MyDrive/machine_learning_stories/busara-mental-health-prediction-challenge-indabax-nigeria/train.csv')
df_test = pd.read_csv('/content/drive/MyDrive/machine_learning_stories/busara-mental-health-prediction-challenge-indabax-nigeria/test.csv')
#ss = pd.read_csv('/content/drive/MyDrive/machine_learning_stories/busara-mental-health-prediction-challenge-indabax-nigeria/sample_submission.csv')
df_train.head()

Unnamed: 0,surveyid,village,survey_date,femaleres,age,married,children,hhsize,edu,hh_children,...,given_mpesa,amount_given_mpesa,received_mpesa,amount_received_mpesa,net_mpesa,saved_mpesa,amount_saved_mpesa,early_survey,depressed,day_of_week
0,926,91,23-Nov-61,1,28.0,1,4,6,10,0,...,0,0.0,0,0.0,0.0,1,0.0,0,0,5
1,747,57,24-Oct-61,1,23.0,1,3,5,8,0,...,0,0.0,1,4.804611,4.804611,0,0.0,0,1,3
2,1190,115,05-Oct-61,1,22.0,1,3,5,9,0,...,0,0.0,0,8.007685,8.007685,1,0.0,0,0,5
3,1065,97,23-Sep-61,1,27.0,1,2,4,10,2,...,0,0.0,0,0.0,0.0,1,1.249199,0,0,0
4,806,42,12-Sep-61,0,59.0,0,4,6,10,4,...,0,0.0,0,0.0,0.0,0,0.0,0,0,3


## Data Cleaning

The age data for the test set was an object type, and it had some wierd data, so I replaced with Zero and converted the data type to float. Next we label encode the survey date, and fill in missing values using pandas interpolate. The interpolation strategy used here is linear. Padding was a bad idea

In [4]:
seps = df_train.shape[0]
comb = pd.concat([df_train, df_test], axis=0)
comb['age'] = comb['age'].apply(lambda x: str(x) )
comb['age'] = comb['age'].apply(lambda x: str(0) if x == ".d" else x)
comb['age'] = comb['age'].apply(lambda x: float(x))

le = LabelEncoder()

comb['survey_date'] = comb['survey_date'].apply(lambda x: str(x))
le.fit(comb['survey_date'])
comb['survey_date'] = le.transform(comb['survey_date'])

colNull = comb.isnull().sum()
colNull = [keys for keys, values in colNull.items() if values > 0]
for i in colNull:
    comb[i] = comb[i].interpolate()

Although the dataset has many features, the highly skewed nature of the dataset did not require so many features, This is a curated list of features that are redundant.

In [5]:
redundant = ['hh_children', 'hh_totalmembers',
       'cons_nondurable', 'asset_livestock', 'asset_durable', 'asset_phone',
       'asset_savings', 'asset_land_owned_total', 'asset_niceroof',
       'cons_allfood', 'cons_ownfood', 'cons_alcohol', 'cons_tobacco',
        'cons_med_children','med_expenses_hh_ep',
       'med_expenses_sp_ep', 'med_expenses_child_ep',
       'med_portion_sickinjured', 'med_port_sick_child', 'med_afford_port',
       'med_sickdays_hhave', 'med_healthconsult',  'med_u5_deaths', 'ed_expenses', 'ed_expenses_perkid',
       'ed_schoolattend', 'ed_sch_missedpc', 'ed_work_act_pc'
          ,'fs_chskipm_often','fs_chwholed_often','fs_meat','fs_enoughtom','fs_sleephun']

separate data. Note that I haven't dropped the redundant features here

In [6]:
train = comb[:seps]
test = comb[seps:]
train.dropna(inplace=True)
train.reset_index(drop=True, inplace=True)

y_train = train['depressed']
x_train = train.drop(labels=['depressed'], axis=1)
x_test = test.drop(labels=['depressed'], axis=1)

A blend of 5 models.

- 3 Gradient Boosting Classifiers
- 1 XGB model
- 1 Random Forest model

In [7]:
########################################################################################
# Gradient Boosting Classifier - Base
########################################################################################

model = GradientBoostingClassifier(n_estimators=90, max_depth=3, random_state=8) 
model.fit(x_train,y_train)
gb_pred = model.predict(x_test)

##########################################################################################
# Gradient Boost Classifier - Edge Cases
#########################################################################################

model = GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=23) 
model.fit(x_train,y_train)
p_pred = model.predict_proba(x_test)
gb_pred2 = []
for pp in p_pred:
    if 0.5 < pp[1] < 0.8:
        gb_pred2.append(1)
    else:
        gb_pred2.append(0)
###########################################################################################
# XGB - Edge Cases
##########################################################################################

model= xgb.XGBClassifier(seed=3)
model.fit(x_train, y_train)
p_pred = model.predict_proba(x_test)

xgb_pred = []
for pp in p_pred:
    if 0.5 < pp[1] < 0.6:
        xgb_pred.append(1)
    else:
        xgb_pred.append(0)

##########################################################################################
# Random Forests - Base
#########################################################################################

model = RandomForestClassifier(random_state=3, n_estimators=20)
model.fit(x_train, y_train)
rf_pred = model.predict(x_test)


############################################################################################
#Clean Datast by removing redundant columns
###########################################################################################

comb_ = comb.drop(redundant, axis=1)
train = comb_[:seps]
test = comb_[seps:]
train.fillna(method="ffill", inplace=True)
train.dropna(inplace=True)
train.reset_index(drop=True, inplace=True)
y_train = train['depressed']
x_train = train.drop(labels=['depressed'], axis=1)
x_test = test.drop(labels=['depressed'], axis=1)

###########################################################################################
# Gradient Boost with Clean Dataset
###########################################################################################

model = GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=23) 
model.fit(x_train,y_train)
p_pred = model.predict_proba(x_test)
gb_pred3 = []
for pp in p_pred:
    if 0.484 < pp[1] < 0.5:
        gb_pred3.append(1)
    else:
        gb_pred3.append(0)
        
###########################################################################################
# Blending the Models
###########################################################################################

blend = []
for p in range(len(gb_pred)):
    if (gb_pred[p] > 0) & (gb_pred2[p] > 0) | (xgb_pred[p] > 0) | (rf_pred[p] > 0) |(gb_pred3[p] > 0):
        blend.append(1)
    else:
        blend.append(0)

submiss = pd.DataFrame({"surveyid": x_test['surveyid'],  "depressed": blend})
submiss = submiss[['surveyid', 'depressed']]
submiss.to_csv("Baseline_indaba_X_2022_IB.csv", index = False)