# Random Forest Classifier

We begin our analysis with randomm forest classifer. Random forest is the meta estimator which fits number of decision tree classifiers on various subsamples of data and uses averaging for improving the model accuracy.  

In [12]:
# Load required packages
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report

## Modelling Company Employees

In [41]:
# Load data into dataframe
df = pd.read_csv('./../../../datasets/preprocessed_ce.csv')

### Splitting data

In [42]:
tgt_col = 'have you ever sought treatment for a mental health disorder from a mental health professional?'
y = df[tgt_col]
X = df.drop(tgt_col, axis=1)

Let's check if the data is imbalanced or not.

In [43]:
# Split data into trainining and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

In [44]:
# Keep copy of original variables
X_train_ori = X_train.copy()
X_test_ori = X_test.copy()

### Categorical features encoding

Before we move forward to encode categorical features, it is necessary to identify them first.

In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1225 entries, 0 to 1224
Data columns (total 55 columns):
 #   Column                                                                                                                                                                                                                          Non-Null Count  Dtype  
---  ------                                                                                                                                                                                                                          --------------  -----  
 0   are you self-employed?                                                                                                                                                                                                          1225 non-null   int64  
 1   how many employees does your company or organization have?                                                                       

Looking at the information of dataframe, there are quite a lot of fetuares which has data type as "object". It is not necessary that all the features with data type as "object" be categorical features. There may be certain columns which might binary values which can be represented by booleans. It is better to check column one by one.  
But for now, I would like to go with the assumption that all the columns with data type as "object" are categorical columns.

In [46]:
cat_cols = df.select_dtypes(include=['object']).columns
cat_cols

Index(['how many employees does your company or organization have?',
       'does your employer provide mental health benefits as part of healthcare coverage?',
       'do you know the options for mental health care available under your employer-provided health coverage?',
       'has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?',
       'does your employer offer resources to learn more about mental health disorders and options for seeking help?',
       'is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?',
       'if a mental health issue prompted you to request a medical leave from work, how easy or difficult would it be to ask for that leave?',
       'would you feel more comfortable talking to your coworkers about your physical health or your mental health?',
       'would you feel comfortable discussing a 

There are 32 columns out of 55 which are categorical in nature. Out of those, after examining the data manually, we can infer that one of them is ordinal in nature and others can be treated as nominal columns. The column - "how many employees does your company or organization have?" - which gives information regarding the size of the company can be treated as ordinal coulmn.

In [47]:
# Encoding ordinal column for training data
X_train['how many employees does your company or organization have?'] = X_train['how many employees does your company or organization have?'].replace({'1-5':1, 
                                                                                                                                              '6-25':2, 
                                                                                                                                              '26-100':3, 
                                                                                                                                              '100-500':4,
                                                                                                                                              '500-1000':5,
                                                                                                                                              'More than 1000':6})

# Encoding ordinal column for testing data
X_test['how many employees does your company or organization have?'] = X_test['how many employees does your company or organization have?'].replace({'1-5':1, 
                                                                                                                                              '6-25':2, 
                                                                                                                                              '26-100':3, 
                                                                                                                                              '100-500':4,
                                                                                                                                              '500-1000':5,
                                                                                                                                              'More than 1000':6})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['how many employees does your company or organization have?'] = X_train['how many employees does your company or organization have?'].replace({'1-5':1,
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['how many employees does your company or organization have?'] = X_test['how many employees does your company or organization have?'].replace({'1-5':1,


In [48]:
# Encoding nominal columns for training data
for column in cat_cols:
    dummy = pd.get_dummies(X_train[column], prefix=str(column))
    X_train = pd.concat([X_train, dummy], axis=1)
    X_train.drop(column, axis=1, inplace=True)
    
# Encoding nominal columns for testing data
for column in cat_cols:
    dummy = pd.get_dummies(X_test[column], prefix=str(column))
    X_test = pd.concat([X_test, dummy], axis=1)
    X_test.drop(column, axis=1, inplace=True)

In [51]:
# Fill value 0 for mismatched columns
mis_cols = list(set(X_train.columns) - set(X_test.columns))
X_test[mis_cols] = 0

### Imbalance check

In [52]:
y.value_counts()

1    779
0    446
Name: have you ever sought treatment for a mental health disorder from a mental health professional?, dtype: int64

The data is imbalanced. In order to use any of the machine learning algorithm, we need to either over the minority class or downsample the majority. Considering the fact that we have less number of records in the data set, it is better to oversample. But, only training data needs to be oversample.  
For oversampling, Sample Minority Oversampling Technique (SMOTE) will be used.

In [53]:
X_train

Unnamed: 0,are you self-employed?,is your employer primarily a tech company/organization?,is your primary role within your company related to tech/it?,have you ever discussed your mental health with your employer?,have you ever discussed your mental health with coworkers?,have you ever had a coworker discuss their or another coworker's mental health with you?,"overall, how much importance does your employer place on physical health?","overall, how much importance does your employer place on mental health?",do you have previous employers?,was your employer primarily a tech company/organization?,...,what country do you work in?_Ireland,what country do you work in?_Mexico,what country do you work in?_Netherlands,what country do you work in?_Norway,what country do you work in?_Poland,what country do you work in?_Portugal,what country do you work in?_Spain,what country do you work in?_Switzerland,what country do you work in?_United Kingdom,what country do you work in?_United States of America
371,0,1.0,1.0,1.0,1.0,1.0,7.0,8.0,1,1.0,...,0,0,0,0,0,0,0,0,0,1
306,0,1.0,1.0,0.0,1.0,1.0,2.0,8.0,0,0.0,...,0,0,1,0,0,0,0,0,0,0
1096,0,1.0,1.0,0.0,1.0,1.0,8.0,8.0,1,1.0,...,0,0,0,0,0,0,0,0,0,1
10,0,1.0,1.0,0.0,0.0,0.0,3.0,2.0,1,0.0,...,0,0,0,0,0,0,0,0,0,1
535,0,1.0,1.0,0.0,0.0,0.0,7.0,5.0,1,1.0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1044,0,1.0,1.0,1.0,0.0,0.0,5.0,5.0,1,1.0,...,0,0,0,0,0,0,0,0,0,1
1095,0,0.0,1.0,0.0,1.0,1.0,7.0,8.0,0,0.0,...,0,0,0,0,0,0,0,0,0,0
1130,0,1.0,0.0,0.0,0.0,0.0,5.0,0.0,1,0.0,...,0,0,0,0,0,0,0,0,0,1
860,0,1.0,1.0,1.0,1.0,1.0,7.0,7.0,1,1.0,...,0,0,0,0,0,0,0,0,0,1


In [54]:
# Oversample the minority class in the target variable
oversample = SMOTE()
X_train, y_train = oversample.fit_resample(X_train.values, y_train.ravel())

### Model training

There are various paramters which random forest algorithm uses to train the model. Our aim is to find those paramters, also known as hyperparamters, which yeilds us the model with the best fit.

In [55]:
# Declare parameters for grid search

# Declare the classifer
clf = RandomForestClassifier(class_weight="balanced", bootstrap=True, oob_score=True)

# Declare the paramter grid for searching
param_grid = dict(
    n_estimators = [100, 200, 400],
    criterion = ['gini', 'entropy'],
    max_depth = [10, 20, 40, None],
    max_features = ['sqrt', 'log2', None],
    max_samples = [0.4, 0.8, None]
)

In [56]:
# Train the model
rf_clf = GridSearchCV(clf, param_grid, scoring='f1', n_jobs=7, cv=5, verbose=2)
rf_clf.fit(X_train, y_train)

Fitting 5 folds for each of 216 candidates, totalling 1080 fits


GridSearchCV(cv=5,
             estimator=RandomForestClassifier(class_weight='balanced',
                                              oob_score=True),
             n_jobs=7,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [10, 20, 40, None],
                         'max_features': ['sqrt', 'log2', None],
                         'max_samples': [0.4, 0.8, None],
                         'n_estimators': [100, 200, 400]},
             scoring='f1', verbose=2)

In [57]:
rf_clf.best_estimator_

RandomForestClassifier(class_weight='balanced', max_depth=10,
                       max_features='sqrt', max_samples=0.8, oob_score=True)

In [58]:
# Save and load the model if required
import joblib
joblib.dump(rf_clf.best_estimator_, './../../../models/rf_clf.pkl')
rf_clf = joblib.load('./../../../models/rf_clf.pkl')

In [59]:
# Predict outcomes with test set
# y_pred = rf_clf.best_estimator_.predict(X_test)
y_pred = rf_clf.predict(X_test)

### Model Evaluation

In order to compute sensitivity and specificity, we need values such as true positives, true negatives, false positives and false neagatives. These values can be easily obtained from confusion matrix.

In [60]:
# Get values from confusion metrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

In [61]:
# Compute sensitivity
sensitivity = tp/(tp+fn)
print(f"Sensitivity: {sensitivity} \n")
# Compute specificity
specificity = tn/(tn+fp)
print(f"Specificity: {specificity} \n")

# Compute classicfication report
print(classification_report(y_test, y_pred))

Sensitivity: 0.8662420382165605 

Specificity: 0.8181818181818182 

              precision    recall  f1-score   support

           0       0.77      0.82      0.80        88
           1       0.89      0.87      0.88       157

    accuracy                           0.85       245
   macro avg       0.83      0.84      0.84       245
weighted avg       0.85      0.85      0.85       245



From the above report, it can inferred that model is finding it difficult to predict people who need to seek help from mental health professional and that can be acceptable. It won't harm any of us to visit a mental health professional even if we need not seek any help for any mental health issue. On the other hand the model is quite good at telling in case we that much needed help from mental health professional. An average F1 score of 0.85 is quite good considering the amount of data that we are training with. Though the model is quite better in predicting the individuals who need help than the ones who do not, the values of specificity and sensitivity are not far apart and hence the overall performance of the model is laudable.

### Fairness evaluation

In [None]:
from fairlearn.widget import FairlearnDashboard

In [None]:
FairlearnDashboard(sensitive_features=X_test,
                  sensitive_feature_names=list(X_test.columns),
                  y_true=y_test.tolist(),
                  y_pred=y_pred.tolist())

In [None]:
X_test.columns