<img src='https://bit.ly/2VnXWr2' width='100' align='left'>

# Final project: NLP to predict Myers-Briggs Personality Type

### Imports

In [2]:
!pip freeze > requirements4.txt

In [1]:
# Data Analysis
import pandas as pd
import numpy as np

# Data Visualization
import seaborn as sns
import matplotlib.pyplot as plt


# Ignore noise warning
import warnings
warnings.filterwarnings('ignore')

# Work with pickles
import pickle

#Metrics
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, accuracy_score, balanced_accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score, multilabel_confusion_matrix, confusion_matrix
from sklearn.metrics import classification_report

# Model training and evaluation
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score, RandomizedSearchCV

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

## 5. Hyperparameter Tuning of the Models (Types)

Althought the metrics of the different models are really good, we can still improve the performance of the models. Therefore, a fine tunning of the different parameters of each models has to be done.

In [2]:
result_svd_vec_types  = pd.read_csv('data/output_csv/result_svd_vec_types.csv')
result_svd_vec_types.drop(['Unnamed: 0'], axis=1, inplace=True)

In [3]:
X = result_svd_vec_types.drop(['type','enfj', 'enfp', 'entj', 'entp', 'esfj', 'esfp', 'estj', 'estp','infj', 'infp', 'intj',
                               'intp', 'isfj', 'isfp', 'istj', 'istp'], axis=1).values
y = result_svd_vec_types['type'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)
print ((X_train.shape),(y_train.shape),(X_test.shape),(y_test.shape))

(6940, 102) (6940,) (1735, 102) (1735,)


<img src='https://www.nicepng.com/png/detail/148-1486992_discover-the-most-powerful-ways-to-automate-your.png' width='1000'>

In [4]:
raise SystemExit('Stop right there! The following cells takes some time to complete.')

As there's quite a few parameters, I will show the parameters' grid I used and then the model training with the best results. 

Those grids have been used during the tuning in Google Colab in pairs or threes of parameters. 

### RandomForest Tuning

##### GridSearchCV

In [None]:
random_forest = RandomForestClassifier(random_state = 42)

max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
n_estimators.append(list(np.arange(50,200)))

param_grid =  {'class_weight': [None,'balanced'],
               'criterion': ['gini', 'entropy'],
               'max_depth': max_depth, 
               'max_features': ['auto', 'sqrt', 'log2'],
               'n_estimators' : n_estimators,
               'min_samples_leaf': np.arange(1,20),
               'min_samples_split': np.arange(2,25),
               'bootstrap': [True, False],
               'oob_score': [True, False] 
            }

grid = GridSearchCV(random_forest, param_grid, cv=3, scoring='f1_weighted', verbose=2, n_jobs=-1, refit=True)

grid.fit(X_train, y_train)

grid.best_estimator_

print(grid.best_params_)

### GradientBooster Tuning

##### GridSearchCV

In [None]:
gradient_booster = GradientBoostingClassifier(random_state = 42)

max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
n_estimators.append(list(np.arange(50,200)))

param_grid =  {'loss':['deviance', 'exponential'],
               'learning_rate': [0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2],
               'max_depth': max_depth, 
               'n_estimators' : n_estimators,
               'min_samples_leaf': np.arange(1,20),
               'min_samples_split': np.arange(2,25),
               'max_features':['auto', 'sqrt', 'log2'],
               'criterion': ['friedman_mse', 'mse', 'mae'],
               'subsample':[0.5, 0.618, 0.8, 0.85, 0.9, 0.95, 1.0]
              }

grid = GridSearchCV(gradient_booster, param_grid, cv=3, scoring='f1_weighted',  verbose=2, n_jobs=-1, refit=True)

grid.fit(X_train, y_train)

grid.best_estimator_

print(grid.best_params_)

### Final results

In [7]:
def baseline_report(model, X_train, X_test, y_train, y_test, name):
    strat_k_fold = StratifiedKFold(n_splits=5, shuffle=True)
    model.fit(X_train, y_train)
    accuracy     = np.mean(cross_val_score(model, X_train, y_train, cv=strat_k_fold, scoring='accuracy'))
    precision    = np.mean(cross_val_score(model, X_train, y_train, cv=strat_k_fold, scoring='precision_weighted'))
    recall       = np.mean(cross_val_score(model, X_train, y_train, cv=strat_k_fold, scoring='recall_weighted'))
    f1score      = np.mean(cross_val_score(model, X_train, y_train, cv=strat_k_fold, scoring='f1_weighted'))
    y_pred = model.predict(X_test)
    mcm = multilabel_confusion_matrix(y_test, y_pred)
    tn = mcm[:, 0, 0]
    tp = mcm[:, 1, 1]
    fn = mcm[:, 1, 0]
    fp = mcm[:, 0, 1]
    specificities = tn / (tn+fp)
    specificity = (specificities.sum())/ 16

    df_model = pd.DataFrame({'model'        : [name],
                             'accuracy'     : [accuracy],
                             'precision'    : [precision],
                             'recall'       : [recall],
                             'f1score'      : [f1score],
                             'specificity'  : [specificity]
                            })   
    return df_model

In [8]:
models = {'randomforest': RandomForestClassifier(random_state = 42, bootstrap=False, class_weight = 'balanced', 
                                                 criterion = 'gini', max_depth = 50, max_features = 'sqrt', 
                                                 min_samples_leaf = 5, min_samples_split = 12, n_estimators = 1800, 
                                                 oob_score = False),
          'xgboost': GradientBoostingClassifier(random_state = 42, loss = 'deviance', max_depth = 3, n_estimators = 1600,
                                                max_features = 'sqrt', learning_rate = 0.075, criterion = 'friedman_mse',
                                                subsample = 0.9, min_samples_leaf = 6, min_samples_split = 15)
         } 

In [11]:
models_df = pd.concat([baseline_report(model, X_train, X_test, y_train, y_test, name) for (name, model) in models.items()]) 
models_df.to_csv('data/output_csv/models_tuned_types.csv')
models_df

Unnamed: 0,model,accuracy,precision,recall,f1score,specificity
0,randomforest,0.642795,0.645832,0.640346,0.638833,0.974255
0,xgboost,0.660519,0.65806,0.663401,0.651957,0.975184


## Conclusions

The model trained has an F1 Score of 0.651957, that is the model is able to predict MBTI personality type 65,2% of times.

Despite not seeming particularly outstanding results, as a multiclass classification (16 types), randomness baseline was located at 6.25%. So predictions from this model would be more than 10 times more accurate than guessing.