# Testing Different Supervised Learning Techniques

In this notebook, several different types of supervised learning model are compared using 5-fold cross-validation. The results here are shown for comparison, the parameters of each model have already been tuned.

Due to the nature of the problem (looking for small differences on a large corpus of similar words), decision tree ensemble methods are generally more performant than other techniques.

Key findings:
- Naive Bayes, which normally performs well on NLP problems, is not performant in this case.
- The Random Forest method has good out of the box performance
- AdaBoost is less performant than Random Forests
- Extreme gradient boosting (XG Boost) outperforms other decision tree 

For reference:
- Class 0 = San Francisco
- Class 1 = New York
- Class 2 = Chicago
- Class 3 = Austin

### Import modules and data

In [2]:
from src.utils import import_data
from src.build_model import JHPModel
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings("ignore")

In [3]:
df = import_data("job-hunter-plus-data", "indeed_data.csv")

# Models On All Four Cities

### Random Forest

In [16]:
clf = RandomForestClassifier(n_estimators=200, n_jobs=-1)

In [18]:
m = JHPModel(clf, stemlem="", min_df=0.01, max_df=0.95, num_cities=4,
                 n_grams=(1,2), use_stopwords=True)

In [27]:
m.cross_validate(df)

Model Accuracy = 0.7176662266749533
Confusion_Matrix:
[[ 2380.   656.   125.    47.]
 [  346.  3752.   149.    36.]
 [  183.   835.  1175.    56.]
 [  175.   427.    47.   526.]]
Class 0 | Precision = 0.742 | Recall = 0.772 | F1 = 0.757
Class 1 | Precision = 0.876 | Recall = 0.662 | F1 = 0.754
Class 2 | Precision = 0.522 | Recall = 0.785 | F1 = 0.628
Class 3 | Precision = 0.448 | Recall = 0.791 | F1 = 0.572


### Naive Bayes

In [5]:
clf = MultinomialNB(alpha = 0.1)

In [6]:
m = JHPModel(clf, stemlem="", min_df=0.01, max_df=0.95, num_cities=4,
                 n_grams=(1,2), use_stopwords=True)
m.cross_validate(df)

Model Accuracy = 0.5556574578605649
Confusion_Matrix:
[[ 1949.   894.   330.    30.]
 [  896.  2971.   606.    59.]
 [  364.   828.  1004.    44.]
 [  316.   357.   234.   276.]]
Class 0 | Precision = 0.608 | Recall = 0.553 | F1 = 0.579
Class 1 | Precision = 0.656 | Recall = 0.588 | F1 = 0.620
Class 2 | Precision = 0.448 | Recall = 0.462 | F1 = 0.455
Class 3 | Precision = 0.233 | Recall = 0.675 | F1 = 0.347


### XG Boost

In [8]:
clf = XGBClassifier(n_estimators=2000, max_depth=3, learning_rate=0.25, n_jobs=-1)
m = JHPModel(clf, stemlem="", min_df=0.01, max_df=0.95, num_cities=4,
                 n_grams=(1,2), use_stopwords=True)
m.cross_validate(df)

Model Accuracy = 0.7305791158616927
Confusion_Matrix:
[[ 2376.   531.   222.    61.]
 [  379.  3758.   330.    50.]
 [  221.   573.  1380.    72.]
 [  158.   264.   136.   613.]]
Class 0 | Precision = 0.745 | Recall = 0.758 | F1 = 0.751
Class 1 | Precision = 0.832 | Recall = 0.733 | F1 = 0.779
Class 2 | Precision = 0.614 | Recall = 0.667 | F1 = 0.640
Class 3 | Precision = 0.523 | Recall = 0.770 | F1 = 0.623


### AdaBoost

In [11]:
clf = AdaBoostClassifier(n_estimators=1000, learning_rate=0.25)
m = JHPModel(clf, stemlem="", min_df=0.01, max_df=0.95, num_cities=4,
                 n_grams=(1,2), use_stopwords=True)
m.cross_validate(df)

Model Accuracy = 0.5975928207496455
Confusion_Matrix:
[[ 1868.  1104.   181.    40.]
 [  504.  3625.   341.    56.]
 [  259.  1140.   791.    61.]
 [  192.   463.   141.   375.]]
Class 0 | Precision = 0.585 | Recall = 0.662 | F1 = 0.621
Class 1 | Precision = 0.801 | Recall = 0.572 | F1 = 0.668
Class 2 | Precision = 0.351 | Recall = 0.544 | F1 = 0.427
Class 3 | Precision = 0.320 | Recall = 0.705 | F1 = 0.440
