# Assignment 3 Natural Language Processing - Part 2: traditional machine learning approach: XGBoost

After part 1 of the assignment, we concluded that the XGBoost performed better than the SVM regarding the traditional machine learning approaches. As such, we decided to use XGBoost as a machine learner for the remainder of the assignment.
Furthermore, for the traditional methods we investigated TF-IDF, Part-of-Speech tagging and Bag-of-Words as input to the model. In part 1 we investigated which works better, but here we could not find any clear optimal solutions as all target variables had other methods that worked well. As such, we decided to test it again for one model predicting all 5 personality traits in one go, to see which input would work the best if we predicted everything in one go.

Preprocessing was not changed as there were no significant changes and insights found in part 1 regarding the preprocessing.

In the remainder of the assignment, we do no longer predict the classification (1 or 0) but continue using the probability of the text belonging to that class. We will thus have one model predicting the 5 classes in one go, with outcomes being the probability.

This requires some changes in the model. We used XGBClassifier with the 'binary:hinge' objective function for the classification. To predict the probabilities, we need to convert the XGBClassifier into an XGBRegressor. Also, the objective function needs to be changed. The 'binary:hinge' function yields 1 or 0. We converted this to 'binary:logistic' to get outcome probabilities.
To do the 5 classifications at once, we give the model as y variable the data frame containing all 5 target variables, instead of just one column as y.

As we changed from binary classification to probability predictions, we decided to also test how a binary classifier would perform when predicting the 5 traits in one go. we will therefore also train and test 3 additional XGBClassifiers with the 'binary:hinge' objective function. This way we can identify if changing from binary classification to probability prediction has some implications for the performance. For parameters, we use the found optimal parameters when investigating the TF-IDF model of part 2.


To assess the performance, we initially could not use F1, Accuracy, Precision and Recall anymore as these are solely for classification with categorical variables. The regressor yields numerical values which cannot be used for this. This would imply the usage of Regressor-based metrics such as RMSE. We however cannot compare the results with part 1 if using regressor-based metrics. Therefore, as we are predicting probabilities, we will convert the probability after prediction to 1 if the probability is equal/higher than 0.5 and to 0 if the prediction is lower than 0.5. This way, we can still create the desired performance metrics and compare them to other models. As we did find the F1 and AUC-ROC more important than Accuracy, Recall and Precision, we dropped the latter three and only looked at the F1 ROC-AUC

For this part, we again test the three vectorization techniques separately. We first do a hyperparameter search to find the optimal XGBoost parameters. Here we changed some parameters to fit the XGBRegressor function instead of the XGBClassifier used for part 1. Continuing, we again validate the models and compare them to each other to see what vectorisation method works best. After this is done, we can investigate the capabilities of XGBoost for predicting the 5 traits in one go, as well as compare it to predicting the separate traits as done in part 1 and the deep learners.


In [18]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split, RandomizedSearchCV, RepeatedKFold
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.feature_extraction.text import CountVectorizer
import xgboost as xgb
from gensim.models import Word2Vec
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.naive_bayes import GaussianNB
from sklearn.multioutput import MultiOutputClassifier

# Reading the data

In [2]:
df = pd.read_csv('processed_data_english_no_lowercasing.csv')
df

Unnamed: 0,TEXT,cEXT,cNEU,cAGR,cCON,cOPN
0,Well right woke midday nap Its sort weird ever...,0,1,1,0,1
1,Well stream consciousness essay used thing lik...,0,0,1,0,0
2,open keyboard button push The thing finally wo...,0,1,0,1,1
3,cant believe Its really happening pulse racing...,1,0,1,1,0
4,Well good old stream consciousness assignment ...,1,0,1,0,1
...,...,...,...,...,...,...
2958,motivated day day basis need provide little fa...,1,0,0,1,1
2959,son biggest part life without reckless person ...,1,1,0,0,0
2960,kid grandkids keep motivated everyday inspire ...,1,0,1,1,0
2961,biggest drive earn money retire beach schedule...,0,0,0,0,0


In [3]:
corpus = df['TEXT'].tolist()

In [4]:
y = df[['cEXT', 'cNEU', 'cAGR', 'cCON', 'cOPN']]

## Using TF-IDF 

First, train-test splits are made per target variable. Afterwards, X_train and X_test are transformed using TF-IDF

In [5]:
X_train_t, X_test_t, y_train_t, y_test_t = train_test_split(corpus, y, test_size=0.2, random_state=42)

In [6]:
tfidf = TfidfVectorizer()  

# for Ext
X_train_t = tfidf.fit_transform(X_train_t)
X_test_t = tfidf.transform(X_test_t)

### Hyperparameter tuning

In order to have the optimal XGBoost model, a Randomized Search is done to find the best combination of 'learning rate', 'gamma', 'lambda' and 'alpha' parameters of the XGBoost.

In [22]:
# to ignore Sklearn warnings
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [24]:
xgb_model = xgb.XGBRegressor(objective='binary:logistic')  # Set objective for regression

# Define the hyperparameter grid to search
params_dict = {'learning_rate': [0.01, 0.1, 0.2],
               'gamma': [0.01, 0.1, 1.0],
               'lambda': [0.01, 0.1, 1.0],
               'alpha': [0.01, 0.1, 1.0],
              }

scoring = {
    'f1': make_scorer(f1_score, average='micro'),
    'roc_auc': make_scorer(roc_auc_score, multi_class='ovr', average='micro'),
}

search = RandomizedSearchCV(
    xgb_model,
    param_distributions=params_dict,
    scoring=scoring,
    refit='f1', 
    cv=2,
    verbose=3)
    
search.fit(X_train_t, y_train_t)

search.best_params_

Fitting 2 folds for each of 10 candidates, totalling 20 fits
[CV 1/2] END alpha=0.01, gamma=0.1, lambda=0.01, learning_rate=0.01; f1: (test=nan) roc_auc: (test=0.552) total time=  49.8s
[CV 2/2] END alpha=0.01, gamma=0.1, lambda=0.01, learning_rate=0.01; f1: (test=nan) roc_auc: (test=0.556) total time=  54.7s
[CV 1/2] END alpha=1.0, gamma=0.01, lambda=0.01, learning_rate=0.2; f1: (test=nan) roc_auc: (test=0.568) total time=  47.1s
[CV 2/2] END alpha=1.0, gamma=0.01, lambda=0.01, learning_rate=0.2; f1: (test=nan) roc_auc: (test=0.565) total time=  46.1s
[CV 1/2] END alpha=0.1, gamma=0.01, lambda=0.01, learning_rate=0.1; f1: (test=nan) roc_auc: (test=0.560) total time=  41.8s
[CV 2/2] END alpha=0.1, gamma=0.01, lambda=0.01, learning_rate=0.1; f1: (test=nan) roc_auc: (test=0.556) total time=  38.6s
[CV 1/2] END alpha=0.1, gamma=0.1, lambda=0.1, learning_rate=0.1; f1: (test=nan) roc_auc: (test=0.567) total time=  40.5s
[CV 2/2] END alpha=0.1, gamma=0.1, lambda=0.1, learning_rate=0.1; f1: (

{'learning_rate': 0.01, 'lambda': 0.01, 'gamma': 0.1, 'alpha': 0.01}

In [7]:
opt_params_r_t = {'learning_rate': 0.01, 'lambda': 0.01, 'gamma': 0.1, 'alpha': 0.01}
opt_params_c_t = {'min_child_weight': 10, 'max_depth': 6, 'learning_rate': 0.01, 'gamma': 0.5}

### Actual training and testing

With the optimal parameters found, the XGBoost is trained on the training set and then tested for validation. 

In [20]:
def XGB_model_c(X_train, X_test, y_train, y_test, opt_params):
    xgb_model = MultiOutputClassifier(xgb.XGBClassifier(
        objective='multi:softmax',
        num_class=5,
        min_child_weight=opt_params['min_child_weight'],
        max_depth=opt_params['max_depth'],
        learning_rate=opt_params['learning_rate'],
        gamma=opt_params['gamma']
    ))

    xgb_model.fit(X_train, y_train)

    predictions = xgb_model.predict(X_test)

    f1 = f1_score(y_test.values.flatten(), predictions.flatten(), average='micro')
    roc_auc = roc_auc_score(y_test.values.flatten(), predictions.flatten(), average='micro')

    print("F1 Score: {:.2f}".format(f1))
    print("ROC AUC: {:.2f}".format(roc_auc))

    return [f1, roc_auc]

In [27]:
def XGB_model_r(X_train, X_test, y_train, y_test, opt_params):
    xgb_model = xgb.XGBRegressor(objective='binary:logistic', 
                                reg_lambda=opt_params['lambda'], 
                                alpha=opt_params['alpha'], 
                                learning_rate=opt_params['learning_rate'], 
                                gamma=opt_params['gamma'],
                                eval_metric='rmse')
    
    xgb_model.fit(X_train, y_train)

    predictions_prob = xgb_model.predict(X_test)

    # Convert probabilities to binary predictions
    predictions_binary = (predictions_prob > 0.5).astype(int)
    
    # F1 Score
    f1 = f1_score(y_test.values.flatten(), predictions_binary.flatten(), average='micro')

    roc_auc = roc_auc_score(y_test.values.flatten(), predictions_prob.flatten(), average='micro')
    
    
    print("F1 Score: {:.2f}".format(f1))
    print("ROC AUC: {:.2f}".format(roc_auc))

    
    
    return [f1, roc_auc]

In [21]:
# With binary classification
metrics_t_c = XGB_model_c(X_train_t, X_test_t, y_train_t, y_test_t, opt_params_c_t)

F1 Score: 0.55
ROC AUC: 0.55


In [28]:
# With probability outcomes
metrics_t_r = XGB_model_r(X_train_t, X_test_t, y_train_t, y_test_t, opt_params_r_t)

F1 Score: 0.54
ROC AUC: 0.56


## Using Part-of-Speech (PoS) tagging

As second method to test, PoS tagging as used. To do so, first the text inputs are tagged.

In [29]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\maxma\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [30]:
pos_df = df.copy()

def pos_tagging(text):
    tokens = nltk.word_tokenize(text)
    pos_tags = nltk.pos_tag(tokens)
    return [tag for word, tag in pos_tags]

pos_df['pos_tags'] = pos_df['TEXT'].apply(pos_tagging)

In [31]:
vectorizer = CountVectorizer(tokenizer=lambda x: x, lowercase=False)
X_p = vectorizer.fit_transform(pos_df['pos_tags'])

X_train_p, X_test_p, y_train_p, y_test_p = train_test_split(X_p, y, test_size=0.2, random_state=42)



### Hyperparameter tuning

In order to have the optimal XGBoost model, a Randomized Search is done to find the best combination of 'learning rate', 'gamma', 'lambda' and 'alpha' parameters of the XGBoost.

In [33]:
xgb_model = xgb.XGBRegressor(objective='binary:logistic')  # Set objective for regression

# Define the hyperparameter grid to search
params_dict = {'learning_rate': [0.01, 0.1, 0.2],
               'gamma': [0.01, 0.1, 1.0],
               'lambda': [0.01, 0.1, 1.0],
               'alpha': [0.01, 0.1, 1.0],
              }

scoring = {
    'f1': make_scorer(f1_score, average='micro'),
    'roc_auc': make_scorer(roc_auc_score, multi_class='ovr', average='micro'),
}

search = RandomizedSearchCV(
    xgb_model,
    param_distributions=params_dict,
    scoring=scoring,
    refit='f1', 
    cv=2,
    verbose=3)
    
search.fit(X_train_p, y_train_p)

search.best_params_

Fitting 2 folds for each of 10 candidates, totalling 20 fits
[CV 1/2] END alpha=1.0, gamma=1.0, lambda=0.1, learning_rate=0.1; f1: (test=nan) roc_auc: (test=0.524) total time=   0.1s
[CV 2/2] END alpha=1.0, gamma=1.0, lambda=0.1, learning_rate=0.1; f1: (test=nan) roc_auc: (test=0.517) total time=   0.1s
[CV 1/2] END alpha=0.1, gamma=1.0, lambda=0.01, learning_rate=0.2; f1: (test=nan) roc_auc: (test=0.524) total time=   0.0s
[CV 2/2] END alpha=0.1, gamma=1.0, lambda=0.01, learning_rate=0.2; f1: (test=nan) roc_auc: (test=0.525) total time=   0.1s
[CV 1/2] END alpha=0.1, gamma=1.0, lambda=0.01, learning_rate=0.01; f1: (test=nan) roc_auc: (test=0.528) total time=   0.4s
[CV 2/2] END alpha=0.1, gamma=1.0, lambda=0.01, learning_rate=0.01; f1: (test=nan) roc_auc: (test=0.524) total time=   0.6s
[CV 1/2] END alpha=0.1, gamma=0.1, lambda=0.01, learning_rate=0.1; f1: (test=nan) roc_auc: (test=0.518) total time=   0.5s
[CV 2/2] END alpha=0.1, gamma=0.1, lambda=0.01, learning_rate=0.1; f1: (test=n

{'learning_rate': 0.1, 'lambda': 0.1, 'gamma': 1.0, 'alpha': 1.0}

In [35]:
opt_params_r_p = {'learning_rate': 0.1, 'lambda': 0.1, 'gamma': 1.0, 'alpha': 1.0}
opt_params_c_p = {'min_child_weight': 10, 'max_depth': 6, 'learning_rate': 0.01, 'gamma': 0.5}

### Actual training and testing

With the optimal parameters found, the XGBoost is trained on the training set and then tested for validation

In [36]:
# With binary classification
metrics_p_c = XGB_model_c(X_train_p, X_test_p, y_train_p, y_test_p, opt_params_c_p)

F1 Score: 0.52
ROC AUC: 0.51


In [37]:
# With probability outcomes
metrics_p_r = XGB_model_r(X_train_p, X_test_p, y_train_p, y_test_p, opt_params_r_p)

F1 Score: 0.52
ROC AUC: 0.52


## Using Bag-of-Words (BoW)

Firstly Converts the texts into Bag-of-Words representation and determine the train-test splits

In [38]:
vectorizer = CountVectorizer()
X_b = vectorizer.fit_transform(df['TEXT'])

In [39]:
X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(X_b, y, test_size=0.2, random_state=42)

### Hyperparameter tuning

In order to have the optimal XGBoost model, a Randomized Search is done to find the best combination of 'learning rate', 'gamma', 'lambda' and 'alpha' parameters of the XGBoost.

In [38]:
xgb_model = xgb.XGBRegressor(objective='binary:logistic')  # Set objective for regression

# Define the hyperparameter grid to search
params_dict = {'learning_rate': [0.01, 0.1, 0.2],
               'gamma': [0.01, 0.1, 1.0],
               'lambda': [0.01, 0.1, 1.0],
               'alpha': [0.01, 0.1, 1.0],
              }

scoring = {
    'f1': make_scorer(f1_score, average='micro'),
    'roc_auc': make_scorer(roc_auc_score, multi_class='ovr', average='micro'),
}

search = RandomizedSearchCV(
    xgb_model,
    param_distributions=params_dict,
    scoring=scoring,
    refit='f1', 
    cv=2,
    verbose=3)
    
search.fit(X_train_b, y_train_b)

search.best_params_

Fitting 2 folds for each of 10 candidates, totalling 20 fits
[CV 1/2] END alpha=1.0, gamma=1.0, lambda=0.1, learning_rate=0.01; f1: (test=nan) roc_auc: (test=0.559) total time=  11.5s
[CV 2/2] END alpha=1.0, gamma=1.0, lambda=0.1, learning_rate=0.01; f1: (test=nan) roc_auc: (test=0.564) total time=  16.8s
[CV 1/2] END alpha=1.0, gamma=0.1, lambda=0.1, learning_rate=0.1; f1: (test=nan) roc_auc: (test=0.566) total time=  17.4s
[CV 2/2] END alpha=1.0, gamma=0.1, lambda=0.1, learning_rate=0.1; f1: (test=nan) roc_auc: (test=0.569) total time=  21.7s
[CV 1/2] END alpha=1.0, gamma=1.0, lambda=0.01, learning_rate=0.1; f1: (test=nan) roc_auc: (test=0.562) total time=  11.9s
[CV 2/2] END alpha=1.0, gamma=1.0, lambda=0.01, learning_rate=0.1; f1: (test=nan) roc_auc: (test=0.561) total time=  11.9s
[CV 1/2] END alpha=0.01, gamma=1.0, lambda=0.01, learning_rate=0.01; f1: (test=nan) roc_auc: (test=0.554) total time=  18.9s
[CV 2/2] END alpha=0.01, gamma=1.0, lambda=0.01, learning_rate=0.01; f1: (test

{'learning_rate': 0.01, 'lambda': 0.1, 'gamma': 1.0, 'alpha': 1.0}

In [41]:
opt_params_r_b = {'learning_rate': 0.01, 'lambda': 0.1, 'gamma': 1.0, 'alpha': 1.0}
opt_params_c_b = {'min_child_weight': 10, 'max_depth': 6, 'learning_rate': 0.01, 'gamma': 0.5}

### Actual training and testing

With the optimal parameters found, the XGBoost is trained on the training set and then tested for validation.

In [43]:
# With binary classification
metrics_b_c = XGB_model_c(X_train_b, X_test_b, y_train_b, y_test_b, opt_params_c_b)

F1 Score: 0.54
ROC AUC: 0.54


In [44]:
# With probability outcomes
metrics_b_r = XGB_model_r(X_train_b, X_test_b, y_train_b, y_test_b, opt_params_r_b)

F1 Score: 0.53
ROC AUC: 0.55


# Comparison

In [45]:
# binary classification
overall_metrics = pd.DataFrame([metrics_t_c,
                                metrics_p_c,
                                metrics_b_c],
                                columns = ['F1', 'ROC-AUC'])

new_index_values = ['TF-IDF', 'PoS', 'BoW']
overall_metrics = overall_metrics.set_index(pd.Index(new_index_values))
overall_metrics

Unnamed: 0,F1,ROC-AUC
TF-IDF,0.550422,0.550363
PoS,0.515683,0.514978
BoW,0.540304,0.540084


In [46]:
# predicting probability
overall_metrics = pd.DataFrame([metrics_t_r,
                                metrics_p_r,
                                metrics_b_r],
                                columns = ['F1', 'ROC-AUC'])

new_index_values = ['TF-IDF', 'PoS', 'BoW']
overall_metrics = overall_metrics.set_index(pd.Index(new_index_values))
overall_metrics

Unnamed: 0,F1,ROC-AUC
TF-IDF,0.537605,0.561213
PoS,0.51602,0.516281
BoW,0.53086,0.549288


# Conclusions

As in part one, we compared the 3 different vectorization techniques again as each personality trait preferred different techniques. We thus wanted to see if we could now identify a clear optimal technique. 

Inspecting the results, we can first identify that the F1 and ROC-AUC are relatively low. This can be said for all 3 vectorization techniques we tried. A machine learner clearly does not perform well when predicting 5 target variables at once. Comparing it to binary classification, we can also see a slight drop. Predicting the probabilities instead of binary classification thus makes it more difficult for the model to make predictions. Differences however are small and both prediction methods perform badly.


## comparison to part 1
In the following table, the F1 scores of part 1 and this part (both binary and probability) have been depicted.

| Vectorisation technique             | Separate predicting | simultaneous predicting (binary) | simultaneous predicting (probability)
| :----------: | :--: | :--: | :--: |
| TF-IDF       |   0.661805   |  0.550422	   | 0.537605 |
| PoS           |   0.665579  |  0.515683	    | 0.516020 |
| BoW    |  0.664900   |  0.540304	     | 0.530860 |

From this table, it becomes clear that by shifting from separately predicting each target variable to all variables at once, the performance has become worse. The model clearly can predict each personality trait better if it is predicting it alone. Predicting all traits at once is more difficult. This finding makes sense as it now has to consider 5 targets instead of 1, which might imply an increased complexity and interdependence of the personality traits. It thus is better to train a model for each trait separately.

Concludingly, the performance of predicting everything in one go is not high. It does not matter if the prediction is made with binary classification or with probabilities as both perform almost identically. Compared to part 1, the performance dropped so that a machine learner can better predict the traits separately than all in one go.

