# Assignment 3 Natural Language Processing - Part 2: traditional machine learning approach: XGBoost

In this part of the assignment, we explore the possibilities of multilingual and cross-lingual modelling when traditional machine learning methods are used. Multilingual and cross-lingual models differ, therefore we decided to also look at them separately in this part.

Firstly, we look into cross-lingual models. Cross-lingual models are models that are trained on texts in a certain language that can predict when given another language. Given this, we identified two possible options for exploring this:

- we train a model using the English text corpus and validate it using the Dutch text corpus. 
- After this, we switch the languages. This means that we then train the model on the Dutch text corpus and validate it using the English text corpus. 

We can then compare the performances of these models to see how they compare and what the effects of the Dutch and English languages have on each other and the predictions.

Secondly, we investigate the potential of the multilingual models. Multilingual models are models that take as input texts in different languages. Given this we again identified 2 approaches we could take:

- As we have both the English and Dutch versions of the texts, we concatenate the two texts into one big text that consists of both the English and Dutch versions of that text. This we can then feed into a model. The model will thus have a larger document text as input.
- As a second option we add the Dutch text with respective target variables to the English dataset. This way we enlarge the English Dataset. The model will thus have more documents as input and will be trained on both English and Dutch.

As with the cross-lingual, we can then compare the performances of these models against each other.

For the machine learner, we decided to use an XGBoost regressor with ‘binary:logistic’ objective function again due to its performance compared to SVM. As a vectorisation technique, we used TF-IDF as in part 2 it became clear that, although performance is not great, this technique was able to show the highest F1 scores if we trained the model to predict 5 target variables in one go. For parameters, we do hyperparameter tuning to find the optimal set of parameters for the XGBoost model.

As with the other parts, we used the F1 score as the main performance metric but also determined the ROC AUC.


In [1]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split, RandomizedSearchCV, RepeatedKFold
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.feature_extraction.text import CountVectorizer
import xgboost as xgb
from gensim.models import Word2Vec
import numpy as np
from sklearn.metrics import mean_squared_error

  "class": algorithms.Blowfish,


# Cross-lingual

first we determine the performance of the cross-lingual methods by training with the English text and validating with the Dutch text, and vice versa.

In [3]:
eng = pd.read_csv('processed_data_english_no_lowercasing.csv')
nl = pd.read_csv('processed_data_dutch_no_lowercasing.csv')
nl = nl.rename(columns={'TEXT_NL': 'TEXT'})

In [5]:
corpus_train = eng['TEXT'].tolist()
y_train_eng_nl = eng[['cEXT', 'cNEU', 'cAGR', 'cCON', 'cOPN']]

corpus_test = nl['TEXT'].tolist()
y_test_eng_nl = eng[['cEXT', 'cNEU', 'cAGR', 'cCON', 'cOPN']]

In [6]:
tfidf = TfidfVectorizer()  

# for Ext
X_train_eng_nl = tfidf.fit_transform(corpus_train)
X_test_eng_nl = tfidf.transform(corpus_test)

In [7]:
opt_param_t = {'learning_rate': 0.01, 'lambda': 0.01, 'gamma': 0.1, 'alpha': 0.01}

In [8]:
def XGB_model(X_train, X_test, y_train, y_test, opt_params):
    xgb_model = xgb.XGBRegressor(objective='binary:logistic', 
                                reg_lambda=opt_params['lambda'], 
                                alpha=opt_params['alpha'], 
                                learning_rate=opt_params['learning_rate'], 
                                gamma=opt_params['gamma'],
                                eval_metric='rmse')
    
    xgb_model.fit(X_train, y_train)

    predictions_prob = xgb_model.predict(X_test)

    predictions_binary = (predictions_prob > 0.5).astype(int)
    
    f1 = f1_score(y_test.values.flatten(), predictions_binary.flatten(), average='micro')
    print('F1 Score:', f1)

    auc_roc = roc_auc_score(y_test.values.flatten(), predictions_prob.flatten(), average='micro')
    print('AUC/ROC:', auc_roc)
    
    return [f1, auc_roc]

In [9]:
# English with Dutch validation
metrics_eng_nl = XGB_model(X_train_eng_nl, X_test_eng_nl, y_train_eng_nl, y_test_eng_nl, opt_param_t)

F1 Score: 0.5044211947350659
AUC/ROC: 0.5250265708475865


In [10]:
# Dutch with English validation
metrics_nl_eng = XGB_model(X_test_eng_nl, X_train_eng_nl, y_test_eng_nl, y_train_eng_nl, opt_param_t)

F1 Score: 0.5455956800539993
AUC/ROC: 0.5635992984231872


# Multi-lingual

As second part, we determine the performance of training a model that gets text in multiple languages. As mentioned, we try 2 methods for doing so.

## Concatenating

The first method is to paste the Dutch texts to the English texts, enlarging the text that the model will get as input.

In [54]:
df = pd.read_csv('processed_data_full_no_lowercasing.csv')
df

Unnamed: 0,TEXT,TEXT_NL,cEXT,cNEU,cAGR,cCON,cOPN
0,Well right woke midday nap Its sort weird ever...,Nou moment net wakker middagdutje Het beetje r...,0,1,1,0,1
1,Well stream consciousness essay used thing lik...,Nou gaan stroom bewustzijn essay deed soort di...,0,0,1,0,0
2,open keyboard button push The thing finally wo...,Een open toetsenbord knoppen drukken Het ding ...,0,1,0,1,1
3,cant believe Its really happening pulse racing...,geloven Het gebeurt echt Mijn pol raast gek Du...,1,0,1,1,0
4,Well good old stream consciousness assignment ...,Welnu weer goede oude stroom bewustzijnstoewij...,1,0,1,0,1
...,...,...,...,...,...,...,...
2958,motivated day day basis need provide little fa...,word dagelijks gemotiveerd noodzaak kleine gez...,1,0,0,1,1
2959,son biggest part life without reckless person ...,Mijn zoon grootste deel leven gewoon roekeloze...,1,1,0,0,0
2960,kid grandkids keep motivated everyday inspire ...,Mijn kinderen kleinkinderen houden elke dag ge...,1,0,1,1,0
2961,biggest drive earn money retire beach schedule...,Mijn grootste drijfveer geld verdienen zodat p...,0,0,0,0,0


In [55]:
df['TEXT_NL'][0]

'Nou moment net wakker middagdutje Het beetje raar sind Texas verhuisde problemen gehad concentreren dingen herinner huiswerk klas begon zodra klok sloeg stopte totdat klaar Natuurlijk gemakkelijker deed steed Maar hierheen verhuisde huiswerk beetje uitdagender druk werk besloot uren besteden gewoon redden Maar ding klas oplette gewoon spul kende terugkijk echt hard gewerkt afgelopen twee jaar goede spoor gebleven lui genie hey allemaal goed Het laat verleden corrigeren weet echt toekomst gefocust blijven Het enige weet wanneer mensen zeggen campus wonen concentreren Voor gemakkelijker helaas woon thuis waakzame oog ouders klein zeurend zusje alleen zeurt zeurt begrijpt bedoel Een ander ding gewoon gedoe helemaal terug school moeten gaan gewoon bibliotheek gaan studeren verhuizen weet vertellen Begrijp verkeerd zie waar vandaan komen waarom willen verhuis weggaan alleen zoveel beschermd maak zorgen wereld Het enige vragen kamer schoon houden toe helpen bedrijf Maar wel Maar genoeg geld

In [56]:
df['TEXT_COMBINED'] = df['TEXT'] + ' ' + df['TEXT_NL']
df = df[['TEXT', 'TEXT_NL',  'TEXT_COMBINED', 'cEXT', 'cNEU', 'cAGR', 'cCON', 'cOPN']]
df

Unnamed: 0,TEXT,TEXT_NL,TEXT_COMBINED,cEXT,cNEU,cAGR,cCON,cOPN
0,Well right woke midday nap Its sort weird ever...,Nou moment net wakker middagdutje Het beetje r...,Well right woke midday nap Its sort weird ever...,0,1,1,0,1
1,Well stream consciousness essay used thing lik...,Nou gaan stroom bewustzijn essay deed soort di...,Well stream consciousness essay used thing lik...,0,0,1,0,0
2,open keyboard button push The thing finally wo...,Een open toetsenbord knoppen drukken Het ding ...,open keyboard button push The thing finally wo...,0,1,0,1,1
3,cant believe Its really happening pulse racing...,geloven Het gebeurt echt Mijn pol raast gek Du...,cant believe Its really happening pulse racing...,1,0,1,1,0
4,Well good old stream consciousness assignment ...,Welnu weer goede oude stroom bewustzijnstoewij...,Well good old stream consciousness assignment ...,1,0,1,0,1
...,...,...,...,...,...,...,...,...
2958,motivated day day basis need provide little fa...,word dagelijks gemotiveerd noodzaak kleine gez...,motivated day day basis need provide little fa...,1,0,0,1,1
2959,son biggest part life without reckless person ...,Mijn zoon grootste deel leven gewoon roekeloze...,son biggest part life without reckless person ...,1,1,0,0,0
2960,kid grandkids keep motivated everyday inspire ...,Mijn kinderen kleinkinderen houden elke dag ge...,kid grandkids keep motivated everyday inspire ...,1,0,1,1,0
2961,biggest drive earn money retire beach schedule...,Mijn grootste drijfveer geld verdienen zodat p...,biggest drive earn money retire beach schedule...,0,0,0,0,0


In [57]:
df['TEXT_COMBINED'][0]

'Well right woke midday nap Its sort weird ever since moved Texas problem concentrating thing remember starting homework grade soon clock struck stopping done course easier still But moved homework got little challenging lot busy work decided spend hour getting But thing always paid attention class plain knew stuff look back really worked hard stayed track last two year without getting lazy would genius hey thats good Its late correct past dont really know stay focused future The one thing know people say live campus cant concentrate For would easier ala living home watchful eye parent little nagging sister nag nag nag You get point Another thing hassle way back school library study need move dont know tell Dont get wrong see theyre coming dont want move need get away Theyve sheltered much dont worry world The thing ask keep room clean help business cant even But need But got enough money live dorm apartment next semester think Ill take advantage But topic went sixth street last night 

In [62]:
corpus = df['TEXT_COMBINED'].tolist()
y = df[['cEXT', 'cNEU', 'cAGR', 'cCON', 'cOPN']]

In [63]:
X_train_t, X_test_t, y_train_t, y_test_t = train_test_split(corpus, y, test_size=0.2, random_state=42)

In [64]:
tfidf = TfidfVectorizer()  
X_train_t = tfidf.fit_transform(X_train_t)
X_test_t = tfidf.transform(X_test_t)

In [69]:
# To not show the Sklearn warnings
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [70]:
xgb_model = xgb.XGBRegressor(objective='binary:logistic')  # Set objective for regression

# Define the hyperparameter grid to search
params_dict = {'learning_rate': [0.01, 0.1, 0.2],
               'gamma': [0.01, 0.1, 1.0],
               'lambda': [0.01, 0.1, 1.0],
               'alpha': [0.01, 0.1, 1.0],
              }

scoring = {
    'f1': make_scorer(f1_score, average='micro'),
    'roc_auc': make_scorer(roc_auc_score, multi_class='ovr', average='micro'),
}

search = RandomizedSearchCV(
    xgb_model,
    param_distributions=params_dict,
    scoring=scoring,
    refit='f1', 
    cv=2,
    verbose=3)
    
search.fit(X_train_t, y_train_t)

search.best_params_

Fitting 2 folds for each of 10 candidates, totalling 20 fits
[CV 1/2] END alpha=1.0, gamma=1.0, lambda=1.0, learning_rate=0.2; f1: (test=nan) roc_auc: (test=0.583) total time=  45.4s
[CV 2/2] END alpha=1.0, gamma=1.0, lambda=1.0, learning_rate=0.2; f1: (test=nan) roc_auc: (test=0.556) total time=  44.5s
[CV 1/2] END alpha=1.0, gamma=1.0, lambda=1.0, learning_rate=0.01; f1: (test=nan) roc_auc: (test=0.574) total time= 2.0min
[CV 2/2] END alpha=1.0, gamma=1.0, lambda=1.0, learning_rate=0.01; f1: (test=nan) roc_auc: (test=0.553) total time= 2.0min
[CV 1/2] END alpha=0.01, gamma=1.0, lambda=0.01, learning_rate=0.01; f1: (test=nan) roc_auc: (test=0.572) total time= 1.6min
[CV 2/2] END alpha=0.01, gamma=1.0, lambda=0.01, learning_rate=0.01; f1: (test=nan) roc_auc: (test=0.549) total time= 1.7min
[CV 1/2] END alpha=0.01, gamma=1.0, lambda=1.0, learning_rate=0.01; f1: (test=nan) roc_auc: (test=0.575) total time= 2.0min
[CV 2/2] END alpha=0.01, gamma=1.0, lambda=1.0, learning_rate=0.01; f1: (te

{'learning_rate': 0.2, 'lambda': 1.0, 'gamma': 1.0, 'alpha': 1.0}

In [71]:
opt_param = {'learning_rate': 0.2, 'lambda': 1.0, 'gamma': 1.0, 'alpha': 1.0}

In [72]:
def XGB_model(X_train, X_test, y_train, y_test, opt_params):
    xgb_model = xgb.XGBRegressor(objective='binary:logistic', 
                                reg_lambda=opt_params['lambda'], 
                                alpha=opt_params['alpha'], 
                                learning_rate=opt_params['learning_rate'], 
                                gamma=opt_params['gamma'],
                                eval_metric='rmse')
    
    xgb_model.fit(X_train, y_train)

    predictions_prob = xgb_model.predict(X_test)

    predictions_binary = (predictions_prob > 0.5).astype(int)
    
    f1 = f1_score(y_test.values.flatten(), predictions_binary.flatten(), average='micro')
    print('F1 Score:', f1)

    auc_roc = roc_auc_score(y_test.values.flatten(), predictions_prob.flatten(), average='micro')
    print('AUC/ROC:', auc_roc)
    
    return [f1, auc_roc]

In [92]:
metrics = XGB_model(X_train_t, X_test_t, y_train_t, y_test_t, opt_param)

F1 Score: 0.5748735244519393
AUC/ROC: 0.5991152231699481


## Appending

The second method is to enlarge the English dataframe by adding the Dutch versions as new rows in the dataframe

In [74]:
eng = pd.read_csv('processed_data_english_no_lowercasing.csv')
nl = pd.read_csv('processed_data_dutch_no_lowercasing.csv')

In [78]:
eng

Unnamed: 0,TEXT,cEXT,cNEU,cAGR,cCON,cOPN
0,Well right woke midday nap Its sort weird ever...,0,1,1,0,1
1,Well stream consciousness essay used thing lik...,0,0,1,0,0
2,open keyboard button push The thing finally wo...,0,1,0,1,1
3,cant believe Its really happening pulse racing...,1,0,1,1,0
4,Well good old stream consciousness assignment ...,1,0,1,0,1
...,...,...,...,...,...,...
2958,motivated day day basis need provide little fa...,1,0,0,1,1
2959,son biggest part life without reckless person ...,1,1,0,0,0
2960,kid grandkids keep motivated everyday inspire ...,1,0,1,1,0
2961,biggest drive earn money retire beach schedule...,0,0,0,0,0


In [79]:
nl

Unnamed: 0,TEXT_NL,cEXT,cNEU,cAGR,cCON,cOPN
0,Nou moment net wakker middagdutje Het beetje r...,0,1,1,0,1
1,Nou gaan stroom bewustzijn essay deed soort di...,0,0,1,0,0
2,Een open toetsenbord knoppen drukken Het ding ...,0,1,0,1,1
3,geloven Het gebeurt echt Mijn pol raast gek Du...,1,0,1,1,0
4,Welnu weer goede oude stroom bewustzijnstoewij...,1,0,1,0,1
...,...,...,...,...,...,...
2958,word dagelijks gemotiveerd noodzaak kleine gez...,1,0,0,1,1
2959,Mijn zoon grootste deel leven gewoon roekeloze...,1,1,0,0,0
2960,Mijn kinderen kleinkinderen houden elke dag ge...,1,0,1,1,0
2961,Mijn grootste drijfveer geld verdienen zodat p...,0,0,0,0,0


In [87]:
nl = nl.rename(columns={'TEXT_NL': 'TEXT'})
concat_df = pd.concat([eng, nl], ignore_index=True)

In [95]:
corpus_a = concat_df['TEXT'].tolist()
y = concat_df[['cEXT', 'cNEU', 'cAGR', 'cCON', 'cOPN']]

In [96]:
X_train_a, X_test_a, y_train_a, y_test_a = train_test_split(corpus_a, y, test_size=0.2, random_state=42)

In [98]:
tfidf = TfidfVectorizer()  

# for Ext
X_train_a = tfidf.fit_transform(X_train_a)
X_test_a = tfidf.transform(X_test_a)

In [93]:
xgb_model = xgb.XGBRegressor(objective='binary:logistic')  # Set objective for regression

# Define the hyperparameter grid to search
params_dict = {'learning_rate': [0.01, 0.1, 0.2],
               'gamma': [0.01, 0.1, 1.0],
               'lambda': [0.01, 0.1, 1.0],
               'alpha': [0.01, 0.1, 1.0],
              }

scoring = {
    'f1': make_scorer(f1_score, average='micro'),
    'roc_auc': make_scorer(roc_auc_score, multi_class='ovr', average='micro'),
}

search = RandomizedSearchCV(
    xgb_model,
    param_distributions=params_dict,
    scoring=scoring,
    refit='f1', 
    cv=2,
    verbose=3)
     

search.best_params_

Fitting 2 folds for each of 10 candidates, totalling 20 fits
[CV 1/2] END alpha=0.1, gamma=1.0, lambda=1.0, learning_rate=0.01; f1: (test=nan) roc_auc: (test=0.558) total time= 1.7min
[CV 2/2] END alpha=0.1, gamma=1.0, lambda=1.0, learning_rate=0.01; f1: (test=nan) roc_auc: (test=0.562) total time= 1.8min
[CV 1/2] END alpha=1.0, gamma=0.01, lambda=0.1, learning_rate=0.2; f1: (test=nan) roc_auc: (test=0.577) total time= 1.8min
[CV 2/2] END alpha=1.0, gamma=0.01, lambda=0.1, learning_rate=0.2; f1: (test=nan) roc_auc: (test=0.572) total time= 1.8min
[CV 1/2] END alpha=0.1, gamma=0.1, lambda=0.01, learning_rate=0.2; f1: (test=nan) roc_auc: (test=0.574) total time= 1.4min
[CV 2/2] END alpha=0.1, gamma=0.1, lambda=0.01, learning_rate=0.2; f1: (test=nan) roc_auc: (test=0.572) total time= 1.5min
[CV 1/2] END alpha=1.0, gamma=0.01, lambda=0.01, learning_rate=0.2; f1: (test=nan) roc_auc: (test=0.573) total time= 1.9min
[CV 2/2] END alpha=1.0, gamma=0.01, lambda=0.01, learning_rate=0.2; f1: (test

{'learning_rate': 0.01, 'lambda': 1.0, 'gamma': 1.0, 'alpha': 0.1}

In [94]:
opt_param_a = {'learning_rate': 0.01, 'lambda': 1.0, 'gamma': 1.0, 'alpha': 0.1}

In [99]:
metrics_a = XGB_model(X_train_a, X_test_a, y_train_a, y_test_a, opt_param_a)

F1 Score: 0.5581787521079258
AUC/ROC: 0.580366337475164


# Conclusion

We investigated both cross-lingual and multi-lingual models. The performance metrics for the cross-lingual model can be seen in the table below. Besides, the results of part 2 are added for the purpose of comparing monolingual models to multi-lingual models.


|              | F1 score | ROC AUC | 
| :----------: | :--: | :--: |
| Trained and tested on English      |   0.537605	   |  0.561213	   |
| Trained on English, tested on Dutch          |  0.504421  |  0.525026	    |
| Trained on Dutch, tested on English   |  0.545595   |  0.563599	     |

Besides the cross-lingual models, we also investigate the potential of multi-lingual models. The performance metrics of these models can be found below:


|              | F1 score | ROC AUC | 
| :----------: | :--: | :--: |
| Concatenating      |   0.574873	   |  0.599115	   |
| Appending         |  0.558178  |  0.580366	    |



For the cross-lingual models, we trained the XGBoost model with English text and validated it on Dutch text. This yielded an F1 score of only 0.504. This score seems rather low, indicating that a cross-lingual model with English as the base is not very good at predicting when given Dutch text. 

Vice versa, we also trained a Dutch model and validated it on English data, which yielded an F1 score of 0.545. Again, this score is rather low indicating that also a Dutch model is not very good at predicting when given an English text.
Remarkable is however the fact that the Dutch-English model outperforms the English-Dutch model significantly, although the difference is small. The model clearly finds some characteristics in the Dutch text that it can better apply to English text than vice versa. As presented in the data analysis part, the average sentence length for Dutch was longer than for English. This might imply that the model simply gets more data when receiving the Dutch text as opposed to the English text. 

All in all, the capabilities of XGBoost as cross-lingual model are very limited. Performances indicate that they are only slightly better than random guessing (F1 > 0.5). Document size might however positively affect the performance. Contrasting, it however works better than when the model has been trained and tested on only English text.

For the multilingual approaches, we first enlarged the texts by concatenating the English and Dutch versions of the text together. This model yielded an F1 of 0.574. Again this seems rather low, indicating that XGBoost cannot predict very well when it receives multiple versions of the same text at once.

With the approach of appending the Dutch texts to the data, the model yielded an F1 of 0.558. This indicates that this approach does also not very well for predicting. 

Comparing the two, we can however see that XGBoost prefers having more words per input over having more data. XGBoost thus sees is better able to see certain aspects in the text if it receives all text at once. This might imply that there is some similarity in the English and Dutch languages that the model can use to make better predictions. This seems logical seen the shared Germanic roots of both the Dutch and English language.

Although both the performances of cross-lingual and multilingual approaches are mediocre, it can be identified that XGBoost shows more potential as multilingual model than as cross-lingual model.

## Comparison to part 2 
All in all, we can identify that inputting multiple languages at once does not make XGBoost a great predictor as results are only slightly better than random guessing (F1 > 0.5). It again highlights the inherently difficult nature of predicting texts. When comparing it however to the findings of part 2, we can see a significant increase. XGBoost is increasingly better at predicting if it is considered as cross-lingual or multilingual model. Here, XGBoost clearly prefers multiple languages given at once as input. This implies that, despite the difficulties in text predictions, the models can be improved by adding Dutch and English texts together. This also highlights the commonalities in the languages due to similar roots that the models can profit from as it yields extra information to the model.



## Comparison between DL and ML approaches
Comparing the machine learning approach to the deep learning approach, again some interesting findings are revealed. Firstly, regarding the cross-lingual modelling, it can be identified that both the monolingual BERT as a deep learner and XGBoost display similar performance when they are trained on English texts and validated on Dutch texts. Both appear to display performance that is only slightly better than random guessing. However, BERT always performs a little better than XGBoost. This finding is interesting as the deep learning approaches outperformed the machine learners in part 2 of the assignment when they were only trained and tested on English texts. There is no difference between deep learning and machine learning methods. More interestingly, both models show an increase in performance when they are trained on Dutch texts and validated on English texts. There are linguistic nuances of the Dutch language which then can be applied to the English language, but not vice versa. Both approaches acknowledge this and use this to their advantage, hence the higher performance when trained on the Dutch language. The usage of BERT and XGBoost for cross-lingual modelling is however debatable as both show performance similar to random guessing. The introduction of mBERt as an alternative deep learning method however is very different from BERT, showing superior performance to both BERT and XGBoost as cross-lingual models. The multilingual NLP nature and specific architecture of mBERT clearly give an advantage over the machine learning approach. As such, only when deep learners have a compatible architecture, they are able to outperform machine learners. Otherwise, machine learners and deep learners show comparable performances, indicating the increased complexity of having multiple languages instead of one.

Regarding the multilingual modelling, again the machine learner and deep learner show similar performances, with the deep learner only performing slightly better than the deep learner. Most noticeable here is that for the machine learners, the performance increased significantly as opposed to part 2, whereas the performance of the deep learners decreased as opposed to part 2. The inclusion of extra languages thus is more beneficial for the machine learners than for the deep learners. Nevertheless, deep learners still outperform machine learners, showing that their architectures are more suitable for text classification than standard machine learners.
