<a href="https://colab.research.google.com/github/oamerl/machine-learning-projects/blob/main/Machine-Learning/reddit-fake-post-detection/Reddit_Fake_Post_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Experiment Summary**

In the design of all models we will be using TF-IDF as the vectorizer and using a separate predefined validation set for the validation. We will be using two different text preprocessing techniques we will first start with stemming using which we will try three models, two logistic regression models (one with word n-gram and one with character n-gram), and one XGboost model. In the design of these three models we will be using random search for hyperparameters tuning for both the vectorizer and the models themselves.
Then next we will re-process the data using lemmatization to see if the performance will differ from the case of using stemming and will assess the performance on two models one is logistic regression with hyperparameters tuning while the other is XGboost model but will use the previously found hyperparameters in sake of time saving.


# **2.** Data and Libraries Importing 📋

Importing needed libraries


In [None]:
import re
import pickle
import sklearn
import pandas as pd
import numpy as np
import holoviews as hv
import nltk
from bokeh.io import output_notebook
output_notebook()

from pathlib import Path

# some settings for pandas and hvplot

pd.options.display.max_columns = 100
pd.options.display.max_rows = 300
pd.options.display.max_colwidth = 100
np.set_printoptions(threshold=2000)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from xgboost import XGBClassifier
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

Importing the data

In [None]:
train_data = pd.read_csv("/content/xy_train.csv", sep=",", na_values=[""]) # reading the training dataset file
test_data = pd.read_csv("/content/x_test.csv", sep=",", na_values=[""]) # reading the testing dataset file

In [None]:
train_data

Unnamed: 0,id,text,label
0,265723,"A group of friends began to volunteer at a homeless shelter after their neighbors protested. ""Se...",0
1,284269,"British Prime Minister @Theresa_May on Nerve Attack on Former Russian Spy: ""The government has c...",0
2,207715,"In 1961, Goodyear released a kit that allows PS2s to be brought to heel. https://m.youtube.com/w...",0
3,551106,"Happy Birthday, Bob Barker! The Price Is Right Host on How He'd Like to Be Remembered | ""As the ...",0
4,8584,"Obama to Nation: 聙""Innocent Cops and Unarmed Young Black Men Should Not be Dying Before Magic Jo...",0
...,...,...,...
59995,70046,"Finish Sniper Simo H盲yh盲 during the invasion of Finland by the USSR (1939, colorized)",0
59996,189377,"Nigerian Prince Scam took $110K from Kansas man; 10 years later, he's getting it back",1
59997,93486,Is It Safe To Smoke Marijuana During Pregnancy? You鈥檇 Be Surprised Of The Answer | no,0
59998,140950,Julius Caesar upon realizing that everyone in the room has a knife except him (44 bc),0


# **3.** Data Cleaning and Pre-processing 🧼

We will first start with typicall steps that we usually do which is conserving the record id as the dataframe index and checking the target label range of values

In [None]:
train_data.set_index('id', inplace = True) # setting the index of the training set dataframe to be the coulmn named "id" such that we don't lose it.
test_data.set_index('id', inplace = True) # setting the index of the testing set dataframe to be the coulmn named "id" such that we don't lose it.

Checking target label

In [None]:
print(train_data["label"].value_counts()) # checking the values count of the target label

0    32172
1    27596
2      232
Name: label, dtype: int64


We notice that there is a label of value 2 which is inconsistent with the problem definition of classifing whether fake or not so will drop records(titles) corresponding to this label

In [None]:
train_data[train_data.label == 2].index # checkin if we are getting the index correctly

Int64Index([540454, 342238, 552146, 549212, 398378, 337016, 357384, 513998,
            448641, 378658,
            ...
            398826,  60959, 186950, 420136,  89505, 219497,  54937, 505566,
            288391,  99749],
           dtype='int64', name='id', length=232)

In [None]:
train_data.drop(train_data[train_data.label == 2].index, inplace=True) # dropping records of label = 2


In [None]:
print(train_data["label"].value_counts()) # checking the values count of the target label after dropping inconsistent values

0    32172
1    27596
Name: label, dtype: int64


## **3.1-** Text cleaning function definition

In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

stemmer = SnowballStemmer("english")
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))


def clean_text(text, root_form = "stemming", for_embedding=False):
    """ steps:
        - remove any html tags (< /br> often found)
        - Keep only ASCII + European Chars and whitespace, no digits
        - remove single letter chars
        - convert all whitespaces (tabs etc.) to single wspace
        if not for embedding (but e.g. tdf-idf):
        - all lowercase
        - remove stopwords, punctuation and stemm
    """
    RE_WSPACE = re.compile(r"\s+", re.IGNORECASE)
    RE_TAGS = re.compile(r"<[^>]+>")
    RE_ASCII = re.compile(r"[^A-Za-z ]", re.IGNORECASE)
    RE_SINGLECHAR = re.compile(r"\b[A-Za-z]\b", re.IGNORECASE)
    if for_embedding:
        # Keep punctuation
        RE_ASCII = re.compile(r"[^A-Za-z,.!? ]", re.IGNORECASE)
        RE_SINGLECHAR = re.compile(r"\b[A-Za-z,.!?]\b", re.IGNORECASE)

    text = re.sub(RE_TAGS, " ", text)
    text = re.sub(RE_ASCII, " ", text)
    text = re.sub(RE_SINGLECHAR, " ", text)
    text = re.sub(RE_WSPACE, " ", text)

    word_tokens = word_tokenize(text) # returns a tokenized copy of text as a list
    words_tokens_lower = [word.lower() for word in word_tokens]

    if for_embedding:
        # no stemming, lowering and punctuation / stop words removal
        words_filtered = word_tokens
    else:
        if root_form == "stemming":
            words_filtered = [stemmer.stem(word) for word in words_tokens_lower if word not in stop_words]
        elif root_form == "lemmatization":
            words_filtered = [lemmatizer.lemmatize(word) for word in words_tokens_lower if word not in stop_words]

    text_clean = " ".join(words_filtered)
    return text_clean

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


Checking if *Stemming* is working correctly

In [None]:
clean_text("éˆ¥æ·²e are common people doing an exceptional job / Heroes in Colombia do existéˆ¥?[Modern]")

'common peopl except job hero colombia exist modern'

Checking if *Lemmatization* is working correctly

In [None]:
clean_text("éˆ¥æ·²e are common people doing an exceptional job / Heroes in Colombia do existéˆ¥?[Modern]", root_form = "lemmatization")

'common people exceptional job hero colombia exist modern'

Making 2 copies of the original dataframe before applying any changes, one copy for stemming and one for lemmatization preprocessing

In [None]:
train_data_clean = train_data.copy() # training data to be used in stemming case
train_data_clean_lemm = train_data.copy() # training data to be used in lemmatization case

In [None]:
test_data_lemm = test_data.copy() # testing data to be used in lemmatization case
# original test_data will be used in case of stemming

In [None]:
train_data_clean.head(5)

Unnamed: 0_level_0,text,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1
265723,"A group of friends began to volunteer at a homeless shelter after their neighbors protested. ""Se...",0
284269,"British Prime Minister @Theresa_May on Nerve Attack on Former Russian Spy: ""The government has c...",0
207715,"In 1961, Goodyear released a kit that allows PS2s to be brought to heel. https://m.youtube.com/w...",0
551106,"Happy Birthday, Bob Barker! The Price Is Right Host on How He'd Like to Be Remembered | ""As the ...",0
8584,"Obama to Nation: 聙""Innocent Cops and Unarmed Young Black Men Should Not be Dying Before Magic Jo...",0


## **3.2-** Stemming preprocessing

As discussed in the "Experiment Summary" in the introduction, we will first use stemming as our inflected words root form obtainer technique and will build with it three models and then will reprocess the data again later using the lemmatization and build another two models.  

In [None]:
%%time
# Training clean titles (stemming used)
train_data_clean["comment_clean"] = train_data_clean["text"].map(lambda x: clean_text(x, root_form = "stemming", for_embedding=False) if isinstance(x, str) else x)

CPU times: user 25 s, sys: 98 ms, total: 25.1 s
Wall time: 29.6 s


In [None]:
# Testing clean titles (stemming used)
test_data["comment_clean"] = test_data["text"].map(lambda x: clean_text(x, root_form = "stemming", for_embedding=False) if isinstance(x, str) else x)

In [None]:
train_data_clean.head(5)

Unnamed: 0_level_0,text,label,comment_clean
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
265723,"A group of friends began to volunteer at a homeless shelter after their neighbors protested. ""Se...",0,group friend began volunt homeless shelter neighbor protest see anoth person also need natur lik...
284269,"British Prime Minister @Theresa_May on Nerve Attack on Former Russian Spy: ""The government has c...",0,british prime minist theresa may nerv attack former russian spi govern conclud high like russia ...
207715,"In 1961, Goodyear released a kit that allows PS2s to be brought to heel. https://m.youtube.com/w...",0,goodyear releas kit allow ps brought heel https youtub com watch alxulk cg zwillc fish midatlant...
551106,"Happy Birthday, Bob Barker! The Price Is Right Host on How He'd Like to Be Remembered | ""As the ...",0,happi birthday bob barker price right host like rememb man said ave pet spay neuter fuckincorpor...
8584,"Obama to Nation: 聙""Innocent Cops and Unarmed Young Black Men Should Not be Dying Before Magic Jo...",0,obama nation innoc cop unarm young black men die magic johnson jimbobshawobodob olymp athlet sho...
...,...,...,...
70046,"Finish Sniper Simo H盲yh盲 during the invasion of Finland by the USSR (1939, colorized)",0,finish sniper simo yh invas finland ussr color
189377,"Nigerian Prince Scam took $110K from Kansas man; 10 years later, he's getting it back",1,nigerian princ scam took kansa man year later get back
93486,Is It Safe To Smoke Marijuana During Pregnancy? You鈥檇 Be Surprised Of The Answer | no,0,safe smoke marijuana pregnanc surpris answer
140950,Julius Caesar upon realizing that everyone in the room has a knife except him (44 bc),0,julius caesar upon realiz everyon room knife except bc


In [None]:
# Drop when any of x "the cleaned title" is missing
train_data_clean = train_data_clean[(train_data_clean["comment_clean"] != "") & (train_data_clean["comment_clean"] != "null")]

In [None]:
train_data_clean.shape

(59758, 3)

Uncleaned title column drop

In [None]:
# Remove column name 'text' which is the uncleaned title from training set
train_data_clean = train_data_clean.drop(['text'], axis=1)

# Remove column name 'text' which is the uncleaned title from testing set
test_data = test_data.drop(['text'], axis=1)

In [None]:
train_data_clean.head(5)

Unnamed: 0_level_0,label,comment_clean
id,Unnamed: 1_level_1,Unnamed: 2_level_1
265723,0,group friend began volunt homeless shelter neighbor protest see anoth person also need natur lik...
284269,0,british prime minist theresa may nerv attack former russian spi govern conclud high like russia ...


In [None]:
train_data_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59758 entries, 265723 to 34509
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   label          59758 non-null  int64 
 1   comment_clean  59758 non-null  object
dtypes: int64(1), object(1)
memory usage: 1.4+ MB


### We thought of filtering the records based on the length of the title but we didn't proceed with the idea and left these titles lengths for later reference if needed.

In [None]:
train_data_clean["comment_clean"].str.len()

id
265723    12353
284269    11837
207715    11024
551106     9480
8584       9453
          ...  
70046        46
189377       54
93486        44
140950       54
34509        61
Name: comment_clean, Length: 59758, dtype: int64

In [None]:
titles_length = [len(title) for title in train_data_clean['comment_clean']]

In [None]:
titles_length

In [None]:
print("maximum title length:", max(titles_length))
print("minimum title length:", min(titles_length))
average = sum(titles_length)/len(titles_length)
print("average title length:", average)

maximum title length: 12353
minimum title length: 2
average title length: 77.80586699688745


In [None]:
#train_data_clean["comment_clean"] = train_data_clean.loc[train_data_clean["comment_clean"].str.len() > 20, "comment_clean"]


## **3.3-** Descriptive analysis

In [None]:
from bokeh.models import NumeralTickFormatter
# Word Frequency of most common words
word_freq = pd.Series(" ".join(train_data_clean["comment_clean"]).split()).value_counts()
word_freq[1:10]

one      3285
like     3128
new      2998
look     2847
color    2737
man      2729
get      2602
trump    2578
say      2347
dtype: int64

In [None]:
# list most uncommon words
word_freq[-10:]

angriff     1
delusion    1
wane        1
undament    1
miku        1
hatsun      1
nfler       1
hicock      1
mccall      1
wahr        1
dtype: int64

In [None]:
# Distribution of ratings
train_data_clean["label"].value_counts(normalize=True)

0    0.538221
1    0.461779
Name: label, dtype: float64

## **3.4-** Division of training set into training and validation and decoupling the features from the labels

In [None]:
from sklearn.model_selection import PredefinedSplit

# split the original training set to a train and a validation set - 25% of data as validation
train, valid = train_test_split(train_data_clean, stratify=train_data_clean['label'], random_state=1, test_size=0.25, shuffle=True)

X_train = train["comment_clean"]
y_train = train["label"]

X_valid = valid["comment_clean"]
y_valid = valid["label"]

print(X_train.shape)
print(X_valid.shape)

(44818,)
(14940,)


validation data indicies

In [None]:
# Create a list where train data indices are -1 and validation data indices are 0
split_index = [-1 if x in X_train.index else 0 for x in train_data_clean.index]

# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)

In [None]:
from sklearn.metrics import roc_auc_score

In [None]:
y_tr = train_data_clean["label"] # target label series
X_tr = train_data_clean["comment_clean"] # training features df

In [None]:
X_tr.head(2)

id
265723    group friend began volunt homeless shelter neighbor protest see anoth person also need natur lik...
284269    british prime minist theresa may nerv attack former russian spi govern conclud high like russia ...
Name: comment_clean, dtype: object

# **4.** Models Training and Evaluation 📈

As a general followed approach in the notebook for limiting the hyper-parameters search space to reduce tuning time, we first optimize the feature creation (vectorization) step through trying different values for ngram, max_df and min_df, then we use these values for the vectorization and retune again but now searching for the model's best hyperparameters.
Although this may lead to sub-optimal values as a more suitable approach would be tunining all hyperparameters toghether but for time sake we will tolerate this sub-optimality.

## **4.1-** Case 1: Models trained on data preprocessed using ***Stemming***

### Model 1 Logistic Regression with TF-IDF vectorization (analyzer=word) and random search as hyperparameter tuner

Optimizing the feature creation (vectorization) step through trying different values for ngram, max_df and min_df where the last two parameters set an upper and lower limit for word frequencies.

In [None]:
%%time
# feature creation and modelling in a single function
# combine the victorizer with the model as a full tunable pipeline
# we gave them a name so we can set their hyperparameters
pipe = Pipeline(
        steps = [("tfidf", TfidfVectorizer(analyzer="word")),
                 ("logreg", LogisticRegression(n_jobs=-1))]
)

# define parameter space to test
# hyperparameters search space of both the model and preprocessing
params = {
    # victorizer hyperparameters (5*5*3 75 combination)
    "tfidf__ngram_range": ((1, 2), (1, 3),(1,4)),
    "tfidf__max_df": [0.5, 0.6, 0.7, 0.8, 0.9],
    "tfidf__min_df": [5, 10, 20, 30, 50]
}

# logistic regression random search instance
pipe_clf = RandomizedSearchCV(pipe, # pipeline containing the victorizer and model
                              params, # pipeline hyperparameters
                              cv=pds, # predefined split
                              scoring="roc_auc", # scoring metric used to evaluate the validation data with it
                              n_iter=37, # number of trials (hyperparameters combinations to try)
                              n_jobs=-1, # number of concurrent threads -1 means using all processors
                              verbose=1,)

# model fitting and training using the training data (will use the optimal hyperparameters that will be found)
# here we still use X_tr; but the randomized search model
# will use our predefined split internally to determine
# which sample belongs to the validation set
pipe_clf.fit(X_tr, y_tr)
pickle.dump(pipe_clf, open("./pipeline_clf_tfidf.pck", "wb"))

Fitting 1 folds for each of 37 candidates, totalling 37 fits
CPU times: user 6.48 s, sys: 413 ms, total: 6.89 s
Wall time: 2min 8s


In [None]:
print('best score {}'.format(pipe_clf.best_score_)) # getting the best validation AUC score
print('best hyperparameters {}'.format(pipe_clf.best_params_)) # getting the optimal hyperparameters values found

best score 0.869891854975242
best hyperparameters {'tfidf__ngram_range': (1, 3), 'tfidf__min_df': 5, 'tfidf__max_df': 0.8}


Now, using this best parameters for TF-IDF we can search for optimal parameters for the LogisticRegression classifier:

In [None]:
%%time
# feature creation and modelling in a single function
# combine the victorizer with the model as a full tunable pipeline
# we gave them a name so we can set their hyperparameters
pipe = Pipeline(
        steps = [("tfidf", TfidfVectorizer(analyzer="word")),
                 ("logreg", LogisticRegression(n_jobs=-1))]
)

# define parameter space to test
# hyperparameters search space of both the model and preprocessing
params = {
    # victorizer hyperparameters
    "tfidf__ngram_range": [(1, 3)], # pipe_clf.best_params_["tfidf__ngram_range"]
    "tfidf__max_df": [0.8], # pipe_clf.best_params_["tfidf__max_df"]
    "tfidf__min_df": [5], # pipe_clf.best_params_["tfidf__min_df"]
    # model hyperparameters
    'logreg__penalty': ["l2"],
    'logreg__C': [0.01, 0.1, 1,10],
    'logreg__max_iter': [10000],
    'logreg__solver': ["newton-cholesky", "saga", "lbfgs", "liblinear"]
}

# logistic regression random search instance
pipe_logreg_clf = RandomizedSearchCV(pipe, # pipeline containing the victorizer and model
                                    params, # pipeline hyperparameters
                                    cv=pds, # predefined split
                                    scoring="roc_auc", # scoring metric used to evaluate the validation data with it
                                    n_iter=10, # number of trials (hyperparameters combinations to try)
                                    n_jobs=-1, # number of concurrent threads -1 means using all processors
                                    verbose=1,)

# model fitting and training using the training data (will use the optimal hyperparameters that will be found)
# here we still use X_tr; but the randomized search model will use our predefined split internally to determine
# which sample belongs to the validation set
pipe_logreg_clf.fit(X_tr, y_tr)
pickle.dump(pipe_logreg_clf, open("./pipeline_clf_logreg.pck", "wb"))

Fitting 1 folds for each of 10 candidates, totalling 10 fits
CPU times: user 8.03 s, sys: 776 ms, total: 8.8 s
Wall time: 9min 51s


In [None]:
print('best score {}'.format(pipe_logreg_clf.best_score_)) # getting the best validation AUC score

best_params_logreg = pipe_logreg_clf.best_params_
print('best hyperparameters {}'.format(best_params_logreg)) # getting the optimal hyperparameters values found

best score 0.8698897639379309
best hyperparameters {'tfidf__ngram_range': (1, 3), 'tfidf__min_df': 5, 'tfidf__max_df': 0.8, 'logreg__solver': 'lbfgs', 'logreg__penalty': 'l2', 'logreg__max_iter': 10000, 'logreg__C': 1}


Training on whole dataset

In [None]:
pipe.set_params(**best_params_logreg).fit(X_tr, y_tr)

Testing (Predictions of test set)

In [None]:
submission = pd.DataFrame()
submission['id'] = test_data.index
submission['label'] = pipe.predict_proba(test_data['comment_clean'])[:,1]
submission.to_csv('submission.csv', index=False)

Observations and trial summary

| **Aspect**                                                | **Comment**                                                                                                                                                                                |
|-----------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| _1- Model type_                                           | Logistic Regression                                                                                                                                                                        |
| _2- Root-form preprocessing_                              | Stemming                                                                                                                                                                                   |
| _3- Vectorizer_                                           | TF-IDF (word n-gram)                                                                                                                                                                       |
| _4- Hyperparameters tuner_                                | Randomized Search (using predefined validation set)                                                                                                                                        |
| _5- Vectorizer hyperparameters space_                     | "tfidf__ngram_range": ((1, 2), (1, 3),(1,4)),<br>"tfidf__max_df": [0.5, 0.6, 0.7, 0.8, 0.9],<br>"tfidf__min_df": [5, 10, 20, 30, 50]                                                       |
| _6- Model hyperparameters space_                          | 'logreg__penalty': ["l2"],<br>'logreg__C': [0.01, 0.1, 1,10],<br>'logreg__max_iter': [10000],<br>'logreg__solver': ["newton-cholesky", "saga", "lbfgs", "liblinear"]                       |
| _7- Optimal hyperparameters_                             | 'tfidf__ngram_range': (1, 3),<br>'tfidf__min_df': 5,<br>'tfidf__max_df': 0.8,<br>'logreg__solver': 'lbfgs', <br>'logreg__penalty': 'l2', <br>'logreg__max_iter': 10000, <br>'logreg__C': 1 |
| _8- AUC score predefined validation set_                  | 0.8699                                                                                                                                                                                     |
| _9- AUC score on kaggle test set (public)_                | 0.8376                                                                                                                                                                                     |
| _10- Observed performance and thoughts on it_             | Here we started with logistic regression model as it usually fits most of the problems nicely while being not a complex model.<br> We can notice that the logistic regression is having a good performance and not overfitting, but we think we can further improve <br> the score if we used another hyperparameter tuner than Randomized Seacrh say Bayesian Search or Grid search to obtain better <br>hyperparameters as there is a potential that there are other values that could fit better as the model is already having acceptable metrics.                                                                                                                                                                                         |
| _11- Reason for changes (if any) and plan for next trial_ | Next we still use logistic regression model but will try using character level n-grams although we expect it to have lower performance.                                                                                                                                                                                       |

### Model 2 Logistic Regression with TF-IDF vectorization (analyzer is character-level) and random search as hyperparameter tuner

Optimizing the feature creation (vectorization) step through trying different values for ngram, max_df and min_df where the last two parameters set an upper and lower limit for word frequencies.

In [None]:
%%time
# feature creation and modelling in a single function
# combine the victorizer with the model as a full tunable pipeline
# we gave them a name so we can set their hyperparameters
pipe = Pipeline(
        steps = [("tfidf", TfidfVectorizer(analyzer="char")),
                 ("logreg", LogisticRegression(n_jobs=-1))]
)

# define parameter space to test
# hyperparameters search space of both the model and preprocessing
params = {
    # victorizer hyperparameters (5*5*3 75 combination)
    "tfidf__ngram_range": ((1, 2), (1, 3),(1,4)),
    "tfidf__max_df": [0.5, 0.6, 0.7, 0.8, 0.9],
    "tfidf__min_df": [5, 10, 20, 30, 50]
}

# logistic regression random search instance
pipe_clf = RandomizedSearchCV(pipe, # pipeline containing the victorizer and model
                              params, # pipeline hyperparameters
                              cv=pds, # predefined split
                              scoring="roc_auc", # scoring metric used to evaluate the validation data with it
                              n_iter=37, # number of trials (hyperparameters combinations to try)
                              n_jobs=-1, # number of concurrent threads -1 means using all processors
                              verbose=1,)

# model fitting and training using the training data (will use the optimal hyperparameters that will be found)
# here we still use X_tr; but the randomized search model
# will use our predefined split internally to determine
# which sample belongs to the validation set
pipe_clf.fit(X_tr, y_tr)
pickle.dump(pipe_clf, open("./pipeline_clf_tfidf.pck", "wb"))

Fitting 1 folds for each of 37 candidates, totalling 37 fits
CPU times: user 14.1 s, sys: 1.31 s, total: 15.4 s
Wall time: 5min 52s


In [None]:
print('best score {}'.format(pipe_clf.best_score_)) # getting the best validation AUC score
print('best hyperparameters {}'.format(pipe_clf.best_params_)) # getting the optimal hyperparameters values found

best score 0.8511221452586296
best hyperparameters {'tfidf__ngram_range': (1, 4), 'tfidf__min_df': 5, 'tfidf__max_df': 0.7}


Now, using this best parameters for TF-IDF we can search for optimal parameters for the LogisticRegression classifier:

In [None]:
%%time
# feature creation and modelling in a single function
# combine the victorizer with the model as a full tunable pipeline
# we gave them a name so we can set their hyperparameters
pipe = Pipeline(
        steps = [("tfidf", TfidfVectorizer(analyzer="char")),
                 ("logreg", LogisticRegression(n_jobs=2))]
)

# define parameter space to test
# hyperparameters search space of both the model and preprocessing
params = {
    # victorizer hyperparameters
    "tfidf__ngram_range": [(1, 4)],
    "tfidf__max_df": [0.7],
    "tfidf__min_df": [5],
    # model hyperparameters
    'logreg__penalty': ["l2"],
    'logreg__C': [0.01, 0.1, 1, 10],
    'logreg__max_iter': [10000],
    'logreg__solver': ["newton-cholesky", "saga", "lbfgs", "liblinear"]
}

# logistic regression random search instance
pipe_logreg_clf = RandomizedSearchCV(pipe, # pipeline containing the victorizer and model
                                    params, # pipeline hyperparameters
                                    cv=pds, # predefined split
                                    scoring="roc_auc", # scoring metric used to evaluate the validation data with it
                                    n_iter=10, # number of trials (hyperparameters combinations to try)
                                    verbose=1,)

# model fitting and training using the training data (will use the optimal hyperparameters that will be found)
# here we still use X_train; but the randomized search model will use our predefined split internally to determine
# which sample belongs to the validation set
pipe_logreg_clf.fit(X_tr, y_tr)
pickle.dump(pipe_logreg_clf, open("./pipeline_clf_logreg.pck", "wb"))

Fitting 1 folds for each of 10 candidates, totalling 10 fits


2 fits failed out of a total of 10.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.9/dist-packages/sklearn/pipeline.py", line 405, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_logistic.py", line 1291, in fit
    fold_coefs_ = Parallel(n_jobs=self.n_jobs, verbose=self.verbose, prefer=prefer)(
  File "/usr/local/lib/python3.9/dist-packages/sklearn/utils/parallel.py", line 63, in __

CPU times: user 2min 24s, sys: 6.79 s, total: 2min 30s
Wall time: 4min 9s


In [None]:
print('best score {}'.format(pipe_logreg_clf.best_score_)) # getting the best validation AUC score

best_params_logreg = pipe_logreg_clf.best_params_
print('best hyperparameters {}'.format(best_params_logreg)) # getting the optimal hyperparameters values found

best score 0.8512412262282633
best hyperparameters {'tfidf__ngram_range': (1, 4), 'tfidf__min_df': 5, 'tfidf__max_df': 0.7, 'logreg__solver': 'saga', 'logreg__penalty': 'l2', 'logreg__max_iter': 10000, 'logreg__C': 10}


Training on whole dataset

In [None]:
pipe.set_params(**best_params_logreg).fit(X_tr, y_tr)

Testing (Predictions of test set)

In [None]:
submission = pd.DataFrame()
submission['id'] = test_data.index
submission['label'] = pipe.predict_proba(test_data['comment_clean'])[:,1]
submission.to_csv('submission.csv', index=False)

Observations and trial summary

| **Aspect**                                                | **Comment**                                                                                                                                                                                |
|-----------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| _1- Model type_                                           | Logistic Regression                                                                                                                                                                        |
| _2- Root-form preprocessing_                              | Stemming                                                                                                                                                                                   |
| _3- Vectorizer_                                           | TF-IDF (**character** n-gram)                                                                                                                                                              |
| _4- Hyperparameters tuner_                                | Randomized Search (using predefined validation set)                                                                                                                                        |
| _5- Vectorizer hyperparameters space_                     | "tfidf__ngram_range": ((1, 2), (1, 3),(1,4)),<br>"tfidf__max_df": [0.5, 0.6, 0.7, 0.8, 0.9],<br>"tfidf__min_df": [5, 10, 20, 30, 50]                                                       |
| _6- Model hyperparameters space_                          | 'logreg__penalty': ["l2"],<br>'logreg__C': [0.01, 0.1, 1,10],<br>'logreg__max_iter': [10000],<br>'logreg__solver': ["newton-cholesky", "saga", "lbfgs", "liblinear"]                       |
| _7- Optimal hyperparameters_                             | 'tfidf__ngram_range': (1, 4),<br>'tfidf__min_df': 5,<br>'tfidf__max_df': 0.7,<br>'logreg__solver': 'saga', <br>'logreg__penalty': 'l2', <br>'logreg__max_iter': 10000, <br>'logreg__C': 10 |
| _8- AUC score predefined validation set_                  | 0.8512                                                                                                                                                                                     |
| _9- AUC score on kaggle test set (public)_                | 0.7983                                                                                                                                                                                     |
| _10- Observed performance and thoughts on it_             | As expected the model is having lower performance than the previous case when we used word-level n-gram<br> this is typically because character-level ngrmas can not provide the model with accurate relations between the characters and their <br> sentiment meaning as a character level gram is indeed possible to appear in both negative and positive sentences, however in case<br>of word ngrams a word is more possible to be related to a sentiment meaning and therefore appears more on a specific class <br> in such case the model can learn the relation between the word and the label class better.                                                                                                                                                                                        |
| _11- Reason for changes (if any) and plan for next trial_ | Next we will try another algorithm which XGboosting as it can act as a feature selection method which will benefit our model because <br>of the huge number of features that we have. We will stick with word n-gram with all the following trials.                                                                                                                                                                                       |

### Model 3 XGboosting with TF-IDF vectorization (analyzer=word) and random search as hyperparameter tuner

In [None]:
%%time
# feature creation and modelling in a single function
# combine the victorizer with the model as a full tunable pipeline
# we gave them a name so we can set their hyperparameters
pipe = Pipeline(
        steps = [("tfidf", TfidfVectorizer(analyzer="word")),
                 ("xgb", XGBClassifier())]
)

# define parameter space to test
# hyperparameters search space of both the model and preprocessing
params = {
    # victorizer hyperparameters (5*5*3 75 combination)
    "tfidf__ngram_range": ((1, 2), (1, 3),(1,4)),
    "tfidf__max_df": [0.5, 0.6, 0.7, 0.8, 0.9],
    "tfidf__min_df": [5, 10, 20, 30, 50]
}

# logistic regression random search instance
pipe_clf = RandomizedSearchCV(pipe, # pipeline containing the victorizer and model
                              params, # pipeline hyperparameters
                              cv=pds, # predefined split
                              scoring="roc_auc", # scoring metric used to evaluate the validation data with it
                              n_iter=37, # number of trials (hyperparameters combinations to try)
                              n_jobs=-1, # number of concurrent threads -1 means using all processors
                              verbose=1,)

# model fitting and training using the training data (will use the optimal hyperparameters that will be found)
# here we still use X_tr; but the randomized search model will use our predefined split internally to determine
# which sample belongs to the validation set
pipe_clf.fit(X_tr, y_tr)
pickle.dump(pipe_clf, open("./pipeline_clf_tfidf.pck", "wb"))

Fitting 1 folds for each of 37 candidates, totalling 37 fits
CPU times: user 1min 14s, sys: 1.24 s, total: 1min 15s
Wall time: 15min 28s


In [None]:
print('best score {}'.format(pipe_clf.best_score_)) # getting the best validation AUC score
print('best hyperparameters {}'.format(pipe_clf.best_params_)) # getting the optimal hyperparameters values found

best score 0.8291936442776718
best hyperparameters {'tfidf__ngram_range': (1, 4), 'tfidf__min_df': 5, 'tfidf__max_df': 0.5}


Now, using this best parameters for TF-IDF we can search for optimal parameters for the LogisticRegression classifier:

In [None]:
%%time
# feature creation and modelling in a single function
# combine the victorizer with the model as a full tunable pipeline
# we gave them a name so we can set their hyperparameters
pipe = Pipeline(
        steps = [("tfidf", TfidfVectorizer(analyzer="word")),
                 ("xgb", XGBClassifier(n_jobs=2))]
)

# define parameter space to test
# hyperparameters search space of both the model and preprocessing
params = {
    # victorizer hyperparameters
    "tfidf__ngram_range": [(1, 4)],
    "tfidf__max_df": [0.5],
    "tfidf__min_df": [5],
    # model hyperparameters
    'xgb__learning_rate': [0.01, 0.1],
    'xgb__n_estimators': [100, 250, 500],
    'xgb__subsample': [0.5, 0.8],
    'xgb__colsample_bytree': [0.5, 0.8],
    'xgb__booster': ['gbtree','gblinear', 'dart']
}

# logistic regression random search instance
pipe_xgb_clf = RandomizedSearchCV(pipe, # pipeline containing the victorizer and model
                                    params, # pipeline hyperparameters
                                    cv=pds, # predefined split
                                    scoring="roc_auc", # scoring metric used to evaluate the validation data with it
                                    n_iter=36, # number of trials (hyperparameters combinations to try)
                                    verbose=1,)

# model fitting and training using the training data (will use the optimal hyperparameters that will be found)
# here we still use X_tr; but the randomized search model will use our predefined split internally to determine
# which sample belongs to the validation set
pipe_xgb_clf.fit(X_tr, y_tr)
pickle.dump(pipe_xgb_clf, open("./pipeline_clf_xgb.pck", "wb"))

Fitting 1 folds for each of 36 candidates, totalling 36 fits
Parameters: { "colsample_bytree", "subsample" } are not used.

Parameters: { "colsample_bytree", "subsample" } are not used.

Parameters: { "colsample_bytree", "subsample" } are not used.

Parameters: { "colsample_bytree", "subsample" } are not used.

Parameters: { "colsample_bytree", "subsample" } are not used.

Parameters: { "colsample_bytree", "subsample" } are not used.

Parameters: { "colsample_bytree", "subsample" } are not used.

Parameters: { "colsample_bytree", "subsample" } are not used.

Parameters: { "colsample_bytree", "subsample" } are not used.

Parameters: { "colsample_bytree", "subsample" } are not used.

Parameters: { "colsample_bytree", "subsample" } are not used.

Parameters: { "colsample_bytree", "subsample" } are not used.

CPU times: user 1h 57min 56s, sys: 22.7 s, total: 1h 58min 19s
Wall time: 1h 9min 7s


In [None]:
print('best score {}'.format(pipe_xgb_clf.best_score_)) # getting the best validation AUC score

best_params_xgb = pipe_xgb_clf.best_params_
print('best hyperparameters {}'.format(best_params_xgb)) # getting the optimal hyperparameters values found

best score 0.858952818609237
best hyperparameters {'xgb__subsample': 0.5, 'xgb__n_estimators': 100, 'xgb__learning_rate': 0.01, 'xgb__colsample_bytree': 0.8, 'xgb__booster': 'gblinear', 'tfidf__ngram_range': (1, 4), 'tfidf__min_df': 5, 'tfidf__max_df': 0.5}


Training on whole dataset

In [None]:
pipe.set_params(**best_params_xgb).fit(X_tr, y_tr)

Testing (Predictions of test set)

In [None]:
submission = pd.DataFrame()
submission['id'] = test_data.index
submission['label'] = pipe.predict_proba(test_data['comment_clean'])[:,1]
submission.to_csv('submission.csv', index=False)

Observations and trial summary

| **Aspect**                                                | **Comment**                                                                                                                                                                                                                        |
|-----------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| _1- Model type_                                           | XGBoosting                                                                                                                                                                                                                         |
| _2- Root-form preprocessing_                              | Stemming                                                                                                                                                                                                                           |
| _3- Vectorizer_                                           | TF-IDF (word n-gram)                                                                                                                                                                                                               |
| _4- Hyperparameters tuner_                                | Randomized Search (using predefined validation set)                                                                                                                                                                                |
| _5- Vectorizer hyperparameters space_                     | "tfidf__ngram_range": ((1, 2), (1, 3),(1,4)),<br>"tfidf__max_df": [0.5, 0.6, 0.7, 0.8, 0.9],<br>"tfidf__min_df": [5, 10, 20, 30, 50]                                                                                               |
| _6- Model hyperparameters space_                          | 'xgb__learning_rate': [0.01, 0.1],<br>'xgb__n_estimators': [100, 250, 500]<br>'xgb__subsample': [0.5, 0.8],<br>'xgb__colsample_bytree': [0.5, 0.8],<br>'xgb__booster': ['gbtree','gblinear', 'dart']                               |
| _7- Optimal hyper parameters_                             | 'xgb__subsample': 0.5,<br>'xgb__n_estimators': 100,<br>'xgb__learning_rate': 0.01,<br>'xgb__colsample_bytree': 0.8,<br>'xgb__booster': 'gblinear',<br>'tfidf__ngram_range': (1, 4),<br>'tfidf__min_df': 5,<br>'tfidf__max_df': 0.5 |
| _8- AUC score predefined validation set_                  | 0.8590                                                                                                                                                                                                                             |
| _9- AUC score on kaggle test set (public)_                | 0.8056                                                                                                                                                                                                                             |
| _10- Observed performance and thoughts on it_             | Astonishingly the model is having lower performance than the first trial of logistic regression<br> whic is something that we dont expect as usually XGboosting is capable of dealing with huge numbers<br> of features, and usually XGboosting was one of the best classifiers that i have tried. I suspect this behaviour<br> to be due to hyperparameters as we used Randomized Search and it is very possible that we could have <br> missed the optimal values. Our choice for Randomized Search from the first place was in sake of faster training time.                                                                                                                                                                                                                               |
| _11- Reason for changes (if any) and plan for next trial_ | Next we will try another preprocessing technique which lemmatization which returns the word to its dictionary-based root. <br> we expect this to have better performance than stemming as we read that it is usually has better accuracy than stemming<br> in cases where context is important so we will test it on the same classifiers to check this claim.                                                                                                                                                                                                                               |

## **4.2-** Case 2: Models trained on data preprocessed using ***Lemmatization***

### Pre-processing re-do for lemmatization case

In [None]:
from sklearn.model_selection import PredefinedSplit
from sklearn.metrics import roc_auc_score

In [None]:
# Clean titles for training and testing
train_data_clean_lemm["title_clean"] = train_data_clean_lemm["text"].map(lambda x: clean_text(x, root_form = "lemmatization", for_embedding=False) if isinstance(x, str) else x)
test_data_lemm["title_clean"] = test_data_lemm["text"].map(lambda x: clean_text(x, root_form = "lemmatization", for_embedding=False) if isinstance(x, str) else x)

# Drop when any of x missing
train_data_clean_lemm = train_data_clean_lemm[(train_data_clean_lemm["title_clean"] != "") & (train_data_clean_lemm["title_clean"] != "null")]

# Remove column name 'text'
train_data_clean_lemm = train_data_clean_lemm.drop(['text'], axis=1)
test_data_lemm = test_data_lemm.drop(['text'], axis=1)

# Checking the dataframe
print(train_data_clean_lemm.head(5))

# split the original training set to a train and a validation set - 25% of data as validation
train, valid = train_test_split(train_data_clean_lemm, stratify=train_data_clean_lemm['label'], random_state=1, test_size=0.25, shuffle=True)
X_train = train["title_clean"]
y_train = train["label"]
X_valid = valid["title_clean"]
y_valid = valid["label"]

print("X_train shape:", X_train.shape)
print("X_valid shape:", X_valid.shape)

# Create a list where train data indices are -1 and validation data indices are 0
# X_train2 (new training set), X_train
split_index = [-1 if x in X_train.index else 0 for x in train_data_clean_lemm.index]

# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)

y_tr = train_data_clean_lemm["label"] # target label series
X_tr = train_data_clean_lemm["title_clean"] # training features df

        label  \
id              
265723      0   
284269      0   
207715      0   
551106      0   
8584        0   

                                                                                                title_clean  
id                                                                                                           
265723  group friend began volunteer homeless shelter neighbor protested seeing another person also need...  
284269  british prime minister theresa may nerve attack former russian spy government concluded highly l...  
207715  goodyear released kit allows p brought heel http youtube com watch alxulk cg zwillc fishing mida...  
551106  happy birthday bob barker price right host like remembered man said ave pet spayed neutered fuck...  
8584    obama nation innocent cop unarmed young black men dying magic johnson jimbobshawobodob olympic a...  
X_train shape: (44818,)
X_valid shape: (14940,)


In [None]:
test_data_lemm.head(5)

Unnamed: 0_level_0,title_clean
id,Unnamed: 1_level_1
0,stargazer
1,yeah
2,pd phoenix car thief get instruction youtube video
3,trump accuses iran one problem credibility
4,believer hezbollah


### Model 4 Logistic Regression with TF-IDF vectorization (analyzer=word) and random search as hyperparameter tuner ⭐⭐⭐

Optimizing the feature creation (vectorization) step through trying different values for ngram, max_df and min_df where the last two parameters set an upper and lower limit for word frequencies.

In [None]:
%%time
# feature creation and modelling in a single function
# combine the victorizer with the model as a full tunable pipeline
# we gave them a name so we can set their hyperparameters
pipe = Pipeline(
        steps = [("tfidf", TfidfVectorizer(analyzer="word")),
                 ("logreg", LogisticRegression(n_jobs=-1))]
)

# define parameter space to test
# hyperparameters search space of both the model and preprocessing
params = {
    # victorizer hyperparameters (5*5*3 75 combination)
    "tfidf__ngram_range": ((1, 2), (1, 3),(1,4), (1,5)),
    "tfidf__max_df": [0.5, 0.6, 0.7, 0.8, 0.9],
    "tfidf__min_df": [5, 10, 20, 30]
}

# logistic regression random search instance
pipe_clf = RandomizedSearchCV(pipe, # pipeline containing the victorizer and model
                              params, # pipeline hyperparameters
                              cv=pds, # predefined split
                              scoring="roc_auc", # scoring metric used to evaluate the validation data with it
                              n_iter=37, # number of trials (hyperparameters combinations to try)
                              verbose=1,)

# model fitting and training using the training data (will use the optimal hyperparameters that will be found)
# here we still use X_tr; but the randomized search model will use our predefined split internally to determine
# which sample belongs to the validation set
pipe_clf.fit(X_tr, y_tr)
pickle.dump(pipe_clf, open("./pipeline_clf_tfidf_lemm.pck", "wb"))

In [None]:
print('best score {}'.format(pipe_clf.best_score_)) # getting the best validation AUC score
print('best hyperparameters {}'.format(pipe_clf.best_params_)) # getting the optimal hyperparameters values found

Now, using this best parameters for TF-IDF we can search for optimal parameters for the LogisticRegression classifier:

In [None]:
%%time
# feature creation and modelling in a single function
# combine the victorizer with the model as a full tunable pipeline
# we gave them a name so we can set their hyperparameters
pipe = Pipeline(
        steps = [("tfidf", TfidfVectorizer(analyzer="word")),
                 ("logreg", LogisticRegression(n_jobs=-1))]
)

# define parameter space to test # runtime 35min
# hyperparameters search space of both the model and preprocessing
params = {
    # victorizer hyperparameters
    "tfidf__ngram_range": [(1, 4)],
    "tfidf__max_df": [0.9],
    "tfidf__min_df": [5],
    # model hyperparameters
    'logreg__penalty': ["l2"],
    'logreg__C': [0.01, 0.1, 1,10],
    'logreg__max_iter': [10000],
    'logreg__solver': ["newton-cholesky", "saga", "lbfgs", "liblinear"]
}

# logistic regression random search instance
pipe_logreg_clf = RandomizedSearchCV(pipe, # pipeline containing the victorizer and model
                                    params, # pipeline hyperparameters
                                    cv=pds, # predefined split
                                    scoring="roc_auc", # scoring metric used to evaluate the validation data with it
                                    n_iter=10, # number of trials (hyperparameters combinations to try)
                                    verbose=1,)

# model fitting and training using the training data (will use the optimal hyperparameters that will be found)
# here we still use X_train; but the grid search model
# will use our predefined split internally to determine
# which sample belongs to the validation set
pipe_logreg_clf.fit(X_tr, y_tr)
pickle.dump(pipe_logreg_clf, open("./pipeline_clf_logreg_lemm.pck", "wb"))

In [None]:
print('best score {}'.format(pipe_logreg_clf.best_score_)) # getting the best validation AUC score

best_params_logreg = pipe_logreg_clf.best_params_
print('best hyperparameters {}'.format(best_params_logreg)) # getting the optimal hyperparameters values found

Training on whole dataset

In [None]:
pipe.set_params(**best_params_logreg).fit(X_tr, y_tr)

Testing (Predictions of test set)

In [None]:
submission = pd.DataFrame()
submission['id'] = test_data_lemm.index
submission['label'] = pipe.predict_proba(test_data_lemm['title_clean'])[:,1]
submission.to_csv('submission_4.csv', index=False)

Observations and trial summary

***Unfortunately cells output has been gone for this trial as I pressed the run button of the model subsection by mistake while I was editing the header and not being connected to runtime but luckily I had already filled the below summary table.***

| **Aspect**                                                | **Comment**                                                                                                                                                                                               |
|-----------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| _1- Model type_                                           | Logistic Regression                                                                                                                                                                                       |
| _2- Root-form preprocessing_                              | Lemmatization                                                                                                                                                                                             |
| _3- Vectorizer_                                           | TF-IDF (word n-gram)                                                                                                                                                                                      |
| _4- Hyperparameters tuner_                                | Randomized Search (using predefined validation set)                                                                                                                                                       |
| _5- Vectorizer hyperparameters space_                     | "tfidf__ngram_range": ((1, 2), (1, 3),(1,4),(1,5)), ---> added (1,5) to search space<br>"tfidf__max_df": [0.5, 0.6, 0.7, 0.8, 0.9],<br>"tfidf__min_df": [5, 10, 20, 30] ---> removed 50 from search space |
| _6- Model hyperparameters space_                          | 'logreg__penalty': ["l2"],<br>'logreg__C': [0.01, 0.1, 1,10],<br>'logreg__max_iter': [10000],<br>'logreg__solver': ["newton-cholesky", "saga", "lbfgs", "liblinear"]                                      |
| _7- Optimal hyper parameters_                             | 'tfidf__ngram_range': (1, 4),<br>'tfidf__min_df': 5,<br>'tfidf__max_df': 0.9,<br>'logreg__solver': 'newton-cholesky',<br>'logreg__penalty': 'l2',<br>'logreg__max_iter': 10000,<br>'logreg__C': 1         |
| _8- AUC score predefined validation set_                  | 0.8720                                                                                                                                                                                                    |
| _9- AUC score on kaggle test set (public)_                | 0.8404                                                                                                                                                                                                    |
| _10- Observed performance and thoughts on it_             | As claimed the lemmatization is better than stemming! this model is the best performer on both validation set<br> and Kaggle's testing set. This can be due to the fact that lemmatization gives more context to the model as it <br>recognizes words based on their exact and contextual meaning and not just cutting the word as in case of stemming <br>which may also suffer from overstemming or understemming.                                                                                                                                                                                                      |
| _11- Reason for changes (if any) and plan for next trial_ | Next we will try XGboosting again using lemmatized data and hopefully the score will improve as it did with logistic regression.                                                                                                                                                                                                      |

### Model 5 XGboosting with TF-IDF vectorization (analyzer=word)

best hyperparameters {'xgb__subsample': 0.5,
'xgb__n_estimators': 100,
'xgb__learning_rate': 0.01,
'xgb__colsample_bytree': 0.8,
'xgb__booster': 'gblinear',
'tfidf__ngram_range': (1, 4),
'tfidf__min_df': 5,
'tfidf__max_df': 0.5}


In [None]:
vectorizer_tfidf = TfidfVectorizer(analyzer="word",
                                 max_df=0.5,
                                 min_df=5,
                                 ngram_range=(1, 4),
                                 norm = "l2")
vectorizer_tfidf.fit(X_tr)

In [None]:
# transform each sentence to numeric vector with tf-idf value as elements

X_tr_vec = vectorizer_tfidf.transform(X_tr)
X_test_vec = vectorizer_tfidf.transform(test_data_lemm['title_clean'])

X_tr_vec.get_shape()

(59758, 23374)

In [None]:
xgb_clf = XGBClassifier(subsample=0.5,
                            n_estimators=100,
                            learning_rate= 0.01,
                            colsample_bytree=0.8,
                            booster = 'gblinear',
                            n_jobs=-1)

xgb_clf.fit(X_tr_vec,y_tr)

Parameters: { "colsample_bytree", "subsample" } are not used.



Testing (Predictions of test set)

In [None]:
submission = pd.DataFrame()
submission['id'] = test_data_lemm.index
submission['label'] = xgb_clf.predict_proba(X_test_vec)[:,1]
submission.to_csv('submission_5_2.csv', index=False)

Observations and trial summary

| **Aspect**                                                | **Comment**                                                                                                                                                                                                                        |
|-----------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| _1- Model type_                                           | XGBoosting                                                                                                                                                                                                                         |
| _2- Root-form preprocessing_                              | Lemmatization                                                                                                                                                                                                                           |
| _3- Vectorizer_                                           | TF-IDF (word n-gram)                                                                                                                                                                                                               |
| _4- Hyperparameters tuner_                                | None - will use previous XGBoosting trial hyperparameters                                                                                                                                                                          |
| _5- Vectorizer hyperparameters space_                     | None                                                                                                                                                                                                                               |
| _6- Model hyperparameters space_                          | None                                                                                                                                                                                                                               |
| _7- Optimal hyper parameters_                             | 'xgb__subsample': 0.5,<br>'xgb__n_estimators': 100,<br>'xgb__learning_rate': 0.01,<br>'xgb__colsample_bytree': 0.8,<br>'xgb__booster': 'gblinear',<br>'tfidf__ngram_range': (1, 4),<br>'tfidf__min_df': 5,<br>'tfidf__max_df': 0.5 |
| _8- AUC score predefined validation set_                  | None - didn't use validation trained on whole data directly                                                                                                                                                                        |
| _9- AUC score on kaggle test set (public)_                | 0.8087                                                                                                                                                                                                                             |
| _10- Observed performance and thoughts on it_             | The claim that lemmatization has better performnace than stemming still holds in this case as well, <br>as the performance at this trial is better than trial 3 in which we used XGboosting with stemming which had<br> testing AUC of 0.8056.Although we didn't use hyperparameters tuner and just relied on the obtained parameters <br>from the 3rd trial as tuning the XBboosting model took long time, the model still performed better than the stemming case.                                                                                                                                                                                                                                |
| _11- Reason for changes (if any) and plan for next trial_ | As shown lemmatization AUC score is better than stemming even when using not that optimal hyperparameters,<br> so a possible setup for a next trial can be using using hyperparameter tuner to get the optimal values for the XGboosting <br>model which will indeed have a higher AUC score than this trial.                                                                                                                                                                                                                                |