# Notebook 2: Model training and evaluation

This notebook contains the process of training a classification model for a hate-speech dataset. The possible targets for each entity are:
- 0: non-harmful
- 1: cyberbullying (addressed towards a single person)
- 2: hate-speech (addressed towards a public person/entity/large group)

The dataset comes from the 2019 PolEval competition: http://2019.poleval.pl/index.php/tasks/task6

The original training set will be split into two subsets: 75% for training the models, 25% for validation. The class responsible for that is defined in the codebase.

After the splitting operation, I will perform data augmentation on the training subset for the undersampled classes 1 and 2.
The augmentation logic will be using the Google Translate API to translate some sentences from Polish to English, and then back to Polish.
This way, we should get the sentences having the same context, but put in a different words.
However if this causes data duplication, I will not proceed with this technique.

Every observation will be cleaned using the TextPreprocessor class defined in the codebase. The steps are:
- putting all text to lowercase,
- removing emojis,
- removing user mentions (ex. @mariusz)
- removing stop words,
- removing URL addresses,
- removing special characters like new line, carriage return, tab,
- removing hashtags,
- removing punctuation marks,
- removing redundant spaces.

I decided to use the SVM algorithm, which should be a good balance between simplicity and performance. Also, it was SVM which took the first place in the PolEval compoetition :)

The text data will have to be transformed into numbers. I will use the TF-IDF vectorizer for that purpose.

In the first approach, both SVM and TFIDF will be trained on default hyperparameters (I will just pass the max_features parameter to TFIDF so that we don't get more features than observations in the training dataset).

In the second approach, I will perform hyperparameter tuning with Optuna, using the hyperparameters ranges specified in the "hyperparameters.yaml" file. I will choose the best configuration at the end.

The results of both approaches will be evaluated by the TextClassificationEvaluator class defined in the codebase.

Default approach, hyperparameters tuning and best hyperparams will be logged in MLflow.

The approaches will be tested on the original "test" dataset.

Considering the facts that the model is expected to be a prototype and we're dealing with the multiclass classification problem, I do not have any strict expectations regarding to the metrics. The main goal is to show a process of data preparation and modelling, and later propose a deployment strategy.

The approach with best results will be chosen to be deployed as a working service.

In [1]:
import os
os.chdir("..")

In [2]:
# I install the googletrans package here to avoid coflicts with pytest http
!pip install googletrans==3.1.0a0

In [49]:
import logging
import warnings
warnings.filterwarnings("ignore")
import random
from tqdm import tqdm

import mlflow
import numpy as np
import optuna
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from googletrans import Translator

from src.text_preprocessor import TextPreprocessor
from model_training.text_classification_evaluator import TextClassificationEvaluator
from model_training.config_parser import ConfigParser
from model_training.train_test_split_transformation import TrainValidationSplitTransformation

In [107]:
def load_data(text_path: str, labels_path: str):
    with open(text_path, "r") as f:
        examples = f.readlines()

    with open(labels_path, "r") as f:
        labels = f.readlines()

    df = pd.DataFrame({
        "text": train_examples,
        "label": train_labels,
    })

    df["label"] = df["label"].apply(lambda x: x.replace("\n", ""))
    df["label"] = df["label"].astype(int)
    return df

In [108]:
df = load_data("data/train/training_set_clean_only_text.txt", "data/train/training_set_clean_only_tags.txt")

In [109]:
df.head(10)

Unnamed: 0,text,label
0,"Dla mnie faworytem do tytułu będzie Cracovia. Zobaczymy, czy typ się sprawdzi.\n",0
1,@anonymized_account @anonymized_account Brawo ty Daria kibic ma być na dobre i złe\n,0
2,"@anonymized_account @anonymized_account Super, polski premier składa kwiaty na grobach kolaborantów. Ale doczekaliśmy czasów.\n",0
3,@anonymized_account @anonymized_account Musi. Innej drogi nie mamy.\n,0
4,"Odrzut natychmiastowy, kwaśna mina, mam problem\n",0
5,"Jaki on był fajny xdd pamiętam, że spóźniłam się na jego pierwsze zajęcia i to sporo i za karę kazał mi usiąść w pierwszej ławce XD\n",0
6,@anonymized_account No nie ma u nas szczęścia 😉\n,0
7,@anonymized_account Dawno kogoś tak wrednego nie widziałam xd\n,0
8,"@anonymized_account @anonymized_account Zaległości były, ale ważne czy były wezwania do zapłaty z których się klub nie wywiązał.\n",0
9,@anonymized_account @anonymized_account @anonymized_account Gdzie jest @anonymized_account . Brudziński jesteś kłamcą i marnym kutasem @anonymized_account\n,2


In [110]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10041 entries, 0 to 10040
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    10041 non-null  object
 1   label   10041 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 157.0+ KB


In [111]:
df.shape

(10041, 2)

## Splitting the dataset into training and validation subsets

In [112]:
train_valid_transformation = TrainValidationSplitTransformation(
    input_column_name="text",
    target_column_name="label",
    train_size=0.75,
    do_stratify=True,
)

df_train, df_val = train_valid_transformation.split_to_training_and_val_subsets(dataframe=df)

In [113]:
df_train.shape

(7530, 2)

In [114]:
df_val.shape

(2511, 2)

In [115]:
df_train["label"].value_counts(normalize=True)

0    0.915272
2    0.059495
1    0.025232
Name: label, dtype: float64

In [116]:
df_val["label"].value_counts(normalize=True)

0    0.915173
2    0.059737
1    0.025090
Name: label, dtype: float64

## Augmenting classes 1 and 2 - experiment

In [117]:
translator = Translator()

In [118]:
def rephrase_text(
    text: str,
    translator_object: Translator = translator
) -> str:
    english_text = translator_object.translate(text, src="pl", dest="en")
    rephrased_text = translator_object.translate(english_text, src="en", dest="pl").text
    start_index = rephrased_text.index("text") + 5
    end_index = rephrased_text.index("wymowa") - 2
    return rephrased_text[start_index:end_index]

In [119]:
sample_indices = range(12)
sample_sentences = df_train["text"].values[sample_indices]
rephrased_sentences = [rephrase_text(text) for text in sample_sentences]
rephrasing_sample = pd.DataFrame({
    "original": sample_sentences,
    "rephrased": rephrased_sentences,
})

In [120]:
pd.options.display.max_colwidth = 200
rephrasing_sample

Unnamed: 0,original,rephrased
0,"Justas Lasickas z golem dla Litwy U-21, która remisuje z Wyspami Owczymi 2:2.\n","Justas Lasickas po golu dla Litwy U-21, który zremisował z Wyspami Owczymi 2-2."
1,@anonymized_account @anonymized_account Kostevycha ręka była rozmyślna ?\n,@anonymized_account @anonymized_account Ręka Kostevycha była celowa?
2,"@anonymized_account Ja już nie odczuwam, regularność codziennie do pracy, śmieję się wiatru w twarz, czym mocniej wieje tym głośniej sié śmieję\n","@anonymized_account Już tego nie czuję, regularność codziennie do pracy, śmieję się w twarz wiatrowi, im mocniej wieje, tym głośniej się śmieję"
3,@anonymized_account @anonymized_account Nikt ciebie tu nie chce rasistowski pętaku\n,@anonymized_account @anonymized_account Nikt cię tu nie chce rasistowski draniu
4,#dividetourwarsaw zrób ktos live spod stadionu prosze potrzebuje tego\n,#dividetourwarsaw czy ktoś na żywo spod stadionu proszę potrzebuję tego
5,@anonymized_account @anonymized_account @anonymized_account @anonymized_account @anonymized_account Kaliciaka mogli też zaprosić :)\n,@anonymized_account @anonymized_account @anonymized_account @anonymized_account @anonymized_account Kaliciak też mógł zaprosić :)
6,@anonymized_account Może nie było ciekawszych? :) a może ta agencja ma też znajomości w gazecie i poprosili żeby o nim wspomnieć ? :)\n,@anonymized_account Może nie było ciekawiej? :) a może ta agencja też ma znajomości w gazecie i poprosili mnie o wzmiankę o nim? :)
7,@anonymized_account @anonymized_account @anonymized_account Masz rację w każdym punkcie. I PiS też ma i dokładnie wie co robi i z kim ma do czynienia.\n,"@anonymized_account @anonymized_account @anonymized_account Masz rację w każdym punkcie. A PiS też ma i dokładnie wie, co robi iz kim ma do czynienia."
8,W sumie to polubiłam już zajęcia z konstrukcji przekazu reklamowego i prowadzący jest świetny\n,W sumie już mi się podobały zajęcia z budowy przekazu reklamowego i prowadzący jest super
9,Tłit aktualny szczególnie dzisiaj‼️\n\nDzięki @anonymized_account \n\nhttps://t.co/w4nJSmo1J4\n,Ttit aktualne szczególnie dzisiaj‼️\n\nDzięki @anonymized_account \n\nhttps://t.co/w4nJSmo1J4


I think this approach is at least worth trying.

In [121]:
def augment_data(
    dataframe: pd.DataFrame,
    new_ones: int,  # how many new observations from class "1" we want to get
    new_twos: int,  # how many new observations from class "2" we want to get
) -> pd.DataFrame:
    ones_number = dataframe[dataframe["label"] == 1].shape[0]
    twos_number = dataframe[dataframe["label"] == 2].shape[0]
    assert new_ones <= ones_number
    assert new_twos <= twos_number
    
    ones_indices = random.sample(range(ones_number), new_ones)
    twos_indices = random.sample(range(twos_number), new_twos)
    
    original_ones = dataframe[dataframe["label"] == 1]["text"].values[ones_indices]
    original_twos = dataframe[dataframe["label"] == 2]["text"].values[twos_indices]
    
    rephrased_ones = [rephrase_text(text) for text in tqdm(original_ones)]
    rephrased_twos = [rephrase_text(text) for text in tqdm(original_twos)]
    
    rephrased_ones_df = pd.DataFrame({
        "text": rephrased_ones,
        "label": [1 for i in range(len(rephrased_ones))]
    })
    rephrased_twos_df = pd.DataFrame({
        "text": rephrased_twos,
        "label": [2 for i in range(len(rephrased_twos))]
    })
    
    return pd.concat([dataframe, rephrased_ones_df, rephrased_twos_df], axis="rows")

In [122]:
df_train["label"].value_counts()

0    6892
2     448
1     190
Name: label, dtype: int64

Let's create 90 new observations from class "1" and 215 new observations from "2" class

In [18]:
df_train_augmented = augment_data(dataframe=df_train, new_ones=90, new_twos=215)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90/90 [00:43<00:00,  2.05it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 215/215 [01:47<00:00,  2.01it/s]


In [19]:
df_train_augmented["label"].value_counts()

0    6892
2     663
1     280
Name: label, dtype: int64

In [20]:
df_train_augmented["label"].value_counts(normalize=True)

0    0.879643
2    0.084620
1    0.035737
Name: label, dtype: float64

In [21]:
df_train["label"].value_counts(normalize=True)

0    0.915272
2    0.059495
1    0.025232
Name: label, dtype: float64

We can see that percentage of both undersampled classes got bigger.

## A function for logging experiments into MLflow

In [68]:
from typing import Any, Dict, Optional

def log_experiment(
    experiment_name: str,
    metrics: Dict[str, Any],
    model: Optional[SVC] = None,
    vectorizer: Optional[TfidfVectorizer] = None,
    hyperparams: Optional[Dict[str, Any]] = None,
    run_name: Optional[str] = None,
    tag: str = Optional[None],
):
    
    mlflow.set_experiment(experiment_name)
    
    with mlflow.start_run(run_name=run_name):
        
        if hyperparams != None:
            for param in hyperparams:
                mlflow.log_param(param, hyperparams[param])
            
        for metric in metrics:
            mlflow.log_metric(metric, metrics[metric])
        
        if model != None:
            mlflow.sklearn.log_model(model, "model")
        if vectorizer != None:
            mlflow.sklearn.log_model(vectorizer, "vectorizer")
        if tag != None:
            mlflow.set_tag("tag1", tag)

## Training SVM and TFIDF with default hyperparameters

In [23]:
tfidf_default = TfidfVectorizer(max_features=df_train_augmented.shape[0])
#just to make sure we don't have more features than observations
svm_default = SVC(verbose=1)

preprocessor = TextPreprocessor()
evaluator = TextClassificationEvaluator()

In [24]:
# text preprocessing
df_train_augmented["text"] = df_train_augmented["text"].apply(lambda x: preprocessor.preprocess(x))
df_val["text"] = df_val["text"].apply(lambda x: preprocessor.preprocess(x))

In [25]:
X_train_tfidf = tfidf_default.fit_transform(df_train_augmented["text"].values)
X_val_tfidf = tfidf_default.transform(df_val["text"].values)

In [26]:
y_train = df_train_augmented["label"].values
y_val = df_val["label"].values

In [27]:
svm_default.fit(X_train_tfidf, y_train)

[LibSVM]...*..*
optimization finished, #iter = 5694
obj = -413.500946, rho = -0.941106
nSV = 4220, nBSV = 286
.....*...*
optimization finished, #iter = 8067
obj = -872.366028, rho = -0.839806
nSV = 5371, nBSV = 648
.*
optimization finished, #iter = 1396
obj = -338.880396, rho = 0.751311
nSV = 877, nBSV = 269
Total nSV = 6461


In [28]:
def evaluate_model(
    model: SVC,
    labels_train: np.ndarray,
    labels_val: np.ndarray,
    features_train: np.ndarray,
    features_val: np.ndarray,
    eval_object: TextClassificationEvaluator = evaluator,
):
    predictions_train = model.predict(features_train)
    predictions_val = model.predict(features_val)
    
    train_metrics = evaluator.calculate_metrics(labels_train, predictions_train)
    validation_metrics = evaluator.calculate_metrics(labels_val, predictions_val)
    
    return train_metrics, validation_metrics

In [29]:
metrics_train, metrics_val = evaluate_model(
    model=svm_default,
    labels_train=y_train,
    labels_val=y_val,
    features_train=X_train_tfidf,
    features_val=X_val_tfidf,
)

In [30]:
print(f"Train metrics:\n{metrics_train}")
print()
print("-"*20)
print()
print(f"Validation metrics:\n{metrics_val}")

Train metrics:
{'accuracy': 0.9624760689215061, 'f1_macro': 0.8420009498522497, 'f1_micro': 0.9624760689215061}

--------------------

Validation metrics:
{'accuracy': 0.9215452011150936, 'f1_macro': 0.4065510938891632, 'f1_micro': 0.9215452011150936}


As we could expect, the model run on default hyperparameters is strongly overfitted.

In the hyperparameter tuning process, I will use f1_marco on validation dataset as the optimization target.

In [31]:
metrics_to_log = {}

for metric, value in metrics_train.items():
    new_key = metric + "_train"
    metrics_to_log[new_key] = value
    
for metric, value in metrics_val.items():
    new_key = metric + "_val"
    metrics_to_log[new_key] = value

In [32]:
log_experiment(
    experiment_name="Default_hyperparameters",
    metrics=metrics_to_log,
    model=svm_default,
    vectorizer=tfidf_default,
)



## Tuning hyperparameters with Optuna

In [33]:
# loading hyperparameters ranges
svm_hyperparams, tfidf_hyperparams = ConfigParser.load_hyperparams("hyperparameters.yaml")

In [37]:
def objective(
    trial, 
    X_train, 
    X_test, 
    y_train, 
    y_test,
    svm_hyperparams,
    tfidf_hyperparams,
    evaluator,
):
    svm_C = trial.suggest_loguniform('svm_c', *svm_hyperparams.C)
    svm_kernel = trial.suggest_categorical('svm_kernel', svm_hyperparams.kernel)
    svm_class_weight = svm_hyperparams.class_weight
    tfidf_max_features = trial.suggest_int('tfidf_max_features', *tfidf_hyperparams.max_features)
    tfidf_min_df = trial.suggest_int('tfidf_min_df', *tfidf_hyperparams.min_df)
    tfidf_max_df = trial.suggest_uniform('tfidf_max_df', *tfidf_hyperparams.max_df)
    
    vectorizer = TfidfVectorizer(
        max_df=tfidf_max_df,
        max_features=tfidf_max_features,
        min_df=tfidf_min_df
    )
    
    X_train_tfidf = vectorizer.fit_transform(X_train)
    X_test_tfidf = vectorizer.transform(X_test)
    
    svm = SVC(
        C=svm_C,
        kernel=svm_kernel,
        class_weight=svm_class_weight
    )
    
    svm.fit(X_train_tfidf, y_train)
    
    y_pred = svm.predict(X_test_tfidf)
    
    f1_macro = evaluator.calculate_f1_macro(y_test, y_pred)
    
    return f1_macro

In [38]:
X_train = df_train_augmented["text"].values
X_val = df_val["text"].values

In [70]:
study = optuna.create_study(direction='maximize')

In [71]:
def log_callback(study, trial):
    best_trial = study.best_trial
    if trial.number % 10 == 0 or trial.number == 199:
        print(f"Trial {trial.number}: {trial.value} Best trial so far: {best_trial.number}: {best_trial.value}")

        log_experiment(
            experiment_name="hyperparams_tuning",
            metrics={"f1_macro": best_trial.value},
            hyperparams=best_trial.params,
            run_name=f"trial_{trial.number}",
        )

In [72]:
logging.disable(logging.CRITICAL)

In [73]:
study.optimize(
    lambda trial: objective(
        trial,
        X_train,
        X_val,
        y_train,
        y_val,
        svm_hyperparams,
        tfidf_hyperparams,
        evaluator,
    ),
    n_trials=200,
    callbacks=[log_callback]
)

Trial 0: 0.03757985719654265 Best trial so far: 0: 0.03757985719654265
Trial 10: 0.4217346846256567 Best trial so far: 2: 0.4301355440304142
Trial 20: 0.42054829993645715 Best trial so far: 17: 0.4589135397543051
Trial 30: 0.40993178954134396 Best trial so far: 26: 0.46558393062556785
Trial 40: 0.40802719172193536 Best trial so far: 31: 0.466316976213899
Trial 50: 0.470975173187786 Best trial so far: 50: 0.470975173187786
Trial 60: 0.47335940495360784 Best trial so far: 58: 0.481140752772768
Trial 70: 0.4803065134099617 Best trial so far: 58: 0.481140752772768
Trial 80: 0.47491313237317384 Best trial so far: 58: 0.481140752772768
Trial 90: 0.40441337988636894 Best trial so far: 88: 0.48179449757396525
Trial 100: 0.481140752772768 Best trial so far: 88: 0.48179449757396525
Trial 110: 0.4009299026351485 Best trial so far: 88: 0.48179449757396525
Trial 120: 0.4447613946678073 Best trial so far: 88: 0.48179449757396525
Trial 130: 0.4800495375180718 Best trial so far: 88: 0.4817944975739652

In [74]:
best_trial = study.best_trial

best_params = best_trial.params
print(best_params)

{'svm_c': 7.6495944342965165, 'svm_kernel': 'rbf', 'tfidf_max_features': 4692, 'tfidf_min_df': 1, 'tfidf_max_df': 0.518276935298537}


In [84]:
best_svm_hyperparams = {}
best_tfidf_hyperparams = {}

for key in best_params.keys():
    if key.startswith("svm_"):
        best_svm_hyperparams[key.replace("svm_", "")] = best_params[key]
    elif key.startswith("tfidf_"):
        best_tfidf_hyperparams[key.replace("tfidf_", "")] = best_params[key]

best_svm_hyperparams["C"] = best_svm_hyperparams["c"]
del best_svm_hyperparams["c"]

In [86]:
best_tfidf = TfidfVectorizer(**best_tfidf_hyperparams)
best_svm = SVC(**best_svm_hyperparams, class_weight="balanced")

X_train_tfidf = best_tfidf.fit_transform(X_train)
X_val_tfidf = best_tfidf.transform(X_val)

best_svm.fit(X_train_tfidf, y_train)

In [89]:
best_metrics_train, best_metrics_val = evaluate_model(
    model=best_svm,
    labels_train=y_train,
    labels_val=y_val,
    features_train=X_train_tfidf,
    features_val=X_val_tfidf,
)

In [91]:
print(f"Train metrics:\n{best_metrics_train}")
print()
print("-"*20)
print()
print(f"Validation metrics:\n{best_metrics_val}")

Train metrics:
{'accuracy': 0.9941289087428207, 'f1_macro': 0.9790723331833172, 'f1_micro': 0.9941289087428207}

--------------------

Validation metrics:
{'accuracy': 0.9203504579848666, 'f1_macro': 0.48179449757396525, 'f1_micro': 0.9203504579848666}


The tuning process improved results a little, but there is still strong overfitting. However, building the best possible solution is not the main purpose, so let's keep that results.

By the way, the fact that accuracy is always equal to f1_micro is very interesting. The number of true positives, true negatives, false positives, and false negatives must be the same across all classes.

In [94]:
best_metrics_to_log = {}

for metric, value in best_metrics_train.items():
    new_key = metric + "_train"
    best_metrics_to_log[new_key] = value
    
for metric, value in best_metrics_val.items():
    new_key = metric + "_val"
    best_metrics_to_log[new_key] = value

log_experiment(
    experiment_name="Best_hyperparams",
    metrics=best_metrics_to_log,
    model=best_svm,
    vectorizer=best_tfidf,
)

I downloaded best SVM and TFIDF objects and saved them in the /models directory.

In [96]:
import pickle

with open("models/model.pkl", "rb") as f:
    model = pickle.load(f)

with open("models/vectorizer.pkl", "rb") as f:
    vectorizer = pickle.load(f)

assert isinstance(model, SVC)
assert isinstance(vectorizer, TfidfVectorizer)

## Running best objects on test dataset

Now let's run the evaluation process on the test dataset:

In [98]:
with open("data/test/test_set_only_text.txt", "r") as f:
    test_examples = f.readlines()

with open("data/test/test_set_only_tags.txt", "r") as f:
    test_labels = f.readlines()

df_test = pd.DataFrame({
    "text": test_examples,
    "label": test_labels,
})

df_test["label"] = df_test["label"].apply(lambda x: x.replace("\n", ""))
df_test["label"] = df_test["label"].astype(int)

In [99]:
df_test.head()

Unnamed: 0,text,label
0,"@anonymized_account Spoko, jak im Duda z Morawieckim zamówią po pięć piw to wszystko będzie ok.\n",0
1,"@anonymized_account @anonymized_account Ale on tu nie miał szans jej zagrania, a ta 'proba' to czysta prowizorka.\n",0
2,"@anonymized_account No czy Prezes nie miał racji, mówiąc,ze to są zdradzieckie mordy? No czy nie miał racji?😁😁\n",0
3,@anonymized_account @anonymized_account Przecież to nawet nie jest przewrotka 😂\n,0
4,@anonymized_account @anonymized_account Owszem podatki tak. Ale nie w takich okolicznościach. Czemu Małysza odpalili z teamu Orlen?\n,0


In [100]:
df_test.shape

(1000, 2)

In [101]:
df_test["text"] = df_test["text"].apply(lambda x: preprocessor.preprocess(x))

In [102]:
df_test.head()

Unnamed: 0,text,label
0,spoko duda morawieckim zamówią pięć piw ok,0
1,miał szans zagrania proba czysta prowizorka,0
2,prezes miał racji mówiącze zdradzieckie mordy miał racji,0
3,przewrotka,0
4,podatki tak takich okolicznościach małysza odpalili teamu orlen,0


In [103]:
X_test_tfidf = vectorizer.transform(df_test["text"].values)
y_test = df_test["label"].values

In [104]:
y_test_preds = model.predict(X_test_tfidf)

In [105]:
print(f"Metrics values on test set:\n{evaluator.calculate_metrics(y_test, y_test_preds)}")

Metrics values on test set:
{'accuracy': 0.871, 'f1_macro': 0.4379092184593614, 'f1_micro': 0.871}


The performance got even worse on previously unseen data. If we want to achieve better results, we would have to apply several overfitting prevention techniques. Getting more data would be a good idea, as we had only around 10 000 observations in the training dataset. Also it may be a good idea to do some research about different text augmentation techniques, and use an algorithm which is less overfitting-prone.

However, let's consider the current model and vectorizer as prototype.