# Notebook 2: Model training and evaluation

This notebook contains the process of training a classification model for a hate-speech dataset. The possible targets for each entity are:
- 0: non-harmful
- 1: cyberbullying (addressed towards a single person)
- 2: hate-speech (addressed towards a public person/entity/large group)

The original training set will be split into two subsets: 75% for training the models, 25% for validation.

After the splitting operation, I will perform data augmentation on the training subset for the undersampled classes 1 and 2.

The augmentation logic will be using the Google Translate API to translate some sentences from Polish to English, and then back to Polish.

This way, we should get the sentences having the same context, but put in a different words.
However if this causes data duplication, I will not proceed with this technique.

Every observation will be cleaned using the TextPreprocessor class defined in the codebase. The steps are:
- putting all text to lowercase,
- removing emojis,
- removing user mentions (ex. @mariusz)
- removing stop words,
- removing URL addresses,
- removing special characters like new line, carriage return, tab,
- removing punctuation marks,
- removing redundant spaces.

I decided to use the SVM algorithm, which should be a good balance between simplicity and performance. Also, it was SVM which took the first place in the PolEval compoetition :)

The text data will have to be transformed into numbers. I will use the TF-IDF vectorizer for that purpose.

In the first approach, both SVM and TFIDF will be trained on default hyperparameters.

In the second approach, I will perform a randomized search hyperparameter tuning, using the hyperparameters ranges specified in the "hyperparameters.yaml" file. I will choose the best configuration at the end.

The results of both approaches will be evaluated by the TextClassificationEvaluator class defined in the codebase.

Both approaches will be logged in MLflow.

The approaches will be tested on the original "test" dataset.
When choosing the better approach, I will take the following things into consideration:
- performance on training set,
- performance on validation and test sets,
- difference in performance between train and validation/test.

The approach with best results will be chosen to be deployed as a working service.