In [None]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Dataset Overview

In [None]:
dataset_file_path = './data/sportoclanky.csv'
assert os.path.exists(dataset_file_path)
df = pd.read_csv(dataset_file_path)

Normally I would print out `df.head()`, but since the data is too sensitive to be public, only metadata can be used for analysis.


In [None]:
df.columns

In the background I have empirically checked that all the categories are truly unique, e.g. that there are no duplications due to case sensitivity, spelling differences, etc., and no missing values.
```
df['category'].unique()
```
The categories are indeed unique.
To be able to use the outputs publicly, I need to encode the categories. 

In [None]:
from sklearn.preprocessing import LabelEncoder

labelencoder = LabelEncoder()
df['category_enc'] = labelencoder.fit_transform(df['category'])

In [None]:
no_classes = len(df['category_enc'].unique())
no_classes

In [None]:
# count by category
cat_counts = df.groupby(['category_enc'])['rss_title'].count().sort_values()
ax = cat_counts.plot.bar(logy=True, title='Number of data points in each category')
for container in ax.containers:
    ax.bar_label(container)

In the cell outputs above we can see that the dataset contains 24 categories and is heavily imbalanced - one of the categories has only 6 datapoints while the other contains ~46K(~41% of the whole dataset)).  

Next, I would like to see if there are any outliers (some of them have been already visible in the `df.head()`) in the `rss_perex` column.
```
df[df['rss_perex'].str.len() < 20]
```
There indeed are 3500 of them. These anomalies were brought to life by mistakes during web scraping: publication date of the article, a subtitle, an author, etc. However, even the subtitles can be useful for classification in our case (e.g. 'Bundesliga' is closely associated with soccer). 

It was also beneficial to know that neither `rss_perex` nor `rss_title` have missing values.

In [None]:
len(df[df['rss_perex'].str.len() < 20])

In [None]:
df['text'] = df['rss_title'] + ' ' + df['rss_perex']

The dataset is rather large, it is also imbalanced.
Before I try any simpler methods, I use a text processing method that became highly popular in the last couple of years - transformers. Instead of training one from scratch I can use a pretrained one and utilize a latent space of its embedding to vectorize the sentences and then build a shallow network for classification.  

The main challenge is finding a suitable model for embedding the sentences - most of the widely used models are trained for the English language.
However, I was able to find multiple multi-language transformers (e.g. [Roberta](https://tfhub.dev/jeongukjae/xlm_roberta_multi_cased_L-24_H-1024_A-16/1)) and a [model trained on the Czech Wikipedia dump](https://tfhub.dev/google/wiki40b-lm-cs/1), which I decided to use. In retrospective, this was a very problematic model and I should have probably used a model [Small-E-Czech](https://huggingface.co/Seznam/small-e-czech). One of the problems, for example, was the fact that the model has an in-built tokenizer, thus the expected input is a raw sentence, however, the output is a 2d embedding of a variable length that had to be dealt with. Another concern is that, unfortunately, I have not find an available evaluation of the model, so using this model was indeed premature.   

## TFHub model

In [None]:
import tensorflow.compat.v1 as tf
import tensorflow_hub as hub
import tensorflow_text
tf.disable_eager_execution()

In [None]:
module_name = "https://tfhub.dev/google/wiki40b-lm-cs/1"

### Testing the embedding functionality of the model.

In [None]:
g = tf.Graph()
with g.as_default():
    # Word embeddings.
    text = tf.placeholder(dtype=tf.string, shape=(1,))
    module = hub.Module(module_name)
    embeddings = module(dict(text=text), signature="word_embeddings",
                        as_dict=True)
    embeddings = embeddings["word_embeddings"]
    init_op = tf.group([tf.global_variables_initializer(),
                      tf.tables_initializer()])
    

In [None]:
# Initialize session.
with tf.Session(graph=g).as_default() as session:
  session.run(init_op)

In [None]:
# getting the embedding size of the largest perex. 
max_index = df['rss_perex'].str.len().argmax()
longest_str = df[df.index == 9752]['rss_perex'].values[0]

In [None]:
with session.as_default():
    em = session.run(embeddings, feed_dict={text: [longest_str]})

In [None]:
np.shape(em)

In [None]:
embed_dim = np.shape(em)

In [None]:
em

### Training the model. 

In [None]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(df['text'], 
                                                    df['category_enc'], 
                                                    stratify=df['category_enc'], 
                                                    random_state=42,
                                                    test_size=0.2)

X_train, X_val, y_train, y_val = train_test_split(X_train, 
                                                    y_train, 
                                                    stratify=y_train, 
                                                    random_state=42,
                                                    test_size=0.25) # 0.25 x 0.8 = 0.2   

In [None]:
class PadLayer(tf.keras.layers.Layer):
    """This layer is necessary to pad the sentences of variable size to the size of the largest embedding."""
    
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def pad_up_to(self, t, max_in_dims, constant_values):
        """https://stackoverflow.com/a/48535322/13591234"""
        s = tf.shape(t)
        paddings = [[0, m-s[i]] for (i,m) in enumerate(max_in_dims)]
        return tf.pad(t, paddings, 'CONSTANT', constant_values=constant_values)

    def call(self, inputs):
        output = self.pad_up_to(inputs, max_in_dims=embed_dim, constant_values=0)
        output = tf.reshape(output, embed_dim)
        return output

In [None]:
model = tf.keras.models.Sequential()
model.add(hub.KerasLayer(module_name, 
                        input_shape=[], 
                        dtype=tf.string, 
                        trainable=False, 
                        signature="word_embeddings",
                        signature_outputs_as_dict=True 
                        ))
model.add(PadLayer())
model.add(tf.keras.layers.Dense(256, activation='relu', input_shape=(embed_dim[2],)))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(no_classes, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['sparse_categorical_accuracy'])


In [None]:
model.summary()

In retrospect, this model has probably too many parameters and lower number of parameters would have sufficed.

In [None]:
model.fit(X_train.to_numpy(), 
          y_train.to_numpy(), 
          epochs=2, 
          batch_size=1,
          validation_data=(X_val.to_numpy(),  y_val.to_numpy()))

In [None]:
results = model.evaluate(X_test.to_numpy(), y_test.to_numpy(), batch_size=1)
print("test loss, test acc:", results)

The accuracy on the test data is 78%, it is not much considering how big is the dataset and the fact that 41% of the data is belongs to just one of the classes. However, the second epoch has shown some improvement in the training, thus it is probable that with higher number of epochs, the model can achieve higher precision. Unfortunately, the was needed to be short due to time and computational constraints.

The next possible steps when training the model can be: 
- changing / tweaking the model's architecture; 
- using a different embedding model; 
- utilizing the `rss_title` better, e.g. using two heads for processing `rss_title` and `rss_perex` individually.

Another thing that is left unfinished is verifying the balanced accuracy of the predictions.  

Further (quite desperate) attempt to improve results can be translating text to English (e.g. using DeepL) and then applying classic verified methods for working with text.

#### Sources for further improvements of the embeddings 
[Czech word2vec model](https://zenodo.org/record/3975038#.Y9F4MrXMJPY), 
[Training word2vec](https://textminingonline.com/training-word2vec-model-on-english-wikipedia-by-gensim), 
[Scripts for custom training of word2vec](https://github.com/anastazie/nlp_czech_wiki)

[Electra paper](https://arxiv.org/abs/2003.10555), 
[Small-e-Czech hf hub](https://huggingface.co/Seznam/small-e-czech), 
[Small-e-Czech github](https://github.com/seznam/small-e-czech),
[Roberta transformer model](https://tfhub.dev/jeongukjae/xlm_roberta_multi_cased_L-24_H-1024_A-16/1)

[Czech embeddings](https://dspace.cuni.cz/bitstream/handle/20.500.11956/147648/120397596.pdf?sequence=1),    
[An evaluation of Czech word embeddings](https://aclanthology.org/W19-6107.pdf)

#### Possible architectures can be tried from 
[Sparse categorical entropy classification](https://github.com/christianversloot/machine-learning-articles/blob/main/how-to-use-sparse-categorical-crossentropy-in-keras.md), 
[Pretrained Embedding Keras](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html), 
[Classifying tweets example](https://towardsdatascience.com/text-classification-using-word-embeddings-and-deep-learning-in-python-classifying-tweets-from-6fe644fcfc81)


### Czech Wikipedia transformer model
[Embedding model](https://tfhub.dev/google/wiki40b-lm-cs/1),
[Collab](https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/wiki40b_lm.ipynb#scrollTo=sv2CmI7BdaML)


## Using BERT-like model 

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-multilingual-cased")
# https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-multilingual-cased", num_labels=no_classes
)

In [None]:
import evaluate
f1_metric = evaluate.load("f1")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return f1_metric.compute(predictions=predictions, references=labels, average='weighted')

In [None]:
training_args = TrainingArguments(
    output_dir="bert_classification",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

In [None]:
from sklearn.model_selection import train_test_split

def split_stratified_into_train_val_test(df_input, stratify_colname='y',
                                         frac_train=0.6, frac_val=0.15, frac_test=0.25,
                                         random_state=None):
    """
    https://stackoverflow.com/a/65571687
    """
    X = df_input # Contains all columns.
    y = df_input[[stratify_colname]] # Dataframe of just the column on which to stratify.

    # Split original dataframe into train and temp dataframes.
    df_train, df_temp, y_train, y_temp = train_test_split(X,
                                                          y,
                                                          stratify=y,
                                                          test_size=(1.0 - frac_train),
                                                          random_state=random_state)

    # Split the temp dataframe into val and test dataframes.
    relative_frac_test = frac_test / (frac_val + frac_test)
    df_val, df_test, y_val, y_test = train_test_split(df_temp,
                                                      y_temp,
                                                      stratify=y_temp,
                                                      test_size=relative_frac_test,
                                                      random_state=random_state)

    assert len(df_input) == len(df_train) + len(df_val) + len(df_test)

    return df_train, df_val, df_test

In [None]:
import torch
class Perex_Dataset(torch.utils.data.Dataset):

    def __init__(self, df, tokenizer):

        self.labels = df['category_enc'].tolist()
        self.texts = tokenizer(df['text'].tolist(), 
                               padding='max_length', max_length = 512, 
                               truncation=True, return_tensors="pt"
                               )

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.texts.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item


In [None]:
df_train, df_val, df_test = split_stratified_into_train_val_test(df, stratify_colname='category_enc')
train, val, test = Perex_Dataset(df_train, tokenizer), Perex_Dataset(df_val, tokenizer), Perex_Dataset(df_test, tokenizer)


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train,
    eval_dataset=val,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

In [None]:
trainer.evaluate()

In [None]:
trainer.predict(test)[2]

#### Results on validation data
{'eval_loss': 0.08356519043445587, 'eval_f1': 0.981017282266466, 'eval_runtime': 217.2683, 'eval_samples_per_second': 76.785, 'eval_steps_per_second': 4.801, 'epoch': 2.0}

#### Results on testing data
{'test_loss': 0.07817238569259644, 'test_f1': 0.9824589477771943, 'test_runtime': 362.4899, 'test_samples_per_second': 76.706, 'test_steps_per_second': 4.795}

We can see that the model gives us 98% of F1 score. This is a very good score since the f1 score takes into consideration recall as well as precision. This metric helps to account for the imbalance in the dataset. 

Direct comparison with the TF model is not possible, since a) the metric is different, b) during splitting the dataset into training and test partitions the training part was contaminated. But the conclusion can be given anyway, since the aforementioned problems give the TF an unfair advantage and the Distilbert model still achieves seemingly better performance. 

### Sources
- [Sequence classification HF](https://huggingface.co/docs/transformers/tasks/sequence_classification)
- [Text classification HF](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb)
- [Creating custom dataset](https://huggingface.co/transformers/v3.2.0/custom_datasets.html)

#### Not utilized
- [EncoderDecoderModel](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)