# Disclaimer

Released under the CC BY 4.0 License (https://creativecommons.org/licenses/by/4.0/)

# Purpose of this notebook

The purpose of this document is to show how I approached the presented problem and to record my learning experience in how to use Tensorflow 2 and CatBoost to perform a classification task on text data.

If, while reading this document, you think _"Why didn't you do `<this>` instead of `<that>`?"_, the answer could be simply because I don't know about `<this>`. Comments, questions and constructive criticism are of course welcome.

# Intro

This simple classification task has been developed to get familiarized with Tensorflow 2 and CatBoost handling of text data. In summary, the task is to predict the author of a short text.

To get a number of train/test examples, it is enough to create a twitter app and, using the python client library for twitter, read the user timeline of multiple accounts. This process is not covered here. If you are interested in this topic, feel free to contact me.


## Features

It is assumed the collected raw data consists of:

1.   The author handle (the label that will be predicted)
2.   The timestamp of the post
3.   The raw text of the post

### Preparing the dataset

When preparing the dataset, the content of the post is preprocessed using these rules:

1.   Newlines are replaced with a space
2.   Links are replaced with a placeholder (e.g. `<link>`)
3.   For each possible unicode char category, the number of chars in that category is added as a feature
4.   The number of words for each tweet is added as a feature
5.   Retweets (even retweets with comment) are discarded. Only responses and original tweets are taken into account

The dataset has been randomly split into three different files for train (70%), validation (10%) and test (20%). For each label, it has been verified that the same percentages hold in all three files.

Before fitting the data and before evaluation on the test dataset, the timestamp values are normalized, using the mean and standard deviation computed on the train dataset.

# TensorFlow 2 model

The model has four different input features:

1.    The normalized timestamp.
2.    The input text, represented as the whole sentence. This will be transformed in a 128-dimensional vector by an embedding layer.
3.    The input text, this time represented as a sequence of words, expressed as indexes of tokens. This representation will be used by a LSTM layer to try to extract some meaning from the actual sequence of the used words.
4.    The unicode character category usage. This should help in identify handles that use emojis, a lot of punctuation or unusual chars.

The resulting layers are concatenated, then after a sequence of two dense layers (with an applied dropout) the final layer computes the logits for the different classes. The used loss function is *sparse categorical crossentropy*, since the labels are represented as indexes of a list of twitter handles.

## Imports for the TensorFlow 2 model

In [None]:
import functools
import os

from tensorflow.keras import Input, layers
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras import regularizers

import pandas as pd
import numpy as np

import copy

import calendar
import datetime
import re


from tensorflow.keras.preprocessing.text import Tokenizer

import unicodedata
#masking layers and GPU don't mix
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

## Definitions for the TensorFlow 2 model


In [None]:
#Download size: ~446MB
hub_layer = hub.KerasLayer(
    "https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1",
    output_shape=[512],
    input_shape=[],
    dtype=tf.string,
    trainable=False
)

embed = hub.load("https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1")

unicode_data_categories = [
    "Cc",
    "Cf",
    "Cn",
    "Co",
    "Cs",
    "LC",
    "Ll",
    "Lm",
    "Lo",
    "Lt",
    "Lu",
    "Mc",
    "Me",
    "Mn",
    "Nd",
    "Nl",
    "No",
    "Pc",
    "Pd",
    "Pe",
    "Pf",
    "Pi",
    "Po",
    "Ps",
    "Sc",
    "Sk",
    "Sm",
    "So",
    "Zl",
    "Zp",
    "Zs"
]

column_names = [
    "handle",
    "timestamp",
    "text"
]

column_names.extend(unicode_data_categories)

train_file = os.path.realpath("input.csv")

n_tokens = 100000

tokenizer = Tokenizer(n_tokens, oov_token='<OOV>')

#List of handles (labels)
#Fill with the handles you want to consider in your dataset
handles = [

]

end_token = "XEND"

train_file = os.path.realpath("data/train.csv")
val_file = os.path.realpath("data/val.csv")
test_file = os.path.realpath("data/test.csv")

## Preprocessing and computing dataset features

In [None]:
def get_pandas_dataset(input_file, fit_tokenizer=False, timestamp_mean=None, timestamp_std=None, pad_sequence=None):

    pd_dat = pd.read_csv(input_file, names=column_names)
    
    pd_dat = pd_dat[pd_dat.handle.isin(handles)]

    if(timestamp_mean is None):
         timestamp_mean = pd_dat.timestamp.mean()
    
    if(timestamp_std is None):
        timestamp_std = pd_dat.timestamp.std()

    pd_dat.timestamp = (pd_dat.timestamp - timestamp_mean) / timestamp_std

    pd_dat["handle_index"] = pd_dat['handle'].map(lambda x: handles.index(x))

    if(fit_tokenizer):
        tokenizer.fit_on_texts(pd_dat["text"])
        pad_sequence = tokenizer.texts_to_sequences([[end_token]])[0][0]
    
    pd_dat["sequence"] = tokenizer.texts_to_sequences(pd_dat["text"])


    max_seq_length = 30
    pd_dat = pd_dat.reset_index(drop=True)

    #max length
    pd_dat["sequence"] = pd.Series(el[0:max_seq_length] for el in pd_dat["sequence"])

    #padding
    pd_dat["sequence"] = pd.Series([el + ([pad_sequence] * (max_seq_length - len(el))) for el in pd_dat["sequence"]])
    
    pd_dat["words_in_tweet"] = pd_dat["text"].str.strip().str.split(" ").str.len() + 1
    
    return pd_dat, timestamp_mean, timestamp_std, pad_sequence

train_dataset, timestamp_mean, timestamp_std, pad_sequence = get_pandas_dataset(train_file, fit_tokenizer=True)

test_dataset, _, _, _= get_pandas_dataset(test_file, timestamp_mean=timestamp_mean, timestamp_std=timestamp_std, pad_sequence=pad_sequence)

val_dataset, _, _, _ = get_pandas_dataset(val_file, timestamp_mean=timestamp_mean, timestamp_std=timestamp_std, pad_sequence=pad_sequence)

#selecting as features only the unicode categories that are used in the train dataset
non_null_unicode_categories = []
for unicode_data_category in unicode_data_categories:
    category_name = unicode_data_category
    category_sum = train_dataset[category_name].sum()

    if(category_sum > 0):
        non_null_unicode_categories.append(category_name) 

print("Bucketized unicode categories used as features: " + repr(non_null_unicode_categories))

## Defining input/output features from the datasets

In [None]:
def split_inputs_and_outputs(pd_dat):


    labels = pd_dat['handle_index'].values

    icolumns = pd_dat.columns

    timestamps = pd_dat.loc[:, "timestamp"].astype(np.float32)
    text = pd_dat.loc[:, "text"]
    sequence = np.asarray([np.array(el) for el in pd_dat.loc[:, "sequence"]])
    #unicode_char_ratios = pd_dat[unicode_data_categories].astype(np.float32)
    unicode_char_categories = {
        category_name: pd_dat[category_name] for category_name in non_null_unicode_categories
    }
    
    words_in_tweet = pd_dat['words_in_tweet']

    return timestamps, text, sequence, unicode_char_categories, words_in_tweet, labels

timestamps_train, text_train, sequence_train, unicode_char_categories_train, words_in_tweet_train, labels_train = split_inputs_and_outputs(train_dataset)
timestamps_val, text_val, sequence_val, unicode_char_categories_val, words_in_tweet_val, labels_val = split_inputs_and_outputs(val_dataset)
timestamps_test, text_test, sequence_test, unicode_char_categories_test, words_in_tweet_test, labels_test = split_inputs_and_outputs(test_dataset)

## Input tensors

In [None]:
input_timestamp = Input(shape=(1, ), name='input_timestamp', dtype=tf.float32)
input_text = Input(shape=(1, ), name='input_text', dtype=tf.string)
input_sequence = Input(shape=(None, 1 ), name="input_sequence", dtype=tf.float32)
input_unicode_char_categories = [
    Input(shape=(1, ), name="input_"+category_name, dtype=tf.float32) for category_name in non_null_unicode_categories
]
input_words_in_tweet = Input(shape=(1, ), name="input_words_in_tweet", dtype=tf.float32)

inputs_train = {
    'input_timestamp': timestamps_train,
    "input_text": text_train,
    "input_sequence": sequence_train,
    'input_words_in_tweet': words_in_tweet_train,
}

inputs_train.update({
    'input_' + category_name: unicode_char_categories_train[category_name] for category_name in non_null_unicode_categories
})

outputs_train = labels_train

inputs_val = {
    'input_timestamp': timestamps_val,
    "input_text": text_val,
    "input_sequence": sequence_val,
    'input_words_in_tweet': words_in_tweet_val
}

inputs_val.update({
    'input_' + category_name: unicode_char_categories_val[category_name] for category_name in non_null_unicode_categories
})
          
outputs_val = labels_val

inputs_test = {
    'input_timestamp': timestamps_test,
    "input_text": text_test,
    "input_sequence": sequence_test,
    'input_words_in_tweet': words_in_tweet_test
}

inputs_test.update({
    'input_' + category_name: unicode_char_categories_test[category_name] for category_name in non_null_unicode_categories
})

outputs_test = labels_test

## TensorFlow 2 model definition

In [None]:
def get_model():

    reg = None
    activation = 'relu'

    reshaped_text = layers.Reshape(target_shape=())(input_text)
    embedded = hub_layer(reshaped_text)
    x = layers.Dense(256, activation=activation)(embedded)

    masking = layers.Masking(mask_value=pad_sequence)(input_sequence)

    lstm_layer = layers.Bidirectional(layers.LSTM(32))(masking)

    flattened_lstm_layer = layers.Flatten()(lstm_layer)

    x = layers.concatenate([
        input_timestamp, 
        flattened_lstm_layer,
        *input_unicode_char_categories,
        input_words_in_tweet,
        x
    ])

    x = layers.Dense(n_tokens // 30, activation=activation, kernel_regularizer=reg)(x)

    x = layers.Dropout(0.1)(x)

    x = layers.Dense(n_tokens // 50, activation=activation, kernel_regularizer=reg)(x)
    
    x = layers.Dropout(0.1)(x)

    x = layers.Dense(256, activation=activation, kernel_regularizer=reg)(x)

    y = layers.Dense(len(handles), activation='linear')(x)

    model = tf.keras.Model(
        inputs=[
                input_timestamp, 
                input_text, 
                input_sequence,
                *input_unicode_char_categories,
                input_words_in_tweet
        ],
        outputs=[y]
    )

    cce = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

    model.compile(
        optimizer='adam',
        loss=cce,
        metrics=['sparse_categorical_accuracy']
    )

    return model

model = get_model()

tf.keras.utils.plot_model(model, to_file='twitstar.png', show_shapes=True)

## TensorFlow 2 model fitting

In [None]:
history = model.fit(
    inputs_train,
    outputs_train,
    epochs=15,
    batch_size=64,
    verbose=True,
    validation_data=(inputs_val, outputs_val),
    callbacks=[
        tf.keras.callbacks.ModelCheckpoint(
            os.path.realpath("weights.h5"),
            monitor="val_sparse_categorical_accuracy", 
            save_best_only=True,
            verbose=2
        ),
        tf.keras.callbacks.EarlyStopping(
            patience=3,
            monitor="val_sparse_categorical_accuracy"
        ),
    ]
)

## TensorFlow 2 model plots for train loss and accuracy


In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Loss vs. epochs')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Training', 'Validation'], loc='upper right')
plt.show()


plt.plot(history.history['sparse_categorical_accuracy'])
plt.plot(history.history['val_sparse_categorical_accuracy'])
plt.title('Accuracy vs. epochs')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Training', 'Validation'], loc='upper right')
plt.show()

## TensorFlow 2 model evaluation

In [None]:
#loading the "best" weights
model.load_weights(os.path.realpath("weights.h5"))

model.evaluate(inputs_test, outputs_test)

### TensorFlow 2 model confusion matrix

Using predictions on the test set, a confusion matrix is produced

In [None]:
def tf2_confusion_matrix(inputs, outputs):
    predictions = model.predict(inputs)
    wrong_labelled_counter = np.zeros((len(handles), len(handles)))

    wrong_labelled_sequences = np.empty((len(handles), len(handles)), np.object)

    for i in range(len(handles)):
        for j in range(len(handles)):
            wrong_labelled_sequences[i][j] = []

    tot_wrong = 0

    for i in range(len(predictions)):
        predicted = int(predictions[i].argmax())
        
        true_value = int(outputs[i])

        wrong_labelled_counter[true_value][predicted] += 1
        wrong_labelled_sequences[true_value][predicted].append(inputs.get('input_text')[i])
        
        ok = (int(true_value) == int(predicted))
        if(not ok):
            tot_wrong += 1

    return wrong_labelled_counter, wrong_labelled_sequences, predictions

def print_confusion_matrix(wrong_labelled_counter):
    the_str = "\t"
    for handle in handles:
        the_str += handle + "\t"
    print(the_str)

    ctr = 0
    for row in wrong_labelled_counter:
        the_str = handles[ctr] + '\t'
        ctr+=1
        for i in range(len(row)):
            the_str += str(int(row[i]))
            if(i != len(row) -1):
                the_str += "\t"
        print(the_str)

wrong_labelled_counter, wrong_labelled_sequences, predictions = tf2_confusion_matrix(inputs_test, outputs_test)


print_confusion_matrix(wrong_labelled_counter)

# CatBoost model

This CatBoost model instance was developed reusing the ideas presented in these tutorials from the official repository: [classification](https://github.com/catboost/tutorials/blob/master/classification/classification_tutorial.ipynb) and [text features](https://github.com/catboost/tutorials/blob/master/text_features/text_features_in_catboost.ipynb)

## Imports for the CatBoost model

In [None]:
import functools
import os

import pandas as pd
import numpy as np

import copy

import calendar
import datetime
import re

import unicodedata
from catboost import Pool, CatBoostClassifier

## Definitions for the CatBoost model

In [None]:

unicode_data_categories = [
    "Cc",
    "Cf",
    "Cn",
    "Co",
    "Cs",
    "LC",
    "Ll",
    "Lm",
    "Lo",
    "Lt",
    "Lu",
    "Mc",
    "Me",
    "Mn",
    "Nd",
    "Nl",
    "No",
    "Pc",
    "Pd",
    "Pe",
    "Pf",
    "Pi",
    "Po",
    "Ps",
    "Sc",
    "Sk",
    "Sm",
    "So",
    "Zl",
    "Zp",
    "Zs"
]



column_names = [
    "handle",
    "timestamp",
    "text"
]

column_names.extend(unicode_data_categories)

#List of handles (labels)
#Fill with the handles you want to consider in your dataset
handles = [

]

train_file = os.path.realpath("./data/train.csv")
val_file = os.path.realpath("./data/val.csv")
test_file = os.path.realpath("./data/test.csv")


## Preprocessing and computing dataset features

In [None]:
def get_pandas_dataset(input_file, timestamp_mean=None, timestamp_std=None):

    pd_dat = pd.read_csv(input_file, names=column_names)

    pd_dat = pd_dat[pd_dat.handle.isin(handles)]

    if(timestamp_mean is None):
         timestamp_mean = pd_dat.timestamp.mean()
    
    if(timestamp_std is None):
        timestamp_std = pd_dat.timestamp.std()

    pd_dat.timestamp = (pd_dat.timestamp - timestamp_mean) / timestamp_std

    pd_dat["handle_index"] = pd_dat['handle'].map(lambda x: handles.index(x))

    pd_dat = pd_dat.reset_index(drop=True)

    return pd_dat, timestamp_mean, timestamp_std

train_dataset, timestamp_mean, timestamp_std = get_pandas_dataset(train_file)

test_dataset, _, _ = get_pandas_dataset(test_file, timestamp_mean=timestamp_mean, timestamp_std=timestamp_std)

val_dataset, _, _ = get_pandas_dataset(val_file, timestamp_mean=timestamp_mean, timestamp_std=timestamp_std)

def split_inputs_and_outputs(pd_dat):

    labels = pd_dat['handle_index'].values

    del(pd_dat['handle'])
    del(pd_dat['handle_index'])

    return pd_dat, labels

X_train, labels_train = split_inputs_and_outputs(train_dataset)
X_val, labels_val = split_inputs_and_outputs(val_dataset)
X_test, labels_test = split_inputs_and_outputs(test_dataset)

## CatBoost model definition

In [None]:
def get_model(catboost_params={}):
    cat_features = []
    text_features = ['text']

    catboost_default_params = {
        'iterations': 1000,
        'learning_rate': 0.03,
        'eval_metric': 'Accuracy',
        'task_type': 'GPU',
        'early_stopping_rounds': 20
    }
    
    catboost_default_params.update(catboost_params)
    
    model = CatBoostClassifier(**catboost_default_params)

    return model, cat_features, text_features


model, cat_features, text_features = get_model()

## CatBoost model fitting

In [None]:
def fit_model(X_train, X_val, y_train, y_val, model, cat_features, text_features, verbose=100):

    learn_pool = Pool(
        X_train, 
        y_train, 
        cat_features=cat_features,
        text_features=text_features,
        feature_names=list(X_train)
    )

    val_pool = Pool(
        X_val, 
        y_val, 
        cat_features=cat_features,
        text_features=text_features,
        feature_names=list(X_val)
    )

    model.fit(learn_pool, eval_set=val_pool, verbose=verbose)

    return model

model = fit_model(X_train, X_val, labels_train, labels_val, model, cat_features, text_features)

## CatBoost model evaluation

Also for the CatBoost model, predictions on the test set, a confusion matrix is produced

In [None]:
def predict(X, model, cat_features, text_features):
    pool = Pool(
        data=X,
        cat_features=cat_features,
        text_features=text_features,
        feature_names=list(X)
    )

    probs = model.predict_proba(pool)

    return probs

def check_predictions_on(inputs, outputs, model, cat_features, text_features, handles):
    predictions = predict(inputs, model, cat_features, text_features)
    labelled_counter = np.zeros((len(handles), len(handles)))

    labelled_sequences = np.empty((len(handles), len(handles)), np.object)

    for i in range(len(handles)):
        for j in range(len(handles)):
            labelled_sequences[i][j] = []

    tot_wrong = 0

    for i in range(len(predictions)):
        predicted = int(predictions[i].argmax())
        
        true_value = int(outputs[i])

        labelled_counter[true_value][predicted] += 1
        labelled_sequences[true_value][predicted].append(inputs.get('text').values[i])
        
        ok = (int(true_value) == int(predicted))
        if(not ok):
            tot_wrong += 1

    return labelled_counter, labelled_sequences, predictions

def confusion_matrix(labelled_counter, handles):
    the_str = "\t"
    for handle in handles:
        the_str += handle + "\t"

    the_str += "\n"

    ctr = 0
    for row in labelled_counter:
        the_str += handles[ctr] + '\t'
        ctr+=1
        for i in range(len(row)):
            the_str += str(int(row[i]))
            if(i != len(row) -1):
                the_str += "\t"
        the_str += "\n"

    return the_str

labelled_counter, labelled_sequences, predictions = check_predictions_on(
    X_test,
    labels_test,
    model,
    cat_features,
    text_features,
    handles
)

confusion_matrix_string = confusion_matrix(labelled_counter, handles)

print(confusion_matrix_string)

# Evaluation

To perform some experiments and evaluate the two models, 18 Twitter users were selected and, for each user, a number of tweets and responses to other users' tweets were collected. In total 39786 tweets were collected. The difference in class representation could be eliminated, for example limiting the number of tweets for each label to the number of tweets in the less represented class. This difference, however, was not eliminated, in order to test if it represents an issue for the accuracy of the two trained models.

The division of the tweets corresponding to each twitter handle for each file (train, test, validation) is reported in the following table. To avoid policy issues (better safe than sorry), the actual user handle is masked using C_x placeholders and a brief description of the twitter user is presented instead.

|Description|Handle|Train|Test|Validation|Sum|
|-------|-------|-------|-------|-------|-------|
|UK-based labour politician|C_1|1604|492|229|2325|
|US-based democratic politician|C_2|1414|432|195|2041|
|US-based democratic politician|C_3|1672|498|273|2443|
|US-based actor|C_4|1798|501|247|2546|
|UK-based actress|C_5|847|243|110|1200|
|US-based democratic politician|C_6|2152|605|304|3061|
|US-based singer|C_7|2101|622|302|3025|
|US-based singer|C_8|1742|498|240|2480|
|Civil rights activist|C_9|314|76|58|448|
|US-based republican politician|C_10|620|159|78|857|
|US-based TV host|C_11|2022|550|259|2831|
|Parody account of C_15 |C_12|2081|624|320|3025|
|US-based democratic politician|C_13|1985|557|303|2845|
|US-based actor/director|C_14|1272|357|183|1812|
|US-based republican politician|C_15|1121|298|134|1553|
|US-based writer|C_16|1966|502|302|2770|
|US-based writer|C_17|1095|305|153|1553|
|US-based entrepreneur|C_18|2084|581|306|2971|
|Sum||27890|7900|3996|39786|



## TensorFlow 2 model

The following charts show loss and accuracy vs epochs for train and validation for a typical run of the TF2 model:

![loss](img/tf2_train_val_loss.png)
![accuracy](img/tf2_train_val_accuracy.png)

If the images do not show correctly, they can be found at these links: [loss](https://github.com/icappello/ml-predict-text-author/blob/master/img/tf2_train_val_loss.png) [accuracy](https://github.com/icappello/ml-predict-text-author/blob/master/img/tf2_train_val_accuracy.png)

After a few epochs, the model starts overfitting on the train data, and the accuracy for the validation set quickly reaches a plateau.

The obtained accuracy on the test set is 0.672

## CatBoost model

The fit procedure stopped after 303 iterations. The obtained accuracy on the test set is 0.808

## Confusion matrices

The confusion matrices for the two models are reported [here](https://docs.google.com/spreadsheets/d/17JGDXYRajnC4THrBnZrbcqQbgzgjo0Jb7KAvPYenr-w/edit?usp=sharing), since large tables are not displayed correctly in the embedded github viewer for jupyter notebooks. Rows represent the actual classes, while columns represent the predicted ones.

## Summary

The CatBoost model obtained a better accuracy overall, as well as a better accuracy on all but one label. No particular optimization was done on the definition of the CatBoost model. The TF2 model could need more data, as well as some changes to its definition, to perform better (comments and pointers on this are welcome). Some variants of the TF2 model were tried: a deeper model with more dense layers, higher dropout rate, more/less units in layers, using only a subset of features, regularization methods (L1, L2, batch regularization), different activation functions (sigmoid, tanh) but none performed significantly better than the one presented.

Looking at the results summarized in the confusion matrices, tweets from C_9 clearly represented a problem, either for the under-representation relative to the other classes or for the actual content of the tweets (some were not written in english). Also, tweets from handles C_5 and C_14 were hard to correctly classify for both models, even if they were not under-represented w.r.t other labels.