# Text classification

The task concentrates on content-based text classification.


## Tasks

1. Get acquainted with the data of the [Polish Cyberbullying detection dataset](http://2019.poleval.pl/index.php/tasks/task6).
   Pay special attention to the distribution of the positive and negative examples in the first task as well as
   distribution of the classes in the second task.
2. Train the following classifiers on the training sets (for the task 1 and the task 2):

   i. Bayesian classifier with TF * IDF weighting.

   ii. Fasttext text classifier

In [8]:
import os
import pandas as pd
import csv

def read_file_as_list(filename, directory):
    li = []
    with open(os.path.join(directory + filename), 'r') as file:
        for line in file:
            line = line.replace('\n', '')
            li.append(line)
    return li

def combine_tags_and_texts(directory):
    tags = read_file_as_list('training_set_clean_only_tags.txt', directory)
    text = read_file_as_list('training_set_clean_only_text.txt', directory)

    with open(os.path.join(directory + 'all.txt'), 'w') as file:
        for i in range(len(tags)):
            file.write('__label__'+tags[i]+' '+ text[i]+'\n')

def fasttext_train_predict(directory):
    combine_tags_and_texts(directory)
    import fasttext
    test_data = read_file_as_list('test_set_clean_only_text.txt', directory)
    text_data = test_data[:10]
    model = fasttext.train_supervised(input=directory+'all.txt')
    with open(os.path.join(directory + 'FastText_results.txt'), 'w') as file:
        for t in test_data:
            # print(t)
            result = model.predict(t)[0][0].replace('__label__', '')
            # print(result)
            file.writelines(result+'\n')

def evaluate(directory, evaluator):
    os.system("perl "+directory+evaluator+" "+directory+"FastText_results.txt > "+directory+"FastText_output.txt")
    file = open(directory+'FastText_output.txt')
    for line in file:
        print(line)

print('Fasttext task1:')
directory = 'task_6-1/'
fasttext_train_predict(directory)
evaluator = 'evaluate1.pl'
evaluate(directory, evaluator)

print('Fasttext task2:')
directory = 'task_6-2/'
fasttext_train_predict(directory)
evaluator = 'evaluate2.pl'
evaluate(directory, evaluator)

Fasttext task1:
Precision = 68.42%

Recall = 9.70%

Balanced F-score = 16.99%

Accuracy = 87.30%

Fasttext task2:
Micro-Average F-score = 86.60%

Macro-Average Precision = 0.622534269475092%

Macro-Average Recall = 0.336006525838507%

Macro-Average F-score = 43.64%

precision0 is 0.867602808425276

recall0 is 0.998845265588915

precision1 is 0

recall1 is 0

precision2 is 1

recall2 is 0.00917431192660551



   iii. Transformer classifier (take into account that a number of experiments should be performed for this model).

3. Compare the results of classification on the test set. Select the appropriate measuers (form accuracy, F1,
   macro/micro F1, MCC) to compare the results.


4. Select 1 TP, 1 TN, 1 FP and 1 FN from your predictions (for the best classifier) and compare the decisions of each
   classifier on these examples using [SHAP](https://github.com/slundberg/shap).

5. Answer the following questions:
   1. Which of the classifiers works the best for the task 1 and the task 2.
   1. Did you achieve results comparable with the results of [PolEval Task](http://2019.poleval.pl/index.php/results/)?
   1. Did you achieve results comparabie with the [Klej leaderboard](https://klejbenchmark.com/leaderboard/)?
   1. Describe strengths and weaknesses of each of the compared algorithms.
   1. Do you think comparison of raw performance values on a single task is enough to assess the value of a given
      algorithm/model?
   1. Did SHAP show that the models use valuable featurs/words when performing their decision?

## Hints

1. You can use [Google colab](https://colab.research.google.com/notebooks/intro.ipynb) to perform experiments which
   require access to GPU or TPU.
1. [Fasttext](https://fasttext.cc/docs/en/supervised-tutorial.html) is a popular basline classifier. Don't report the Precision/Recall/F1 provided by
   Fasttext since they might be [wrong](https://github.com/facebookresearch/fastText/issues/261).
1. [Huggingsface Transformers](https://github.com/huggingface/transformers) library is a popular library for performing NLP tasks base on the transformer
   architecture.
1. [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/) by Jurafsky and Martin
   has a [chapter](https://web.stanford.edu/~jurafsky/slp3/4.pdf) devoted to the problem of classification.