The purpose of this notebook is to load all the loose txt files into a single dataframe saved in the "data.csv" file. 

This notebook:
1. Loads all the texts (.txt) into a pandas DataFrame.
2. Loads all the corresponding gold labels into the DataFrame.
3. Encodes those labels and saves the encodings.
4. Translates all non-english texts into english.

Now that's all clear, let's import what we will need in the notebook.

In [85]:
import os
import pandas as pd
import numpy as np
import nltk
import pickle

from deep_translator import GoogleTranslator
from sklearn.preprocessing import MultiLabelBinarizer
from nltk.tokenize import sent_tokenize

nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/jochem/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

Now that we have all the relevant packages, lets start of with point 1: loading all .txt files into a data frame. 

So, let's describe a function that loops through a directory, reads the text files and puts those text files into a DataFrame.

Next, we load the data from all the directories.

In [86]:
def load_data(directory:str):
    data = []

    for file in os.listdir(directory):
        file_path = os.path.join(directory,file)
        
        english = 0
        if file[:2] == "EN":
            english = 1

        try:
            with open(file_path, 'r') as f:
                text = f.read()
                data.append({"id":file, "text": text, "english": english})
        except Exception as e:
            print(f"Error while reading:{file}")
    
    print("Data length: ", len(data))
    df_text = pd.DataFrame(data)
    return df_text
        


In [87]:
df_text_en = load_data("/Users/jochem/Documents/school/Uni KI jaar 4/Scriptie/Train Data/training_data_16_October_release/EN/raw-documents")
df_text_bg = load_data("/Users/jochem/Documents/school/Uni KI jaar 4/Scriptie/Train Data/training_data_16_October_release/BG/raw-documents")
df_text_hi = load_data("/Users/jochem/Documents/school/Uni KI jaar 4/Scriptie/Train Data/training_data_16_October_release/HI/raw-documents")
df_text_pt = load_data("/Users/jochem/Documents/school/Uni KI jaar 4/Scriptie/Train Data/training_data_16_October_release/PT/raw-documents")
df_text = pd.concat([df_text_en, df_text_bg, df_text_hi, df_text_pt])
print(len(df_text))

Data length:  200
Data length:  211
Data length:  115
Data length:  200
726


We still need the correct labels for these files though. 

To get these we: 
1. Define a function that goes through the subtask 2 annotation file and loads all the appropriate labels into a DataFrame.
2. Load in all different languages labels.
3. Encode the labels with a MultiLabelBinarizer.
4. Save these in the DF as well. 

In [88]:
def load_labels(file_path:str):
    labels = []
    with open(file_path, "r") as f:
        for line in f:
            tags = line.strip().split("\t")
            
            text_id = tags[0].strip()

            dom_narrs = [narr.strip() for narr in tags[1].split(";")]
            sub_narrs = [narr.strip() for narr in tags[2].split(";")]

            labels.append({"id":text_id,"dom_narr": dom_narrs, "sub_narr": sub_narrs})
    print("Labels length:", len(labels))
    df_labels = pd.DataFrame(labels)
    return df_labels

In [89]:
df_labels_en = load_labels("/Users/jochem/Documents/school/Uni KI jaar 4/Scriptie/Train Data/training_data_16_October_release/EN/subtask-2-annotations.txt")
df_labels_bg = load_labels("/Users/jochem/Documents/school/Uni KI jaar 4/Scriptie/Train Data/training_data_16_October_release/BG/subtask-2-annotations.txt")
df_labels_hi = load_labels("/Users/jochem/Documents/school/Uni KI jaar 4/Scriptie/Train Data/training_data_16_October_release/HI/subtask-2-annotations.txt")
df_labels_pt = load_labels("/Users/jochem/Documents/school/Uni KI jaar 4/Scriptie/Train Data/training_data_16_October_release/PT/subtask-2-annotations.txt")
df_labels = pd.concat([df_labels_en, df_labels_bg, df_labels_hi, df_labels_pt])
print(len(df_labels))

Labels length: 200
Labels length: 211
Labels length: 115
Labels length: 200
726


Now for the encoding:
1. Again, we define a function to do this for us.
2. Encode the labels and save them in the DF.
3. Save the MultiLabelBinarizer objects in a pickle (.pkl) file, so we can access them later, to revert predictions back into full classes.

In [90]:
# label preprocessing steps
# one-hot encodes the dominant- and sub-narrative labels       
# TODO: also make this function save the dom and sub mlbs       
def encode_labels(df, save:bool=True):
    dom_mlb = MultiLabelBinarizer()
    sub_mlb = MultiLabelBinarizer()

    dom_narr_enc = dom_mlb.fit_transform(df["dom_narr"])
    df = pd.concat([df, pd.DataFrame(dom_narr_enc, columns=dom_mlb.classes_)], axis=1)

    sub_narr_enc = sub_mlb.fit_transform(df["sub_narr"])
    df = pd.concat([df, pd.DataFrame(sub_narr_enc, columns=sub_mlb.classes_)], axis=1)
    
    if(save):
        with open("../pkl_files/dom_mlb.pkl", "wb") as f:
            pickle.dump(dom_mlb,f)
        
        with open("../pkl_files/sub_mlb.pkl", "wb") as f:
            pickle.dump(sub_mlb,f)

    return df

In [91]:
df = encode_labels(pd.merge(df_text, df_labels, on="id"))
df.to_clipboard()
print(len(df))

726


Next, we would like to translate the non-english texts into english. So,

1. Define a function to translate a list of strings into english.
2. Apply this function to all non english texts.
3. While applying the translation function, we call nltk's sent_tokenizer, to split the texts up into an array of sentence strings.

In [97]:
# source lang = "auto", target lang = "en"
translator = GoogleTranslator()
def translate_text(text:list[str]):
    translation = []
    
    try:
        translation = translator.translate_batch(text)
        translation = [t for t in translation if t is not None]
        translation = " ".join(translation)

    except Exception as error:
        print("translation failed")
        print(error)

    #print(translation)
    return translation

In [98]:
# only using sent tokenize and translating each sentence individually had some serious performance issues,
# so now lets chunk the text to a length thats doable.
def chunk_text(text:str, chunk_size=4000) -> list[str]:
    if(len(text) < chunk_size):
        return [text]
    
    chunks = []
    current_chunk = []
    current_chunk_length = 0

    sentence_tokens = sent_tokenize(text)
    for sent in sentence_tokens:
        sent_length = len(sent)

        if (sent_length + current_chunk_length) >= chunk_size:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sent]
            current_chunk_length = sent_length + 1
        
        else:
            current_chunk.append(sent)
            current_chunk_length += sent_length + 1
    
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    
    return chunks



In [99]:
chunked_text = df.loc[(df["english"] == 0), "text"].apply(lambda x: chunk_text(x))
chunked_text.to_clipboard()

In [100]:
print(len(chunked_text))
amount = 0
for i in range(200, 726):
    if len(chunked_text[i])>1:
        amount+=1
print(amount)

526
44


In [101]:
chunked_text = chunked_text.apply(lambda x: translate_text(x))

translation failed
Request exception can happen due to an api connection error. Please check your connection and try again
translation failed
Request exception can happen due to an api connection error. Please check your connection and try again
translation failed
Request exception can happen due to an api connection error. Please check your connection and try again
translation failed
Request exception can happen due to an api connection error. Please check your connection and try again
translation failed
Request exception can happen due to an api connection error. Please check your connection and try again
translation failed
Request exception can happen due to an api connection error. Please check your connection and try again
translation failed
Request exception can happen due to an api connection error. Please check your connection and try again


KeyboardInterrupt: 

BE AWARE THE FOLLOWING CELL TAKES A LONG TIME TO COMPLETE!

It applies the translation to all the non english texts.
Expect to wait at least 10 minutes for the result.

In [None]:
df.loc[(df["english"] == 0), "text"] = df.loc[(df["english"] == 0), "text"].apply(lambda x: translate_text(chunk_text(x)))

['Арестович: Налице са всички предпоставки за срив на фронта\n\nИма предпоставки за пълен крах на украинския фронт и ще има още пробиви на руските въоръжени сили, тъй като войските се попълват с войници, чиято възраст вече е над 55 години. Това заяви във видеоблога на Alpha Олексий Арестович, бивш съветник в канцеларията на президента на Украйна, който избяга на Запад след уволнението на главнокомандващия ВСУ ген. Валерий Залужни. „Какво ще стане, ако фронтът продължи да се срива, за което има всички условия, всички предпоставки? Може да има още и още пробиви като този при Торецк, защото войските са изтощени, липсват попълнения, качеството на мобилизираните е под въпрос", обясни той. "Знам за случай, когато на един командир физически му свършиха бойците, той поиска попълнения, защото заемаше много трудна позиция. Дадено му е подкрепление - 20 души, най-младият от които е на 55 г. Половината от тях загинаха още на първия ден и знаете ли защо? Когато FPV лети към тях, те дори не могат да

In [82]:
print(len(df))
df.to_clipboard()

726


The last thing we want to do is save our full data set to a csv file. 

In [None]:
df.to_csv("../data/newdata.csv")