The purpose of this notebook is to load all the loose txt files into a single dataframe saved in the "data.csv" file. 

This notebook:
1. Loads all the texts (.txt) into a pandas DataFrame.
2. Loads all the corresponding gold labels into the DataFrame.
3. Encodes those labels and saves the encodings.
4. Translates all non-english texts into english.

Now that's all clear, let's import what we will need in the notebook.

In [3]:
import os
import pandas as pd
import numpy as np
import nltk
import pickle

from deep_translator import GoogleTranslator, single_detection
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from nltk.tokenize import sent_tokenize
from langdetect import detect

import sys
sys.path.append("../src")
import utils


nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/jochem/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

Now that we have all the relevant packages, lets start of with point 1: loading all .txt files into a data frame. 

So, let's describe a function that loops through a directory, reads the text files and puts those text files into a DataFrame.

Next, we load the data from all the directories.

In [11]:
def load_data(directory:str):
    data = []
    print("current directory: ", directory)
    language = directory.split("/")[-2]
    
    for file in os.listdir(directory):
        file_path = os.path.join(directory,file)

        try:
            with open(file_path, 'r') as f:
                text = f.read()
                data.append({"id":file, "text": text, "language": language})

        except Exception as e:
            print(f"Error while reading:{file}")
    
    print("Data length: ", len(data))
    print("Language: ", language)
    print()
    df_text = pd.DataFrame(data)
    return df_text
        


In [12]:
text_dfs = []
train_folder = "/Users/jochem/Documents/School/Uni KI jaar 4/Scriptie/Train Data/target_4_December_release"
for dir in os.listdir(train_folder):
    path = os.path.join(train_folder,dir,"raw-documents")
    print(path)
    text_dfs.append(load_data(path))
df_text = pd.concat(text_dfs)
print("Total data length: ", len(df_text))

/Users/jochem/Documents/School/Uni KI jaar 4/Scriptie/Train Data/target_4_December_release/RU/raw-documents
current directory:  /Users/jochem/Documents/School/Uni KI jaar 4/Scriptie/Train Data/target_4_December_release/RU/raw-documents
Data length:  133
Language:  RU

/Users/jochem/Documents/School/Uni KI jaar 4/Scriptie/Train Data/target_4_December_release/PT/raw-documents
current directory:  /Users/jochem/Documents/School/Uni KI jaar 4/Scriptie/Train Data/target_4_December_release/PT/raw-documents
Data length:  400
Language:  PT

/Users/jochem/Documents/School/Uni KI jaar 4/Scriptie/Train Data/target_4_December_release/BG/raw-documents
current directory:  /Users/jochem/Documents/School/Uni KI jaar 4/Scriptie/Train Data/target_4_December_release/BG/raw-documents
Data length:  401
Language:  BG

/Users/jochem/Documents/School/Uni KI jaar 4/Scriptie/Train Data/target_4_December_release/HI/raw-documents
current directory:  /Users/jochem/Documents/School/Uni KI jaar 4/Scriptie/Train Data/

We still need the correct labels for these files though. 

To get these we: 
1. Define a function that goes through the subtask 2 annotation file and loads all the appropriate labels into a DataFrame.
2. Load in all different languages labels.
3. Encode the labels with a MultiLabelBinarizer.
4. Save these in the DF as well. 

In [13]:
def load_labels(file_path:str):
    labels = []
    with open(file_path, "r") as f:
        for line in f:
            tags = line.strip().split("\t")
            
            text_id = tags[0].strip()

            dom_narrs = [narr.strip() for narr in tags[1].split(";")]
            sub_narrs = [narr.strip() for narr in tags[2].split(";")]

            labels.append({"id":text_id,"dom_narr": dom_narrs, "sub_narr": sub_narrs})
    print("Labels length:", len(labels))
    df_labels = pd.DataFrame(labels)
    return df_labels

In [14]:
label_dfs = []
train_folder = train_folder
for dir in os.listdir(train_folder):
    path = os.path.join(train_folder,dir,"subtask-2-annotations.txt")
    print(path)
    label_dfs.append(load_labels(path))

df_labels = pd.concat(label_dfs)
print(len(df_labels))

/Users/jochem/Documents/School/Uni KI jaar 4/Scriptie/Train Data/target_4_December_release/RU/subtask-2-annotations.txt
Labels length: 133
/Users/jochem/Documents/School/Uni KI jaar 4/Scriptie/Train Data/target_4_December_release/PT/subtask-2-annotations.txt
Labels length: 400
/Users/jochem/Documents/School/Uni KI jaar 4/Scriptie/Train Data/target_4_December_release/BG/subtask-2-annotations.txt
Labels length: 401
/Users/jochem/Documents/School/Uni KI jaar 4/Scriptie/Train Data/target_4_December_release/HI/subtask-2-annotations.txt
Labels length: 366
/Users/jochem/Documents/School/Uni KI jaar 4/Scriptie/Train Data/target_4_December_release/EN/subtask-2-annotations.txt
Labels length: 399
1699


Now for the encoding:
1. Again, we define a function to do this for us.
2. Encode the labels and save them in the DF.
3. Save the MultiLabelBinarizer objects in a pickle (.pkl) file, so we can access them later, to revert predictions back into full classes.

In [15]:
# label preprocessing steps
# one-hot encodes the dominant- and sub-narrative labels       
# TODO: also make this function save the dom and sub mlbs       
def encode_labels(df, save:bool=True):
    dom_mlb = MultiLabelBinarizer()
    sub_mlb = MultiLabelBinarizer()

    dom_narr_enc = dom_mlb.fit_transform(df["dom_narr"])
    df = pd.concat([df, pd.DataFrame(dom_narr_enc, columns=dom_mlb.classes_)], axis=1)

    sub_narr_enc = sub_mlb.fit_transform(df["sub_narr"])
    df = pd.concat([df, pd.DataFrame(sub_narr_enc, columns=sub_mlb.classes_)], axis=1)
    
    if(save):
        with open("../pkl_files/dom_mlb.pkl", "wb") as f:
            pickle.dump(dom_mlb,f)
        
        with open("../pkl_files/sub_mlb.pkl", "wb") as f:
            pickle.dump(sub_mlb,f)

    return df

In [16]:
df = encode_labels(pd.merge(df_text, df_labels, on="id"))
df.to_clipboard()
print(len(df))

1699


Next, we would like to translate the non-english texts into english. So,

1. Define a function to translate a list of strings into english.
2. Apply this function to all non english texts.
3. While applying the translation function, we call nltk's sent_tokenizer, to split the texts up into an array of sentence strings.

In [17]:
# source lang = "auto", target lang = "en"
translator = GoogleTranslator()
def translate_text(text:list[str]):
    translation = []
    
    try:
        translation = [translator.translate(t) for t in text]
        translation = [t for t in translation if t is not None]
        translation = " ".join(translation)

    except Exception as error:
        print("Translation failed")
        print(error)
        
    return translation

In [18]:
# only using sent tokenize and translating each sentence individually had some serious performance issues,
# so now lets chunk the text to a length thats doable.
def chunk_text(text:str, chunk_size=1500) -> list[str]:
    if(len(text) < chunk_size):
        return [text]
    
    chunks = []
    current_chunk = []
    current_chunk_length = 0

    sentence_tokens = sent_tokenize(text)
    for sent in sentence_tokens:
        sent_length = len(sent)

        if (sent_length + current_chunk_length) >= chunk_size:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sent]
            current_chunk_length = sent_length + 1
        
        else:
            current_chunk.append(sent)
            current_chunk_length += sent_length + 1
    
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    
    return chunks



BE AWARE THE FOLLOWING CELL TAKES A LONG TIME TO COMPLETE!

It applies the translation to all the non english texts.
Expect to wait at least 10 minutes for the result.

In [19]:
df["translated_text"] = df["text"].apply(lambda x: translate_text(chunk_text(x)) if detect(x) != "en" else x)

Translation failed
Request exception can happen due to an api connection error. Please check your connection and try again
Translation failed
Request exception can happen due to an api connection error. Please check your connection and try again
Translation failed
Request exception can happen due to an api connection error. Please check your connection and try again
Translation failed
Request exception can happen due to an api connection error. Please check your connection and try again
Translation failed
Request exception can happen due to an api connection error. Please check your connection and try again
Translation failed
Request exception can happen due to an api connection error. Please check your connection and try again
Translation failed
संयुक्त रिपोर्ट: यूक्रेन युद्ध का विभिन्न क्षेत्रों/अंतरराष्ट्रीय चिंताओं पर ...

वीआईएफ यंग स्कॉलर्स फोरम ने 08 अप्रैल 2022 को अपनी साप्ताहिक बैठक में ‘विभिन्न क्षेत्रों/अंतरराष्ट्रीय सरोकारों पर यूक्रेन युद्ध के प्रभाव' की चर्चा की। इन विद्व

After the translation, we would like to know the length of the texts.

In [None]:
df["translated_text_length"] = df["translated_text"].apply(lambda x: len(x) if x!=[] else 0)
df["natural_text_length"] = df["text"].apply(lambda x: len(x))

The texts that failed the translation now have a length of 0, let's drop these from the DF, before we apply the train/test split.

In [None]:
df = df.drop(df[df["translated_text_length"] == 0].index)

Now let's split the data into a train set and test set. We will employ an 80/20 split. We will save the split to separate files.

In [5]:
df = utils.load_data()
df_train, df_test = train_test_split(df, test_size=0.2, random_state=102493)
print("Amount of training instances: ", len(df_train))
print("Amount of test instances: ", len(df_test))
print("Total instances: ", len(df_train)+len(df_test))

Amount of training instances:  1359
Amount of test instances:  340
Total instances:  1699


In [6]:
df_train.to_pickle("../data/train_data.pkl")
df_test.to_pickle("../data/test_data.pkl")