The purpose of this notebook is to load all the loose txt files into a single dataframe saved in the "data.csv" file. 

This notebook:
1. Loads all the texts (.txt) into a pandas DataFrame.
2. Loads all the corresponding gold labels into the DataFrame.
3. Encodes those labels and saves the encodings.
4. Translates all non-english texts into english.

Now that's all clear, let's import what we will need in the notebook.

In [5]:
import os
import pandas as pd
import numpy as np
import nltk
import pickle

from deep_translator import GoogleTranslator, single_detection
from sklearn.preprocessing import MultiLabelBinarizer
from nltk.tokenize import sent_tokenize
from langdetect import detect


nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/jochem/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

Now that we have all the relevant packages, lets start of with point 1: loading all .txt files into a data frame. 

So, let's describe a function that loops through a directory, reads the text files and puts those text files into a DataFrame.

Next, we load the data from all the directories.

In [None]:
def load_data(directory:str):
    data = []
    print("current directory: ", directory)
    language = directory.split("/")[-2]
    
    for file in os.listdir(directory):
        file_path = os.path.join(directory,file)

        try:
            with open(file_path, 'r') as f:
                text = f.read()
                data.append({"id":file, "text": text, "language": language})

        except Exception as e:
            print(f"Error while reading:{file}")
    
    print("Data length: ", len(data))
    print("Language: ", language)
    print()
    df_text = pd.DataFrame(data)
    return df_text
        


In [13]:
df_text_en = load_data("/Users/jochem/Documents/school/Uni KI jaar 4/Scriptie/Train Data/training_data_16_October_release/EN/raw-documents")
df_text_bg = load_data("/Users/jochem/Documents/school/Uni KI jaar 4/Scriptie/Train Data/training_data_16_October_release/BG/raw-documents")
df_text_hi = load_data("/Users/jochem/Documents/school/Uni KI jaar 4/Scriptie/Train Data/training_data_16_October_release/HI/raw-documents")
df_text_pt = load_data("/Users/jochem/Documents/school/Uni KI jaar 4/Scriptie/Train Data/training_data_16_October_release/PT/raw-documents")
df_text = pd.concat([df_text_en, df_text_bg, df_text_hi, df_text_pt])
print("Total data length: ", len(df_text))

current directory:  /Users/jochem/Documents/school/Uni KI jaar 4/Scriptie/Train Data/training_data_16_October_release/EN/raw-documents
Data length:  200
Language:  EN

current directory:  /Users/jochem/Documents/school/Uni KI jaar 4/Scriptie/Train Data/training_data_16_October_release/BG/raw-documents
Data length:  211
Language:  BG

current directory:  /Users/jochem/Documents/school/Uni KI jaar 4/Scriptie/Train Data/training_data_16_October_release/HI/raw-documents
Data length:  115
Language:  HI

current directory:  /Users/jochem/Documents/school/Uni KI jaar 4/Scriptie/Train Data/training_data_16_October_release/PT/raw-documents
Data length:  200
Language:  PT

Total data length:  726


We still need the correct labels for these files though. 

To get these we: 
1. Define a function that goes through the subtask 2 annotation file and loads all the appropriate labels into a DataFrame.
2. Load in all different languages labels.
3. Encode the labels with a MultiLabelBinarizer.
4. Save these in the DF as well. 

In [None]:
def load_labels(file_path:str):
    labels = []
    with open(file_path, "r") as f:
        for line in f:
            tags = line.strip().split("\t")
            
            text_id = tags[0].strip()

            dom_narrs = [narr.strip() for narr in tags[1].split(";")]
            sub_narrs = [narr.strip() for narr in tags[2].split(";")]

            labels.append({"id":text_id,"dom_narr": dom_narrs, "sub_narr": sub_narrs})
    print("Labels length:", len(labels))
    df_labels = pd.DataFrame(labels)
    return df_labels

In [15]:
df_labels_en = load_labels("/Users/jochem/Documents/school/Uni KI jaar 4/Scriptie/Train Data/training_data_16_October_release/EN/subtask-2-annotations.txt")
df_labels_bg = load_labels("/Users/jochem/Documents/school/Uni KI jaar 4/Scriptie/Train Data/training_data_16_October_release/BG/subtask-2-annotations.txt")
df_labels_hi = load_labels("/Users/jochem/Documents/school/Uni KI jaar 4/Scriptie/Train Data/training_data_16_October_release/HI/subtask-2-annotations.txt")
df_labels_pt = load_labels("/Users/jochem/Documents/school/Uni KI jaar 4/Scriptie/Train Data/training_data_16_October_release/PT/subtask-2-annotations.txt")
df_labels = pd.concat([df_labels_en, df_labels_bg, df_labels_hi, df_labels_pt])
print(len(df_labels))

Labels length: 200
Labels length: 211
Labels length: 115
Labels length: 200
726


Now for the encoding:
1. Again, we define a function to do this for us.
2. Encode the labels and save them in the DF.
3. Save the MultiLabelBinarizer objects in a pickle (.pkl) file, so we can access them later, to revert predictions back into full classes.

In [16]:
# label preprocessing steps
# one-hot encodes the dominant- and sub-narrative labels       
# TODO: also make this function save the dom and sub mlbs       
def encode_labels(df, save:bool=True):
    dom_mlb = MultiLabelBinarizer()
    sub_mlb = MultiLabelBinarizer()

    dom_narr_enc = dom_mlb.fit_transform(df["dom_narr"])
    df = pd.concat([df, pd.DataFrame(dom_narr_enc, columns=dom_mlb.classes_)], axis=1)

    sub_narr_enc = sub_mlb.fit_transform(df["sub_narr"])
    df = pd.concat([df, pd.DataFrame(sub_narr_enc, columns=sub_mlb.classes_)], axis=1)
    
    if(save):
        with open("../pkl_files/dom_mlb.pkl", "wb") as f:
            pickle.dump(dom_mlb,f)
        
        with open("../pkl_files/sub_mlb.pkl", "wb") as f:
            pickle.dump(sub_mlb,f)

    return df

In [17]:
df = encode_labels(pd.merge(df_text, df_labels, on="id"))
df.to_clipboard()
print(len(df))

726


Next, we would like to translate the non-english texts into english. So,

1. Define a function to translate a list of strings into english.
2. Apply this function to all non english texts.
3. While applying the translation function, we call nltk's sent_tokenizer, to split the texts up into an array of sentence strings.

In [8]:
# source lang = "auto", target lang = "en"
translator = GoogleTranslator()
def translate_text(text:list[str]):
    translation = []
    
    try:
        translation = [translator.translate(t) for t in text]
        translation = [t for t in translation if t is not None]
        translation = " ".join(translation)

    except Exception as error:
        print("Translation failed")
        print(error)
        
    return translation

In [9]:
# only using sent tokenize and translating each sentence individually had some serious performance issues,
# so now lets chunk the text to a length thats doable.
def chunk_text(text:str, chunk_size=1500) -> list[str]:
    if(len(text) < chunk_size):
        return [text]
    
    chunks = []
    current_chunk = []
    current_chunk_length = 0

    sentence_tokens = sent_tokenize(text)
    for sent in sentence_tokens:
        sent_length = len(sent)

        if (sent_length + current_chunk_length) >= chunk_size:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sent]
            current_chunk_length = sent_length + 1
        
        else:
            current_chunk.append(sent)
            current_chunk_length += sent_length + 1
    
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    
    return chunks



BE AWARE THE FOLLOWING CELL TAKES A LONG TIME TO COMPLETE!

It applies the translation to all the non english texts.
Expect to wait at least 10 minutes for the result.

In [10]:
df["translated_text"] = df["text"].apply(lambda x: translate_text(chunk_text(x)) if detect(x) != "en" else x)

Translation failed
Request exception can happen due to an api connection error. Please check your connection and try again
Translation failed
Request exception can happen due to an api connection error. Please check your connection and try again
Translation failed
Request exception can happen due to an api connection error. Please check your connection and try again
Translation failed
Request exception can happen due to an api connection error. Please check your connection and try again
Translation failed
Request exception can happen due to an api connection error. Please check your connection and try again
Translation failed
संयुक्त रिपोर्ट: यूक्रेन युद्ध का विभिन्न क्षेत्रों/अंतरराष्ट्रीय चिंताओं पर ...

वीआईएफ यंग स्कॉलर्स फोरम ने 08 अप्रैल 2022 को अपनी साप्ताहिक बैठक में ‘विभिन्न क्षेत्रों/अंतरराष्ट्रीय सरोकारों पर यूक्रेन युद्ध के प्रभाव' की चर्चा की। इन विद्वानों ने अध्ययन के अपने निर्धारित क्षेत्र से संबंधित यूक्रेन युद्ध के प्रभावों और प्रतिक्रियाओं पर चर्चा की। इस परिचर्चा में

In [27]:
df.to_csv("../data/newdata.csv")

Some texts failed the translation, so let's wipe them from the data set. You know what I'm leaving them, this can be handled in data cleaning script.

In [11]:
print(len(df))
df.to_clipboard()

726


Let's see which translated texts went wrong:

In [15]:
amount =0 
for i in range(len(df)):
    if(df["translated_text"][i] == []):
        print(df["id"][i])
        amount += 1

print(amount)

HI_82.txt
HI_115.txt
HI_40.txt
HI_54.txt
HI_56.txt
HI_117.txt
HI_80.txt
HI_84.txt
HI_53.txt
HI_46.txt
HI_85.txt
HI_105.txt
HI_79.txt
HI_86.txt
HI_92.txt
HI_22.txt
HI_20.txt
HI_2.txt
HI_39.txt
HI_6.txt
HI_120.txt
HI_49.txt
HI_136.txt
HI_62.txt
HI_89.txt
HI_99.txt
HI_66.txt
HI_73.txt
HI_67.txt
HI_131.txt
HI_65.txt
HI_64.txt
HI_130.txt
33


Now let's wipe them.

In [24]:
print(df.head())
dfwipe = df[df["translated_text"].apply(lambda x: len(x)>0)]
print(len(dfwipe))
print(len(df))

                 id                                               text  \
0  EN_UA_104876.txt  Putin honours army unit blamed for Bucha massa...   
1  EN_UA_023211.txt  Europe Putin thanks US journalist Tucker Carls...   
2  EN_UA_011260.txt  Russia has a clear plan to resolve the conflic...   
3  EN_UA_101067.txt  First war of TikTok era sees tragedy, humor an...   
4  EN_UA_102963.txt  Ukraine's President Zelenskyy to address Mexic...   

  language                                           dom_narr  \
0       EN                                            [Other]   
1       EN                                            [Other]   
2       EN  [URW: Russia is the Victim, URW: Discrediting ...   
3       EN                                            [Other]   
4       EN                                            [Other]   

                                            sub_narr  \
0                                            [Other]   
1                                            [Other]

In [26]:
print(df.head())

                 id                                               text  \
0  EN_UA_104876.txt  Putin honours army unit blamed for Bucha massa...   
1  EN_UA_023211.txt  Europe Putin thanks US journalist Tucker Carls...   
2  EN_UA_011260.txt  Russia has a clear plan to resolve the conflic...   
3  EN_UA_101067.txt  First war of TikTok era sees tragedy, humor an...   
4  EN_UA_102963.txt  Ukraine's President Zelenskyy to address Mexic...   

  language                                           dom_narr  \
0       EN                                            [Other]   
1       EN                                            [Other]   
2       EN  [URW: Russia is the Victim, URW: Discrediting ...   
3       EN                                            [Other]   
4       EN                                            [Other]   

                                            sub_narr  \
0                                            [Other]   
1                                            [Other]

The last thing we want to do is save our full data set to a csv file. 