Author: [Jule Godbersen](mailto:godbersj@tcd.ie)

Content of file: Loading, preprocessing and analyzing the data used within this project

In [47]:
from collections import Counter
from datasets import load_dataset
import random
import pickle
import os
from deep_translator import GoogleTranslator
import time
from tqdm import tqdm
from data_analysis import analyze_dataset

### Accessing the [MMS dataset](https://brand24-ai.github.io/mms_benchmark/)

Note: As the dataset is private, log in to your huggingface account and then accept the terms of use for the [Huggingface website of the dataset](https://huggingface.co/datasets/Brand24/mms). Also make sure to have run ``huggingface-cli login`.

In [48]:
dataset = load_dataset("Brand24/mms",cache_dir="/mount/studenten-temp1/users/godberja/HuggingfaceCache") # TODO adapt path

# Collection
Within this project I make use of the MMS dataset which is a multimodal set and contains data for multiple languages. Within this project I will only make use of the English and German data.

Let's explore the dataset a bit first:

In [3]:
labels = dataset["train"].features["label"].names
print(labels) 

['negative', 'neutral', 'positive']


In [4]:
# looking at the documentation I found that the labels are represented via indices within the dataset
label2ind = {"negative":0,"neutral":1,"positive":2}
ind2label = {0:"negative",1:"neutral",2:"positive"}

In [5]:
# look at an example data sample
dataset["train"][0]

{'_id': 0,
 'text': '" آللهمَ إني أسألك من الخير كله ..عاجله وآجله ، ما علمت منه وما لم أعلم،،وأعوذ بك من الشر كله ..عاجله وآجله، ما علمت منه وما لم أعلم "',
 'label': 2,
 'original_dataset': 'ar_arsentdl',
 'domain': 'social_media',
 'language': 'ar',
 'Family': 'Afro-Asiatic',
 'Genus': 'Semitic',
 'Definite articles': 'definite affix',
 'Indefinite articles': 'no article',
 'Number of cases': '3',
 'Order of subject, object, verb': 'SVO',
 'Negative morphemes': 'negative particle',
 'Polar questions': 'interrogative intonation only',
 'Position of negative word wrt SOV': 'SNegVO',
 'Prefixing vs suffixing': 'weakly suffixing',
 'Coding of nominal plurality': 'mixed morphological plural',
 'Grammatical genders': 'masculine, feminine',
 'cleanlab_self_confidence': 0.21071475744247437}

In [7]:
# check for distribution of different languages supported / labels
languages = []
labels = []
for data_sample in dataset["train"]:
    languages += [data_sample["language"]]
    labels += [data_sample["label"]]

In [8]:
print("Distribution of data samples across languages: ",Counter(languages))

Distribution of data samples across languages:  Counter({'en': 2330486, 'ar': 932075, 'es': 418712, 'zh': 331702, 'de': 315887, 'pl': 236688, 'fr': 210631, 'ja': 209780, 'cs': 196287, 'pt': 157834, 'sl': 113543, 'ru': 110930, 'hr': 77594, 'sr': 76368, 'th': 72319, 'bg': 62150, 'hu': 56682, 'sk': 56623, 'sq': 44284, 'sv': 41346, 'bs': 36183, 'ur': 19660, 'hi': 16999, 'fa': 13525, 'it': 12065, 'he': 8619, 'lv': 5790, 'el': 500})


In [9]:
print("Distribution of data samples across sentiment classes: ",Counter(labels))

Distribution of data samples across sentiment classes:  Counter({2: 3494710, 1: 1341392, 0: 1329160})


# Preparation

## Filter dataset for languages

Only get the German data:

In [4]:
# filter dataset for german only
german = dataset.filter(lambda row: row["language"] == "de")

In [7]:
print("There is",len(german["train"]),"data samples available in German.")

There is 315887 data samples available in German.


In [8]:
# just out of interest: check dataset sources for german
german_sources = []
for datasample in german["train"]:
    german_sources += [datasample["original_dataset"]]
print("Sources of datasets: ",Counter(german_sources))

Sources of datasets:  Counter({'de_multilan_amazon': 209073, 'de_twitter_sentiment': 90534, 'de_sb10k': 9948, 'de_omp': 3598, 'de_dai_labor': 1781, 'de_ifeel': 953})


As this is quite a lot of data and I want to simulate a low-resource scenario: I will limit the amount of data:

Only get the English data:

In [5]:
english = dataset.filter(lambda row: row["language"] == "en")

In [10]:
print("There is",len(english["train"]),"data samples available in English.")

There is 2330486 data samples available in English.


In [11]:
# just out of interest: check dataset sources for english
english_sources = []
for datasample in english["train"]:
    english_sources += [datasample["original_dataset"]]
print("Sources of datasets: ",Counter(english_sources))

Sources of datasets:  Counter({'en_amazon': 1883238, 'en_multilan_amazon': 209393, 'en_twitter_sentiment': 85784, 'en_semeval_2017': 65071, 'en_tweet_airlines': 14427, 'en_silicone_meld_s': 12138, 'en_sentistrength': 11759, 'en_vader_movie_reviews': 10605, 'en_dai_labor': 7073, 'en_per_sent': 5333, 'en_vader_nyt': 5190, 'en_silicone_sem': 4643, 'en_vader_twitter': 4200, 'en_vader_amazon': 3708, 'en_financial_phrasebank_sentences_75agree': 3448, 'en_tweets_sanders': 3424, 'en_poem_sentiment': 1052})


## Dataset choice

As there's way too much data available: Check if there's datasets were there's samples for German as well as for English. Then later take a subset of these each.

(Because remember: I want to simulate a resource-scenario that means I simulate the scenario of having a smaller amount of data in German. Of course normally one would use as much data as being made available. But this would not help me answering my research questions.)

In [7]:
# compare which dataset sources are common for both languages
en_s = {'en_amazon': 1883238, 'en_multilan_amazon': 209393, 'en_twitter_sentiment': 85784, 'en_semeval_2017': 65071, 'en_tweet_airlines': 14427, 'en_silicone_meld_s': 12138, 'en_sentistrength': 11759, 'en_vader_movie_reviews': 10605, 'en_dai_labor': 7073, 'en_per_sent': 5333, 'en_vader_nyt': 5190, 'en_silicone_sem': 4643, 'en_vader_twitter': 4200, 'en_vader_amazon': 3708, 'en_financial_phrasebank_sentences_75agree': 3448, 'en_tweets_sanders': 3424, 'en_poem_sentiment': 1052}
de_s = {'de_multilan_amazon': 209073, 'de_twitter_sentiment': 90534, 'de_sb10k': 9948, 'de_omp': 3598, 'de_dai_labor': 1781, 'de_ifeel': 953}
en_s_c = [k[3:] for k in en_s]
for k in de_s:
    if k[3:] in en_s_c:
        print(k[3:])

multilan_amazon
twitter_sentiment
dai_labor


Find more information about the datasets on the papers mentioned on the following website: https://brand24-ai.github.io/mms_benchmark/citations.html

To have different "genres" available I want to use multilan_amazon and twitter_sentiment further on. I disregard dai_labor as it similarly to twitter_sentiment contains texts from social media. 

In [5]:
# extract datasets and print their sizes
en_reviews = dataset.filter(lambda row: row["original_dataset"] == "en_multilan_amazon")
print("Amazon review dataset with ",len(en_reviews["train"])," English samples")
de_reviews = dataset.filter(lambda row: row["original_dataset"] == "de_multilan_amazon")
print("Amazon review dataset with ",len(de_reviews["train"])," German samples")

en_twitter = dataset.filter(lambda row: row["original_dataset"] == "en_twitter_sentiment")
print("Amazon review dataset with ",len(en_twitter["train"])," English samples")
de_twitter = dataset.filter(lambda row: row["original_dataset"] == "de_twitter_sentiment")
print("Amazon review dataset with ",len(de_twitter["train"])," German samples")

Amazon review dataset with  209393  English samples
Amazon review dataset with  209073  German samples
Amazon review dataset with  85784  English samples
Amazon review dataset with  90534  German samples


## Selection of datasamples and creation of datasplits

I now have access to German and English data. They are all within one training split and are probably imbalanced with respect to class labels. Thus I will now balance the data and split it into 3 splits for training, development and testing. The ratio I choose is 70:20:10.

Note that I ignore the dataset source when balancing the data.

### Balance for labels

In [6]:
# function to extract ids of the instances in given data dataset
def get_ids(data):
    return [instance["_id"] for instance in data]

In [7]:
# function to balance the dataset for having the same amount of instances for each label
def balance_data(data):

    # split data according to labels
    neg_instances = data.filter(lambda row: row["label"]==0)["train"]
    all_neg_ids = get_ids(neg_instances)
    neut_instances = data.filter(lambda row: row["label"]==1)["train"]
    all_neut_ids = get_ids(neut_instances)
    pos_instances = data.filter(lambda row: row["label"]==2)["train"]
    all_pos_ids = get_ids(pos_instances)

    # get the minimum amount of instances we can extract for each of the labels 
    class_minimum = min(len(all_neg_ids),len(all_neut_ids),len(all_pos_ids))
    print("We have ",class_minimum," instances for all three labels")

    # shuffle list with ids to make sure we randomly choose instances
    random.shuffle(all_neg_ids)
    random.shuffle(all_neut_ids)
    random.shuffle(all_pos_ids)

    # extract class_minimum amount of (shuffled!) instances for each of the labels
    neg_ids = all_neg_ids[:class_minimum]
    neut_ids = all_neut_ids[:class_minimum]
    pos_ids = all_pos_ids[:class_minimum]

    return neg_ids,neut_ids,pos_ids 

In [8]:
# balance the German data
de_review_neg, de_review_neut, de_review_pos = balance_data(de_reviews)
print("There is ",len(de_review_neg)," instances for each label in the German review data")
de_twitter_neg, de_twitter_neut, de_twitter_pos = balance_data(de_twitter)
print("There is ",len(de_twitter_neg)," instances for each label in the German twitter data")

We have  41935  instances for all three labels
There is  41935  instances for each label in the German review data
We have  17447  instances for all three labels
There is  17447  instances for each label in the German twitter data


In [9]:
# balance the English data
en_review_neg, en_review_neut, en_review_pos = balance_data(en_reviews)
print("There is ",len(en_review_neg)," instances for each label in the English review data")
en_twitter_neg, en_twitter_neut, en_twitter_pos = balance_data(en_twitter)
print("There is ",len(en_twitter_neg)," instances for each label in the English twitter data")

We have  41940  instances for all three labels
There is  41940  instances for each label in the English review data
We have  22829  instances for all three labels
There is  22829  instances for each label in the English twitter data


### Split into train, dev, test

I couldn't find online how many datasamples are actually needed for training a model. To represent the low-resource scenario I will make use of 600 German instances (which is ideally 100 per "genre" and this for each class label). The high resource will be represented with 6000 English instances (which is ideally 1000 per "genre" and this for each class label). These numbers refer to the size of the training data. The test split should always be the same and should contain 120 instances (balanced for the three classes). Note that I want to have a ratio of ca. 70:30 for train-dev.

If there is time left after my experiments, I want to experiment further with the dataset size. Thus I also create a small, a medium and a large training and development split. 

In [10]:
# shuffle the datasets to counterfact any patterns in dataset
random.shuffle(de_twitter_neg)
random.shuffle(de_twitter_neut)
random.shuffle(de_twitter_pos)
random.shuffle(en_twitter_neg)
random.shuffle(en_twitter_neut)
random.shuffle(en_twitter_pos)
random.shuffle(de_review_neg)
random.shuffle(de_review_neut)
random.shuffle(de_review_pos)
random.shuffle(en_review_neg)
random.shuffle(en_review_neut)
random.shuffle(en_review_pos)

In [11]:
def create_split(ids,size):
    split = ids[:size]
    left_ids = ids[size:] # remove ids from list
    return split,left_ids

In [12]:
# manually set the values I decided to choose
test_size = 20 # leads to: 20 instances * 3 labels * 2 dataset sources = 120 test instances
train_size_small = 100 # leads to: 100 instances * 3 labels * 2 dataset sources = 600 train instances
train_size_medium = 500 # leads to: 500 instances * 3 labels * 2 dataset sources = 3000 train instances
train_size_large = 1000 # leads to: 1000 instances * 3 labels * 2 dataset sources = 6000 train instances
train_dev_ratio = [70,30]

In [13]:
def extract_splits(ids,test_size,train_size_small,train_size_medium,train_size_large,train_dev_ratio):
    # calculate size of data split
    dev_size_large = int(train_size_large / train_dev_ratio[0] * train_dev_ratio[1])
    dev_size_medium = int(train_size_medium / train_dev_ratio[0] * train_dev_ratio[1])
    dev_size_small = int(train_size_small / train_dev_ratio[0] * train_dev_ratio[1])

    test, ids = create_split(ids,test_size)

    train_large, ids = create_split(ids,train_size_large)
    train_medium, _ = create_split(train_large,train_size_medium)
    train_small, _ = create_split(train_large,train_size_small)

    dev_large, _ = create_split(ids,dev_size_large)
    dev_medium, _ = create_split(dev_large,dev_size_medium)
    dev_small, _ = create_split(dev_large,dev_size_small)

    return test, train_large, train_medium, train_small, dev_large, dev_medium, dev_small

In [17]:
# create splits for German twitter

# look at negative samples
de_twitter_neg_test, de_twitter_neg_train_large, de_twitter_neg_train_medium, de_twitter_neg_train_small, de_twitter_neg_dev_large, de_twitter_neg_dev_medium, de_twitter_neg_dev_small = extract_splits(de_twitter_neg,test_size,train_size_small,train_size_medium,train_size_large,train_dev_ratio)

# look at neutral samples
de_twitter_neut_test, de_twitter_neut_train_large, de_twitter_neut_train_medium, de_twitter_neut_train_small, de_twitter_neut_dev_large, de_twitter_neut_dev_medium, de_twitter_neut_dev_small = extract_splits(de_twitter_neut,test_size,train_size_small,train_size_medium,train_size_large,train_dev_ratio)

# look at positive samples
de_twitter_pos_test, de_twitter_pos_train_large, de_twitter_pos_train_medium, de_twitter_pos_train_small, de_twitter_pos_dev_large, de_twitter_pos_dev_medium, de_twitter_pos_dev_small = extract_splits(de_twitter_pos,test_size,train_size_small,train_size_medium,train_size_large,train_dev_ratio)

In [18]:
# create splits for English twitter

# look at negative samples
en_twitter_neg_test, en_twitter_neg_train_large, en_twitter_neg_train_medium, en_twitter_neg_train_small, en_twitter_neg_dev_large, en_twitter_neg_dev_medium, en_twitter_neg_dev_small = extract_splits(en_twitter_neg,test_size,train_size_small,train_size_medium,train_size_large,train_dev_ratio)

# look at neutral samples
en_twitter_neut_test, en_twitter_neut_train_large, en_twitter_neut_train_medium, en_twitter_neut_train_small, en_twitter_neut_dev_large, en_twitter_neut_dev_medium, en_twitter_neut_dev_small = extract_splits(en_twitter_neut,test_size,train_size_small,train_size_medium,train_size_large,train_dev_ratio)

# look at positive samples
en_twitter_pos_test, en_twitter_pos_train_large, en_twitter_pos_train_medium, en_twitter_pos_train_small, en_twitter_pos_dev_large, en_twitter_pos_dev_medium, en_twitter_pos_dev_small = extract_splits(en_twitter_pos,test_size,train_size_small,train_size_medium,train_size_large,train_dev_ratio)

In [19]:
# create splits for German review

# look at negative samples
de_review_neg_test, de_review_neg_train_large, de_review_neg_train_medium, de_review_neg_train_small, de_review_neg_dev_large, de_review_neg_dev_medium, de_review_neg_dev_small = extract_splits(de_review_neg,test_size,train_size_small,train_size_medium,train_size_large,train_dev_ratio)

# look at neutral samples
de_review_neut_test, de_review_neut_train_large, de_review_neut_train_medium, de_review_neut_train_small, de_review_neut_dev_large, de_review_neut_dev_medium, de_review_neut_dev_small = extract_splits(de_review_neut,test_size,train_size_small,train_size_medium,train_size_large,train_dev_ratio)

# look at positive samples
de_review_pos_test, de_review_pos_train_large, de_review_pos_train_medium, de_review_pos_train_small, de_review_pos_dev_large, de_review_pos_dev_medium, de_review_pos_dev_small = extract_splits(de_review_pos,test_size,train_size_small,train_size_medium,train_size_large,train_dev_ratio)

In [20]:
# create splits for English review

# look at negative samples
en_review_neg_test, en_review_neg_train_large, en_review_neg_train_medium, en_review_neg_train_small, en_review_neg_dev_large, en_review_neg_dev_medium, en_review_neg_dev_small = extract_splits(en_review_neg,test_size,train_size_small,train_size_medium,train_size_large,train_dev_ratio)

# look at neutral samples
en_review_neut_test, en_review_neut_train_large, en_review_neut_train_medium, en_review_neut_train_small, en_review_neut_dev_large, en_review_neut_dev_medium, en_review_neut_dev_small = extract_splits(en_review_neut,test_size,train_size_small,train_size_medium,train_size_large,train_dev_ratio)

# look at positive samples
en_review_pos_test, en_review_pos_train_large, en_review_pos_train_medium, en_review_pos_train_small, en_review_pos_dev_large, en_review_pos_dev_medium, en_review_pos_dev_small = extract_splits(en_review_pos,test_size,train_size_small,train_size_medium,train_size_large,train_dev_ratio)

In [22]:
# merge splits, shuffle lists and save ids in a dict
all_ids = dict()

# test files
de_test = de_twitter_neg_test + de_twitter_neut_test + de_twitter_pos_test + de_review_neg_test + de_review_neut_test + de_review_pos_test
random.shuffle(de_test)
all_ids["de_test"] = de_test

en_test = en_twitter_neg_test + en_twitter_neut_test + en_twitter_pos_test + en_review_neg_test + en_review_neut_test + en_review_pos_test
random.shuffle(en_test)
all_ids["en_test"] = en_test

# train files large
de_train_large = de_twitter_neg_train_large + de_twitter_neut_train_large + de_twitter_pos_train_large + de_review_neg_train_large + de_review_neut_train_large + de_review_pos_train_large
random.shuffle(de_train_large)
all_ids["de_train_large"] = de_train_large

en_train_large = en_twitter_neg_train_large + en_twitter_neut_train_large + en_twitter_pos_train_large + en_review_neg_train_large + en_review_neut_train_large + en_review_pos_train_large
random.shuffle(en_train_large)
all_ids["en_train_large"] = en_train_large

# train files medium
de_train_medium = de_twitter_neg_train_medium + de_twitter_neut_train_medium + de_twitter_pos_train_medium + de_review_neg_train_medium + de_review_neut_train_medium + de_review_pos_train_medium
random.shuffle(de_train_medium)
all_ids["de_train_medium"] = de_train_medium

en_train_medium = en_twitter_neg_train_medium + en_twitter_neut_train_medium + en_twitter_pos_train_medium + en_review_neg_train_medium + en_review_neut_train_medium + en_review_pos_train_medium
random.shuffle(en_train_medium)
all_ids["en_train_medium"] = en_train_medium

# train files small
de_train_small = de_twitter_neg_train_small + de_twitter_neut_train_small + de_twitter_pos_train_small + de_review_neg_train_small + de_review_neut_train_small + de_review_pos_train_small
random.shuffle(de_train_small)
all_ids["de_train_small"] = de_train_small

en_train_small = en_twitter_neg_train_small + en_twitter_neut_train_small + en_twitter_pos_train_small + en_review_neg_train_small + en_review_neut_train_small + en_review_pos_train_small
random.shuffle(en_train_small)
all_ids["en_train_small"] = en_train_small

# dev files large
de_dev_large = de_twitter_neg_dev_large + de_twitter_neut_dev_large + de_twitter_pos_dev_large + de_review_neg_dev_large + de_review_neut_dev_large + de_review_pos_dev_large
random.shuffle(de_dev_large)
all_ids["de_dev_large"] = de_dev_large

en_dev_large = en_twitter_neg_dev_large + en_twitter_neut_dev_large + en_twitter_pos_dev_large + en_review_neg_dev_large + en_review_neut_dev_large + en_review_pos_dev_large
random.shuffle(en_dev_large)
all_ids["en_dev_large"] = en_dev_large

# dev files medium
de_dev_medium = de_twitter_neg_dev_medium + de_twitter_neut_dev_medium + de_twitter_pos_dev_medium + de_review_neg_dev_medium + de_review_neut_dev_medium + de_review_pos_dev_medium
random.shuffle(de_dev_medium)
all_ids["de_dev_medium"] = de_dev_medium

en_dev_medium = en_twitter_neg_dev_medium + en_twitter_neut_dev_medium + en_twitter_pos_dev_medium + en_review_neg_dev_medium + en_review_neut_dev_medium + en_review_pos_dev_medium
random.shuffle(en_dev_medium)
all_ids["en_dev_medium"] = en_dev_medium

# dev files small
de_dev_small = de_twitter_neg_dev_small + de_twitter_neut_dev_small + de_twitter_pos_dev_small + de_review_neg_dev_small + de_review_neut_dev_small + de_review_pos_dev_small
random.shuffle(de_dev_small)
all_ids["de_dev_small"] = de_dev_small

en_dev_small = en_twitter_neg_dev_small + en_twitter_neut_dev_small + en_twitter_pos_dev_small + en_review_neg_dev_small + en_review_neut_dev_small + en_review_pos_dev_small
random.shuffle(en_dev_small)
all_ids["en_dev_small"] = en_dev_small

In [24]:
# once again look at the length of the splits
for k,v in all_ids.items():
    print(k, ": ",len(v)," instances")

de_test :  120  instances
en_test :  120  instances
de_train_large :  6000  instances
en_train_large :  6000  instances
de_train_medium :  3000  instances
en_train_medium :  3000  instances
de_train_small :  600  instances
en_train_small :  600  instances
de_dev_large :  2568  instances
en_dev_large :  2568  instances
de_dev_medium :  1284  instances
en_dev_medium :  1284  instances
de_dev_small :  252  instances
en_dev_small :  252  instances


In [25]:
# save indices to file
with open('data/ids.pkl', 'wb') as file:
    pickle.dump(all_ids, file)

## Translating the dataset

Now I still need to translate the datasets. I'm using the google translator.

In [49]:
# in case you want to get the indices by loading from file, use the following:
with open('data/ids.pkl', 'rb') as file:
    all_ids = pickle.load(file)

In [4]:
g2e_translator = GoogleTranslator(source="de",target="en")
e2g_translator = GoogleTranslator(source="en",target="de")

In [5]:
# translate a text represented as string with help of the given translator
def translate_text(translator, text):
    return translator.translate(text=text)

In [106]:
# test if translation works using some example instances
german_example = dataset["train"][all_ids["de_test"][0]]["text"]
print(german_example)
print(translate_text(g2e_translator,german_example))
print("")
english_example = dataset["train"][all_ids["en_test"][0]]["text"]
print(english_example)
print(translate_text(e2g_translator,english_example))

Sehr guter Film...spannend , sehenswert
Very good film...exciting, worth seeing

Book was good, easy to read, plot was low keyed. Could read but put down and come back to read later . Was not a thriller ,you could not put down. Just a good book.
Das Buch war gut, leicht zu lesen, die Handlung war zurückhaltend. Konnte es lesen, aber zur Seite legen und später wiederkommen, um es zu lesen. War kein Thriller, den man nicht aus der Hand legen konnte. Einfach ein gutes Buch.


In [27]:
# also try if the translate_batch command works
t_pl = ["Ich hab dich lieb","Ich mag dich","Du bist toll"]
translated = GoogleTranslator('de', 'en').translate_batch(t_pl)
print(translated)

['I love you', 'I like you', 'You are awesome']


In [51]:
def translate_ids(data,ids,src,tgt):
    """_summary_

    Args:
        data: Dataset from huggingface
        ids (list): list of ids
        src (str): represents source language
        tgt (str): represents target language

    Returns:
        dict: key = text in source language, value = translated text into target language
    """
    texts = [data["train"][data_id]["text"][:4999] for data_id in ids]
    translated_texts = GoogleTranslator(src,tgt).translate_batch(texts)
    translations = dict()
    for i in range(len(ids)):
        translations[ids[i]] = translated_texts[i] # key = id, value = translation
    return translations

In [53]:
# do batched translations and save them to file

# translate german to english
de_train_translation = translate_ids(dataset,all_ids["de_train_large"],"de","en")
print("de_train_translation done")
de_dev_translation = translate_ids(dataset,all_ids["de_dev_large"],"de","en")
print("de_dev_translation done")
de_test_translation = translate_ids(dataset,all_ids["de_test"],"de","en")
print("de_test_translation done")

# translate english to german
en_train_translation = translate_ids(dataset,all_ids["en_train_large"],"en","de")
print("en_train_translation done")
en_dev_translation = translate_ids(dataset,all_ids["en_dev_large"],"en","de")
print("en_dev_translation done")
en_test_translation = translate_ids(dataset,all_ids["en_test"],"en","de")
print("en_test_translation done")

translations = dict(**de_train_translation,**de_dev_translation,**de_test_translation,**en_train_translation,**en_dev_translation,**en_test_translation)

with open("data/translations.pkl", 'wb') as file:
    pickle.dump(translations, file)

de_train_translation done
de_dev_translation done
de_test_translation done
en_train_translation done
en_dev_translation done
en_test_translation done


In [9]:
# # translate german text content of according indices and save translations to files
# language = "de_test"
# translations = dict()
# translator = g2e_translator
# target = "english"
# checkpoint_frequency = 1000
# for split in ids[language].keys():
#     checkpoint_filename = "data/" + language + "2" + target + "_" + split+ "_checkpoint.pkl"
#     filename = "data/" + language + "2" + target + "_" + split+ ".pkl"
#     if os.path.exists(checkpoint_filename):
#         with open(checkpoint_filename, 'rb') as file:
#             translations = pickle.load(file)
#     for ind in tqdm(ids[language][split]):
#         if ind not in translations:
#             text = dataset["train"][ind]["text"][:4999] # max amount of chars is 5000
#             translations[ind] = translate_text(translator,text=text)
#             # save intermediate progress of translations to file
#         if len(translations) % checkpoint_frequency == 0: 
#             with open(checkpoint_filename, 'wb') as file:
#                 pickle.dump(translations, file)
#             time.sleep(60) # wait a minute until translating again

#     if set(translations.keys()) == set(ids[language][split]):
#         print("Finished translating ",split, " split from ",language," into ",target)
#     else:
#         print("Unfortunately there are some IDs missing")
#         break
    
#     # save final file version
#     with open(filename, 'wb') as file:
#         pickle.dump(translations, file)
    
#     os.remove(checkpoint_filename)

  3%|▎         | 6978/210147 [14:57<7:15:30,  7.78it/s] 


TooManyRequests: Server Error: You made too many requests to the server.According to google, you are allowed to make 5 requests per secondand up to 200k requests per day. You can wait and try again later oryou can try the translate_batch function

## Analysis

Use the file ``data_analysis.py`` to analyze the dataset for e.g. distribution of labels across datasplits.