Author: [Jule Godbersen](mailto:godbersj@tcd.ie)

Content of file: Loading, preprocessing and analyzing the data used within this project

In [1]:
from collections import Counter
from datasets import load_dataset
import random
import pickle
import os
from deep_translator import GoogleTranslator
import time
from tqdm import tqdm
from data_analysis import analyze_dataset

  from .autonotebook import tqdm as notebook_tqdm


### Accessing the [MMS dataset](https://brand24-ai.github.io/mms_benchmark/)

Note: As the dataset is private, log in to your huggingface account and then accept the terms of use for the [Huggingface website of the dataset](https://huggingface.co/datasets/Brand24/mms). Also make sure to have run ``huggingface-cli login`.

In [2]:
dataset = load_dataset("Brand24/mms",cache_dir="/mount/studenten-temp1/users/godberja/HuggingfaceCache")

Exploring the dataset a bit:

# Collection
Within this project I make use of the MMS dataset which is a multimodal set and contains data for multiple languages. Within this project I will only make use of the English and German data.

In [4]:
labels = dataset["train"].features["label"].names
print(labels) 

['negative', 'neutral', 'positive']


In [5]:
# looking at the documentation I found that the labels are represented via indices within the dataset
label2ind = {"negative":0,"neutral":1,"positive":2}
ind2label = {0:"negative",1:"neutral",2:"positive"}

In [6]:
# look at an example data sample
dataset["train"][0]

{'_id': 0,
 'text': '" آللهمَ إني أسألك من الخير كله ..عاجله وآجله ، ما علمت منه وما لم أعلم،،وأعوذ بك من الشر كله ..عاجله وآجله، ما علمت منه وما لم أعلم "',
 'label': 2,
 'original_dataset': 'ar_arsentdl',
 'domain': 'social_media',
 'language': 'ar',
 'Family': 'Afro-Asiatic',
 'Genus': 'Semitic',
 'Definite articles': 'definite affix',
 'Indefinite articles': 'no article',
 'Number of cases': '3',
 'Order of subject, object, verb': 'SVO',
 'Negative morphemes': 'negative particle',
 'Polar questions': 'interrogative intonation only',
 'Position of negative word wrt SOV': 'SNegVO',
 'Prefixing vs suffixing': 'weakly suffixing',
 'Coding of nominal plurality': 'mixed morphological plural',
 'Grammatical genders': 'masculine, feminine',
 'cleanlab_self_confidence': 0.21071475744247437}

In [7]:
# check for distribution of different languages supported / labels
languages = []
labels = []
for data_sample in dataset["train"]:
    languages += [data_sample["language"]]
    labels += [data_sample["label"]]

In [8]:
print("Distribution of data samples across languages: ",Counter(languages))

Distribution of data samples across languages:  Counter({'en': 2330486, 'ar': 932075, 'es': 418712, 'zh': 331702, 'de': 315887, 'pl': 236688, 'fr': 210631, 'ja': 209780, 'cs': 196287, 'pt': 157834, 'sl': 113543, 'ru': 110930, 'hr': 77594, 'sr': 76368, 'th': 72319, 'bg': 62150, 'hu': 56682, 'sk': 56623, 'sq': 44284, 'sv': 41346, 'bs': 36183, 'ur': 19660, 'hi': 16999, 'fa': 13525, 'it': 12065, 'he': 8619, 'lv': 5790, 'el': 500})


In [9]:
print("Distribution of data samples across sentiment classes: ",Counter(labels))

Distribution of data samples across sentiment classes:  Counter({2: 3494710, 1: 1341392, 0: 1329160})


# Preparation

## Filter dataset for languages

Only get the German data:

In [10]:
# filter dataset for german only
german = dataset.filter(lambda row: row["language"] == "de")

In [11]:
print("There is",len(german["train"]),"data samples available in German.")

There is 315887 data samples available in German.


In [12]:
# just out of interest: check dataset sources for german
german_sources = []
for datasample in german["train"]:
    german_sources += [datasample["original_dataset"]]
print("Sources of datasets: ",Counter(german_sources))

Sources of datasets:  Counter({'de_multilan_amazon': 209073, 'de_twitter_sentiment': 90534, 'de_sb10k': 9948, 'de_omp': 3598, 'de_dai_labor': 1781, 'de_ifeel': 953})


Only get the English data:

In [13]:
english = dataset.filter(lambda row: row["language"] == "en")

In [14]:
print("There is",len(english["train"]),"data samples available in English.")

There is 2330486 data samples available in English.


## Selection of datasamples and creation of datasplits

I now have access to German and English data. They are all within one training split and are probably imbalanced with respect to class labels. Thus I will now balance the data and split it into 3 splits for training, development and testing. The ratio I choose is 70:20:10.

Note that I ignore the dataset source when balancing the data.

### Balance for labels

In [41]:
# function to extract ids of the instances in given data dataset
def get_ids(data):
    return [instance["_id"] for instance in data]

In [48]:
# function to balance the dataset for having the same amount of instances for each label
def balance_data(data):

    # split data according to labels
    neg_instances = data.filter(lambda row: row["label"]==0)["train"]
    all_neg_ids = get_ids(neg_instances)
    neut_instances = data.filter(lambda row: row["label"]==1)["train"]
    all_neut_ids = get_ids(neut_instances)
    pos_instances = data.filter(lambda row: row["label"]==2)["train"]
    all_pos_ids = get_ids(pos_instances)

    # get the minimum amount of instances we can extract for each of the labels 
    class_minimum = min(len(all_neg_ids),len(all_neut_ids),len(all_pos_ids))
    print("We have ",class_minimum," instances for all three labels")

    # shuffle list with ids to make sure we randomly choose instances
    random.shuffle(all_neg_ids)
    random.shuffle(all_neut_ids)
    random.shuffle(all_pos_ids)

    # extract class_minimum amount of (shuffled!) instances for each of the labels
    neg_ids = all_neg_ids[:class_minimum]
    neut_ids = all_neut_ids[:class_minimum]
    pos_ids = all_pos_ids[:class_minimum]

    return neg_ids,neut_ids,pos_ids 

In [49]:
# balance the German data
g_neg, g_neut, g_pos = balance_data(german)

Filter: 100%|██████████| 315887/315887 [00:18<00:00, 17035.06 examples/s]
Filter: 100%|██████████| 315887/315887 [00:18<00:00, 17091.37 examples/s]
Filter: 100%|██████████| 315887/315887 [00:18<00:00, 16853.64 examples/s]


We have  100071  instances for all three labels


In [57]:
# balance the English data
e_neg, e_neut, e_pos = balance_data(english)

Filter: 100%|██████████| 2330486/2330486 [02:57<00:00, 13130.58 examples/s]
Filter: 100%|██████████| 2330486/2330486 [02:57<00:00, 13103.31 examples/s]
Filter: 100%|██████████| 2330486/2330486 [02:56<00:00, 13169.29 examples/s]


We have  290823  instances for all three labels


### Split into train, dev, test

In [96]:
# function for splitting a list of indices (for certain label) into train, dev and test split of ratio 80:20:10
def create_splits_for_list(list_of_indices):
     # identify amount of instances in train and dev split (the amount of test is the rest)
     num_all_instances = len(list_of_indices)
     num_train_instances = int(num_all_instances * 0.7)
     num_dev_instances = int(num_all_instances * 0.2)

     # extract the specific amount of instances and put into different splits (note: as the input was already shuffled before, there is no need to do that here again)
     train = list_of_indices[:num_train_instances]
     dev = list_of_indices[num_train_instances:num_train_instances + num_dev_instances]
     test = list_of_indices[num_train_instances + num_dev_instances:]

     return train, dev, test

In [97]:
# function to generate final train, dev, test splits (represented via lists of indices)
def create_splits(neg,neut,pos):
    # create individual splits, still considering different labels
    neg_train, neg_dev, neg_test = create_splits_for_list(neg)
    neut_train, neut_dev, neut_test = create_splits_for_list(neut)
    pos_train, pos_dev, pos_test = create_splits_for_list(pos)

    # join ids of different labels into split each
    train = neg_train + neut_train + pos_train
    dev = neg_dev + neut_dev + pos_dev
    test = neg_test + neut_test + pos_test

    # shuffle the data to make sure the dataset instances are not sorted according to the labels
    random.shuffle(train)
    random.shuffle(dev)
    random.shuffle(test)
    return train, dev, test

In [98]:
# generate splits for German
g_train, g_dev, g_test = create_splits(g_neg, g_neut, g_pos)

In [100]:
# generate splits for English
e_train, e_dev, e_test = create_splits(e_neg, e_neut, e_pos)

In [101]:
# not sure if I make use of this later, but also consider German data with half amount of instances
half_g_size = int(len(g_neg)/2)
g_train_small, g_dev_small, g_test_small = create_splits(g_neg[:half_g_size], g_neut[:half_g_size], g_pos[:half_g_size])

In [102]:
# save indices to file
indices = {"german": {"train":g_train, "dev":g_dev, "test":g_test},
           "german_small": {"train":g_train_small, "dev":g_dev_small, "test":g_test_small},
           "english": {"train":e_train, "dev":e_dev, "test":e_test}}

with open('data/indices.pkl', 'wb') as file:
    pickle.dump(indices, file)

## Translating the dataset

Now I still need to translate the datasets. I'm using the google translator.

In [3]:
# in case you want to get the indices by loading from file, use the following:
with open('data/indices.pkl', 'rb') as file:
    indices = pickle.load(file)

In [4]:
g2e_translator = GoogleTranslator(source="de",target="en")
e2g_translator = GoogleTranslator(source="en",target="de")

In [5]:
# translate a text represented as string with help of the given translator
def translate_text(translator, text):
    return translator.translate(text=text)

In [106]:
# test if translation works using some example instances
german_example = dataset["train"][indices["german"]["train"][0]]["text"]
print(german_example)
print(translate_text(g2e_translator,german_example))
print("")
english_example = dataset["train"][indices["english"]["train"][0]]["text"]
print(english_example)
print(translate_text(e2g_translator,english_example))

Sehr guter Film...spannend , sehenswert
Very good film...exciting, worth seeing

Book was good, easy to read, plot was low keyed. Could read but put down and come back to read later . Was not a thriller ,you could not put down. Just a good book.
Das Buch war gut, leicht zu lesen, die Handlung war zurückhaltend. Konnte es lesen, aber zur Seite legen und später wiederkommen, um es zu lesen. War kein Thriller, den man nicht aus der Hand legen konnte. Einfach ein gutes Buch.


In [107]:
# no need to translate german_small data as its instances are part of larger german data
del indices["german_small"] 

In [12]:
t_pl = ["Ich hab dich lieb","Ich mag dich","Du bist toll"]
translated = GoogleTranslator('de', 'en').translate_batch(t_pl)
print(translated)

TooManyRequests: Server Error: You made too many requests to the server.According to google, you are allowed to make 5 requests per secondand up to 200k requests per day. You can wait and try again later oryou can try the translate_batch function

In [27]:
# batched version

# translate german text content of according indices and save translations to files
language = "german"
translations = dict()
translator = g2e_translator
target = "english"
chunksize = 1000
for split in indices[language].keys():
    checkpoint_filename = "data/" + language + "2" + target + "_" + split+ "_checkpoint.pkl"
    filename = "data/" + language + "2" + target + "_" + split+ ".pkl"
    if os.path.exists(checkpoint_filename):
        with open(checkpoint_filename, 'rb') as file:
            translations = pickle.load(file)
    chunked_ids = [indices[language][split][i * chunksize:(i + 1) * chunksize] for i in range((len(indices[language][split]) + chunksize - 1) // chunksize )] 
    for chunked_id in tqdm(chunked_ids):
        if chunked_id[0] not in translations: # we haven't looked at this chunk ye
            texts = [dataset["train"][id]["text"][:4999] for id in chunked_id]
            translated_texts = translator.translate_batch(texts)
            for i,id in enumerate(chunked_id):
                translations[id] = translated_texts[i]
            # save intermediate progress of translations to file
            with open(checkpoint_filename, 'wb') as file:
                pickle.dump(translations, file)
            time.sleep(20) # wait until translating again

    if set(translations.keys()) == set(indices[language][split]):
        print("Finished translating ",split, " split from ",language," into ",target)
    else:
        print("Unfortunately there are some IDs missing")
        break
    
    # save final file version
    with open(filename, 'wb') as file:
        pickle.dump(translations, file)
    
    os.remove(checkpoint_filename)

[1393766, 1297145, 1304930, 1244848, 1237232, 1471369, 1249063, 1247317, 1368911, 1279141, 1308875, 1416120, 1295395, 1300160, 1409840, 1531832, 1363076, 1328640, 1391131, 1449241, 1489938, 1506855, 1288275, 1238527, 1260370, 1313772, 1426367, 1298833, 1307638, 1542349, 1431928, 1344632, 1353519, 1374125, 1530615, 1350231, 1229926, 1515655, 1333747, 1303690, 1439907, 1423985, 1494666, 1405358, 1316882, 1372430, 1392591, 1513984, 1516168, 1271226, 1514190, 1367793, 1285726, 1302613, 1489961, 1229037, 1350311, 1349766, 1226876, 1362700, 1246367, 1450284, 1470998, 1455674, 1441331, 1245228, 1439414, 1380041, 1297970, 1365646, 1233492, 1485370, 1519732, 1364640, 1411168, 1450425, 1353186, 1522463, 1383766, 1434209, 1338298, 1407603, 1502753, 1516300, 1437821, 1337487, 1362877, 1473415, 1394320, 1312618, 1318544, 1521626, 1236316, 1446392, 1435671, 1534541, 1227884, 1467826, 1391280, 1414904, 1448176, 1256031, 1498312, 1342659, 1512869, 1286473, 1369835, 1316567, 1360556, 1304105, 1344808, 

In [9]:
# # translate german text content of according indices and save translations to files
# language = "german"
# translations = dict()
# translator = g2e_translator
# target = "english"
# checkpoint_frequency = 1000
# for split in indices[language].keys():
#     checkpoint_filename = "data/" + language + "2" + target + "_" + split+ "_checkpoint.pkl"
#     filename = "data/" + language + "2" + target + "_" + split+ ".pkl"
#     if os.path.exists(checkpoint_filename):
#         with open(checkpoint_filename, 'rb') as file:
#             translations = pickle.load(file)
#     for ind in tqdm(indices[language][split]):
#         if ind not in translations:
#             text = dataset["train"][ind]["text"][:4999] # max amount of chars is 5000
#             translations[ind] = translate_text(translator,text=text)
#             # save intermediate progress of translations to file
#         if len(translations) % checkpoint_frequency == 0: 
#             with open(checkpoint_filename, 'wb') as file:
#                 pickle.dump(translations, file)
#             time.sleep(60) # wait a minute until translating again

#     if set(translations.keys()) == set(indices[language][split]):
#         print("Finished translating ",split, " split from ",language," into ",target)
#     else:
#         print("Unfortunately there are some IDs missing")
#         break
    
#     # save final file version
#     with open(filename, 'wb') as file:
#         pickle.dump(translations, file)
    
#     os.remove(checkpoint_filename)

  3%|▎         | 6978/210147 [14:57<7:15:30,  7.78it/s] 


TooManyRequests: Server Error: You made too many requests to the server.According to google, you are allowed to make 5 requests per secondand up to 200k requests per day. You can wait and try again later oryou can try the translate_batch function

In [None]:
# translate english text content of according indices and save translations to files
language = "english"
translations = dict()
translator = e2g_translator
target = "german"
checkpoint_frequency = 1000
for split in indices[language].keys():
    checkpoint_filename = "data/" + language + "2" + target + "_" + split+ "_checkpoint.pkl"
    filename = "data/" + language + "2" + target + "_" + split+ ".pkl"
    if os.path.exists(checkpoint_filename):
        with open(checkpoint_filename, 'rb') as file:
            translations = pickle.load(file)
    for ind in tqdm(indices[language][split]):
        if ind not in translations:
            text = dataset["train"][ind]["text"][:4999] # max amount of chars is 5000
            translations[ind] = translate_text(translator,text=text)
            # save intermediate progress of translations to file
        if len(translations) % checkpoint_frequency == 0: 
            with open(checkpoint_filename, 'wb') as file:
                pickle.dump(translations, file)
            time.sleep(60) # wait a minute until translating again

    if set(translations.keys()) == set(indices[language][split]):
        print("Finished translating ",split, " split from ",language," into ",target)
    else:
        print("Unfortunately there are some IDs missing")
        break
    
    # save final file version
    with open(filename, 'wb') as file:
        pickle.dump(translations, file)
    
    os.remove(checkpoint_filename)

## Analysis

In [23]:
from data_analysis import analyze_dataset
analyze_dataset(indices["german"],True,"data/_german_data_analysis.txt")

ImportError: cannot import name 'analyze_dataset2' from 'data_analysis' (/mount/studenten-temp1/users/godberja/GermanSentiment/data_analysis.py)