# Dataset Creation Notebook



explain what i do (plus links to datasets)


In this Notebook, we read from three datasets in total: (Links are provided in the README part of the dataset folder)
* `BRM-emot-submit.csv`: This dataset gives us our stimuli and their emotional valence $\to$ `stimuli_and_valence_df`
* `SUBTLEXusfrequencyabove1.xls`: This dataset gives us frequencies of words $\to$ `freq_df`
* `CogNet-v1.0.tsv`: this dataset gives us words and their cognates from any language $\to$ `cognate_df`

We will take our potential stimuli from the `stimuli_and_valence_df` and by checking against pre-set conditions, only retain the ones that meet them.
The resulting stimuli (words) are then categorized into positive, negative, and neutral, depending on their valence values.

> Note: we are parsing pretty big datasets here, so running this might take a while.

In [1]:
# todo: comments & make pretty

import pandas as pd
import nltk

# -----------------------------
# Load datasets
# -----------------------------
""" we read the datasets that we have collected (i.e. downloaded)
-> if you run this, make sure they are in the same folder as this code notebook"""
stimuli_and_valence_df = pd.read_csv("BRM-emot-submit.csv", index_col=0)        # from this dataset we get the words themselves, along with their emotional valence
freq_df = pd.read_excel("SUBTLEXusfrequencyabove1.xls")        # from this dataset we get the frequency of these words
cognate_df = pd.read_csv("CogNet-v1.0.tsv", sep="\t")        # from this dataset we get the cognate status

# -----------------------------
# Precompute frequency dictionary
# -----------------------------
"""we parse our frequency dataset and take only the "Lg10WF" frequencies from it
-> because the dataset is very big and this helps with computation time"""
freq_dict = dict(zip(freq_df["Word"].astype(str), freq_df["Lg10WF"])) # e.g., {"dog": 2.111, ...}

# -----------------------------
# Precompute POS tags for ALL stimuli
# -----------------------------
"""there is the so-called "nltk-pos_tag" - this gives us part-of-speech tags (Wortarten)
for all of our potential words (sounds weird to do it for all words if we only use a few of them later, but its actually quicker somehow)
-> sometimes its not too great at assigning POS tags, so we still have to go over your data manually later to find potential mistakes
-> but its the easiest and quickest way to do this, especially for nouns (like in our case)"""
stimuli = stimuli_and_valence_df["Word"].astype(str).tolist()
tagged = dict(nltk.pos_tag(stimuli))   # e.g., {"dog": "NN", ...}

# -----------------------------
# Precompute cognate lookup
# -----------------------------
"""we read the cognate dataset. cognates in this dataset are shown as word1, language1 and the corresponding word2, language2 (in one row).
To make things easier for us, we only take the rows of the dataset where either language1 or language2 is english and the other one is german"""
# Create a set of English words where the matching row
# contains German ("deu") as the other language.
cognate_set = set()

for _, r in cognate_df.iterrows():
    w1, l1 = r["word 1"], r["lang 1"]
    w2, l2 = r["word 2"], r["lang 2"]

    # We only care about words where one language is english and the *other* language is German                     ###
    if l1 == "eng" and l2 == "deu":
        cognate_set.add(str(w1))
    if l2 == "eng" and l1 == "deu":
        cognate_set.add(str(w2))

# -----------------------------
# Process rows
# -----------------------------
"""Now that we have prepared everything that we need, we can start processing our dataset.
This means that for each word, we check our dataset conditions (through the "if").
Dataset conditions:
- length shorter than 8 characters (i.e. maximum 7 characters long)
- part-of-speech is noun
- frequency higher than 2
- not a cognate

If any of these conditions are not met, we "continue", which means that the word will dropped (i.e. not be added to the dataset)
and we continue with the next one.

If the word passed all the checks, we look up its emotional valence and assign a categorical value to it (pos/neg/neutr)
- pos: continuuos valence of 7 or higher
- neg: continuuos valence of 3 or lower
- neutral: continuuos valence between (or equal to) 4.5 and 5.5
everything that falls outside of that range will be left out of the dataset, because we want there to be be clear distinctions between the valences
-> we dont want fuzzy boundaries

After doing all that, we will add the word to our dataset, along with a lot more info that we have collected about our words along the way

Note: we will count how many of each valence we have: while we want to have 30 of each in the end, ideally, at this point, this number should be higher though,
as we want to go over the dataset manually later (to account for any potentail mistakes with pos and non-cognate assignments).
we also want to make sure that we only have clear-valence-words in there, so we will check that manually too. 
Therefore, the numbers should be high enough to account for any manual post-hoc deletions.
"""
rows = []
neg_count = pos_count = neutr_count = none_count = 0

for _, row in stimuli_and_valence_df.iterrows():
    stimulus = str(row["Word"])
    valence = row["V.Mean.Sum"]

    # Length filter
    if len(stimulus) > 7:
        continue

    # POS filter (noun)
    tag = tagged.get(stimulus, "")
    if not tag.startswith("NN"):
        continue

    # Frequency filter
    freq = freq_dict.get(stimulus, 0) # if the word does not exist in the freq dataset, we assume their freq is 0
    if freq <= 2:
        continue

    # Cognate check
    ## for now, we will assume that anythng that is not in the cognate dataset is actually not a cognate
    ## of course this is probably not the case, as the word could just not be included in the dataset, even though it is one
    ## we will go over this and check it manually later
    if stimulus in cognate_set:
        continue

    # process valence into categorical (cat_valence) and continuuos (valence)
    cat_valence = None


    if valence <= 3:
        cat_valence = "negative"
        neg_count += 1

    elif valence >= 7:
        cat_valence = "positive"
        pos_count += 1

    elif 4.5 <= valence <= 5.5:
        cat_valence = "neutral"
        neutr_count += 1

    else:
        valence = None
        continue            # skip fuzzy boundary cases

    # add the word, along with its continuuos and categoircal valence, its length, and frequency
    if pd.notna(valence):
        rows.append({
            "Experiment": "LDT",
            "Item_type": "exp",
            "Stimulus": stimulus.upper(),
            "Condition": "word",
            "Cognate_status": "noncognate",
            "Emotional_valence_cat": cat_valence,###
            "Emotional_valence_cont": valence,###
            "Length": len(stimulus),
            "Lg10SUBTLEX_US": freq,
            "correct_response": "j"
        })

# -----------------------------
# Output
# -----------------------------
print(f"Positiv: {pos_count}, Negativ: {neg_count}, Neutral: {neutr_count}, None: {none_count}")
exp_df = pd.DataFrame(rows)
exp_df


Positiv: 142, Negativ: 179, Neutral: 595, None: 0


Unnamed: 0,Experiment,Item_type,Stimulus,Condition,Cognate_status,Emotional_valence_cat,Emotional_valence_cont,Length,Lg10SUBTLEX_US,correct_response
0,LDT,exp,ABDOMEN,word,noncognate,neutral,5.43,7,2.235528,j
1,LDT,exp,ABILITY,word,noncognate,positive,7.00,7,2.991669,j
2,LDT,exp,ABSORB,word,noncognate,neutral,5.50,6,2.008600,j
3,LDT,exp,ABUSE,word,noncognate,negative,1.53,5,2.719331,j
4,LDT,exp,ACCOUNT,word,noncognate,neutral,5.39,7,3.358125,j
...,...,...,...,...,...,...,...,...,...,...
911,LDT,exp,YEN,word,noncognate,neutral,5.00,3,2.495544,j
912,LDT,exp,YUMMY,word,noncognate,positive,7.52,5,2.359835,j
913,LDT,exp,ZAP,word,noncognate,neutral,5.39,3,2.100371,j
914,LDT,exp,ZIP,word,noncognate,neutral,5.06,3,2.591065,j


In [None]:
# Save to file
exp_df.to_csv("dataframe_exp.csv", encoding='utf-8', index=False, sep = ";")