# Dataset Creation Notebook

This notebook collects stimuli for a psycholinguistic experiment on emotionally valent words and their perception (via reaction times).

> Code by: \
*Jasmin Orth, LMU Munich*

### Overview

In this notebook, we read from three datasets in total:
* [`BRM-emot-submit.csv`](https://link.springer.com/article/10.3758/s13428-012-0314-x): This dataset gives us our stimuli and their emotional valence $\to$ `stimuli_and_valence_df`
* [`SUBTLEXusfrequencyabove1.xls`](https://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus/overview.htm): This dataset gives us frequencies of words $\to$ `freq_df`
* [`CogNet-v1.0.tsv`](https://github.com/kbatsuren/CogNet/blob/master/CogNet-v1.0.zip): this dataset gives us words and their cognates from any language $\to$ `cognate_df`

We will take our potential stimuli from the `stimuli_and_valence_df`. Then, by checking against pre-set conditions, we only retain the ones that meet them. The resulting stimuli (words) are categorized into positive, negative, and neutral, depending on their valence values.

> Note: we are parsing pretty big datasets here, so some parts of this notebook might take a while to run.

### Conditions For Words To Be Included:

* length shorter than 8 characters (i.e. maximum 7 characters long)
* part-of-speech (POS) is noun
* log frequency is higher than 2
* non-cognates

## Preparation

### Imports

In [1]:
import pandas as pd
import nltk

### Load Datasets
We read the datasets that we have collected (i.e., downloaded).
> If you run this, make sure they are in the same folder as this code notebook.

In [4]:
stimuli_and_valence_df = pd.read_csv("BRM-emot-submit.csv", index_col=0)        # stimuli, emotional valence
freq_df = pd.read_excel("SUBTLEXusfrequencyabove1.xls")        # frequency
cognate_df = pd.read_csv("CogNet-v1.0.tsv", sep="\t")        # cognate status

In [None]:
# run to see our datasets
print(f" Stimuli Dataset: \n {stimuli_and_valence_df[:3]} \n")
print(f"Freguency Dataset: \n {freq_df[:3]} \n")
print(f"Cognate Dataset: \n {cognate_df[:3]} \n")

## Process Datasets

We will now parse the datasets we loaded. Then we take everything we need from them (e.g., frequencies) and/or perform preprocessing (e.g., POS tagging).

> We do this in advance because the datasets are pretty big and this helps with computation time.

### Precompute Frequency Dictionary

We parse our frequency dataset and take only the "Lg10WF" frequencies from it. We then save them in a dictionary to be used later.

In [12]:
freq_dict = dict(zip(freq_df["Word"].astype(str), freq_df["Lg10WF"])) # e.g., {"dog": 2.111, ...}

In [None]:
# run to see our frequency dictionary
freq_dict

### Precompute POS Tags For All Stimuli

We parse our stimuli dataset and assign POS tags (**P**art-**O**f-**S**peech **T**ags, i.e., *Wortarten*). \
We use the so-called `pos_tag` method from NLTK (**N**atural **L**anguage **T**ool**K**it) to get POS tags for all of our potential words. We then save them in a dictionary to be used later.

> Note: \
This is a very basic method for assigning POS tags, so we will have to check our data manually later to find potential mistakes. \
$\to$ but it's the easiest and quickest way to do this, especially for nouns (like in our case).

In [16]:
stimuli = stimuli_and_valence_df["Word"].astype(str).tolist()
tagged = dict(nltk.pos_tag(stimuli))   # e.g., {"dog": "NN", ...}

In [None]:
# run to see our POS dictionary
tagged

### Precompute Cognate Lookup

We parse the cognate dataset. Cognates in this dataset are shown as `word1`, `language1` and the corresponding `word2`, `language2` (in one row). \
As we are interested in English-German cognates, we only take the rows of the dataset where either `language1` or `language2` is English and the other one is German.

> Note: \
For now, we will assume that anything that is not in the cognate dataset is actually not a cognate. \
This does not necessarily filter out **all** cognates. It could be the case that a word is simply not present in the dataset even though it is a cognate. We will manually check the dataset later.

In [None]:
# big dataset -> running this will take some time
cognate_set = set()

for _, r in cognate_df.iterrows():
    w1, l1 = r["word 1"], r["lang 1"]   # e.g. w1: "cat", l1: "eng", w2: "Katze", l2: "deu"
    w2, l2 = r["word 2"], r["lang 2"]

    if l1 == "eng" and l2 == "deu":
        cognate_set.add(str(w1))
    if l2 == "eng" and l1 == "deu":
        cognate_set.add(str(w2))

In [None]:
# run to see our cognate set
cognate_set

## Final Steps

### Process Rows

Now that we have prepared everything that we need, we can start processing our dataset.
This means that for each word, we check our dataset conditions (through the `if`-statements).

<small>

**Reminder** - these are our dataset conditions:
* length shorter than 8 characters (i.e. maximum 7 characters long)
* part-of-speech (POS) is noun
* log frequency is higher than 2
* non-cognates

</small>

If any of these conditions are not met, we "`continue`", which means that the word will dropped (i.e., not be added to the dataset) and we continue with the next one.

If the word passed all the checks, we look up its emotional valence and assign a categorical value to it (`positive`/`negative`/`neutral`)
* **positive**: continuous valence of 7 or higher
* **negative**: continuous valence of 3 or lower
* **neutral**: continuous valence between (or equal to) 4.5 and 5.5

Everything that falls outside of this range will be left out of the dataset, because we want there to be be clear distinctions between the valences. \
$\to$ We dont want fuzzy boundaries

#### Idea Behind The Code

> You can imagine the following code like words literally trying to walk through it top to bottom. If a condition is not met, its journey stops and the next word tries its luck. If a word makes it to the bottom, we can save it. Basically a decision tree:

<img src="dec_tree.png" width="200">

<small>(increase the `width` of this image to $\approx$ 500 see it better)<small>

### Adding Words
After doing all that, we will add the word to our dataset, along with all the info that we have collected about it along the way:

* `"Stimulus": stimulus.upper(),` $\to$ the word itself (all upper case)
* `"Condition": "word",` $\to$ in contrast to *non-words*, which we will add to our dataset separately
* `"Cognate_status": "noncognate",` $\to$ all our words should be non-cognates
* `"Emotional_valence_cat": cat_valence,` $\to$ the assigned categorical valence (positive/negative/neutral)
* `"Emotional_valence_cont": valence,` $\to$ the original continuous valence of the word
* `"Length": len(stimulus),` $\to$ the length of the word
* `"Lg10SUBTLEX_US": freq,` $\to$ the frequency of the word
* `"correct_response": "j"` $\to$ for words, the correct response is always "j" (in contrast to "f" for *non-words*)

> Note: \
We also count how many of each valence we have: \
Whilst we want to have 30 of each in the end, ideally, at this point, the numbers should be higher. \
This is because we want to go over the dataset manually later (to account for any potential mistakes with POS and non-cognate assignments).
We also want to make sure that we only have clear-valence-words in there, so we will check that manually too. \
Therefore, the **numbers should be high enough to account for any manual post-hoc deletions.**

In [40]:
rows = []
neg_count = pos_count = neutr_count = none_count = 0

for _, row in stimuli_and_valence_df.iterrows():
    stimulus = str(row["Word"])
    valence = row["V.Mean.Sum"]

    # length filter: discard everything longer then 7 chararacters
    if len(stimulus) > 7:
        none_count += 1
        continue

    # POS filter: discard verything that is not a noun
    tag = tagged.get(stimulus, "")
    if not tag.startswith("NN"):
        none_count += 1
        continue

    # frequency filter: discard everything that has a smaller log frequency than 2
    freq = freq_dict.get(stimulus, 0) # If a word does not exist in the frequency dataset, we assume its frequency is 0
    if freq <= 2:
        none_count += 1
        continue

    # cognate check: discard all cognates
    if stimulus in cognate_set:
        none_count += 1
        continue

    # process valence into categorical (cat_valence) and continuuos (valence)
    # increase the valence counters accordingly
    cat_valence = None

    if valence <= 3:
        cat_valence = "negative"
        neg_count += 1

    elif valence >= 7:
        cat_valence = "positive"
        pos_count += 1

    elif 4.5 <= valence <= 5.5:
        cat_valence = "neutral"
        neutr_count += 1

    else:
        valence = None
        none_count += 1
        continue            # skip fuzzy boundary cases

    # add the word
    if pd.notna(valence):
        rows.append({
            "Experiment": "LDT",
            "Item_type": "exp",
            "Stimulus": stimulus.upper(),
            "Condition": "word",
            "Cognate_status": "noncognate",
            "Emotional_valence_cat": cat_valence,
            "Emotional_valence_cont": valence,
            "Length": len(stimulus),
            "Lg10SUBTLEX_US": freq,
            "correct_response": "j"
        })

exp_df = pd.DataFrame(rows)

In [44]:
# run to see how many words we have in each category
print(f"Positive: {pos_count}, Negative: {neg_count}, Neutral: {neutr_count}. \n\nWe have discarded {none_count} words so far!")

Positive: 142, Negative: 179, Neutral: 595. 

We have discarded 12999 words so far!


In [41]:
# run to see our final dataset
exp_df

Unnamed: 0,Experiment,Item_type,Stimulus,Condition,Cognate_status,Emotional_valence_cat,Emotional_valence_cont,Length,Lg10SUBTLEX_US,correct_response
0,LDT,exp,ABDOMEN,word,noncognate,neutral,5.43,7,2.235528,j
1,LDT,exp,ABILITY,word,noncognate,positive,7.00,7,2.991669,j
2,LDT,exp,ABSORB,word,noncognate,neutral,5.50,6,2.008600,j
3,LDT,exp,ABUSE,word,noncognate,negative,1.53,5,2.719331,j
4,LDT,exp,ACCOUNT,word,noncognate,neutral,5.39,7,3.358125,j
...,...,...,...,...,...,...,...,...,...,...
911,LDT,exp,YEN,word,noncognate,neutral,5.00,3,2.495544,j
912,LDT,exp,YUMMY,word,noncognate,positive,7.52,5,2.359835,j
913,LDT,exp,ZAP,word,noncognate,neutral,5.39,3,2.100371,j
914,LDT,exp,ZIP,word,noncognate,neutral,5.06,3,2.591065,j


### Save The Dataset

We can now save the dataset to a `.csv` file.

> Important: \
If you already have a file saved under the same name, this **will overwrite it**.

In [43]:
exp_df.to_csv("dataframe_exp.csv", encoding='utf-8', index=False, sep = ";")
print(f"Done! The dataset was saved to \"dataframe_exp.csv\". Check your folder :)")

Done! The dataset was saved to "dataframe_exp.csv". Check your folder :)
