<h2> Generating the Stimuli file</h2>
    
This notebook is a modified version of prior work to generate the stimuli files used in anagram experiments.

First we grab a list of words and then shuffle. From the shuffled set we then set up seperate groups (set groups) from there we use those paired shuffle and "correct" words to make the stimuli file used in the anagram experiments. 

This one will differ since I'll be including the sources and frequency data on use of the words in the set. Furthermore, I will be using this notebook to also make a json object of valid solutions for each shuffle. The stimuli file we use includes the word used for shuffling however, those shuffled strings can be solved to more than one real english word which is how we define "valid". 

We can use dictionaries that have all the words of specific strings than use a function to give us some number of those words. 

#### The word bank we are using is from Word Net:
    George A. Miller (1995). WordNet: A Lexical Database for English.
    Communications of the ACM Vol. 38, No. 11: 39-41.
    Christiane Fellbaum (1998, ed.) WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.
    WordNet: An Electronic Lexical Database

#### The frequency information is from the word freq: 
    Robyn Speer. (2022). rspeer/wordfreq: v3.0 (v3.0.2). Zenodo. https://doi.org/10.5281/zenodo.7199437


In [1]:
%load_ext autoreload
%reload_ext autoreload

import csv
import json
import random
import pandas as pd
from collections import Counter

import nltk
import numpy as np
import itertools
from nltk.corpus import wordnet as wn
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
import spacy 

from anagram_utils import (
    mk_dict_from_wordnet_for_length,
    remove_from_dict,
    get_word_frequencies,
    sort_words_by_frequency,
    get_top_n_words,
    shuffle_list,
    reformat_sorted_wordlist,
    check_for_word,
    check_for_doubles,
    check_for_same,
    find_valid_words
)
nltk.download("wordnet") # @ russ do I need this if i've already downloaded?

spacy.load('en_core_web_sm')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/lyndefolsom/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


<spacy.lang.en.English at 0x16ad34c80>

<h3> Making the dictionary </h3>
Using the utility script we make a dictionary for the words we want to use. The function mk_dict... will look for a wordlength and call up the wordnet dictionary to use for the stimuli generation. 

Still debugging the remove proper nouns function @to-do 
 

In [2]:
word_dict = {}
wordlengths = [4,5,6] # set the lengths outside of the loop bc we use them later too 
for wordlength in wordlengths:
    word_dict[wordlength] = mk_dict_from_wordnet_for_length(wordlength)
    print(f'found {len(word_dict[wordlength])} words of length {wordlength}')

# making a loop through the word dic to use the remove proper nouns function
# for wordlength in word_dict:
#     word_dict[wordlength] = remove_proper_nouns(word_dict[wordlength])
#     print(f'found after removing proper nouns {len(word_dict[wordlength])} words of length {wordlength}')

# define our curse words or other words to remove
curse_words = ["shit", "piss", "fuck", "cunt", "cocksucker", "motherfucker", "tits"] #rip george carlin
other_words_to_remove = ["jesus", "george", "john", 'james', 'york', 'david', 'google', 'robert', 'thomas','kill','trump', 'stupid', 'centre' ]
# use the remove from dict func to take those ones out
for wordlength in word_dict:
    word_dict[wordlength] = remove_from_dict(word_dict[wordlength], curse_words)
    print(f'found after removing curse words {len(word_dict[wordlength])} words of length {wordlength}')

for wordlength in word_dict:
    word_dict[wordlength] = remove_from_dict(word_dict[wordlength], other_words_to_remove)
    print(f'found after removing other words {len(word_dict[wordlength])}words of length {wordlength}')



found 2310 words of length 4
found 4095 words of length 5
found 6258 words of length 6
found after removing curse words 2307 words of length 4
found after removing curse words 4095 words of length 5
found after removing curse words 6258 words of length 6
found after removing other words 2304words of length 4
found after removing other words 4091words of length 5
found after removing other words 6252words of length 6


## Dictionary made, lets get our words
I made a get words function to grab the list I want based off how many grams we want to have at the end. 

In [11]:
word_frequencies = get_word_frequencies(word_dict) # we get the word frequencies
sorted_wordlist = sort_words_by_frequency(word_frequencies) # we sort the dictionary by frequency
sorted_wordlist = reformat_sorted_wordlist(sorted_wordlist) # we reformat the dictionary for the get_top_n_words function
# make your subset list of words for the anagrams
number_of_anagrams_per_word_length = 200
full_list = get_top_n_words(sorted_wordlist, number_of_anagrams_per_word_length)
# kewl so lets make a list of those words but shuffled letters ie our anagrams
shuffled_list = shuffle_list(full_list) # we take the list and make a new list of each word's letters shuffled (anagram)
# and for the next few things it's gonna help to also have a df of these two lits named root and shuffled
cat_full_list = pd.DataFrame({
    "root" : full_list,
    "shuffled" : shuffled_list,
    # id collumn which is a string of the index + 1 so that it's 3 digits long
    "id" : [str(i+1).zfill(3) for i in range(len(full_list))]    
})
cat_full_list= check_for_doubles(cat_full_list) # we check for any words that are the same in the root and shuffled columns 
cat_full_list= check_for_same(cat_full_list) # we check for any duplicates in the shuffled coll and reshuffle them
## need to make a function that checks that words that are "shuffled" do not exist in the dictionary, then shuffle if they are
cat_full_list = check_for_word(cat_full_list, word_dict)

## Make a solution key

Many anagrams have more than one response that is a valid solution. 

In [15]:
valid_words_gram = [find_valid_words(word, word_dict) for word in shuffled_list] # we find the valid words for each anagram in the shuffled list
valid_words = [list(set(words)) for words in valid_words_gram] # we remove any duplicates in the valid words list and set to just the one list
cat_full_list['valid_words'] = valid_words # we add the valid words to the df
cat_full_list=check_for_word(cat_full_list, word_dict) # we check for any words that are the same in the root and shuffled columns

In [16]:

# function to use the root in the df to assign it's string length in a new column of the same def
def get_length(row):
    return len(row['root']) 


# apply the function to the df
cat_full_list['length'] = cat_full_list.apply(get_length, axis=1)

# function to first group by the length in a DF then randomly assign to a group a through d 
def group_by_length(df):
    #
    if len(df) % 4 != 0:
        raise "Error in distributing between 4 groups, please revise counts"
    # group by the length of the word
    grouped = df.groupby('length')
    # make a list of the groups
    groups = [group for name, group in grouped]
    # make a list of the group names
    group_names = ['SetA', 'SetB', 'SetC', 'SetD']
    # calculate number of words per group
    num_words_per_group = len(df) // len(group_names)
    # make a dictionary of the group names and the groups
    group_dict = dict(zip(group_names, groups))
    # make a list of the group names for each word in the df don't exceed the num words per group
    group_list = [] # make an empty list to append to
    # loop through the group names until the number of words in the group is equal to the number of words per group
    for group_name in group_names:
        for i in range(num_words_per_group):
            group_list.append(group_name)
    # assign the group list to the df
    df['Set'] = group_list
    return df

# apply the function to the df
cat_full_list = group_by_length(cat_full_list)

# check that each group in the df has the same number of words
grouped = cat_full_list.groupby('Set')
for name, group in grouped:
    print(f'{name} has {len(group)} words')

#### troublesome function ***** now we create 4 unique set runs for each set and call that col setrun and it uses the val of set and adds it's number
### so that we have 4 unique runs for each set, this would then mean our final js stim  would have 16 unique runs, 4 for each set, and a file that ought to be len(id) * 16

def set_run(df):
    # group by set
    grouped = df.groupby('Set')
    # for each group, shuffle and assign it to a set run A1, A2, A3, A4, B1, B2, B3, B4, etc.
    set_runs_list = []
    # loop through the groups four times, each time shuffle the order and then append taht list to the set runs
    for name, group in grouped:
        for i in range(4):
            # take the id and shuffle the order (but not the contents, just the order) and append to the set runs list
            set_runs_list.append(group['id'].sample(frac=1).tolist())
    # make a new column in the df and assign the set runs list to it
    df['set_run'] = set_runs_list
    return df

#save the cat_full_list to a csv
cat_full_list.to_csv('cat_full_list.csv', index=False)

SetA has 150 words
SetB has 150 words
SetC has 150 words
SetD has 150 words


In [None]:
#tester chunk
print(type(cat_full_list))

In [6]:
def format_js_stimuli(df):
    js_stimuli = []
    for idx, row in df.iterrows():
        if idx == 0:  
            continue
        js_entry = {
                "id": row['id'],
                "type": row['length'],
                "anagram": row['shuffled'],
                "correct": row['root'],
                "valid": row['valid_words'],
                "set": row['Set'],
                "set_run": row['Set']
            }
        js_stimuli.append(js_entry) 
    return js_stimuli

js_stimuli = format_js_stimuli(cat_full_list)

In [None]:
# this chunk should be the thing that saves out the stimuli file in a js. 
stimuli_js_content = "let trial_objects = " + str(all_entries) + ";"
with open("stimuli.js", "w") as file:
    file.write(stimuli_js_content)

#  ________________________________________________________

In [None]:
# refactor the below functions
# create stimulus JSON file

def format_js_stimuli(df):
    js_stimuli = []
    for idx, row in df.iterrows():
        if idx == 0:  
            continue
        js_entry = {
                "id": row['id'],
                "type": row['length'],
                "anagram": row['shuffled'],
                "correct": row['root'],
                "valid": row['valid_words'],
                "set": row['Set'],
                "set_run": row['Set']
            }
        js_stimuli.append(js_entry) 
    return js_stimuli

js_stimuli = format_js_stimuli(cat_full_list)
# needs refactoring

def set_shuffle(word_list, set_name):
    shuffled_list = word_list.sample(frac=1).reset_index(drop=True)  # Shuffle the DataFrame
    return [
        {
            "id": word["id"],
            "type": word["length"],
            "anagram": word["shuffled"],
            "correct": word["root"],
            "valid": word["valid_words"],
            "set": word["Set"],
            "set_run": set_run,
        }
        for _, word in shuffled_list.iterrows()
    ]

def run_set_shuffle(df): # function to shuffle words in each set and assign to a unique run sequence called a set_run
    all_entries = []
    if isinstance(df, list):
        df = pd.DataFrame(df)
    for run_number in range(1, 5):  # Create 4 runs for each set
        for set_name in ['SetA', 'SetB', 'SetC', 'SetD']:
            set_run = f"{set_name}{run_number}" 
            set_df = df[df['SetRun'] == set_run] # filter the df by the set run
            shuffled_set = set_shuffle(set_df, set_name)
            all_entries.extend(shuffled_set)
    return all_entries


print(type(js_stimuli))
js_stimuli = run_set_shuffle(js_stimuli)
print(js_stimuli)


In [None]:
# refactor the below functions to work with new frame

# Placeholder for all JS entries
all_entries = []

# Loop over each CSV file and set name
for i in range(4):
    filename = f"group_{i+1}_word_pairs.csv"
    set_name = f"Set{chr(65 + i)}"  # 'SetA', 'SetB', 'SetC', 'SetD'

    with open(filename, newline="") as csvfile:
        csvreader = csv.reader(csvfile)
        js_entries = csv_to_js_format(
            csvreader, set_name
        )  # Pass csvreader and set_name
        all_entries.extend(js_entries)

## Writing the JS file
# Save all entries into the JS file
stimuli_js_content = "let trial_objects = " + str(all_entries) + ";"
with open("stimuli.js", "w") as file:
    file.write(stimuli_js_content)


# Depreciated 
## Now we start making the stimulus files

First we need to distribute the unique words among the number of groups. Groups then turn into CSV files. 
We also need to make the json file that's used for the stimulus in the experiment but I prefer it to be formated in a way that's legible. 


In [5]:
### ----- Depreciated Functions ----- ###

def wordlist2length_dict(anagram_list):
    #  grouped by the lengths of the words
    anagram_by_length = {}
    for word in anagram_list:
        length = len(word)
        if length not in anagram_by_length:
            anagram_by_length[length] = []
        anagram_by_length[length].append(word)
    return anagram_by_length
        
def distribute_words(anagram_by_length, num_set = 4):
    #  distribute the words to the sets
    anagram_sets = {}
    for length, words in anagram_by_length.items():
        random.shuffle(words)
        anagram_sets[length] = [words[i::num_set] for i in range(num_set)]
    return anagram_sets


# function to join the sets into unique collumns named group_a-d based off of the rows of the df
def join_sets_into_groups(anagram_sets, list):
    test_group = pd.DataFrame()
    test_list_length = len(list)
    for i in range(test_list_length):
        group = pd.DataFrame()
        for length, words in anagram_sets.items():
            group[f'length_{length}'] = words[i]
        test_group = pd.concat([test_group, group], axis=1)
    return test_group.transpose()

# each row of a set is joined into a new list, joined into a new df


In [None]:
### testing 
######
# create a test list with 20 words of length 4, 5, and 6
import random
import string

def generate_random_word(length):
    return ''.join(random.choices(string.ascii_lowercase, k=length))

# Generate a list of 20 words with lengths 4, 5, and 6
word_lengths = [4, 5, 6]
test_list = [generate_random_word(length) for _ in range(20) for length in word_lengths]



## New Functions
-- distribute into groups
-- make csv 
-- make csv turn into the json file 
-- shuffle the run order and assign

In [None]:
# making a distributing function to spread the words across the four groups
# def distribute_words(word_list, num_sets):
#     # Calculate the number of words per group
#     group_size = len(word_list) // num_sets

#     # Create groups
#     groups = [[] for _ in range(num_sets)]

#     # Distribute words to each group
#     for index, word in enumerate(word_list):
#         group_index = index // group_size
#         if (
#             group_index < num_sets
#         ):  # This check prevents index out of range if not perfectly divisible
#             groups[group_index].append(word)

#     return groups

# need to mod this to take an array and convert to js format
def format_js_stimuli(sets):
    js_stimuli = []
    for idx, row in enumerate(sets):
        if idx == 0:  
            continue
        word_type, word_pair = row
        original, anagram = word_pair.split(", ")
        js_entry = {
            "id": f"{str(idx + 1).zfill(3)}",  # Assign unique IDs starting from 001
            "type": word_type,
            "anagram": anagram,
            "correct": original,
            "set": set_name,
        }
        js_entries.append(js_entry) 
    return js_stimuli


def csv_to_js_format(csv_content, set_name):
    js_entries = []
    for idx, row in enumerate(csv_content):
        if idx == 0:  # Skip header row
            continue
        word_type, word_pair = row
        original, anagram = word_pair.split(", ")
        js_entry = {
            "id": f"{str(idx + 1).zfill(3)}",  # Assign unique IDs starting from 001
            "type": word_type,
            "anagram": anagram,
            "correct": original,
            "set": set_name,
        }
        js_entries.append(js_entry)
    return js_entries

# Function to shuffle a set of words and include set run order
def set_shuffle(word_list, set_name, run_number):
    shuffled_list = word_list[:]  # Create a copy of the word_list to shuffle
    random.shuffle(shuffled_list)  # Shuffle the list of words
    set_run = f"{set_name}{run_number}"
    return [
        {
            "id": word["id"],
            "type": word["type"],
            "anagram": word["anagram"],
            "correct": word["correct"],
            "set": word["set"],
            "setRun": set_run,
        }
        for word in shuffled_list
    ]



In [None]:
# make the possible_words_dict into a json file in which the key is the shuffled word and the value is the list of possible words
# all_possible_words_json = json.dumps(possible_words_dict)
# with open("possible_words.json", "w") as file:
#     file.write(all_possible_words_json)

<h3> Concatinate and create the Sets </h3>

So now we need to make the stimuli for each of the blocks. For this we want to have 10 strings of each length per block for 3 blocks. 

And then we make groups of those until all the words are assigned. Finally, we will collect 10-30 participants per group. 

So in total: 3 blocks of 30 anagrams which will be 10 four letters, 10 five letters, 10 six letters. To make sure we get all the words, we will need 4 groups to have the 120 words for each string length represented.

In [None]:

num_groups = 4
# making a distributing function to spread the words across the four groups
def distribute_words(word_list, num_groups):
    # Calculate the number of words per group
    group_size = len(word_list) // num_groups

    # Create groups
    groups = [[] for _ in range(num_groups)]

    # Distribute words to each group
    for index, word in enumerate(word_list):
        group_index = index // group_size
        if (
            group_index < num_groups
        ):  # This check prevents index out of range if not perfectly divisible
            groups[group_index].append(word)

    return groups

# Save each group to a separate CSV file-- we do this so we can review the words in each group and how they are distributed.
# for i in range(4):
#     filename = f"group_{i+1}_word_pairs.csv"
#     with open(filename, "w", newline="") as file:
#         writer = csv.writer(file)
#         writer.writerow(["Type", "Word Pairs"])
#         writer.writerows(
#             [["Four-Letter", word] for word in grouped_four_letter_words[i]]
#         )
#         writer.writerows(
#             [["Five-Letter", word] for word in grouped_five_letter_words[i]]
#         )
#         writer.writerows([["Six-Letter", word] for word in grouped_six_letter_words[i]])


Okay now we need to create the stimuli file which should be in a .js that will look like: 

let trial_objects = [
    {
        "id": "001",
        "type": "Four-Letter",
        "anagram": "atth",
        "correct": "that",
        "set": "A"
    }
]

In [None]:
# csv_directory = (
#     "./group_1_word_pairs.csv",
#     "./group_2_word_pairs.csv",
#     "./group_3_word_pairs.csv",
#     "./group_4_word_pairs.csv",
# )



# Placeholder for all JS entries
all_entries = []

# Loop over each CSV file and set name
for i in range(4):
    filename = f"group_{i+1}_word_pairs.csv"
    set_name = f"Set{chr(65 + i)}"  # 'SetA', 'SetB', 'SetC', 'SetD'

    with open(filename, newline="") as csvfile:
        csvreader = csv.reader(csvfile)
        js_entries = csv_to_js_format(
            csvreader, set_name
        )  # Pass csvreader and set_name
        all_entries.extend(js_entries)

## Writing the JS file
# Save all entries into the JS file
stimuli_js_content = "let trial_objects = " + str(all_entries) + ";"
with open("stimuli.js", "w") as file:
    file.write(stimuli_js_content)

# modfiying the stimuli.js file to be in the preferred format (mostly for readability)
# Here we format the JSON with specific spacing and bracketing style
stimuli_js_content = "let trial_objects = [\n"
for entry in all_entries:
    stimuli_js_content += "    " + json.dumps(entry, indent=4) + ",\n"
stimuli_js_content = stimuli_js_content.rstrip(",\n") + "\n];"

with open("stimuli.js", "w") as file:
    file.write(stimuli_js_content)

Alright we've now got a set but we are gonna take each set and then shuffle into 4 coded runs.
Essentially, by trying to randomize the order on the fly, we can introduce a bunch of bugs into the timeline variables. 
By hardcoding 4 unique run orders for each set of words, we are addressing order effects without risking more bugs. 

So below we take the stimuli file, filter by set, add a new parameter to the stimuli file which is its run order assignment which is numbered 1-4. 
Now ALL of the words in set A will be shuffled into 4 unique orders and assigned to A1, A2, A3, A4. All the A words are the same but their order is now randomized.
The this will be saved into the stimuli file as "SetRun":"A1" etc etc.

Now this will make the Stimuli.js file increase substancially (by four) but will keep our code clean and modular. Also fewer headaches since the other option is adjusting the JS utility file that makes the variable order and we really don't wanna do that.

In [None]:

# Placeholder for all JS entries
all_entries = []

# Loop over each CSV file and set name
for i in range(4):
    filename = f"group_{i+1}_word_pairs.csv"
    set_name = f"Set{chr(65 + i)}"  # 'SetA', 'SetB', 'SetC', 'SetD'

    with open(filename, newline="") as csvfile:
        csvreader = csv.reader(csvfile)
        csv_content = list(csvreader)  # Convert csvreader to a list
        js_entries = csv_to_js_format(
            csv_content, set_name
        )  # Pass csv_content and set_name
        for run_number in range(1, 5):  # Create 4 runs for each set
            shuffled_set = set_shuffle(js_entries, set_name, run_number)
            all_entries.extend(shuffled_set)

# Save all entries into the JS file
stimuli_js_content = "let trial_objects = " + json.dumps(all_entries, indent=4) + ";"
with open("stimuli.js", "w") as file:
    file.write(stimuli_js_content)

# Count occurrences of each id... okay well this isn't right but also not really important, so I'll leave it as is. What's happening is that the id is being counted but the assignment is made during a loop and so for each id there is 4 unique string pairs. for example 032 is about, water, birth, and trust.
id_counter = Counter(entry["id"] for entry in all_entries)
print("ID Occurrences:")
for id_, count in id_counter.items():
    print(f"id {id_}: {count} times")


Okay I'm just looking for a sanity check and gonna run a loop over the stimuli js file to correct the  ID and then see if that does it. I think this notebook should be a code review for lab someday. 

In [None]:
import json
from collections import defaultdict

# Load the stimuli file
with open("stimuli.js", "r") as file:
    stimuli_js_content = file.read()

# Extract the JSON data from the stimuli file
json_data = json.loads(stimuli_js_content[len("let trial_objects = ") : -1])

# Create a mapping for unique anagrams to new IDs
unique_anagram_to_id = {}
id_counter = 1

for entry in json_data:
    anagram = entry["anagram"]
    if anagram not in unique_anagram_to_id:
        unique_anagram_to_id[anagram] = f"{id_counter:03d}"
        id_counter += 1

# Reassign IDs in the JSON data
for entry in json_data:
    entry["id"] = unique_anagram_to_id[entry["anagram"]]

# Save the updated entries into the JS file
updated_stimuli_js_content = (
    "let trial_objects = " + json.dumps(json_data, indent=4) + ";"
)
with open("updated_stimuli.js", "w") as file:
    file.write(updated_stimuli_js_content)

# Print the new ID mapping for verification
print("New ID mapping for unique anagrams:")
for anagram, new_id in unique_anagram_to_id.items():
    print(f"Anagram: {anagram}, New ID: {new_id}")

# This worked okay lets save to the stimuli.js file
with open("stimuli.js", "w") as file:
    file.write(updated_stimuli_js_content)