<h2> Generating the Stimuli file</h2>
    
This notebook is a modified version of prior work to generate the stimuli files used in anagram experiments.

First we grab a list of words and then shuffle. From the shuffled set we then set up seperate groups (set groups) from there we use those paired shuffle and "correct" words to make the stimuli file used in the anagram experiments. 

This one will differ since I'll be including the sources and frequency data on use of the words in the set. Furthermore, I will be using this notebook to also make a json object of valid solutions for each shuffle. The stimuli file we use includes the word used for shuffling however, those shuffled strings can be solved to more than one real english word which is how we define "valid". 

We can use dictionaries that have all the words of specific strings than use a function to give us some number of those words. 

#### The word bank we are using is from Word Net:
    George A. Miller (1995). WordNet: A Lexical Database for English.
    Communications of the ACM Vol. 38, No. 11: 39-41.
    Christiane Fellbaum (1998, ed.) WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.
    WordNet: An Electronic Lexical Database

#### The frequency information is from the word freq: 
    Robyn Speer. (2022). rspeer/wordfreq: v3.0 (v3.0.2). Zenodo. https://doi.org/10.5281/zenodo.7199437


In [1]:
%load_ext autoreload
%reload_ext autoreload

import csv
import json
import random
import pandas as pd
from collections import Counter

import nltk
import numpy as np
import itertools
from nltk.corpus import wordnet as wn
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
import spacy 

from propernoun import remove_proper_nouns
from anagram_utils import (
    mk_dict_from_wordnet_for_length,
    remove_from_dict,
    get_word_frequencies,
    sort_words_by_frequency,
    get_top_n_words,
    shuffle_letters,# @ russ do I keep this if it is used in the shuffle list function? 
    shuffle_list,
    reformat_sorted_wordlist,
    check_for_doubles,
    check_for_same
)
nltk.download("wordnet") # @ russ do I need this if i've already downloaded?

spacy.load('en_core_web_sm')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/lyndefolsom/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


<spacy.lang.en.English at 0x30136a0f0>

A code chunk for script testing (bc lynde is scared of using their terminal and messing with their enviornment)

<h3> Making the dictionary </h3>
Using the utility script we make a dictionary for the words we want to use. The function mk_dict... will look for a wordlength and call up the wordnet dictionary to use for the stimuli generation. 

Still debugging the remove proper nouns function @to-do 
 

In [2]:
word_dict = {}
wordlengths = [4,5,6] # set the lengths outside of the loop bc we use them later too 
for wordlength in wordlengths:
    word_dict[wordlength] = mk_dict_from_wordnet_for_length(wordlength)
    print(f'found {len(word_dict[wordlength])} words of length {wordlength}')

# making a loop through the word dic to use the remove proper nouns function
for wordlength in word_dict:
    word_dict[wordlength] = remove_proper_nouns(word_dict[wordlength])
    print(f'found after removing proper nouns {len(word_dict[wordlength])} words of length {wordlength}')

# define our curse words or other words to remove
curse_words = ["shit", "piss", "fuck", "cunt", "cocksucker", "motherfucker", "tits"] #rip george carlin
other_words_to_remove = ["jesus", "george", "john", 'james', 'york', 'david', 'google', 'robert', 'thomas','kill','trump', 'stupid' ]
# use the remove from dict func to take those ones out
for wordlength in word_dict:
    word_dict[wordlength] = remove_from_dict(word_dict[wordlength], curse_words)
    print(f'found after removing curse words {len(word_dict[wordlength])} words of length {wordlength}')

for wordlength in word_dict:
    word_dict[wordlength] = remove_from_dict(word_dict[wordlength], other_words_to_remove)
    print(f'found after removing other words {len(word_dict[wordlength])}words of length {wordlength}')




found 2310 words of length 4
found 4095 words of length 5
found 6258 words of length 6
found after removing proper nouns 2234 words of length 4
found after removing proper nouns 3982 words of length 5
found after removing proper nouns 6071 words of length 6
found after removing curse words 2231 words of length 4
found after removing curse words 3982 words of length 5
found after removing curse words 6071 words of length 6
found after removing other words 2228words of length 4
found after removing other words 3978words of length 5
found after removing other words 6066words of length 6


## Dictionary made, lets get our words
I made a get words function to grab the list I want based off how many grams we want to have at the end. 

In [3]:
word_frequencies = get_word_frequencies(word_dict) # we get the word frequencies
sorted_wordlist = sort_words_by_frequency(word_frequencies) # we sort the dictionary by frequency
sorted_wordlist = reformat_sorted_wordlist(sorted_wordlist) # we reformat the dictionary for the get_top_n_words function
# make your subset list of words for the anagrams
number_of_anagrams_per_word_length = 200
full_list = get_top_n_words(sorted_wordlist, number_of_anagrams_per_word_length)
# kewl so lets make a list of those words but shuffled letters ie our anagrams
shuffled_list = shuffle_list(full_list) # we take the list and make a new list of each word's letters shuffled (anagram)
# and for the next few things it's gonna help to also have a df of these two lits named root and shuffled
cat_full_list = pd.DataFrame({
    "root" : full_list,
    "shuffled" : shuffled_list #must name it shuffled for the next function
})
cat_full_list= check_for_doubles(cat_full_list) # we check for any words that are the same in the root and shuffled columns 
cat_full_list= check_for_same(cat_full_list) # we check for any duplicates in the shuffled coll and reshuffle them

# New functions
Okay I'm realizing that I've gotta make more functions (RIP). I draft them here and then move them to the Utils file. 
After the util file, I'll need to write a test (double RIP). 

In [4]:
#function to check if the word is in the dictionary (needed for function below)
def check_word(word):
    if word in word_dict[wordlengths].values:
        return f"{word} is in the dictionary"
    else:
        return f"{word} is not in the dictionary"
    
#function for finding the possible words made out of the string of letters in shuffled, we are calling these the valid words
def find_valid_words(input_string):
     valid_words = []
     permutations = itertools.permutations(input_string)
     for perm in permutations:
         perm_word = "".join(perm) # @lynde todo, fix this
         result = check_word(perm_word)
         if "is not in the dictionary" not in result:
             valid_words.append(perm_word)
     return valid_words

test_strings = ['eltie','dcoe','cowr','earb']
test_strings_valid_words = find_valid_words(test_strings)

TypeError: unhashable type: 'list'

## Make a solution key

Many anagrams have more than one response that is a valid solution. 

In [None]:
# def find_valid_words(input_string):
#     valid_words = []
#     permutations = itertools.permutations(input_string)

#     for perm in permutations:
#         perm_word = "".join(perm)
#         result = check_word(perm_word)
#         if "is not in the dictionary" not in result:
#             valid_words.append(perm_word)

#     return valid_words


# now we go through the list of shuffled words use the find_valid_words for each word and append the results to a list
# five_possible_words = []
# for word in shuffled_five:
#     possible_words = find_valid_words(word)
#     five_possible_words.append(possible_words)
# six_possible_words = []
# for word in shuffled_six:
#     possible_words = find_valid_words(word)
#     six_possible_words.append(possible_words)
# four_possible_words = []
# for word in shuffled_fours:
#     possible_words = find_valid_words(word)
#     four_possible_words.append(possible_words)

# # aggregate all of these lists into one list
# all_possible_words = five_possible_words + six_possible_words + four_possible_words
# # convert the list into a dictionary with the shuffled word as the key and the list of possible words as the value
# possible_words_dict = dict(
#     zip(shuffled_five + shuffled_six + shuffled_fours, all_possible_words)
# )


## With an All Possible file we can check valid
Now that we have all the possible words that can be made from a particular string, we can use that dictionary as a json that could be read into the experiment.

In [None]:
# make the possible_words_dict into a json file in which the key is the shuffled word and the value is the list of possible words
# all_possible_words_json = json.dumps(possible_words_dict)
# with open("possible_words.json", "w") as file:
#     file.write(all_possible_words_json)

<h3> Concatinate and create the Sets </h3>

So now we need to make the stimuli for each of the blocks. For this we want to have 10 strings of each length per block for 3 blocks. 

And then we make groups of those until all the words are assigned. Finally, we will collect 10-30 participants per group. 

So in total: 3 blocks of 30 anagrams which will be 10 four letters, 10 five letters, 10 six letters. To make sure we get all the words, we will need 4 groups to have the 120 words for each string length represented.

In [None]:
# making a distributing function to spread the words across the four groups
def distribute_words(word_list, num_groups=4):
    # Calculate the number of words per group
    group_size = len(word_list) // num_groups

    # Create groups
    groups = [[] for _ in range(num_groups)]

    # Distribute words to each group
    for index, word in enumerate(word_list):
        group_index = index // group_size
        if (
            group_index < num_groups
        ):  # This check prevents index out of range if not perfectly divisible
            groups[group_index].append(word)

    return groups


# Distribute words to the groups
# grouped_four_letter_words = distribute_words(shuffled_four_pairs)
# grouped_five_letter_words = distribute_words(shuffled_pairs_five)
# grouped_six_letter_words = distribute_words(shuffled_six_pairs)

# Save each group to a separate CSV file-- we do this so we can review the words in each group and how they are distributed.
# for i in range(4):
#     filename = f"group_{i+1}_word_pairs.csv"
#     with open(filename, "w", newline="") as file:
#         writer = csv.writer(file)
#         writer.writerow(["Type", "Word Pairs"])
#         writer.writerows(
#             [["Four-Letter", word] for word in grouped_four_letter_words[i]]
#         )
#         writer.writerows(
#             [["Five-Letter", word] for word in grouped_five_letter_words[i]]
#         )
#         writer.writerows([["Six-Letter", word] for word in grouped_six_letter_words[i]])


Okay now we need to create the stimuli file which should be in a .js that will look like: 

let trial_objects = [
    {
        "id": "001",
        "type": "Four-Letter",
        "anagram": "atth",
        "correct": "that",
        "set": "A"
    }
]

In [None]:
csv_directory = (
    "./group_1_word_pairs.csv",
    "./group_2_word_pairs.csv",
    "./group_3_word_pairs.csv",
    "./group_4_word_pairs.csv",
)

# Function to convert CSV content to JS format
def csv_to_js_format(csv_content, set_name):
    js_entries = []
    for idx, row in enumerate(csv_content):
        word_type, word_pair = row
        if idx == 0:  # Skip header row
            continue
        original, anagram = word_pair.split(", ")
        js_entry = {
            "id": f"{str(idx).zfill(3)}",
            "type": word_type,
            "anagram": anagram,
            "correct": original,
            "set": set_name,
        }
        js_entries.append(js_entry)
    return js_entries


# Placeholder for all JS entries
all_entries = []

# Loop over each CSV file and set name
for i in range(4):
    filename = f"group_{i+1}_word_pairs.csv"
    set_name = f"Set{chr(65 + i)}"  # 'SetA', 'SetB', 'SetC', 'SetD'

    with open(filename, newline="") as csvfile:
        csvreader = csv.reader(csvfile)
        js_entries = csv_to_js_format(
            csvreader, set_name
        )  # Pass csvreader and set_name
        all_entries.extend(js_entries)

## Writing the JS file
# Save all entries into the JS file
stimuli_js_content = "let trial_objects = " + str(all_entries) + ";"
with open("stimuli.js", "w") as file:
    file.write(stimuli_js_content)

# modfiying the stimuli.js file to be in the preferred format (mostly for readability)
# Here we format the JSON with specific spacing and bracketing style
stimuli_js_content = "let trial_objects = [\n"
for entry in all_entries:
    stimuli_js_content += "    " + json.dumps(entry, indent=4) + ",\n"
stimuli_js_content = stimuli_js_content.rstrip(",\n") + "\n];"

with open("stimuli.js", "w") as file:
    file.write(stimuli_js_content)

Alright we've now got a set but we are gonna take each set and then shuffle into 4 coded runs.
Essentially, by trying to randomize the order on the fly, we can introduce a bunch of bugs into the timeline variables. 
By hardcoding 4 unique run orders for each set of words, we are addressing order effects without risking more bugs. 

So below we take the stimuli file, filter by set, add a new parameter to the stimuli file which is its run order assignment which is numbered 1-4. 
Now ALL of the words in set A will be shuffled into 4 unique orders and assigned to A1, A2, A3, A4. All the A words are the same but their order is now randomized.
The this will be saved into the stimuli file as "SetRun":"A1" etc etc.

Now this will make the Stimuli.js file increase substancially (by four) but will keep our code clean and modular. Also fewer headaches since the other option is adjusting the JS utility file that makes the variable order and we really don't wanna do that.

In [None]:
# Function to convert CSV content to JS format
def csv_to_js_format(csv_content, set_name):
    js_entries = []
    for idx, row in enumerate(csv_content):
        if idx == 0:  # Skip header row
            continue
        word_type, word_pair = row
        original, anagram = word_pair.split(", ")
        js_entry = {
            "id": f"{str(idx + 1).zfill(3)}",  # Assign unique IDs starting from 001
            "type": word_type,
            "anagram": anagram,
            "correct": original,
            "set": set_name,
        }
        js_entries.append(js_entry)
    return js_entries


# Function to shuffle a set of words and include set run order
def set_shuffle(word_list, set_name, run_number):
    shuffled_list = word_list[:]  # Create a copy of the word_list to shuffle
    random.shuffle(shuffled_list)  # Shuffle the list of words
    set_run = f"{set_name}{run_number}"
    return [
        {
            "id": word["id"],
            "type": word["type"],
            "anagram": word["anagram"],
            "correct": word["correct"],
            "set": word["set"],
            "setRun": set_run,
        }
        for word in shuffled_list
    ]


# Placeholder for all JS entries
all_entries = []

# Loop over each CSV file and set name
for i in range(4):
    filename = f"group_{i+1}_word_pairs.csv"
    set_name = f"Set{chr(65 + i)}"  # 'SetA', 'SetB', 'SetC', 'SetD'

    with open(filename, newline="") as csvfile:
        csvreader = csv.reader(csvfile)
        csv_content = list(csvreader)  # Convert csvreader to a list
        js_entries = csv_to_js_format(
            csv_content, set_name
        )  # Pass csv_content and set_name
        for run_number in range(1, 5):  # Create 4 runs for each set
            shuffled_set = set_shuffle(js_entries, set_name, run_number)
            all_entries.extend(shuffled_set)

# Save all entries into the JS file
stimuli_js_content = "let trial_objects = " + json.dumps(all_entries, indent=4) + ";"
with open("stimuli.js", "w") as file:
    file.write(stimuli_js_content)

# Count occurrences of each id... okay well this isn't right but also not really important, so I'll leave it as is. What's happening is that the id is being counted but the assignment is made during a loop and so for each id there is 4 unique string pairs. for example 032 is about, water, birth, and trust.
id_counter = Counter(entry["id"] for entry in all_entries)
print("ID Occurrences:")
for id_, count in id_counter.items():
    print(f"id {id_}: {count} times")


Okay I'm just looking for a sanity check and gonna run a loop over the stimuli js file to correct the  ID and then see if that does it. I think this notebook should be a code review for lab someday. 

In [None]:
import json
from collections import defaultdict

# Load the stimuli file
with open("stimuli.js", "r") as file:
    stimuli_js_content = file.read()

# Extract the JSON data from the stimuli file
json_data = json.loads(stimuli_js_content[len("let trial_objects = ") : -1])

# Create a mapping for unique anagrams to new IDs
unique_anagram_to_id = {}
id_counter = 1

for entry in json_data:
    anagram = entry["anagram"]
    if anagram not in unique_anagram_to_id:
        unique_anagram_to_id[anagram] = f"{id_counter:03d}"
        id_counter += 1

# Reassign IDs in the JSON data
for entry in json_data:
    entry["id"] = unique_anagram_to_id[entry["anagram"]]

# Save the updated entries into the JS file
updated_stimuli_js_content = (
    "let trial_objects = " + json.dumps(json_data, indent=4) + ";"
)
with open("updated_stimuli.js", "w") as file:
    file.write(updated_stimuli_js_content)

# Print the new ID mapping for verification
print("New ID mapping for unique anagrams:")
for anagram, new_id in unique_anagram_to_id.items():
    print(f"Anagram: {anagram}, New ID: {new_id}")

# This worked okay lets save to the stimuli.js file
with open("stimuli.js", "w") as file:
    file.write(updated_stimuli_js_content)