<a href="https://colab.research.google.com/github/nohstns/SpaceSimRep/blob/main/SimSpan_Rep_ItemGen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pseudoword generation SimSpan Replication
This notebook contains the code used to generate the pseudowords used to replicate [Nakata & Suzuki (2019)](https://www.cambridge.org/core/journals/studies-in-second-language-acquisition/article/effects-of-massing-and-spacing-on-the-learning-of-semantically-related-and-unrelated-words/F58BA8D70385603B9C42E408BFCB8A10). We rely on the `Wuggy` package, the [Python implementation](https://wuggycode.github.io/wuggy/) of the algorithm developed by [Keuleers & Brysbaert (2010)](http://crr.ugent.be/papers/Wuggy_BRM.pdf).

It takes eight hardcoded vocabulary sets and their corresponding Basque translations or, when the translation was not available in the Basque plugin data, a semantically related alternative. The script outputs a `.csv` file with the following structure:
`INDEX  ITEM  SET ENGLISH SOURCE  PSEUDO`

The returned dataset has three possible pseudowords per item, meant to be manually selected post-generation to control for possible issues with the generated items.

Questions can be sent to nafal@utexas.edu.

## Loading packages


In [1]:
!pip install Wuggy



In [2]:
from wuggy import WuggyGenerator
from csv import DictWriter

## Loading reference vocabulary

Reference (English) items

In [3]:
ref_related_set1 = ["baboon", "badger", "otter", "porcupine", "raccoon", "weasel"]
ref_related_set2 = ["diaphragm", "intestine", "placenta", "rectum", "tympanum", "womb"]
ref_related_set3 = ["bluff", "estuary", "plateau", "ravine", "shoal", "strait"]
ref_related_set4 = ["azalea", "camellia", "camphor", "cedar", "magnolia", "willow"]

ref_unrelated_set5 = ["alloy", "apparition", "kerosene", "kiln", "plumage", "rudder"]
ref_unrelated_set6 = ["cistern", "insurgent", "pall", "parable", "sardine", "venom"]
ref_unrelated_set7 = ["alcove", "pail", "pigment", "potassium", "relic", "toupee"]
ref_unrelated_set8 = ["berth", "fuselage", "ointment", "ore", "sentry", "tuberculosis"]

english_sets = [ref_related_set1, ref_related_set2, ref_related_set3, ref_related_set4,
                ref_unrelated_set5, ref_unrelated_set6, ref_unrelated_set7, ref_unrelated_set8]

Source (real Basque words) items

In [4]:
source_related_set1 = ["tximino", "azkonar", "igaraba", "triku", "ugaztun", "erbinude"]
source_related_set2 = ["diafragma", "heste", "plazenta", "uzki", "tinpano", "umetoki"]
source_related_set3 = ["labar", "estuario", "ordoki", "sakan", "saldo", "itsasarte"]
source_related_set4 = ["arrosa", "infusio", "iraunkor", "zedro", "lore", "sahats"]



In [5]:
source_unrelated_set5 = ["aleazio", "agerpen", "erregai", "labe", "luma", "lema"]
source_unrelated_set6 = ["urtegi", "matxinatu", "oihal", "parabola", "sardina", "pozoi"]
source_unrelated_set7 = ["bazter", "ontzi", "pigmentu", "potasio", "erlikia", "ileorde"]
source_unrelated_set8 = ["ohatze", "hegazkin", "ukendu", "meatze", "zaintzaile", "tuberkuloso"]

In [6]:
source_sets = [source_related_set1, source_related_set2, source_related_set3, source_related_set4,
               source_unrelated_set5, source_unrelated_set6, source_unrelated_set7, source_unrelated_set8]

## Running Wuggy
### Loading algorithm

In [7]:
g = WuggyGenerator()

In [8]:
g.supported_official_language_plugin_names

['orthographic_basque',
 'orthographic_dutch',
 'orthographic_english',
 'orthographic_french',
 'orthographic_german',
 'orthographic_italian',
 'orthographic_polish',
 'orthographic_serbian_cyrillic',
 'orthographic_serbian_latin',
 'orthographic_spanish',
 'orthographic_vietnamese',
 'orthographic_estonian',
 'phonetic_english_celex',
 'phonetic_english_cmu',
 'phonetic_french',
 'phonetic_italian']

In [9]:
g.load('orthographic_basque')

### Data preparation

In [10]:
newData = []
set_names = [
    "related_set1", "related_set2", "related_set3", "related_set4",
    "unrelated_set5", "unrelated_set6", "unrelated_set7", "unrelated_set8"
]
counter = 1
idx_counter = 1
ncandidates = 3

In [11]:
with open('generatedPseudoWords.csv', 'w', newline="") as csvfile:
  fieldnames = ['INDEX', 'ITEM', 'SET', 'ENGLISH', 'SOURCE', 'PSEUDO']
  writer = DictWriter(csvfile, fieldnames)

  writer.writeheader()

### Generating pseudowords

In [12]:
for set_idx, (source_word_list, english_word_list, set_name) in enumerate(zip(source_sets, english_sets, set_names)):
    print(f"Processing Set: {set_name} (Index: {set_idx})")
    for word_idx, (source_word, english_word) in enumerate(zip(source_word_list, english_word_list)):
        print(f"  - Word Pair {word_idx+1}: Source='{source_word}', English='{english_word}'")

        pseudoword_matches = g.generate_classic([source_word])

        pseudowords_for_current_source_word = []

        for i in range(ncandidates):
            if i < len(pseudoword_matches):
                pseudowords_for_current_source_word.append(pseudoword_matches[i]["pseudoword"])
            else:
                pseudowords_for_current_source_word.append("")

        print(f"    Generated pseudowords: {pseudowords_for_current_source_word}")

        print(f"    Storing pseudowords in dataset")
        for pseudo_word in pseudowords_for_current_source_word:
          with open('generatedPseudoWords.csv', 'a', newline="") as csvfile:
            fieldnames = ['INDEX', 'ITEM', 'SET', 'ENGLISH', 'SOURCE', 'PSEUDO']
            writer = DictWriter(csvfile, fieldnames)
            writer.writerow({
                "INDEX": idx_counter,
                "ITEM": counter,
                "SET": set_name,
                "ENGLISH": english_word,
                "SOURCE": source_word,
                "PSEUDO": pseudo_word
            })
            idx_counter += 1
        counter += 1
        print(f'  Pseudowords for {word_idx+1}. {source_word} stored.\n')
    print(f'Set {set_name} stored.\n')

print('DONE')

Processing Set: related_set1 (Index: 0)
  - Word Pair 1: Source='tximino', English='baboon'
    Generated pseudowords: ['tribido', 'tribilo', 'trihido']
    Storing pseudowords in dataset
  Pseudowords for 1. tximino stored.

  - Word Pair 2: Source='azkonar', English='badger'
    Generated pseudowords: ['azpudar', 'azpular', 'aztudar']
    Storing pseudowords in dataset
  Pseudowords for 2. azkonar stored.

  - Word Pair 3: Source='igaraba', English='otter'
    Generated pseudowords: ['odanama', 'odanaca', 'odanasa']
    Storing pseudowords in dataset
  Pseudowords for 3. igaraba stored.

  - Word Pair 4: Source='triku', English='porcupine'
    Generated pseudowords: ['trolu', 'trelu', 'trili']
    Storing pseudowords in dataset
  Pseudowords for 4. triku stored.

  - Word Pair 5: Source='ugaztun', English='raccoon'
    Generated pseudowords: ['ubentun', 'unentun', 'udiltun']
    Storing pseudowords in dataset
  Pseudowords for 5. ugaztun stored.

  - Word Pair 6: Source='erbinude', E