<a href="https://colab.research.google.com/github/nohstns/SpaceSimRep/blob/main/SimSpan_Rep_ItemGen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pseudoword generation SimSpan Replication
This notebook contains the code used to generate the pseudowords used to replicate [Nakata & Suzuki (2019)](https://www.cambridge.org/core/journals/studies-in-second-language-acquisition/article/effects-of-massing-and-spacing-on-the-learning-of-semantically-related-and-unrelated-words/F58BA8D70385603B9C42E408BFCB8A10). We rely on the `Wuggy` package, the [Python implementation](https://wuggycode.github.io/wuggy/) of the algorithm developed by [Keuleers & Brysbaert (2010)](http://crr.ugent.be/papers/Wuggy_BRM.pdf).

## Loading packages


In [None]:
!pip install Wuggy

Collecting Wuggy
  Downloading wuggy-1.1.2-py3-none-any.whl.metadata (601 bytes)
Collecting Levenshtein>=0.12.0 (from Wuggy)
  Downloading levenshtein-0.27.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (3.7 kB)
Collecting rapidfuzz<4.0.0,>=3.9.0 (from Levenshtein>=0.12.0->Wuggy)
  Downloading rapidfuzz-3.14.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (12 kB)
Downloading wuggy-1.1.2-py3-none-any.whl (14 kB)
Downloading levenshtein-0.27.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (153 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m153.3/153.3 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rapidfuzz-3.14.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (3.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, Levenshtein, Wuggy
Successfully installed Levensh

In [None]:
from wuggy import WuggyGenerator
from csv import DictWriter

In [None]:
test_set0 = ["tximino", "azkonar", "igaraba"]

## Loading reference vocabulary

Reference (English) items

In [None]:
ref_related_set1 = ["baboon", "badger", "otter", "porcupine", "raccoon", "weasel"]
ref_related_set2 = ["diaphragm", "intestine", "placenta", "rectum", "tympanum", "womb"]
ref_related_set3 = ["bluff", "estuary", "plateau", "ravine", "shoal", "strait"]
ref_related_set4 = ["azalea", "camellia", "camphor", "cedar", "magnolia", "willow"]

ref_unrelated_set6 = ["cistern", "insurgent", "pall", "parable", "sardine", "venom"]
ref_unrelated_set7 = ["alcove", "pail", "pigment", "potassium", "relic", "toupee"]
ref_unrelated_set8 = ["berth", "fuselage", "ointment", "ore", "sentry", "tuberculosis"]
ref_unrelated_set5 = ["alloy", "apparition", "kerosene", "kiln", "plumage", "rudder"]

english_sets = [ref_related_set1, ref_related_set2, ref_related_set3, ref_related_set4,
                ref_unrelated_set5, ref_unrelated_set6, ref_unrelated_set7, ref_unrelated_set8]

Source (real Basque words) items

In [None]:
source_related_set1 = ["tximino", "azkonar", "igaraba", "triku", "mapatxea", "erbiarra"]
source_related_set2 = ["diafragma", "hestea", "plazenta", "ondestea", "tinpanoa", "umetokia"]
source_related_set3 = ["labarra", "estuarioa", "lautada", "arroka", "sardun-arraina", "itsasartea"]
source_related_set4 = ["azalea", "kamelia", "kanforra", "zedroa", "magnolia", "sahats"]



In [None]:
source_unrelated_set5 = ["aleazioa", "agerpena", "kerosenoa", "labea", "lumajea", "lema"]
source_unrelated_set6 = ["zisterna", "matxinatua", "oihal-zapia", "parabola", "sardina", "pozoia"]
source_unrelated_set7 = ["alkoba", "ontzia", "pigmentua", "potasioa", "erlikia", "tupea"]
source_unrelated_set8 = ["kaiola", "fuselajea", "ukendua", "mea", "zaintzailea", "tuberkulosia"]

In [None]:
source_sets = [source_related_set1, source_related_set2, source_related_set3, source_related_set4,
               source_unrelated_set5, source_unrelated_set6, source_unrelated_set7, source_unrelated_set8]

## Running Wuggy

In [None]:
g = WuggyGenerator()

In [None]:
g.supported_official_language_plugin_names

['orthographic_basque',
 'orthographic_dutch',
 'orthographic_english',
 'orthographic_french',
 'orthographic_german',
 'orthographic_italian',
 'orthographic_polish',
 'orthographic_serbian_cyrillic',
 'orthographic_serbian_latin',
 'orthographic_spanish',
 'orthographic_vietnamese',
 'orthographic_estonian',
 'phonetic_english_celex',
 'phonetic_english_cmu',
 'phonetic_french',
 'phonetic_italian']

In [None]:
g.load('orthographic_basque')

In [None]:
ncandidates = 3

In [None]:
for match in g.generate_classic(test_set0):
    print(match["pseudoword"])

pribido
pribilo
prihilo
prihido
trihido
trihilo
tribilo
tribido
fribilo
fribido
azpudar
azpular
aztudar
aztular
askular
askudar
aspolar
aspodar
aspunar
aktolar
omalada
omalafa
omalaja
omalasa
omalapa
omalaxa
omalama
omalaza
omalaha
omalaga


Steps to add:

- zip generated pseudoword with original English word
- structure export so it has the columns `Item - Set - English - Pseudo - Reference`

In [None]:
pseudoword_matches = g.generate_classic(test_set0)
g.export_classic_pseudoword_matches_to_csv(pseudoword_matches, "./pseudowords.csv")