<h1>What does this program do?</h1>

Turn a selection of words into a puzzle.

<h2>What is the puzzle?</h2>

One aspect of every word changes, and the reader must either figure out what changed, or what the original text said.

<h2>What is changed about the words?</h2>

There are three possible changes for each word: spelling, pronounciation, and definition.

<h3>Definition</h3>

The program looks the word up in an online dictionary, and replaces it with a word offset from the original by some amount. For example, the 10th word after it, or the 6th word before it.

<h3>Spelling & Pronounciation</h3>

Since this puzzle is entirely text-based, changes to pronounciation must actually affect its spelling. Therefore, changes to spelling and pronounciation both change spelling, but in different ways.

A word consists of three aspects: appearance, sound, and meaning. Its meaning is abstract, applied to a word by historical use and present societal use and understanding. As for appearance and sound, a graphical representation of an English word consists of letters, which then represent sounds. Individual sounds are called __phonemes__, and the letters or groups of letters that represent phonemes are called __graphemes__.

Phonemes in English have a plethora of irregular spellings, some only appearing in a single word (and its derivatives). That is what my program will play on. When changing the _spelling_ of a word, it will pick a random phoneme and its associated grapheme in the word, and change the grapheme to another valid representation of that same phoneme, for example changing 'thaw' to 'tho', where the 'aw' sound is spelled like the 'o' in 'bog'. However, when changing a _pronounciation_, it selects a random phoneme in the word and replaces it with another phoneme, and pick a valid grapheme.

<h4>How does the program know what phonemes are?</h4>
It will look up the word on Wiktionary, which has the IPA representation of the word. Each IPA symbol is a phoneme, and each phoneme is exactly one symbol.

<h4>How will Python recognize IPA symbols?</h4>
Hopefully I can use Unicode. Otherwise, I don't know

<h4>What about English vowels? How will you represent their complexity?</h4>


<h1>IDEA:</h1>
instead of directly changing the phoneme/grapheme, the program picks an English dialect to start with, and another to change it to. It then changes the spelling of the word to match how someone speaking the first dialect would spell the pronounciation of someone in the second dialect.

10/15

Step 1: write code that can understand phonemes and break words down into them

Step 2: use some base text to derive all possible spellings of each phoneme

In [1]:
with open('dracula_frankenstein.txt', 'r', encoding='utf-8') as f:
    rawText = f.read()

print(rawText[:10])


You will 


In [2]:
with open('cmuDict.txt', 'r', encoding='utf-8') as f:
    rawDict = f.read()

cmu = rawDict.split('\n')[56:]
print(cmu[0])

!EXCLAMATION-POINT  EH2 K S K L AH0 M EY1 SH AH0 N P OY2 N T


Problem: though I now have a phonetic spelling of most English words, it cannot tell me which grapheme each sound belongs to. Some graphemes consist of multiple letters, and others represent multiple distinct phonemes

Solution?: first, assume that each letter is its own grapheme. if correct, then move on. otherwise, for the graphemes that do not cleanly match, try sticking letters together until it matches.

In [3]:
testWord = cmu[0].strip('! ')
word, pron = testWord.split(maxsplit=1)
word = word.strip()
pron = pron.strip()
print(word, '\n', pron)
phonemes_old = pron.split()


EXCLAMATION-POINT 
 EH2 K S K L AH0 M EY1 SH AH0 N P OY2 N T


In [4]:
class coolWord:
    def __init__(self, spelling):
        self.s = spelling
        

In [5]:
baseWords = rawText.split()
baseWords = set(baseWords)
for word in baseWords:
    pass

10/17

in the ARPAbet, some symbols represent vowel sounds, and others consonants. there is always at least one vowel per word. so, words can be broken up into chunks with exactly one vowel sound each.

In [6]:
useful_words = {}
for w in cmu:
    try:
        t, p = w.split(maxsplit=1)
        useful_words[t] = p
    except (IndexError, ValueError) as e:
        print(w)
        print(e)

In [7]:
useful_words = {w.split(maxsplit=1)[0]: w.split(maxsplit=1)[1] for w in cmu}

In [8]:
useful_words['LIAISON']

'L IY0 EY1 Z AA2 N'

In [9]:
testWord2 = cmu[100].strip('! ')
word, pron = testWord2.split(maxsplit=1)
word = word.strip()
pron = pron.strip()
print(word, '\n', pron)
phonemes = pron.split()

AARONSON(1) 
 AA1 R AH0 N S AH0 N


In [10]:
print(type(phonemes[0]))

<class 'str'>


In [1]:
vowels = [    # General American English specifically
    'AA',     # balm, bot
    'AE',     # bat
    'AH',     # butt
    'AO',     # stOry      this gave me a heart attack when Wikipedia gave another example as 'cAUGHt' which is not at all the same sound to me
    'AW',     # bout
    'AX',     # commA (schwa)
    'AY',     # bite
    'EH',     # bet
    'ER',     # bIRd, forewORd
    'EY',     # bait
    'IH',     # bit
    'IX',     # rosEs, rabbIt
    'IY',     # beat
    'OW',     # boat
    'OY',     # boy
    'UH',     # book
    'UW'      # boot
]
# source: https://en.wikipedia.org/wiki/ARPABET
consonants = [
    'B',      # buy
    'CH',     # China
    'D',      # die
    'DH',     # thy
    'DX',     # buTTer
    'EL',     # bottLE
    'EM',     # rhythM
    'EN',     # buttON
    'F',      # fight
    'G',      # guy
    'HH',     # High
    'JH',     # jive
    'K',      # kite
    'L',      # lie
    'M',      # my
    'N',      # nigh
    'NG',     # siNG
    'P',      # pie
    'Q',      # uh-oh (glottal stop)
    'R',      # rye
    'S',      # sigh
    'SH',     # shy
    'T',      # tie
    'TH',     # thigh
    'V',      # vie
    'W',      # wise
    'WH',     # why (for fancy people)
    'Y',      # yacht
    'Z',      # zoo
    'ZH'      # pleaSure
]

In [12]:
class coolWord:
    def __init__(self, spelling, pronounciation):
        symbols = "~!@#$%^&*()-_=+[]{}\\|;:\'\",<.>/?1234567890"
        self.word = spelling.strip(symbols).lower()
        ps = pronounciation.split()
        ps2 = [p.strip(symbols) for p in ps]
        self.p = ' '.join(ps2)        
        self.x = False
        self.xx = False
        self.xIndex = 0
        if 'x' in self.word:
            self.x = True
            if 'xx' in self.word:
                self.xx = True
                self.s = self.word.replace('xx', 'ks')
            else:
                self.xIndex = self.word.index('x')
                if self.xIndex == 0:
                    self.s = self.word.replace('x', 'z')
                elif self.xIndex == len(self.word) - 1 and ps2[-1] == 'OW':
                    self.s = self.word.replace('x', '')
                elif self.word[self.xIndex - 1] in 'aeiou' and self.xIndex != len(self.word) - 1 and self.word[self.xIndex + 1] in 'aeiou':
                    self.s = self.word.replace('x', 'gz')
                else:
                    self.s = self.word.replace('x', 'ks')
        else:
            self.s = self.word
    def __str__(self):
        return self.word
    def __repr__(self):
        return self.word
    def atomize(self, v, c):
        current_letters = []
        current_sounds = []
        final = {}
        vows = []
        cons = []
        letter_counter = 0
        while letter_counter < len(self.s):
            current_letter = letter_counter
            while letter_counter < len(self.s) and self.s[letter_counter] in "qwrtypsdfghjklzxcvbnm":
                current_letters.append(self.s[letter_counter])
                letter_counter += 1
            final[current_letter] = ''.join(current_letters)
            current_letters = []
            if letter_counter >= len(self.s):
                break
            current_letter = letter_counter
            while letter_counter < len(self.s) and self.s[letter_counter] in "aeiou":
                current_letters.append(self.s[letter_counter])
                letter_counter += 1
            final[current_letter] = ''.join(current_letters)
            current_letters = []
        self.c = cons
        self.v = vows
        self.f = final
#        for i in self.p:
#            if i in v:
#                pass
#            elif i in c:
#                current_sounds.append(i)
#                self.c[letter_counter].append(current_letters)
#                letter_counter += 1
#            else:
#                print("weirdo ->", i)

In [13]:
useful_words['ANGST']

'AA1 NG K S T'

In [14]:
useful_words['THOUGHT']

'TH AO1 T'

In [15]:
wordObjs = []
for line in cmu:
    wordObjs.append(coolWord(*line.split(maxsplit=1)))

In [16]:
print(wordObjs[0])

exclamation-point


In [17]:
wordObjs2 = {line.split(maxsplit=1)[0]: coolWord(*line.split(maxsplit=1)) for line in cmu}

In [18]:
hello = wordObjs2['HELLO']
hello.atomize('v', 'c')
print(hello.f)
print(hello.p)

{0: 'h', 1: 'e', 2: 'll', 4: 'o'}
HH AH L OW


10/22

In [19]:
def getPhoneStructure(word):
    phones = []
    for phone in word.p.split():
        if phone in vowels:
            phones.append('V')
        else:
            phones.append('C')
    return ''.join(phones)

In [20]:
getPhoneStructure(wordObjs2['HELLO'])

'CVCV'

In [21]:
angst = wordObjs2['ANGST']
angst.atomize('v', 'c')
print(angst.f)
print(angst.p)
getPhoneStructure(angst)

{0: 'a', 1: 'ngst'}
AA NG K S T


'VCCCC'

In [22]:
wordObjs2['BINGE'].p

'B IH N JH'

start at the end of a word.

if it ends in a vowel sound, then the last letter must be part of a vowel phoneme.
<ul>
    <li>if the next sound is also a vowel, find the amount of vowel phonemes at the end (vPhon) and vowel letters (vLet)</li>
    <ul>
        <li>if those are equal, assign each phoneme to each letter in order as its grapheme. [note: is this always accurate?]</li>
        <li>then return to loop</li>
        <li>otherwise, ???</li>
    </ul>
    <li>otherwise, while the last letter is not a vowel, add letters to the current grapheme.</li>
    <ul>
        <li>once the last letter is a vowel(aeiouy), add that to the grapheme.</li>
        <li>If the next letter is a vowel and the next sound is not, add vowels until you reach a consonant</li>
        <li>Otherwise, return to start of loop</li>
    </ul>
</ul>
if the word ends in a consonant sound, then the last letter must be part of a consonant phoneme.
<ul>
    <li>add all letters to the current grapheme, until reaching a vowel letter, unless the first letter is a vowel</li>
    <li>add all consonant phonemes up to the next vowel sound to some object, to analyze later. do the same for the letters</li>
    
</ul>
then, sort out any clumps of consonants (somehow)

In [23]:
thought = wordObjs2['THOUGHT']
thought.atomize('v', 'c')
print(thought.f)
print(thought.p)

{0: 'th', 2: 'ou', 4: 'ght'}
TH AO T


In [24]:
villa = wordObjs2['VILLA']
villa.atomize('v', 'c')
print(villa.f)
print(villa.p)

{0: 'v', 1: 'i', 2: 'll', 4: 'a'}
V IH L AH


In [25]:
shingle = wordObjs2['SHINGLE']
shingle.atomize('v', 'c')
print(shingle.f)
print(shingle.p)
getPhoneStructure(shingle)

{0: 'sh', 2: 'i', 3: 'ngl', 6: 'e'}
SH IH NG G AH L


'CVCCVC'

In [26]:
misshapen = wordObjs2['MISSHAPEN']
misshapen.atomize('v', 'c')
print(misshapen.f)
print(misshapen.p)
getPhoneStructure(misshapen)

{0: 'm', 1: 'i', 2: 'ssh', 5: 'a', 6: 'p', 7: 'e', 8: 'n'}
M IH S SH EY P AH N


'CVCCVCVC'

In [27]:
missile = wordObjs2['MISSILE']
missile.atomize('v', 'c')
print(missile.f)
print(missile.p)
getPhoneStructure(missile)

{0: 'm', 1: 'i', 2: 'ss', 4: 'i', 5: 'l', 6: 'e'}
M IH S AH L


'CVCVC'

the first letter of a word must belong to its first phoneme, and the last letter must belong to its last phoneme.

what if you gave a program the very basic pronounciations of letters, and when it comes across a word that it cannot pronounce, have it learn the new pronounciations?

or, give it the basic spellings of phonemes, and when it doesn't spell a word right, have it learn the spelling?

In [28]:
singer = wordObjs2['SINGER']
singer.atomize('v', 'c')
print(singer.f)
print(singer.p)
getPhoneStructure(singer)

{0: 's', 1: 'i', 2: 'ng', 4: 'e', 5: 'r'}
S IH NG ER


'CVCV'

In [29]:
spellings = {
    'AA': ['o'],
    'AE': ['a'],
    'AH': ['u'],
    'AO': ['o'],
    'AW': ['ou'],
    'AX': ['a'],
    'AXR': ['er'],
    'AY': ['i'],
    'EH': ['e'],
    'ER': ['ir'],
    'EY': ['ai'],
    'IH': ['i'],
    'IX': ['e'],
    'IY': ['ea'],
    'OW': ['oa'],
    'OY': ['oy'],
    'UH': ['oo'],
    'UW': ['oo'],
    'UX': ['u'],
    'B': ['b'],
    'CH': ['ch'],
    'D': ['d'],
    'DX': ['tt'],
    'EL': ['le'],
    'EM': ['m'],
    'EN': ['on'],
    'F': ['f'],
    'G': ['g'],
    'H': ['h'],
    'HH': ['h'],
    'JH': ['j'],
    'K': ['k'],
    'L': ['l'],
    'M': ['m'],
    'N': ['n'],
    'NX': ['ng'],
    'NG': ['ng'],
    'P': ['p'],
    'Q': ['-'],
    'R': ['r'],
    'S': ['s'],
    'SH': ['sh'],
    'T': ['t'],
    'TH': ['th'],
    'V': ['v'],
    'W': ['w'],
    'WH': ['wh'],
    'Y': ['y'],
    'Z': ['z'],
    'ZH': ['s']
}

In [30]:
def spell(word):
    phones = word.p.split()
    spelling = []
    for p in phones:
        spelling.append(spellings[p][0])
    return spelling

In [31]:
spell(wordObjs2['HELLO'])

['h', 'u', 'l', 'oa']

In [32]:
spell(wordObjs2['BAT'])

['b', 'a', 't']

In [33]:
wordObjs2 = {word: wordObjs2[word] for word in wordObjs2 if len(wordObjs2[word].p.split()) < 2 * len(word)}

In [34]:
threeLetter = [wordObjs2[word] for word in wordObjs2 if len(wordObjs2[word].s) == 3]

In [35]:
len(threeLetter)

1662

In [36]:
twoLetter = [wordObjs2[word] for word in wordObjs2 if len(wordObjs2[word].s) == 2]

In [37]:
len(twoLetter)

233

In [38]:
for word in twoLetter:
    news = spell(word)
    print(f"{word.s.lower()}: {word.p}: {news}")
    if word.s.lower() != news:
        break

em: AH M: ['u', 'm']


In [39]:
print(spell(wordObjs2['BY']))

['b', 'i']


In [40]:
def spell2(word):
    phones = word.p.split()
    spelling = []
    
    while ''.join(spelling) != word.s:
        spelling = []
        for p in phones:
            pass
            #if ''.join(spelling + spellings[p][0]

In [41]:
print(len(''.join([])))

0


10/24

So I have the basic framework: the program "learns" every possible different spelling of each English phoneme, by trying to spell words and memorizing new spellings.
However, this assumes that each word has one new grapheme maximum. Any more and it couldn't tell them apart.

So, it must start with small words, to minimize the chance of multiple new graphemes per word. It'll start with two-letter words, then three, and so on.

However. It still needs a way to figure out which part is misspelled, and there's not a simple way to subtract two arbitrary strings. They could be different lengths, offsetting the letters from each other, so a 1-1 comparison wouldn't work.

But I had an idea in the shower yesterday: we're already assuming that there's no more than one new grapheme per word, and we can use that here.

We have the original word, with its pronounciation, as well as the code's attempt at spelling it. The attempt is also split into its graphemes.

1. Start at the beginning of the original word. Check if it starts with the same letters as the first guessed grapheme.
2. If it does, then we discard those letters, and repeat with the next grapheme.
3. If not, then we learn that new grapheme.

However, this method has a critical flaw.

A word starting with the same letters as the guessed grapheme does not mean that it guessed the grapheme correctly. For example, say the original word is 'witty' and it guessed 'wity'. It would recognize that it spelled 'w' and 'i' correctly, so it looks at the rest of the word, 'tty'. 'tty' starts with 't', so it would recognize that as correct, and remove one 't'. Then, it checks the remaining 'ty', and finds that its guess of 'y' was incorrect, learning 'ty' as a new spelling. But this was incorrect.

<ul>
    <li>Starting at the end of the word instead would not fix the issue.</li>
    <li>I did have an idea on Tuesday:</li>
</ul>

1. Start at index 0 of the original word. Capture letters from the word until it finds a letter that doesn't match the guess.
2. Do this for every index (letter)
3. Check each group of letters to see if they correspond to a guessed grapheme.

I FORGOT MY BETTER IDEA FROM THE SHOWER (included below) (I had the basic idea in the shower, but actually worked it out and refined it today)

<ol>
    <li>Guess a spelling for the first phoneme.</li>
    <li>If the word starts with that spelling, then save that guess.</li>
    <li>If it doesn't start with that, guess another spelling until it matches.</li>
    <ol>
        <li>If no known graphemes match, skip to step 8.</li>
    </ol>
    <li>Guess a spelling for the second phoneme.</li>
    <li>Add it to the current correct guess, and check whether it matches the start of the word.</li>
    <li>If it does, save that guess and repeat steps 5-6.</li>
    <li>If the current correct guess is the same as the original word, then it is spelled correctly. Exit loop.</li>
    <li>Do the same procedure, but starting at the end of the word. Save the current correct guess from this step separately.</li>
    <li></li>
</ol>



In [42]:
which = wordObjs2['BORDEAUX']
print(which.word, which.s)
which.atomize('v', 'c')
print(which.f)
print(which.p)
getPhoneStructure(which)

bordeaux bordeau
{0: 'b', 1: 'o', 2: 'rd', 4: 'eau'}
B AO R D OW


'CVCCV'

In [43]:
from time import sleep
from copy import copy

def sideSpell(word, phones, spellings, polarity):
    currentSpell = []
    for p in phones:
        oldLen = len(currentSpell)
        for s in spellings[p]:
            newSpell = copy(currentSpell)
            if not polarity:
                newSpell.append(s)
                if word.s.startswith(''.join(newSpell)):
                    currentSpell.append(s)
                    break
            else:
                newSpell.insert(0, s)
                if word.s.endswith(''.join(newSpell)):
                    currentSpell.insert(0, s)
                    break                
        if len(currentSpell) == oldLen:
            break
    return currentSpell

def spell3(word, spellings):
    word.s = word.s.lower()
    if not word.s.isalnum():
        print("bad characters", word.s)
        return False
    print(word)
    phones = word.p.split()
    if len(word.s) < len(phones):
        print("too long!", word.s, phones)
        return False
    if word.s[:-2] == 'le' and phones[:-2] == ['AH', 'L'] and word.s[-3] not in ['aeiou']:
        phones.append(phones.pop[-2])
        print("special case: L")
    rphones = copy(phones)
    rphones.reverse()
    print("phones", phones, rphones)
    frontSpell = sideSpell(word, phones, spellings, 0)
    backSpell = sideSpell(word, rphones, spellings, 1)
    print("front & back", frontSpell, backSpell)
    if '' in frontSpell:
        frontSpell.remove('')
    if '' in backSpell:
        backSpell.remove('')
    if frontSpell == backSpell and ''.join(frontSpell) == word.s:
        print('yay!')
        return True
    missingG = word.s.removeprefix(''.join(frontSpell)).removesuffix(''.join(backSpell))
    if not missingG:
        missingG = backSpell[0]
    print("missing grapheme", missingG)
    if len(frontSpell + backSpell) > len(phones):
            #print(word, frontSpell, len(frontSpell))
            #print(backSpell, len(backSpell))
            #print(phones)
            print("too long")
            del backSpell[0]
    if len(frontSpell + backSpell) < len(phones):
        try:
            missingP = phones[len(frontSpell)]
            print("missing phoneme (clean)", missingP)
            spellings[missingP].append(missingG)
        except IndexError as e:
            print(word, frontSpell, len(frontSpell))
            print(backSpell, len(backSpell))
            print(phones)
            raise e
    else:
        missingP = phones[len(frontSpell) - 1]
        print("missing phoneme (overlap)", missingP)
        try:
            spellings[missingP].append(''.join((frontSpell[-1], missingG) if frontSpell else (missingG, backSpell[0])))
            print(''.join((frontSpell[-1], missingG) if frontSpell else (missingG, backSpell[0])))
        except IndexError as e:
            print(word, frontSpell, len(frontSpell))
            print(backSpell, len(backSpell))
            print(phones)
            print(missingP)
            raise e

def learn(spellings, words):
    for word in words:
        spell3(word, spellings)
        print(word, word.p)
        break
    return spellings

In [44]:
def clean(aDict):
    for key in aDict:
        while '' in aDict[key]:
            aDict[key].remove('')

In [45]:
spellings = {
    'AA': ['o'],
    'AE': ['a'],
    'AH': ['u'],
    'AO': ['o'],
    'AW': ['ou'],
    'AX': ['a'],
    'AXR': ['er'],
    'AY': ['i'],
    'EH': ['e'],
    'ER': ['ir'],
    'EY': ['ai'],
    'IH': ['i'],
    'IX': ['e'],
    'IY': ['ea'],
    'OW': ['oa'],
    'OY': ['oy'],
    'UH': ['oo'],
    'UW': ['oo'],
    'UX': ['u'],
    'B': ['b'],
    'CH': ['ch'],
    'D': ['d'],
    'DX': ['tt'],
    'EL': ['le'],
    'EM': ['m'],
    'EN': ['on'],
    'F': ['f'],
    'G': ['g'],
    'H': ['h'],
    'HH': ['h'],
    'JH': ['j'],
    'K': ['k'],
    'L': ['l'],
    'M': ['m'],
    'N': ['n'],
    'NX': ['ng'],
    'NG': ['ng'],
    'P': ['p'],
    'Q': ['-'],
    'R': ['r'],
    'S': ['s'],
    'SH': ['sh'],
    'T': ['t'],
    'TH': ['th'],
    'V': ['v'],
    'W': ['w'],
    'WH': ['wh'],
    'Y': ['y'],
    'Z': ['z'],
    'ZH': ['s']
}

In [46]:
from random import choice

spell3(choice(twoLetter), spellings)
clean(spellings)

mo
phones ['M', 'OW'] ['OW', 'M']
front & back ['m'] []
missing grapheme o
missing phoneme (clean) OW


In [47]:
spellings

{'AA': ['o'],
 'AE': ['a'],
 'AH': ['u'],
 'AO': ['o'],
 'AW': ['ou'],
 'AX': ['a'],
 'AXR': ['er'],
 'AY': ['i'],
 'EH': ['e'],
 'ER': ['ir'],
 'EY': ['ai'],
 'IH': ['i'],
 'IX': ['e'],
 'IY': ['ea'],
 'OW': ['oa', 'o'],
 'OY': ['oy'],
 'UH': ['oo'],
 'UW': ['oo'],
 'UX': ['u'],
 'B': ['b'],
 'CH': ['ch'],
 'D': ['d'],
 'DX': ['tt'],
 'EL': ['le'],
 'EM': ['m'],
 'EN': ['on'],
 'F': ['f'],
 'G': ['g'],
 'H': ['h'],
 'HH': ['h'],
 'JH': ['j'],
 'K': ['k'],
 'L': ['l'],
 'M': ['m'],
 'N': ['n'],
 'NX': ['ng'],
 'NG': ['ng'],
 'P': ['p'],
 'Q': ['-'],
 'R': ['r'],
 'S': ['s'],
 'SH': ['sh'],
 'T': ['t'],
 'TH': ['th'],
 'V': ['v'],
 'W': ['w'],
 'WH': ['wh'],
 'Y': ['y'],
 'Z': ['z'],
 'ZH': ['s']}

In [48]:
learn(spellings, twoLetter)

em
phones ['AH', 'M'] ['M', 'AH']
front & back [] ['m']
missing grapheme e
missing phoneme (clean) AH
em AH M


{'AA': ['o'],
 'AE': ['a'],
 'AH': ['u', 'e'],
 'AO': ['o'],
 'AW': ['ou'],
 'AX': ['a'],
 'AXR': ['er'],
 'AY': ['i'],
 'EH': ['e'],
 'ER': ['ir'],
 'EY': ['ai'],
 'IH': ['i'],
 'IX': ['e'],
 'IY': ['ea'],
 'OW': ['oa', 'o'],
 'OY': ['oy'],
 'UH': ['oo'],
 'UW': ['oo'],
 'UX': ['u'],
 'B': ['b'],
 'CH': ['ch'],
 'D': ['d'],
 'DX': ['tt'],
 'EL': ['le'],
 'EM': ['m'],
 'EN': ['on'],
 'F': ['f'],
 'G': ['g'],
 'H': ['h'],
 'HH': ['h'],
 'JH': ['j'],
 'K': ['k'],
 'L': ['l'],
 'M': ['m'],
 'N': ['n'],
 'NX': ['ng'],
 'NG': ['ng'],
 'P': ['p'],
 'Q': ['-'],
 'R': ['r'],
 'S': ['s'],
 'SH': ['sh'],
 'T': ['t'],
 'TH': ['th'],
 'V': ['v'],
 'W': ['w'],
 'WH': ['wh'],
 'Y': ['y'],
 'Z': ['z'],
 'ZH': ['s']}

In [49]:
for i in twoLetter:
    print(i, i.p)

em AH M
aa EY EY
ab AE B
ac EY S IY
ad AE D
ae EY
ag AE G
ag EY G IY
ah AA
ai AY
ai EY AY
al AE L
al AE L
al AE L AH B AE M AH
am AE M
am EY EH M
an AE N
an AH N
ap EY P IY
ar AA R
as AE Z
as EH Z
at AE T
au OW
aux OW
av EY V IY
aw AO
ay EY
ay AY
ba B IY EY
ba B AA
be B IY
be B IY
bi B AY
bo B OW
by B AY
ca K AH
ca S IY EY
ca K AA
ce S IY IY
co K OW
co K OW
co K AH P AH N IY
co S IY OW T UW
cy S AY
da D AA
da D IY EY
de D IY
de D EY
de D AH
di D IY
di D AY
do D UW
dr D AA K T ER
dr D R AY V
dr D AA K T ER
du D UW
du D AH
eb EH B
ed EH D
ee IY IY
eh EH
ek EH K
ek IY K EY
el EH L
em EH M
en EH N
er ER
es EH S
et EH T
ev EH V
fe F EY
fi F AY
fi F IY
fu F UW
ga G AA
ga JH IY EY
ga JH AO R JH AH
go G OW
gu G UW
ha HH AA
he HH IY
hi HH AY
hm HH AH M
ho HH OW
hu HH UW
hy HH AY
ia IY AH
ib IH B
ib AY B IY
id IH D
id AY D IY
if IH F
il IH L
im IH M
in IH N
in IH N
in IH N
in IH N CH
io AY OW
ip AY P IY
ip IH P
is IH Z
is IH Z
it IH T
it IH T
ja Y AA
je JH IY
ji JH IY
jo JH OW
jr JH UW N ER
ju J

10/29

So. It does mostly work.

But.

The presence of abbreviations in the dataset means that it contains non-phonetic transcriptions. For example, one pronouncation of "AL" is the full phonetic spelling of Alabama.

(also "MR" sounds like "mister" according to the dictionary)

Currently, the code just refuses to analyze words with more phonemes than letters. This assumes that the only letter in English that can represent two distinct phones in succession is 'x', and that my attempt to account for x works. That seems to hold, though I haven't tested it on any words with more than three letters yet.

I did fail to account for another edge case though: the """word""" 'ng'. According to this version of the CMUPD, 'ng' can be pronounced 'IH NG'. This totally bypasses my attempt to check edge cases: the word is two letters long, so it doesn't have more phones than letters.

LIGHTBULB MOMENT: it has no vowel letter, which means (for our purposes) it isn't a valid pronouncable English word! So I just need to find a reasonably efficient way of checking if a vowel is in a word and build that check into my code!

I also plan to add a method of tracking examples of grapheme usage, which should help with explaining the solutions to its puzzles.

_oh right i'm building a puzzle generator_

NOTE: this fix does not account for """words""" like 'aa' -- pronounced /EY EY/ --, so my code learned that 'aa' is a valid spelling of /EY/. This actually does appear in fringe cases, like Baal, so maybe this is fine. Will add further comment if any other bad examples happen.

Actually, """words""" like 'er' pronounced /ER/ represent an intended example of the above result. 

_at least I think I intended it_

Well I did find another problem word, but nothing to do with the previous comment. 'of' posed a problem. My code does not start off knowing that 'o' can be pronounced /AH/, or that 'f' can be /v/. So it can't differentiate between both new spellings, and learns that /AH/ can be spelled 'of'. I'll try just adding that pronunciation of 'o' in at the beginning, but obviously this isn't ideal, nor is it consistent. Depending on what order the dictionary of words happens to be in, it'll learn spellings at different times, potentially causing this kind of mistake for different words each time. I am also unsure at this stage whether this is an unlikely occurrence, or if I'll need to build in more knowledge beforehand, which I wanted to minimize.

Also I realized I forgot to mention another issue from earlier. After running my code through every two-letter word and learning each new spelling, the lists of spellings for each phoneme were cluttered with empty strings. As a solution, I just added a function to remove those, but I need to figure out why that happens in the first place.

In [50]:
## from time import sleep
from copy import copy

def sideSpell(word, phones, spells, polarity):
    currentSpell = []
    for p in phones:
        oldLen = len(currentSpell)
        for s in spells[p]:
            newSpell = copy(currentSpell)
            if not polarity:
                newSpell.append(s.g)
                if word.s.startswith(''.join([a if not isinstance(a, graphExample) else a.g for a in newSpell])):
                    currentSpell.append(s)
                    break
            else:
                newSpell.insert(0, s.g)
                if word.s.endswith(''.join([a if not isinstance(a, graphExample) else a.g for a in newSpell])):
                    currentSpell.insert(0, s)
                    break                
        if len(currentSpell) == oldLen:
            break
    return currentSpell

def spell3(word, spellings, exLen=4, debug=False):
    word.s = word.s.lower()
    if not word.s.isalnum():
        if debug:
            print("bad characters", word.s)
        return False
    if debug:
        print(word)
    phones = word.p.split()
    if len(word.s) < len(phones):
        if debug:
            print("too long!", word.s, phones)
        return False
    if not set(word.s) & set('aeiouy'):
        if debug:
            print("no vowels -> not a word!", word.s, phones)
        return False
    if word.s[:-2] == 'le' and phones[:-2] == ['AH', 'L'] and word.s[-3] not in ['aeiou']:
        phones.append(phones.pop[-2])
        if debug:
            print("special case: L")
    rphones = copy(phones)
    rphones.reverse()
    if debug:
        print("phones", phones, rphones)
    frontSpell = sideSpell(word, phones, spellings, 0)
    frontStr = [a.g for a in frontSpell]
    backSpell = sideSpell(word, rphones, spellings, 1)
    backStr = [a.g for a in backSpell]
    if debug:
        print("front & back", frontSpell, backSpell)
    if '' in frontSpell:
        frontSpell.remove('')
    if '' in backSpell:
        backSpell.remove('')
    if frontSpell == backSpell and ''.join(frontStr) == word.s:
        if debug:
            print('yay!')
            print(word, frontSpell, len(frontSpell))
            print(backSpell, len(backSpell))
            print(phones)
        exPhone = choice(phones)
        g = frontSpell[phones.index(exPhone)]
        p = spellings[exPhone]
        exGraph = p[p.index(g)]
        if len(word.s) == exLen and not exGraph.set:
            exGraph.setExample(word)
        return True
    missingG = graphExample(word.s.removeprefix(''.join(frontStr)).removesuffix(''.join(backStr)), isX=word.x)
    #if not missingG.g:
    #    missingG = backSpell[0]
    if debug:
        print("missing grapheme", missingG)
    if len(frontSpell + backSpell) > len(phones):
            #print(word, frontSpell, len(frontSpell))
            #print(backSpell, len(backSpell))
            #print(phones)
            if debug:
                print("too long")
            del backSpell[0]
    if len(frontSpell + backSpell) < len(phones):
        try:
            missingP = phones[len(frontSpell)]
            if debug:
                print("missing phoneme (clean)", missingP)
            if len(word.s) == exLen and not missingG.set:
                missingG.setExample(word)        
            spellings[missingP].append(missingG)    
        except IndexError as e:
            if debug:
                print(word, frontSpell, len(frontSpell))
                print(backSpell, len(backSpell))
                print(phones)
            raise e
    else:
        missingP = phones[len(frontSpell) - 1]
        if debug:
            print("missing phoneme (overlap)", missingP)
        try:
            newG = graphExample(''.join((frontStr[-1], missingG.g) if frontStr else (missingG.g, backStr[0])), isX = missingG.g.count('x') or (frontStr[-1].count('x') if frontStr else backStr[0].count('x')))
            if len(word.s) == exLen and not newG.set:
                newG.setExample(word)  
            spellings[missingP].append(newG)
            if debug:
                print(newG)
        except IndexError as e:
            if debug:
                print(word, frontSpell, len(frontSpell))
                print(backSpell, len(backSpell))
                print(phones)
                print(missingP)
            raise e

def learn3(spellings, words, exLen=4, debug=False):
    newSpells = {s: copy(spellings[s]) for s in spellings}
    for word in words:
        spell3(word, newSpells, exLen=exLen, debug=debug)
        if debug:
            print(word, word.p)
        #break
    return newSpells

In [51]:
print(bool(set('geoff') & set('aeiou')))

True


In [52]:
baseSpellings = {
    'AA': ['o'],
    'AE': ['a'],
    'AH': ['u', 'o'],
    'AO': ['o'],
    'AW': ['ou'],
    'AX': ['a'],
    'AXR': ['er'],
    'AY': ['i'],
    'EH': ['e'],
    'ER': ['ir'],
    'EY': ['ai'],
    'IH': ['i'],
    'IX': ['e'],
    'IY': ['ea'],
    'OW': ['oa'],
    'OY': ['oy'],
    'UH': ['oo'],
    'UW': ['oo'],
    'UX': ['u'],
    'B': ['b'],
    'CH': ['ch'],
    'D': ['d'],
    'DX': ['tt'],
    'EL': ['le'],
    'EM': ['m'],
    'EN': ['on'],
    'F': ['f'],
    'G': ['g'],
    'H': ['h'],
    'HH': ['h'],
    'JH': ['j'],
    'K': ['k'],
    'L': ['l'],
    'M': ['m'],
    'N': ['n'],
    'NX': ['ng'],
    'NG': ['ng'],
    'P': ['p'],
    'Q': ['-'],
    'R': ['r'],
    'S': ['s'],
    'SH': ['sh'],
    'T': ['t'],
    'TH': ['th'],
    'V': ['v'],
    'W': ['w'],
    'WH': ['wh'],
    'Y': ['y'],
    'Z': ['z'],
    'ZH': ['s']
}

In [53]:
def initSpellings(plain):
    ans = {p: [graphExample(a, isX=bool(a.count('x'))) for a in plain[p]] for p in plain}
    return ans

In [59]:
coolSpellings = initSpellings(baseSpellings)

In [60]:
spellings

{'AA': ['o'],
 'AE': ['a'],
 'AH': ['u', 'e'],
 'AO': ['o'],
 'AW': ['ou'],
 'AX': ['a'],
 'AXR': ['er'],
 'AY': ['i'],
 'EH': ['e'],
 'ER': ['ir'],
 'EY': ['ai'],
 'IH': ['i'],
 'IX': ['e'],
 'IY': ['ea'],
 'OW': ['oa', 'o'],
 'OY': ['oy'],
 'UH': ['oo'],
 'UW': ['oo'],
 'UX': ['u'],
 'B': ['b'],
 'CH': ['ch'],
 'D': ['d'],
 'DX': ['tt'],
 'EL': ['le'],
 'EM': ['m'],
 'EN': ['on'],
 'F': ['f'],
 'G': ['g'],
 'H': ['h'],
 'HH': ['h'],
 'JH': ['j'],
 'K': ['k'],
 'L': ['l'],
 'M': ['m'],
 'N': ['n'],
 'NX': ['ng'],
 'NG': ['ng'],
 'P': ['p'],
 'Q': ['-'],
 'R': ['r'],
 'S': ['s'],
 'SH': ['sh'],
 'T': ['t'],
 'TH': ['th'],
 'V': ['v'],
 'W': ['w'],
 'WH': ['wh'],
 'Y': ['y'],
 'Z': ['z'],
 'ZH': ['s']}

In [61]:
coolSpellings

{'AA': [o],
 'AE': [a],
 'AH': [u, o],
 'AO': [o],
 'AW': [ou],
 'AX': [a],
 'AXR': [er],
 'AY': [i],
 'EH': [e],
 'ER': [ir],
 'EY': [ai],
 'IH': [i],
 'IX': [e],
 'IY': [ea],
 'OW': [oa],
 'OY': [oy],
 'UH': [oo],
 'UW': [oo],
 'UX': [u],
 'B': [b],
 'CH': [ch],
 'D': [d],
 'DX': [tt],
 'EL': [le],
 'EM': [m],
 'EN': [on],
 'F': [f],
 'G': [g],
 'H': [h],
 'HH': [h],
 'JH': [j],
 'K': [k],
 'L': [l],
 'M': [m],
 'N': [n],
 'NX': [ng],
 'NG': [ng],
 'P': [p],
 'Q': [-],
 'R': [r],
 'S': [s],
 'SH': [sh],
 'T': [t],
 'TH': [th],
 'V': [v],
 'W': [w],
 'WH': [wh],
 'Y': [y],
 'Z': [z],
 'ZH': [s]}

In [62]:
spell3(choice(twoLetter), coolSpellings)
clean(spellings)

In [63]:
coolSpellings

{'AA': [o],
 'AE': [a],
 'AH': [u, o],
 'AO': [o],
 'AW': [ou],
 'AX': [a],
 'AXR': [er],
 'AY': [i],
 'EH': [e],
 'ER': [ir],
 'EY': [ai],
 'IH': [i],
 'IX': [e],
 'IY': [ea],
 'OW': [oa],
 'OY': [oy],
 'UH': [oo],
 'UW': [oo],
 'UX': [u],
 'B': [b],
 'CH': [ch],
 'D': [d],
 'DX': [tt],
 'EL': [le],
 'EM': [m],
 'EN': [on],
 'F': [f],
 'G': [g],
 'H': [h],
 'HH': [h],
 'JH': [j],
 'K': [k, qu],
 'L': [l],
 'M': [m],
 'N': [n],
 'NX': [ng],
 'NG': [ng],
 'P': [p],
 'Q': [-],
 'R': [r],
 'S': [s],
 'SH': [sh],
 'T': [t],
 'TH': [th],
 'V': [v],
 'W': [w],
 'WH': [wh],
 'Y': [y],
 'Z': [z],
 'ZH': [s]}

In [64]:
spellings

{'AA': ['o'],
 'AE': ['a'],
 'AH': ['u', 'e'],
 'AO': ['o'],
 'AW': ['ou'],
 'AX': ['a'],
 'AXR': ['er'],
 'AY': ['i'],
 'EH': ['e'],
 'ER': ['ir'],
 'EY': ['ai'],
 'IH': ['i'],
 'IX': ['e'],
 'IY': ['ea'],
 'OW': ['oa', 'o'],
 'OY': ['oy'],
 'UH': ['oo'],
 'UW': ['oo'],
 'UX': ['u'],
 'B': ['b'],
 'CH': ['ch'],
 'D': ['d'],
 'DX': ['tt'],
 'EL': ['le'],
 'EM': ['m'],
 'EN': ['on'],
 'F': ['f'],
 'G': ['g'],
 'H': ['h'],
 'HH': ['h'],
 'JH': ['j'],
 'K': ['k'],
 'L': ['l'],
 'M': ['m'],
 'N': ['n'],
 'NX': ['ng'],
 'NG': ['ng'],
 'P': ['p'],
 'Q': ['-'],
 'R': ['r'],
 'S': ['s'],
 'SH': ['sh'],
 'T': ['t'],
 'TH': ['th'],
 'V': ['v'],
 'W': ['w'],
 'WH': ['wh'],
 'Y': ['y'],
 'Z': ['z'],
 'ZH': ['s']}

In [65]:
class graphExample:
    def __init__(self, grapheme, isX=False):
        self.g = grapheme
        self.set = False
        self.isx = isX
        self.x = 'x' if isX else None

    def __str__(self):
        try:
            return f"'{self.g if not self.isx else self.x}' in '{self.w}'"
        except AttributeError as e:
            return self.g if not self.isx else self.x

    def __repr__(self):
        return self.__str__()

    def __eq__(self, other):
        if isinstance(other, str):
            return self.g == other
        elif isinstance(other, graphExample):
            return self.g == other.g
        return False

    def __hash__(self):
        return self.g.__hash__() if not self.isx else self.x.__hash__()

    def setExample(self, ex):
        self.w = ex
        self.set = True

10/30

Testing & Debugging

Removed a leftover couple lines that didn't work with the new object system, fixed conditionals in list comprehensions

Currently the grapheme objects don't account for my handling of x, so working on that now

In [66]:
test1 = learn3(coolSpellings, twoLetter, exLen=2)

TypeError: sequence item 1: expected str instance, graphExample found

In [67]:
test1

NameError: name 'test1' is not defined

In [1]:
## set operations and equivalency testing

a = graphExample('e')
b = graphExample('e')
adict = {'EE': [a]}
bdict = {'EE': [b]}
aset = set(adict['EE'])
bset = set(bdict['EE'])
aset ^ bset

NameError: name 'graphExample' is not defined

I had to define the \_\_hash__() magic method for my graphExample objects to get set operations to work on them

and then I realized I could just call the long version of symmetric_difference and keep them as lists. oh well

and THEN it turns out that that method doesn't work on lists. the python library said that method works on other iterables. maybe it works on dicts or something, or maybe you can get custom objects to support it. idk

In [None]:
def validate(spell1, spell2):
    differences = {}
    for phone in spell1:
        setDiffs = set(spell1[phone]) ^ set(spell2[phone])
        if setDiffs:
            differences[phone] = [spell1[phone], spell2[phone]]
    if differences:
        return differences

In [None]:
from random import sample

def get_mistakes(baseSpells, words, completeSpells, exLen=4):
    generation = 0
    while generation < 100:
        testSpells = learn3(baseSpells, sample(words, len(words)), exLen=exLen)
        check = validate(completeSpells, testSpells)
        if check:
            return generation, check
        generation += 1
    return 'nope'

I wanted to test whether this program could always find the correct spellings, or if it depended on the order. so far, it greatly depends on the order. In its current order, I only had the one exception ('of')

I JUST REALIZED

that shuffling affects words outside its intended scope, changing its order outside this function. that's not good

In [None]:
spell3(wordObjs2['BE'], coolSpellings, exLen=2, debug=True)
spell3(wordObjs2['UY'], coolSpellings, exLen=2, debug=True)
spell3(wordObjs2['VI'], coolSpellings, exLen=2, debug=True)

spell3(wordObjs2['EE'], coolSpellings, exLen=2, debug=True)

In [None]:
coolSpellings

In [None]:
coolSpellings = initSpellings(baseSpellings)

In [None]:
get_mistakes(coolSpellings, twoLetter, test1, exLen=2)

In [None]:
spell3(wordObjs2['OU'], coolSpellings, exLen=2, debug=True)

In [None]:
test2 = learn(test1, threeLetter, exLen=3, debug=True)

In [None]:
wordObjs2['QU'].p

After testing my model with three-letter words, I found three distinct issues:

1. When my program entirely fails to spell a word with multiple sounds, it assigns the entire word to the first sound as one grapheme. This issue is not new, but still
2. The program can find a 'correct' spelling that leaves out one or more sounds. This results in it learning to spell a sound with an empty string
3. Once it learns that a letter can be silent, the mismatch between the number of graphemes and phonemes causes an IndexError.
Here is an example of issues one and two: <img src="issue1.png">
And here is issue three: <img src="issue2.png">

Brainstorming potential solutions:

For issue one:
<ul>
    <li>completely ignore words which my program thinks is made of a single grapheme</li>
    <li>do a first pass, where those words are ignored. then allow monographic words. this assumes that a later word will contain the correct spellings to fix it</li>
</ul>
I like the second idea, but I'm not sure if it would actually work. I suppose I could just check whether the word actually contains only one phoneme, and only allow it in that case

Issue two:
<ul>
    <li>Figure out a way to incorporate silent letters into my program's understanding</li>
    <li>When my program thinks it's spelled a word correctly, check if it accounts for every phoneme. If it doesn't, force it to move on</li>
</ul>

I feel like the first idea goes against what I'm trying to do here. There's no consistent way to determine which letters are silent and which aren't, and my program assumes exactly one grapheme (excepting x) for every phoneme. As for the second idea, I don't know what it'll do if it 'moves on' and then fails to spell it. I need to work that out

Issue three: solve issue two

Important note: I don't know which letters of the word 'ewe' make which sounds. My opinion is that the first 'e' is the /J/ sound, and the remaining 'we' represents the ending /EW/. But I don't see one correct answer. And that means I don't have an objective way to measure my program's success. That sounds like a problem, but maybe it's cool that I, in a way, made something with its own ability to interpret language and develop its own opinions.

In [68]:
## from time import sleep
from copy import copy

def sideSpell4(word, phones, spells, polarity):
    currentSpell = []
    for p in phones:
        oldLen = len(currentSpell)
        for s in spells[p]:
            newSpell = copy(currentSpell)
            if not polarity:
                newSpell.append(s.g)
                if word.s.startswith(''.join([a if not isinstance(a, graphExample) else a.g for a in newSpell])):
                    currentSpell.append(s)
                    break
            else:
                newSpell.insert(0, s.g)
                if word.s.endswith(''.join([a if not isinstance(a, graphExample) else a.g for a in newSpell])):
                    currentSpell.insert(0, s)
                    break                
        if len(currentSpell) == oldLen:
            break
    return currentSpell

def spell4(word, spellings, exLen=4, debug=False):
    word.s = word.s.lower()
    if not word.s.isalnum():
        if debug:
            print("bad characters", word.s)
        return False
    if debug:
        print(word)
    phones = word.p.split()
    if len(word.s) < len(phones):
        if debug:
            print("too long!", word.s, phones)
        return False
    if not set(word.s) & set('aeiouy'):
        if debug:
            print("no vowels -> not a word!", word.s, phones)
        return False
    if word.s[:-2] == 'le' and phones[:-2] == ['AH', 'L'] and word.s[-3] not in ['aeiou']:
        phones.append(phones.pop[-2])
        if debug:
            print("special case: L")
    rphones = copy(phones)
    rphones.reverse()
    if debug:
        print("phones", phones, rphones)
    frontSpell = sideSpell4(word, phones, spellings, 0)
    frontStr = [a.g for a in frontSpell]
    backSpell = sideSpell4(word, rphones, spellings, 1)
    backStr = [a.g for a in backSpell]
    if debug:
        print("front & back", frontSpell, backSpell)
    if '' in frontSpell:
        frontSpell.remove('')
    if '' in backSpell:
        backSpell.remove('')
    if len(phones) > (len(frontSpell) + len(backSpell) + 1):
        if debug:
            print("failure to spell")
        return False
    if frontSpell == backSpell and ''.join(frontStr) == word.s:
        if debug:
            print('yay!')
            print(word, frontSpell, len(frontSpell))
            print(backSpell, len(backSpell))
            print(phones)
        exPhone = choice(phones)
        g = frontSpell[phones.index(exPhone)]
        p = spellings[exPhone]
        exGraph = p[p.index(g)]
        if len(word.s) == exLen and not exGraph.set:
            exGraph.setExample(word)
        return True
    missingG = graphExample(word.s.removeprefix(''.join(frontStr)).removesuffix(''.join(backStr)), isX=word.x)
    #if not missingG.g:
    #    missingG = backSpell[0]
    if debug:
        print("missing grapheme", missingG)
    if len(frontSpell + backSpell) > len(phones):
            #print(word, frontSpell, len(frontSpell))
            #print(backSpell, len(backSpell))
            #print(phones)
            if debug:
                print("too long")
            del backSpell[0]
    if len(frontSpell + backSpell) < len(phones):
        try:
            missingP = phones[len(frontSpell)]
            if debug:
                print("missing phoneme (clean)", missingP)
            if len(word.s) == exLen and not missingG.set:
                missingG.setExample(word)        
            spellings[missingP].append(missingG)    
        except IndexError as e:
            if debug:
                print(word, frontSpell, len(frontSpell))
                print(backSpell, len(backSpell))
                print(phones)
            raise e
    else:
        missingP = phones[len(frontSpell) - 1]
        if debug:
            print("missing phoneme (overlap)", missingP)
        try:
            newG = graphExample(''.join((frontStr[-1], missingG.g) if frontStr else (missingG.g, backStr[0])), isX = missingG.g.count('x') or (frontStr[-1].count('x') if frontStr else backStr[0].count('x')))
            if len(word.s) == exLen and not newG.set:
                newG.setExample(word)  
            spellings[missingP].append(newG)
            if debug:
                print(newG)
        except IndexError as e:
            if debug:
                print(word, frontSpell, len(frontSpell))
                print(backSpell, len(backSpell))
                print(phones)
                print(missingP)
            raise e

def learn3(spellings, words, exLen=4, debug=False):
    newSpells = {s: copy(spellings[s]) for s in spellings}
    for word in words:
        spell3(word, newSpells, exLen=exLen, debug=debug)
        if debug:
            print(word, word.p)
        #break
    return newSpells

In [69]:
def getGLen(word):
    pNum = len(word.p.split())
    lNum = len(word.s)
    minG = (lNum / pNum)
    if minG > 4:
        raise ValueError("too many letters")
    if minG < 1:
        raise ValueError("too many phonemes")
    minG = int(minG)
    maxG = min(lNum - pNum + 1, 4)
    return (maxG, minG)

In [70]:
getGLen(wordObjs2['DR.'])

ValueError: too many phonemes

In [71]:
def atomize(word):
    maxGraph, minGraph = getGLen(word)
    lIndex = 0
    potential = []
    while lIndex < len(word.s):
        for i in range(minGraph, maxGraph + 1):
            atom = "*" * lIndex
            atom += word.s[lIndex:lIndex + i]
            atom += "*" * (len(word.s) - len(atom))
            if atom not in potential:
                potential.append(atom)
        lIndex += 1
    return potential

In [72]:
atomize(wordObjs2['BREATH'])

['b*****',
 'br****',
 'bre***',
 '*r****',
 '*re***',
 '*rea**',
 '**e***',
 '**ea**',
 '**eat*',
 '***a**',
 '***at*',
 '***ath',
 '****t*',
 '****th',
 '*****h']

In [83]:
import re
from math import ceil

def match_phones(word):
    graphs = atomize(word)
    phones = word.p.split()
    print(phones)
    matches = {}
    grange = range(len(graphs))
    for p in phones:
        matches[p] = []
    for gInd in grange:
        g = graphs[gInd]
        if g.lstrip('*') == g:
            matches[phones[0]].append(g.strip('*'))
        elif g.rstrip('*') == g:
            matches[phones[len(phones) - 1]].append(g.strip('*'))
        else:
            a = re.split('\w', g)
            endGaps = len(a[len(a) - 1])
            absMin = ceil(len(a[0]) / getGLen(word)[0])
            endPhones = len(phones[absMin + 1:])
            pmin = absMin + (endPhones - endGaps if endGaps < endPhones else 0)
            #pmax = max(len(a[0]) - ceil(endGaps / getGLen(word)[0]), 1)
            absMax = ceil(endGaps / getGLen(word)[0])
            pmax = min(len(phones) - 2, len(a[0]))
            #pmax = min(len(a[0]), len(phones))
            print(g, pmin, pmax)
            for p1 in phones[pmin: pmax + 1]:
                matches[p1].append(g.strip('*'))
    return matches

In [74]:
breath = wordObjs2['BREATH']

match_phones(breath)

['B', 'R', 'EH', 'TH']
*r**** 1 1
*re*** 1 1
*rea** 1 1
**e*** 1 2
**ea** 1 2
**eat* 2 2
***a** 1 2
***at* 2 2
****t* 2 2


{'B': ['b', 'br', 'bre'],
 'R': ['r', 're', 'rea', 'e', 'ea', 'a'],
 'EH': ['e', 'ea', 'eat', 'a', 'at', 't'],
 'TH': ['ath', 'th', 'h']}

In [75]:
breathy = wordObjs2['BREATHY']

match_phones(breathy)

['B', 'R', 'EH', 'TH', 'IY']
*r***** 1 1
*re**** 1 1
*rea*** 1 1
**e**** 1 2
**ea*** 1 2
**eat** 2 2
***a*** 1 3
***at** 2 3
***ath* 3 3
****t** 2 3
****th* 3 3
*****h* 3 3


{'B': ['b', 'br', 'bre'],
 'R': ['r', 're', 'rea', 'e', 'ea', 'a'],
 'EH': ['e', 'ea', 'eat', 'a', 'at', 't'],
 'TH': ['a', 'at', 'ath', 't', 'th', 'h'],
 'IY': ['thy', 'hy', 'y']}

In [76]:
breathing = wordObjs2['BREATHING']

match_phones(breathing)

['B', 'R', 'IY', 'DH', 'IH', 'NG']
*r******* 1 1
*re****** 1 1
*rea***** 1 1
*reat**** 1 1
**e****** 1 2
**ea***** 1 2
**eat**** 1 2
**eath*** 2 2
***a***** 1 3
***at**** 1 3
***ath*** 2 3
***athi** 3 3
****t**** 1 4
****th*** 2 4
****thi** 3 4
****thin* 4 4
*****h*** 2 4
*****hi** 3 4
*****hin* 4 4
******i** 3 4
******in* 4 4
*******n* 4 4


{'B': ['b', 'br', 'bre', 'brea'],
 'R': ['r', 're', 'rea', 'reat', 'e', 'ea', 'eat', 'a', 'at', 't'],
 'IY': ['e', 'ea', 'eat', 'eath', 'a', 'at', 'ath', 't', 'th', 'h'],
 'DH': ['a', 'at', 'ath', 'athi', 't', 'th', 'thi', 'h', 'hi', 'i'],
 'IH': ['t', 'th', 'thi', 'thin', 'h', 'hi', 'hin', 'i', 'in', 'n'],
 'NG': ['hing', 'ing', 'ng', 'g']}

In [77]:
l = [1,2,3,4]
print(l[1:7])
print("_"*1)
'***ertyui'.partition('tqweryuiopasdfghjklzxcvbnm')

[2, 3, 4]
_


('***ertyui', '', '')

11/7

<h1>What I want</h1>
a way to determine how each phoneme in a word is spelled
<h2>What is the scope?</h2>
most English words, excepting obscure scientific/technical terms, foreign words directly borrowed (like names), and abbreviations
<h2>What other goals did I want to incorporate?</h2>
Involve as little interference from me as possible, like coding in exceptions for irregular words
<h2>How have I attempted to realize this idea?</h2>
<ol>
    <li>determine spellings via deduction, ie. use a series of logical decisions and processes that can determine the spelling of every English word</li>
    <li>start with some basic graphemes for each phoneme, and analyze each word, learning new phonemes along the way</li>
    <li>(WIP) using the known number and order of phonemes, calculate each possible representation of phonemes in every word and do something with them</li>
</ol>
<h2>Issues with each attempt:</h2>
<ol>
    <li>too many irregular English words to do this, as far as I can tell</li>
    <li>My code assumed that each word would have no more than one new grapheme, which did not hold. If there's more than one new grapheme, my code can't tell where one ends and the other begins. Also, the code found false positives, resulting in it learning blank graphemes, which then cause a catastrophic error</li>
    <li>It's unclear how this helps me. One idea was to keep the most common graphemes, but that both leaves out real graphemes that only occur once and includes common mistakes. I like how it accounts for uncertainty, but it needs more time to cure. Also, it currently doesn't handle the '-le' exception.</li>
    <ul>
        <li>Actually, that's only part of the problem. When a sound occurs more than once within a word, it doesn't keep track of their representations separately. This may be a problem, or maybe this is a desired outcome. I'm not sure yet.</li>
    </ul>
</ol>

In [79]:
match_phones(wordObjs2['LITTLE'])

['L', 'IH', 'T', 'AH', 'L']
*i**** 1 1
*it*** 1 1
**t*** 1 2
**tt** 2 2
***t** 2 3
***tl* 3 3
****l* 3 3


{'L': ['l', 'li', 'le', 'e'],
 'IH': ['i', 'it', 't'],
 'T': ['t', 'tt', 't'],
 'AH': ['t', 'tl', 'l']}

In [80]:
match_phones(wordObjs2['KIBBLE'])

['K', 'IH', 'B', 'AH', 'L']
*i**** 1 1
*ib*** 1 1
**b*** 1 2
**bb** 2 2
***b** 2 3
***bl* 3 3
****l* 3 3


{'K': ['k', 'ki'],
 'IH': ['i', 'ib', 'b'],
 'B': ['b', 'bb', 'b'],
 'AH': ['b', 'bl', 'l'],
 'L': ['le', 'e']}

How should my code distinguish between correct and incorrect graphemes?



In [84]:
import re
from math import ceil

def match_phones2(word):
    graphs = atomize(word)
    phones = word.p.split()
    print(phones)
    matches = {}
    grange = range(len(graphs))
    for p in phones:
        matches[p] = []
    for gInd in grange:
        g = graphs[gInd]
        if g.lstrip('*') == g:
            matches[phones[0]].append(g)
        elif g.rstrip('*') == g:
            matches[phones[len(phones) - 1]].append(g)
        else:
            a = re.split('\w', g)
            endGaps = len(a[len(a) - 1])
            absMin = ceil(len(a[0]) / getGLen(word)[0])
            endPhones = len(phones[absMin + 1:])
            pmin = absMin + (endPhones - endGaps if endGaps < endPhones else 0)
            #pmax = max(len(a[0]) - ceil(endGaps / getGLen(word)[0]), 1)
            absMax = ceil(endGaps / getGLen(word)[0])
            pmax = min(len(phones) - 2, len(a[0]))
            #pmax = min(len(a[0]), len(phones))
            print(g, pmin, pmax)
            for p1 in phones[pmin: pmax + 1]:
                matches[p1].append(g)
    return matches

In [86]:
breathPhones = match_phones2(breath)
breathPhones

['B', 'R', 'EH', 'TH']
*r**** 1 1
*re*** 1 1
*rea** 1 1
**e*** 1 2
**ea** 1 2
**eat* 2 2
***a** 1 2
***at* 2 2
****t* 2 2


{'B': ['b*****', 'br****', 'bre***'],
 'R': ['*r****', '*re***', '*rea**', '**e***', '**ea**', '***a**'],
 'EH': ['**e***', '**ea**', '**eat*', '***a**', '***at*', '****t*'],
 'TH': ['***ath', '****th', '*****h']}

In [129]:
def count_spaces(word, reverse=False):
    letters = list(word)
    count = 0
    #print(word)
    #print(letters)
    if reverse:
        while letters.pop() == '*':
            count += 1
    else:
        while letters.pop(0) == '*':
            count += 1
    return count

def count_letters(word):
    letters = list(word)
    count = 0
    while len(letters) != 0 and letters.pop(0) != '*':
        count += 1
    return count
    
def stitch(matches):
    stitched = matches.pop(0)
    mapping = [0] * (len(stitched) - count_spaces(stitched, reverse=True))
    num = 1
    for grapheme in matches:
        start = count_spaces(grapheme)
        end = count_spaces(grapheme, reverse=True)
        if start != len(stitched.strip('*')):
            raise ValueError('bad match')
        stitched = stitched.strip('*') + grapheme.lstrip('*')
        newLetterNum = count_letters(grapheme.lstrip('*'))
        mapping += [num] * newLetterNum
        num += 1
    return stitched, mapping

def fancyPrint(stitchObj):
    print(stitchObj[0])
    print(''.join([str(i) for i in stitchObj[1]]))

def recursionPractice(listOfLists):
    return stitch(['something', recursionPractice('something else')])

def combinatory(length, bases):
    combs = []
    

def getIndices(matchDict):
    lengths = [len(matchDict[i]) for i in matchDict]
    indices = []
    pass

def stitchAll(matchDict, phoneInds, graphInds):
    phones = matchDict.keys()
    for graph in matchDict[phone]:
        pass
        

In [106]:
count_spaces(breathPhones['B'][0], reverse=False)

b*****
['*', '*', '*', '*', '*']


0

In [122]:
a = stitch([breathPhones['B'][0], breathPhones['R'][0], breathPhones['EH'][0], breathPhones['TH'][0]])
a

('breath', [0, 1, 2, 3, 3, 3])

In [123]:
print(a[0])
print(''.join([str(i) for i in a[1]]))

breath
012333
