<img src='https://hammondm.github.io/hltlogo1.png' style="float:right">

Linguistics 578<br>
Fall 2024<br>
Hammond

## Things to remember about any homework assignment:

1. For this assignment, you will edit this jupyter notebook and turn it in. Do not turn in pdf files or separate `.py` files.
1. Late work is not accepted.
1. Given the way I grade, you should try to answer *every* question, even if you don't like your answer or have to guess.
1. You may *not* use `python` modules that we have not already used in class.
1. You may certainly talk to your classmates about the assignment, but everybody must turn in *their own* work. It is not acceptable to turn in work that is essentially the same as the work of classmates.
1. All code must run. It doesn't have to be perfect, it may not do all that you want it to do, but it must run without error.
1. Code must run in reasonable time. Assume that if it takes more than *5 minutes* to run (on your machine), that's too long.
1. Please do not add, remove, or copy autograded cells.
1. Make sure to select `restart, run all cells` from the `kernel` menu when you're done and before you turn this in!

***

***my name***: Kathleen Costa

***people I talked to about the assignment***: N/A

***

## Homework #4

Here are the imports. Please do not import anything else.

In [1]:
import graphviz,re
import pywrapfst as fst
import numpy as np

1. Using `pywrapfst`, create a symbol table for the (lowercase) English alphabet plus $\epsilon$ and `#`. (The second symbol added should be `#`.)

In [2]:
st = fst.SymbolTable()
# YOUR CODE HERE
st = fst.SymbolTable()
st.add_symbol('<epsilon>')
st.add_symbol('#')

for letter in 'abcdefghijklmnopqrstuvwxyz':
    st.add_symbol(letter)

In [3]:
assert st.num_symbols() == 28

In [4]:
assert st.find(0) == '<epsilon>'

In [5]:
assert st.find(1) == '#'

2. We'll now build a character *bigram model* based on the *Alice in Wonderland* text. The following function should strip the Gutenberg information at the beginning and end of the file, convert everything to lowercase, convert everything except lowercase letters to space, and return a list of lowercase words.

In [6]:
def getwords(alice):
    '''tokenizes the Alice text
    args:
        alice: the location of the Gutenberg alice file
    returns
        a list of lowercase words
    '''
    # YOUR CODE HERE
    with open(alice, 'r') as f:
        text = f.readlines()[54:3406]  # Read lines 55 to 3405

    text = ''.join(text)

    text = text.lower()
    text = re.sub('[^a-z]+', ' ', text)

    words = text.split()
    
    return words


In [7]:
words = getwords('alice.txt')
assert words[:12] == [
    'chapter', 'i', 'down', 'the', 'rabbit', 'hole',
    'alice', 'was', 'beginning', 'to', 'get', 'very'
]

In [8]:
assert words[-5:] == ['happy', 'summer', 'days', 'the', 'end']

In [9]:
assert len(words) == 27336

3. We now write a function that will return a list of counts for letter unigrams and letter bigrams. You'll need to pad each word with `#` on each side before doing this.

In [10]:
def getcounts(wds):
    '''get letter unigram and bigram counts from a list of words
    args:
        words: a list of words
    returns
        unigrams: a dictionary from letters to counts
        bigrams: a dictionary from letter pairs to counts
    '''
    # YOUR CODE HERE
    unigrams = {}
    bigrams = {}

    padded_words = ['#' + word + '#' for word in wds]

    for word in padded_words:
        for letter in word:
            if letter in unigrams:
                unigrams[letter] += 1
            else:
                unigrams[letter] = 1
        
        for i in range(len(word) - 1):
            bigram = word[i:i + 2]
            if bigram in bigrams:
                bigrams[bigram] += 1
            else:
                bigrams[bigram] = 1

    return unigrams, bigrams

In [11]:
ugs,bgs = getcounts(words)
assert len(ugs) == 27

In [12]:
assert len(bgs) == 427

In [13]:
assert bgs['ab'] == 214

In [14]:
assert ugs['#'] == 54672

In [15]:
assert ugs['q'] == 209

4. Let's now write a function that takes our unigram and bigram counts and creates a dictionary of log conditional probabilities.

In [16]:
def makecondprobs(unigrams,bigrams):
    '''calculate conditional probabilities
    args:
        unigrams: dictionary of unigram counts
        bigrams: dictionary of bigram counts
    returns:
        a dictionary of log conditional probabilities
    '''
    # YOUR CODE HERE
    condprobs = {}
    total_unigrams = sum(unigrams.values())

    for bigram, bigram_count in bigrams.items():
        first_letter = bigram[0]
        first_count = unigrams.get(first_letter, 0)

        if first_count > 0:
            prob = bigram_count / first_count
            condprobs[bigram] = np.log(prob)

    return condprobs

In [17]:
cp = makecondprobs(ugs,bgs)
assert len(cp) == 427

In [18]:
assert np.isclose(cp['ab'],-3.71,atol=.01)

In [19]:
total = 0
for bigram in cp:
    if bigram[0] == 'a':
        total += np.exp(cp[bigram])
assert np.isclose(total,1.0,atol=.0001)

5. We now create a WFSA for this model using the symbol table you created in question #1 and the dictionary we just created. The state geometry is critical here. There will be a start state, which you can think of as the `#` on the left. From there, the first arc takes you to various letter states. The probabilities of those arcs will be the conditional probabilities of the different letters based on the initial `#`. Now from each letter state you can go to any other letter state or the final `#`. In each case, the relevant probability is the conditional probability of the second letter/`#` based on the first letter. The second `#` state is the only final state. Note that there will be no arc corresponding to the first `#`.

In [20]:
def makeWFSA(tab,dic):
    '''create a WFSA that encodes a bigram model
    args:
        tab: a symbol table including all letters
            and boundary markers
        dic: a dictionary of the conditional probabilities
    returns:
        a WFSA
    '''
    # YOUR CODE HERE
    c = fst.Compiler(
        isymbols=tab,
        osymbols=tab,
        keep_isymbols=True,
        keep_osymbols=True,
        arc_type='log'
    )

    start_state = 0
    c.write(f'{start_state} 1 # # 0.0')

    letters = list('abcdefghijklmnopqrstuvwxyz')
    for i, letter in enumerate(letters):
        letter_state = i + 1
        for next_letter in letters + ['#']:
            bigram_prob = dic.get((letter, next_letter), 0)
            if bigram_prob > 0:
                log_prob = np.log(bigram_prob)
                c.write(f'{letter_state} {letter_state + 1} {letter} {next_letter} {log_prob}')

        c.write(f'{letter_state} {len(letters) + 1} {letter} # 0.0')

    final_state = len(letters) + 1
    c.write(f'{final_state}') 
    c.write(f'{final_state} {final_state} # # 0.0')  # self-loop on final state

    wfsa = c.compile()
    return wfsa

In [21]:
wfsa = makeWFSA(st,cp)
assert type(wfsa) == fst.MutableFst

In [22]:
assert wfsa.num_states() == 28

In [23]:
c2 = fst.Compiler(
    isymbols=st,
    osymbols=st,
    keep_isymbols=True,
    keep_osymbols=True,
    arc_type='log'
)
c2.write('0 1 h h')
c2.write('1 2 a a')
c2.write('2 3 t t')
c2.write('3 4 # #')
c2.write('4')
f2 = c2.compile()
f3 = fst.compose(f2,wfsa)
assert f3.num_states() == 5

AssertionError: 

In [None]:
f3s = str(f3).split('\n')[:-2]
val = np.exp(sum([float(s.split('\t')[-1]) for s in f3s]))
assert np.isclose(val,0.00019,atol=.001)