# Overview of baseline code for morphological inflection
LING 409 Fall 2024

In [2]:
import nonneural  # import nonneural module to have access to functions defined in it
import random # for doing some testing
import os

In this notebook, I'll start to demonstrate how I might work through the `nonneural.py` code to understand what it is doing. 

We can display the text of the `nonneural.py` Python script in this Jupyter notebook using the [magic command `%pycat`](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-pycat).

In [3]:
%pycat nonneural.py

[0;31m#!/usr/bin/env python3[0m[0;34m[0m
[0;34m[0m[0;34m"""[0m
[0;34mNon-neural baseline system for the SIGMORPHON 2020 Shared Task 0.[0m
[0;34mAuthor: Mans Hulden[0m
[0;34mModified by: Tiago Pimentel[0m
[0;34mModified by: Jordan Kodner[0m
[0;34mModified by: Omer Goldman[0m
[0;34mLast Update: 22/03/2021[0m
[0;34m"""[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;32mimport[0m [0msys[0m[0;34m,[0m [0mos[0m[0;34m,[0m [0mgetopt[0m[0;34m,[0m [0mre[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0mfunctools[0m [0;32mimport[0m [0mwraps[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0mglob[0m [0;32mimport[0m [0mglob[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;34m"""[0m
[0;34mFinds the Hamming distance of two strings s and t (number of places where the two characters differ)[0m
[0;34munts default starting from word initial and disregards any trailing characters that don't have corresponding place in the other word (so max possible value is t

Looking over the code structure, we can see that it first defines a bunch of utility/helper functions at the top and then a **main** function at the bottom (starting at line 162).

To understand the code structure, let's start by looking at the `main()` function.

## The `main()` function

### Running the module as a script vs. importing it

At the very very end of the code (line 260), there is a little conditional expression that says:

```python
if __name__ == "__main__":
    main(sys.argv)
```

This conditional expression makes the `nonneural.py` code ["usable as a script as well as an importable module"](https://docs.python.org/3/tutorial/modules.html#executing-modules-as-scripts).

If we type

```bash
# I am in the shell!
python nonneural.net
```

on the command line (e.g., in a bash shell), the `_name_` is set to `__main__` and the code in the conditional expression is run.
But if we import the code as a module in a python interpreter using

```python
# I am in a python interpreter
import nonneural
```

then the `main()` function is not run, but we get to access all the functions that were defined in the script.



## Arguments for script, help, global variable initialization: lines 162-186

The first section of the `main()` function is important if we are running `nonneural.py` as a script.
It sets the options we have for including arguments when we run the script (note: the `-t` option doesn't actually work!)
It also sets defaults for the variables `TEST`, `OUTPUT`, `HELP` and `path`.

```python
def main(argv):
    options, remainder = getopt.gnu_getopt(argv[1:], 'ohp:', ['output','help','path='])
    TEST, OUTPUT, HELP, path = False,False, False, '../data/'
    for opt, arg in options:
        if opt in ('-o', '--output'):
            OUTPUT = True
        if opt in ('-t', '--test'):
            TEST = True
        if opt in ('-h', '--help'):
            HELP = True
        if opt in ('-p', '--path'):
            path = arg

```

So for instance, we can generate output files if we run the script with the `-o` argument

```bash
# I am in the shell!
python nonneural.net -o
```


The next part (line 175-184) sets the message printed if we ask for help in the shell.

```python
    if HELP:
            print("\n*** Baseline for the SIGMORPHON 2020 shared task ***\n")
            print("By default, the program runs all languages only evaluating accuracy.")
            print("To create output files, use -o")
            print("The training and dev-data are assumed to live in ./part1/development_languages/")
            print("Options:")
            print(" -o         create output files with guesses (and don't just evaluate)")
            print(" -t         evaluate on test instead of dev")
            print(" -p [path]  data files path. Default is ../data/")
            quit()

```

We can ask for help in the shell:

```bash
# I am in the shell!
python nonneural.net -h
python nonneural.net --help
```

If we are in a Python interpreter and ask for help, we get back the docstrings in the script.

```python
# I am in a python interpreter
import nonneural
help(nonneural)
```

In [4]:
help(nonneural)

Help on module nonneural:

NAME
    nonneural

DESCRIPTION
    Non-neural baseline system for the SIGMORPHON 2020 Shared Task 0.
    Author: Mans Hulden
    Modified by: Tiago Pimentel
    Modified by: Jordan Kodner
    Modified by: Omer Goldman
    Last Update: 22/03/2021

FUNCTIONS
    add_accent_to_third_syllable(word, msd)
        Add accent to the third syllable from the end if:
        1. The word is imperative
        2. The word doesn't already have an accent mark
        3. The word is long enough
        4. The word is not negative
        5. The word is not INFM,2,PL

    alignprs(lemma, form)
        Break lemma/form into three parts:
        based on leading and trailing '_', doesn't take into account different characters
        ex. ('', 'demonstrate', '__', '', 'demonstrati', 'on')
        IN:  1 | 2 | 3
        OUT: 4 | 5 | 6
        1/4 are assumed to be prefixes, 2/5 the stem, and 3/6 a suffix.
        1/4 and 3/6 may be empty.

    apply_best_rule(lemma, msd, allprul

Finally, line 186 initializes global variables `totalavg` and `numlang` to 0.0 and 0.

```python
    totalavg, numlang = 0.0, 0
```

## The giant for loop (lines 187-257)

Line 187 starts a giant `for` loop that spans the rest of the `main()` function.

```python
    for lang in [os.path.splitext(d)[0] for d in os.listdir(path) if '.trn' in d]:
```
The loop iterates over each language `lang` in the `path` directory, if the language has a training data file (indicated by `.trn`)

We can take a look at what the list comprehension in line 187 does:

In [11]:
path = os.path.join(os.getcwd(),"data")
[os.path.splitext(d)[0] for d in os.listdir(path) if '.trn' in d]

['spa']

If we skim over the rest of the for loop we can look at the comments to get hints about what the code is supposed to do:

1. "test if language is predominantly suffixing or prefixing" (line 193)
2. "Read in lines and extract transformation rules from pairs" (line 202, kind of hidden)
3. "run eval on dev" (line 226)

Let's first look at the rest of the beginning bit of the for loop, lines 187-191, before the test for suffixing/prefixing :

```python
for lang in [os.path.splitext(d)[0] for d in os.listdir(path) if '.trn' in d]:
        allprules, allsrules = {}, {}
        if not os.path.isfile(path + lang +  ".trn"):
            continue
        lines = [line.strip() for line in open(path + lang + ".trn", "r", encoding='utf8') if line != '\n']

```

What's going on in this part of the code?
Can we take a look at what `lines` is?

In [None]:
for lang in [os.path.splitext(d)[0] for d in os.listdir(path) if '.trn' in d]:
    lines = [line.strip() for line in open(os.path.join(path, lang + ".trn"), "r", encoding='utf8') if line != '\n']
    print(lines[0:5])

AttributeError: module 'os' has no attribute 'join'

### Testing for prefixing/suffixing (lines 193-201)

Now let's look at the first chunk of code that tests if a language is predominantly suffixing or prefixing: lines 193-201.

```python
        # First, test if language is predominantly suffixing or prefixing
        # If prefixing, work with reversed strings
        prefbias, suffbias = 0,0
        for l in lines:
            lemma, _, form = l.split(u'\t')
            aligned = halign(lemma, form)
            if ' ' not in aligned[0] and ' ' not in aligned[1] and '-' not in aligned[0] and '-' not in aligned[1]:
                prefbias += numleadingsyms(aligned[0],'_') + numleadingsyms(aligned[1],'_')
                suffbias += numtrailingsyms(aligned[0],'_') + numtrailingsyms(aligned[1],'_')
```

Line 195 initializes two variables to the integer 0. What do you think these variables are for?

```python
        prefbias, suffbias = 0,0
```

Then we have a for loop that iterates over each line in `lines`.

To better understand what's going on in the for loop, we can see what happens when we run through a single iteration of the loop.


In [None]:
# let's try out the the for loop on a random, single entry of lines
rand_line_ind = random.randint(0,len(lines))
l0 = lines[rand_line_ind] 
lemma, _, form = l0.split(u'\t')

print(lemma)
print(_) # why use underscore as a variable name?
print(form)


In [None]:
# CODE BLOCK FOR CALLING halign()
# We'll come back to the details of halign() later
aligned = nonneural.halign(lemma, form)

print(aligned)

In [None]:
# CODE BLOCK FOR LINES 199-201
prefbias, suffbias = 0,0

if ' ' not in aligned[0] and ' ' not in aligned[1] and '-' not in aligned[0] and '-' not in aligned[1]:
    prefbias += nonneural.numleadingsyms(aligned[0],'_') + nonneural.numleadingsyms(aligned[1],'_')
    suffbias += nonneural.numtrailingsyms(aligned[0],'_') + nonneural.numtrailingsyms(aligned[1],'_')

print(nonneural.numleadingsyms(aligned[0],'_'))
print(nonneural.numleadingsyms(aligned[1],'_'))
print(nonneural.numtrailingsyms(aligned[0],'_'))
print(nonneural.numtrailingsyms(aligned[1],'_'))

print("prefbias = ", prefbias)
print("suffbias = ", suffbias)

### Extracting transformation rules (lines 202-224)

Now let's take a look at the "extracting transformation rules" part.

The first chunk of code does something based on whether `prefbias` or `suffbias` is greater.

```python
lemma, msd, form = l.split(u'\t')
            if prefbias > suffbias:
                lemma = lemma[::-1]
                form = form[::-1]
```

We can try it out on our test line.

In [None]:
# CODE BLOCK FOR TESTING LINES 202-206

print("Reminder: prefbias =", prefbias)
print("Reminder: suffbias =", suffbias)

lemma, msd, form = l0.split(u'\t')
print("lemma:", lemma, "\nmsd:", msd, "\nform:", form)

if prefbias > suffbias:
    lemma = lemma[::-1]
    print(lemma)
    form = form[::-1]
    print(form)



Line 207 then calls `prefix_suffix_rules_get()`

```python
            prules, srules = prefix_suffix_rules_get(lemma, form)
```

We can try it out on our test line.

In [None]:
prules, srules = nonneural.prefix_suffix_rules_get(lemma, form)
print("prules:", prules)
print("\n")
print("srules:", srules)

The next code chunk (line 209-212) does something with `allprules` and `allsrules` (initialized in line 188):

```python
            if msd not in allprules and len(prules) > 0:
                allprules[msd] = {}
            if msd not in allsrules and len(srules) > 0:
                allsrules[msd] = {}

```

Let's try it out.

In [None]:
allprules, allsrules = {}, {}  # from line 188

print("Reminder: msd=", msd)

if msd not in allprules and len(prules) > 0:
    allprules[msd] = {}
if msd not in allsrules and len(srules) > 0:
    allsrules[msd] = {}

print("allprules:", allprules)
print("allsrules:", allprules)

Then we have two code chunks that seem to do something similar to `prules` and `srules`:

```python

            for r in prules:
                if (r[0],r[1]) in allprules[msd]:
                    allprules[msd][(r[0],r[1])] = allprules[msd][(r[0],r[1])] + 1
                else:
                    allprules[msd][(r[0],r[1])] = 1

            for r in srules:
                if (r[0],r[1]) in allsrules[msd]:
                    allsrules[msd][(r[0],r[1])] = allsrules[msd][(r[0],r[1])] + 1
                else:
                    allsrules[msd][(r[0],r[1])] = 1


```

Let's try out the prules chunk:

In [None]:
# CODE BLOCK FOR prules SECTION

print("Reminder: msd =", msd)
print("Reminder: prules =", prules, "\n")

print("Reminder: allprules =", allprules)
print("Reminder: allprules[msd] =", allprules[msd], "\n")


for r in prules:
    if (r[0],r[1]) in allprules[msd]:
        allprules[msd][(r[0],r[1])] = allprules[msd][(r[0],r[1])] + 1
    else:
        allprules[msd][(r[0],r[1])] = 1

print("Now allprules[msd] is:", allprules[msd])


In [None]:
# CODE BLOCK FOR srules SECTION
print("Reminder: msd =", msd)
print("Reminder: srules =", srules, "\n")

print("Reminder: allsrules =", allsrules)
print("Reminder: allsrules[msd] =", allsrules[msd], "\n")


for r in srules:
    if (r[0],r[1]) in allsrules[msd]:
        allsrules[msd][(r[0],r[1])] = allsrules[msd][(r[0],r[1])] + 1
    else:
        allsrules[msd][(r[0],r[1])] = 1

print("Now allsrules[msd] is:", allsrules[msd])




## The `halign()` function

What does the `halign()` function do?
We can print out the function definition:

In [None]:
# From https://stackoverflow.com/questions/1562759/can-python-print-a-function-definition
import inspect
import sys
sys.stdout.write(inspect.getsource(nonneural.halign))