In this notebook I am going to try and use the [Formosanbank](https://ai4commsci.gitbook.io/formosanbank/the-bank-architecture/corpora) resources to make a basic autocorrect prediction system for Amis.
Even assuming the corpora have enough word distribution to be useful for a frequency-based prediction system, I already see the following problems:
* I don't know any Amis, and will have difficulty checking correctness
* No existing test sets
* Although I'm not sure how similar the morphology of Amis and Truku are (probably not very), I have no idea how an autocorrect would work for a fusional (?) language. If 'blah' is go and 'klah' is went and 'glah' is will go, should there be some bias toward forms of this stem if the input is similar enough in certain ways? At that point it starts to bleed into a stemmer/tagger.

**Note**: I have already used some of the validation/sterilization tools in FormosanBank. According to the FormosanBank README it should have been a few simple checks and changes like standardizing all punctuation to single-spaced equivalents, but I am unclear what exactly was changed.

In [8]:
corpora_dir = "./FormosanBank/Corpora"
find_lang = "ami" # Amis
find_glotto = "cent2104"
find_dialect = "Coastal"

First, let's make sure we're finding the expected files.

In [9]:
import os
all_xmls = []
for root, dirname, filenames in os.walk(corpora_dir):
    for f in filenames:
        if f.endswith("xml"):
            all_xmls.append(os.path.join(root,f))
print(len(all_xmls))
print(all_xmls[:2])

17108
['./FormosanBank/Corpora/Wikipedias/XML/Seediq/Hnigan.xml', './FormosanBank/Corpora/Wikipedias/XML/Seediq/Jingay_siang.xml']


In [10]:
lang_xmls = []
import xml.etree.ElementTree as ET
for filepath in all_xmls:
    tree = ET.parse(filepath)
    root = tree.getroot()
    # taken from formosanbank validate_xml.py
    lang = root.get("{http://www.w3.org/XML/1998/namespace}lang")
    if lang:
        lang = lang.lower()
    else:
        # print(f"{filepath} doesn't appear to have a [lang] attrib: {root.attrib}")
        continue
    glottocode = root.get("glottocode")
    dialect = root.get("dialect")
    if lang.lower() == find_lang.lower():
        if not glottocode and not dialect: # they're both None
            print(f"glotto: {glottocode} | dialect: {dialect} | file: {' '.join(filepath.split('/')[-5:])}")
            # we assume the language is correct
            lang_xmls.append(filepath)
        else:
            if glottocode:
                if glottocode.lower() == find_glotto:
                    lang_xmls.append(filepath)
            if dialect:
                if glottocode.lower() == find_dialect:
                    lang_xmls.append(filepath)

glotto: None | dialect: None | file: Corpora Presidential_Apologies XML Amis Amis.xml
glotto: None | dialect: None | file: Corpora ILRDF_Dicts XML Amis Amis.xml
glotto: None | dialect: None | file: FormosanBank Corpora Virginia_Fey_Dictionary XML Amis.xml


In [11]:
print(len(lang_xmls))

19


Now it looks like we have the right files for the language and dialect we want. There is gray area here in terms of langauge and dialect definition, but this should be good enough.

In [12]:
def get_sent_list(root) -> list[str]:
    sents = root.findall(".//S")
    texts = []
    for s in sents:
        form_children = []
        for child in s:
            if child.tag == "FORM":
                form_children.append(child)
            # there is 'standard' and 'original' forms
            if len(form_children) == 1:
                texts.append(form_children[0].text)
            else:
                for child in form_children:
                    kind = child.get("kindOf")
                    if kind == "standard":
                        texts.append(child.text)
    return texts

Let's test our new method. 

In [13]:
testfile = lang_xmls[0]
print(testfile)
root = ET.parse(testfile).getroot()
ret = get_sent_list(root)
print(len(ret))
print(root.get("doesnotexist"))

./FormosanBank/Corpora/ePark/XML/ep3_文化篇/Amis/Coastal_Amis.xml
2128
None


Great, now let's see how it works on our collected files. Hopefully it's all formatted correctly!

In [14]:
all_sents = []
for file in lang_xmls:
    tree = ET.parse(file)
    root = tree.getroot()
    file_text_as_list = get_sent_list(root)
    print(f"{' '.join(file.split('/')[-3:])} length: {len(file_text_as_list)}")
    all_sents += file_text_as_list

ep3_文化篇 Amis Coastal_Amis.xml length: 2128
ep2_文化篇 Amis Coastal_Amis.xml length: 2128
ep2_閱讀書寫篇 Amis Coastal_Amis.xml length: 3375
ep3_族語短文 Amis Coastal_Amis.xml length: 492
ep3_繪本平台 Amis Coastal_Amis.xml length: 5028
ep2_族語短文 Amis Coastal_Amis.xml length: 814
ep2_生活會話篇 Amis Coastal_Amis.xml length: 3096
ep3_圖畫故事篇 Amis Coastal_Amis.xml length: 172
ep3_句型篇國中 Amis Coastal_Amis.xml length: 2404
ep3_句型篇高中 Amis Coastal_Amis.xml length: 3456
ep2_情境族語 Amis Coastal_Amis.xml length: 4578
ep3_閱讀書寫篇 Amis Coastal_Amis.xml length: 3465
ep3_情境族語 Amis Coastal_Amis.xml length: 3084
ep3_生活會話篇 Amis Coastal_Amis.xml length: 3186
ep2_學習詞表 Amis Coastal_Amis.xml length: 5444
ep1_九階教材 Amis Coastal_Amis.xml length: 4618
XML Amis Amis.xml length: 132
XML Amis Amis.xml length: 21928
Virginia_Fey_Dictionary XML Amis.xml length: 8883


Hmm, it's a bit odd that the first two match exactly in length, and even though they're episode 2/3, they are also the same topic. I bet they're the same file accidentally - let's use a `set` to make sure we're not getting biased counts.

In [15]:
print(len(all_sents))
print(len(set(all_sents)))

78411
14103


Whoa, that's a huge difference! There's a good chance that the Virginia Fey and others had single-word dictionaries and vocabularies that got duplicated. Let's check and see what's actually getting duplicated.

In [16]:
counts= {}
for e in all_sents:
    if e in counts:
        counts[e] = counts[e] + 1
    else:
        counts[e] = 1
i = sorted(counts.items(), key=lambda x: x[1], reverse=True)

In [17]:
print(i[:20])

[('Cima kiso?', 51), ('masadak', 41), ('Cima ko ngangan iso?', 36), ('kaka', 34), ("romi'ad", 34), ("mafana'", 34), ('Papina ko salikaka iso?', 34), ('romadiw', 33), ('katayni', 33), ("Nga'ay ho kiso?", 32), ('adada', 30), ('tatodong', 29), ('folad', 29), ('anini', 29), ('kolong', 29), ('niyam', 29), ("riko'", 29), ('pising', 29), ('Talacowa kiso?', 29), ('masakero', 28)]


As expected, most are single words, but there are some sentences in there. This is a bit odd, but makes sense when you consider that most of our corpus is from an online language teaching platform. More than likely, a sentence is repeated multiple times throughout a 'unit', and potentially across multiple units as well.

Regardless, we now have a set of sentences - our corpus! - and can go ahead with normal Autocorrect stuff!

In [18]:
small_sentence_list= list(set(all_sents))

In [28]:
corpus = []
for s in small_sentence_list:
    corpus += [w.strip('/\\ .,?!;:[]()<>#$%^&*') for w in s.lower().split(' ') if w.strip(' \'/\\ .,?!;:[]()<>#$%^&*') != '']
dictionary = set(corpus) # "Dictionary" of known words

I'm not so confident about putting everything to lowercase, but this should work. I'm unclear if I should be stripping the single quote `'` since some languages use it to demarcate a glottal stop.
Anyway, at this point we have our corpus, and we have our set of all words that we've seen. Next we need to make a count for each word in the corpus, which will be used to predict which 'autocorrected' word is most likely.

In [29]:
wcount = {}
for w in corpus:
    if w in wcount:
        wcount[w] +=1
    else:
        wcount[w] = 1
m = sum(wcount.values())
wprobs = {k:wcount[k]/m for k in wcount} # probability of each word according to our corpus

In [37]:
print(f"Corpus has {len(corpus)} words")
print(f"Sanity check for full corpus: {str(len(corpus) == m)}")
print(f"Corpus has {len(dictionary)} unique words")
print(f"Sanity check for unique words: {str(len(wcount) == len(wprobs) == len(dictionary))}")
testwords = list(dictionary)[:10]
testwords += ['a', 'o', 'no'] # known common words
print(f"Testing words: {testwords}")
print("")
for t in testwords:
    print(f"Count for {t}: {wcount[t]}")
    print(f"Prob  for {t}: {wprobs[t]}")

Corpus has 108919 words
Sanity check for full corpus: True
Corpus has 13695 unique words
Sanity check for unique words: True
Testing words: ['nipacamolan', 'masanekay', "mianako'", 'satadamaanay', "misatata'ak", 'hofocan', "maloci'ay", "li'akong", 'longec', "i'ayawho", 'a', 'o', 'no']

Count for nipacamolan: 1
Prob  for nipacamolan: 9.181134604614438e-06
Count for masanekay: 2
Prob  for masanekay: 1.8362269209228875e-05
Count for mianako': 1
Prob  for mianako': 9.181134604614438e-06
Count for satadamaanay: 7
Prob  for satadamaanay: 6.426794223230107e-05
Count for misatata'ak: 1
Prob  for misatata'ak: 9.181134604614438e-06
Count for hofocan: 1
Prob  for hofocan: 9.181134604614438e-06
Count for maloci'ay: 1
Prob  for maloci'ay: 9.181134604614438e-06
Count for li'akong: 1
Prob  for li'akong: 9.181134604614438e-06
Count for longec: 1
Prob  for longec: 9.181134604614438e-06
Count for i'ayawho: 1
Prob  for i'ayawho: 9.181134604614438e-06
Count for a: 6334
Prob  for a: 0.058153306585627854
Co

The above looks good, although I am a little bit suspicious of how high the counts are for `a` `no` and `o`. I suppose with over 100k words in the corpus this isn't that high.
Now we will actually do autocorrect prediction.