# Code for Chapter 3: Learning Languages

This notebook provides the code written for and used in the Chapter 3 of my dissertation **_SigmaPie_ for subregular and subsequential grammar induction**. All the links will be added soon. :)

# Generators and evaluators: the setup for the experiments

## Step 1: loading dependencies, including _SigmaPie_

In [2]:
import codecs
from random import choice, randint
from pprint import pprint

# #I'm also adding some more to make things easier
import pickle
def write_report(this, which):
    if which == "data":
        pickle.dump(globals()[this].data, open("results_/" + this + "/data", "wb"))
    elif which == "sample":
        pickle.dump(globals()[this+"_sample"], open("results_/" + this + "/sample", "wb"))
    elif which == "grammar":
        pickle.dump(globals()[this], open("results_/" + this + "/grammar", "wb"))
    print("recorded", which)

In [3]:
# accessing SigmaPie toolkit: I know, horrible!
# I promise I'll make it a package soon
'''%cd local_sigmapie/code/
from main import *
%cd ../..'''
%cd ../SigmaPie/src
from sigmapie import *
%cd ../../subregular-experiments

c:\C\2022Summer\SigmaPie\src

You successfully loaded SigmaPie. 

Formal language classes and grammars available:
	* strictly piecewise: SP(alphabet, grammar, k, data, polar);
	* strictly local: SL(alphabet, grammar, k, data, edges, polar);
	* tier-based strictly local: TSL(alphabet, grammar, k, data, edges, polar, tier);
	* multiple tier-based strictly local: MTSL(alphabet, grammar, k, data, edges, polar).

Alternatively, you can initialize a transducer: FST(states, sigma, gamma, initial, transitions, stout).
Learning algorithm:
	OSTIA: ostia(sample, sigma, gamma).
c:\C\2022Summer\subregular-experiments


In [4]:
def backness_harmony(string):
    """
    Tells if a string is well-formed according to rules
    of Turkish backness harmony.
    """
    front_class, back_class = "Iaou", "ieOU"
    front, back = False, False
    
    for v in front_class + back_class:
        if v in string:
            front = True if v in front_class else front
            back = True if v in back_class else back

    return not (front and back)

In [5]:
def rounding_harmony(string):
    """
    Tells if a string is well-formed according to rules
    of Turkish rounding harmony.
    """
    high, low, rounded = "iIuU", "aeoO", "uUoO"
    
    vowels = "".join([v for v in string if v in high + low])
    if len(vowels) < 2:
        return True
    
    ro = vowels[0] in rounded
    
    for v in vowels[1:]:
        if v in low:
            if v in rounded:
                return False
            ro = False
        elif (ro and v not in rounded) or (not ro and v in rounded):
            return False
            
    return True

In [6]:
def backness_and_rounding(string):
    return backness_harmony(string) and rounding_harmony(string)

In [7]:
def turkish_word(length = 10, cons = "x", vowel_cluster = (1, 2),
                          cons_cluster = (0, 3)):
    """
    This generator generates fake Turkish words: namely, the words in which
    the harmonic system and rules of Turkish are preserved, but all consonants
    were substituted by a single given consonant.
    
    Arguments:
    * length (int): a length of a word that needs to be generated;
    * cons (str): a single character (or an empty string if only vowels
                  need to be generated), a "choice" of the consonant 
                  that makes this harmony long-distant;
    * vowel_cluster (tuple[int, int]): a tuple of integers representing
                                       minimal and maximal length of
                                       the vowel cluster;
    * cons_cluster (tuple[int, int]): a tuple of integers representing
                                      minimal and maximal length of
                                      the consonantal cluster.
                                      
    Returns:
    * str: a fake Turkish harmonic word, where all consonants are masked.
    """
    if length < 1:
        raise ValueError("Words cannot be so short.")
    
    vowels = {
        (True, True, True):"u",
        (True, True, False):"I",
        (True, False, True):"o",
        (True, False, False):"a",
        (False, True, True):"U",
        (False, True, False):"i",
        (False, False, True):"O",
        (False, False, False):"e"
    }
    
    backness = choice([True, False])
    height = choice([True, False])
    rounding = choice([True, False])
    
    specs = (backness, height, rounding)
    word = ""
    
    if choice([0, 1]):
            word += "x" * randint(*cons_cluster)
            
    while len(word) < length:
        vc = vowels[specs] * randint(*vowel_cluster)
        
        # this part is neededd to avoid the word-initial *oo clusters
        '''if len(vc) > 1 and not height and rounding:
            rounding = False
            vc = vc[0] + vowels[(backness, height, rounding)] * (len(vc) - 1)
         '''   # # #temporary
        word += vc
        word += "x" * randint(*cons_cluster)
        
        height = choice([True, False])
        rounding = False if not height else rounding
        specs = (backness, height, rounding)
        
    return word[:length]

In [8]:
def generate_turkish_words(n = 10, length = 10, cons = "x",
                           vowel_cluster = (1, 2), cons_cluster = (1, 3)):
    """
    This generator generates a list of fake Turkish words.
    
    Arguments:
    * n (int): a number of strings that need to be generated;
    ... for the rest of the arguments, see generate_turkish_word.
    
    Outputs:
    * list: the list containing n fake Turkish words.
    """
    return [turkish_word(length, cons, vowel_cluster, cons_cluster) for i in range(n)]

In [9]:
def harmonic_evaluator(data, rule):
    """
    Evaluates the provided data with respect to a given
    rule of harmony.
    
    Arguments:
    * data (list[str]): a list of strings tht need to be evaluated;
    * rule (function): a function that evaluates a string according
                       to some harmony.
                       
    Results:
    * Prints the report that shows if the data follows the rule.
    """
    correct = 0
    bad = list()
    for w in progressBar(data, prefix = "evaluating"):# #
        # # #correct = (correct + 1) if rule(w) else correct# # #I changed it
        if rule(w):
            correct += 1
        else:
            bad.append(w)
        
    ratio = (correct / len(data))
    print(f"Percentage of harmonic words: {int(ratio * 100)}%.")
    if len(bad) > 0:
        print("illegal words:", bad)

In [11]:
this = "mitsl9"
globals()[this] = MITSL(polar = "n")
print("created grammar object")
# #globals()[this].data = toy_mhwb
globals()[this].data = pickle.load(open("results_/" + this + "/data", "rb"))
#write_report(this, "data")
globals()[this].data.append("") # added to eliminate *>< on all tiers
print("populated data")
globals()[this].extract_alphabet()
globals()[this].learn()
write_report(this, "grammar")

globals()[this+"_sample"] = globals()[this].generate_sample(n = 1000)
write_report(this, "sample")
harmonic_evaluator(globals()[this+"_sample"], backness_and_rounding)
'''print("--------------------------")
print("Generates such strings:", globals()[this+"_sample"][:15])
print("--------------------------")
print("Size of the grammar:", len(globals()[this].grammar))
print("--------------------------")
print("Grammars:", globals()[this].grammar)'''


created grammar object
populated data
extracting alphabet |████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0% 
generating ngrams |████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0% 
extracting symbols |████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0% 
annotating input, attesting k-grams |████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0% 
calculating paths |████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0% 
learning unattested grams |████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0% 
gathering grammars |████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0% 
recorded grammar
generating i

'print("--------------------------")\nprint("Generates such strings:", globals()[this+"_sample"][:15])\nprint("--------------------------")\nprint("Size of the grammar:", len(globals()[this].grammar))\nprint("--------------------------")\nprint("Grammars:", globals()[this].grammar)'

In [16]:
harmonic_evaluator(globals()[this+"_sample"], backness_and_rounding)


evaluating |████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0% 
Percentage of harmonic words: 99%.
illegal words: ['Uxxi', 'Uxxii', 'Uxxiix', 'uxxuxuxxI', 'Uxxi']


In [18]:
d = generate_turkish_words(n=15000,cons_cluster=(1,4))
g=MITSL(polar='n')
g.data = d
g.extract_alphabet()
g.learn()
s = g.generate_sample()# # #I made a mistake here by not specifiying n=1000, so see the next cell
harmonic_evaluator(s, backness_and_rounding)


extracting alphabet |████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0% 
generating ngrams |████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0% 
extracting symbols |████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0% 
annotating input, attesting k-grams |████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0% 
calculating paths |████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0% 
learning unattested grams |████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0% 
gathering grammars |████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0% 
generating items |█████████████████████████████████████████████████

In [28]:
s = g.generate_sample(n = 1000)
harmonic_evaluator(s, backness_and_rounding)

generating items |████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0% 
evaluating |████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0% 
Percentage of harmonic words: 95%.
illegal words: ['oxxIxa', 'Oxxi', 'Oxxi', 'UxUxxixi', 'oxxxI', 'Oxxi', 'uxxI', 'Uxxxi', 'Oxxixe', 'Oxxixi', 'uxuxxIxxI', 'Uxxixi', 'uxxIx', 'Uxxi', 'Uxxi', 'uxxI', 'oxxxIxxIxxaxa', 'Oxxixixxixexixx', 'oxxIxIxaxx', 'Uxxixe', 'Oxxi', 'oxxxxIx', 'Oxxxix', 'oxxxIxI', 'uxxIxaxx', 'oxxI', 'Uxxxixxxe', 'Uxxi', 'oxxI', 'Uxxixi', 'Uxxxixixexxix', 'UxUxxxixxxi', 'Uxxxxxxixe', 'uxxI', 'Oxxix', 'oxxxIx', 'UxUxUxxi', 'oxuxxI', 'Oxxix', 'oxuxxIx', 'uxxI', 'uxxI', 'oxxIx', 'oxxxIxI']


In [4]:
m = pickle.load(open("results_/mitsl9/grammar", "rb"))
mm = [i for i in progressBar(m.data) if not m.scan(i)]
mm

 |████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0% 


[]