# Error Modelling Function

Most of the string similarity functions focus on similarity or "overlap". This notebook demonstrates how to build the graphical structure which draws mappings for unmatched segments.

Firstly, we import some math functions and the shingle functions that we made in earlier notebooks.
Then we initialize two dictionaries, `port_ro_cog` and `port_ro_noncog`. 

`port_ro_cog` means cognates which corresponds to Portuguese - Romanian cognates and `port_ro_noncog` corresponds to Portuguese - Romanian non-cognates.

In [13]:
from math import ceil, floor
from src.shingle import *

port_ro_cog = dict()
port_ro_noncog = dict()

## Graphical Model Algorithm

This algorithm takes *top* and *bottom* as the parameters and returns a graph in a form of a tuple.

It takes three steps which are illustrated in the code:
1. Initialization of sets
2. Equalization of the set cardinalities
3. Inserting the mappings of the set members into the graph

In [14]:
def graph_model(first, second):
    ''' Constructs the graphical structure between two shingle sets. '''
    
    # Step 1: Initialization
    # If the given sets first and second are empty, we initialize 
    # them by inserting an empty token, (nun), into those sets.
    
    if len(first) == 0:
        first.append("nun") #insert empty token if found empty
    if len(second) == 0:
        second.append("nun") #insert empty token if found empty
        
    # Step 2: Equalization of the set cardinalities
    # The cardinalities of the sets first and second made
    # equal by inserting empty tokens (nun) into the
    # middle of the sets.
    
    # While loops to equalize the sizes
    while(len(first) < len(second)):
        pos = ceil(len(first) / 2)
        first.insert(pos, "nun")
    
    # While loops to equalize the sizes
    while(len(first) > len(second)):
        pos = floor(len(second) / 2)
        second.insert(pos, "nun")
        
    # Step 3: Inserting the mappings of the set members into the graph
    # The empty graph is initialized as graph = {}.
    # The directed edges are generated, originating from every set member
    # of first to every set member of second. This results in a complete 
    # directed bipartite graph between first and second sets.
    
    # Pairs in tuples
    graph = set() #Graph in sets to avoid duplicates
    
    for i in range(len(first)):
        pair = (first[i], second[i]) # One to one mapping with same index
        graph.add(pair)
    for i in range(len(first) - 1):
        pair = (first[i], second[i + 1]) # One to one mapping with an index ahead
        graph.add(pair)
    if len(first) > 1:
        for i in range(1, len(first)):
            pair = (first[i], second[i - 1]) # One to one mapping with an index before
            graph.add(pair)
    return graph

Two functions are defined. `common_elements` is similar to list intersection. `uncommon_elements` is similar to symmetric difference.

In [15]:
def common_elements(list1, list2):
    return [element for element in list1 if element in list2]

def uncommon_elements(list1, list2):
    return [element for element in list1 if element not in list2]

Now that all functions are defined, we will populate our dictionaries with the graphical results that we obtained!

In [16]:
def init_dict():
    
    # for cognates
    cogs = open("data/portuguese/graph_cognates_freq.txt", "r+")
    for line in cogs:
        splits = line.rstrip().split()
        port_ro_cog[(splits[1], splits[2])] = int(splits[0])
    cogs.close()
    
    # for non cognates
    noncogs = open("data/portuguese/graph_noncognates_freq.txt", "r+")
    for line in noncogs:
        splits = line.rstrip().split()
        port_ro_noncog[(splits[1], splits[2])] = int(splits[0])
    noncogs.close()

The error modelling function, pi, is defined as,
$$

$$

In [24]:
def pi(source, target, k = 1):
    query = two_ends(source, 2) #Your query
    document = two_ends(target, 2) #Your document
    qd = common_elements(query, document) # q cap d
    first = uncommon_elements(query, qd) # q - (q cap d)
    second = uncommon_elements(document, qd) # d - (q cap d)
    graph = graph_model(first,second)
    res = sum([port_ro_cog[i]**k for i in graph]) # sum the frequencies in the dictionary
    return res / len(graph)

In [25]:
init_dict()
pi("aspirat", "aspirar")

4