# Final Scoring Function

This final notebook demonstrates how all the functions are incoporated together.

Firstly, we import the functions that we created.

In [None]:
from src.shingle import *
from src.graph import *
import numpy as np

The file `port.txt` is a list of Portuguese lexicon. We create a dictionary in the form of a lexeme mapped to its shingle set.

In [None]:
portuguese = open('data/portuguese/port.txt', "r+")
wordlist = {}
for line in portuguese:
    word = line.rstrip()
    wordlist[word] = two_ends(word, 2)

The file `portuguese_train_mrr.txt` contains info about Portuguese - Romanian cognates.

We form the data structure something like this then:

`data = [(query, shingle-set of query, correct answer)]`

In [None]:
train = open('data/portuguese/portuguese_train_mrr.txt', "r+")
pairs = []
for line in train:
    word = line.rstrip().split('____')
    pairs.append((word[0], word[1]))
data = [(p[0], two_ends(p[0],2), p[1]) for p in pairs]
avgdl = np.mean([len(s[1]) for s in data])
print("Average is", avgdl)

Similar to 4th notebook initialize the dictionary of lexicons in the `init_dict()` function.

In [None]:
port_ro_cog = dict()
port_ro_noncog = dict()

def init_dict():
    cogs = open("data/portuguese/graph_cognates_freq.txt", "r+")
    for line in cogs:
        splits = line.rstrip().split()
        port_ro_cog[(splits[1], splits[2])] = int(splits[0])
    cogs.close()
    noncogs = open("data/portuguese/graph_noncognates_freq.txt", "r+")
    for line in noncogs:
        splits = line.rstrip().split()
        port_ro_noncog[(splits[1], splits[2])] = int(splits[0])
    noncogs.close()

init_dict()

This is the `pi` function that we created in the 4th notebook.

In [None]:
def pi(source, target, k = 1.5):
    query = two_ends(source, 2) #Your query
    document = two_ends(target, 2) #Your document
    qd = common_elements(query, document) # q cap d
    first = uncommon_elements(query, qd) # q - (q cap d)
    second = uncommon_elements(document, qd) # d - (q cap d)
    graph = graph_model(first,second)
    res = 0 # sum the frequencies in the dictionary
    for i in graph:
        if i in port_ro_cog:
            res += port_ro_cog[i] ** k
    return res / len(graph)

In [None]:
pi("aspirat", "aspirar")

From the third notebook, we take the dirichlet functions for the similarity part.

In [None]:
shingles = 470751

def smooth(intersection, document, mu):
    smooth = []
    for word in intersection:
        prob = 1.0 + np.divide(document.count(word), mu * frequencies[word] / shingles)
        smooth.append(np.log10(prob))
    smooth = np.array(smooth)
    return smooth

def dirichlet(query, document, mu = 100.0):
    intersection = [word for word in document if word in query] # intersection
    add = len(query) * np.log10(np.divide(mu, mu + len(document)))
    score = np.dot(tf(intersection, query), smooth(intersection , document, mu)) + add
    return score

def tf(intersection, document):
    tf = [document.count(word) for word in intersection]
    return np.array(tf)

We also need to initialize the counts of the shingles in order to capture the document frequencies.

In [None]:
# Initialize counts
frequencies = {}
text = open("data/portuguese/two_ends.txt", "r+")
for line in text:
    word = line.strip().split(' ')
    frequencies[word[0]] = float(word[1])   

## Finally putting it together

The function `score` is the combination of error modelling function and similarity function.

In [None]:
def score(query, document, alpha = 0.5):
    return alpha * pi(query, document) + (1.0 - alpha) * dirichlet(two_ends(query, 2), two_ends(document, 2))

In [None]:
print(pi("aspirat", "aspirar"))
print(dirichlet(two_ends("aspirat", 2), two_ends("aspirar", 2)))

In [None]:
print(score("aspirat", "aspirar"))

Finally, for every input cognate, we want to output a bunch of cognates which are similar to the other cognates in the list. This can be done in the following manner:

In [None]:
for source in data:
    score_list = []
    for key in wordlist:
        score_list.append((key, score(key, source[0])))
    score_list = sorted(score_list, key=lambda x: x[1], reverse=True)
    
    for i in range(5):
        print(score_list[i])