# Morfessor Implementation (Vector Abstractions)



## Outline
  - Objective : find constructions
      - Compounds = words
      - Constructions = morphs
      - Atoms = letters/characters
  - Components 
      - Cost Function
      - Training
      - Decoding
  - Model
      - Lexicon, Grammar
      - Independence assumption vis-a-vis constructions
      - MAP estimate - Likelihood + MDL-Prior
  - Training Algo
      - Greedy & Local Search
      - Starts with inital lexicon and tries to find optimal segmentation     


## Estimate
$argmax_M\,P(M|corpus)\,=\,argmax_M\,P(corpus|M)\,P(M)$

$P(M)\,=\,P(lexicon,grammar)=\,P(lexicon)$ (for baseline)

### Prior Probability - MDL Formulation
Supposing that there are $L$ different morphs,

$P(lexicon)\,=\,L!\,P(properties(\mu_1)...properties(\mu_L))$

$L!$ ways to order the list, properties - frequency, string of letters


$P(properties(\mu_1)...properties(\mu_L))\,=\,P(f_{\mu_1},..f_{\mu_L}).P(s_{\mu_1},..s_{\mu_L})$

$P(f_{\mu_1},..f_{\mu_L})\,=\,\frac{(L-1)!(N-L)!}{(N-1)!}$ - Implicit frequency modeling - Appendix A

where $N=\Sigma_{j=1}^{L} f_{\mu_j}$ Total number of morph tokens

$P(s_{\mu_1},..s_{\mu_L})\,=\,\Pi_{i=1}^{L} P(s_{\mu_i})\,=\,\Pi_{i=1}^{L} \Pi_{j=1}^{l_{\mu_i}} P(c_{ij})$ - Probability of each character multiplied - Implicit modeling of length with '#' marker added to each morph at the end

### Likelihood - MLE Formulation
$P(corpus|M)\,=\,\Pi_{j=1}^{W}\Pi_{k=1}^{n_j}P(\mu_{jk})$ - There are $W$ words, each word split into $n_j$ morphs

$P(\mu_i)\,=\,\frac{f_{\mu_i}}{N}\,=\,\frac{f_{\mu_i}}{\Sigma_{j=1}^{L} f_{\mu_j}}$

### Putting it all together
$argmax_M\,P(M|corpus)\,=\,argmax_M\,P(corpus|M)\,P(M)$


$argmax_M\,P(M|corpus)\,=\,argmax_M\,\Pi_{j=1}^{W}\Pi_{k=1}^{n_j}P(\mu_{jk})\,.\,L!\frac{(L-1)!(N-L)!}{(N-1)!}.\Pi_{i=1}^{L} \Pi_{j=1}^{l_{\mu_i}} P(c_{ij})$

In [None]:
from pymagnitude import *
from collections import Counter

In [None]:
data_file = 'data/wikitext-2/train.txt'
#mag_file  = 'vectors/wiki_train_words/wiki_train_words.magnitude'
mag_file = '../tools/GloVe/ukwac-10L.magnitude'
#mag_file = '../tools/GloVe/tel_tok.magnitude'
#data_file = 'test_corpus.txt'
#data_file = 'data/tel/train.txt.tok.tok'


In [1415]:
vectors = Magnitude(mag_file)
VECDIM = vectors.dim
VECCOST = math.log(VECDIM)
print(VECDIM)

100


In [1416]:
class Node:
    def __init__(self, form, count=0, cxn = False):
        self.form = form
        self.count = count
        self.slotvectors = {}
        self.slots = Counter()
        self.cxn = cxn
        
        
    def __str__(self):
        if type(self.form) == str:
            form = f"{self.form}/{self.count}"    
        else:
            comps = []
            for slot_pos in range(len(self.form)):
                if slot_pos not in self.slots:
                    f = self.form[slot_pos]
                else:
                    neibs = [x[0] for x in sorted(self.slots[slot_pos].items(), key = lambda x: x[1], reverse=True)[:3]]
                    f = '['+'/'.join(neibs)+']'
                comps.append(f)
            comps = '_'.join(comps)
            form = f"{comps}/{self.count}"
        return f"({form})"
    
    
    def __repr__(self):
        return self.__str__()
    

In [1417]:
nodes = {}
charmap = {}
num_tokens = 0
import re 
by_space = re.compile('\s+')
with open(data_file, errors='ignore') as f:
    for line in f:
        line = line.strip()
        if len(line) > 0:
            #line = line.replace("#", '')
            line = by_space.split(line)
                       
            for tok in line:
                if tok not in nodes:
                    nodes[tok] = Node(tok)
                for c in tok:
                    if c not in charmap:
                        charmap[c] = 0
                    charmap[c] += 1
                nodes[tok].count += 1
                num_tokens += 1
                

In [1418]:
import math
def print_nodes(nodes):
    for tok, node in nodes.items():
        print(node)
        
def counts_to_logprobs(counts):
    total_count = sum(counts.values())
    logprobs = {}
    for key, value in counts.items():
        logprobs[key] = -math.log(value/total_count)
    return logprobs


def stirlings_approximation(n):
    return n * math.log(n) - n + 0.5*(math.log(n) + math.log(2*math.pi))

def log_fact(n):
    if n < 2:
        return 0
    if n < 20:
        return math.log(math.factorial(n))
    return stirlings_approximation(n)

def implicit_frequency(num_types, num_tokens):
    """
        P(F) = ((L-1)! x (N-L) !) / (N-1) !
        C(F) = -(log((L-1)!) + log((N-L)!) - log((N-1)!))
    """ 
    return log_fact(num_tokens - 1) - log_fact(num_tokens - num_types) - log_fact(num_types -1)


# Currently not adding '#' character at the end. Need to add that and see how it works out.
def implicit_length_cost(types, charmap):
    """
        C(S) = Sigma_i_|words| ( Sigma_j_|wi| (-log(P(c_ij))) )
    """
    total_cost = 0
    for t in types:
        for char in t:
            total_cost += charmap[char]
    return total_cost

def lexicon_cost(num_tokens, nodes, charmap):
    
    return  implicit_frequency(len(nodes), num_tokens) + implicit_length_cost(nodes, charmap) -  log_fact(len(nodes))


def corpus_cost(node_probs):
    total_cost = 0
    with open(data_file) as f:
        for line in f:
            line = line.strip()
            if len(line) > 0:
                line = by_space.split(line)
                for tok in line:
                    total_cost += node_probs[tok]
    return total_cost

def corpus_cost_eff(nodes, num_tokens):
    """
        compute Sigma_i_L(f(i) * log(f(i))) - NlogN
    """
    
    node_counts = {k : v.count*math.log(v.count) for k, v in nodes.items()}
    return -(sum(node_counts.values()) - num_tokens * math.log(num_tokens))


from scipy.spatial.distance import cosine
import numpy
def distance(x, y, yvector = False):
    if not yvector:
        d = 1-cosine(vectors.query(x), vectors.query(y))
    else:
        d = 1-cosine(vectors.query(x), y)
    if d <= 0:
        return numpy.inf
    return -math.log(d)

from fastdist import fastdist
def fdistance(x, y):
    if type(x) == str:
        x = vectors.query(x)
    if type(y) == str:
        y = vectors.query(y)
    d = fastdist.cosine(x, y)
    if d <= 0:
        return numpy.inf
    return -math.log(d)


def fdistance_pw(x, y):
    if type(x) == str:
        x = vectors.query(x)
    if type(y) == str:
        y = {y : 1}
        
    y_vectors = [c * vectors.query(z) for z, c in y.items()]
    try:
        mv = numpy.mean(y_vectors, axis=0)
    except:
        print(x,y)
    ds = []
    y_vectors.append(x)
    ds = [fastdist.cosine(p, mv) for p in y_vectors]
   
    D = sum(ds)
    d = ds[-1] / D
    #print(ds, D)
#     d = 0
#     for z in y:
#         if type(z) == str:
#             z = vectors.query(z)
        
#         d += fastdist.cosine(x, z)
        
#     d = d / len(y)
    if d <= 0:
        return numpy.inf
    return -math.log(d)


def normalize(word_vec):
    norm=numpy.linalg.norm(word_vec)
    if norm == 0: 
        return word_vec
    return word_vec/norm

# def get_top_n_keys(d, n=5):
#     return {x[0] : x[1] for x in sorted(d.items(), key=lambda x : x[1], reverse=True)[:n]}

def get_top_n_keys(d, n=5):
    return {x[0] : x[1] for x in d.most_common()[:n]}

In [1419]:
num_types = len(nodes.keys())
charmap_logs = counts_to_logprobs(charmap)

In [1420]:
print(implicit_frequency(num_types, num_tokens))
corpus_cost_eff(nodes, num_tokens)

170151.9303001716


14566281.632556107

## Update Rules

### Removing a node t

1. Frequency Cost:
 * Number of types decreases by 1.    
    $L' = L - 1$      
 * Number of tokens decreases by f(t).   
    $N' = N - f(t)$        
 * $C(F)' = -(log((L'-1)!) + log((N'-L')!) - log((N'-1)!))$
  
  
2. Length Cost:
  * $C(S)' = C(S) - \Sigma_{i}^{|t|} -log(p(ci)$
  
  
3. Corpus cost:
  * Number of tokens decreases by f(t).   
     $N' = N - f(t)$        
  * A decrease in the total count by f(t)
  * $C(corpus)  = NlogN - \Sigma_i^L (f(i) * log(f(i)))$
  * $C(corpus)' = C(corpus) + f(t)*log(f(t)) - NlogN + N'logN'$

### Adding a node t with parent u

1. Frequency Cost:
 * If t is new, Number of types increases by 1.    
    $L' = L + 1$      
 * Number of tokens increases by f(u).   
    $N' = N + f(u)$        
 * $C(F)' = -(log((L'-1)!) + log((N'-L')!) - log((N'-1)!))$
  
  
2. Length Cost (if t is new):
  * $C(S)' = C(S) + \Sigma_{i}^{|t|} -log(p(ci)$
  
  
3. Corpus cost:
  * Number of tokens increases by f(u).   
     $N' = N - f(u)$        
  * An increase in the total count by f(u)
  * $C(corpus)  = NlogN - \Sigma_i^L (f(i) * log(f(i)))$
  * $C(corpus)' = C(corpus) - f(u)*log(f(u)) - NlogN + N'logN'$
  
   
   
   


In [1427]:
class MorfessorModel:
    def __init__(self, charmap_logs):
        self.frequency_cost = 0
        self.length_cost = 0
        self.corpus_cost = 0
        self.total_cost = 0
        
        self.num_types = 0
        self.num_tokens = 0
        
        self.charmap_logs = charmap_logs
        
        self.nodes = {}
        self.cxns = {}
        
   
        
    def add_node(self, token, count, parent=None, debug=False):
        num_types_new = self.num_types
        old_count = 0
        node_count = count
        if debug:
            print("A Cost Before", self.corpus_cost, self.frequency_cost, self.length_cost)

        
        if token not in self.nodes:
            self.num_types = self.num_types + 1
            self.length_cost = self.length_cost + sum([self.charmap_logs[char] for char in token if char not in ['_','[',']']])
            if '_' in token:
                split_tok = token.split('_')
                self.nodes[token] = Node(split_tok, count, cxn = True)    
                head, tail = '_'.join(split_tok[:-1])+'_', '_' + split_tok[-1]
                if head not in self.cxns:
                    self.cxns[head] = {}
                self.cxns[head][tail] = self.nodes[token]
                if tail not in self.cxns:
                    self.cxns[tail] = {}
                self.cxns[tail][head] = self.nodes[token]
                
            else:
                self.nodes[token] = Node(token, count)
        
        else:
            if self.nodes[token].count == 0:
                self.num_types += 1

            else:
                count += self.nodes[token].count # Update the count            
                old_count = self.nodes[token].count * math.log(self.nodes[token].count)


            self.nodes[token].count = count

               
        num_tokens_new = self.num_tokens + node_count
        
        
        self.frequency_cost = implicit_frequency(self.num_types, num_tokens_new)    
        
        if debug:
            print("A components ", old_count, - count * math.log(count), "delta : ", old_count - count * math.log(count), - ((self.num_tokens * math.log(self.num_tokens)) if self.num_tokens > 0 else 0), (num_tokens_new * math.log(num_tokens_new))  )

        self.corpus_cost = self.corpus_cost + old_count \
                                            - count * math.log(count) \
                                            - ((self.num_tokens * math.log(self.num_tokens)) if self.num_tokens > 0 else 0) \
                                            + (num_tokens_new * math.log(num_tokens_new))
        
        if debug:
            print("A Cost After", self.corpus_cost, self.frequency_cost, self.length_cost)

        self.num_tokens = num_tokens_new
        self.total_cost = self.corpus_cost + self.frequency_cost + self.length_cost
        
        
    def remove_node(self, token, decrease_by, debug = False):
        node = self.nodes[token]
        count = node.count - decrease_by
        
        if debug:
            print("R Cost Before", self.corpus_cost, self.frequency_cost, self.length_cost)

        
        if count == 0:
            self.num_types = self.num_types - 1
            self.length_cost = self.length_cost - sum([self.charmap_logs[char] for char in token if char not in ['_','[',']']])
            
        num_tokens_new = self.num_tokens - decrease_by
    
        self.frequency_cost = implicit_frequency(self.num_types, num_tokens_new)               
        
        self.corpus_cost = self.corpus_cost + node.count * math.log(node.count) \
                                            - (count * math.log(count) if count > 0 else 0) \
                                            - (self.num_tokens * math.log(self.num_tokens)) \
                                            + (num_tokens_new * math.log(num_tokens_new))
        
        if debug:
            print("R Cost After", self.corpus_cost, self.frequency_cost, self.length_cost)

        
        self.total_cost = self.corpus_cost + self.frequency_cost + self.length_cost
        self.num_tokens = num_tokens_new
        
        self.nodes[token].count -= decrease_by
        
        if self.nodes[token].count <= 0:
            del self.nodes[token]
            if '_' in token:
                split_tok = token.split('_')
                head, tail = '_'.join(split_tok[:-1])+'_', '_' + split_tok[-1]
                if head in self.cxns:
                    del self.cxns[head][tail]

                if len(self.cxns[head]) == 0:
                    del self.cxns[head]

                if tail in self.cxns:
                    del self.cxns[tail][head]

                if len(self.cxns[tail]) == 0:
                    del self.cxns[tail]

   

In [1452]:
m = MorfessorModel(charmap_logs)
import random
tokens = [(k,v.count) for k, v in nodes.items()]
#random.shuffle(tokens)

In [1453]:
import tqdm
m.originals = {}
for token, count in tqdm.tqdm(tokens):
    m.add_node(token, count)
    m.originals[token] = count

100%|██████████| 33277/33277 [00:00<00:00, 51868.07it/s] 


In [1454]:
m.num_tokens, m.total_cost, m.num_types


(2051910, 15516165.546927406, 33277)

In [1455]:
import copy
def should_join(prev, tok, debug=False):
        candidate = prev+'_'+tok
        new_candidate = False
        # Current cost
        old_cost = m.total_cost
        cand_count = 0
        if candidate in m.nodes:
            cand_count = 1
        if cand_count == 0 and (candidate in m.originals):
            cand_count = m.originals[candidate] + 1
        elif cand_count == 0:
            cand_count = 1

        m.remove_node(prev, cand_count, debug=debug)
        
        if tok not in m.nodes:
            m.add_node(prev, cand_count, debug=debug)
            # No splitting
            return None, None
        
        m.remove_node(tok, cand_count, debug=debug)
            
        m.add_node(candidate, cand_count, debug = debug)
        # Retain the node if the cost is lower                       
        if m.total_cost < old_cost:
             return True, cand_count
        return False, cand_count




def run_once(data_file, debug):
    new_candidate = False
    with open(data_file, errors='ignore') as f:
        for i, line in tqdm.tqdm(enumerate(f)):
            line = line.strip()
            if len(line) > 0:
                #line = line.replace("#", '')
                line = by_space.split(line)
                #line = "#".join(line)
                prev = line[0]
                
                for tok in line[1:]:
                    candidate = prev+'_'+tok
                    new_candidate = False
                    
                    if prev not in m.nodes:
                        prev = candidate
                    elif tok in m.nodes:
                        
                        best_prev = tok, m.total_cost, 'I'
                        best_match = [None, None]
                        
                        # Word_Word compound.
                        sj, cand_count = should_join(prev, tok, debug=debug)
                        
                        
                        revert_prev_tok = not sj if sj != None else False
                        if sj == True:
                            #print()
                            best_prev = candidate, m.total_cost, 'T_T'
                        
                        
                        # First remove the candidate to search for matches
                        # This does not depend on whether the candidate is to be joined or not.
                        # After this point, we would have removed prev, tok and candidate. Ideal for finding the match.
                        if sj != None:
                            m.remove_node(candidate, cand_count, debug=debug)
                        
                        # Lets search all the cxns that start with prev.
                        # The construction can be filled only if the slot we are dealing with is terminal.
                        # That is why cannot have continuation ("_") after the current token.
                        
                        head, tail = prev + '_', '_' + tok
                        
                        # Check for X_C(Y) first.
                        if head in m.cxns:
                            matched = m.cxns[head]
                            #print(matched)
                            
                            most_similar = (None, numpy.inf)
                            
                            for ctail, mat in matched.items():
                                #print(ctail, mat)
                                # Check if match already has a slot 
                                slot_pos = (len(mat.form) - 1)
                                if slot_pos in mat.slots:
                                    #likely_similar = list(mat.slots[slot_pos].keys())
                                    likely_similar = get_top_n_keys(mat.slots[slot_pos])
                                else:
                                    likely_similar = mat.form[-1]
                                    if likely_similar == tok: # No point in merging the same things
                                        continue 
                                        
                                similarity = fdistance_pw(tok, likely_similar)
                                if similarity < most_similar[1]:
                                    most_similar = (mat, similarity)
                                    
                            if most_similar[0]:
                                #print(prev, tok, most_similar, best_prev, sj, m.total_cost)
                                mat = most_similar[0]
                                old_mat = copy.deepcopy(mat)
                                old_cand = None
                                slot_pos = (len(mat.form) - 1)
                                merge_cost = 0
                                slot_cost = 0
                                merged = [False, False]
                                # If we are creating a new construction
                                if slot_pos not in mat.slots:
                                    cxn_id = len(matched) + 1 
                                    cxn_candidate = prev + '_' + f'[Q{cxn_id}]'
                                    cxn_candidate_count = mat.count + 1       
                                    #tgt_vector = vectors.query(mat.form[-1])         
                                    # As we are merging the two constructions, let us remove the old
                                    if '_'.join(mat.form) in m.nodes:
                                        m.remove_node('_'.join(mat.form), mat.count, debug=debug)
                                        merged[0] = True
                                    if candidate in m.nodes:
                                        old_cand = copy.deepcopy(m.nodes[candidate])
                                        m.remove_node(candidate, m.nodes[candidate].count, debug=debug)
                                        merged[1] = True
                                    slot_cost = math.log(sys.getsizeof(str(tok+mat.form[-1])))
                                    
                                else:
                                    # Construction already exists. Need to update the count and avg vector
                                    cxn_candidate = '_'.join(mat.form)
                                    cxn_candidate_count = 1  
                                    cxn_node = m.nodes[cxn_candidate]
                                    if tok not in mat.slots[slot_pos]:
                                        slot_cost += math.log(sys.getsizeof(tok))
                                    #tgt_vector = cxn_node.slotvectors[slot_pos]
                        
                                m.add_node(cxn_candidate, cxn_candidate_count, debug=debug)
                                
                                cxn_cost = m.total_cost + most_similar[1] + slot_cost
                                if cxn_cost >= best_prev[1]:
                                    #print("Not making a construction now.")
                                    
                                    m.remove_node(cxn_candidate, cxn_candidate_count, debug=debug)
                                    if merged[0]:
                                        m.add_node('_'.join(old_mat.form), old_mat.count, debug=debug)                                        
                                    if merged[1]:
                                        m.add_node('_'.join(old_cand.form), old_cand.count, debug=debug)                                        
                                        
                                    
                                else:
                                    #print("Updating the CxN", prev, tok, cxn_candidate, mat, mat.form[-1], cxn_cost, sj)
                                    best_match = mat, cxn_candidate_count
                                    best_prev = cxn_candidate, cxn_cost, 'T_C'                                    
                                    revert_prev_tok = False
                            
                        if best_prev[-1] == 'T_C':
                            m.remove_node(cxn_candidate, cxn_candidate_count, debug=debug)
                            
                        # Lets check for C(X)_Y        
                        if tail in m.cxns and prev.count('_') == 0:
                            matched = m.cxns[tail]
                            #print(matched)
                            
                            most_similar = (None, numpy.inf)
                            
                            prev_vector = numpy.zeros(VECDIM)
                            if prev in m.nodes:
                                prev_node = m.nodes[prev]
                            else:
                                prev_node = Node(prev, count=1)
                                    
                            for slot_pos in range(len(prev_node.form)):
                                if slot_pos in prev_node.slots:
                                    prev_vector += numpy.mean(list(prev_node.slots[slot_pos].keys()), axis=0)
                                else:
                                    prev_vector += vectors.query(prev_node.form[slot_pos])
                            
                            prev_vector = normalize(prev_vector)
                            slot_pos = 0
                            for chead, mat in matched.items():
                                #print(ctail, mat)
                                # Check if match already has a slot 
                                
                                if chead != head:
                                    likely_similar = numpy.zeros(VECDIM)
                                    if len(mat.form) == 2:
#                                         for slot_pos in range(len(mat.form) - 1):
#                                             if slot_pos in mat.slots:
#                                                 likely_similar += mat.slotvectors[slot_pos]                                    
#                                             else:
#                                                 likely_similar += vectors.query(mat.form[slot_pos])
                                        if slot_pos in mat.slots:
                                            #likely_similar = list(mat.slots[slot_pos].keys())
                                            likely_similar = get_top_n_keys(mat.slots[slot_pos])
                                        else:
                                            likely_similar = mat.form[slot_pos]


                                        similarity = fdistance_pw(prev_vector, likely_similar)
                                        if similarity < most_similar[1]:
                                            most_similar = (mat, similarity)

                            if most_similar[0]:
                                #print(prev_node, tok, most_similar, best_prev, sj, m.total_cost)
                                mat = most_similar[0]
                                old_mat = copy.deepcopy(mat)
                                old_cand = None
                                merged = [False, False]
                                slot_pos = 0                                
                                slot_cost = 0
                                if (len(mat.form) == 2) and (slot_pos in mat.slots):
                                    # Something line [Q]_Y exists
                                    cxn_candidate = '_'.join(mat.form)
                                    cxn_candidate_count = 1  
                                    cxn_node = m.nodes[cxn_candidate]
                                    
                                    if prev not in mat.slots[slot_pos]:
                                        slot_cost += math.log(sys.getsizeof(prev))
                                    #tgt_vector = cxn_node.slotvectors[slot_pos]
                                    
                                else:
                                    cxn_id = len(matched) + 1 
                                    cxn_candidate = f'[Q{cxn_id}]' + "_" + tok
                                    cxn_candidate_count = mat.count + 1      
                                    slot_cost = math.log(sys.getsizeof(str(prev+mat.form[0])))
                                    

                                    # As we are merging the two constructions, let us remove the old
                                    if '_'.join(mat.form) in m.nodes:
                                        m.remove_node('_'.join(mat.form), mat.count, debug=debug)
                                        merged[0] = True
                                    
                                                                            
                                                                                                                                       
                                if candidate in m.nodes:
                                    old_cand = copy.deepcopy(m.nodes[candidate])
                                    m.remove_node(candidate, m.nodes[candidate].count, debug=debug)
                                    merged[1] = True
                                    
                                m.add_node(cxn_candidate, cxn_candidate_count, debug=debug)
                                cxn_cost = m.total_cost + most_similar[1] + slot_cost
                                if cxn_cost >= best_prev[1]:
                                    #print("Not making a construction now.")
                                    
                                    m.remove_node(cxn_candidate, cxn_candidate_count, debug=debug)
                                    if merged[0]:
                                        m.add_node('_'.join(old_mat.form), old_mat.count, debug=debug)                                        
                                    if merged[1]:
                                        m.add_node('_'.join(old_cand.form), old_cand.count, debug=debug)       
                                    
                                else:
                                    #print("Updating the CxN", prev, tok, cxn_candidate, mat, mat.form[-1], cxn_cost, sj)
                                    cxn_node = m.nodes[cxn_candidate]
                                    best_match = mat, cxn_candidate_count
                                
                                    if tok in m.nodes:
                                        for spos, sloti in m.nodes[tok].slots.items():
                                            if spos not in cxn_node.slots:
                                                cxn_node.slots[spos] = Counter()
                                            cxn_node.slots[spos].update(sloti)
                                             
                                    # Update the avg vector
                                    #print(f"I am merging {mat} with {candidate} best_prev : {best_prev} similarity : {most_similar[1]}")
                                    if slot_pos not in cxn_node.slots:
                                        cxn_node.slots[slot_pos] = Counter({mat.form[0]:1})
                                    if prev not in cxn_node.slots[slot_pos]:
                                        cxn_node.slots[slot_pos][prev] = 0
                                    cxn_node.slots[slot_pos][prev] += 1
                                        
                                    #cxn_node.slots[slot_pos] = normalize(tgt_vector +  prev_vector)
                                    #xn_node.slotvectors[slot_pos] = numpy.mean([tgt_vector,prev_vector], axis=0)
                                   # print(f"Results {cxn_node}")
                                    best_prev = cxn_candidate, cxn_cost, 'C_T'
                                    revert_prev_tok = False
                                    
                        # Because we removed the candidate, add it back.
                        if best_prev[-1] == 'T_C':
                            
                            cxn_candidate =  best_prev[0]
                            cxn_candidate_count = best_match[1]
                            #print(best_prev)
                            m.add_node(cxn_candidate, cxn_candidate_count, debug=debug)
                            
                            cxn_node = m.nodes[cxn_candidate]
                            
                            if prev in m.nodes:
                                for spos, sloti in m.nodes[prev].slots.items():
                                    if spos not in cxn_node.slots:
                                        cxn_node.slots[spos] = Counter()
                                    cxn_node.slots[spos].update(sloti)
                                             
                            
                            
                            slot_pos = len(best_match[0].form)  - 1
                            if slot_pos not in cxn_node.slots:
                                cxn_node.slots[slot_pos] = Counter({best_match[0].form[-1]:1})
                                            
                            if tok not in cxn_node.slots[slot_pos]:
                                cxn_node.slots[slot_pos][tok] = 0
                                        
                            cxn_node.slots[slot_pos][tok] += 1
                                    
                            
                            
                        elif best_prev[2] == 'T_T':
                            m.add_node(candidate, cand_count, debug = debug)
                            
                        if revert_prev_tok == True:
                              # Revert back !
                            #m.remove_node(candidate, cand_count, debug=debug)
                            
                            m.add_node(prev, cand_count, debug=debug)
                            m.add_node(tok, cand_count, debug=debug)
                            
                        prev = best_prev[0]
#                             prev = tok
                                                                         

                        
                        
                    if candidate not in m.originals:
                        m.originals[candidate] = 0
                    m.originals[candidate] += 1
#             if i > 50:
#                  break


In [1456]:
run_once(data_file, debug=False)
# # Reset Originals
# m.originals = {}
# for k, v in m.nodes.items():
#     m.originals[k] = v.count


# %time print(distance('this', 'the'))
# %time print(fdistance('this', 'the'))

#%prun normalize(a + b)
#%time normalize(a + b)
#%time numpy.mean([a, b], axis=0)

36718it [02:05, 292.78it/s]


In [1457]:
m.num_tokens, m.total_cost, m.num_types

(720655, 7687279.370708236, 34503)

In [1488]:
sorted({k : v for k, v in m.nodes.items() if k.count('_') > 2 and v.count > 0}.items(), key = lambda x : x[1].count, reverse = True)

[('(_[Q2]_[Q2]_[Q2]', ((_[died/now/or]_[–///mi]_[)/or/long]/233)),
 ('[Q2]_to_[Q2]_[Q2]', ([back/began/Ode]_to_[have/do/an]_[to/on/their]/191)),
 ('[Q2]_(_[Q2]_[Q2]_[Q2]', ([-/miles/feet]_(_[)/5/9]_[//)/;]_[;/long/at]/129)),
 ('[Q2]_that_[Q2]_[Q2]',
  ([stated/noted/said]_that_[it/he/was]_[was/had/to]/127)),
 ('[Q5]_(_[Q2]_[Q2]', ([acres/6/long]_(_[10/100/PHP]_[//ISBN/:]/76)),
 ('[Q3]_@-@_[Q2]_[Q2]',
  ([All/all/three]_@-@_[inch/day/gun]_[line/@-@/fire]/75)),
 ('[Q2]_were_[Q2]_[Q2]',
  ([they/They/guns]_were_[made/also/created]_[over/:/by]/67)),
 ('[Q2]_a_[Q2]_[Q2]', ([in/,/as]_a_[gold/result/god]_[of/@-@/present]/59)),
 ('[Q2]_"_[Q2]_[Q2]', ([./was/that]_"_[A./"/was]_["/also/that]/46)),
 ('[Q2]_was_[Q2]_[Q2]',
  ([it/that/she]_was_[to/later/released]_[to/by/was]/41)),
 ('[Q2]_@.@_[Q2]_[Q2]', ([1/2/5]_@.@_[2/0/5]_[%/in/Rowson]/33)),
 ('[Q2]_)_–_@-@', ([Q2]_)_–_@-@/24)),
 ("[Q2]_'_[Q2]_'", ([Q2]_'_[Q2]_'/17)),
 ('[Q2]_Light_Horse_Brigade', ([Q2]_Light_Horse_Brigade/16)),
 ('[Q2]_in_[Q2]

In [1459]:
{k : v for k, v in m.nodes.items() if k.startswith('the_')}

{'the_[Q2]_[Q2]': (the_[game/Little/series]_[of/./,]/89)}

In [1491]:
m.nodes['[Q2]_was_[Q2]_[Q2]'].slots[2]

Counter({'in': 20,
         'also': 59,
         'quickly': 20,
         'permanently': 20,
         'trapped': 20,
         'unscathed': 20,
         'found': 40,
         'popular': 20,
         'later': 198,
         'erected': 20,
         'opened': 20,
         'proposed': 20,
         'compiled': 20,
         'continued': 20,
         'fully': 20,
         'inspired': 20,
         'released': 100,
         'used': 100,
         'cast': 40,
         '@-@': 40,
         'sold': 20,
         'introduced': 40,
         'against': 20,
         'largely': 40,
         'rebuilt': 20,
         'struggling': 20,
         'bought': 20,
         'appointed': 40,
         'replaced': 60,
         'followed': 60,
         'to': 270,
         'revealed': 40,
         'awarded': 100,
         'completed': 20,
         'inducted': 20,
         'Villa': 20,
         'broken': 20,
         'facing': 20,
         'mother': 20,
         'unknown': 20,
         'held': 20,
         'submitted': 20,
 

In [1282]:
y = get_top_n_keys(m.nodes['the_[Q2]'].slots[1], n=6)
x = 'York'
fdistance_pw(x, y)

2.436686605843415

In [1223]:
D = [0.7505350605337768, 0.5774341990944761, 0.563967412483534, 0.5443518380846117, 0.4899049485976567]
0.28/sum(D)

0.0956874533221716

In [956]:
m.nodes['[Q2]_Rock'].slots

{0: ['Little',
  'at',
  'at',
  'at',
  'at',
  'at',
  'after',
  'renamed',
  'North',
  'Big',
  'North',
  'became',
  ',']}

In [888]:
import sys
sz = sys.getsizeof(str())
print(sz)
math.log(sz)

49


3.8918202981106265

In [1486]:
import numpy

def write_to_file(data_file):
    to_file = f"{data_file}.wtok"
    with open(data_file, errors='ignore') as f, open(to_file, 'w') as tf:
        for i, line in tqdm.tqdm(enumerate(f)):
            line = line.strip()
            if len(line) > 0:
                #line = line.replace("#", '')
                line = by_space.split(line)
                #line = "#".join(line)
                prev = line[0]
                new_line = [prev]
                
                for tok in line[1:]:

                    candidate = prev+'_'+tok
                    prev_cost, tok_cost = numpy.inf, numpy.inf
                    if prev in m.nodes:
                        prev_cost = -math.log(m.nodes[prev].count/m.num_tokens)
                    if tok in m.nodes:
                        tok_cost = -math.log(m.nodes[tok].count/m.num_tokens)

                    individual_cost = prev_cost + tok_cost
                    mode = 'I'
                    best_cost = individual_cost
                    head, tail = prev + '_', '_' + tok
                    
                    
                    if head in m.cxns:
                        matched = m.cxns[head]
                        most_similar = (None, numpy.inf)
                            
                        for ctail, mat in matched.items():
                            slot_pos = (len(mat.form) - 1)
                            if slot_pos in mat.slots:
                                likely_similar = get_top_n_keys(mat.slots[slot_pos])
                            else:
                                likely_similar = mat.form[-1]
                                        
                            similarity = fdistance_pw(tok, likely_similar)
                            if similarity < most_similar[1]:
                                most_similar = (mat, similarity)
                           
                        if most_similar[0]:
                            mat = most_similar[0]
                            slot_pos = (len(mat.form) - 1)
                            merge_cost = 0
                            slot_cost = 0
                            if slot_pos not in mat.slots:
                                    pass
                            else:
                                    # Construction already exists. Need to update the count and avg vector
                                cxn_candidate = '_'.join(mat.form)
                                cxn_node = m.nodes[cxn_candidate]
                                cxn_candidate_cost = -math.log(cxn_node.count / m.num_tokens)
                                    #tgt_vector = cxn_node.slotvectors[slot_pos]
                        
                                cxn_cost = cxn_candidate_cost + most_similar[1] + slot_cost
                                if cxn_cost < best_cost:
                                    prev = cxn_candidate
                                    new_line[-1] = prev
                                    mode = 'T_C'
                                    best_cost = cxn_cost
                    
                    if tail in m.cxns:
                        matched = m.cxns[tail]
                        most_similar = (None, numpy.inf)
                            
                        for ctail, mat in matched.items():
                            slot_pos = 0
                            if slot_pos in mat.slots:
                                likely_similar = get_top_n_keys(mat.slots[slot_pos])
                            else:
                                likely_similar = mat.form[-1]
                                        
                            similarity = fdistance_pw(tok, likely_similar)
                            if similarity < most_similar[1]:
                                most_similar = (mat, similarity)
                           
                        if most_similar[0]:
                            mat = most_similar[0]
                            slot_pos = 0
                            merge_cost = 0
                            slot_cost = 0
                            if slot_pos not in mat.slots:
                                    pass
                            else:
                                    # Construction already exists. Need to update the count and avg vector
                                cxn_candidate = '_'.join(mat.form)
                                cxn_node = m.nodes[cxn_candidate]
                                cxn_candidate_cost = -math.log(cxn_node.count / m.num_tokens)
                                    #tgt_vector = cxn_node.slotvectors[slot_pos]
                        
                                cxn_cost = cxn_candidate_cost + most_similar[1]
                                if cxn_cost < best_cost:
                                    prev = cxn_candidate
                                    new_line[-1] = prev
                                    mode = 'C_T'
                                    best_cost = cxn_cost
                                
                    if candidate in m.nodes:
                        candidate_cost = -math.log(m.nodes[candidate].count/m.num_tokens)
                        if candidate_cost < best_cost:
                            prev = candidate
                            new_line[-1] = prev
                            mode = 'T_T'
                        else:
                            prev = tok
                            new_line.append(tok)
                    elif mode == 'I':
                            prev = tok
                            new_line.append(tok)
                    
                #tf.write(' '.join(new_line) + '\n')
                print(line)
                print(' '.join(new_line), end=' ')
                print("\n"+"-"*20)

            if i > 10:
                break
                   

In [1487]:
write_to_file(data_file)

6it [00:00, 57.52it/s]

['=', 'Valkyria', 'Chronicles', 'III', '=']
= Valkyria_[Q2] [Q2]_= 
--------------------
['Senjō', 'no', 'Valkyria', '3', ':', '<unk>', 'Chronicles', '(', 'Japanese', ':', '戦場のヴァルキュリア3', ',', 'lit', '.', 'Valkyria', 'of', 'the', 'Battlefield', '3', ')', ',', 'commonly', 'referred', 'to', 'as', 'Valkyria', 'Chronicles', 'III', 'outside', 'Japan', ',', 'is', 'a', 'tactical', 'role', '@-@', 'playing', 'video', 'game', 'developed', 'by', 'Sega', 'and', 'Media.Vision', 'for', 'the', 'PlayStation', 'Portable', '.', 'Released', 'in', 'January', '2011', 'in', 'Japan', ',', 'it', 'is', 'the', 'third', 'game', 'in', 'the', 'Valkyria', 'series', '.', '<unk>', 'the', 'same', 'fusion', 'of', 'tactical', 'and', 'real', '@-@', 'time', 'gameplay', 'as', 'its', 'predecessors', ',', 'the', 'story', 'runs', 'parallel', 'to', 'the', 'first', 'game', 'and', 'follows', 'the', '"', 'Nameless', '"', ',', 'a', 'penal', 'military', 'unit', 'serving', 'the', 'nation', 'of', 'Gallia', 'during', 'the', 'Second', '

11it [00:00, 39.00it/s]

['The', 'game', "'s", 'battle', 'system', ',', 'the', '<unk>', 'system', ',', 'is', 'carried', 'over', 'directly', 'from', '<unk>', 'Chronicles', '.', 'During', 'missions', ',', 'players', 'select', 'each', 'unit', 'using', 'a', 'top', '@-@', 'down', 'perspective', 'of', 'the', 'battlefield', 'map', ':', 'once', 'a', 'character', 'is', 'selected', ',', 'the', 'player', 'moves', 'the', 'character', 'around', 'the', 'battlefield', 'in', 'third', '@-@', 'person', '.', 'A', 'character', 'can', 'only', 'act', 'once', 'per', '@-@', 'turn', ',', 'but', 'characters', 'can', 'be', 'granted', 'multiple', 'turns', 'at', 'the', 'expense', 'of', 'other', 'characters', "'", 'turns', '.', 'Each', 'character', 'has', 'a', 'field', 'and', 'distance', 'of', 'movement', 'limited', 'by', 'their', 'Action', '<unk>', '.', 'Up', 'to', 'nine', 'characters', 'can', 'be', 'assigned', 'to', 'a', 'single', 'mission', '.', 'During', 'gameplay', ',', 'characters', 'will', 'call', 'out', 'if', 'something', 'happens'




In [771]:
math.log(VECDIM)

4.605170185988092

In [1379]:
from collections import Counter

In [1380]:
x = Counter()

In [1385]:
x['b'] = 2
x['a'] = 1
x['c'] = 1

In [1386]:
x

Counter({'a': 1, 'b': 2, 'c': 1})

In [1394]:
x.most_common()

[('b', 2), ('a', 1), ('c', 1)]