# Positional Character-based Shingling

This notebook details about algorithms discussed in section 2 of the paper, "Alignment Analysis of Sequential Segmentation of Lexicons to Improve Automatic Cognate Detection"

## K-gram shingling

A basic concept on shingling. 

To illustrate, the shingle of the word *rosmarin* is created with $k = 2$ as: $S = \left\lbrace \textit{r, ro, os, sm, ma, ar, ri, in, n} \right\rbrace$

In [None]:
def shingle(input, k):
    ''' 
    Returns the shingle set of the string. 
    Take input string, and k (length of shingle) as parameters.
    '''
    k = min(len(input), k)
    start_combinations = [input[:i] for i in range(1, k)]
    kgrams = [input[i:i + k] for i in range(len(input) - k + 1)]
    end_combinations = [input[-i:] for i in range(k - 1, 0, -1)]
    return start_combinations + kgrams + end_combinations

In [None]:
shingle("rosmarin", 2)

## Positional Shingling from 2 Ends

We attach a position number to the left if the numbering begins from the start, and to the right if the numbering begins from the end.
Then the smallest position number is selected between the two position numbers.
If the position numbers are equal, then we select the left position number as a convention.

<img src="fig1.png" alt="algorithm" width="400px"/>

The process of positional tokenisation from both ends. On the left, algorithm segments the Romanian word *romarin* into the split-set $\left\lbrace \textit{1r, 2ro, 3om, 4ma, ar4, ri3, in2, n1} \right\rbrace$. On the right, the algorithm segments *rosmarin* into $\left\lbrace \textit{1r, 2ro, 3os, 4sm, 5ma, ar4, ri3, in2, n1} \right\rbrace$. 

In [None]:
def two_ends(input, k):
    ''' 
    Returns the doubly-ended shingle set of the string. 
    Take input string, and k (length of shingle) as parameters.
    '''
    basic = shingle(input, k)
    result =[]
    for i in range(1, len(basic) + 1):
        if i <= (len(input) - i + 2):
            result.append(str(i) + basic[i - 1]) # Append numbers from start
        else:
            result.append(basic[i - 1] + str(len(basic) - i + 1)) # Append numbers from end
    return result

In [None]:
two_ends("rosmarin", 2)

## Similarity Checking

Finally we implement a Jaccard function to incoporate those functions.

In [None]:
def jaccard(str1, str2, ends = 2, k = 2):
    if ends == 2:
        set1 = set(two_ends(str1, k))
        set2 = set(two_ends(str2, k))
    else:
        set1 = set(shingle(str1, k))
        set2 = set(shingle(str2, k))
    numerator = len(set1.intersection(set2))
    denominator = len(set1.union(set2))
    return numerator / denominator

In [None]:
print(jaccard("apple", "appe"))