## Functions
- For questions 5,6,7 use the function `levenshtein`
- For question 6, modify the function `levenshtein` on the variable `substitutions`
- For question 8, use the function `jaro_winkler`. The function is defined in the file `Edistance.py`
- For questions 5 to 10, the function `uniFreq` is needed to calculate the count of unigrams in the corpus $C_3$
- For question 9, the function `bigramFreq` is needed to calculate the count of bigrams in the corpus $C_3$
- For question 10, use the code snippet given in the last cell

## Files
- use unigram.csv for questions 5,6,7,8
- use bigrams.csv for questions 9,10

In [21]:
def uniFreq():
    unigrams = open('unigram.csv').read().splitlines()
    wordFreq = dict()
    for word in unigrams:
        item = word.split(',')
        wordFreq[item[0].strip()] = int(item[1].strip())
    return wordFreq

def bigramFreq():
    bigrams =  open('bigrams.csv').read().splitlines()
    wordBigram = dict()
    for word in bigrams:
        item = word.split(',')
        wordBigram[item[0].strip()] = int(item[1].strip())
    return wordBigram

## levenshtein

In [15]:
# Function Definition starts with def <function name> (<input arguments>)
# 2 strings to comapre, hence 2 inputs as arguments

# mind the indentation
def levenshtein(s1, s2):
    if len(s1) < len(s2):
        return levenshtein(s2, s1)

    # len(s1) >= len(s2)
    if len(s2) == 0:
        return len(s1)

    previous_row = range(len(s2) + 1)

    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            # cost for each of 3 operations.
            insertions = previous_row[j + 1] + 1 # j+1 instead of j since previous_row and current_row are one character longer
            deletions = current_row[j] + 1       # than s2
            substitutions = previous_row[j] + (c1 != c2) # if true returns one, oterwise 0
            
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    
    return previous_row[-1]

## Jaro Winkler

In [23]:
from Edistance import jaro_winkler
print jaro_winkler('bimal','vimal')

0.866666666667


### Jaro Winkler Distance

The Jaro–Winkler distance is a measure of similarity between two strings.  The Jaro-Winkler similarity is given by `1 - Jaro Winkler distance`. The Jaro–Winkler distance metric is designed and best suited for short strings such as person names. The similarity score is normalized such that 0 equates to no similarity and 1 is an exact match.

$ d_{j}=\left\{{\begin{array}{ll}0&{\text{if }}m=0\\{\frac  {1}{3}}\left({\frac  {m}{|s_{1}|}}+{\frac  {m}{|s_{2}|}}+{\frac  {m-t}{m}}\right)&{\text{otherwise}}\end{array}}\right. $


Where 
- **s1** and **s2** are the strings 
- **m** is the number of matching characters (see below);
- **t** is half the number of transpositions (see below).

Source - [Jaro-Winkler Distance (Wikipedia)](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance)

## Bigram Likelihood

In [25]:
wordFreq = uniFreq()
wordBigram = bigramFreq()
def bigram_likelihood(bigramWord1,word1,word2,uniDict=wordFreq,biDict=wordBigram):
    bi = biDict[bigramWord1]
    uni = uniDict[word1]

    print bi,uni

    bigramProb = bi/(uni*1.0)
    return bigramProb

bigram_likelihood('iron safe','iron','safe')

8 12


0.6666666666666666

## Add-one smoothing for finding likelihood of a sentence

In [27]:
wordFreq['<s>'] = 1
wordFreq['</s>'] = 1
stri2 = ['<s> sandip babu sang bande mataram </s>','<s> chandranath babu asked for betel leaves </s>','<s> poor bimala went to the dressing room </s>']
for stri in stri2:
    mult = 1.0
    for i,item in enumerate(stri.split(' ')):
        try:
            print (wordBigram[item+' '+stri.split()[i+1]] + 1)/((wordFreq[item] + len(wordFreq.keys()))*1.0),item+' '+stri.split()[i+1]
            mult = mult * (wordBigram[item+' '+stri.split()[i+1]] + 1)/((wordFreq[item] + len(wordFreq.keys()))*1.0)
        except:
            try:
                print (1)/((wordFreq[item] + len(wordFreq.keys()))*1.0),item+' '+stri.split()[i+1]
                mult = mult * (1)/((wordFreq[item] + len(wordFreq.keys()))*1.0)
            except:
                print item
    print mult

0.00980244307043 <s> sandip
0.00905109489051 sandip babu
0.000149075730471 babu sang
0.000150784077201 sang bande
0.00584883023395 bande mataram
0.000149970005999 mataram </s>
0.000150806816468 </s>
1.74932821106e-18
0.000904840898809 <s> chandranath
0.00180695678362 chandranath babu
0.000149075730471 babu asked
0.0001495215311 asked for
for
0.000301477238468 betel leaves
0.000150625094141 leaves </s>
0.000150806816468 </s>
1.65494105455e-21
0.00120645453174 <s> poor
0.00045051809581 poor bimala
0.000297707651087 bimala went
0.00014905351021 went to
0.000150693188668 to the
0.000150715900528 the dressing
0.00105342362679 dressing room
0.000148345942738 room </s>
0.000150806816468 </s>
8.56025744882e-29
