## Language Models

- Machine Translation
- Spell Correction
- Speech Recognition
- Summarisation, Q & A


### Formulations

- How likely a sentence is well formed
$$ P(S) = P(w_1,w_2,w_3,.....,w_n) $$

- Predict the next word - Mobile Keyboards
    $$ P(w_5 |  w_1,w_2,w_3,w_4)  $$

## The Hangman Game

The <a href="https://en.wikipedia.org/wiki/Hangman_(game)">Hangman game</a> is a simple game whereby one person thinks of a word, which they keep secret from their opponent, who tries to guess the word one character at a time. The game ends when the opponent makes more than a fixed number of incorrect guesses, or they figure out the secret word before then (in which case they *win*). 

Here's a simple version of the game, and a method allowing interactive play. 

Predict one character at a time, for a word to be predicted

- Predict one character at a time - Mobile Keyboards
    $$ P(c_5 |  c_1,c_2,c_3,c_4)  $$

The description and the methods **`hangman`** and **`human`** is obtained from University of Melbourne, from the course [Web Search and Text Analysis](http://people.eng.unimelb.edu.au/tcohn/comp90042.html)


In [None]:
# allowing better python 2 & python 3 compatibility 
from __future__ import print_function 

def hangman(secret_word, guesser, max_mistakes=8, verbose=True):
    secret_word = secret_word.lower()
    mask = ['_'] * len(secret_word)
    guessed = set()
    if verbose:
        print("Starting hangman game. Target is", ' '.join(mask), 'length', len(secret_word))
    
    mistakes = 0
    while mistakes < max_mistakes:
        if verbose:
            print("You have", (max_mistakes-mistakes), "attempts remaining.")
        guess = guesser(mask, guessed)

        if verbose:
            print('Guess is', guess)
        if guess in guessed:
            if verbose:
                print('Already guessed this before.')
            mistakes += 1
        else:
            try:
                guessed.add(guess)
            except:
                print(guessed,guess)
            if guess in secret_word:
                for i, c in enumerate(secret_word):
                    if c == guess:
                        mask[i] = c
                if verbose:
                    print('Good guess:', ' '.join(mask))
            else:
                if verbose:
                    print('Sorry, try again.')
                mistakes += 1
                
        if '_' not in mask:
            if verbose:
                print('Congratulations, you won.')
            return mistakes
        
    if verbose:
        print('Out of guesses. The word was', secret_word)    
    return mistakes

def human(mask, guessed):
    print('Enter your guess:')
    return raw_input().lower().strip()
    #return input().lower().strip() # swap with above for python 3


In [None]:
hangman('whatever', human, 8, True)

## Loading the dataset

In [None]:
f = open('corncob_lowercase.txt').read().splitlines()
f[:10]

In [None]:
test = set()
for i,item in enumerate(f):
    if i%58 == 0:
        test.add(item)
    if len(test) == 1000:
        break
        
train = set(f) - test

len(train), len(test)

## Randomly predicting a character at a time

 - Randomly predict a number from 1 to 26. 
 - Pick the corresponding letter from the alphabet.
 - Check if its already guessed, else not perdict it

In [None]:
import numpy as np
import string    


def randomG(mask,guessed):
    
    # Initialise a dictionary with all the charcters in the alphbet
    d = dict.fromkeys(string.ascii_lowercase, 0)
    
    # Give an integer index for each character
    pd = dict()
    for i,j in enumerate(d.keys()):
        pd[i+1] = j
    
    while 1:
        trials = np.random.choice(range(1,27), 1,replace=False)
        if trials[0] not in guessed:
            return pd[trials[0]]

In [None]:
mist = hangman('whatever',randomG,8)

In [None]:
# Number of mistakes made in count of characters - Highest possible for 1000 words?
totMist = 0

# Number of words correctly predicted?
corrs = 0

#Testing for effectiveness of the method on the test data
for i,item in enumerate(test):
    mist = hangman(item, randomG, 8, False)
    totMist += mist
    if mist < 8:
        corrs += 1
        print(i,item)

print(totMist,corrs)

## Can we do any better? ...In a principled and efficient way?

1) Chain Rule
$$ P(w_1,w_2,w_3,.....,w_n) \approx  P(w_1)\cdot P(w_2|w_1) \cdot P(w_3|w_2,w_1) ..... \cdot P(w_n | w_1,.....,w_{n-1}) $$

2) Markov Assumption


$$ P(w_1,w_2,w_3,.....,w_n) \approx \prod_{i} P(w_i | w_{i-k} ...w_{i-1} ) $$

$$ P(w_i |w_1w_2.....w_{i-1}) \approx  P(w_i | w_{i-k} ...w_{i-1} ) $$


## Unigram character prediction

- Predict a character based on the frequency of occurence in the corpus
$$ P(w_1,w_2,w_3,.....,w_n) = \prod_{i} P(w_i) $$

In [None]:
charaFreq = dict()
totFreq = 0
for item in train:
    for chara in item:
        try:
            charaFreq[chara] += 1
        except:
            charaFreq[chara] = 1            
        totFreq += 1
import operator
sorted_x = sorted(charaFreq.items(), key=operator.itemgetter(1),reverse=True)

sorted_x[:5]

In [None]:
def unig(mask,guessed):
    for item in sorted_x:
        if item[0] not in guessed:
            return item[0]
hangman('whatever', unig, 8, True)

In [None]:
totMist = 0
corrs = 0
for i,item in enumerate(test):
    mist = hangman(item, unig, 8, False)
    totMist += mist
    if mist < 8:
        corrs += 1

corrs,totMist

## Unigram with Length

In [None]:
from collections import defaultdict
frequni = defaultdict(dict)
for item in train:
    for chara in item:
        try:
            frequni[len(item)][chara] += 1
        except:
            frequni[len(item)][chara] = 1

In [None]:
def unigLen(mask,guessed):
    sorted_x = sorted(frequni[len(mask)].items(), key=operator.itemgetter(1),reverse=True)
    for item in sorted_x:
        if item[0] not in guessed:
            return item[0]
        
totMist = 0
corrs = 0
for i,item in enumerate(test):
    mist = hangman(item, unigLen, 8, False)
    totMist += mist
    if mist < 8:
        corrs += 1
print (corrs,totMist)

## Unigram with Position

- Proabbility distribution conditioned on individual position


In [None]:
from collections import defaultdict
frequni = defaultdict(dict)
posCountDict = dict()
for item in train:
    for pos,chara in enumerate(item):
        try:
            frequni[pos][chara] += 1
        except KeyError:
            frequni[pos][chara] = 1
        try:
            posCountDict[pos] += 1
        except KeyError:
            posCountDict[pos] = 1

In [None]:
freqProb = defaultdict(dict)
for item in frequni.keys():
    if posCountDict[item] > 50:
        for stuff in frequni[item].keys():
            freqProb[item][stuff] = frequni[item][stuff]*1.0/posCountDict[item]

freqProb.keys()

In [None]:
def unigPos(mask,guessed):
    locDict = dict()
    for item in freqProb.keys():
        if item <= len(mask):
            sorted_x = sorted(freqProb[item].items(), key=operator.itemgetter(1),reverse=True)
            for stuff in sorted_x:
                if stuff[0] not in guessed:
                    locDict[stuff[0]] = stuff[1]
    sorted_x = sorted(locDict.items(), key=operator.itemgetter(1),reverse=True)
    for stuff in sorted_x:
        if stuff[0] not in guessed:
            return stuff[0]
    
    print('onnumilla')
        
totMist = 0
corrs = 0
for i,item in enumerate(test):
    mist = hangman(item, unigPos, 8, False)
    totMist += mist
    if mist < 8:
        corrs += 1
print (corrs,totMist)

- Shall we keep the max per position?

```for stuff in sorted_x:
                if stuff[0] not in guessed:
                    if stuff[0] in locDict.keys():
                        if stuff[1] > locDict[stuff[0]]:
                            locDict[stuff[0]] = stuff[1]
                            break
                    else:
                        locDict[stuff[0]] = stuff[1]
                        break
```                        
                 
- Shall we add up the probabilities?

## Unigram with Position bidirectionally

In [None]:
from collections import defaultdict
frequni = defaultdict(dict)
posCountDict = dict()
frequniRev = defaultdict(dict)
posCountDictRev = dict()


for item in train:
    for pos,chara in enumerate(item):
        revIndex = len(item)-1-pos

        try:
            frequni[pos][chara] += 1
        except KeyError:
            frequni[pos][chara] = 1
            
        try:
            frequniRev[revIndex][chara] += 1
        except KeyError:
            frequniRev[revIndex][chara] = 1
            
            
        try:
            posCountDict[pos] += 1
        except KeyError:
            posCountDict[pos] = 1
            
        try:
            posCountDictRev[revIndex] += 1
        except KeyError:
            posCountDictRev[revIndex] = 1


freqProb = defaultdict(dict)
freqProbRev = defaultdict(dict)

for item in frequni.keys():
    if posCountDict[item] > 50:
        for stuff in frequni[item].keys():
            freqProb[item][stuff] = frequni[item][stuff]*1.0/posCountDict[item]
            
for item in frequniRev.keys():
    if posCountDictRev[item] > 50:
        for stuff in frequniRev[item].keys():
            freqProbRev[item][stuff] = frequniRev[item][stuff]*1.0/posCountDictRev[item]

freqProb.keys(),freqProbRev.keys()

In [None]:
def unigPosBi(mask,guessed):
    locDict = list()
    for item in freqProb.keys():
        if item <= len(mask):
            sorted_x = sorted(freqProb[item].items(), key=operator.itemgetter(1),reverse=True)
            for stuff in sorted_x:
                if stuff[0] not in guessed:
                    locDict.append((stuff[0],stuff[1],item))
                    break
    for item in freqProbRev.keys():
        if item <= len(mask):
            sorted_x = sorted(freqProbRev[item].items(), key=operator.itemgetter(1),reverse=True)
            for stuff in sorted_x:
                if stuff[0] not in guessed:
                    locDict.append((stuff[0],stuff[1],-1*item))
                    break
    loctDict2 = dict()
    for item in locDict:
        try:
            loctDict2[item[0]] += item[1]
        except:
            loctDict2[item[0]] = item[1]
    
    #sorted_x = sorted(locDict, key=lambda x: x[1],reverse=True)
    sorted_x = sorted(loctDict2.items(), key=operator.itemgetter(1),reverse=True)
    for stuff in sorted_x:
        if stuff[0] not in guessed:
            return stuff[0]
    
    print('onnumilla')
        
totMist = 0
corrs = 0
for i,item in enumerate(test):
    mist = hangman(item, unigPosBi, 8, False)
    totMist += mist
    if mist < 8:
        corrs += 1

print (corrs,totMist)

### bi-gram Models

$$
P(w_i | w_{i-1}) = \frac{count(w_{i-1},w_i)}{count(w_{i-1})} $$

In [None]:
from collections import defaultdict
bigram = defaultdict(dict)

for item in train:
    item2 = '$'+item+'$'
    for i,chara in enumerate(item2[:-1]):
        try:
            bigram[chara][item2[i+1]] += 1
        except:
            bigram[chara][item2[i+1]] = 1


In [None]:
outerKeys = dict()

for item in bigram.keys():
    for stuff in bigram[item]:
        try:
            outerKeys[item] += bigram[item][stuff]
        except:
            outerKeys[item] = bigram[item][stuff]

            
for item in bigram.keys():
    try:
        del bigram[item]['$']
    except:
        pass
    
    for stuff in bigram[item]:
        bigram[item][stuff] = (1.0*bigram[item][stuff])/outerKeys[item]


def bigramM(mask,guessed):
    currMax = 0.0
    
    currBi = '_'
    if len(mask) == mask.count('_'):
        return unig(mask,guessed).split('_')[0]
    else:
        for i,item in enumerate(mask):
            try:
                if item != '_' and mask[i+1] == '_':
                    sorted_x = sorted(bigram[item].items(), key=operator.itemgetter(1),reverse=True)
                    for thing in sorted_x:
                        if thing[1] > currMax and thing[0] not in guessed:
                            currMax = thing[1]
                            currBi = thing[0]
                            break
            except IndexError:
                pass

        return currBi

hangman('whatevera', bigramM, 8, False)               

In [None]:
totMist = 0
corrs = 0
for i,item in enumerate(test):
    mist = hangman(item, bigramM, 8, False)
    totMist += mist
    if mist < 8:
        corrs += 1
print (corrs,totMist)

### N-gram Models

In [None]:
from collections import defaultdict

def ngramModel(train,ngt,countsOnly=1):
    ng = ngt - 1
    bigram = defaultdict(dict)
    outerKeys =dict()
    for item in train:
        item2 = '$'+item+'$'
        for i,chara in enumerate(item2):
            if i+ng >= len(item2)-1:
                break
            try:
                bigram[item2[i:i+ng]][item2[i+ng]] += 1
            except:
                try:
                    bigram[item2[i:i+ng]][item2[i+ng]] = 1
                except IndexError:
                    pass
    for item in bigram.keys():
        for stuff in bigram[item]:
            try:
                outerKeys[item] += bigram[item][stuff]
            except:
                outerKeys[item] = bigram[item][stuff]
    if countsOnly != 1:
        for item in bigram.keys():
            try:
                del bigram[item]['$']
            except:
                pass
            for stuff in bigram[item]:
                bigram[item][stuff] = (1.0*bigram[item][stuff])/outerKeys[item]
    
    return bigram,outerKeys

In [None]:
ngModels = dict()
outKeys = dict()

for i in range(2,6):
    ngModels[i],outKeys[i] = ngramModel(train,i)

In [None]:
ngModels[5]

## Backoff Models

- use  n-gram  if  you  have  good  evidence
- otherwise  (n-1)-gram, (n-2)-gram .... bigram, unigram .
- How to come up with good evidence

In [None]:
from __future__ import print_function 

def hangman2(secret_word, guesser, max_mistakes=8, verbose=True,param=20):
    secret_word = secret_word.lower()
    mask = ['_'] * len(secret_word)
    guessed = set()
    if verbose:
        print("Starting hangman game. Target is", ' '.join(mask), 'length', len(secret_word))
    
    mistakes = 0
    while mistakes < max_mistakes:
        if verbose:
            print("You have", (max_mistakes-mistakes), "attempts remaining.")
        guess = guesser(mask, guessed,param)

        if verbose:
            print('Guess is', guess)
        if guess in guessed:
            if verbose:
                print('Already guessed this before.')
            mistakes += 1
        else:
            guessed.add(guess)
            if guess in secret_word:
                for i, c in enumerate(secret_word):
                    if c == guess:
                        mask[i] = c
                if verbose:
                    print('Good guess:', ' '.join(mask))
            else:
                if verbose:
                    print('Sorry, try again.')
                mistakes += 1
                
        if '_' not in mask:
            if verbose:
                print('Congratulations, you won.')
            return mistakes
        
    if verbose:
        print('Out of guesses. The word was', secret_word)    
    return mistakes

In [None]:
def ngramM(mask,guessed,paramList):
    proList = list()
    maski = '$' +  ''.join(mask) + '$'
    if len(mask) == mask.count('_'):
        return unigPos(mask,guessed)
    maskSort = sorted([item[-4:] for item in maski.split('_') if len(item)>=1 and '$'!=item[-1]],key=lambda x: len(x),reverse=True)
    for item in maskSort:
        for i in range(4):
            proAdded = 0
            try:
                if outKeys[len(item[i:])+1][item[i:]] > paramList[i]:
                    pro = sorted(ngModels[len(item[i:])+1][item[i:]].items(),key=operator.itemgetter(1),reverse=True)
                    for pros in pro:
                        if pros[0] not in guessed:
                            proList.append((pros[0],pros[1],pros[1]*1.0/outKeys[len(item[i:])+1][item[i:]],len(item[i:])+1))
                            proAdded = 1
                            break
                    if proAdded == 1:
                        break
            except (IndexError,KeyError) as e:
                pass
                
    #print(proList)
    if len(proList) == 0:
        return unigPos(mask,guessed)
    return sorted(proList, key=lambda vertex: (vertex[3], vertex[2]),reverse=True)[0][0]

In [None]:
maxparai = 0
maxCorr = 0


for uni in  range(1,20):
    totMist = 0
    corrs = 0
    parai = [uni,uni,uni,uni]
    for i,item in enumerate(test):
        mist = hangman2(item, ngramM, 8, False,parai)
        totMist += mist
        if mist < 8:
            corrs += 1
    if corrs > maxCorr:
        print ('Best - ',parai,corrs,totMist)
        maxparai = parai
        maxCorr = corrs


In [None]:
maxparai = 0
maxCorr = 0


for uni in  range(1,20):
    for bi in range(1,20):
        for tri in range(1,20):
            for quad in range(1,20):
                totMist = 0
                corrs = 0
                parai = [quad,tri,bi,uni]
                for i,item in enumerate(test):
                    mist = hangman2(item, ngramM, 8, False,parai)
                    totMist += mist
                    if mist < 8:
                        corrs += 1
                if corrs > maxCorr:
                    print ('Best - ',parai,corrs,totMist)
                    maxparai = parai
                    maxCorr = corrs

## Interploation
- mix  unigram,  bigram,  trigram and so on

$$ P(w_n|w_{n-2},w_{n-1}) = \lambda_1 P(w_n|w_{n-2},w_{n-1}) + \lambda_2 P(w_n|w_{n-1}) + \lambda_3 P(w_n)   $$

In [None]:
def ngraInterp(mask,guessed,paramList=[0.1,0.2,0.3,0.4]):
    proList = list()
    maski = '$' +  ''.join(mask) + '$'
    if len(mask) == mask.count('_'):
        return unigPos(mask,guessed)
    maskSort = sorted([item[-4:] for item in maski.split('_') if len(item)>=1 and '$'!=item[-1]],key=lambda x: len(x),reverse=True)
    #print (maskSort)
    for item in maskSort:
        candidates = dict()
        for i in range(-1*(len(item)-1),1):
            #print(item,item[-i:],-i,len(item[-i:]))
            proAdded = 0
            try:
                pro = sorted(ngModels[len(item[-i:])+1][item[-i:]].items(),key=operator.itemgetter(1),reverse=True)
                for pros in pro:
                    if pros[0] not in guessed:
                        try:
                            candidates[pros[0]] += paramList[-i]* pros[1]*1.0/outKeys[len(item[-i:])+1][item[-i:]]
                        except:
                            candidates[pros[0]] = paramList[-i]* pros[1]*1.0/outKeys[len(item[-i:])+1][item[-i:]]
            except (IndexError,KeyError) as e:
                pass
        finMark = list(sorted(candidates.items(), key=operator.itemgetter(1),reverse=True)[0])
        finMark.append(item)
        proList.append(tuple(finMark))
                
    #print(proList)
    if len(proList) == 0:
        return unigPos(mask,guessed)
    #print(sorted(proList, key=lambda x: x[0],reverse=True)[0][0])
    return sorted(proList, key=lambda x: x[0],reverse=True)[0][0]

In [None]:
hangman2('whatever', ngraInterp, 8, False,[0.9,0.3,1.3,0.4])             
totMist = 0
corrs = 0
for i,item in enumerate(test):
    mist = hangman2(item, ngraInterp, 8, False,[0.3,0.5,0.7,0.9])
    totMist += mist
    if mist < 8:
        corrs += 1
print (corrs,totMist)

In [None]:
def ngraInterp(mask,guessed,paramList=[0.25,0.25,0.25,0.25]):
    proList = list()
    maski = '$' +  ''.join(mask) + '$'
    if len(mask) == mask.count('_'):
        return unig(mask,guessed)
    maskSort = sorted([item[-4:] for item in maski.split('_') if len(item)>=1 and '$'!=item[-1]],key=lambda x: len(x),reverse=True)
    #print (maskSort)
    for item in maskSort:
        candidates = dict()
        for i in range(-1*(len(item)-1),1):
            #print(item,item[-i:],-i,len(item[-i:]))
            proAdded = 0
            try:
                pro = sorted(ngModels[len(item[-i:])+1][item[-i:]].items(),key=operator.itemgetter(1),reverse=True)
                for pros in pro:
                    if pros[0] not in guessed:
                        try:
                            candidates[pros[0]] += paramList[-i]* pros[1]*1.0/outKeys[len(item[-i:])+1][item[-i:]]
                        except:
                            candidates[pros[0]] = paramList[-i]* pros[1]*1.0/outKeys[len(item[-i:])+1][item[-i:]]
            except (IndexError,KeyError) as e:
                pass
        finMark = list(sorted(candidates.items(), key=operator.itemgetter(1),reverse=True)[0])
        finMark.append(item)
        proList.append(tuple(finMark))
                
    #print(proList)
    if len(proList) == 0:
        return unigPos(mask,guessed)
    return sorted(proList, key=lambda x: x[0],reverse=True)[0][0]

In [None]:
hangman2('whatever', ngraInterp, 8, False, [0.25,0.25,0.25,0.25])   

In [None]:
totMist = 0
corrs = 0
for i,item in enumerate(test):
    mist = hangman2(item, ngraInterp, 8, False, [ 0.25,0.25,0.25,0.25])
    #[ 6.18284023  0.66143727  0.91415198  2.9306201 ]
    totMist += mist
    if mist < 8:
        corrs += 1
print (corrs,totMist)

## parameter Estimation

In [None]:
def ngraInterpWeights(mask,guessed,paramList=[ 0.25,0.25,0.25,0.25]):
    proList = list()
    scores = defaultdict(dict)
    maski = '$' +  ''.join(mask) + '$'
    if len(mask) == mask.count('_'):
        return unig(mask,guessed),'nil'
    maskSort = sorted([item[-4:] for item in maski.split('_') if len(item)>=1 and '$'!=item[-1]],key=lambda x: len(x),reverse=True)
    #print (maskSort)
    for item in maskSort:
        candidates = dict()

        
        for i in range(-1*(len(item)-1),1):
            #print(item,item[-i:],-i,len(item[-i:]))
            proAdded = 0
            try:
                pro = sorted(ngModels[len(item[-i:])+1][item[-i:]].items(),key=operator.itemgetter(1),reverse=True)
                for pros in pro:
                    if pros[0] not in guessed:
                        try:
                            candidates[pros[0]] += paramList[-i]* pros[1]*1.0/outKeys[len(item[-i:])+1][item[-i:]]
                        except:
                            candidates[pros[0]] = paramList[-i]* pros[1]*1.0/outKeys[len(item[-i:])+1][item[-i:]]
                        scores[item+'_'+pros[0]][len(item[-i:])+1] =  paramList[-i]* pros[1]*1.0/outKeys[len(item[-i:])+1][item[-i:]]

            except (IndexError,KeyError) as e:
                pass
        finMark = list(sorted(candidates.items(), key=operator.itemgetter(1),reverse=True)[0])
        finMark.append(item)
        proList.append(tuple(finMark))
                
    #print(proList)
    if len(proList) == 0:
        return unig(mask,guessed),'nil'
    
    finVal = sorted(proList, key=lambda x: x[0],reverse=True)[0][0]
    newScores = defaultdict(dict)
    for item in scores.keys():
        if item.split('_')[1]== finVal:
            newScores[item] = scores[item]
    return finVal,newScores


def hangman3(secret_word, guesser, max_mistakes=8, verbose=True,param=20):
    tuples = open('filePar.csv','a')
    secret_word = secret_word.lower()
    mask = ['_'] * len(secret_word)
    guessed = set()
    if verbose:
        print("Starting hangman game. Target is", ' '.join(mask), 'length', len(secret_word))
    
    mistakes = 0
    while mistakes < max_mistakes:
        if verbose:
            print("You have", (max_mistakes-mistakes), "attempts remaining.")
        guess,scores = guesser(mask, guessed,param)
        finScores = defaultdict(dict)
        for i in range(1,6):
            finScores[guess][i] = 0.0
        if scores != 'nil': 
            for item in scores.keys():
                for stuff in scores[item].keys():
                    try:
                        finScores[guess][stuff] += scores[item][stuff]
                    except KeyError:
                        print(item,stuff)

                
        if verbose:
            print('Guess is', guess)
        if guess in guessed:
            if verbose:
                print('Already guessed this before.')
            mistakes += 1
        else:
            guessed.add(guess)
            if guess in secret_word:
                for i, c in enumerate(secret_word):
                    if c == guess:
                        mask[i] = c
                if verbose:
                    print('Good guess:', ' '.join(mask))
                finScores[guess]['correct'] = 1
            else:
                if verbose:
                    print('Sorry, try again.')
                mistakes += 1
                finScores[guess]['correct'] = 0
        if scores != 'nil':
            print(guess,finScores[guess][2],finScores[guess][3],finScores[guess][4],finScores[guess][5],finScores[guess]['correct'],file=tuples,sep=',')
        if '_' not in mask:
            if verbose:
                print('Congratulations, you won.')
            return mistakes
        
    if verbose:
        print('Out of guesses. The word was', secret_word)    
    tuples.close()
    return mistakes

In [None]:
corrs = 0
totMist = 0 
tuples = open('filePar.csv','w')
tuples.close()
for i,item in enumerate(test):
    mist = hangman3(item, ngraInterpWeights, 8, False,[ 0.25,0.25,0.25,0.25])
    totMist += mist
    if mist < 8:
        corrs += 1
print (corrs,totMist)


In [None]:
import pandas
params = pandas.read_csv('filePar.csv',names=['char','bi','tri','quad','five','label'])
params.groupby('label').count()

In [None]:
from sklearn import linear_model
log_model = linear_model.LogisticRegression(max_iter=1000)
train_features = params[['bi','tri','quad','five']]
test_features = params['label']
log_model.fit(X = train_features ,
              y = test_features)

print(log_model.coef_,log_model.get_params())

In [None]:
a = log_model.coef_[0]
b = [item/sum(a) for item in a]
b,b[::-1]


In [None]:
corrs = 0
totMist = 0 
for i,item in enumerate(test):
    mist = hangman3(item, ngraInterpWeights, 8, False,b)
    totMist += mist
    if mist < 8:
        corrs += 1
print (corrs,totMist)


## Reference
 - [Stanford CS124 slides](https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf)