# Week 5: Sentence Completion Challenge

The cell below will load the language_model class (developed last week) and train it using the files in the training directory.

In [8]:
%load_ext autoreload
%autoreload 2  
#this means that language_model will be reloaded when you run this cell - this is important if you change the language_model class!
import os
from language_model import * 
## import language model from previous lab
parentdir="/Users/finpearson/Desktop/Github/ANLE---Python-/Week4/sentence-completion/" 
#you may need to update this 

trainingdir=os.path.join(parentdir,"Holmes_Training_Data")
training,testing=get_training_testing(trainingdir)
MAX_FILES=10  

 #use a small number here whilst developing your solutions
mylm=language_model(trainingdir=trainingdir,files=training[:MAX_FILES],adjust_unknowns=True)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
There are 522 files in the training directory: /Users/finpearson/Desktop/Github/ANLE---Python-/Week4/sentence-completion/Holmes_Training_Data
Processing 19TOM10.TXT
Processing SNOWI10.TXT
Processing FBRLS10.TXT
Processing WTSLW10.TXT
UnicodeDecodeError processing WTSLW10.TXT: ignoring rest of file
Processing MOHIC10.TXT
UnicodeDecodeError processing MOHIC10.TXT: ignoring rest of file
Processing CEVEN10.TXT
Processing WNLAW10.TXT
Processing PRESC10.TXT
Processing MPOOL10.TXT
Processing AHERO10.TXT


Let's have a look at the most frequent words in the training data.

In [9]:
vocab=sorted(mylm.unigram.items(),key=lambda x:x[1],reverse =True)

In [10]:
vocab[:10]

[('__START', 0.0767994807109772),
 ('__END', 0.0767994807109772),
 (',', 0.055092985858456345),
 ('the', 0.03996174784564508),
 ('.', 0.03794896863430192),
 ('of', 0.021419336570876007),
 ('and', 0.018276622911758262),
 ('to', 0.01652963206342592),
 ('a', 0.016050144514074764),
 ('``', 0.015263464384359356)]

How big is the vocabulary?  What kind of words are low frequency?  What kind of words are mid-frequency?

In [None]:
len(vocab)

In [None]:
vocab[-10:]

In [None]:
topvocab=vocab[:7000]

In [None]:
topvocab[-10:]

Make sure you can:
* look up bigram probabilities
* generate a sentence according to the model
* calculate the perplexity of a test sentence

In [11]:
mylm.compute_perplexity(filenames=testing[:MAX_FILES],methodparams={"method":"bigram","smoothing":"kneser-ney"})

Processing file 0:ADAMB10.TXT
Processing file 1:ECORE10.TXT
Processing file 2:INDHE10.TXT
Processing file 3:GNDIN10.TXT
Processing file 4:BBEAU10.TXT
Processing file 5:FWALD10.TXT
Processing file 6:RCRIM10.TXT
Processing file 7:KDNPD10.TXT
Processing file 8:SWGEM10.TXT
Processing file 9:TSAMU10.TXT


488.57547866385994

In [None]:
mylm.compute_perplexity(filenames=testing[:MAX_FILES],methodparams={"method":"bigram","smoothing":"absolute"})

In [None]:
mylm.compute_perplexity(filenames=testing[:MAX_FILES],methodparams={"method":"unigram"})

Now lets load in and have a look at the sentence completion challenge questions.

In [12]:
import pandas as pd, csv
questions=os.path.join(parentdir,"testing_data.csv")
answers=os.path.join(parentdir,"test_answer.csv")

with open(questions) as instream:
    csvreader=csv.reader(instream)
    lines=list(csvreader)
qs_df=pd.DataFrame(lines[1:],columns=lines[0])
qs_df.head()

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd, csv


Unnamed: 0,id,question,a),b),c),d),e)
0,1,I have it from the same source that you are bo...,crying,instantaneously,residing,matched,walking
1,2,It was furnished partly as a sitting and partl...,daintily,privately,inadvertently,miserably,comfortably
2,3,"As I descended , my old ally , the _____ , cam...",gods,moon,panther,guard,country-dance
3,4,"We got off , _____ our fare , and the trap rat...",rubbing,doubling,paid,naming,carrying
4,5,"He held in his hand a _____ of blue paper , sc...",supply,parcel,sign,sheet,chorus


Need to be able to tokenize questions so that the gaps can be located.

In [None]:
from nltk import word_tokenize as tokenize

tokens=[tokenize(q) for q in qs_df['question']]
print(tokens)

Getting the context of the blank: looking at the preceding words (number given in window)

In [14]:
def get_left_context(sent_tokens,window,target="_____"):
    found=-1
    for i,token in enumerate(sent_tokens):
        if token==target:
            found=i
            break 
            
    if found>-1:
        return sent_tokens[i-window:i]
    else:
        return []
    

qs_df['tokens']=qs_df['question'].map(tokenize)
qs_df['left_context']=qs_df['tokens'].map(lambda x: get_left_context(x,2))
qs_df.head()    

Unnamed: 0,id,question,a),b),c),d),e),tokens,left_context
0,1,I have it from the same source that you are bo...,crying,instantaneously,residing,matched,walking,"[I, have, it, from, the, same, source, that, y...","[and, are]"
1,2,It was furnished partly as a sitting and partl...,daintily,privately,inadvertently,miserably,comfortably,"[It, was, furnished, partly, as, a, sitting, a...","[flowers, arranged]"
2,3,"As I descended , my old ally , the _____ , cam...",gods,moon,panther,guard,country-dance,"[As, I, descended, ,, my, old, ally, ,, the, _...","[,, the]"
3,4,"We got off , _____ our fare , and the trap rat...",rubbing,doubling,paid,naming,carrying,"[We, got, off, ,, _____, our, fare, ,, and, th...","[off, ,]"
4,5,"He held in his hand a _____ of blue paper , sc...",supply,parcel,sign,sheet,chorus,"[He, held, in, his, hand, a, _____, of, blue, ...","[hand, a]"


##  Building and evaluating an SCC system
1. always predict the same answer (e.g., "a")


In [116]:
# from scc import *
### you can import this the above line but I have included the code here to make it easier to inspect it

class question:
    
    def __init__(self,aline):
        self.fields=aline
    
    def get_field(self,field):
        return self.fields[question.colnames[field]]
    
    def add_answer(self,fields):
        self.answer=fields[1]
   
    def chooseA(self):
        return("a")

    def chooseRandom(self):
        return random.choice(["a", "b", "c", "d", "e"])

    def chooseUnigramProb(self):
        print (self.fields[1])
        return self.get_field(mylm.nextlikely(k=1, current=self.fields[1], method="unigram")), "a"
        
    
    def predict(self,method="chooseA"):
        #eventually there will be lots of methods to choose from
        if method=="chooseA":
            return self.chooseA()
        if method=="chooseRandom":
            return self.chooseRandom()
        if method=="chooseUnigramProb":
            return self.chooseUnigramProb()
        
    def predict_and_score(self,method="chooseA"):
        
        #compare prediction according to method with the correct answer
        #return 1 or 0 accordingly
        prediction=self.predict(method=method)
        #print(prediction, self.answer)
        if prediction ==self.answer:
            return 1
        else:
            return 0

class scc_reader:
    
    def __init__(self,qs=questions,ans=answers):
        self.qs=qs
        self.ans=ans
        self.read_files()
        
    def read_files(self):
        
        #read in the question file
        with open(self.qs) as instream:
            csvreader=csv.reader(instream)
            qlines=list(csvreader)
        
        #store the column names as a reverse index so they can be used to reference parts of the question
        question.colnames={item:i for i,item in enumerate(qlines[0])}
        
        #create a question instance for each line of the file (other than heading line)
        self.questions=[question(qline) for qline in qlines[1:]]
        
        #read in the answer file
        with open(self.ans) as instream:
            csvreader=csv.reader(instream)
            alines=list(csvreader)
            
        #add answers to questions so predictions can be checked    
        for q,aline in zip(self.questions,alines[1:]):
            q.add_answer(aline)
        
    def get_field(self,field):
        return [q.get_field(field) for q in self.questions] 
    
    def predict(self,method="chooseA"):
        return [q.predict(method=method) for q in self.questions]
    
    def predict_and_score(self,method="chooseA"):
        scores=[q.predict_and_score(method=method) for q in self.questions]
        return sum(scores)/len(scores)
    
            

In [117]:
SCC = scc_reader()

In [101]:
SCC.get_field("b)")

['instantaneously',
 'privately',
 'moon',
 'doubling',
 'parcel',
 'stick',
 'communication',
 'speedy',
 'farmhouse',
 'intermittently',
 'begged',
 'stars',
 'delicate',
 'cheers',
 'advocate',
 'prospect',
 'accustomed',
 'dared',
 'moonlight',
 'meditation',
 'weak',
 'touched',
 'seamanship',
 'affairs',
 'wayside',
 'knocker',
 'darkly',
 'inevitable',
 'glanced',
 'abilities',
 'confirmed',
 'misfortunes',
 'shadow',
 'marched',
 'universities',
 'pedantic',
 'backed',
 'varnish',
 'stripped',
 'mellowed',
 'control',
 'resolution',
 'correspondence',
 'ride',
 'choose',
 'note',
 'encumbered',
 'affliction',
 'dirty',
 'smiling',
 'running',
 'surrounded',
 'message',
 'childish',
 'folded',
 'translated',
 'tired',
 'softened',
 'slouched',
 'matters',
 'related',
 'chamber',
 'struggle',
 'smiled',
 'realised',
 'explain',
 'thicket',
 'weather',
 'saints',
 'devil',
 'bent',
 'muddle',
 'contradict',
 'sword',
 'pleasantries',
 'indirect',
 'loud',
 'parties',
 'peace',
 'e

In [107]:
SCC.predict()

['a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a'

In [29]:
SCC.predict_and_score()

0.19903846153846153

### Adding a random choice

In [None]:
SCC.predict(method="chooseRandom")


In [94]:
SCC.predict_and_score(method="chooseRandom")

0.20961538461538462

### Using the language model
using unigram probabilities

In [118]:
SCC.predict(method="chooseUnigramProb")

I have it from the same source that you are both an orphan and a bachelor and are _____ alone in London.


KeyError: '__END'

### Adding Context
looking up context and bigram probabilities


## Right context

### Backing off to unigram probs

Backing off might not change the decision (the correct answer may not be in the bestchoices given back by the bigram model)

Investigate: 
* the effect of the amount of training data on each of the strategies
* plot on a graph - should see a cross-over (unigram than bigram for small training data but bigram better than unigram for large training data)

Extend:
* trigram model
* incorporation of distributional similarity / word2vec vectors
* RNNLM ...?