Task: Implement an n-gram Language Model on BeRP Dataset
----

Your language model:
- *must* work for the unigram, bigram, and trigram cases (5 points are allocated to an experiment involving larger values of `n`)
    - hint: try to implement the bigram case as a generalized "n greater than 1" case
- should be *token agnostic* (this means that if we give the model text tokenized as single characters, it will function as a character language model and if we give the model text tokenized as "words" (or "traditionally"), then it will function as a language model with those tokens)
- will use Laplace smoothing
- will replace all tokens that occur only once with `<UNK>` at train time
    - do not add `<UNK>` to your vocabulary if no tokens in the training data occur only once!

In [1]:
import lm_model as lm
import numpy as np
import matplotlib.pyplot as plt
import statistics

In [2]:
# test the language model (unit tests)
import test_minitrainingprovided as test

# passing all these tests is a good indication that your model
# is correct. They are *not a guarantee*, so make sure to look
# at the tests and the cases that they cover. (we'll be testing
# your model against all of the testing data in addition).

# autograder points in gradescope are assigned SIXTY points
# this is essentially 60 points for correctly implementing your
# underlying model
# there are an additional 10 points manually graded for the correctness
# parts of your sentence generation

# make sure all training files are in a "training_files" directory 
# that is in the same directory as this notebook

unittest = test.TestMiniTraining()
unittest.test_createunigrammodellaplace()
unittest.test_createbigrammodellaplace()
unittest.test_unigramlaplace()
unittest.test_unigramunknownslaplace()
unittest.test_bigramlaplace()
unittest.test_bigramunknownslaplace()
# produces output
unittest.test_generateunigramconcludes()
# produces output
unittest.test_generatebigramconcludes()

unittest.test_onlyunknownsgenerationandscoring()

[['<s>', 'am', 'ham', 'am', '</s>'], ['<s>', '</s>']]
[['<s>', 'sam', '</s>'], ['<s>', 'i', 'am', '</s>']]


In [3]:
# instantiate a bigram language model, train it, and generate ten sentences
# make sure your output is nicely formatted!
ngram = 2
training_file_path = "training_files/berp-training.txt"
# optional parameter tells the tokenize function how to tokenize
by_char = False
data = lm.read_file(training_file_path)
tokens = lm.tokenize(data, ngram, by_char=by_char)

model = lm.LanguageModel(ngram, training_file=training_file_path)
model.train(tokens)
print(*model.generate(5), sep='\n')

['<s>', 'start', 'over', '</s>']
['<s>', "i'd", 'like', 'some', 'uh', 'i', 'will', 'spend', 'more', 'than', 'one', 'hundred', 'dollars', '</s>']
['<s>', 'asian', 'lunch', 'in', 'berkeley', '</s>']
['<s>', "let's", 'start', 'over', '</s>']
['<s>', 'looking', 'for', 'dinner', '</s>']


In [4]:
# evaluate your bigram model on the test data
# score each line in the test data individually, then calculate the average score
# you need not re-train your model
test_path = "testing_files/berp-test.txt"
test_data = lm.read_file(test_path)

scores = []

for sentence in test_data:
    tok = lm.tokenize_line(sentence, ngram, by_char)
    scores.append(model.score(tok))


# Print out the mean score and standard deviation
# for words-as-tokens, these values should be
# ~4.9 * 10^-5 and 0.000285
print("Average = ", sum(scores)/len(scores))
print("Standard Deviation = ",statistics.stdev(scores))

Average =  4.962082362726267e-05
Standard Deviation =  0.000286735365135695


In [6]:
# see if you can train your model on the data you found for your first homework
n = 30
model = lm.LanguageModel(n, r'./training_files/cancer_small_data.txt')
model.train()
# what is the maximum value of n <= 10 that you can train a model *in your programming environment* in a reasonable amount of time? (less than 3 - 5 minutes)
# Tried with 100 and it worked. Took 1 min

In [7]:
# generate three sentences with this model
cnt = 1
for _ in range(3):
    print(f"SENTENCE {cnt}")
    cnt+=1
    sentence1 = model.generate_sentence()
    print("Original Sentence Length ", len(sentence1))
    print("First 100 Words:")
    print(" ".join(sentence1[n-1:n+50]))
    print("*"*20)
    print()

SENTENCE 1
Original Sentence Length  4677
First 100 Words:
<UNK> the frequency and clinical features of lungnodules in IgG4 related disease IgG4RD patients as an insight for help with the diagnosis of lung nodulesMethods A retrospective study was carried out in West China Hospital Sichuan University from January toDecember patients with definite IgG4RD were <UNK> Fifty of patients with definite
********************

SENTENCE 2
Original Sentence Length  3327
First 100 Words:
Human transcription factor and protein kinase gene fusions in human <UNK> <UNK> <UNK> G <UNK> <UNK> <UNK> gene fusions are estimated to account for upto of cancer morbidity Recently <UNK> studies have established oncofusions throughout all tissue types However the functional implications of the identified oncofusions have often not been investigated
********************

SENTENCE 3
Original Sentence Length  3095
First 100 Words:
This study aimed to investigate serum matrix metalloproteinase MMP2 and MMP9levels in pa

Implement the corresponding function and evaluate the perplexity of your model on the first 20 lines in the test data for values of `n` from 1 to 3. Perplexity should be individually calculated for each line.

In [8]:
test_path = "testing_files/berp-test.txt"
test_data = lm.read_file(test_path)

for ngram in range(1, 4):
    print("********")
    print("Ngram model:", ngram)
    model = lm.LanguageModel(ngram,training_file=test_path)
    model.train()
    sentences = model.generate(10)
    perplexities = [model.perplexity(sentence) for sentence in sentences]
    print(*[" ".join(sentence) for sentence in sentences], sep='\n')
    print(f"############ Mean Perplexity {sum(perplexities)/len(perplexities)} ############\n")

********
Ngram model: 1
<s> restaurant </s>
<s> me <UNK> restaurant restaurant please like list i night <UNK> dollars some for could about is icsi dinner know <UNK> have any would food </s>
<s> dollars to uh <UNK> </s>
<s> i'd </s>
<s> don't </s>
<s> are there i give for computer restaurant on </s>
<s> distance i'd than give </s>
<s> to pizza want have <UNK> go food would is restaurant <UNK> best know information list travel is <UNK> restaurants the </s>
<s> </s>
<s> <UNK> <UNK> <UNK> more a are expensive do <UNK> <UNK> <UNK> i <UNK> cost best restaurant about </s>
############ Mean Perplexity 41.94660860789063 ############

********
Ngram model: 2
<s> do you at <UNK> not have <UNK> for breakfast any of the restaurant </s>
<s> i'd like to the best <UNK> restaurant </s>
<s> i'd like to eat <UNK> information on <UNK> how about <UNK> </s>
<s> what is <UNK> from the list the best <UNK> </s>
<s> i'd like to eat on <UNK> not expensive </s>
<s> the restaurants <UNK> <UNK> </s>
<s> uh i'm will