# HW 2: N-gram Language Models

## Date Out: Thursday, February 20
## Due Date: Thursday, March 5

This programming assignment is more open-ended than the previous ones. It is centered on the N-gram language models and tasks you to:

* download and process a large text dataset in python using the <code>csv</code> library
* perform sentence and word tokenization
* calculate N-gram counts and probabilities
* compare the characteristics of the N-grams across different models
* generate random sentences using the models

<u>You may work in teams of two or three (2-tuples or 3-tuples?) for this assignment.</u>

<hr>

In [40]:
import nltk

In [41]:
import csv

### Task #1

<u>Download two large text datasets from Kaggle.</u>

The <a href="http://kaggle.com">Kaggle competition hosting site</a> offers a number of free datasets that contain interesting text fields. For this assignment, we will use the "Wine Reviews" and "All the News" datasets. They can be accessed by selecting the "Datasets" header and then searching for these specific datasets. Then, choose "Data" from the sub-header, preview some of the csv data and notice how at least one of the columns in the dataset will contain sufficient text. I chose to direct you to these two datasets because the textual content seemed interesting and would have different language characteristics, and both were large csv files that could generate significant n-gram counts, but not be too large of a file.

<em>(You can use other datasets if you wish. Others that looked interesting on Kaggle include the "Yelp Dataset" (but its over 3GB !!!), "SMS Spam Collection Dataset", "Russian Troll Tweets", and "A Million News Headlines".)</em>

In [42]:
# Downloaded wine-reviews and all-the-news

### Task #2

<u>Process the downloaded <code>csv</code> files in python.</u>

There's a nice csv library already included in python for accessing values in that are stored in a comma separated values (csv) format. Read the <a href="https://docs.python.org/3/library/csv.html">csv library documentation</a>.
What is the delimiter in your csv files? Open each of the two .csv files that you downloaded using this library and be able to read in the data. Note that we really only care about the text column in this assignment.

In [84]:
# Use the head.csv file in each folder for testing
# Both csv files are comma deliminated

#Field limit
import sys
import csv
maxInt = sys.maxsize

while True:
    # decrease the maxInt value by factor 10 
    # as long as the OverflowError occurs.

    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt/10)

news = open("all-the-news/articles1.csv", "r")
wine = open("wine-reviews/winemag-data_first150k.csv", "r")
#news = open("all-the-news/head1000.csv", "r")
#wine = open("wine-reviews/head1000.csv", "r")
#news = open("all-the-news/head500.csv", "r")
#wine = open("wine-reviews/head500.csv", "r")
#news = open("all-the-news/head50.csv", "r")
#wine = open("wine-reviews/head50.csv", "r")
#news = open("all-the-news/head.csv", "r")
#wine = open("wine-reviews/head.csv", "r")

reviews = []
articles = []
with news as csv_file:
    csv_reader = csv.DictReader(csv_file, delimiter=",")
    for lines in csv_reader:
        articles.append(lines["content"])
with wine as csv_file:
    csv_reader = csv.DictReader(csv_file, delimiter=",")
    for lines in csv_reader:
        reviews.append(lines["description"])

In [85]:
#print(reviews)
#print(articles)

### Task #3

<u>Perform sentence segmentation and word tokenization.</u>

Utilize the nltk module to perform sentence segmentation and word tokenization. But at this point, there are a few decisions that need to be made:

* How we should handle the .csv rows in the previous step? If we ignore row makers, and "lump everything together", how will that effect our language model?
* Do we want to remove punctuation? What is the effect of keeping punctuation in the model?
* Do we want to add sentence boundary markers, such as <samp>&lt;S&gt;</samp> and <samp>&lt;/S&gt;</samp>?</li>
* Should two the words <samp>The</samp> and <samp>the</samp> be treated as the same? What are the effects of doing, or not doing, this?

In [86]:
# Rows will be lumped into a single string.  A single " " will be added to the end of each row to ensure sentences
# are not being combined (Ex. "lastword. firstword" instead of "lastword.firstword")
# Punctuation will NOT be kept when counting n-grams
# The and the will be treated as the same word, this will decrease total n-grams
# N-Gram words will be counted within sentence boundries  (a trigram/bigram will not overlap into another sentence)
import string
import datetime

reviewsraw = ""
articlesraw = ""
print("READ REVIEWS")
for review in reviews:
    #review = review.translate(str.maketrans('', '', string.punctuation))
    review = review.lower()
    reviewsraw += review + " "
print("READ REVIEWS COMPLETE")
print("READ NEWS")
for article in articles:
    #article = article.translate(str.maketrans('', '', string.punctuation))
    article = article.lower()
    articlesraw += article + " "
print("READ NEWS COMPLETE")
    
#reviewTokens = nltk.word_tokenize(reviewsraw)
print("SENT TOKE REVIEWS")
now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))
reviewSents = nltk.sent_tokenize(reviewsraw)
print("SENT TOKE REVIEWS COMPLETE")
now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))
print("SENT TOKE NEWS")
now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))
#articleTokens = nltk.word_tokenize(articlesraw)
articleSents = nltk.sent_tokenize(articlesraw)
print("SENT TOKE NEWS COMPLETE")
now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))

READ REVIEWS
READ REVIEWS COMPLETE
READ NEWS
READ NEWS COMPLETE
SENT TOKE REVIEWS
2020-02-23 23:21:48
SENT TOKE REVIEWS COMPLETE
2020-02-23 23:22:10
SENT TOKE NEWS
2020-02-23 23:22:10
SENT TOKE NEWS COMPLETE
2020-02-23 23:22:10


In [87]:
#print(reviewTokens)
#print(reviewSents)

### Task #4

<u>Calculate N-gram counts and compute probabilities.</u>

Use a python dictionary (or any suitable data structure) to first compute unigram counts. Then try bigram counts. Finally, trigram counts.

How much memory are you using? How fast, or slow, is the code -- how long is this step taking? If it is taking too long, try only using a fraction of your corpus: instead of loading the entire .csv file, try only reading the first 1000 rows of data.

Using those counts, compute the probabilities for the unigrams, bigrams, and trigrams, and store those in a new python dictionary (or some other data structure).

In [88]:
# Starting memory usage 1.7 GB -  - 10:54 AM
# Tokenize ended 11:05 AM - memory at 5.9 GB - Starting ngram counts and prob at 11:06 AM
import math
from nltk.util import ngrams
from collections import Counter
from multiprocessing import Process
import multiprocessing
import datetime

def ngram_function(index, sentences, tokenCountDict, unigramsDict, bigramsDict, trigramsDict, stopwords):
    sentCount = 0
    
    #use temp vars to avoid race conditions
    tmpunigramsDict = Counter()
    tmpbigramsDict = Counter()
    tmptrigramsDict = Counter()
    tmptokenCount = 0
    
    for sent in sentences[index]:
        if sentCount % 1000 == 0:
            now = datetime.datetime.now()
            print("Thread " + str(index) + ": " + str(sentCount) + " of " + str(len(sentences[index])) + " " + now.strftime("%Y-%m-%d %H:%M:%S"))
        sent = sent.translate(str.maketrans('', '', string.punctuation))
        words = nltk.word_tokenize(sent)
        for word in list(words):  
            if word in stopwords:
                words.remove(word)
        tmptokenCount += len(words)
        tmpunigramsDict += Counter(ngrams(words, 1))
        tmpbigramsDict += Counter(ngrams(words, 2))
        tmptrigramsDict += Counter(ngrams(words, 3))
        sentCount += 1
    tokenCountDict[index] = tmptokenCount
    unigramsDict[index] = tmpunigramsDict
    bigramsDict[index] = tmpbigramsDict
    trigramsDict[index] = tmptrigramsDict

def processUnigrams(tokenCount, rawCounter, percentDict, countDict):
    for word in rawCounter:
        firstword = list(word)[0]
        count = rawCounter[(firstword,)]
        if count > 0:
            percent = math.log(count / tokenCount)
            percentDict.update({firstword : percent})        
            countDict.update({firstword : count})
        
def processNGrams(rawCounter, reviewUGCount, percentDict, countDict):
    for ngram in rawCounter:
        count = rawCounter[ngram]
        if count > 0:
            firstword = list(ngram)[0]
            if firstword in reviewUGCount.keys():
                total = reviewUGCount[firstword]
                percent = math.log(count / total)
                percentDict.update({ngram : percent})
                countDict.update({ngram : count})

def nGramDictToCSV(path, fileName, theDict):
    theDict = dict(theDict)
    w = csv.writer(open(path + "/" + fileName + ".csv", "w"))
    for key, val in theDict.items():
        w.writerow([key, val])
        
def sortDict(theDict):
    return dict(sorted(theDict.items(), key=lambda y: y[1], reverse=True))

print("START WINE REVIEWS")
reviewTokenCount = 0

reviewUnigrams = Counter()
reviewBigrams = Counter()
reviewTrigrams = Counter()

#I dont want these in my ngram results
stopwords = ["'", "s", "’", "”", "“", "t"]

print("CREATING NGRAM COUNTS - WINE REIVEWS")
now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))

#This is taking too long, use all cores on the machine.
#Match to the number of threads on the machine
threadCount = 8

#Prep for multithread
interval = math.ceil(len(reviewSents) / threadCount)

manager = multiprocessing.Manager()
reviewThreadDict = manager.dict()
reviewUnigramsDict = manager.dict()
reviewBigramsDict = manager.dict()
reviewTrigramsDict = manager.dict()
reviewTokenCountDict = manager.dict()

for x in range(threadCount):
    reviewThreadDict.update({x : reviewSents[interval*x:interval*(x+1)]})
    reviewUnigramsDict.update({x : {}})
    reviewBigramsDict.update({x : {}})
    reviewTrigramsDict.update({x : {}})
    reviewTokenCountDict.update({x : 0})

print("Review sentence length: " + str(len(reviewSents)))    
    
if __name__ == "__main__":

    threads = list()
    for index in range(threadCount):
        x = Process(target=ngram_function, args=(index, reviewThreadDict, reviewTokenCountDict, reviewUnigramsDict, reviewBigramsDict, reviewTrigramsDict, stopwords))
        threads.append(x)
        print("Starting thread " + str(index))
        x.start()

    for a in range(threadCount):
        thread = threads[a]       
        thread.join()        
        reviewUnigrams += reviewUnigramsDict[a]
        reviewBigrams += reviewBigramsDict[a]
        reviewTrigrams += reviewTrigramsDict[a]
        reviewTokenCount += reviewTokenCountDict[a]
        print("End thread " + str(a))

print("All threads complete.")

now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))

print("CREATING COUNTER DICTIONARY - WINE REIVEWS")

reviewNGramCounts = {
    "unigrams" : Counter(reviewUnigrams),
    "bigrams" : Counter(reviewBigrams),
    "trigrams" : Counter(reviewTrigrams)
}

reviewUGPercent = {}
reviewUGCount = {}
reviewBGPercent = {}
reviewBGCount = {}
reviewTGPercent = {}
reviewTGCount = {}

print("PROCCESSING UNIGRAMS PERCENTAGE - WINE REVIEWS")
processUnigrams(reviewTokenCount, reviewNGramCounts["unigrams"], reviewUGPercent, reviewUGCount)
print("PROCCESSING BIGRAM PERCENTAGE - WINE REVIEWS")
processNGrams(reviewNGramCounts["bigrams"], reviewUGCount, reviewBGPercent, reviewBGCount)
print("PROCCESSING TRIGRAMS PERCENTAGE - WINE REVIEWS")
processNGrams(reviewNGramCounts["trigrams"], reviewUGCount, reviewTGPercent, reviewTGCount)

print("SORTING WINE UNIGRAM PERCENT")
reviewUGPercent = sortDict(reviewUGPercent)
print("SORTING WINE UNIGRAM COUNT")
reviewUGCount = sortDict(reviewUGCount)
print("SORTING WINE BIGRAM PERCENT")
reviewBGPercent = sortDict(reviewBGPercent)
print("SORTING WINE BIGRAM COUNT")
reviewBGCount = sortDict(reviewBGCount)
print("SORTING WINE TRIGRAM PERCENT")
reviewTGPercent = sortDict(reviewTGPercent)
print("SORTING WINE TRIGRAM COUNT")
reviewTGCount = sortDict(reviewTGCount)

print("WRITING WINE FILES")
print("WRITING WINE unigramCounts")
nGramDictToCSV("wine-reviews/", "unigramCounts", reviewUGCount)
print("WRITING WINE unigramPercent")
nGramDictToCSV("wine-reviews/", "unigramPercent", reviewUGPercent)
print("WRITING WINE bigramCounts")
nGramDictToCSV("wine-reviews/", "bigramCounts", reviewBGCount)
print("WRITING WINE bigramPercent")
nGramDictToCSV("wine-reviews/", "bigramPercent", reviewBGPercent)
print("WRITING WINE trigramCounts")
nGramDictToCSV("wine-reviews/", "trigramCounts", reviewTGCount)
print("WRITING WINE trigramPercent")
nGramDictToCSV("wine-reviews/", "trigramPercent", reviewTGPercent)

print("END WINE REVIEWS")
now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))

###############################################################

print("START NEWS")
articleTokenCount = 0    

articleUnigrams = Counter()
articleBigrams = Counter()
articleTrigrams = Counter()

print("CREATING NGRAM COUNTS - NEWS")
now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))

#Prep for multithread
interval = math.ceil(len(articleSents) / threadCount)

articleThreadDict = manager.dict()
articleUnigramsDict = manager.dict()
articleBigramsDict = manager.dict()
articleTrigramsDict = manager.dict()
articleTokenCountDict = manager.dict()

for x in range(threadCount):
    articleThreadDict.update({x : articleSents[interval*x:interval*(x+1)]})
    articleUnigramsDict.update({x : {}})
    articleBigramsDict.update({x : {}})
    articleTrigramsDict.update({x : {}})
    articleTokenCountDict.update({x : 0})

print("News sentence length: " + str(len(reviewSents)))      
    
if __name__ == "__main__":

    threads = list()
    for index in range(threadCount):
        x = Process(target=ngram_function, args=(index, articleThreadDict, articleTokenCountDict, articleUnigramsDict, articleBigramsDict, articleTrigramsDict, stopwords))
        threads.append(x)
        print("Starting thread " + str(index))
        x.start()

    for a in range(threadCount):
        thread = threads[a]
        thread.join()
        articleUnigrams += articleUnigramsDict[a]
        articleBigrams += articleBigramsDict[a]
        articleTrigrams += articleTrigramsDict[a]
        articleTokenCount += articleTokenCountDict[a]
        print("End thread " + str(a))
        
print("All threads complete")
now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))

print("CREATING COUNTER DICTIONARY - NEWS")
articleNGramCounts = {
    "unigrams" : Counter(articleUnigrams),
    "bigrams" : Counter(articleBigrams),
    "trigrams" : Counter(articleTrigrams)
}

articleUGPercent = {}
articleUGCount = {}
articleBGPercent = {}
articleBGCount = {}
articleTGPercent = {}
articleTGCount = {}

print("PROCCESSING UNIGRAMS PERCENTAGE - NEWS")
processUnigrams(articleTokenCount, articleNGramCounts["unigrams"], articleUGPercent, articleUGCount)
print("PROCCESSING BIGRAM PERCENTAGE - NEWS")
processNGrams(articleNGramCounts["bigrams"], articleUGCount, articleBGPercent, articleBGCount)
print("PROCCESSING TRIGRAM PERCENTAGE - NEWS")
processNGrams(articleNGramCounts["trigrams"], articleUGCount, articleTGPercent, articleTGCount)

print("SORTING NEWS UNIGRAM PERCENT")
articleUGPercent = sortDict(articleUGPercent)
print("SORTING NEWS UNIGRAM COUNT")
articleUGCount = sortDict(articleUGCount)
print("SORTING NEWS BIGRAM PERCENT")
articleBGPercent = sortDict(articleBGPercent)
print("SORTING NEWS BIGRAM COUNT")
articleBGCount = sortDict(articleBGCount)
print("SORTING NEWS TRIGRAM PERCENT")
articleTGPercent = sortDict(articleTGPercent)
print("SORTING NEWS TRIGRAM COUNT")
articleTGCount = sortDict(articleTGCount)

print("WRITING NEWS FILES")
print("WRITING NEWS unigramCounts")
nGramDictToCSV("all-the-news/", "unigramCounts", articleUGCount)
print("WRITING NEWS unigramPercent")
nGramDictToCSV("all-the-news/", "unigramPercent", articleUGPercent)
print("WRITING NEWS bigramCounts")
nGramDictToCSV("all-the-news/", "bigramCounts", articleBGCount)
print("WRITING NEWS bigramPercent")
nGramDictToCSV("all-the-news/", "bigramPercent", articleBGPercent)
print("WRITING NEWS trigramCounts")
nGramDictToCSV("all-the-news/", "trigramCounts", articleTGCount)
print("WRITING NEWS trigramPercent")
nGramDictToCSV("all-the-news/", "trigramPercent", articleTGPercent)

now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))

print("END NEWS")

print("DONE")

START WINE REVIEWS
CREATING NGRAM COUNTS - WINE REIVEWS
2020-02-23 23:22:29
Starting thread 0
Starting thread 1
Starting thread 2
Starting thread 3
Starting thread 4
Thread 0
Starting thread 5
Thread 1
Starting thread 6
Starting thread 7
Thread 3
0 of 50863
Thread 2
0 of 50863
0 of 50863
0 of 50863
Thread 4
Thread 7
Thread 5
0 of 50863
Thread 6
0 of 50863
0 of 50856
0 of 50863
Thread 3
Thread 0
Thread 1
100 of 50863
Thread 2
100 of 50863
100 of 50863
Thread 4
Thread 5
100 of 50863
100 of 50863
100 of 50863
Thread 6
Thread 7
100 of 50863
100 of 50856
Thread 1
Thread 4
200 of 50863
Thread 2
Thread 3
200 of 50863
Thread 0
200 of 50863
200 of 50863
Thread 5
200 of 50863
Thread 6
200 of 50863
200 of 50863
Thread 7
200 of 50856
Thread 4
Thread 1
300 of 50863
Thread 2
300 of 50863
Thread 3
300 of 50863
300 of 50863
Thread 0
Thread 6
Thread 5
300 of 50863
300 of 50863
300 of 50863
Thread 7
300 of 50856
Thread 4
Thread 2
400 of 50863
Thread 1
400 of 50863
400 of 50863
Thread 0
Thread 3
400 of 5

Thread 5
Thread 3
4300 of 50863
4300 of 50863
Thread 1
4400 of 50863
Thread 4
Thread 2
4600 of 50863
4400 of 50863
Thread 7
Thread 6
4400 of 50856
4400 of 50863
Thread 0
4300 of 50863
Thread 3
4400 of 50863
Thread 5
4400 of 50863
Thread 1
4500 of 50863
Thread 4
4700 of 50863
Thread 2
Thread 7
4500 of 50863
4500 of 50856
Thread 6
4500 of 50863
Thread 0
4400 of 50863
Thread 3
4500 of 50863
Thread 5
Thread 1
4500 of 50863
4600 of 50863
Thread 4
Thread 6
Thread 2
4600 of 50863
Thread 7
4800 of 50863
4600 of 50856
4600 of 50863
Thread 0
4500 of 50863
Thread 3
Thread 1
4600 of 50863
4700 of 50863
Thread 5
4600 of 50863
Thread 4
4900 of 50863
Thread 6
Thread 2
4700 of 50863
Thread 7
4700 of 50863
4700 of 50856
Thread 0
4600 of 50863
Thread 3
Thread 1
4800 of 50863
4700 of 50863
Thread 5
4700 of 50863
Thread 4
Thread 6
5000 of 50863
4800 of 50863
Thread 2
Thread 7
4800 of 50863
4800 of 50856
Thread 0
4700 of 50863
Thread 1
4900 of 50863
Thread 3
4800 of 50863
Thread 5
4800 of 50863
Thread 4
51

9000 of 50863
Thread 5
8700 of 50863
Thread 0
8700 of 50863
Thread 2
Thread 4
8800 of 50863
9100 of 50863
Thread 6
8800 of 50863
Thread 7
Thread 3
8800 of 50856
8800 of 50863
Thread 1
9100 of 50863
Thread 5
8800 of 50863
Thread 0
8800 of 50863
Thread 4
9200 of 50863
Thread 2
8900 of 50863
Thread 6
8900 of 50863
Thread 3
Thread 7
8900 of 50863
8900 of 50856
Thread 1
9200 of 50863
Thread 5
8900 of 50863
Thread 2
Thread 4
9000 of 50863
9300 of 50863
Thread 0
8900 of 50863
Thread 6
9000 of 50863
Thread 3
9000 of 50863
Thread 7
9000 of 50856
Thread 1
9300 of 50863
Thread 5
Thread 4
9000 of 50863
9400 of 50863
Thread 2
9100 of 50863
Thread 0
9000 of 50863
Thread 6
9100 of 50863
Thread 7
9100 of 50856
Thread 3
9100 of 50863
Thread 1
9400 of 50863
Thread 4
9500 of 50863
Thread 5
9100 of 50863
Thread 0
9100 of 50863
Thread 2
9200 of 50863
Thread 6
9200 of 50863
Thread 1
Thread 7
Thread 3
9500 of 50863
9200 of 50856
9200 of 50863
Thread 4
9600 of 50863
Thread 5
9200 of 50863
Thread 0
9200 of 508

13400 of 50863
12900 of 50863
Thread 7
13000 of 50856
Thread 4
13600 of 50863
Thread 0
13000 of 50863
Thread 2
Thread 6
13100 of 50863
13100 of 50863
Thread 3
13100 of 50863
Thread 5
Thread 1
13000 of 50863
13500 of 50863
Thread 7
13100 of 50856
Thread 4
13700 of 50863
Thread 0
13100 of 50863
Thread 2
Thread 6
13200 of 50863
13200 of 50863
Thread 3
13200 of 50863
Thread 1
13600 of 50863
Thread 5
13100 of 50863
Thread 7
13200 of 50856
Thread 4
13800 of 50863
Thread 0
13200 of 50863
Thread 6
Thread 3
13300 of 50863
Thread 2
13300 of 50863
13300 of 50863
Thread 1
13700 of 50863
Thread 5
13200 of 50863
Thread 7
13300 of 50856
Thread 4
13900 of 50863
Thread 0
13300 of 50863
Thread 3
13400 of 50863
Thread 6
13400 of 50863
Thread 2
13400 of 50863
Thread 1
13800 of 50863
Thread 5
13300 of 50863
Thread 7
13400 of 50856
Thread 4
14000 of 50863
Thread 0
13400 of 50863
Thread 6
13500 of 50863
Thread 3
13500 of 50863
Thread 2
13500 of 50863
Thread 1
13900 of 50863
Thread 5
13400 of 50863
Thread 7
1

17300 of 50863
Thread 5
17100 of 50863
Thread 0
17300 of 50863
Thread 1
17700 of 50863
Thread 6
17400 of 50863
Thread 4
18000 of 50863
Thread 7
Thread 3
17300 of 50863
17300 of 50856
Thread 2
17400 of 50863
Thread 5
17200 of 50863
Thread 0
17400 of 50863
Thread 1
17800 of 50863
Thread 6
17500 of 50863
Thread 4
18100 of 50863
Thread 3
Thread 7
17400 of 50863
17400 of 50856
Thread 2
Thread 0
17500 of 50863
Thread 5
17300 of 50863
17500 of 50863
Thread 1
17900 of 50863
Thread 6
17600 of 50863
Thread 4
18200 of 50863
Thread 3
17500 of 50863
Thread 7
17500 of 50856
Thread 2
Thread 0
17600 of 50863
17600 of 50863
Thread 1
Thread 5
18000 of 50863
17400 of 50863
Thread 6
17700 of 50863
Thread 4
18300 of 50863
Thread 3
17600 of 50863
Thread 7
17600 of 50856
Thread 0
17700 of 50863
Thread 2
17700 of 50863
Thread 1
18100 of 50863
Thread 5
17500 of 50863
Thread 6
17800 of 50863
Thread 4
18400 of 50863
Thread 3
17700 of 50863
Thread 7
17700 of 50856
Thread 0
17800 of 50863
Thread 2
17800 of 50863
T

Thread 2
21600 of 50863
Thread 7
21400 of 50856
Thread 1
22000 of 50863
Thread 6
Thread 0
21700 of 50863
21700 of 50863
Thread 5
21300 of 50863
Thread 4
22400 of 50863
Thread 3
21500 of 50863
Thread 2
21700 of 50863
Thread 7
21500 of 50856
Thread 1
22100 of 50863
Thread 0
Thread 6
21800 of 50863
21800 of 50863
Thread 5
21400 of 50863
Thread 4
22500 of 50863
Thread 3
21600 of 50863
Thread 2
21800 of 50863
Thread 7
21600 of 50856
Thread 0
Thread 1
21900 of 50863
22200 of 50863
Thread 6
21900 of 50863
Thread 5
21500 of 50863
Thread 4
22600 of 50863
Thread 2
21900 of 50863
Thread 7
21700 of 50856
Thread 3
21700 of 50863
Thread 0
22000 of 50863
Thread 6
Thread 1
22000 of 50863
22300 of 50863
Thread 4
22700 of 50863
Thread 5
21600 of 50863
Thread 2
22000 of 50863
Thread 0
22100 of 50863
Thread 7
21800 of 50856
Thread 3
21800 of 50863
Thread 6
22100 of 50863
Thread 1
22400 of 50863
Thread 4
22800 of 50863
Thread 5
21700 of 50863
Thread 2
22100 of 50863
Thread 0
22200 of 50863
Thread 7
Thread 

26700 of 50863
Thread 2
Thread 7
25900 of 50863
25600 of 50856
Thread 0
26100 of 50863
Thread 6
26000 of 50863
Thread 1
26300 of 50863
Thread 3
25600 of 50863
Thread 4
26800 of 50863
Thread 5
25500 of 50863
Thread 2
26000 of 50863
Thread 0
26200 of 50863
Thread 7
25700 of 50856
Thread 6
26100 of 50863
Thread 1
26400 of 50863
Thread 3
25700 of 50863
Thread 4
26900 of 50863
Thread 5
25600 of 50863
Thread 2
26100 of 50863
Thread 0
26300 of 50863
Thread 7
25800 of 50856
Thread 6
26200 of 50863
Thread 1
26500 of 50863
Thread 4
27000 of 50863
Thread 3
25800 of 50863
Thread 5
25700 of 50863
Thread 2
26200 of 50863
Thread 0
26400 of 50863
Thread 7
25900 of 50856
Thread 6
Thread 1
26300 of 50863
26600 of 50863
Thread 4
27100 of 50863
Thread 5
25800 of 50863
Thread 3
25900 of 50863
Thread 0
Thread 2
26300 of 50863
26500 of 50863
Thread 7
26000 of 50856
Thread 1
26700 of 50863
Thread 6
26400 of 50863
Thread 4
27200 of 50863
Thread 5
25900 of 50863
Thread 3
26000 of 50863
Thread 0
26600 of 50863
T

Thread 7
Thread 3
29800 of 50856
29700 of 50863
Thread 6
30200 of 50863
Thread 5
29700 of 50863
Thread 0
30500 of 50863
Thread 2
30200 of 50863
Thread 1
30600 of 50863
Thread 4
31200 of 50863
Thread 7
29900 of 50856
Thread 3
29800 of 50863
Thread 6
30300 of 50863
Thread 5
29800 of 50863
Thread 2
Thread 0
30300 of 50863
30600 of 50863
Thread 4
Thread 1
31300 of 50863
30700 of 50863
Thread 7
30000 of 50856
Thread 3
29900 of 50863
Thread 6
30400 of 50863
Thread 5
29900 of 50863
Thread 0
30700 of 50863
Thread 2
30400 of 50863
Thread 4
31400 of 50863
Thread 1
30800 of 50863
Thread 7
30100 of 50856
Thread 3
30000 of 50863
Thread 6
30500 of 50863
Thread 5
30000 of 50863
Thread 2
Thread 0
30500 of 50863
30800 of 50863
Thread 4
31500 of 50863
Thread 1
30900 of 50863
Thread 7
30200 of 50856
Thread 3
30100 of 50863
Thread 6
30600 of 50863
Thread 5
30100 of 50863
Thread 0
30900 of 50863
Thread 2
30600 of 50863
Thread 4
31600 of 50863
Thread 1
31000 of 50863
Thread 7
30300 of 50856
Thread 3
30200 o

34000 of 50856
Thread 1
34800 of 50863
Thread 5
33900 of 50863
Thread 3
33900 of 50863
Thread 0
34800 of 50863
Thread 2
34500 of 50863
Thread 6
34500 of 50863
Thread 4
35600 of 50863
Thread 7
34100 of 50856
Thread 1
34900 of 50863
Thread 5
34000 of 50863
Thread 3
34000 of 50863
Thread 0
34900 of 50863
Thread 2
34600 of 50863
Thread 6
34600 of 50863
Thread 4
35700 of 50863
Thread 7
34200 of 50856
Thread 1
35000 of 50863
Thread 5
34100 of 50863
Thread 3
34100 of 50863
Thread 0
35000 of 50863
Thread 4
35800 of 50863
Thread 6
Thread 2
34700 of 50863
34700 of 50863
Thread 7
34300 of 50856
Thread 1
35100 of 50863
Thread 5
Thread 3
34200 of 50863
34200 of 50863
Thread 0
35100 of 50863
Thread 4
35900 of 50863
Thread 6
34800 of 50863
Thread 2
34800 of 50863
Thread 7
34400 of 50856
Thread 1
35200 of 50863
Thread 5
34300 of 50863
Thread 3
34300 of 50863
Thread 0
35200 of 50863
Thread 4
36000 of 50863
Thread 6
34900 of 50863
Thread 2
34900 of 50863
Thread 7
34500 of 50856
Thread 1
35300 of 50863
T

Thread 3
38100 of 50863
Thread 5
38100 of 50863
Thread 4
39900 of 50863
Thread 6
38700 of 50863
Thread 2
38700 of 50863
Thread 0
39200 of 50863
Thread 1
39200 of 50863
Thread 7
38300 of 50856
Thread 3
38200 of 50863
Thread 5
Thread 4
38200 of 50863
40000 of 50863
Thread 6
38800 of 50863
Thread 2
38800 of 50863
Thread 0
39300 of 50863
Thread 1
39300 of 50863
Thread 3
38300 of 50863
Thread 7
38400 of 50856
Thread 5
Thread 4
38300 of 50863
40100 of 50863
Thread 6
38900 of 50863
Thread 2
38900 of 50863
Thread 0
39400 of 50863
Thread 1
39400 of 50863
Thread 3
38400 of 50863
Thread 7
38500 of 50856
Thread 4
40200 of 50863
Thread 5
38400 of 50863
Thread 6
39000 of 50863
Thread 2
39000 of 50863
Thread 0
39500 of 50863
Thread 1
39500 of 50863
Thread 7
38600 of 50856
Thread 4
Thread 5
40300 of 50863
38500 of 50863
Thread 3
38500 of 50863
Thread 6
39100 of 50863
Thread 2
39100 of 50863
Thread 0
39600 of 50863
Thread 1
39600 of 50863
Thread 4
Thread 7
38700 of 50856
40400 of 50863
Thread 3
38600 o

43400 of 50863
Thread 5
42300 of 50863
Thread 4
44300 of 50863
Thread 6
Thread 2
43000 of 50863
43000 of 50863
Thread 3
42300 of 50863
Thread 7
42500 of 50856
Thread 0
Thread 1
43500 of 50863
43500 of 50863
Thread 5
42400 of 50863
Thread 6
Thread 4
43100 of 50863
44400 of 50863
Thread 2
43100 of 50863
Thread 3
42400 of 50863
Thread 7
42600 of 50856
Thread 0
43600 of 50863
Thread 1
43600 of 50863
Thread 5
42500 of 50863
Thread 4
44500 of 50863
Thread 6
43200 of 50863
Thread 2
43200 of 50863
Thread 3
42500 of 50863
Thread 0
Thread 7
43700 of 50863
42700 of 50856
Thread 1
43700 of 50863
Thread 5
42600 of 50863
Thread 4
44600 of 50863
Thread 6
Thread 2
43300 of 50863
43300 of 50863
Thread 3
42600 of 50863
Thread 7
42800 of 50856
Thread 0
43800 of 50863
Thread 1
43800 of 50863
Thread 5
42700 of 50863
Thread 4
44700 of 50863
Thread 2
43400 of 50863
Thread 6
43400 of 50863
Thread 3
42700 of 50863
Thread 0
Thread 7
43900 of 50863
42900 of 50856
Thread 1
43900 of 50863
Thread 5
42800 of 50863
T

Thread 4
48600 of 50863
Thread 0
47700 of 50863
Thread 1
47800 of 50863
Thread 3
46500 of 50863
Thread 6
47300 of 50863
Thread 2
47300 of 50863
Thread 7
46700 of 50856
Thread 5
46600 of 50863
Thread 4
48700 of 50863
Thread 0
47800 of 50863
Thread 1
47900 of 50863
Thread 3
46600 of 50863
Thread 6
47400 of 50863
Thread 2
47400 of 50863
Thread 7
46800 of 50856
Thread 5
46700 of 50863
Thread 4
48800 of 50863
Thread 0
47900 of 50863
Thread 1
48000 of 50863
Thread 3
46700 of 50863
Thread 6
47500 of 50863
Thread 2
47500 of 50863
Thread 7
46900 of 50856
Thread 5
46800 of 50863
Thread 4
48900 of 50863
Thread 0
48000 of 50863
Thread 1
48100 of 50863
Thread 3
46800 of 50863
Thread 6
47600 of 50863
Thread 2
47600 of 50863
Thread 7
47000 of 50856
Thread 5
46900 of 50863
Thread 4
49000 of 50863
Thread 0
48100 of 50863
Thread 1
48200 of 50863
Thread 3
46900 of 50863
Thread 6
47700 of 50863
Thread 2
47700 of 50863
Thread 7
Thread 5
47000 of 50863
47100 of 50856
Thread 4
49100 of 50863
Thread 0
48200 o

SORTING NEWS UNIGRAM PERCENT
SORTING NEWS UNIGRAM COUNT
SORTING NEWS BIGRAM PERCENT
SORTING NEWS BIGRAM COUNT
SORTING NEWS TRIGRAM PERCENT
SORTING NEWS TRIGRAM COUNT
WRITING NEWS FILES
WRITING NEWS unigramCounts
WRITING NEWS unigramPercent
WRITING NEWS bigramCounts
WRITING NEWS bigramPercent
WRITING NEWS trigramCounts
WRITING NEWS trigramPercent
2020-02-23 23:57:29
END NEWS
DONE


### Task #5

<u>Compare the statistics of the corpora.</u>
                        
Use the results of those calculations that you just made the poor computer painstakingly compute. What are the differences in the most common unigrams between the two language models? Are there interesting differences between the bigram models or trigram models?

Be able to sort the n-grams to output the top k with the highest count or probability.

In [9]:
# PYTHON CODE HERE

### Task #6

<u>Generate random sentences from the N-grams models for both datasets.</u>
                        
We briefly talked about this idea in class. It's also introduced at a high-level in J&M 4.3. How can a random number in the range [0,1] probabilistically generate a word using your model?

In [None]:
# PYTHON CODE HERE

### Report

Write a technical report (in this Jupyter Notebook, with good Markdown formatting) that documents your findings, "lessons learned", any areas of where you ran into difficult, and also any other interesting details. Include in your report the following details:

1. Names of the datasets used.
1. Does your model use all of the data in the .csv file or only a subset of it (i.e. first 1,000 rows)?
1. What is the vocabulary and size of each dataset?
1. How did you handle the merging of separate rows in a .csv file? How did you handle sentence segmentation with sentence boundary markers? Also report on any other decisions made in step #3.
1. How long did it take your program to build these models? Do you have any statistics on memory/RAM usage?
1. Output the top 15 unigrams, bigrams, trigrams for each model. Are there any interesting differences?
1. Output 3 different randomly generated sentences for each unigram, bigram, trigram model. How did you know where the randomly generated sentence ended?

Also submit this python notebook `.ipynb` to D2L.

In [None]:
# PYTHON CODE AND REPORT HERE