# HW 2: N-gram Language Models

## Date Out: Thursday, February 20
## Due Date: Thursday, March 5

This programming assignment is more open-ended than the previous ones. It is centered on the N-gram language models and tasks you to:

* download and process a large text dataset in python using the <code>csv</code> library
* perform sentence and word tokenization
* calculate N-gram counts and probabilities
* compare the characteristics of the N-grams across different models
* generate random sentences using the models

<u>You may work in teams of two or three (2-tuples or 3-tuples?) for this assignment.</u>

<hr>

In [1]:
import nltk

In [2]:
import csv

### Task #1

<u>Download two large text datasets from Kaggle.</u>

The <a href="http://kaggle.com">Kaggle competition hosting site</a> offers a number of free datasets that contain interesting text fields. For this assignment, we will use the "Wine Reviews" and "All the News" datasets. They can be accessed by selecting the "Datasets" header and then searching for these specific datasets. Then, choose "Data" from the sub-header, preview some of the csv data and notice how at least one of the columns in the dataset will contain sufficient text. I chose to direct you to these two datasets because the textual content seemed interesting and would have different language characteristics, and both were large csv files that could generate significant n-gram counts, but not be too large of a file.

<em>(You can use other datasets if you wish. Others that looked interesting on Kaggle include the "Yelp Dataset" (but its over 3GB !!!), "SMS Spam Collection Dataset", "Russian Troll Tweets", and "A Million News Headlines".)</em>

In [3]:
# Downloaded wine-reviews and all-the-news

### Task #2

<u>Process the downloaded <code>csv</code> files in python.</u>

There's a nice csv library already included in python for accessing values in that are stored in a comma separated values (csv) format. Read the <a href="https://docs.python.org/3/library/csv.html">csv library documentation</a>.
What is the delimiter in your csv files? Open each of the two .csv files that you downloaded using this library and be able to read in the data. Note that we really only care about the text column in this assignment.

In [4]:
# Use the head.csv file in each folder for testing
# Both csv files are comma deliminated

#Field limit
import sys
import csv
maxInt = sys.maxsize

while True:
    # decrease the maxInt value by factor 10 
    # as long as the OverflowError occurs.

    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt/10)

#news = open("all-the-news/articles1.csv", "r")
#wine = open("wine-reviews/winemag-data-130k-v2.csv", "r")
#news = open("all-the-news/head1000.csv", "r")
#wine = open("wine-reviews/head1000.csv", "r")
#news = open("all-the-news/head500.csv", "r")
#wine = open("wine-reviews/head500.csv", "r")
#news = open("all-the-news/head50.csv", "r")
#wine = open("wine-reviews/head50.csv", "r")
news = open("all-the-news/head.csv", "r")
wine = open("wine-reviews/head.csv", "r")

reviews = []
articles = []
with news as csv_file:
    csv_reader = csv.DictReader(csv_file, delimiter=",")
    for lines in csv_reader:
        articles.append(lines["content"])
with wine as csv_file:
    csv_reader = csv.DictReader(csv_file, delimiter=",")
    for lines in csv_reader:
        reviews.append(lines["description"])

In [5]:
#print(reviews)
#print(articles)

### Task #3

<u>Perform sentence segmentation and word tokenization.</u>

Utilize the nltk module to perform sentence segmentation and word tokenization. But at this point, there are a few decisions that need to be made:

* How we should handle the .csv rows in the previous step? If we ignore row makers, and "lump everything together", how will that effect our language model?
* Do we want to remove punctuation? What is the effect of keeping punctuation in the model?
* Do we want to add sentence boundary markers, such as <samp>&lt;S&gt;</samp> and <samp>&lt;/S&gt;</samp>?</li>
* Should two the words <samp>The</samp> and <samp>the</samp> be treated as the same? What are the effects of doing, or not doing, this?

In [6]:
# Rows will be lumped into a single string.  A single " " will be added to the end of each row to ensure sentences
# are not being combined (Ex. "lastword. firstword" instead of "lastword.firstword")
# Punctuation will NOT be kept when counting n-grams
# The and the will be treated as the same word, this will decrease total n-grams
# N-Gram words will be counted within sentence boundries  (a trigram/bigram will not overlap into another sentence)
import string
import datetime

reviewsraw = ""
articlesraw = ""
print("READ REVIEWS")
for review in reviews:
    #review = review.translate(str.maketrans('', '', string.punctuation))
    review = review.lower()
    reviewsraw += review + " "
print("READ REVIEWS COMPLETE")
print("READ NEWS")
for article in articles:
    #article = article.translate(str.maketrans('', '', string.punctuation))
    article = article.lower()
    articlesraw += article + " "
print("READ NEWS COMPLETE")
    
#reviewTokens = nltk.word_tokenize(reviewsraw)
print("SENT TOKE REVIEWS")
now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))
reviewSents = nltk.sent_tokenize(reviewsraw)
print("SENT TOKE REVIEWS COMPLETE")
now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))
print("SENT TOKE NEWS")
now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))
#articleTokens = nltk.word_tokenize(articlesraw)
articleSents = nltk.sent_tokenize(articlesraw)
print("SENT TOKE NEWS COMPLETE")
now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))

READ REVIEWS
READ REVIEWS COMPLETE
READ NEWS
READ NEWS COMPLETE
SENT TOKE REVIEWS
2020-02-27 17:50:06
SENT TOKE REVIEWS COMPLETE
2020-02-27 17:50:06
SENT TOKE NEWS
2020-02-27 17:50:06
SENT TOKE NEWS COMPLETE
2020-02-27 17:50:06


In [7]:
#print(reviewTokens)
#print(reviewSents)

### Task #4

<u>Calculate N-gram counts and compute probabilities.</u>

Use a python dictionary (or any suitable data structure) to first compute unigram counts. Then try bigram counts. Finally, trigram counts.

How much memory are you using? How fast, or slow, is the code -- how long is this step taking? If it is taking too long, try only using a fraction of your corpus: instead of loading the entire .csv file, try only reading the first 1000 rows of data.

Using those counts, compute the probabilities for the unigrams, bigrams, and trigrams, and store those in a new python dictionary (or some other data structure).

In [11]:

import math
from nltk.util import ngrams
from collections import Counter
from multiprocessing import Process
import multiprocessing
import datetime

def ngram_function(index, sentences, tokenCountDict, unigramsDict, bigramsDict, trigramsDict, stopwords):
    sentCount = 0
    
    #use temp vars to avoid race conditions
    tmpunigramsDict = []
    tmpbigramsDict = []
    tmptrigramsDict = []
    tmptokenCount = 0
    
    for sent in sentences[index]:
        if sentCount % 1000 == 0:
            now = datetime.datetime.now()
            print("Thread " + str(index) + ": " + str(sentCount) + " of " + str(len(sentences[index])) + " " + now.strftime("%Y-%m-%d %H:%M:%S"))
        sent = sent.translate(str.maketrans('', '', string.punctuation))
        words = nltk.word_tokenize(sent)
        for word in list(words):  
            if word in stopwords:
                words.remove(word)
        tmptokenCount += len(words)
        tmpunigramsDict += (ngrams(words, 1))
        tmpbigramsDict += (ngrams(words, 2))
        tmptrigramsDict += (ngrams(words, 3))
        sentCount += 1
    tokenCountDict[index] = tmptokenCount
    unigramsDict[index] = tmpunigramsDict
    bigramsDict[index] = tmpbigramsDict
    trigramsDict[index] = tmptrigramsDict

def processUnigrams(tokenCount, rawCounter, percentDict, countDict):
    for word in rawCounter:
        firstword = list(word)[0]
        count = rawCounter[(firstword,)]
        if count > 0:
            percent = math.log(count / tokenCount)
            percentDict.update({firstword : percent})        
            countDict.update({firstword : count})
        
def processNGrams(rawCounter, reviewUGCount, percentDict, countDict):
    for ngram in rawCounter:
        count = rawCounter[ngram]
        if count > 0:
            firstword = list(ngram)[0]
            if firstword in reviewUGCount.keys():
                total = reviewUGCount[firstword]
                percent = math.log(count / total)
                percentDict.update({ngram : percent})
                countDict.update({ngram : count})

def nGramDictToCSV(path, fileName, theDict):
    theDict = dict(theDict)
    w = csv.writer(open(path + "/" + fileName + ".csv", "w"))
    for key, val in theDict.items():
        w.writerow([key, val])
        
def sortDict(theDict):
    return dict(sorted(theDict.items(), key=lambda y: y[1], reverse=True))

print("START WINE REVIEWS")
reviewTokenCount = 0

reviewUnigrams = Counter()
reviewBigrams = Counter()
reviewTrigrams = Counter()

#I dont want these in my ngram results
stopwords = ["'", "s", "’", "”", "“", "t"]

print("CREATING NGRAM COUNTS - WINE REIVEWS")
now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))

#This is taking too long, use all cores on the machine.
#Match to the number of threads on the machine
threadCount = 8

#Prep for multithread
interval = math.ceil(len(reviewSents) / threadCount)

manager = multiprocessing.Manager()
reviewThreadDict = manager.dict()
reviewUnigramsDict = manager.dict()
reviewBigramsDict = manager.dict()
reviewTrigramsDict = manager.dict()
reviewTokenCountDict = manager.dict()

for x in range(threadCount):
    reviewThreadDict.update({x : reviewSents[interval*x:interval*(x+1)]})
    reviewUnigramsDict.update({x : []})
    reviewBigramsDict.update({x : []})
    reviewTrigramsDict.update({x : []})
    reviewTokenCountDict.update({x : 0})

print("Review sentence length: " + str(len(reviewSents)))    
    
if __name__ == "__main__":

    threads = list()
    for index in range(threadCount):
        x = Process(target=ngram_function, args=(index, reviewThreadDict, reviewTokenCountDict, reviewUnigramsDict, reviewBigramsDict, reviewTrigramsDict, stopwords))
        threads.append(x)
        print("Starting thread " + str(index))
        x.start()

    for a in range(threadCount):
        thread = threads[a]       
        thread.join()        
        reviewUnigrams += Counter(reviewUnigramsDict[a])
        reviewBigrams += Counter(reviewBigramsDict[a])
        reviewTrigrams += Counter(reviewTrigramsDict[a])
        reviewTokenCount += reviewTokenCountDict[a]
        print("End thread " + str(a))

print("All threads complete.")

now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))

print("CREATING COUNTER DICTIONARY - WINE REIVEWS")

reviewNGramCounts = {
    "unigrams" : Counter(reviewUnigrams),
    "bigrams" : Counter(reviewBigrams),
    "trigrams" : Counter(reviewTrigrams)
}

reviewUGPercent = {}
reviewUGCount = {}
reviewBGPercent = {}
reviewBGCount = {}
reviewTGPercent = {}
reviewTGCount = {}

print("PROCCESSING UNIGRAMS PERCENTAGE - WINE REVIEWS")
processUnigrams(reviewTokenCount, reviewNGramCounts["unigrams"], reviewUGPercent, reviewUGCount)
print("PROCCESSING BIGRAM PERCENTAGE - WINE REVIEWS")
processNGrams(reviewNGramCounts["bigrams"], reviewUGCount, reviewBGPercent, reviewBGCount)
print("PROCCESSING TRIGRAMS PERCENTAGE - WINE REVIEWS")
processNGrams(reviewNGramCounts["trigrams"], reviewUGCount, reviewTGPercent, reviewTGCount)

print("SORTING WINE UNIGRAM PERCENT")
reviewUGPercent = sortDict(reviewUGPercent)
print("SORTING WINE UNIGRAM COUNT")
reviewUGCount = sortDict(reviewUGCount)
print("SORTING WINE BIGRAM PERCENT")
reviewBGPercent = sortDict(reviewBGPercent)
print("SORTING WINE BIGRAM COUNT")
reviewBGCount = sortDict(reviewBGCount)
print("SORTING WINE TRIGRAM PERCENT")
reviewTGPercent = sortDict(reviewTGPercent)
print("SORTING WINE TRIGRAM COUNT")
reviewTGCount = sortDict(reviewTGCount)

print("WRITING WINE FILES")
print("WRITING WINE unigramCounts")
nGramDictToCSV("wine-reviews/", "unigramCounts", reviewUGCount)
print("WRITING WINE unigramPercent")
nGramDictToCSV("wine-reviews/", "unigramPercent", reviewUGPercent)
print("WRITING WINE bigramCounts")
nGramDictToCSV("wine-reviews/", "bigramCounts", reviewBGCount)
print("WRITING WINE bigramPercent")
nGramDictToCSV("wine-reviews/", "bigramPercent", reviewBGPercent)
print("WRITING WINE trigramCounts")
nGramDictToCSV("wine-reviews/", "trigramCounts", reviewTGCount)
print("WRITING WINE trigramPercent")
nGramDictToCSV("wine-reviews/", "trigramPercent", reviewTGPercent)

print("END WINE REVIEWS")
now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))

###############################################################

print("START NEWS")
articleTokenCount = 0    

articleUnigrams = Counter()
articleBigrams = Counter()
articleTrigrams = Counter()

print("CREATING NGRAM COUNTS - NEWS")
now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))

#Prep for multithread
interval = math.ceil(len(articleSents) / threadCount)

articleThreadDict = manager.dict()
articleUnigramsDict = manager.dict()
articleBigramsDict = manager.dict()
articleTrigramsDict = manager.dict()
articleTokenCountDict = manager.dict()

for x in range(threadCount):
    articleThreadDict.update({x : articleSents[interval*x:interval*(x+1)]})
    articleUnigramsDict.update({x : []})
    articleBigramsDict.update({x : []})
    articleTrigramsDict.update({x : []})
    articleTokenCountDict.update({x : 0})

print("News sentence length: " + str(len(reviewSents)))      
    
if __name__ == "__main__":

    threads = list()
    for index in range(threadCount):
        x = Process(target=ngram_function, args=(index, articleThreadDict, articleTokenCountDict, articleUnigramsDict, articleBigramsDict, articleTrigramsDict, stopwords))
        threads.append(x)
        print("Starting thread " + str(index))
        x.start()

    for a in range(threadCount):
        thread = threads[a]
        thread.join()
        articleUnigrams += Counter(articleUnigramsDict[a])
        articleBigrams += Counter(articleBigramsDict[a])
        articleTrigrams += Counter(articleTrigramsDict[a])
        articleTokenCount += articleTokenCountDict[a]
        print("End thread " + str(a))
        
print("All threads complete")
now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))

print("CREATING COUNTER DICTIONARY - NEWS")
articleNGramCounts = {
    "unigrams" : Counter(articleUnigrams),
    "bigrams" : Counter(articleBigrams),
    "trigrams" : Counter(articleTrigrams)
}

articleUGPercent = {}
articleUGCount = {}
articleBGPercent = {}
articleBGCount = {}
articleTGPercent = {}
articleTGCount = {}

print("PROCCESSING UNIGRAMS PERCENTAGE - NEWS")
processUnigrams(articleTokenCount, articleNGramCounts["unigrams"], articleUGPercent, articleUGCount)
print("PROCCESSING BIGRAM PERCENTAGE - NEWS")
processNGrams(articleNGramCounts["bigrams"], articleUGCount, articleBGPercent, articleBGCount)
print("PROCCESSING TRIGRAM PERCENTAGE - NEWS")
processNGrams(articleNGramCounts["trigrams"], articleUGCount, articleTGPercent, articleTGCount)

print("SORTING NEWS UNIGRAM PERCENT")
articleUGPercent = sortDict(articleUGPercent)
print("SORTING NEWS UNIGRAM COUNT")
articleUGCount = sortDict(articleUGCount)
print("SORTING NEWS BIGRAM PERCENT")
articleBGPercent = sortDict(articleBGPercent)
print("SORTING NEWS BIGRAM COUNT")
articleBGCount = sortDict(articleBGCount)
print("SORTING NEWS TRIGRAM PERCENT")
articleTGPercent = sortDict(articleTGPercent)
print("SORTING NEWS TRIGRAM COUNT")
articleTGCount = sortDict(articleTGCount)

print("WRITING NEWS FILES")
print("WRITING NEWS unigramCounts")
nGramDictToCSV("all-the-news/", "unigramCounts", articleUGCount)
print("WRITING NEWS unigramPercent")
nGramDictToCSV("all-the-news/", "unigramPercent", articleUGPercent)
print("WRITING NEWS bigramCounts")
nGramDictToCSV("all-the-news/", "bigramCounts", articleBGCount)
print("WRITING NEWS bigramPercent")
nGramDictToCSV("all-the-news/", "bigramPercent", articleBGPercent)
print("WRITING NEWS trigramCounts")
nGramDictToCSV("all-the-news/", "trigramCounts", articleTGCount)
print("WRITING NEWS trigramPercent")
nGramDictToCSV("all-the-news/", "trigramPercent", articleTGPercent)

now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))

print("END NEWS")

print("DONE")



START WINE REVIEWS
CREATING NGRAM COUNTS - WINE REIVEWS
2020-02-27 17:51:54
Review sentence length: 20
Starting thread 0
Starting thread 1
Starting thread 2
Starting thread 3
Starting thread 4
Thread 0: 0 of 3 2020-02-27 17:51:54
Starting thread 5
Starting thread 6
Starting thread 7
Thread 2: 0 of 3 2020-02-27 17:51:54
Thread 1: 0 of 3 2020-02-27 17:51:54
Thread 3: 0 of 3 2020-02-27 17:51:54
Thread 5: 0 of 3 2020-02-27 17:51:54
End thread 0
Thread 4: 0 of 3 2020-02-27 17:51:54
Thread 6: 0 of 2 2020-02-27 17:51:54
End thread 1
End thread 2
End thread 3
End thread 4
End thread 5
End thread 6
End thread 7
All threads complete.
2020-02-27 17:51:54
CREATING COUNTER DICTIONARY - WINE REIVEWS
PROCCESSING UNIGRAMS PERCENTAGE - WINE REVIEWS
PROCCESSING BIGRAM PERCENTAGE - WINE REVIEWS
PROCCESSING TRIGRAMS PERCENTAGE - WINE REVIEWS
SORTING WINE UNIGRAM PERCENT
SORTING WINE UNIGRAM COUNT
SORTING WINE BIGRAM PERCENT
SORTING WINE BIGRAM COUNT
SORTING WINE TRIGRAM PERCENT
SORTING WINE TRIGRAM COUNT


### Task #5

<u>Compare the statistics of the corpora.</u>
                        
Use the results of those calculations that you just made the poor computer painstakingly compute. What are the differences in the most common unigrams between the two language models? Are there interesting differences between the bigram models or trigram models?

Be able to sort the n-grams to output the top k with the highest count or probability.

In [13]:
# Load ngram percents and counts from files
# Use head 100 files for testing

#Major differences in news was mainly politcal and contained words talking about the president, white hosue, us etc
#Wine reiviews had none of this in their top ngrams

def readCountPercentFiles(fp):
    file = open(fp, "r")
    tmpDict = {}
    with file as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=",")
        for lines in csv_reader:
            tmpDict.update({lines[0] : lines[1]})
    return tmpDict

def writeTopN(theDict, top):
    for x in range(top):
        print(list(theDict.keys())[x] + "," + list(theDict.values())[x])
        
newsUGCountDict = readCountPercentFiles("all-the-news/unigramCounts.csv")
newsBGCountDict = readCountPercentFiles("all-the-news/bigramCounts.csv")
newsTGCountDict = readCountPercentFiles("all-the-news/trigramCounts.csv")
newsUGPercentDict = readCountPercentFiles("all-the-news/unigramPercent.csv")
newsBGPercentDict = readCountPercentFiles("all-the-news/bigramPercent.csv")
newsTGPercentDict = readCountPercentFiles("all-the-news/trigramPercent.csv")

wineUGCountDict = readCountPercentFiles("wine-reviews/unigramCounts.csv")
wineBGCountDict = readCountPercentFiles("wine-reviews/bigramCounts.csv")
wineTGCountDict = readCountPercentFiles("wine-reviews/trigramCounts.csv")
wineUGPercentDict = readCountPercentFiles("wine-reviews/unigramPercent.csv")
wineBGPercentDict = readCountPercentFiles("wine-reviews/bigramPercent.csv")
wineTGPercentDict = readCountPercentFiles("wine-reviews/trigramPercent.csv")

#newsUGCountDict = readCountPercentFiles("all-the-news/unigramCounts100.csv")
#newsBGCountDict = readCountPercentFiles("all-the-news/bigramCounts100.csv")
#newsTGCountDict = readCountPercentFiles("all-the-news/trigramCounts100.csv")
#newsUGPercentDict = readCountPercentFiles("all-the-news/unigramPercent100.csv")
#newsBGPercentDict = readCountPercentFiles("all-the-news/bigramPercent100.csv")
#newsTGPercentDict = readCountPercentFiles("all-the-news/trigramPercent100.csv")

#wineUGCountDict = readCountPercentFiles("wine-reviews/unigramCounts100.csv")
#wineBGCountDict = readCountPercentFiles("wine-reviews/bigramCounts100.csv")
#wineTGCountDict = readCountPercentFiles("wine-reviews/trigramCounts100.csv")
#wineUGPercentDict = readCountPercentFiles("wine-reviews/unigramPercent100.csv")
#wineBGPercentDict = readCountPercentFiles("wine-reviews/bigramPercent100.csv")
#wineTGPercentDict = readCountPercentFiles("wine-reviews/trigramPercent100.csv")

#newsUGCountDict = readCountPercentFiles("all-the-news/unigramCountsFull.csv")
#newsBGCountDict = readCountPercentFiles("all-the-news/bigramCountsFull.csv")
#newsTGCountDict = readCountPercentFiles("all-the-news/trigramCountsFull.csv")
#newsUGPercentDict = readCountPercentFiles("all-the-news/unigramPercentFull.csv")
#newsBGPercentDict = readCountPercentFiles("all-the-news/bigramPercentFull.csv")
#newsTGPercentDict = readCountPercentFiles("all-the-news/trigramPercentFull.csv")

#wineUGCountDict = readCountPercentFiles("wine-reviews/unigramCountsFull.csv")
#wineBGCountDict = readCountPercentFiles("wine-reviews/bigramCountsFull.csv")
#wineTGCountDict = readCountPercentFiles("wine-reviews/trigramCountsFull.csv")
#wineUGPercentDict = readCountPercentFiles("wine-reviews/unigramPercentFull.csv")
#wineBGPercentDict = readCountPercentFiles("wine-reviews/bigramPercentFull.csv")
#wineTGPercentDict = readCountPercentFiles("wine-reviews/trigramPercentFull.csv")

#print top N from dict
#csv's are already sorted
writeTopN(newsUGCountDict, 10)

print(len(wineUGCountDict))
print(len(newsUGCountDict))



the,980
a,459
of,431
to,425
and,419
in,404
that,197
he,186
was,156
for,154
174
3991


In [14]:
print("news unigrams")
writeTopN(newsUGCountDict, 15)
print("news bigrams")
writeTopN(newsBGCountDict, 15)
print("news trigrams")
writeTopN(newsTGCountDict, 15)
print("wine unigrams")
writeTopN(wineUGCountDict, 15)
print("wine bigrams")
writeTopN(wineBGCountDict, 15)
print("wine trigrams")
writeTopN(wineTGCountDict, 15)

news unigrams
the,980
a,459
of,431
to,425
and,419
in,404
that,197
he,186
was,156
for,154
on,150
his,148
said,119
mr,108
they,106
news bigrams
('in', 'the'),127
('of', 'the'),99
('and', 'the'),42
('in', 'a'),41
('on', 'the'),39
('to', 'the'),37
('he', 'said'),30
('mr', 'wong'),30
('ms', 'kerr'),27
('at', 'the'),26
('he', 'was'),26
('the', '40th'),25
('40th', 'precinct'),24
('mr', 'leahy'),23
('for', 'the'),22
news trigrams
('the', '40th', 'precinct'),22
('in', 'the', 'city'),13
('the', 'united', 'states'),12
('in', 'the', '40th'),10
('in', 'the', 'precinct'),9
('in', 'the', 'bronx'),9
('mr', 'wong', 'was'),8
('more', 'than', 'a'),7
('one', 'of', 'the'),6
('of', 'the', 'precinct'),6
('the', 'biggest', 'loser'),6
('calories', 'a', 'day'),6
('the', 'police', 'department'),5
('the', 'betances', 'houses'),5
('back', 'to', 'the'),5
wine unigrams
and,16
the,10
with,10
a,9
acidity,6
this,6
of,6
is,5
wine,5
its,5
aromas,4
flavors,4
in,4
dried,3
palate,3
wine bigrams
('the', 'palate'),3
('and', '

### Task #6

<u>Generate random sentences from the N-grams models for both datasets.</u>
                        
We briefly talked about this idea in class. It's also introduced at a high-level in J&M 4.3. How can a random number in the range [0,1] probabilistically generate a word using your model?

In [15]:
# PYTHON CODE HERE

In [16]:
import random
from math import exp
from random import randint

def getDictCount(theDict):
    count = 0
    for w in theDict:
        count += int(theDict[w])
    return count

def weighted_random_by_dct(dct, count):
    rand_val = random.random()
    total = 0
    for k, v in dct.items():
        total += float(v) / count
        if rand_val <= total:
            return k

def findGram(firstword, theDict):
    fwordlist = {}
    fwordcount = 0
    for w in theDict:
        theKey = eval(w)[0]
        if(theKey == firstword):
            fwordlist.update({w : theDict[w]})
            fwordcount += int(theDict[w])
    return weighted_random_by_dct(fwordlist, fwordcount)
    return randint(0, 1)

def makeTriSentence(setChoice):
    uniDict = newsUGCountDict
    triDict = newsTGCountDict
   
    if setChoice == "wine":
        uniDict = wineUGCountDict
        triDict = wineTGCountDict
    
    uniCount = getDictCount(uniDict)
    triCount = getDictCount(triDict)
    
    firstword = weighted_random_by_dct(uniDict, uniCount)
    
    numtri = randint(6, 10)
    
    sentGrams = firstword + " "
    while numtri >= 0:
        theGram = findGram(firstword, triDict)
        if theGram is not None:
            tupp = eval(theGram)
            tupp = list(tupp)
            sentGrams += tupp[1] + " " + tupp[2]  + " "
            firstword = eval(theGram)[2]
            numtri = numtri - 1
        else:
            firstword = weighted_random_by_dct(uniDict, uniCount)
           
    return sentGrams

def makeBiSentence(setChoice):
    uniDict = newsUGCountDict
    biDict = newsBGCountDict
   
    if setChoice == "wine":
        uniDict = wineUGCountDict
        biDict = wineBGCountDict
    
    uniCount = getDictCount(uniDict)
    biCount = getDictCount(biDict)
    
    firstword = weighted_random_by_dct(uniDict, uniCount)
    
    numbi = randint(8, 15)
    
    sentGrams = firstword + " "
    while numbi >= 0:
        theGram = findGram(firstword, biDict)
        if theGram is not None:
            tupp = eval(theGram)
            tupp = list(tupp)
            sentGrams += tupp[1] + " "
            firstword = eval(theGram)[1]
            numbi = numbi - 1
        else:
            firstword = weighted_random_by_dct(uniDict, uniCount)
           
    return sentGrams

def makeUniSentence(setChoice):
    uniDict = newsUGCountDict
   
    if setChoice == "wine":
        uniDict = wineUGCountDict
    
    uniCount = getDictCount(uniDict)
    
    firstword = weighted_random_by_dct(uniDict, uniCount)
    
    numuni = randint(10, 20)
    
    sentGrams = firstword + " "
    while numuni >= 0:
        firstword = weighted_random_by_dct(uniDict, uniCount)
        sentGrams += firstword + " "
        numuni = numuni - 1
           
    return sentGrams
    
print("wine")
print(makeUniSentence("wine"))
print(makeBiSentence("wine"))
print(makeTriSentence("wine"))

print("news")
print(makeUniSentence("news"))
print(makeBiSentence("news"))
print(makeTriSentence("news"))


wine
a pleasantly with aromas restrained nonetheless much if with plum a hearty elegant with regular for country expressive wine crisp 
this is fresh with juicy red berry fruits and savory 
a typical navarran whiff of preserved peach in this is fairly full bodied with tomatoey was all stainlesssteel fermented offers spice 
news
people the father hair chiefs states former hungry in their 40th monarch is epithet parts on of problem 2017 
potentially decision that will of the contract between 152 and brains of retribution 
police say those crimes the pounds come out of leadership changes in the successful ground was different 


### Report

Write a technical report (in this Jupyter Notebook, with good Markdown formatting) that documents your findings, "lessons learned", any areas of where you ran into difficult, and also any other interesting details. Include in your report the following details:

1. Names of the datasets used.
1. Does your model use all of the data in the .csv file or only a subset of it (i.e. first 1,000 rows)?
1. What is the vocabulary and size of each dataset?
1. How did you handle the merging of separate rows in a .csv file? How did you handle sentence segmentation with sentence boundary markers? Also report on any other decisions made in step #3.
1. How long did it take your program to build these models? Do you have any statistics on memory/RAM usage?
1. Output the top 15 unigrams, bigrams, trigrams for each model. Are there any interesting differences?
1. Output 3 different randomly generated sentences for each unigram, bigram, trigram model. How did you know where the randomly generated sentence ended?

Also submit this python notebook `.ipynb` to D2L.

<u>Perform sentence segmentation and word tokenization.</u>

Utilize the nltk module to perform sentence segmentation and word tokenization. But at this point, there are a few decisions that need to be made:

* How we should handle the .csv rows in the previous step? If we ignore row makers, and "lump everything together", how will that effect our language model?
* Do we want to remove punctuation? What is the effect of keeping punctuation in the model?
* Do we want to add sentence boundary markers, such as <samp>&lt;S&gt;</samp> and <samp>&lt;/S&gt;</samp>?</li>
* Should two the words <samp>The</samp> and <samp>the</samp> be treated as the same? What are the effects of doing, or not doing, this?

## Technical Report

### Data Sets

### <a href="https://www.kaggle.com/snapcrack/all-the-news">All the news</a>
News articles from the New York Times, Breitbart, CNN, Business Insider, the Atlantic, Fox News, Talking Points Memo, Buzzfeed News, National Review, New York Post, the Guardian, NPR, Reuters, Vox, and the Washington Post.
Only articles1.csv from the data set was used for anylisis.
<u>article1.csv</u>

* 50,000 news articles
* 194.11 MB

### <a href="https://www.kaggle.com/zynicide/wine-reviews">Wine Reviewss</a>
130,000 Wine reviews scraped from <a href="http://www.winemag.com/?s=&drink_type=wine"> WineEnthusiast</a> during June 2017.
Only winemag-data-130k-v2.csv from the data set was used for anaylisis.
<u>winemag-data-130k-v2.csv</u>
* 130,000 wire reviews
* 50.46 MB

The n-gram models were created using the entire file. 2 CSV files were created for each n-gram model, one for counts and one for percentages, totaling 6 files.

### News n-gram files

<u>unigramCountsFull.csv</u> - 2.2 MB -177449 lines<br>
<u>unigramPercentFull.csv</u> - 5.3 MB - 177449 lines<br>
<u>bigramCountsFull.csv</u> - 124.3 MB -4719838 lines<br>
<u>bigramPercentFull.csv</u> - 203.8 MB - 4719838 lines<br>
<u>trigramCountsFull.csv</u> - 494.5 MB - 14689607 lines<br>
<u>trigramPercentFull.csv</u> - 756.7 MB - 14689607 lines<br>

### Wine n-gram files

<u>unigramCountsFull.csv</u> - 584.3 kB -44771 lines<br>
<u>unigramPercentFull.csv</u> - 1.4 MB - 44771 lines<br>
<u>bigramCountsFull.csv</u> - 14.1 MB -535868 lines<br>
<u>bigramPercentFull.csv</u> - 22.9 MB - 535868 lines<br>
<u>trigramCountsFull.csv</u> - 51.8 MB - 1527733 lines<br>
<u>trigramPercentFull.csv</u> - 77.7 MB - 1527733 lines<br>

Total vocabulary can be represented by the line counts of each data set's unigram count file

* News vocabulary count - 177449
* Wine vocabulary count - 44771

### Sentence Segmentation and Word Tokenization
After column containing the articles/reviews from the data set csv files, the entries were added together into a single string with a space (" ") seperating each entry to prevent sentences from running together.  Sentence segmation was then performed on this string using NLTK's `sent_tokenize()` function.  Each sentence was then broken into its correspoding words using NLTK's `word_tokenize()` function, after the word list for a sentence was created it was used the count the 3 different n-grams contained within it.  Counting the n-grams this way, sentence by sentence, ensured that n-grams did not overflow between sentences.  Special characters and punctuation were removed from the words and all words were made lower case before counting n-grams.

### Program Performance
The code contained within this notebook will be able to parse the entire files from both data sets only if there is sufficent RAM on the machine (16+ GB). Initially I was having problems with the amount if time it was taking to process the News data set.  This was taking over 8 hours.  I rewrote the code that counted the n-grams so the data could be processed on multiple cpus at the same time to cut the processing time.  I eventually found the problem in my code that was causing the excessive processing times. The issue was that two counter objects were being added together after each sentence was processed.  As the sum Counter object grew larger, this addtion took more time as well.  I changed this to only update a counter object every 150000 sentences.  This ran much quicker, but at the cost of more RAM utilziation.  Each file can be processed in less than 5 minutes.

### News Top N-Grams

#### Unigrams
the,1873788<br>
to,891377<br>
of,810531<br>
a,767762<br>
and,757945<br>
in,649374<br>
that,452533<br>
for,300547<br>
is,295052<br>
on,292685<br>
he,256274<br>
it,252612<br>
was,226418<br>
with,211355<br>
said,208127<br>

#### Bigrams
('of', 'the'),192662<br>
('in', 'the'),155857<br>
('to', 'the'),88735<br>
('on', 'the'),66703<br>
('for', 'the'),53846<br>
('at', 'the'),51722<br>
('in', 'a'),51372<br>
('and', 'the'),49236<br>
('to', 'be'),48117<br>
('that', 'the'),47247<br>
('with', 'the'),38812<br>
('from', 'the'),35589<br>
('of', 'a'),33050<br>
('as', 'a'),33041<br>
('he', 'said'),32528<br>

#### Trigrams
('the', 'united', 'states'),21374<br>
('one', 'of', 'the'),14553<br>
('the', 'white', 'house'),9812<br>
('according', 'to', 'the'),9302<br>
('a', 'lot', 'of'),8498<br>
('the', 'new', 'york'),6098<br>
('as', 'well', 'as'),6017<br>
('said', 'in', 'a'),5930<br>
('in', 'a', 'statement'),5795<br>
('in', 'the', 'united'),5598<br>
('some', 'of', 'the'),5331<br>
('new', 'york', 'times'),5061<br>
('part', 'of', 'the'),4985<br>
('to', 'be', 'a'),4899<br>
('going', 'to', 'be'),4749<br>


### Wine Top N-Grams

#### Unigrams
and,404954<br>
the,258453<br>
a,215622<br>
of,184159<br>
with,152655<br>
is,111531<br>
this,109710<br>
wine,87624<br>
flavors,77830<br>
in,74619<br>
to,64069<br>
it,63720<br>
its,60771<br>
fruit,56501<br>
but,48420<br>

#### Bigrams
('on', 'the'),34841<br>
('this', 'is'),27562<br>
('is', 'a'),24238<br>
('in', 'the'),22983<br>
('the', 'palate'),21646<br>
('and', 'a'),20614<br>
('the', 'wine'),19738<br>
('the', 'finish'),19696<br>
('flavors', 'of'),19612<br>
('aromas', 'of'),14910<br>
('of', 'the'),14066<br>
('this', 'wine'),14013<br>
('the', 'nose'),12380<br>
('wine', 'is'),12180<br>

#### Trigrams
('this', 'is', 'a'),14062<br>
('on', 'the', 'finish'),10524<br>
('in', 'the', 'mouth'),8994<br>
('on', 'the', 'palate'),7633<br>
('the', 'wine', 'is'),7256<br>
('on', 'the', 'nose'),6931<br>
('the', 'palate', 'is'),6179<br>
('a', 'touch', 'of'),5641<br>
('a', 'hint', 'of'),4099<br>
('the', 'finish', 'is'),3437<br>
('the', 'mouth', 'with'),2841<br>
('this', 'wine', 'is'),2830<br>
('over', 'the', 'next'),2670<br>
('as', 'well', 'as'),2544<br>
('ready', 'to', 'drink'),2487<br>


### Sentence Generation
The first word in the sentence was selecetd by using a weighted algorithm to randomly select a word from the unigrams model.  This word was then used to create a list of ngrams whos first word is the first word we selected earlier.  We then select our ngram for the sentence using the same alogrithm from the first word.  The last word of the ngram is then used as the first word for the next ngram in the sentence and the process repeats until we have reached the desired length of ngrams. The number of ngrams in the sentence is determined by a random number generated between 8 and 20.

### News Model Generated Sentences

Unigram - antiquated block his the like with administration central sitting unsustainable how manhattan studies been the of now to from islam him his <br>
Bigram - barack obama the interview when he added she believes in communicating with sweden but the slave labor<br>
Trigram - them to the petitions committee on april 27 but not overwhelming during the kingdom said john rubey chief executive<br>

### Wine Review Model Generated Sentences

Unigram - mineral this but and also mourvèdre but ragù berries greenness in rich fruit fruit reflects light a with <br>
Bigram - this case for the nose its maturation a whiff of tang <br>
Trigram - to balance the most beautiful example of washington rieslings this is a good everyday quaff at a decent introduction <br>
