# HW 2: N-gram Language Models

## Date Out: Thursday, February 20
## Due Date: Thursday, March 5

This programming assignment is more open-ended than the previous ones. It is centered on the N-gram language models and tasks you to:

* download and process a large text dataset in python using the <code>csv</code> library
* perform sentence and word tokenization
* calculate N-gram counts and probabilities
* compare the characteristics of the N-grams across different models
* generate random sentences using the models

<u>You may work in teams of two or three (2-tuples or 3-tuples?) for this assignment.</u>

<hr>

In [40]:
import nltk

In [41]:
import csv

### Task #1

<u>Download two large text datasets from Kaggle.</u>

The <a href="http://kaggle.com">Kaggle competition hosting site</a> offers a number of free datasets that contain interesting text fields. For this assignment, we will use the "Wine Reviews" and "All the News" datasets. They can be accessed by selecting the "Datasets" header and then searching for these specific datasets. Then, choose "Data" from the sub-header, preview some of the csv data and notice how at least one of the columns in the dataset will contain sufficient text. I chose to direct you to these two datasets because the textual content seemed interesting and would have different language characteristics, and both were large csv files that could generate significant n-gram counts, but not be too large of a file.

<em>(You can use other datasets if you wish. Others that looked interesting on Kaggle include the "Yelp Dataset" (but its over 3GB !!!), "SMS Spam Collection Dataset", "Russian Troll Tweets", and "A Million News Headlines".)</em>

In [42]:
# Downloaded wine-reviews and all-the-news

### Task #2

<u>Process the downloaded <code>csv</code> files in python.</u>

There's a nice csv library already included in python for accessing values in that are stored in a comma separated values (csv) format. Read the <a href="https://docs.python.org/3/library/csv.html">csv library documentation</a>.
What is the delimiter in your csv files? Open each of the two .csv files that you downloaded using this library and be able to read in the data. Note that we really only care about the text column in this assignment.

In [1]:
# Use the head.csv file in each folder for testing
# Both csv files are comma deliminated

#Field limit
import sys
import csv
maxInt = sys.maxsize

while True:
    # decrease the maxInt value by factor 10 
    # as long as the OverflowError occurs.

    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt/10)

#news = open("all-the-news/articles1.csv", "r")
#wine = open("wine-reviews/winemag-data_first150k.csv", "r")
#news = open("all-the-news/head1000.csv", "r")
#wine = open("wine-reviews/head1000.csv", "r")
#news = open("all-the-news/head500.csv", "r")
#wine = open("wine-reviews/head500.csv", "r")
#news = open("all-the-news/head50.csv", "r")
#wine = open("wine-reviews/head50.csv", "r")
news = open("all-the-news/head.csv", "r")
wine = open("wine-reviews/head.csv", "r")

reviews = []
articles = []
with news as csv_file:
    csv_reader = csv.DictReader(csv_file, delimiter=",")
    for lines in csv_reader:
        articles.append(lines["content"])
with wine as csv_file:
    csv_reader = csv.DictReader(csv_file, delimiter=",")
    for lines in csv_reader:
        reviews.append(lines["description"])

FileNotFoundError: [Errno 2] No such file or directory: 'all-the-news/head.csv'

In [85]:
#print(reviews)
#print(articles)

### Task #3

<u>Perform sentence segmentation and word tokenization.</u>

Utilize the nltk module to perform sentence segmentation and word tokenization. But at this point, there are a few decisions that need to be made:

* How we should handle the .csv rows in the previous step? If we ignore row makers, and "lump everything together", how will that effect our language model?
* Do we want to remove punctuation? What is the effect of keeping punctuation in the model?
* Do we want to add sentence boundary markers, such as <samp>&lt;S&gt;</samp> and <samp>&lt;/S&gt;</samp>?</li>
* Should two the words <samp>The</samp> and <samp>the</samp> be treated as the same? What are the effects of doing, or not doing, this?

In [2]:
# Rows will be lumped into a single string.  A single " " will be added to the end of each row to ensure sentences
# are not being combined (Ex. "lastword. firstword" instead of "lastword.firstword")
# Punctuation will NOT be kept when counting n-grams
# The and the will be treated as the same word, this will decrease total n-grams
# N-Gram words will be counted within sentence boundries  (a trigram/bigram will not overlap into another sentence)
import string
import datetime

reviewsraw = ""
articlesraw = ""
print("READ REVIEWS")
for review in reviews:
    #review = review.translate(str.maketrans('', '', string.punctuation))
    review = review.lower()
    reviewsraw += review + " "
print("READ REVIEWS COMPLETE")
print("READ NEWS")
for article in articles:
    #article = article.translate(str.maketrans('', '', string.punctuation))
    article = article.lower()
    articlesraw += article + " "
print("READ NEWS COMPLETE")
    
#reviewTokens = nltk.word_tokenize(reviewsraw)
print("SENT TOKE REVIEWS")
now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))
reviewSents = nltk.sent_tokenize(reviewsraw)
print("SENT TOKE REVIEWS COMPLETE")
now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))
print("SENT TOKE NEWS")
now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))
#articleTokens = nltk.word_tokenize(articlesraw)
articleSents = nltk.sent_tokenize(articlesraw)
print("SENT TOKE NEWS COMPLETE")
now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))

READ REVIEWS


NameError: name 'reviews' is not defined

In [87]:
#print(reviewTokens)
#print(reviewSents)

### Task #4

<u>Calculate N-gram counts and compute probabilities.</u>

Use a python dictionary (or any suitable data structure) to first compute unigram counts. Then try bigram counts. Finally, trigram counts.

How much memory are you using? How fast, or slow, is the code -- how long is this step taking? If it is taking too long, try only using a fraction of your corpus: instead of loading the entire .csv file, try only reading the first 1000 rows of data.

Using those counts, compute the probabilities for the unigrams, bigrams, and trigrams, and store those in a new python dictionary (or some other data structure).

In [3]:
# Due to restricting the ngrams to sentence boundries, updating the ngram count dicts was taking a very
# long time.  Parsing the news articles took over 10 hours with the sentences split between 8 threads
# as the size the ngram count dictionary increased so did the time it took to update it with each consecutive sentence.
# I ended up moving this to the Bridges super computer.  Running the process on 28 cores, each thread had to handle
# only 50000 sentences (as oppsoed ot the 180000 sentences per core on my 8 core desktop). 
# this ran in about 40 minutes.
# I have only included head stubs of the files for my submision. The full files totaled over 1.5 GB.
# If you would like to see the full files please let me know.

import math
from nltk.util import ngrams
from collections import Counter
from multiprocessing import Process
import multiprocessing
import datetime

def ngram_function(index, sentences, tokenCountDict, unigramsDict, bigramsDict, trigramsDict, stopwords):
    sentCount = 0
    
    #use temp vars to avoid race conditions
    tmpunigramsDict = Counter()
    tmpbigramsDict = Counter()
    tmptrigramsDict = Counter()
    tmptokenCount = 0
    
    for sent in sentences[index]:
        if sentCount % 1000 == 0:
            now = datetime.datetime.now()
            print("Thread " + str(index) + ": " + str(sentCount) + " of " + str(len(sentences[index])) + " " + now.strftime("%Y-%m-%d %H:%M:%S"))
        sent = sent.translate(str.maketrans('', '', string.punctuation))
        words = nltk.word_tokenize(sent)
        for word in list(words):  
            if word in stopwords:
                words.remove(word)
        tmptokenCount += len(words)
        tmpunigramsDict += Counter(ngrams(words, 1))
        tmpbigramsDict += Counter(ngrams(words, 2))
        tmptrigramsDict += Counter(ngrams(words, 3))
        sentCount += 1
    tokenCountDict[index] = tmptokenCount
    unigramsDict[index] = tmpunigramsDict
    bigramsDict[index] = tmpbigramsDict
    trigramsDict[index] = tmptrigramsDict

def processUnigrams(tokenCount, rawCounter, percentDict, countDict):
    for word in rawCounter:
        firstword = list(word)[0]
        count = rawCounter[(firstword,)]
        if count > 0:
            percent = math.log(count / tokenCount)
            percentDict.update({firstword : percent})        
            countDict.update({firstword : count})
        
def processNGrams(rawCounter, reviewUGCount, percentDict, countDict):
    for ngram in rawCounter:
        count = rawCounter[ngram]
        if count > 0:
            firstword = list(ngram)[0]
            if firstword in reviewUGCount.keys():
                total = reviewUGCount[firstword]
                percent = math.log(count / total)
                percentDict.update({ngram : percent})
                countDict.update({ngram : count})

def nGramDictToCSV(path, fileName, theDict):
    theDict = dict(theDict)
    w = csv.writer(open(path + "/" + fileName + ".csv", "w"))
    for key, val in theDict.items():
        w.writerow([key, val])
        
def sortDict(theDict):
    return dict(sorted(theDict.items(), key=lambda y: y[1], reverse=True))

print("START WINE REVIEWS")
reviewTokenCount = 0

reviewUnigrams = Counter()
reviewBigrams = Counter()
reviewTrigrams = Counter()

#I dont want these in my ngram results
stopwords = ["'", "s", "’", "”", "“", "t"]

print("CREATING NGRAM COUNTS - WINE REIVEWS")
now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))

#This is taking too long, use all cores on the machine.
#Match to the number of threads on the machine
threadCount = 8

#Prep for multithread
interval = math.ceil(len(reviewSents) / threadCount)

manager = multiprocessing.Manager()
reviewThreadDict = manager.dict()
reviewUnigramsDict = manager.dict()
reviewBigramsDict = manager.dict()
reviewTrigramsDict = manager.dict()
reviewTokenCountDict = manager.dict()

for x in range(threadCount):
    reviewThreadDict.update({x : reviewSents[interval*x:interval*(x+1)]})
    reviewUnigramsDict.update({x : {}})
    reviewBigramsDict.update({x : {}})
    reviewTrigramsDict.update({x : {}})
    reviewTokenCountDict.update({x : 0})

print("Review sentence length: " + str(len(reviewSents)))    
    
if __name__ == "__main__":

    threads = list()
    for index in range(threadCount):
        x = Process(target=ngram_function, args=(index, reviewThreadDict, reviewTokenCountDict, reviewUnigramsDict, reviewBigramsDict, reviewTrigramsDict, stopwords))
        threads.append(x)
        print("Starting thread " + str(index))
        x.start()

    for a in range(threadCount):
        thread = threads[a]       
        thread.join()        
        reviewUnigrams += reviewUnigramsDict[a]
        reviewBigrams += reviewBigramsDict[a]
        reviewTrigrams += reviewTrigramsDict[a]
        reviewTokenCount += reviewTokenCountDict[a]
        print("End thread " + str(a))

print("All threads complete.")

now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))

print("CREATING COUNTER DICTIONARY - WINE REIVEWS")

reviewNGramCounts = {
    "unigrams" : Counter(reviewUnigrams),
    "bigrams" : Counter(reviewBigrams),
    "trigrams" : Counter(reviewTrigrams)
}

reviewUGPercent = {}
reviewUGCount = {}
reviewBGPercent = {}
reviewBGCount = {}
reviewTGPercent = {}
reviewTGCount = {}

print("PROCCESSING UNIGRAMS PERCENTAGE - WINE REVIEWS")
processUnigrams(reviewTokenCount, reviewNGramCounts["unigrams"], reviewUGPercent, reviewUGCount)
print("PROCCESSING BIGRAM PERCENTAGE - WINE REVIEWS")
processNGrams(reviewNGramCounts["bigrams"], reviewUGCount, reviewBGPercent, reviewBGCount)
print("PROCCESSING TRIGRAMS PERCENTAGE - WINE REVIEWS")
processNGrams(reviewNGramCounts["trigrams"], reviewUGCount, reviewTGPercent, reviewTGCount)

print("SORTING WINE UNIGRAM PERCENT")
reviewUGPercent = sortDict(reviewUGPercent)
print("SORTING WINE UNIGRAM COUNT")
reviewUGCount = sortDict(reviewUGCount)
print("SORTING WINE BIGRAM PERCENT")
reviewBGPercent = sortDict(reviewBGPercent)
print("SORTING WINE BIGRAM COUNT")
reviewBGCount = sortDict(reviewBGCount)
print("SORTING WINE TRIGRAM PERCENT")
reviewTGPercent = sortDict(reviewTGPercent)
print("SORTING WINE TRIGRAM COUNT")
reviewTGCount = sortDict(reviewTGCount)

print("WRITING WINE FILES")
print("WRITING WINE unigramCounts")
nGramDictToCSV("wine-reviews/", "unigramCounts", reviewUGCount)
print("WRITING WINE unigramPercent")
nGramDictToCSV("wine-reviews/", "unigramPercent", reviewUGPercent)
print("WRITING WINE bigramCounts")
nGramDictToCSV("wine-reviews/", "bigramCounts", reviewBGCount)
print("WRITING WINE bigramPercent")
nGramDictToCSV("wine-reviews/", "bigramPercent", reviewBGPercent)
print("WRITING WINE trigramCounts")
nGramDictToCSV("wine-reviews/", "trigramCounts", reviewTGCount)
print("WRITING WINE trigramPercent")
nGramDictToCSV("wine-reviews/", "trigramPercent", reviewTGPercent)

print("END WINE REVIEWS")
now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))

###############################################################

print("START NEWS")
articleTokenCount = 0    

articleUnigrams = Counter()
articleBigrams = Counter()
articleTrigrams = Counter()

print("CREATING NGRAM COUNTS - NEWS")
now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))

#Prep for multithread
interval = math.ceil(len(articleSents) / threadCount)

articleThreadDict = manager.dict()
articleUnigramsDict = manager.dict()
articleBigramsDict = manager.dict()
articleTrigramsDict = manager.dict()
articleTokenCountDict = manager.dict()

for x in range(threadCount):
    articleThreadDict.update({x : articleSents[interval*x:interval*(x+1)]})
    articleUnigramsDict.update({x : {}})
    articleBigramsDict.update({x : {}})
    articleTrigramsDict.update({x : {}})
    articleTokenCountDict.update({x : 0})

print("News sentence length: " + str(len(reviewSents)))      
    
if __name__ == "__main__":

    threads = list()
    for index in range(threadCount):
        x = Process(target=ngram_function, args=(index, articleThreadDict, articleTokenCountDict, articleUnigramsDict, articleBigramsDict, articleTrigramsDict, stopwords))
        threads.append(x)
        print("Starting thread " + str(index))
        x.start()

    for a in range(threadCount):
        thread = threads[a]
        thread.join()
        articleUnigrams += articleUnigramsDict[a]
        articleBigrams += articleBigramsDict[a]
        articleTrigrams += articleTrigramsDict[a]
        articleTokenCount += articleTokenCountDict[a]
        print("End thread " + str(a))
        
print("All threads complete")
now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))

print("CREATING COUNTER DICTIONARY - NEWS")
articleNGramCounts = {
    "unigrams" : Counter(articleUnigrams),
    "bigrams" : Counter(articleBigrams),
    "trigrams" : Counter(articleTrigrams)
}

articleUGPercent = {}
articleUGCount = {}
articleBGPercent = {}
articleBGCount = {}
articleTGPercent = {}
articleTGCount = {}

print("PROCCESSING UNIGRAMS PERCENTAGE - NEWS")
processUnigrams(articleTokenCount, articleNGramCounts["unigrams"], articleUGPercent, articleUGCount)
print("PROCCESSING BIGRAM PERCENTAGE - NEWS")
processNGrams(articleNGramCounts["bigrams"], articleUGCount, articleBGPercent, articleBGCount)
print("PROCCESSING TRIGRAM PERCENTAGE - NEWS")
processNGrams(articleNGramCounts["trigrams"], articleUGCount, articleTGPercent, articleTGCount)

print("SORTING NEWS UNIGRAM PERCENT")
articleUGPercent = sortDict(articleUGPercent)
print("SORTING NEWS UNIGRAM COUNT")
articleUGCount = sortDict(articleUGCount)
print("SORTING NEWS BIGRAM PERCENT")
articleBGPercent = sortDict(articleBGPercent)
print("SORTING NEWS BIGRAM COUNT")
articleBGCount = sortDict(articleBGCount)
print("SORTING NEWS TRIGRAM PERCENT")
articleTGPercent = sortDict(articleTGPercent)
print("SORTING NEWS TRIGRAM COUNT")
articleTGCount = sortDict(articleTGCount)

print("WRITING NEWS FILES")
print("WRITING NEWS unigramCounts")
nGramDictToCSV("all-the-news/", "unigramCounts", articleUGCount)
print("WRITING NEWS unigramPercent")
nGramDictToCSV("all-the-news/", "unigramPercent", articleUGPercent)
print("WRITING NEWS bigramCounts")
nGramDictToCSV("all-the-news/", "bigramCounts", articleBGCount)
print("WRITING NEWS bigramPercent")
nGramDictToCSV("all-the-news/", "bigramPercent", articleBGPercent)
print("WRITING NEWS trigramCounts")
nGramDictToCSV("all-the-news/", "trigramCounts", articleTGCount)
print("WRITING NEWS trigramPercent")
nGramDictToCSV("all-the-news/", "trigramPercent", articleTGPercent)

now = datetime.datetime.now()
print (now.strftime("%Y-%m-%d %H:%M:%S"))

print("END NEWS")

print("DONE")

START WINE REVIEWS
CREATING NGRAM COUNTS - WINE REIVEWS
2020-02-24 15:00:59


NameError: name 'reviewSents' is not defined

### Task #5

<u>Compare the statistics of the corpora.</u>
                        
Use the results of those calculations that you just made the poor computer painstakingly compute. What are the differences in the most common unigrams between the two language models? Are there interesting differences between the bigram models or trigram models?

Be able to sort the n-grams to output the top k with the highest count or probability.

In [17]:
# Load ngram percents and counts from files
# Use head 100 files for testing

#Major differences in news was mainly politcal and contained words talking about the president, white hosue, us etc
#Wine reiviews had none of this in their top ngrams

def readCountPercentFiles(fp):
    file = open(fp, "r")
    tmpDict = {}
    with file as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=",")
        for lines in csv_reader:
            tmpDict.update({lines[0] : lines[1]})
    return tmpDict

def writeTopN(theDict, top):
    for x in range(top):
        print(list(theDict.keys())[x] + "," + list(theDict.values())[x])

newsUGCountDict = readCountPercentFiles("all-the-news/unigramCounts100.csv")
newsBGCountDict = readCountPercentFiles("all-the-news/bigramCounts100.csv")
newsTGCountDict = readCountPercentFiles("all-the-news/trigramCounts100.csv")
newsUGPercentDict = readCountPercentFiles("all-the-news/unigramPercent100.csv")
newsBGPercentDict = readCountPercentFiles("all-the-news/bigramPercent100.csv")
newsTGPercentDict = readCountPercentFiles("all-the-news/trigramPercent100.csv")

wineUGCountDict = readCountPercentFiles("wine-reviews/unigramCounts100.csv")
wineBGCountDict = readCountPercentFiles("wine-reviews/bigramCounts100.csv")
wineTGCountDict = readCountPercentFiles("wine-reviews/trigramCounts100.csv")
wineUGPercentDict = readCountPercentFiles("wine-reviews/unigramPercent100.csv")
wineBGPercentDict = readCountPercentFiles("wine-reviews/bigramPercent100.csv")
wineTGPercentDict = readCountPercentFiles("wine-reviews/trigramPercent100.csv")

#newsUGCountDict = readCountPercentFiles("all-the-news/unigramCountsFull.csv")
#newsBGCountDict = readCountPercentFiles("all-the-news/bigramCountsFull.csv")
#newsTGCountDict = readCountPercentFiles("all-the-news/trigramCountsFull.csv")
#newsUGPercentDict = readCountPercentFiles("all-the-news/unigramPercentFull.csv")
#newsBGPercentDict = readCountPercentFiles("all-the-news/bigramPercentFull.csv")
#newsTGPercentDict = readCountPercentFiles("all-the-news/trigramPercentFull.csv")

#wineUGCountDict = readCountPercentFiles("wine-reviews/unigramCountsFull.csv")
#wineBGCountDict = readCountPercentFiles("wine-reviews/bigramCountsFull.csv")
#wineTGCountDict = readCountPercentFiles("wine-reviews/trigramCountsFull.csv")
#wineUGPercentDict = readCountPercentFiles("wine-reviews/unigramPercentFull.csv")
#wineBGPercentDict = readCountPercentFiles("wine-reviews/bigramPercentFull.csv")
#wineTGPercentDict = readCountPercentFiles("wine-reviews/trigramPercentFull.csv")

#print top N from dict
#csv's are already sorted
writeTopN(wineTGCountDict, 10)



('this', 'is', 'a'),14062
('on', 'the', 'finish'),10524
('in', 'the', 'mouth'),8994
('on', 'the', 'palate'),7633
('the', 'wine', 'is'),7256
('on', 'the', 'nose'),6931
('the', 'palate', 'is'),6179
('a', 'touch', 'of'),5641
('a', 'hint', 'of'),4099
('the', 'finish', 'is'),3437


### Task #6

<u>Generate random sentences from the N-grams models for both datasets.</u>
                        
We briefly talked about this idea in class. It's also introduced at a high-level in J&M 4.3. How can a random number in the range [0,1] probabilistically generate a word using your model?

In [None]:
# PYTHON CODE HERE

### Report

Write a technical report (in this Jupyter Notebook, with good Markdown formatting) that documents your findings, "lessons learned", any areas of where you ran into difficult, and also any other interesting details. Include in your report the following details:

1. Names of the datasets used.
1. Does your model use all of the data in the .csv file or only a subset of it (i.e. first 1,000 rows)?
1. What is the vocabulary and size of each dataset?
1. How did you handle the merging of separate rows in a .csv file? How did you handle sentence segmentation with sentence boundary markers? Also report on any other decisions made in step #3.
1. How long did it take your program to build these models? Do you have any statistics on memory/RAM usage?
1. Output the top 15 unigrams, bigrams, trigrams for each model. Are there any interesting differences?
1. Output 3 different randomly generated sentences for each unigram, bigram, trigram model. How did you know where the randomly generated sentence ended?

Also submit this python notebook `.ipynb` to D2L.

In [None]:
# PYTHON CODE AND REPORT HERE