# CSE 258, Fall 2019: Homework 4
Data: : http://cseweb.ucsd.edu/classes/fa19/cse258-a/files/assignment1.tar.gz

Using the code provided on the webpage, read the first 10,000 reviews from the corpus, and read the reviews without capitalization or punctuation

In [96]:
# Import dependencies
import gzip
import string
from collections import defaultdict
from sklearn import linear_model
import numpy as np

In [26]:
# Load data
def readGz(path):
    for l in gzip.open(path, 'rt'):
        yield eval(l)

punctuation = set(string.punctuation)
data = []
for line in readGz("train_Category.json.gz"):
    line["review_text"] = ''.join([c for c in line['review_text'].lower() if not c in punctuation])
    data.append(line)
data = data[:10000]

In [28]:
data[0]

{'n_votes': 0,
 'review_id': 'r99763621',
 'user_id': 'u17334941',
 'review_text': 'genuinely enthralling if collins or bernard did invent this out of whole cloth they deserve a medal for imagination lets leave the veracity aside for a moment  always a touchy subject when it comes to real life stories of the occult  and talk about the contents \n the black alchemist covers a period of two years in which collins a magician and bernard a psychic undertook a series of psychic quests that put them in opposition with the titular black alchemist as entertainment goes the combination of harrowing discoveries ancient lore and going down the pub for a cigarette and a guinness trying to make sense of it all while a hen party screams at each other is a winner it is simultaneously down to earth and out of this world \n it reads fast both because of the curiousity and because collins has a very clear writing style sometimes its a little clunky or over repetitive and theres a few meetings that get u

## Task 1
How many unique bigrams are there amongst the reviews? List the 5 most-frequently-occurring bigrams
along with their number of occurrences in the corpus (1 mark).

In [70]:
bigrams = defaultdict(int)
for d in data:
    review = d["review_text"].split()
    for x in range(1, len(review)):
        a,b = review[x-1].strip(), review[x].strip()
        bigram = "{} {}".format(a,b)
        bigrams[bigram] += 1

In [71]:
bigram_items = list(bigrams.items())
bigram_items.sort(key=lambda item:item[1], reverse=True)
print("5 most-frequently-occuring bigrams and their number of coccurences:")
for bigram, count in bigram_items[:5]:
    print('"{}": {}'.format(bigram, count))

5 most-frequently-occuring bigrams and their number of coccurences:
"of the": 7927
"this book": 5850
"in the": 5627
"and the": 3189
"is a": 3183


## Task 2
The code provided performs least squares using the 1000 most common unigrams. Adapt it to use
the 1000 most common bigrams and report the MSE obtained using the new predictor (use bigrams
only, i.e., not unigrams+bigrams) (1 mark). Note that the code performs regularized regression with a
regularization parameter of 1.0. The prediction target should be the ‘rating’ field in each review.

In [72]:
words = [x[0] for x in bigram_items[:1000]]
wordId = dict(zip(words, range(len(words))))
wordSet = set(words)

def feature(datum):
    feat = [0]*len(words)
    review = datum["review_text"].split()
    for x in range(1, len(review)):
        a,b = review[x-1].strip(), review[x].strip()
        bigram = "{} {}".format(a,b)
        if bigram in wordSet:
            feat[wordId[bigram]] += 1
    feat.append(1) #offset
    return feat

X = [feature(d) for d in data]
y = [d['rating'] for d in data]

In [73]:
def MSE(predictions, labels):
    differences = [(x-y)**2 for x,y in zip(predictions,labels)]
    return sum(differences) / len(differences)
    
# Regularized regression
clf = linear_model.Ridge(1.0, fit_intercept=False) # MSE + 1.0 l2
clf.fit(X, y)
theta = clf.coef_
predictions = clf.predict(X)

# Evaluate
mse = MSE(predictions, y)
print("MSE: {}".format(mse))

MSE: 1.0178487590567005


## Task 3
Repeat the above experiment using unigrams and bigrams, still considering the 1000 most common.
That is, your model will still use 1000 features (plus an offset), but those 1000 features will be some
combination of unigrams and bigrams. Report the MSE obtained using the new predictor (1 mark).

In [79]:
unigrams = defaultdict(int)
for d in data:
    review = d["review_text"].split()
    for word in review:
        unigrams[word] += 1
unigram_items = list(unigrams.items())
unigram_items.sort(key=lambda item: item[1], reverse=True)
for unigram, count in unigram_items[:5]:
    print('"{}": {}'.format(unigram, count))

"the": 73431
"and": 44301
"a": 39577
"to": 36821
"i": 36581


In [85]:
# Merge and sort lists of uni- and bigrams
uni_and_bigrams = unigram_items + bigram_items
uni_and_bigrams.sort(key=lambda item:item[1], reverse=True)

# Find the 1000 most common of these
words = [x[0] for x in uni_and_bigrams[:1000]]
wordId = dict(zip(words, range(len(words))))
wordSet = set(words)

def feature(datum):
    feat = [0]*len(words)
    review = datum["review_text"].split()
    
    # Check unigrams
    for word in review:
        if word in wordSet:
            feat[wordId[word]] += 1
            
    # Check bigrams
    for x in range(1, len(review)):
        a,b = review[x-1].strip(), review[x].strip()
        bigram = "{} {}".format(a,b)
        if bigram in wordSet:
            feat[wordId[bigram]] += 1
    
    feat.append(1) #offset
    return feat

X = [feature(d) for d in data]
y = [d['rating'] for d in data]

In [87]:
# Regularized regression
clf = linear_model.Ridge(1.0, fit_intercept=False) # MSE + 1.0 l2
clf.fit(X, y)
theta = clf.coef_
predictions = clf.predict(X)

# Evaluate
mse = MSE(predictions, y)
print("MSE: {}".format(mse))

MSE: 0.9683729530414926


## Task 4
 What is the inverse document frequency of the words ‘stories’, ‘magician’, ‘psychic’, ‘writing’, and ‘wonder’? What are their tf-idf scores in the first review (using log base 10) (1 mark)?

In [94]:
# Find the document frequencies of each unigram
dfs = defaultdict(int)
check_words = ['stories', 'magician', 'psychic', 'writing', 'wonder']
for d in data:
    review = d["review_text"].split()
    for word in check_words:
        if word in review:
            dfs[word] += 1

In [110]:
N = len(data) # Total number of documents
idfs = dict()
# Calculate inverse document frequencies
for word in check_words:
    df = dfs[word]
    idfs[word] = np.log10(N/df)
    print("IDF {}: {}".format(word, idfs[word]))

IDF stories: 1.1174754620451195
IDF magician: 2.657577319177794
IDF psychic: 2.6020599913279625
IDF writing: 0.9978339382434923
IDF wonder: 1.7670038896078462


In [111]:
# Calculate TF-IDF for first document
tfidfs = dict()
first_review = data[0]["review_text"].split()
for word in check_words:
    tf = first_review.count(word)
    idf = idfs[word]
    tfidfs[word] = tf * idf
    print("TFIDF {}: {}".format(word, tfidfs[word]))
    
                           

TFIDF stories: 1.1174754620451195
TFIDF magician: 2.657577319177794
TFIDF psychic: 5.204119982655925
TFIDF writing: 0.9978339382434923
TFIDF wonder: 1.7670038896078462


## Task 5
Adapt your unigram model to use the tfidf scores of words, rather than a bag-of-words representation.
That is, rather than your features containing the word counts for the 1000 most common unigrams, it
should contain tfidf scores for the 1000 most common unigrams. Report the MSE of this new model (1
mark).

## Task 6
Which other review has the highest cosine similarity compared to the first review (provide the review id,
or the text of the review) (1 mark)?

## Task 7
Implement a validation pipeline for this same data, by randomly shuffling the data, using 10,000 reviews
for training, another 10,000 for validation, and another 10,000 for testing.1 Consider regularization
parameters in the range {0.01, 0.1, 1, 10, 100}, and report MSEs on the test set for the model that performs
best on the validation set. Using this pipeline, compare the following alternatives in terms of their
performance:
* Unigrams vs. bigrams
* Removing punctuation vs. preserving it. The model that presevers punctuation sshould treat punctiation charactes as seperate works, e.g. "Amazing!" would become ["amazing", "!"]
* tfidf scores vs. word counts   
In total you should compare 2 × 2 × 2 = 8 models, and produce a table comparing their performance (2
marks)