<a href="https://colab.research.google.com/github/lingo-mit/6864-hw1/blob/master/6864_hw1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [21]:
%%bash
!(stat -t /usr/local/lib/*/dist-packages/google/colab > /dev/null 2>&1) && exit 
rm -rf 6864-hw1
git clone https://github.com/lingo-mit/6864-hw1.git

Cloning into '6864-hw1'...


In [22]:
import sys
sys.path.append("/content/6864-hw1")

import csv
import itertools as it
import numpy as np
np.random.seed(0)

import lab_util

## Introduction

In this lab, you'll explore three different ways of using unlabeled text data to learn pretrained word representations. Your lab report will describe the effects of different modeling decisions (representation learning objective, context size, etc.) on both qualitative properties of learned representations and their effect on a downstream prediction problem.

**General lab report guidelines**

Homework assignments should be submitted in the form of a research report. (We'll be providing a place to upload them before the due date, but are still sorting out some logistics.) Please upload PDFs, with a maximum of four single-spaced pages. (If you want you can use the [Association for Computational Linguistics style files](http://acl2020.org/downloads/acl2020-templates.zip).) Reports should have one section for each part of the homework assignment below. Each section should describe the details of your code implementation, and include whatever charts / tables are necessary to answer the set of questions at the end of the corresponding homework part.



We're going to be working with a dataset of product reviews. It looks like this:

In [23]:
data = []
n_positive = 0
n_disp = 0
with open("reviews.csv") as reader:
  csvreader = csv.reader(reader)
  next(csvreader)
  for id, review, label in csvreader:
    label = int(label)

    # hacky class balancing
    if label == 1:
      if n_positive == 2000:
        continue
      n_positive += 1
    if len(data) == 4000:
      break

    data.append((review, label))
    
    if n_disp > 5:
      continue
    n_disp += 1
    print("review:", review)
    print("rating:", label, "(good)" if label == 1 else "(bad)")
    print()

print(f"Read {len(data)} total reviews.")
np.random.shuffle(data)
reviews, labels = zip(*data)
train_reviews = reviews[:3000]
train_labels = labels[:3000]
val_reviews = reviews[3000:3500]
val_labels = labels[3000:3500]
test_reviews = reviews[3500:]
test_labels = labels[3500:]

review: I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.
rating: 1 (good)

review: Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".
rating: 0 (bad)

review: This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother an

We've provided a little bit of helper code for reading in the dataset; your job is to implement the learning!

## Part 1: word representations via matrix factorization

First, we'll construct the term--document matrix (look at `/content/6864-hw1/lab_util.py` in the file browser on the left if you want to see how this works).

In [24]:
vectorizer = lab_util.CountVectorizer()
vectorizer.fit(train_reviews)
td_matrix = vectorizer.transform(train_reviews).T
print(f"TD matrix is {td_matrix.shape[0]} x {td_matrix.shape[1]}")

TD matrix is 2006 x 3000


First, implement a function that computes word representations via latent semantic analysis:

In [25]:
def learn_reps_lsa(matrix, rep_size):
  # `matrix` is a `|V| x n` matrix, where `|V|` is the number of words in the
  # vocabulary. This function should return a `|V| x rep_size` matrix with each
  # row corresponding to a word representation. The `sklearn.decomposition` 
  # package may be useful.

  # Your code here!
  from sklearn.decomposition import TruncatedSVD
  result = TruncatedSVD(rep_size)
  U_sigma = result.fit_transform(matrix)
#  word_rep = np.matmul((np.matmul(U_mat, Sig_mat)), V_mat)
  word_rep = np.matmul(U_sigma, np.linalg.inv(np.diag(result.singular_values_)))
  return word_rep

Let's look at some representations:

In [26]:
reps = learn_reps_lsa(td_matrix, 500)
words = ["good", "bad", "cookie", "jelly", "dog", "the", "3"]
show_tokens = [vectorizer.tokenizer.word_to_token[word] for word in words]
lab_util.show_similar_words(vectorizer.tokenizer, reps, show_tokens)

good 47
  gerber 1.876
  crazy 1.885
  luck 1.887
  beat 1.907
  suspect 1.907
bad 201
  disgusting 1.624
  gone 1.772
  horrible 1.772
  positive 1.778
  shortbread 1.783
cookie 504
  nana's 0.946
  bars 1.392
  odd 1.445
  wants 1.463
  cookies 1.479
jelly 351
  cardboard 1.194
  twist 1.196
  advertised 1.329
  peanuts 1.356
  plastic 1.503
dog 925
  happier 1.667
  earlier 1.691
  eats 1.710
  standard 1.730
  stays 1.734
the 36
  suspect 1.950
  flowers 1.966
  leaked 1.968
  burn 1.969
  m 1.969
3 289
  omega 1.756
  vendor 1.768
  supermarket 1.777
  nutty 1.778
  facts 1.794


We've been operating on the raw count matrix, but in class we discussed several reweighting schemes aimed at making LSA representations more informative. 

Here, implement the TF-IDF transform and see how it affects learned representations.

In [27]:
def transform_tfidf(matrix):
  # `matrix` is a `|V| x |D|` matrix of raw counts, where `|V|` is the 
  # vocabulary size and `|D|` is the number of documents in the corpus. This
  # function should (nondestructively) return a version of `matrix` with the
  # TF-IDF transform appliied.

  # Your code here!
  V = np.size(matrix, 0)
  D = np.size(matrix, 1)
  result = np.zeros((V, D))
  
  non_zero = np.count_nonzero(matrix, axis = 1)
  doc_word = np.sum(matrix, axis = 0)
                  
  for i in range(V):
      for j in range(D):
          if non_zero[i] == 0:
              result[i,j] = 0
          else:
              result[i,j] = matrix[i,j]/doc_word[j]*np.log(D/non_zero[i])
    
  return result

How does this change the learned similarity function?

In [115]:
Wtt = td_matrix.dot(td_matrix.T)
td_matrix_tfidf = transform_tfidf(td_matrix)
reps_tfidf = learn_reps_lsa(td_matrix_tfidf, 500)
reps_Wtt = learn_reps_lsa(Wtt, 200)
lab_util.show_similar_words(vectorizer.tokenizer, reps_tfidf, show_tokens)
lab_util.show_similar_words(vectorizer.tokenizer, reps_Wtt, show_tokens)
lab_util.show_similar_words(vectorizer.tokenizer, reps, show_tokens)

good 47
  required 1.742
  sample 1.766
  lays 1.790
  suspect 1.805
  customers 1.808
bad 201
  death 1.374
  classic 1.546
  positive 1.577
  ate 1.590
  floor 1.594
cookie 504
  nana's 1.061
  shape 1.335
  bars 1.484
  likely 1.518
  keeps 1.545
jelly 351
  shocked 1.292
  plum 1.300
  beans 1.306
  anyone 1.351
  softer 1.363
dog 925
  required 1.270
  lamb 1.608
  pets 1.620
  ball 1.638
  animals 1.673
the 36
  <unk> 1.433
  and 1.492
  of 1.561
  bottom 1.563
  best 1.604
3 289
  earlier 1.426
  serious 1.457
  omega 1.504
  pricing 1.550
  includes 1.559
good 47
  crazy 1.640
  flaxseed 1.723
  suspect 1.734
  pretty 1.746
  bread 1.747
bad 201
  awful 1.273
  huge 1.313
  smells 1.328
  disgusting 1.359
  minutes 1.443
cookie 504
  cookies 0.499
  nana's 0.581
  oreos 1.051
  bites 1.159
  bars 1.283
jelly 351
  twist 1.166
  creamer 1.259
  refund 1.398
  cardboard 1.400
  shipped 1.412
dog 925
  foods 1.107
  stays 1.147
  switched 1.248
  appeal 1.269
  pet 1.280
the 36
  

Now that we have some representations, let's see if we can do something useful with them.

Below, implement a feature function that represents a document as the sum of its
learned word embeddings.

The remaining code trains a logistic regression model on a set of *labeled* reviews; we're interested in seeing how much representations learned from *unlabeled* reviews improve classification.

In [116]:
def word_featurizer(xs):
  # normalize
  return xs / np.sqrt((xs ** 2).sum(axis=1, keepdims=True))

def lsa_featurizer(xs):
  # This function takes in a matrix in which each row contains the word counts
  # for the given review. It should return a matrix in which each row contains
  # the learned feature representation of each review (e.g. the sum of LSA 
  # word representations).
  
  feats = None # Your code here!
  #transformed_mat = transform_tfidf(td_matrix)

  #feats = transform_tfidf(xs)
  #print(transformed_mat.shape)
  #print(transformed_mat1.shape)
  #word_rep = learn_reps_lsa(transformed_mat, 100)
  #print(word_rep)
  #print(word_rep1)
  #feats = np.transpose(td_matrix).dot(transformed_mat)
  feats = xs.dot(reps_tfidf)
  # normalize
  return feats / np.sqrt((feats ** 2).sum(axis=1, keepdims=True))

def combo_featurizer(xs):
  return np.concatenate((word_featurizer(xs), lsa_featurizer(xs)), axis=1)

def train_model(featurizer, xs, ys):
  import sklearn.linear_model
  xs_featurized = featurizer(xs)
  model = sklearn.linear_model.LogisticRegression()
  model.fit(xs_featurized, ys)
  return model

def eval_model(model, featurizer, xs, ys):
  xs_featurized = featurizer(xs)
  pred_ys = model.predict(xs_featurized)
  print("test accuracy", np.mean(pred_ys == ys))

def training_experiment(name, featurizer, n_train):
  print(f"{name} features, {n_train} examples")
  train_xs = vectorizer.transform(train_reviews[:n_train])
  train_ys = train_labels[:n_train]
  test_xs = vectorizer.transform(test_reviews)
  test_ys = test_labels
  print('------------ Train ------------')
  print(train_xs.shape)
  model = train_model(featurizer, train_xs, train_ys)
  print('------------ Eval ------------')
  print(test_xs.shape)
  eval_model(model, featurizer, test_xs, test_ys)
  print()

training_experiment("word", word_featurizer, 10)
training_experiment("lsa", lsa_featurizer, 10)
training_experiment("combo", combo_featurizer, 10)

word features, 10 examples
------------ Train ------------
(10, 2006)
------------ Eval ------------
(500, 2006)
test accuracy 0.496

lsa features, 10 examples
------------ Train ------------
(10, 2006)
------------ Eval ------------
(500, 2006)
test accuracy 0.49

combo features, 10 examples
------------ Train ------------
(10, 2006)
------------ Eval ------------
(500, 2006)
test accuracy 0.5



**Part 1: Lab writeup**

Part 1 of your lab report should discuss any implementation details that were important to filling out the code above. Then, use the code to set up experiments that answer the following questions:

1. Qualitatively, what do you observe about nearest neighbors in representation    space? (E.g. what words are most similar to _the_, _dog_, _3_, and _good_?)

2. How does the size of the LSA representation affect this behavior?


3. Recall that the we can compute the word co-occurrence matrix $W_{tt} = W_    
   {td} W_{td}^\top$. What can you prove about the relationship between the    
   left singular vectors of $W_{td}$ and $W_{tt}$? Do you observe this behavior 
   with your implementation of `learn_reps_lsa`? Why or why not?

4. Do learned representations help with the review classification problem? What
   is the relationship between the number of labeled examples and the effect of
   word embeddings?
   
5. What is the relationship between the size of the word embeddings and their      usefulness for the classification task.

## Part 2: word representations via language modeling

In this section, we'll train a word embedding model with a word2vec-style objective rather than a matrix factorization objective. This requires a little more work; we've provided scaffolding for a PyTorch model implementation below.
(If you've never used PyTorch before, there are some tutorials [here](https://pytorch.org/tutorials/). You're also welcome to implement these experiments in
any other framework of your choosing.)

In [92]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data as torch_data

class Word2VecModel(nn.Module):
  # A torch module implementing a word2vec predictor. The `forward` function
  # should take a batch of context word ids as input and predict the word 
  # in the middle of the context as output, as in the CBOW model from lecture.

  def __init__(self, vocab_size, embed_dim):
      super().__init__()
      # Your code here!
      self.voc_num = vocab_size
      self.emb_dim = embed_dim
      self.word_embed = nn.Embedding (vocab_size, embed_dim)
      self.weight1 = nn.Linear(embed_dim, vocab_size)
      #self.act_fn1 = nn.relu()
      #self.weight2 = nn.Linear(100, vocab_size)
        
  def forward(self, context):
      # Context is an `n_batch x n_context` matrix of integer word ids
      # this function should return a set of scores for predicting the word 
      # in the middle of the context

      # Your code here!
      #print(type(context))
      #ave_context = torch.sum(context, dim=1)
      #print(ave_context)
      sum_rep = self.word_embed(context).sum(dim = 1)/context.shape[1]
      #print(list(sum_rep.size()))
      #print(sum_rep)
      #hidden = self.weight1(sum_rep)
      #hidden = F.relu(hidden)
      #sum_rep = sum_rep/np.size(context, 1)
      #average#
      out_rep = self.weight1(sum_rep)
      score = F.log_softmax(out_rep, dim = 1)
      return score

In [99]:
def learn_reps_word2vec(corpus, window_size, rep_size, n_epochs, n_batch):
  # This method takes in a corpus of training sentences. It returns a matrix of
  # word embeddings with the same structure as used in the previous section of 
  # the assignment. (You can extract this matrix from the parameters of the 
  # Word2VecModel.)

  tokenizer = lab_util.Tokenizer()
  tokenizer.fit(corpus)
  tokenized_corpus = tokenizer.tokenize(corpus)

  ngrams = lab_util.get_ngrams(tokenized_corpus, window_size)

  device = torch.device('cpu')  # run on colab gpu
  model = Word2VecModel(tokenizer.vocab_size, rep_size).to(device)
  optimizer = optim.Adam(model.parameters(), lr=0.001)
  loss_fn = nn.NLLLoss() # Your code here

  loader = torch_data.DataLoader(ngrams, batch_size=n_batch, shuffle=True)

  for epoch in range(n_epochs):
    for context, label in loader:
      # as described above, `context` is a batch of context word ids, and
      # `label` is a batch of predicted word labels
      # Your code here!
      model.zero_grad()
      pred = model(context)
      loss = loss_fn(pred, label)
      loss.backward()
      optimizer.step()

  # reminder: you want to return a `vocab_size x embedding_size` numpy array
  # Your code here!
  return model.word_embed.weight.data
  #return model.weight1.weight.data

In [113]:
reps_word2vec = learn_reps_word2vec(train_reviews, 1, 500, 10, 100)

After training the embeddings, we can try to visualize the embedding space to see if it makes sense. First, we can take any word in the space and check its closest neighbors.

In [114]:
lab_util.show_similar_words(vectorizer.tokenizer, reps_word2vec.numpy(), show_tokens)

good 47
  superior 1.687
  nutritious 1.690
  doing 1.691
  hole 1.696
  part 1.706
bad 201
  sugary 1.680
  tart 1.712
  wide 1.720
  liquid 1.722
  bottom 1.723
cookie 504
  exact 1.679
  beyond 1.713
  satisfying 1.720
  0 1.723
  nicely 1.728
jelly 351
  local 1.633
  bears 1.662
  junk 1.666
  nutrition 1.668
  egg 1.670
dog 925
  cat 1.594
  acid 1.651
  puppy 1.671
  belgian 1.685
  wrong 1.689
the 36
  their 1.540
  our 1.540
  my 1.668
  every 1.715
  a 1.725
3 289
  9 1.663
  5 1.678
  salsa 1.678
  zero 1.693
  40 1.694


We can also cluster the embedding space. Clustering in 4 or more dimensions is hard to visualize, and even clustering in 2 or 3 can be difficult because there are so many words in the vocabulary. One thing we can try to do is assign cluster labels and qualitiatively look for an underlying pattern in the clusters.

In [103]:
from sklearn.cluster import KMeans

indices = KMeans(n_clusters=10).fit_predict(reps_word2vec)
zipped = list(zip(range(vectorizer.tokenizer.vocab_size), indices))
np.random.shuffle(zipped)
zipped = zipped[:100]
zipped = sorted(zipped, key=lambda x: x[1])
for token, cluster_idx in zipped:
  word = vectorizer.tokenizer.token_to_word[token]
  print(f"{word}: {cluster_idx}")

w: 0
seeds: 0
themselves: 1
provide: 1
work: 1
close: 1
bake: 1
replace: 1
feel: 1
carbonated: 2
what: 2
pizza: 2
stomach: 2
science: 2
delivery: 2
plum: 2
coconut: 2
constantly: 2
fact: 3
list: 3
holds: 3
him: 3
anywhere: 3
service: 3
bbq: 5
greta: 5
points: 5
part: 5
tassimo: 5
toddler: 5
traditional: 5
candy: 5
varieties: 5
http: 5
lot: 5
result: 5
seller: 5
marinade: 5
<unk>: 5
suggested: 5
party: 5
needed: 6
seems: 6
would: 6
might: 6
followed: 6
looked: 6
lock: 6
makes: 6
tangy: 7
awful: 7
bold: 7
china: 7
rotten: 7
black: 7
great: 7
little: 7
natural: 7
flavored: 7
eater: 7
almond: 7
hot: 7
granted: 7
priced: 7
terrific: 7
health: 8
15: 8
literally: 8
longer: 8
stevia: 8
this: 8
acid: 8
dead: 8
most: 8
wheat: 8
finally: 8
wild: 8
kids: 8
portion: 8
clams: 8
oats: 8
otherwise: 8
fence: 8
perfectly: 8
same: 8
also: 8
items: 8
gummy: 8
e: 8
?: 9
here: 9
until: 9
stock: 9
advertised: 9
careful: 9
life: 9
favorites: 9
excited: 9
comparable: 9
chance: 9


Finally, we can use the trained word embeddings to construct vector representations of full reviews. One common approach is to simply average all the word embeddings in the review to create an overall embedding. Implement the transform function in Word2VecFeaturizer to do this.

In [104]:
def lsa_featurizer(xs):
  feats = None # Your code here!
  feats = xs.dot(reps_word2vec)
  for i in range(np.size(xs, 0)):
    feats[i,:] = feats[i,:]/(np.sum(xs, axis = 1)[i])
  
  # normalize
  return feats / np.sqrt((feats ** 2).sum(axis=1, keepdims=True))

training_experiment("word2vec", lsa_featurizer, 3000)

word2vec features, 3000 examples
------------ Train ------------
(3000, 2006)
------------ Eval ------------
(500, 2006)
test accuracy 0.788



**Part 2: Lab writeup**

Part 2 of your lab report should discuss any implementation details that were important to filling out the code above. Then, use the code to set up experiments that answer the following questions:

1. Qualitatively, what do you observe about nearest neighbors in representation space? (E.g. what words are most similar to _the_, _dog_, _3_, and _good_?) How well do word2vec representations correspond to your intuitions about word similarity?

2. One important parameter in word2vec-style models is context size. How does changing the context size affect the kinds of representations that are learned?

3. How do results on the downstream classification problem compare to 
   part 1?

4. What are some advantages and disadvantages of learned embedding representations, relative to the featurization done in part 1?

5. What are some potential problems with constructing a representation of the review by averaging the embeddings of the individual words?