# 3rd Exercise

Prepared by: *Hardian Lawi*

In [0]:
import re
import numpy as np
import pandas as pd
from gensim.models import KeyedVectors
from sklearn.linear_model import LogisticRegression

In [0]:
%%bash

wget -qO yelp_review_polarity_csv.tgz https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz
wget -qO GoogleNews-vectors-negative300.bin.gz "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
gunzip GoogleNews-vectors-negative300.bin.gz
tar -xvzf yelp_review_polarity_csv.tgz
ls

yelp_review_polarity_csv/
yelp_review_polarity_csv/train.csv
yelp_review_polarity_csv/readme.txt
yelp_review_polarity_csv/test.csv
GoogleNews-vectors-negative300.bin
GoogleNews-vectors-negative300.bin.gz
sample_data
yelp_review_polarity_csv
yelp_review_polarity_csv.tgz


gzip: GoogleNews-vectors-negative300.bin already exists;	not overwritten


# Load pre-trained Embeddings

In [0]:
# Load vectors directly from the file
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


# Process Dataset

In [0]:
train = pd.read_csv('yelp_review_polarity_csv/train.csv', names=["label", "text"])
test = pd.read_csv('yelp_review_polarity_csv/test.csv', names=["label", "text"])

train = train.head(10000)
test = test.head(1000)

y_train = train.label - 1
y_test = test.label - 1

corpus = train.text.to_list() + test.text.to_list()

## Tokenization

In [0]:
def tokenize(corpus, lower=True, pattern=r'(?u)\b\w\w+\b'):
  tokenized_corpus = []
  rx = re.compile(pattern)
  for sent in corpus:
    if lower:
      tokenized_corpus.append([x.lower() for x in rx.findall(sent)])
    else:
      tokenized_corpus.append([x for x in rx.findall(sent)])
  return tokenized_corpus

tokenized_corpus = tokenize(corpus)

## Convert each sentence to a vector



In [0]:
embed_corpus = []
invalid_counts = 0
for doc in tokenized_corpus:
  temp = np.zeros((1, 300), dtype=float)
  count = 0
  for token in doc:
    if token in model.vocab:
      temp += model[token]
      count += 1
  if count != 0:
    temp /= count
  else:
    invalid_counts += 1
  embed_corpus.append(temp)
embed_corpus = np.concatenate(embed_corpus)

print('Invalid counts:', invalid_counts)

In [0]:
train_emb = embed_corpus[:train.shape[0]]
test_emb = embed_corpus[train.shape[0]:]

print('train size:', train_emb.shape)
print('test size:', test_emb.shape)

# Training

In [0]:
model = LogisticRegression()
model = model.fit(train_emb, y_train)

print('Train acc:', (model.predict(train_emb) == y_train).mean())
print('Test acc:', (model.predict(test_emb) == y_test).mean())

# Bonus

This model performs worse than both TF and TF-idf representation because even though some dimension of the embeddings might be representing the positivity or negativity of the words, by taking the mean embeddings, we are averaging out this effect because in most cases, documents consist of words with neutral sentiment. However, this problem should be mitigated by performing **weighted average** (multiply by TF or TF-idf) of the embeddings. Also, instead of taking the average, we could **concatenate** the embeddings to form a long vector, e.g. each word is represented by 300-dimensional vector, thus, a document of 5 words would be 1500-dimensional vector. However, the latter approach poses another problem, i.e. not all documents are in the same length. Therefore, we could pad the vectors by 0 to form vectors of same length.

All the approaches that we have seen do not seem to learn anything regarding the positions of the words. There is a way to incorporate information to our linear model by using *positional embeddings*. Other than this, we could also use model such as *Recurrent Neural Network* that specifically designs to learn from sequence data.