# Sentiment Analysis

### ECE590 Homework assignment 2
Name: Javier Cervantes
net id: jc1010

We are interested in sentiment analysis. Given a short document, we wish to assess whether the corresponding sentiment is positive (label 𝑦=1 ) or negative (label 𝑦=0). The assignment is as follows: 

1. For every word, we will learn a corresponding d-dimensional vector $x_i \in \mathbb{R}^d$ for word $i$ in the vocabulary. 

2. Assume that there are $M_j$ words in document $j$. The feature vector for this document is $f_j = \frac{1}{M_j} \sum_{i=1}^{M_j} x_{(m, j)}$ such that $x_{(m, j)}$ is the d-dimensional vector for the m-th word in document j.

3. The probability of positive sentiment for document j is modeled as $P(y_j = 1 | f_j) = \sigma[w \cdot f_j + b]$ where $\sigma$ is the sigmoid function, $w \in \mathbb{R}^d$ is the weight vector and $b \in \mathbb{R}$ is the bias.

In [1]:
from datasets import load_dataset
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchtext
import nltk
from nltk.corpus import stopwords
from datasets import load_from_disk
import numpy as np
import tqdm
import pandas as pd
from datasets import Dataset
import collections

In [2]:
seed = 257

np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True

## Prepare the data

We begin by tokenizing and cleaning the text. In this process, we'll remove punctuation, convert to lowercase, and remove stopwords. I believe that it's worth noting that removing stop words might be problematic in some model designs. For this particular model, which doesn't take into account word order, removing words like "not" should not affect the model's performance because the model won't have the capability of identifying where upon a given document the word "not" is located.

In [3]:
# load the dataset
train_data, test_data = load_dataset("yelp_polarity", split=["train", "test"])

In [7]:
# tokenize the dataset
tokenizer = torchtext.data.utils.get_tokenizer("basic_english")


def tokenize(obs, tokenizer, max_length):
    """
    Tokenize an observation
    max_length: the maximum length of the tokenized sequence
    """
    return {"tokens": tokenizer(obs["text"])[:max_length]}

In [5]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\cerva\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
# remove stopwords and punctuation
stop_words = stopwords.words("english")


def remove_stopwords(obs):
    """
    Removes stopwords from tokens for each obs in Dataset
    """
    obs["tokens"] = [word for word in obs["tokens"] if word not in stop_words]
    return obs


def remove_punctuation(obs):
    """
    Removes punctuation from tokens for each obs in Dataset
    """
    obs["tokens"] = [word for word in obs["tokens"] if word.isalpha()]
    return obs


def tokenize_and_clean(obs, max_length):
    """
    Tokenize, remove stopwords and punctuation from observation
    """
    tokens = tokenizer(obs["text"][:max_length])
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [word for word in tokens if word.isalpha()]
    return {"tokens": tokens}


# train_data = train_data.map(remove_stopwords)

Working under the assumption that a document's sentiment can be identified rather quickly, I've set the maximum length of a document to 100 words. This is a hyperparameter that can be adjusted to improve the model's performance.

In [9]:
max_length = 100

# tokenizer(train_data[0]["text"][:512])

train_data = train_data.map(tokenize_and_clean, fn_kwargs={"max_length": max_length})

# test_data = test_data.map(tokenize_and_clean)

Map:   0%|          | 0/560000 [00:00<?, ? examples/s]

In [8]:
# train_data.save_to_disk("/datasets/yelp_polarity_train")
# train_data = load_from_disk("/datasets/yelp_polarity_train/")

Now that our data has been tokenized and cleaned, we can create a validation set.

In [10]:
# validation data
train_valid_data = train_data.train_test_split(test_size=0.25)
train_data = train_valid_data["train"]
valid_data = train_valid_data["test"]

From the training data, we now proceed to create a vocabulary comprised of the training data's unique words. Given the large number of documents in the training set, I'll add a minimum frequency threshold of 30 for every word to be included in the vocabulary. Given the large number of observations we have available, removing words that appear only a few times should not affect the model's performance.

Also very important in the creation of our vocabulary is to add a couple of special tokens: one for padding and one for unknown words. The padding token will be used to make all documents the same length, and the unknown token will be used to catch words that are not in the vocabulary.

In [11]:
# creating the vocabulary
special_tokens = ["<unk>", "<pad>"]

# setting a minimum frequency for the tokens ... 30 times in 420,000 sentences is not a lot
vocab = torchtext.vocab.build_vocab_from_iterator(
    train_data["tokens"], specials=special_tokens, min_freq=30
)
vocab.set_default_index(vocab["<unk>"])
len(vocab)

7827

We now have a vocabulary of 7,827 unique words (including the special tokens). The Vocab object has a method that is used to identify unknown words and replace them with the unknown token. We utilize that method to assign the index of "unk".

Now that we have the vocabulary, we can numerically encode the words in the data using indices from the vocabulary we just created. We also need to pad the sequences so that they're all the same length and we don't run into issues when inputting them into the model.

In [12]:
def numericalize_example(obs, vocab):
    ids = vocab.lookup_indices(obs["tokens"])
    return {"ids": ids}


train_data = train_data.map(numericalize_example, fn_kwargs={"vocab": vocab})
valid_data = valid_data.map(numericalize_example, fn_kwargs={"vocab": vocab})
# test_data = test_data.map(numericalize_example, fn_kwargs={"vocab": vocab})

Map:   0%|          | 0/420000 [00:00<?, ? examples/s]

Map:   0%|          | 0/140000 [00:00<?, ? examples/s]

In [16]:
train_data = train_data.with_format("torch", columns=["ids", "label"])
valid_data = valid_data.with_format("torch", columns=["ids", "label"])
# test_data = test_data.with_format("torch", columns = ["ids", "label"])

We're going to make use of data loaders. This will allow us to load the data in batches, which will be useful for training the model. This is where we'll use the padding process mentioned above. In this process we need to adequately structure the data so that it can be input into the model. We'll make use of a collate function to do this. Since we're working with PyTorch, we'll create a single tensor for each batch of data and a single tensor for the labels. Note that the data tensor will have a shape of (batch_size, max_length) and the labels tensor will have a shape of (batch_size, 1).

In [19]:
pad_index = vocab["<pad>"]


def get_collate_fn(pad_index):
    def collate_fn(batch):
        batch_ids = [doc["ids"] for doc in batch]
        batch_ids = nn.utils.rnn.pad_sequence(
            batch_ids, padding_value=pad_index, batch_first=True
        )
        batch_labels = [doc["label"] for doc in batch]
        batch_labels = torch.stack(batch_labels)
        return {"ids": batch_ids, "label": batch_labels}

    return collate_fn


def get_data_loader(dataset, batch_size, pad_index, shuffle=False):
    collate_fn = get_collate_fn(pad_index)
    data_loader = torch.utils.data.DataLoader(
        dataset=dataset, batch_size=batch_size, collate_fn=collate_fn, shuffle=shuffle
    )
    return data_loader


Now we shall set the batch size and create the data loaders for the training, validation and test sets.

In [None]:
batch_size = 256

train_loader = get_data_loader(train_data, batch_size, pad_index, shuffle=True)
valid_loader = get_data_loader(valid_data, batch_size, pad_index, shuffle=False)
# test_loader = get_data_loader(test_data, batch_size, pad_index, shuffle=False)
