# 2. Sentiment Analysis using Neural Network

In class, we learned sequence representation and how language models are developed for different tasks. In this assignment, we will implement two neural network models for sentiment analysis task using IMDB dataset. Sentiment analysis in Natural Language Processing (NLP) is a task that involves classifying sentences or text into different categories based on the sentiment expressed. It aims to determine whether the sentiment of the text is positive, or negative. This analysis helps in understanding the overall opinion or emotion conveyed by the text.

Please note that this assignment is built and tested under *Google Colaboratory*. If you work on a local machine, you need to handle version issue on your own. Please complete the given jupyter notebook file and submit it along with your answer to this latex file.

In [4]:
from google.colab import drive

drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Data Preprocessing

In this assignment, we will use TorchText package to deal with the data. There are a few ways to process the data, but this package makes this procedure much easier. Now we will go over the next steps with TorchText:

- Preprocessing
- Split into train and test set
- Build dataset
- Building vocabulary
- Batching the data

Once you load the data, there will be two columns, review comment and sentiment label (1 for positive and 0 for negative).
Out of 50000 sample data, we will use 30000 as our training set and rest of the data as our test set.
You can refer to [here](https://torchtext.readthedocs.io/en/latest/index.html) for more information on the TorchText.

In [2]:
!pip install torchtext==0.6.0

Collecting torchtext==0.6.0
  Downloading torchtext-0.6.0-py3-none-any.whl.metadata (6.3 kB)
Downloading torchtext-0.6.0-py3-none-any.whl (64 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.2/64.2 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: torchtext
Successfully installed torchtext-0.6.0


In [3]:
import urllib.request
import pandas as pd
from torchtext import data, datasets
from torchtext.vocab import Vocab
from torchtext.data import TabularDataset
from torchtext.data import Iterator

In [5]:
# load data
urllib.request.urlretrieve(
    "https://raw.githubusercontent.com/LawrenceDuan/IMDb-Review-Analysis/master/IMDb_Reviews.csv",
    filename="IMDb_Reviews.csv",
)

('IMDb_Reviews.csv', <http.client.HTTPMessage at 0x7b81d017dcc0>)

In [6]:
df = pd.read_csv("IMDb_Reviews.csv", encoding="latin1")
df.head()

Unnamed: 0,review,sentiment
0,My family and I normally do not watch local mo...,1
1,"Believe it or not, this was at one time the wo...",0
2,"After some internet surfing, I found the ""Home...",0
3,One of the most unheralded great works of anim...,1
4,"It was the Sixties, and anyone with long hair ...",0


In [7]:
print("Total number of sample data: {}".format(len(df)))

Total number of sample data: 50000


In [8]:
train_df = df[:30000]
test_df = df[30000:]

train_df.to_csv("train_data.csv", index=False)
test_df.to_csv("test_data.csv", index=False)

In [9]:
# data preprocessing
TEXT = data.Field(
    sequential=True,
    use_vocab=True,
    tokenize=str.split,
    lower=True,
    batch_first=True,
    fix_length=100,
)

LABEL = data.Field(sequential=False, use_vocab=False, batch_first=False, is_target=True)

In [10]:
# build dataset
train_data, test_data = TabularDataset.splits(
    path=".",
    train="train_data.csv",
    test="test_data.csv",
    format="csv",
    fields=[("text", TEXT), ("label", LABEL)],
    skip_header=True,
)

In [11]:
print("number of training data : {}".format(len(train_data)))
print("number of test data : {}".format(len(test_data)))

number of training data : 30000
number of test data : 20000


In [12]:
print(train_data.fields.items())

dict_items([('text', <torchtext.data.field.Field object at 0x7b81d017ff40>), ('label', <torchtext.data.field.Field object at 0x7b81d017fcd0>)])


In [13]:
print(train_data[0].text)
print(train_data[0].label)

['my', 'family', 'and', 'i', 'normally', 'do', 'not', 'watch', 'local', 'movies', 'for', 'the', 'simple', 'reason', 'that', 'they', 'are', 'poorly', 'made,', 'they', 'lack', 'the', 'depth,', 'and', 'just', 'not', 'worth', 'our', 'time.<br', '/><br', '/>the', 'trailer', 'of', '"nasaan', 'ka', 'man"', 'caught', 'my', 'attention,', 'my', 'daughter', 'in', "law's", 'and', "daughter's", 'so', 'we', 'took', 'time', 'out', 'to', 'watch', 'it', 'this', 'afternoon.', 'the', 'movie', 'exceeded', 'our', 'expectations.', 'the', 'cinematography', 'was', 'very', 'good,', 'the', 'story', 'beautiful', 'and', 'the', 'acting', 'awesome.', 'jericho', 'rosales', 'was', 'really', 'very', 'good,', "so's", 'claudine', 'barretto.', 'the', 'fact', 'that', 'i', 'despised', 'diether', 'ocampo', 'proves', 'he', 'was', 'effective', 'at', 'his', 'role.', 'i', 'have', 'never', 'been', 'this', 'touched,', 'moved', 'and', 'affected', 'by', 'a', 'local', 'movie', 'before.', 'imagine', 'a', 'cynic', 'like', 'me', 'dabbi

In [14]:
print(vars(train_data[0]))

{'text': ['my', 'family', 'and', 'i', 'normally', 'do', 'not', 'watch', 'local', 'movies', 'for', 'the', 'simple', 'reason', 'that', 'they', 'are', 'poorly', 'made,', 'they', 'lack', 'the', 'depth,', 'and', 'just', 'not', 'worth', 'our', 'time.<br', '/><br', '/>the', 'trailer', 'of', '"nasaan', 'ka', 'man"', 'caught', 'my', 'attention,', 'my', 'daughter', 'in', "law's", 'and', "daughter's", 'so', 'we', 'took', 'time', 'out', 'to', 'watch', 'it', 'this', 'afternoon.', 'the', 'movie', 'exceeded', 'our', 'expectations.', 'the', 'cinematography', 'was', 'very', 'good,', 'the', 'story', 'beautiful', 'and', 'the', 'acting', 'awesome.', 'jericho', 'rosales', 'was', 'really', 'very', 'good,', "so's", 'claudine', 'barretto.', 'the', 'fact', 'that', 'i', 'despised', 'diether', 'ocampo', 'proves', 'he', 'was', 'effective', 'at', 'his', 'role.', 'i', 'have', 'never', 'been', 'this', 'touched,', 'moved', 'and', 'affected', 'by', 'a', 'local', 'movie', 'before.', 'imagine', 'a', 'cynic', 'like', 'me

In [15]:
# build vocabulary
TEXT.build_vocab(train_data, min_freq=10, max_size=10000)

In [16]:
print("size of the vocabulary : {}".format(len(TEXT.vocab)))  # includes <unk> and <pad>

size of the vocabulary : 10002


In [48]:
print(TEXT.vocab.stoi)

Output hidden; open in https://colab.research.google.com to view.

In [18]:
# build data loader and batching the data
batch_size = 64
train_loader = Iterator(dataset=train_data, batch_size=batch_size)
test_loader = Iterator(dataset=test_data, batch_size=batch_size)

In [19]:
print("number of mini-batches for training data : {}".format(len(train_loader)))
print("number of mini-batches for test data : {}".format(len(test_loader)))

number of mini-batches for training data : 469
number of mini-batches for test data : 313


In [20]:
batch = next(iter(train_loader))
print(type(batch))

<class 'torchtext.data.batch.Batch'>


In [22]:
print(batch.text)

tensor([[ 265,   29,    2,  ...,  584,   37,   33],
        [   3,  773, 2391,  ...,   22,   49, 7061],
        [  80, 2224,   26,  ...,    2,  369,  151],
        ...,
        [  84,    5,  376,  ...,   17,    2, 1305],
        [   2,   20,   41,  ...,  936,   75, 9638],
        [ 338, 4634,    0,  ...,   33,  215,    0]])


In [23]:
print(batch.label)

tensor([0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0])


## Problem 2-1. Multilayer Perceptron (MLP) (30 pts)

In this question, we are going to implement a simple Multilayer Perceptron (MLP) model to classify IMDB dataset. MLP is the classical type of neural network, and they are comprised of one or more layers of neurons. Data is fed to the input layer, there may be one or more hidden layers providing levels of abstraction, and predictions will be made on the output layer.

In [24]:
import torch
import torch.nn as nn
import torch.nn.functional as F

### Problem 2-1 (a) (10 pts)

Implement a two-layer fully-connected neural network. Your model should contain an embedding layer to represent a word in a dense vector representation. Use ReLU for activation function.

In [43]:
class Linear(nn.Module):
    def __init__(self, input_size, hidden_size, embed_dim, num_classes):
        super(Linear, self).__init__()
        #################### YOUR CODE (3 lines) #####################
        self.embedding = nn.Embedding(len(TEXT.vocab), embed_dim)
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, num_classes)
        ###############################################################

    def forward(self, text):
        #################### YOUR CODE (3-5 lines) #####################
        # hint: you can use F.relu and F.softmax
        embedded = self.embedding(text).mean(dim=1)
        tmp = F.relu(self.fc1(embedded))
        preds = self.fc2(tmp)
        ###############################################################
        return preds

### Problem 2-1 (b) (10 pts)

Implement a function to check the accuracy of your model.

In [26]:
def model_accuracy(output, target):
    #################### YOUR CODE (2-3 lines) #####################
    pred_output = torch.argmax(output, dim=1)
    accuracy = (pred_output == target).float().mean()
    ###############################################################
    return accuracy

### Problem 2-1 (c) (5 pts)

Complete the code below to train the model

In [42]:
def train(model, iterator, optimizer, criterion):

    loss_ep = 0
    acc_ep = 0

    for batch in iterator:
        optimizer.zero_grad()

        #################### YOUR CODE (1-2 lines) #####################
        output = model(batch.text).squeeze()
        ###############################################################
        loss = criterion(output, batch.label.squeeze())
        acc = model_accuracy(output, batch.label)

        loss.backward()
        optimizer.step()
        loss_ep += loss.item()
        acc_ep += acc.item()
    return loss_ep / len(iterator), acc_ep / len(iterator)

### Problem 2-1 (d) (5 pts)

Complete the code below to evaluate the model

In [40]:
def evaluate(model, iterator, criterion):

    loss_ep = 0
    acc_ep = 0

    model.eval()
    with torch.no_grad():
        for batch in iterator:
            #################### YOUR CODE (1-2 lines) #####################
            output = model(batch.text).squeeze()
            ###############################################################
            loss = criterion(output, batch.label)
            acc = model_accuracy(output, batch.label)

            loss_ep += loss.item()
            acc_ep += acc.item()

    return loss_ep / len(iterator), acc_ep / len(iterator)

Now let's see the model performance

In [41]:
def train_the_model(epochs, model, train_loader, valid_loader, optimizer, criterion):

    for epoch in range(epochs):

        train_loss, train_acc = train(model, train_loader, optimizer, criterion)
        valid_loss, valid_acc = evaluate(model, valid_loader, criterion)

        print(f"\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc * 100:.2f}%")
        print(f"\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc * 100:.2f}%")

    torch.save(model.state_dict(), "saved_weights_linear.pt")

In [35]:
# hyperparameter
num_epochs = 10
lr = 1e-4
max_token_length = 100
hidden_size = 100
embed_dim = 100
seed = 1
num_classes = 2

In [44]:
loss_function = nn.CrossEntropyLoss()
linear_model = Linear(max_token_length, hidden_size, embed_dim, num_classes)
optimizer = torch.optim.Adam(linear_model.parameters(), lr=lr)

train_the_model(
    num_epochs, linear_model, train_loader, test_loader, optimizer, loss_function
)

	Train Loss: 0.690 | Train Acc: 54.08%
	 Val. Loss: 0.683 |  Val. Acc: 60.15%
	Train Loss: 0.672 | Train Acc: 61.08%
	 Val. Loss: 0.655 |  Val. Acc: 64.17%
	Train Loss: 0.638 | Train Acc: 65.34%
	 Val. Loss: 0.614 |  Val. Acc: 68.20%
	Train Loss: 0.597 | Train Acc: 69.13%
	 Val. Loss: 0.576 |  Val. Acc: 70.80%
	Train Loss: 0.561 | Train Acc: 71.49%
	 Val. Loss: 0.548 |  Val. Acc: 72.46%
	Train Loss: 0.533 | Train Acc: 73.43%
	 Val. Loss: 0.527 |  Val. Acc: 73.75%
	Train Loss: 0.511 | Train Acc: 74.94%
	 Val. Loss: 0.511 |  Val. Acc: 74.78%
	Train Loss: 0.492 | Train Acc: 76.24%
	 Val. Loss: 0.498 |  Val. Acc: 75.57%
	Train Loss: 0.476 | Train Acc: 77.28%
	 Val. Loss: 0.488 |  Val. Acc: 76.13%
	Train Loss: 0.462 | Train Acc: 78.22%
	 Val. Loss: 0.478 |  Val. Acc: 76.65%


## Problem 2-2. Convolutional Neural Network (CNN) (20 pts)

Next, we will perform sentimental analysis on the same dataset with Convolutional Neural Network (CNN). In a CNN, text is organised into a matrix, with each row representing a word embedding. The CNN’s convolutional layer scans the text like it would an image, breaks it down into features, and judges whether each feature matches the relevant label or not.
You can refer to [here](https://emnlp2014.org/papers/pdf/EMNLP2014181.pdf) for the use of CNN on text classification.


### Problem 2-2 (a) (15 pts)

Complete the code below for CNN. Again, your model will require an embedding layer to represent into dense vector. After getting embeddings, we will feed the tensors through the convolutional layer, applying the ReLu activation function following the convlutional layers. Then the tensors will be passed through pooling layers. Lastly, apply dropout to the concatenated filter outputs and subsequently pass them through a linear layer to *generate* our predictions.

In [45]:
class CNN(nn.Module):
    def __init__(
        self, input_dim, embed_dim, n_filters, filter_sizes, num_classes, dropout_rate
    ):
        super().__init__()

        self.embedding = nn.Embedding(input_dim, embed_dim)
        # convolutions with different size of filters
        self.convs = nn.ModuleList(
            [
                nn.Conv2d(
                    in_channels=1,
                    out_channels=n_filters,
                    kernel_size=(filter_size, embed_dim),
                )
                for filter_size in filter_sizes
            ]
        )

        self.fc = nn.Linear(len(filter_sizes) * n_filters, num_classes)
        self.dropout = nn.Dropout(dropout_rate)

    def forward(self, text):
        # text = [batch_size, sen_len]

        embedded = self.embedding(text)
        # embedded = [batch_size, sen_len, embed_dim]

        embedded = embedded.unsqueeze(1)
        # embedded = [batch_size, 1, sen_len, embed_dim]

        #################### YOUR CODE (4 lines) #####################
        # hint: you can use F.relu, F.max_pool1d, and torch.cat

        # After getting embeddings, we will feed the tensors through the convolutional layer, applying the ReLu activation function following the convlutional layers.
        # output size of the n-convolutional layer = [batch_size, n_filters, sen_len - filter_sizes[n] + 1]
        conv_out = [F.relu(conv(embedded)).squeeze(3) for conv in self.convs]


        # Then the tensors will be passed through pooling layers.
        # output size of the n-pooling layer = [batch_size, n_filters]
        pool_out = [F.max_pool1d(out, out.shape[2]).squeeze(2) for out in conv_out]
        pool_out = torch.cat(pool_out, dim=1)


        # Lastly, apply dropout to the concatenated filter outputs and subsequently pass them through a linear layer to generate our predictions.
        # output size of the dropout = [batch size, n_filters * len(filter_sizes)]
        drop = self.dropout(pool_out)
        output= self.fc(drop)
        ###############################################################
        return output

Now let's see the model performance.

We will use the same function that we used in MLP model for model training and evaluation.
It will take about 30 minutes to run 5 epochs

In [46]:
# hyperparameter
num_epochs = 5
lr = 1e-4
input_dim = len(TEXT.vocab)
embed_dim = 300
num_filters = 100
filter_sizes = [3, 5, 7]
dropout_rate = 0.25
num_classes = 2

In [47]:
cnn_model = CNN(
    input_dim, embed_dim, num_filters, filter_sizes, num_classes, dropout_rate
)


optimizer = torch.optim.Adam(cnn_model.parameters(), lr=lr)

train_the_model(
    num_epochs, cnn_model, train_loader, test_loader, optimizer, loss_function
)

	Train Loss: 0.671 | Train Acc: 59.68%
	 Val. Loss: 0.563 |  Val. Acc: 71.97%
	Train Loss: 0.504 | Train Acc: 77.34%
	 Val. Loss: 0.513 |  Val. Acc: 74.90%
	Train Loss: 0.420 | Train Acc: 83.32%
	 Val. Loss: 0.487 |  Val. Acc: 76.46%
	Train Loss: 0.340 | Train Acc: 88.64%
	 Val. Loss: 0.463 |  Val. Acc: 77.78%
	Train Loss: 0.266 | Train Acc: 92.60%
	 Val. Loss: 0.445 |  Val. Acc: 78.89%


### Problem 2-2 (b) (5 pts)

Compare the linear model (problem 1) and CNN (problem 2), and discuss their advantages and limitations.

The linear model (problem 1) obtains results quickly with fewer parameters but has lower accuracy. In contrast, the CNN (problem 2) uses many hyperparameters to achieve high accuracy even with fewer epochs, but it is slower and can suffer from overfitting. From the code execution results, we can see that the CNN (problem 2) converges quickly in fewer epochs. However, since it trains significantly more parameters than the linear model, it takes a very long time to train.