#  Sentimental Analysis

---
### Load in and visualize the data

In [3]:
import numpy as np
import pandas as pd

df = pd.read_csv("mini_sentiment_dataset.txt",
                 sep="\t", names=["label", "text"])
print(df.head())

# --- Split into two datasets ------------------------------------
reviews = df[["text"]].copy()    # DataFrame with a single 'text' column
labels  = df[["label"]].copy()   # DataFrame with a single 'label' column

# Optional: inspect them
print("Reviews:")
print(reviews.head(), "\n")

print("Labels:")
print(labels.head())

   label                                              text
0      1   Absolutely loved this product, would buy again!
1      0               Waste of money, completely useless.
2      1              This phone exceeded my expectations.
3      0  The battery died after two hours, disappointing.
4      1                 Great service and friendly staff.
Reviews:
                                               text
0   Absolutely loved this product, would buy again!
1               Waste of money, completely useless.
2              This phone exceeded my expectations.
3  The battery died after two hours, disappointing.
4                 Great service and friendly staff. 

Labels:
   label
0      1
1      0
2      1
3      0
4      1


In [7]:
type(reviews)

pandas.core.frame.DataFrame

## Data pre-processing

The first step when building a neural network model is getting your data into the proper form to feed into the network. Since we're using embedding layers, we'll need to encode each word with an integer. We'll also want to clean it up a bit.

You can see an example of the reviews data above. Here are the processing steps, we'll want to take:
>* We'll want to get rid of periods and extraneous punctuation.
* Also, you might notice that the reviews are delimited with newline characters `\n`. To deal with those, I'm going to split the text into each review using `\n` as the delimiter. 
* Then I can combined all the reviews back together into one big string.

First, let's remove all punctuation. Then get all the text without the newlines and split it into individual words.

In [None]:
import pandas as pd
import re

# ── 1.  Lower‑case everything and strip out non‑letters ───────────
clean_words = (
    reviews['text']                         # the column with full sentences
      .str.lower()                          # lower‑case
      .str.replace(r'[^a-z\s]', ' ', regex=True)  # keep only a–z and spaces
      .str.split()                          # split on whitespace → lists of words
      .explode()                            # one word per row
)

# ── 2.  Drop any empty tokens, get uniques, and build the vocab DF ──
words = (
    clean_words[clean_words != '']          # remove blanks
      .drop_duplicates()
      .sort_values()
      .reset_index(drop=True)
      .to_frame(name='word')                # single‑column DataFrame
)

print(words.head())


         word
0           a
1  absolutely
2      acting
3       after
4       again


### Encoding the words

The embedding lookup requires that we pass in integers to our network. The easiest way to do this is to create dictionaries that map the words in the vocabulary to integers. Then we can convert each of our reviews into integers so they can be passed into the network.

> **Exercise:** Now you're going to encode the words with integers. Build a dictionary that maps words to integers. Later we're going to pad our input vectors with zeros, so make sure the integers **start at 1, not 0**.
> Also, convert the reviews to integers and store the reviews in a new list called `reviews_ints`. 

In [16]:
# feel free to use this import 
from collections import Counter

## Build a dictionary that maps words to integers
word2idx = {word: idx + 1 for idx, word in enumerate(words['word'])}
word2idx

## use the dict to tokenize each review in reviews_split
## store the tokenized reviews in reviews_ints
# Tokenise every review using that dictionary  ───────────────
# reviews_ints will be a pandas Series of integer lists
reviews_ints = (
    reviews['text']
        .str.lower()
        .str.replace(r'[^a-z\s]', ' ', regex=True)
        .str.split()
        .apply(lambda toks: [word2idx[w] for w in toks if w in word2idx])
)

# Optional: convert to plain Python list for downstream code
reviews_ints = reviews_ints.tolist()
reviews_ints

[[2, 54, 80, 66, 93, 18, 5],
 [88, 60, 56, 20, 84],
 [80, 62, 30, 58, 31],
 [79, 15, 26, 4, 81, 44, 27],
 [41, 73, 10, 40, 76],
 [71, 76, 10, 78, 73],
 [46, 7, 85, 63, 91, 79, 67],
 [33, 28, 70, 51],
 [9, 8, 32, 61],
 [59, 92, 79, 45, 14, 6],
 [35, 74, 10, 52, 13, 25],
 [52, 12, 17, 10, 53],
 [79, 57, 87, 34, 43, 69],
 [16, 64, 10, 65, 3],
 [79, 11, 72, 82, 49, 48, 10, 75],
 [22, 21, 83],
 [24, 55, 90, 86, 5],
 [38, 87, 19, 10, 37],
 [42, 23, 77, 36, 58, 50, 68],
 [77, 47, 58, 29, 39, 1, 89]]

### Encoding the labels

Our labels are "positive" or "negative". To use these labels in our network, we need to convert them to 0 and 1.

> **Exercise:** Convert labels from `positive` and `negative` to 1 and 0, respectively, and place those in a new list, `encoded_labels`.

In [19]:
encoded_labels = labels['label'].astype(int).tolist()   # <- plain Python list
# quick check
print(encoded_labels[:5])        # [1, 0, 1, 0, 1]  etc.

[1, 0, 1, 0, 1]


### Removing Outliers

In [21]:
# outlier review stats
review_lens = Counter([len(x) for x in reviews_ints])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))
# we are good

Zero-length reviews: 0
Maximum review length: 8


---
## Padding sequences

To deal with both short and very long reviews, we'll pad or truncate all our reviews to a specific length. For reviews shorter than some `seq_length`, we'll pad with 0s. For reviews longer than `seq_length`, we can truncate them to the first `seq_length` words. A good `seq_length`, in this case, is 200.

In [None]:
import numpy as np

def pad_features(reviews_ints, seq_length=200):  
    n_reviews = len(reviews_ints)
    
    # Start with a zero‑matrix; dtype=int for PyTorch or TensorFlow later.
    features = np.zeros((n_reviews, seq_length), dtype=np.int32)
    
    for i, review in enumerate(reviews_ints):
        # Use only the first `seq_length` tokens if review is too long
        review_slice = review[:seq_length]
        
        # Place tokens at the END of the row (left‑pad with 0s)
        features[i, -len(review_slice):] = review_slice

    return features

# ------------------------------------------------------------------
# Example usage
# ------------------------------------------------------------------
seq_length = 200
features = pad_features(reviews_ints, seq_length)

print("Feature matrix shape :", features.shape)   # (num_reviews, 200)
print("First row (non‑zero tail):\n", features[0][-10:])  # peek last 10 positions


Feature matrix shape : (20, 200)
First row (non‑zero tail):
 [ 0  0  0  2 54 80 66 93 18  5]


## Training, Validation, Test

## DataLoaders and Batching

In [33]:
# ------------------------------------------------------------------
# 0) Make sure labels are a NumPy array with shape (N,)
# ------------------------------------------------------------------
labels_np = np.asarray(encoded_labels, dtype=np.int64)    # <- 1‑D

# `features` is already the padded NumPy array from pad_features()
# features.shape  ->  (N, seq_length)
features_np = features.astype(np.int64)                   # keep dtype ints

# ------------------------------------------------------------------
# 1) Train / validation / test split
# ------------------------------------------------------------------
split_frac = 0.8

split_idx      = int(len(features_np) * split_frac)
train_x        = features_np[:split_idx]
train_y        = labels_np[:split_idx]

remaining_x    = features_np[split_idx:]
remaining_y    = labels_np[split_idx:]

test_idx       = int(len(remaining_x) * 0.5)
val_x, test_x  = remaining_x[:test_idx],  remaining_x[test_idx:]
val_y, test_y  = remaining_y[:test_idx],  remaining_y[test_idx:]

print("\n\t\tFeature Shapes")
print("Train :", train_x.shape,
      "\nValid :", val_x.shape,
      "\nTest  :", test_x.shape)

# ------------------------------------------------------------------
# 2) Wrap splits in TensorDataset objects
# ------------------------------------------------------------------
train_data = TensorDataset(torch.from_numpy(train_x),
                           torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(val_x),
                           torch.from_numpy(val_y))
test_data  = TensorDataset(torch.from_numpy(test_x),
                           torch.from_numpy(test_y))

# ------------------------------------------------------------------
# 3) Build DataLoaders (shuffle the training set)
# ------------------------------------------------------------------
batch_size  = 1

train_loader = DataLoader(train_data, shuffle=True,  batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=False, batch_size=batch_size)
test_loader  = DataLoader(test_data,  shuffle=False, batch_size=batch_size)


		Feature Shapes
Train : (16, 200) 
Valid : (2, 200) 
Test  : (2, 200)


In [34]:
# obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = next(dataiter)

print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

Sample input size:  torch.Size([1, 200])
Sample input: 
 tensor([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 88, 60, 56,
         20, 84]])

Sample label size:  torch.Size([

In [35]:
# First checking if GPU is available
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

No GPU available, training on CPU.


In [36]:
import torch.nn as nn

class SentimentRNN(nn.Module):
    """
    The RNN model that will be used to perform Sentiment analysis.
    """

    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        """
        Initialize the model by setting up the layers.
        """
        super(SentimentRNN, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # embedding and LSTM layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, 
                            dropout=drop_prob, batch_first=True)
        
        # dropout layer
        self.dropout = nn.Dropout(0.3)
        
        # linear and sigmoid layers
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()
        

    def forward(self, x, hidden):
        """
        Perform a forward pass of our model on some input and hidden state.
        """
        batch_size = x.size(0)

        # embeddings and lstm_out
        x = x.long()
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden)
        
        lstm_out = lstm_out[:, -1, :] # getting the last time step output
        
        # dropout and fully-connected layer
        out = self.dropout(lstm_out)
        out = self.fc(out)
        # sigmoid function
        sig_out = self.sig(out)
        
        # return last sigmoid output and hidden state
        return sig_out, hidden
    
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden
        

## Instantiate the network

In [37]:
# Instantiate the model w/ hyperparams
vocab_size = len(word2idx)+1 # +1 for the 0 padding + our word tokens
output_size = 1
embedding_dim = 400
hidden_dim = 256
n_layers = 2

net = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

print(net)

SentimentRNN(
  (embedding): Embedding(94, 400)
  (lstm): LSTM(400, 256, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sig): Sigmoid()
)



## Training

In [38]:
# loss and optimization functions
lr=0.001

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)


In [40]:
# training params

epochs = 4

counter = 0
print_every = 100
clip=5 # gradient clipping

# move model to GPU, if available
if(train_on_gpu):
    net.cuda()

net.train()
# train for some number of epochs
for e in range(epochs):
    # initialize hidden state
    h = net.init_hidden(batch_size)

    # batch loop
    for inputs, labels in train_loader:
        counter += 1

        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()

        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        h = tuple([each.data for each in h])

        # zero accumulated gradients
        net.zero_grad()

        # get the output from the model
        output, h = net(inputs, h)

        # calculate the loss and perform backprop
        loss = criterion(output.squeeze(1), labels.float())
        loss.backward()
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()

        # loss stats
        if counter % print_every == 0:
            # Get validation loss
            val_h = net.init_hidden(batch_size)
            val_losses = []
            net.eval()
            for inputs, labels in valid_loader:

                # Creating new variables for the hidden state, otherwise
                # we'd backprop through the entire training history
                val_h = tuple([each.data for each in val_h])

                if(train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                output, val_h = net(inputs, val_h)
                val_loss = criterion(output.squeeze(1), labels.float())

                val_losses.append(val_loss.item())

            net.train()
            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))

---
## Testing

There are a few ways to test your network.

* **Test data performance:** First, we'll see how our trained model performs on all of our defined test_data, above. We'll calculate the average loss and accuracy over the test data.

* **Inference on user-generated data:** Second, we'll see if we can input just one example review at a time (without a label), and see what the trained model predicts. Looking at new, user input data like this, and predicting an output label, is called **inference**.

In [43]:
# Get test data loss and accuracy

test_losses = [] # track loss
num_correct = 0

# init hidden state
h = net.init_hidden(batch_size)

net.eval()
# iterate over test data
for inputs, labels in test_loader:

    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    h = tuple([each.data for each in h])

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()
    
    # get predicted outputs
    output, h = net(inputs, h)
    
    # calculate loss
    test_loss = criterion(output.squeeze(1), labels.float())
    test_losses.append(test_loss.item())
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze(1))
    
    # compare predictions to true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)


# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

Test loss: 3.557
Test accuracy: 0.000
