# Sentiment Network with PyTorch

Below is where you'll define the network.

<img src="assets/network_diagram.png" width=40%>

The layers are as follows:
1. An [embedding layer](https://pytorch.org/docs/stable/nn.html#embedding) that converts our word tokens (integers) into embeddings of a specific size.
2. An [LSTM layer](https://pytorch.org/docs/stable/nn.html#lstm) defined by a hidden_state size and number of layers
3. A fully-connected output layer that maps the LSTM layer outputs to a desired output_size
4. A sigmoid activation layer which turns all outputs into a value 0-1; return **only the last sigmoid output** as the output of this network.

### The Embedding Layer

We need to add an [embedding layer](https://pytorch.org/docs/stable/nn.html#embedding) because there are 74000+ words in our vocabulary. It is massively inefficient to one-hot encode that many classes. So, instead of one-hot encoding, we can have an embedding layer and use that layer as a lookup table. You could train an embedding layer using Word2Vec, then load it here. But, it's fine to just make a new layer, using it for only dimensionality reduction, and let the network learn the weights.


### The LSTM Layer(s)

We'll create an [LSTM](https://pytorch.org/docs/stable/nn.html#lstm) to use in our recurrent network, which takes in an input_size, a hidden_dim, a number of layers, a dropout probability (for dropout between multiple layers), and a batch_first parameter.

Most of the time, you're network will have better performance with more layers; between 2-3. Adding more layers allows the network to learn really complex relationships. 

> **Here implement:** Complete the `__init__`, `forward`, and `init_hidden` functions for the SentimentRNN model class.

Note: `init_hidden` should initialize the hidden and cell state of an lstm layer to all zeros, and move those state to GPU, if available.


In [1]:
import numpy as np
import pandas as pd
import torch
import string
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from transformers import BertTokenizer
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re
from sklearn.metrics import fbeta_score
from IPython.display import Image
from transformers import BertTokenizer, BertModel
%matplotlib inline

In [2]:
# First checking if GPU is available
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

No GPU available, training on CPU.


In [3]:
df = pd.read_csv("/kaggle/input/formspring-csv/formspring.csv")

In [4]:
df.drop(['post', 'asker', 'bully1', 'bully2', 'bully3'], axis = 1, inplace = True)

In [5]:
def impute_ans_columns(value):
    v = ['No','nan']
    if value in v:
        return 0
    return 1

In [6]:
for col in ['ans1', 'ans2', 'ans3']:
    df[col] = df[col].apply(impute_ans_columns)
df.sample(10)

Unnamed: 0,userid,ques,ans,ans1,severity1,ans2,severity2,ans3,severity3
6571,zooshay,or the annoying back driver?,nope not me hate the back i always say i get ...,0,0,0,0,0,0.0
0,aguitarplayer94,what&#039;s your favorite song? :D<br>,I like too many songs to have a favorite,0,0,0,0,0,0.0
4244,freshswagg21,fagg,hahahahaha you the fagg buddy lol:),1,5,1,4,1,
10957,outlaw9000,Ahh i see. tut tut :P bit naughty eh :P,that's why I am an Outlaw ! ! LOL,0,0,0,0,0,0.0
9799,kellyblake1,What is your saddest memory?,Loosing the people that I love... I will neve...,0,0,0,0,0,0.0
2226,teaachgee,What&#039;s one thing you hate the feeling of?,being lied to,0,0,0,0,0,0.0
12618,outlaw9000,what&#039;s your biggest pet peeve?,sloopy people,0,0,0,0,0,0.0
7135,zooshay,when do u feel most yourself? online? in persom?,Both,0,0,0,0,0,0.0
2680,dearalexiis,Mall or Park?,Mall i guesss,0,0,0,0,0,0.0
6377,zooshay,lol u have a stalker O_O,i doo i actually have many :) can u take sum ...,0,0,1,3,0,0.0


In [7]:
def impute_severity_columns(value):
    '''Value will be a string. We need to convert it to int'''
    v = ['nan', 'None', '0']
    if value in v:
        return 0
    try:
        return int(value)
    except ValueError as e:
        #print(value)
        return 5

In [8]:
for col in ['severity1', 'severity2', 'severity3']:
    df[col] = df[col].apply(impute_severity_columns)

In [9]:
df['IsBully'] = (
    (df.ans1 * df.severity1 + df.ans2 * df.severity2 + df.ans3 * df.severity3) / 30) >= 0.0333

# Remove uneccessary columns
df_2 = df.drop(['userid','ans1', 'severity1','ans2','severity2','ans3','severity3'], axis = 1)

In [10]:
df_2.sample(10)

Unnamed: 0,ques,ans,IsBully
9934,Who&#039;s the most underrated actor?,Without doubt Lily Loveless,False
1676,If you saw a serious crime take place and if y...,YES :),False
7637,but never dtf,Neverz,False
6008,I LiKe U G stRiNg,cool,False
4973,0,b****,True
2243,what&#039;s the last thing you said to your si...,I love you...? haha,False
7963,Oh never mind good luck child.,Who needs luck?,False
6784,waht are some stores you go to?,supre myer dereon ahh i cant really think of...,False
22,Do you believe in life after death?,Yes of course i do... Its called Eternity wit...,False
2645,Hot or Cold?,Dependss .,False


In [11]:
for col in ['ques', 'ans']:
    df_2[col] = df_2[col].str.replace("&#039;", "'") # Put back the apostrophe

    df_2[col] = df_2[col].str.replace("<br>", "") 
    df_2[col] = df_2[col].str.replace("&quot;", "") 
    #df_2[col] = df_2[col].str.replace("<3", "love")

In [12]:
df_2 = df_2.dropna(how='all')

In [13]:
df_2.head()

Unnamed: 0,ques,ans,IsBully
0,what's your favorite song? :D,I like too many songs to have a favorite,False
1,<3,</3 ? haha jk! <33,False
2,hey angel you duh sexy,Really?!?! Thanks?! haha,False
3,(:,;(,False
4,******************MEOWWW*************************,*RAWR*?,False


In [14]:
df_2['ques_ans'] = df_2['ques'] + ' ' + df_2['ans'] 

In [15]:
df_2.head()

Unnamed: 0,ques,ans,IsBully,ques_ans
0,what's your favorite song? :D,I like too many songs to have a favorite,False,what's your favorite song? :D I like too many...
1,<3,</3 ? haha jk! <33,False,<3 </3 ? haha jk! <33
2,hey angel you duh sexy,Really?!?! Thanks?! haha,False,hey angel you duh sexy Really?!?! Thanks?! haha
3,(:,;(,False,(: ;(
4,******************MEOWWW*************************,*RAWR*?,False,******************MEOWWW**********************...


In [16]:
df_2.drop(['ques','ans'], axis=1)
columns = ['ques_ans','IsBully']
df2_ordered = df_2[columns]
df2_ordered['ques_ans'] = df2_ordered['ques_ans'].str.lower()
# Remove punctuation using regex
df2_ordered['ques_ans'] = df2_ordered['ques_ans'].str.replace(f'[{string.punctuation}]', '', regex=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_ordered['ques_ans'] = df2_ordered['ques_ans'].str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_ordered['ques_ans'] = df2_ordered['ques_ans'].str.replace(f'[{string.punctuation}]', '', regex=True)


In [17]:
df2_ordered = df2_ordered[df2_ordered['ques_ans'].notna()]  # Remove NaN values
df2_ordered = df2_ordered[df2_ordered['ques_ans'].str.strip() != '']  # Remove empty strings
df2_ordered.head()

Unnamed: 0,ques_ans,IsBully
0,whats your favorite song d i like too many so...,False
1,3 3 haha jk 33,False
2,hey angel you duh sexy really thanks haha,False
4,meowww rawr,False
5,any makeup tips i suck at doing my makeup lol ...,False


In [18]:
X = df2_ordered['ques_ans'].values
y = df2_ordered['IsBully'].values

In [19]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

# Initialize the tokenizer and fit it on the text data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)

# Convert the text to sequences of integers
sequences = tokenizer.texts_to_sequences(X)

# Pad sequences to ensure uniform input length (e.g., max_len=100)
max_len = 50  # Maximum length of sequences
padded_sequences = pad_sequences(sequences, maxlen=max_len, padding='post')

# Vocabulary size
vocab_size = len(tokenizer.word_index) + 1  # +1 because the tokenizer index starts at 1


In [20]:
# Stratify ensures that the class proportions are maintained across splits
X_train, X_test, y_train, y_test = train_test_split(padded_sequences, y, test_size=0.2, random_state=42)

# Further split train into train and validation
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, stratify=y_train, random_state=42)


In [21]:
from imblearn.under_sampling import RandomUnderSampler

# Assuming X_train is your feature matrix and y_train is your target (label)
undersampler = RandomUnderSampler(random_state=42)

# Perform undersampling
X_train_resampled, y_train_resampled = undersampler.fit_resample(X_train, y_train)
X_val_resampled, y_val_resampled = undersampler.fit_resample(X_val, y_val)
X_test_resampled, y_test_resampled = undersampler.fit_resample(X_test, y_test)

# Check the new class distribution after undersampling
print("Original class distribution:", pd.Series(y_train).value_counts())
print("Resampled class distribution:", pd.Series(y_train_resampled).value_counts())
print("Original validation class distribution:", pd.Series(y_val).value_counts())
print("Resampled validation class distribution:", pd.Series(y_val_resampled).value_counts())
print("Original test class distribution:", pd.Series(y_test).value_counts())
print("Resampled test class distribution:", pd.Series(y_test_resampled).value_counts())


Original class distribution: False    6535
True     1148
Name: count, dtype: int64
Resampled class distribution: False    1148
True     1148
Name: count, dtype: int64
Original validation class distribution: False    2179
True      382
Name: count, dtype: int64
Resampled validation class distribution: False    382
True     382
Name: count, dtype: int64
Original test class distribution: False    2190
True      371
Name: count, dtype: int64
Resampled test class distribution: False    371
True     371
Name: count, dtype: int64


## Tokenizing and Feature Engineering

In [22]:
def load_glove_embeddings(vocab, glove_file='glove.6B.50d.txt', embedding_dim=50):
    embeddings_index = {}
    
    # Load GloVe embeddings
    with open(glove_file, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = vector
    
    # Create embedding matrix for our vocabulary
    embedding_matrix = np.zeros((vocab_size, embedding_dim))
    for word, i in vocab.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
        else:
            embedding_matrix[i] = np.random.normal(scale=0.6, size=(embedding_dim,))
    
    return embedding_matrix

In [23]:
# Load GloVe embeddings for the tokenizer vocabulary
embedding_matrix = load_glove_embeddings(tokenizer.word_index, '/kaggle/input/glove-embeddings/glove.6B.50d.txt', 50)

In [24]:
from torch.utils.data import DataLoader, TensorDataset

# Assuming X_train, X_test, y_train, y_test are your NumPy arrays or padded sequences

# Convert the data to PyTorch tensors
X_train_tensor = torch.tensor(X_train_resampled, dtype=torch.long)
y_train_tensor = torch.tensor(y_train_resampled, dtype=torch.float32)  # Assuming binary classification

X_test_tensor = torch.tensor(X_test_resampled, dtype=torch.long)
y_test_tensor = torch.tensor(y_test_resampled, dtype=torch.float32)

X_val_tensor = torch.tensor(X_val_resampled, dtype=torch.long)
y_val_tensor = torch.tensor(y_val_resampled, dtype=torch.float32)


# Create TensorDataset (combines inputs and labels)
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
val_dataset = TensorDataset(X_val_tensor, y_val_tensor)

# Create DataLoader for training and testing sets
batch_size = 32

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)


In [25]:
class SentimentRNN(nn.Module):
    """
    The RNN model that will be used to perform Sentiment analysis.
    """

    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers,embedding_matrix, drop_prob=0.01):

        """
        Initialize the model by setting up the layers.
        """
        super(SentimentRNN, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # If GloVe embeddings are provided, use them; otherwise, initialize randomly
        if embedding_matrix is not None:
            self.embedding = nn.Embedding.from_pretrained(torch.FloatTensor(embedding_matrix), freeze=False)
        else:
            #self.embedding = nn.Embedding(vocab_size, embedding_dim)
            # Load BERT tokenizer and model
            self.tokenizer = BertTokenizer.from_pretrained(bert_model_name)
            self.bert = BertModel.from_pretrained(bert_model_name)


        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=drop_prob, batch_first=True)
        
        # dropout layer
        self.dropout = nn.Dropout(drop_prob)
        
        # linear and sigmoid layers
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()
        

    def forward(self, x, hidden):
        """
        Perform a forward pass of our model on some input and hidden state.
        """
        batch_size = x.size(0)
     # Initialize hidden state for the current batch size
        hidden = self.init_hidden(batch_size)
        # embeddings and lstm_out
        x = x.long()
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden)
    
        # stack up lstm outputs
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        
        # dropout and fully-connected layer
        out = self.dropout(lstm_out)
        out = self.fc(out)
        # sigmoid function
        sig_out = self.sig(out)
        
        # reshape to be batch_size first
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1] # get last batch of labels
        
        # return last sigmoid output and hidden state
        return sig_out, hidden
       
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden


## Instantiate the network
​
Here, we'll instantiate the network. First up, defining the hyperparameters.
​
* `vocab_size`: Size of our vocabulary or the range of values for our input, word tokens.
* `output_size`: Size of our desired output; the number of class scores we want to output (pos/neg).
* `embedding_dim`: Number of columns in the embedding lookup table; size of our embeddings.
* `hidden_dim`: Number of units in the hidden layers of our LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
* `n_layers`: Number of LSTM layers in the network. Typically between 1-3
​
> Define the model  hyperparameters.
​

In [26]:
# Instantiate the model w/ hyperparams
output_size = 1
embedding_dim = 50
hidden_dim = 256
n_layers = 2
num_epochs=6
# Initialize the model
model = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, embedding_matrix=embedding_matrix)
print(model)

SentimentRNN(
  (embedding): Embedding(20637, 50)
  (lstm): LSTM(50, 256, num_layers=2, batch_first=True, dropout=0.01)
  (dropout): Dropout(p=0.01, inplace=False)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sig): Sigmoid()
)


In [27]:
# loss and optimization functions
lr=0.001
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

In [28]:
# training params

counter = 0
print_every = 100
clip=5 # gradient clipping

# Train the model (simplified training loop)
for epoch in range(num_epochs):
    model.train()
    hidden = model.init_hidden(batch_size)
    
    for inputs, labels in train_loader:
        hidden = tuple([each.data for each in hidden])  # Detach hidden states
        counter += 1
        # Zero the gradients
        model.zero_grad()
        
        # Forward pass
        output, hidden = model(inputs, hidden)
        
        # Loss and backward pass
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(model.parameters(), clip)
        # Update weights
        optimizer.step()

        # loss stats
        if counter % print_every == 0:
            # Get validation loss
            val_h = model.init_hidden(batch_size)
            val_losses = []
            model.eval()
            for inputs, labels in valid_loader:

                # Creating new variables for the hidden state, otherwise
                # we'd backprop through the entire training history
                val_h = tuple([each.data for each in val_h])

                if(train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                output, val_h = model(inputs, val_h)
                val_loss = criterion(output.squeeze(), labels.float())

                val_losses.append(val_loss.item())

            model.train()
            print("Epoch: {}/{}...".format(epoch+1, num_epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))

Epoch: 2/6... Step: 100... Loss: 0.690578... Val Loss: 0.691167
Epoch: 3/6... Step: 200... Loss: 0.689891... Val Loss: 0.697322
Epoch: 5/6... Step: 300... Loss: 0.625658... Val Loss: 0.693707
Epoch: 6/6... Step: 400... Loss: 0.629568... Val Loss: 0.680452


In [29]:
# Get test data loss and accuracy

test_losses = [] # track loss
num_correct = 0

# init hidden state
h = model.init_hidden(batch_size)
y_pred = []
y_true = []
model.eval()
# iterate over test data
for inputs, labels in test_loader:

    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    h = tuple([each.data for each in h])

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()
    
    # get predicted outputs
    output, h = model(inputs, h)
   
    # calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())  # rounds to the nearest integer
 
    y_pred.extend(pred.bool())
    y_true.extend(labels.bool())
    # compare predictions to true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    
    
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
   
    num_correct += np.sum(correct)


# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

# Generate classification report
report = classification_report(y_true, y_pred, target_names=['False', 'True'])
print(report)


Test loss: 0.699
Test accuracy: 0.558
              precision    recall  f1-score   support

       False       0.53      0.89      0.67       371
        True       0.67      0.23      0.34       371

    accuracy                           0.56       742
   macro avg       0.60      0.56      0.50       742
weighted avg       0.60      0.56      0.50       742

