# Sentiment Network with PyTorch- Network Architecture

Below is where you'll define the network.

<img src="assets/network_diagram.png" width=40%>

The layers are as follows:
1. An [embedding layer](https://pytorch.org/docs/stable/nn.html#embedding) that converts our word tokens (integers) into embeddings of a specific size.
2. An [LSTM layer](https://pytorch.org/docs/stable/nn.html#lstm) defined by a hidden_state size and number of layers
3. A fully-connected output layer that maps the LSTM layer outputs to a desired output_size
4. A sigmoid activation layer which turns all outputs into a value 0-1; return **only the last sigmoid output** as the output of this network.

### The Embedding Layer

We need to add an [embedding layer](https://pytorch.org/docs/stable/nn.html#embedding) because there are 74000+ words in our vocabulary. It is massively inefficient to one-hot encode that many classes. So, instead of one-hot encoding, we can have an embedding layer and use that layer as a lookup table. You could train an embedding layer using Word2Vec, then load it here. But, it's fine to just make a new layer, using it for only dimensionality reduction, and let the network learn the weights.


### The LSTM Layer(s)

We'll create an [LSTM](https://pytorch.org/docs/stable/nn.html#lstm) to use in our recurrent network, which takes in an input_size, a hidden_dim, a number of layers, a dropout probability (for dropout between multiple layers), and a batch_first parameter.

Most of the time, you're network will have better performance with more layers; between 2-3. Adding more layers allows the network to learn really complex relationships. 

> **Here implement:** Complete the `__init__`, `forward`, and `init_hidden` functions for the SentimentRNN model class.

Note: `init_hidden` should initialize the hidden and cell state of an lstm layer to all zeros, and move those state to GPU, if available.


In [1]:
import numpy as np
import pandas as pd
import torch
import string
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
from sklearn.metrics import fbeta_score
from IPython.display import Image
from transformers import BertTokenizer, BertModel
%matplotlib inline

In [2]:

class SentimentRNN(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, output_dim, n_layers=2, dropout=0.2):
        super(SentimentRNN, self).__init__()
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers
        self.dropout = nn.Dropout(dropout)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=dropout, batch_first=True)  # Set batch_first=True
        self.fc = nn.Linear(hidden_dim, output_dim)
        self.sig = nn.Sigmoid()
        
    def init_hidden(self, batch_size):
        # Initialize hidden state with shape (n_layers, batch_size, hidden_dim)
        return (torch.zeros(self.n_layers, batch_size, self.hidden_dim),
                torch.zeros(self.n_layers, batch_size, self.hidden_dim))

    def forward(self, x):
        # LSTM forward pass
        out, _ = self.lstm(x)
        # Take the output of the last time step
        out = out[:, -1, :]
        # Pass through a fully connected layer
        out = self.fc(out)
        sig_out = self.sig(out)
 
        return sig_out


In [3]:
# First checking if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [4]:
df = pd.read_csv("/kaggle/input/formspring-csv/formspring.csv")

In [5]:
df.drop(['post', 'asker', 'bully1', 'bully2', 'bully3'], axis = 1, inplace = True)

In [6]:
def impute_ans_columns(value):
    v = ['No','nan']
    if value in v:
        return 0
    return 1

In [7]:
for col in ['ans1', 'ans2', 'ans3']:
    df[col] = df[col].apply(impute_ans_columns)
df.sample(10)

Unnamed: 0,userid,ques,ans,ans1,severity1,ans2,severity2,ans3,severity3
9085,kellyblake1,have you missed me?... i feel as though i&#039...,Yeah I had noticed :( Any particular reason?...,0,0,0,0.0,0,0.0
1783,teaachgee,Name a movie or movies you can watch over and ...,Juno Sweet november mr. deeds :],0,0,0,,0,0.0
1353,teaachgee,Have you ever found a four leaf clover?,Nope has anyone?,0,0,0,0.0,0,0.0
8020,avlarios,THEY&#039;RE BE DOUCHEBAGS,SHOVE A PENIS DOWN THEIR THROAT,1,8,1,,1,
12723,outlaw9000,Who can you picture spending your entire life ...,It used to be my Wife but she is out of my L...,0,0,0,0.0,0,0.0
7297,zooshay,Would you mind if your boyfriend went out to p...,nope if he dont mind me doing stuff without h...,0,0,0,0.0,0,0.0
3425,tabithalocascio,Tabi. If you stopped hating on Texas everyone ...,ummmm jack screw you not my fault people get...,0,0,0,0.0,0,0.0
8879,kellyblake1,Do you miss someone a lot right now?,I do but I think its going down like how mu...,0,0,0,0.0,0,0.0
850,teaachgee,Current worry?,how to survive in this crazy khaos of a world...,0,0,0,0.0,0,0.0
6726,zooshay,to you whats the best thing about austraila?,ill have to say the beaches all year round th...,0,0,0,0.0,0,0.0


In [8]:
def impute_severity_columns(value):
    '''Value will be a string. We need to convert it to int'''
    v = ['nan', 'None', '0']
    if value in v:
        return 0
    try:
        return int(value)
    except ValueError as e:
        #print(value)
        return 5

In [9]:
for col in ['severity1', 'severity2', 'severity3']:
    df[col] = df[col].apply(impute_severity_columns)

In [10]:
df['IsBully'] = (
    (df.ans1 * df.severity1 + df.ans2 * df.severity2 + df.ans3 * df.severity3) / 30) >= 0.0333

# Remove uneccessary columns
df_2 = df.drop(['userid','ans1', 'severity1','ans2','severity2','ans3','severity3'], axis = 1)

In [11]:
df_2.sample(10)

Unnamed: 0,ques,ans,IsBully
10088,Hows Lifeee Now Thaaat Yoooh Knoeee Thuhhh As,Lmao Great . (: Couldntt Bee Betterr .,False
2900,Lmao but then alix will see mee,lol how bout ya give it tew mi nd iWont accep...,True
9687,We need some more girls in here theres Too ma...,I don't like that song.,False
12145,one memorable quote you remember from a song? ...,Keep your drink just give me the Money. : Pi...,False
12569,What type of sexual performance do you preform...,Intercourse with women Oral with Men. I like ...,False
3067,haha so many people are hating on your formspr...,Juss krista as usual..,False
1364,Have you ever had a poem or a song written abo...,yes lots.,False
8991,fight for the one or settle for someone amazing?,Hmmm well 'the one' may not actually be the ...,False
884,Do any of your friends have children?,yes they do,False
9848,What would you want written on your tombstone?,OMG! I haven't even thought about it lol. I d...,False


In [12]:
for col in ['ques', 'ans']:
    df_2[col] = df_2[col].str.replace("&#039;", "'") # Put back the apostrophe

    df_2[col] = df_2[col].str.replace("<br>", "") 
    df_2[col] = df_2[col].str.replace("&quot;", "") 
    #df_2[col] = df_2[col].str.replace("<3", "love")

In [13]:
df_2 = df_2.dropna(how='all')

In [14]:
df_2.head()

Unnamed: 0,ques,ans,IsBully
0,what's your favorite song? :D,I like too many songs to have a favorite,False
1,<3,</3 ? haha jk! <33,False
2,hey angel you duh sexy,Really?!?! Thanks?! haha,False
3,(:,;(,False
4,******************MEOWWW*************************,*RAWR*?,False


In [15]:
df_2['ques_ans'] = df_2['ques'] + ' ' + df_2['ans'] 

In [16]:
df_2.head()

Unnamed: 0,ques,ans,IsBully,ques_ans
0,what's your favorite song? :D,I like too many songs to have a favorite,False,what's your favorite song? :D I like too many...
1,<3,</3 ? haha jk! <33,False,<3 </3 ? haha jk! <33
2,hey angel you duh sexy,Really?!?! Thanks?! haha,False,hey angel you duh sexy Really?!?! Thanks?! haha
3,(:,;(,False,(: ;(
4,******************MEOWWW*************************,*RAWR*?,False,******************MEOWWW**********************...


In [17]:
df_2.drop(['ques','ans'], axis=1)
columns = ['ques_ans','IsBully']
df2_ordered = df_2[columns]
df2_ordered['ques_ans'] = df2_ordered['ques_ans'].str.lower()
# Remove punctuation using regex
df2_ordered['ques_ans'] = df2_ordered['ques_ans'].str.replace(f'[{string.punctuation}]', '', regex=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_ordered['ques_ans'] = df2_ordered['ques_ans'].str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_ordered['ques_ans'] = df2_ordered['ques_ans'].str.replace(f'[{string.punctuation}]', '', regex=True)


In [18]:
df2_ordered = df2_ordered[df2_ordered['ques_ans'].notna()]  # Remove NaN values
df2_ordered = df2_ordered[df2_ordered['ques_ans'].str.strip() != '']  # Remove empty strings
df2_ordered.head()

Unnamed: 0,ques_ans,IsBully
0,whats your favorite song d i like too many so...,False
1,3 3 haha jk 33,False
2,hey angel you duh sexy really thanks haha,False
4,meowww rawr,False
5,any makeup tips i suck at doing my makeup lol ...,False


In [19]:
from imblearn.under_sampling import RandomUnderSampler

# Assuming X_train is your feature matrix and y_train is your target (label)
undersampler = RandomUnderSampler(random_state=42)

# Perform undersampling
X_resampled, y_resampled = undersampler.fit_resample(df2_ordered['ques_ans'].values.reshape(-1, 1), df2_ordered['IsBully'])
# Check the new class distribution after undersampling
#print("Original class distribution:", df2_ordered['ques_ans'].value_counts())
print("Resampled class distribution:", pd.Series(y_resampled).value_counts())


Resampled class distribution: IsBully
False    1901
True     1901
Name: count, dtype: int64


In [20]:
X = df2_ordered['ques_ans'].values
y = df2_ordered['IsBully'].values

In [21]:
print(X_resampled)

[['what is the last film you saw  get him to the greek  fucking hilarious ']
 ['whaaat aree 5thingsss yoooh loveee about angieee ampamp ashely    i cantt evenn thinkk uvv o1  lett alonee o5  lmao ']
 ['why are you talking to people that you don39t even know just wondering  no']
 ...
 ['youre a bushwhacking  alchy piece of shit scrub  and youre blocked']
 ['your 44 years old o dumbass  pedifile d  you feel better now  if you dont want to talk to me  then just say so']
 ['if i told u den it would make it all the less fun  or it would make yuh loook like a person not a fake scaredyy cat ']]


## Data Cleanup upto here

In [22]:
!pip install --upgrade openai

Collecting openai
  Downloading openai-1.53.0-py3-none-any.whl.metadata (24 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB)
Downloading openai-1.53.0-py3-none-any.whl (387 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m387.1/387.1 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hDownloading jiter-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (325 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m325.2/325.2 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jiter, openai
Successfully installed jiter-0.6.1 openai-1.53.0


In [23]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.8.0


In [24]:
embedding_model = "text-embedding-3-small"
embedding_encoding = "cl100k_base"
max_tokens = 8000  # the maximum for text-embedding-3-small is 8191

In [25]:
from openai import OpenAI
api_key = "OPENAI_API+KEY"

client = OpenAI(api_key="OPENAI_API_KEY")


def get_embedding(text: str, model="text-embedding-3-large", **kwargs):
    # replace newlines, which can negatively affect performance.
    text = text.replace("\n", " ")

    response = client.embeddings.create(input=[text], model=model, **kwargs)

    return response.data[0].embedding


In [26]:
# Reshape X_resampled and y_resampled if necessary
X_resampled_np = X_resampled.reshape(-1) if X_resampled.ndim > 1 else X_resampled
y_resampled_np = y_resampled.reshape(-1) if y_resampled.ndim > 1 else y_resampled

# Now create the DataFrame
df_resampled = pd.DataFrame({
    "ques_ans": X_resampled_np,
    "IsBully": y_resampled_np
})

df_resampled["embedding"] = df_resampled["ques_ans"].apply(lambda x: get_embedding(x))

In [27]:
df_resampled.head()

Unnamed: 0,ques_ans,IsBully,embedding
0,what is the last film you saw get him to the ...,False,"[-0.017185905948281288, 0.06371483206748962, -..."
1,whaaat aree 5thingsss yoooh loveee about angie...,False,"[-0.0150068374350667, 0.021200960502028465, -0..."
2,why are you talking to people that you don39t ...,False,"[0.02681851200759411, 0.014757352881133556, -0..."
3,do you own a striped sweater i may,False,"[-0.07465637475252151, 0.02348770759999752, -0..."
4,would you ever wait tables at a restaurant i ...,False,"[-0.04610820114612579, -0.024261515587568283, ..."


We do not need to make same length as the embeddings are of fixed length.

In [28]:
## Convert embeddings to a list of lists, ensuring they are flattened
embeddings_list = [embedding if isinstance(embedding[0], (float, int)) else embedding[0] for embedding in df_resampled["embedding"].to_list()]

# Convert to a PyTorch tensor
embeddings_tensor = torch.tensor(embeddings_list, dtype=torch.float32).squeeze()

# Check the shape
print(embeddings_tensor.shape)  # Should output (3802, 1536)


torch.Size([3802, 3072])


In [29]:
print(embeddings_tensor.shape) 

torch.Size([3802, 3072])


## Tokenizing and Feature Engineering

## Instantiate the network
​
Here, we'll instantiate the network. First up, defining the hyperparameters.
​
* `vocab_size`: Size of our vocabulary or the range of values for our input, word tokens.
* `output_size`: Size of our desired output; the number of class scores we want to output (pos/neg).
* `embedding_dim`: Number of columns in the embedding lookup table; size of our embeddings.
* `hidden_dim`: Number of units in the hidden layers of our LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
* `n_layers`: Number of LSTM layers in the network. Typically between 1-3
​
> Define the model  hyperparameters.
​

In [30]:
# Assume embeddings_data is a list of embeddings for each text
hidden_dim = 128
output_dim = 1
batch_size = 32
embedding_dim = embeddings_tensor.shape[1]
model = SentimentRNN(embedding_dim, hidden_dim, output_dim)


In [31]:
# loss and optimization functions
lr=0.001
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

Since LSTMs expect a 3D tensor of shape (batch_size, sequence_length, embedding_dim), you need to add a sequence dimension to your embeddings. For each input sentence embedding, the sequence length is 1, as you only have a single embedding vector per sentence.

In [32]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    embeddings_tensor.numpy(),  # Convert to NumPy array for splitting
    y_resampled,               # Labels
    test_size=0.2, 
    shuffle=True,
    random_state=42            
)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.15,random_state=42, shuffle=True)

# Convert the split data back to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32).squeeze()  # Shape: (num_train_samples, 1, embedding_dim)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32).squeeze()    # Shape: (num_test_samples, 1, embedding_dim)
X_val_tensor = torch.tensor(X_val, dtype=torch.float32).squeeze()   # Shape: (num_test_samples, 1, embedding_dim)

y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).unsqueeze(1)  # Shape: (num_train_samples, 1)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32).unsqueeze(1)    # Shape: (num_test_samples, 1)
y_val_tensor = torch.tensor(y_val.values, dtype=torch.float32).unsqueeze(1)    # Shape: (num_test_samples, 1)

# Check shapes to ensure correctness
print("Training set shapes:", X_train_tensor.shape, y_train_tensor.shape)
print("Testing set shapes:", X_test_tensor.shape, y_test_tensor.shape)
print("Validation set shapes:", X_val_tensor.shape, y_val_tensor.shape)

Training set shapes: torch.Size([2584, 3072]) torch.Size([2584, 1])
Testing set shapes: torch.Size([761, 3072]) torch.Size([761, 1])
Validation set shapes: torch.Size([457, 3072]) torch.Size([457, 1])


In [33]:
# Create TensorDataset (combines inputs and labels)
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
val_dataset = TensorDataset(X_val_tensor, y_val_tensor)

# Create DataLoader for training and testing sets
batch_size = 32

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

In [38]:
num_epochs = 4
clip = 5  # Gradient clipping value
counter = 0
print_every = 100
model.train()
for epoch in range(num_epochs):
    for embeddings_batch, labels_batch in train_loader:

        # Move data to the device if using GPU
        embeddings_batch, labels_batch = embeddings_batch.to(device), labels_batch.to(device)
        embeddings_batch = embeddings_batch.unsqueeze(1)
        # Zero gradients
        optimizer.zero_grad()
        
        # Forward pass
        outputs = model(embeddings_batch)
     
        # Calculate loss
        loss = criterion(outputs.squeeze(1), labels_batch.squeeze(1))
        
        # Backward pass and optimization
        loss.backward()
        
        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        # Optimizer step
        optimizer.step()
        
        # Increment step counter
        counter += 1

        # Print loss and validation loss every `print_every` steps
        if counter % print_every == 0:
            val_losses = []
            model.eval()  # Set model to evaluation mode
            
            with torch.no_grad():  # Disable gradient calculation for validation
                for inputs, labels in valid_loader:
                    inputs, labels = inputs.to(device), labels.to(device)  # Move to device
                    output = model(inputs.unsqueeze(1))
                    val_loss = criterion(output, labels)
                    val_losses.append(val_loss.item())
            
            model.train()  # Set model back to training mode
            
            # Print training and validation losses
            print("Epoch: {}/{}...".format(epoch + 1, num_epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))


Epoch: 2/4... Step: 100... Loss: 0.563460... Val Loss: 0.596806
Epoch: 3/4... Step: 200... Loss: 0.629637... Val Loss: 0.587182
Epoch: 4/4... Step: 300... Loss: 0.568587... Val Loss: 0.577696


In [39]:
# Get test data loss and accuracy

test_losses = [] # track loss
num_correct = 0

y_pred = []
y_true = []
model.eval()
# iterate over test data
for inputs, labels in test_loader:
    
    # get predicted outputs
    output = model(inputs.unsqueeze(1))
   
   
    # calculate loss
    test_loss = criterion(output, labels.float())
    test_losses.append(test_loss.item())
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())  # rounds to the nearest integer

   
    y_pred.extend(pred.bool())
    y_true.extend(labels.bool())
    # compare predictions to true label
    correct_tensor = pred.eq(labels.view_as(pred).float())
    #correct_tensor = pred.eq(labels.float().view_as(pred))
    
    
    correct = np.squeeze(correct_tensor.numpy())
   
    num_correct += np.sum(correct)


# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

# Generate classification report
report = classification_report(y_true, y_pred, target_names=['False', 'True'])
print(report)

Test loss: 0.611
Test accuracy: 0.799
              precision    recall  f1-score   support

       False       0.80      0.83      0.81       401
        True       0.80      0.77      0.78       360

    accuracy                           0.80       761
   macro avg       0.80      0.80      0.80       761
weighted avg       0.80      0.80      0.80       761

