
## Convolutional Neural Network Sample

<b> Step one: Load the data and turn it into pandas. We have the labels and the text in different files so we will combine it into a pandas file. Dataset: https://github.com/cardiffnlp/tweeteval</b>

In [1]:
#Import the main tools for the task:
import os
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torch

In [3]:
# create a dictionary of the mapping between numbers and labels
mappings = {"anger": 3, "joy": 1, "optimism": 2, "sadness": 0}

def load_data(mapping_dictionary:dict , tweet_file_path: str, label_file_path:str)-> pd.DataFrame:
    """
    function to load both the tweets and the labels, combine them together as pandas dataframe
    """
    with open(tweet_file_path, 'r', encoding="utf-8") as file:
        tweets = [line.strip() for line in file.readlines()]

    with open(label_file_path, 'r', encoding="utf-8") as file:
        label_numbers = [int(line.strip()) for line in file.readlines()]

    label_texts = [next((key for key, value in mapping_dictionary.items() if value == label_number), None) for label_number in label_numbers]
    
    df = pd.DataFrame({
        'text': tweets,
        'label_text': label_texts,
        'label_number': label_numbers
    })
    
    return df

# make sure the files in the same directory as the notebook
train_data = load_data(mappings, "train_text.txt", "train_labels.txt")
validation_data = load_data(mappings, "val_text.txt", "val_labels.txt")
test_data = load_data(mappings, "test_text.txt", "test_labels.txt")

# print the head of one of them
print(train_data.head())

# Print the length of each data
print(f"Length of train_data: {len(train_data)}")
print(f"Length of validation_data: {len(validation_data)}")
print(f"Length of test_data: {len(test_data)}")

                                                text label_text  label_number
0  “Worry is a down payment on a problem you may ...   optimism             2
1  My roommate: it's okay that we can't spell bec...    sadness             0
2  No but that's so cute. Atsu was probably shy a...        joy             1
3  Rooneys fucking untouchable isn't he? Been fuc...    sadness             0
4  it's pretty depressing when u hit pan on ur fa...      anger             3
Length of train_data: 3257
Length of validation_data: 374
Length of test_data: 1421


In [4]:
### Now for the main assignment:
## Task 2 with a dataset from only 2 emotions
# First, we grab the main dataset again, complete with training and validation packs:
## Now we filter and create the subset dataset:
# First one will be sadness (0) and joy (1)
sadjoylist = ['sadness','joy']
train_data = train_data[train_data['label_text'].isin(sadjoylist)]

#Alright, got only the sadness and joy!
print(train_data.head())
print(type(train_data))
## Count total with only sadness and joy:
print(len(train_data))

                                                text label_text  label_number
1  My roommate: it's okay that we can't spell bec...    sadness             0
2  No but that's so cute. Atsu was probably shy a...        joy             1
3  Rooneys fucking untouchable isn't he? Been fuc...    sadness             0
5  @user but your pussy was weak from what I hear...    sadness             0
7  Tiller and breezy should do a collab album. Ra...        joy             1
<class 'pandas.core.frame.DataFrame'>
2108


In [5]:
## Now we do the same for the other 2 sets for validation and testing:
validation_data = validation_data[validation_data['label_text'].isin(sadjoylist)]
test_data = test_data[test_data['label_text'].isin(sadjoylist)]

In [6]:
#Validation set got only the sadness and joy!
print(validation_data.head())
print(type(validation_data))
# Count validation:
print(len(validation_data))

                                                text label_text  label_number
0  @user @user Oh, hidden revenge and anger...I r...    sadness             0
1  if not then #teamchristine bc all tana has don...    sadness             0
2  Hey @user #Fields in #skibbereen give your onl...    sadness             0
3  Why have #Emmerdale had to rob #robron of havi...    sadness             0
4  @user I would like to hear a podcast of you go...    sadness             0
<class 'pandas.core.frame.DataFrame'>
257


In [7]:
#Test set got only the sadness and joy!
print(test_data.head())
print(type(test_data))
#count:
print(len(test_data))

                                                text label_text  label_number
1  @user Interesting choice of words... Are you c...    sadness             0
3  @user Welcome to #MPSVT! We are delighted to h...        joy             1
4                       What makes you feel #joyful?        joy             1
5                                    i am revolting.    sadness             0
9  @user Get Donovan out of your soccer booth. He...    sadness             0
<class 'pandas.core.frame.DataFrame'>
916


<b> Next step after loading the data is preprocessing, including tokenization, creating vocabulary, embedding and padding. We used this tutorial on how to build CNN text classifier https://chriskhanhtran.github.io/posts/cnn-sentence-classification/ </b>

In [8]:
# Function to tokenize the strings in text column:
def tokenize_sentence(sentence: str) -> list:
    """
    Tokenizes a sentence using nltk's word_tokenize method.

    Args:
    - sentence (str): The sentence to tokenize.

    Returns:
    - List of tokens.
    """
    return word_tokenize(sentence)

In [9]:
## Now we apply the tokenizer to the new subset and the validation/testing kit:
train_data['tokenized_text'] = train_data['text'].apply(tokenize_sentence)
validation_data['tokenized_text'] = validation_data['text'].apply(tokenize_sentence)
test_data['tokenized_text'] = test_data['text'].apply(tokenize_sentence)

In [10]:
# now the data has a new column "tokenized_text" which is a list of tokens
#print(train_data.head())
print(train_data['tokenized_text'].head())

1    [My, roommate, :, it, 's, okay, that, we, ca, ...
2    [No, but, that, 's, so, cute, ., Atsu, was, pr...
3    [Rooneys, fucking, untouchable, is, n't, he, ?...
5    [@, user, but, your, pussy, was, weak, from, w...
7    [Tiller, and, breezy, should, do, a, collab, a...
Name: tokenized_text, dtype: object


<b> After tokenization, we'll get set of all the unique tokens in the data and create a mapping between the tokens and their index. Then convert the tokens into numbers </b>

In [11]:
# Build a set of all unique tokens in the training data
vocab_set = set()
for tokens in train_data['tokenized_text']:
    vocab_set.update(tokens)

# Convert the set to a list to index tokens
vocab_list = list(vocab_set)

print(vocab_list[:4])

# Create a word to index mapping
word_to_index = {word: index for index, word in enumerate(vocab_list)}

['physical', 'inferno', 'Point', 'yuk']


In [12]:
# Add OOV token and its index to the vocabulary. This is because some tokens in the val/test might have vocabulary not in training
OOV_TOKEN = "<OOV>"
if OOV_TOKEN not in word_to_index:
    word_to_index[OOV_TOKEN] = len(word_to_index)

In [13]:
## Main function for conversion of tokens from previous step to index numbers:
def tokens_to_numbers(tokens: list, word_to_index: dict) -> list:
    """
    Converts a list of tokens to their corresponding indices using a word-to-index mapping.
    Returns the index of OOV_TOKEN for out-of-vocabulary words.
    """
    return [word_to_index.get(token, word_to_index[OOV_TOKEN]) for token in tokens]

In [14]:
## Apply the function for token to index conversion:
train_data['numeric_tokens'] = train_data['tokenized_text'].apply(lambda x: tokens_to_numbers(x, word_to_index))
validation_data['numeric_tokens'] = validation_data['tokenized_text'].apply(lambda x: tokens_to_numbers(x, word_to_index))
test_data['numeric_tokens'] = test_data['tokenized_text'].apply(lambda x: tokens_to_numbers(x, word_to_index))

In [15]:
# Now the df has a new column "numeric_tokens"
#print(train_data.head())
print(train_data['numeric_tokens'])

## Seems alright

1       [1269, 2851, 2118, 7359, 2559, 5508, 1162, 594...
2       [1212, 4011, 1162, 2559, 7581, 4218, 4141, 688...
3       [3434, 3943, 6005, 5952, 3404, 1211, 4483, 705...
5       [6281, 6391, 4011, 369, 6506, 7642, 7693, 2820...
7       [7254, 2358, 4757, 6699, 4093, 6716, 4140, 605...
                              ...                        
3250    [6281, 6391, 447, 7120, 2533, 2527, 2408, 1855...
3251    [6281, 6391, 2709, 5099, 2992, 1936, 4150, 448...
3254    [6281, 6391, 6281, 6391, 6281, 6391, 6281, 639...
3255    [1263, 3970, 6716, 2527, 6541, 4483, 1804, 142...
3256    [6281, 6391, 6281, 6391, 5616, 3672, 5198, 649...
Name: numeric_tokens, Length: 2108, dtype: object


<b> Next step is to do padding </b>

In [16]:
# Figure the maximum sequence length, but since tweets are conveniently short
# The maximum length of tweet is defined:
MAX_SEQUENCE_LENGTH = max(train_data['numeric_tokens'].apply(len))
print('The Maximum Length is:', MAX_SEQUENCE_LENGTH)

The Maximum Length is: 48


In [17]:
## Padding is required for the dataset:
def pad_sequence(numeric_tokens: list, max_length: int) -> list:
    """
    Pads a sequence to a given length. If sequence is shorter than the target length,
    it'll be padded with zeros.
    """
    return numeric_tokens + [0]*(max_length - len(numeric_tokens))

In [18]:
## Apply the padding:
train_data['padded_tokens'] = train_data['numeric_tokens'].apply(lambda x: pad_sequence(x, MAX_SEQUENCE_LENGTH))
validation_data['padded_tokens'] = validation_data['numeric_tokens'].apply(lambda x: pad_sequence(x, MAX_SEQUENCE_LENGTH))
test_data['padded_tokens'] = test_data['numeric_tokens'].apply(lambda x: pad_sequence(x, MAX_SEQUENCE_LENGTH))

In [19]:
# Now the df has a new column "padded_tokens". 
# Usually we shouldn't load all that to memory but once again, Tweets are small, so it is ok:
#print(train_data.head())
print(train_data['padded_tokens'])

## Seems the same as previous...

1       [1269, 2851, 2118, 7359, 2559, 5508, 1162, 594...
2       [1212, 4011, 1162, 2559, 7581, 4218, 4141, 688...
3       [3434, 3943, 6005, 5952, 3404, 1211, 4483, 705...
5       [6281, 6391, 4011, 369, 6506, 7642, 7693, 2820...
7       [7254, 2358, 4757, 6699, 4093, 6716, 4140, 605...
                              ...                        
3250    [6281, 6391, 447, 7120, 2533, 2527, 2408, 1855...
3251    [6281, 6391, 2709, 5099, 2992, 1936, 4150, 448...
3254    [6281, 6391, 6281, 6391, 6281, 6391, 6281, 639...
3255    [1263, 3970, 6716, 2527, 6541, 4483, 1804, 142...
3256    [6281, 6391, 6281, 6391, 5616, 3672, 5198, 649...
Name: padded_tokens, Length: 2108, dtype: object


In [20]:
### MAIN CLASS FOR Convolutional Neural Network ###:
class TextCNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, dropout, pad_idx, activation_function='ReLU', pooling_strategy='max'):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
        self.convs = nn.ModuleList([
            nn.Conv2d(in_channels=1, out_channels=n_filters, kernel_size=(fs, embedding_dim))
            for fs in filter_sizes
        ])
        self.fc = nn.Linear(len(filter_sizes) * n_filters, output_dim)
        self.dropout = nn.Dropout(dropout)

        # Activation function
        if activation_function == 'ReLU':
            self.activation = F.relu
        elif activation_function == 'LeakyReLU':
            self.activation = F.leaky_relu
        elif activation_function == 'ELU':
            self.activation = F.elu
        else:
            raise ValueError("Unsupported activation function")

        # Pooling strategy
        self.pooling_strategy = pooling_strategy

    def forward(self, text):
        # text = [batch size, sent len]
        embedded = self.embedding(text)
        # embedded = [batch size, sent len, emb dim]
        embedded = embedded.unsqueeze(1)
        #embedded = embedded.squeeze(1)
        # embedded = [batch size, 1, sent len, emb dim]
        
        conved = [self.activation(conv(embedded)).squeeze(3) for conv in self.convs]
        # conved_n = [batch size, n_filters, sent len - filter_sizes[n]]
        
        if self.pooling_strategy == 'max':
            pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
        elif self.pooling_strategy == 'avg':
            pooled = [F.avg_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
        else:
            raise ValueError("Unsupported pooling strategy")
        # pooled_n = [batch size, n_filters]
        
        cat = self.dropout(torch.cat(pooled, dim=1))
        # cat = [batch size, n_filters * len(filter_sizes)]
        
        return self.fc(cat)

In [21]:
#Main tools for the encoding and loading:
from torch.utils.data import Dataset, DataLoader

In [22]:
## Similarly to the previous exercise, a class for the lock and loading of DataLoader:
class TextDataset(Dataset):
    def __init__(self, tokenized_texts, labels):
        self.tokenized_texts = tokenized_texts
        self.labels = labels

    def __len__(self):
        return len(self.tokenized_texts)

    def __getitem__(self, idx):
        return torch.tensor(self.tokenized_texts[idx], dtype=torch.long), torch.tensor(self.labels[idx], dtype=torch.long)

In [23]:
# Conversion to list:
train_dataset = TextDataset(train_data['padded_tokens'].tolist(), train_data['label_number'].tolist())
val_dataset = TextDataset(validation_data['padded_tokens'].tolist(), validation_data['label_number'].tolist())
test_dataset = TextDataset(test_data['padded_tokens'].tolist(), test_data['label_number'].tolist())

In [24]:
## Batch size to work with: GS of 32, 64 and 128:
batch_size = 32

#Loaded and ready:
train_loader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size)
val_loader = DataLoader(val_dataset, shuffle=False, batch_size=batch_size)
test_loader = DataLoader(test_dataset, shuffle=False, batch_size=batch_size)


In [25]:
print(train_loader)

<torch.utils.data.dataloader.DataLoader object at 0x000001BC84637850>


In [26]:
# Execute on CPU since the small size of Tweets should make this easy:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print('The device gpu or cpu is:', device)

The device gpu or cpu is: cpu


In [26]:
# Main hyperparameters:
vocab_size = len(word_to_index) + 1  # +1 for padding 
print(vocab_size)
embedding_dim = 300  
n_filters = 110
filter_sizes = [2,3,4]
#output_dim = len(train_data['label_number'].unique())
output_dim = 1
print(output_dim)
dropout=0.1
#dropout=0.24691844248854944
pad_idx = 0  # Assuming 0 is the index for padding

7749
1


In [27]:
### Alternative setting for applying the Optima best settings for Sadness and Joy dataset:
# Main hyperparameters:
vocab_size = len(word_to_index) + 1  # +1 for padding 
print(vocab_size)
embedding_dim = 300  
n_filters = 130
filter_sizes = [2,3,4]
#output_dim = len(train_data['label_number'].unique())
output_dim = 1
print(output_dim)
dropout=0.30216765316720884
pad_idx = 0  # Assuming 0 is the index for padding

7749
1


In [27]:
## Run the main CNN model builder:
model = TextCNN(vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, dropout, pad_idx).to(device)
print("Model's embedding layer size:", model.embedding.weight.data.size())

Model's embedding layer size: torch.Size([7749, 300])


In [28]:
## Alternate for the CNN model builder for Optima Settings for Sadjoy:
model_optima_sadjoy = TextCNN(vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, dropout, pad_idx).to(device)
print("Model's embedding layer size:", model_optima_sadjoy.embedding.weight.data.size())

Model's embedding layer size: torch.Size([7749, 300])


In [28]:
#model.embedding.weight.data.copy_(torch.tensor(embedding_matrix))
model.embedding.weight.requires_grad = False  # Freeze the embedding layer

In [29]:
#Alternate for optima sadjoy:
model_optima_sadjoy.embedding.weight.requires_grad = False  # Freeze the embedding layer

In [29]:
# Loss and optimizer
## BCE loss is used since it's only two categories
criterion = nn.BCELoss().to(device)
#criterion = nn.CrossEntropyLoss().to(device)
#lr=0.000682446937962706
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

In [30]:
# Loss and optimizer for sadjoy optima config:

## BCE loss is used since it's only two categories
criterion = nn.BCELoss().to(device)
#criterion = nn.CrossEntropyLoss().to(device)
#lr=0.000682446937962706
optimizer = torch.optim.Adam(model_optima_sadjoy.parameters(), lr=0.0007809671359222956)

In [31]:
## Main function to perform training:
def train(model, iterator, optimizer, criterion, device):
    model.train()
    epoch_loss = 0
    for batch_texts, batch_labels in iterator:
        batch_texts, batch_labels = batch_texts.to(device), batch_labels.to(device)
        batch_labels = batch_labels.view(-1, 1)
        # Convert the labels to float since its now only 2 categories for sadness and joy:
        batch_labels = batch_labels.float()
        optimizer.zero_grad()
        predictions = model(batch_texts)
        # Apply sigmoid activation to the predictions to compress to either 0, 1
        predictions = torch.sigmoid(predictions)
        loss = criterion(predictions, batch_labels)

        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
    return epoch_loss / len(iterator)

In [32]:
### Function for evaluation and metric scores:
def evaluate(model, iterator, criterion, device):
    model.eval()
    epoch_loss = 0
    with torch.no_grad():
        for batch_texts, batch_labels in iterator:
            batch_texts, batch_labels = batch_texts.to(device), batch_labels.to(device)
            # Reshape the labels to [batch_size, 1]
            batch_labels = batch_labels.view(-1, 1)
            # Cast the labels to float
            batch_labels = batch_labels.float()
            predictions = model(batch_texts)
            # Apply sigmoid activation to the predictions
            predictions = torch.sigmoid(predictions)
            loss = criterion(predictions, batch_labels)
            epoch_loss += loss.item()
    return epoch_loss / len(iterator)

In [32]:
# Number of epochs at 20
n_epochs = 20
best_valid_loss = float('inf')

for epoch in range(n_epochs):
    train_loss = train(model, train_loader, optimizer, criterion, device)
    valid_loss = evaluate(model, val_loader, criterion, device)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'sadjoy_model.pt')
    
    print(f'Epoch: {epoch+1:02}')
    print(f'\tTrain Loss: {train_loss:.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f}')

Epoch: 01
	Train Loss: 1.003
	 Val. Loss: 0.667
Epoch: 02
	Train Loss: 0.374
	 Val. Loss: 0.740
Epoch: 03
	Train Loss: 0.269
	 Val. Loss: 2.171
Epoch: 04
	Train Loss: 0.486
	 Val. Loss: 2.449
Epoch: 05
	Train Loss: 1.021
	 Val. Loss: 3.906
Epoch: 06
	Train Loss: 0.818
	 Val. Loss: 3.487
Epoch: 07
	Train Loss: 0.449
	 Val. Loss: 3.835
Epoch: 08
	Train Loss: 1.007
	 Val. Loss: 4.670
Epoch: 09
	Train Loss: 1.143
	 Val. Loss: 5.624
Epoch: 10
	Train Loss: 0.639
	 Val. Loss: 7.256
Epoch: 11
	Train Loss: 0.981
	 Val. Loss: 7.832
Epoch: 12
	Train Loss: 1.038
	 Val. Loss: 8.242
Epoch: 13
	Train Loss: 1.013
	 Val. Loss: 8.673
Epoch: 14
	Train Loss: 0.973
	 Val. Loss: 10.629
Epoch: 15
	Train Loss: 1.381
	 Val. Loss: 8.474
Epoch: 16
	Train Loss: 1.039
	 Val. Loss: 10.696
Epoch: 17
	Train Loss: 3.416
	 Val. Loss: 22.250
Epoch: 18
	Train Loss: 1.999
	 Val. Loss: 15.437
Epoch: 19
	Train Loss: 1.782
	 Val. Loss: 20.173
Epoch: 20
	Train Loss: 2.274
	 Val. Loss: 15.420


In [34]:
### Training Sadjoy Optima model ###

# Number of epochs at 20
n_epochs = 20
best_valid_loss = float('inf')

for epoch in range(n_epochs):
    train_loss = train(model_optima_sadjoy, train_loader, optimizer, criterion, device)
    valid_loss = evaluate(model_optima_sadjoy, val_loader, criterion, device)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model_optima_sadjoy.state_dict(), 'sadjoy_model_plus_optima.pt')
    
    print(f'Epoch: {epoch+1:02}')
    print(f'\tTrain Loss: {train_loss:.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f}')

Epoch: 01
	Train Loss: 0.186
	 Val. Loss: 0.473
Epoch: 02
	Train Loss: 0.102
	 Val. Loss: 0.476
Epoch: 03
	Train Loss: 0.058
	 Val. Loss: 0.456
Epoch: 04
	Train Loss: 0.048
	 Val. Loss: 0.488
Epoch: 05
	Train Loss: 0.032
	 Val. Loss: 0.464
Epoch: 06
	Train Loss: 0.033
	 Val. Loss: 0.517
Epoch: 07
	Train Loss: 0.028
	 Val. Loss: 0.480
Epoch: 08
	Train Loss: 0.018
	 Val. Loss: 0.490
Epoch: 09
	Train Loss: 0.019
	 Val. Loss: 0.484
Epoch: 10
	Train Loss: 0.014
	 Val. Loss: 0.573
Epoch: 11
	Train Loss: 0.017
	 Val. Loss: 0.536
Epoch: 12
	Train Loss: 0.013
	 Val. Loss: 0.563
Epoch: 13
	Train Loss: 0.008
	 Val. Loss: 0.500
Epoch: 14
	Train Loss: 0.019
	 Val. Loss: 0.508
Epoch: 15
	Train Loss: 0.010
	 Val. Loss: 0.545
Epoch: 16
	Train Loss: 0.011
	 Val. Loss: 0.531
Epoch: 17
	Train Loss: 0.006
	 Val. Loss: 0.518
Epoch: 18
	Train Loss: 0.012
	 Val. Loss: 0.528
Epoch: 19
	Train Loss: 0.013
	 Val. Loss: 0.540
Epoch: 20
	Train Loss: 0.010
	 Val. Loss: 0.567


In [35]:
## The train loss decreases drastically
#### Now we do the metric evaluations for the base model with no hyperparameter changes:
## To evaluate the model and find the f1 score and accuracy
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

def evaluate_model(model, data_loader, criterion, device):
    model.eval()
    all_predictions = []
    all_true_labels = []
    total_loss = 0.0
    
    with torch.no_grad():
        for batch in data_loader:
            text, labels = batch
            text, labels = text.to(device), labels.to(device)  # Move data to device
            # Reshape the labels to [batch_size, 1]
            labels = labels.view(-1)
            # Cast the labels to float
            labels = labels.float()
            predictions = model(text).squeeze(1)
            # Apply sigmoid activation to the predictions
            predictions = torch.sigmoid(predictions)
            loss = criterion(predictions, labels)
            total_loss += loss.item()
            
            # Convert predictions to class labels and append to lists
            predicted_labels = (predictions.round().long())
            all_predictions.extend(predicted_labels.tolist())
            all_true_labels.extend(labels.tolist())

    return all_predictions, all_true_labels, total_loss / len(data_loader)

In [36]:
## Save the model:
torch.save(model, 'sadjoymodel.pth')
## Load the base initial model for sadjoy:
model.load_state_dict(torch.load('sadjoy_model.pt'))

<All keys matched successfully>

In [37]:
## Save the alternative optima settings model for sadjoy:
torch.save(model_optima_sadjoy, 'sadjoymodel_plus_optima.pth')
## Load the base initial model for sadjoy:
model_optima_sadjoy.load_state_dict(torch.load('sadjoy_model_plus_optima.pt'))

<All keys matched successfully>

In [35]:
### All predictions, labels and testing loss values
all_preds, all_labels, test_loss = evaluate_model(model, test_loader, criterion, device)

In [39]:
### All predictions, labels and testing loss values

### Alternative for optima sadjoy:
all_preds_alt, all_labels_alt, test_loss_alt = evaluate_model(model_optima_sadjoy, test_loader, criterion, device)

In [36]:
#### Now we calculate accuracy and F1 score for the base version of SadJoy model:
accuracy = accuracy_score(all_labels, all_preds)
f1 = f1_score(all_labels, all_preds, average='macro') # 'macro' calculates metrics for each label and then does an unweighted mean

#precision = precision_score(all_labels, all_preds, average='macro')
#recall = recall_score(all_labels, all_preds, average='macro')

print(f"Accuracy: {accuracy:.3f}")
print(f"F1 Score: {f1:.3f}")

Accuracy: 0.671
F1 Score: 0.665


In [40]:
#Alternative for optima sadjoy:
#### Now we calculate accuracy and F1 score for SadJoy with Optima:
accuracy_alt = accuracy_score(all_labels_alt, all_preds_alt)
f1_alt = f1_score(all_labels_alt, all_preds_alt, average='macro') # 'macro' calculates metrics for each label and then does an unweighted mean

#precision = precision_score(all_labels, all_preds, average='macro')
#recall = recall_score(all_labels, all_preds, average='macro')

print(f"Accuracy: {accuracy_alt:.3f}")
print(f"F1 Score: {f1_alt:.3f}")

Accuracy: 0.757
F1 Score: 0.736


In [37]:
## First set of experiments:
## Hyperparameter changes: increase in batch size and filter increase
## Batch size to work with: GS of 32, 64 and 128:
### Batch size INCREASED from 32 to 64:
batch_size = 64

#Loaded and ready:
train_loader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size)
val_loader = DataLoader(val_dataset, shuffle=False, batch_size=batch_size)
test_loader = DataLoader(test_dataset, shuffle=False, batch_size=batch_size)

In [38]:
# Main hyperparameters:
## We adjust the 
vocab_size = len(word_to_index) + 1  # +1 for padding 
print(vocab_size)
embedding_dim = 300  
## Number of filter INCREASED from 110 to 210:
n_filters = 210
filter_sizes = [2,3,4]
#output_dim = len(train_data['label_number'].unique())
output_dim = 1
print(output_dim)
dropout=0.1
#dropout=0.24691844248854944
pad_idx = 0  # Assuming 0 is the index for padding

7749
1


In [39]:
## Run the main CNN model builder:
model_alt_one = TextCNN(vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, dropout, pad_idx).to(device)
print("Model's embedding layer size:", model_alt_one.embedding.weight.data.size())

Model's embedding layer size: torch.Size([7749, 300])


In [40]:
#model.embedding.weight.data.copy_(torch.tensor(embedding_matrix))
model_alt_one.embedding.weight.requires_grad = False  # Freeze the embedding layer

In [41]:
# Number of epochs at 20
n_epochs = 20
best_valid_loss = float('inf')

for epoch in range(n_epochs):
    train_loss = train(model_alt_one, train_loader, optimizer, criterion, device)
    valid_loss = evaluate(model_alt_one, val_loader, criterion, device)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model_alt_one.state_dict(), 'sadjoy_model_alt_1.pt')
    
    print(f'Epoch: {epoch+1:02}')
    print(f'\tTrain Loss: {train_loss:.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f}')

Epoch: 01
	Train Loss: 0.642
	 Val. Loss: 0.603
Epoch: 02
	Train Loss: 0.640
	 Val. Loss: 0.603
Epoch: 03
	Train Loss: 0.645
	 Val. Loss: 0.603
Epoch: 04
	Train Loss: 0.643
	 Val. Loss: 0.603
Epoch: 05
	Train Loss: 0.644
	 Val. Loss: 0.603
Epoch: 06
	Train Loss: 0.640
	 Val. Loss: 0.603
Epoch: 07
	Train Loss: 0.643
	 Val. Loss: 0.603
Epoch: 08
	Train Loss: 0.642
	 Val. Loss: 0.603
Epoch: 09
	Train Loss: 0.640
	 Val. Loss: 0.603
Epoch: 10
	Train Loss: 0.638
	 Val. Loss: 0.603
Epoch: 11
	Train Loss: 0.642
	 Val. Loss: 0.603
Epoch: 12
	Train Loss: 0.637
	 Val. Loss: 0.603
Epoch: 13
	Train Loss: 0.641
	 Val. Loss: 0.603
Epoch: 14
	Train Loss: 0.640
	 Val. Loss: 0.603
Epoch: 15
	Train Loss: 0.640
	 Val. Loss: 0.603
Epoch: 16
	Train Loss: 0.644
	 Val. Loss: 0.603
Epoch: 17
	Train Loss: 0.644
	 Val. Loss: 0.603
Epoch: 18
	Train Loss: 0.639
	 Val. Loss: 0.603
Epoch: 19
	Train Loss: 0.639
	 Val. Loss: 0.603
Epoch: 20
	Train Loss: 0.637
	 Val. Loss: 0.603


In [42]:
## Save the model:
torch.save(model_alt_one, 'sadjoymodel_alt_one.pth')
## Load the base initial model for sadjoy:
model_alt_one.load_state_dict(torch.load('sadjoy_model_alt_1.pt'))

<All keys matched successfully>

In [43]:
### All predictions, labels and testing loss values
all_preds, all_labels, test_loss = evaluate_model(model_alt_one, test_loader, criterion, device)

In [44]:
#### Now we calculate accuracy and F1 score for the first alternative experiment of SadJoy model:
accuracy = accuracy_score(all_labels, all_preds)
f1 = f1_score(all_labels, all_preds, average='macro') # 'macro' calculates metrics for each label and then does an unweighted mean

#precision = precision_score(all_labels, all_preds, average='macro')
#recall = recall_score(all_labels, all_preds, average='macro')

print(f"Accuracy: {accuracy:.3f}")
print(f"F1 Score: {f1:.3f}")

Accuracy: 0.609
F1 Score: 0.384


In [45]:
#### Second set of experiments:
### Learning rate increased and epoch number increased:

## First we restore batch size to GS 32:
## Batch size to work with: GS of 32, 64 and 128:
batch_size = 32

#Loaded and ready:
train_loader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size)
val_loader = DataLoader(val_dataset, shuffle=False, batch_size=batch_size)
test_loader = DataLoader(test_dataset, shuffle=False, batch_size=batch_size)

In [46]:
# Main hyperparameters:
vocab_size = len(word_to_index) + 1  # +1 for padding 
print(vocab_size)
embedding_dim = 300  
n_filters = 110
filter_sizes = [2,3,4]
#output_dim = len(train_data['label_number'].unique())
output_dim = 1
print(output_dim)
dropout=0.1
#dropout=0.24691844248854944
pad_idx = 0  # Assuming 0 is the index for padding

7749
1


In [47]:
## Run the main CNN model builder again for model alternative 2:
model_alt_two = TextCNN(vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, dropout, pad_idx).to(device)
print("Model's embedding layer size:", model_alt_two.embedding.weight.data.size())

Model's embedding layer size: torch.Size([7749, 300])


In [48]:
# Loss and optimizer
## BCE loss is used since it's only two categories
criterion = nn.BCELoss().to(device)
#criterion = nn.CrossEntropyLoss().to(device)
#lr=0.000682446937962706
### Learning rate is increase to 0.05
optimizer = torch.optim.Adam(model_alt_two.parameters(), lr=0.05)

In [49]:
# Number of epochs INCREASED to 30:
n_epochs = 30
best_valid_loss = float('inf')

for epoch in range(n_epochs):
    train_loss = train(model_alt_two, train_loader, optimizer, criterion, device)
    valid_loss = evaluate(model_alt_two, val_loader, criterion, device)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model_alt_two.state_dict(), 'sadjoy_model_alt2.pt')
    
    print(f'Epoch: {epoch+1:02}')
    print(f'\tTrain Loss: {train_loss:.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f}')

Epoch: 01
	Train Loss: 33.231
	 Val. Loss: 33.681
Epoch: 02
	Train Loss: 33.597
	 Val. Loss: 33.681
Epoch: 03
	Train Loss: 33.617
	 Val. Loss: 33.681
Epoch: 04
	Train Loss: 33.604
	 Val. Loss: 33.681
Epoch: 05
	Train Loss: 33.590
	 Val. Loss: 33.681
Epoch: 06
	Train Loss: 33.563
	 Val. Loss: 33.681
Epoch: 07
	Train Loss: 33.577
	 Val. Loss: 33.681
Epoch: 08
	Train Loss: 33.611
	 Val. Loss: 33.681
Epoch: 09
	Train Loss: 33.584
	 Val. Loss: 33.681
Epoch: 10
	Train Loss: 33.570
	 Val. Loss: 33.681
Epoch: 11
	Train Loss: 33.577
	 Val. Loss: 33.681
Epoch: 12
	Train Loss: 33.557
	 Val. Loss: 33.681
Epoch: 13
	Train Loss: 33.617
	 Val. Loss: 33.681
Epoch: 14
	Train Loss: 33.590
	 Val. Loss: 33.681
Epoch: 15
	Train Loss: 33.584
	 Val. Loss: 33.681
Epoch: 16
	Train Loss: 33.584
	 Val. Loss: 33.681
Epoch: 17
	Train Loss: 33.590
	 Val. Loss: 33.681
Epoch: 18
	Train Loss: 33.550
	 Val. Loss: 33.681
Epoch: 19
	Train Loss: 33.611
	 Val. Loss: 33.681
Epoch: 20
	Train Loss: 33.604
	 Val. Loss: 33.681


In [50]:
## Save the model:
torch.save(model_alt_two, 'sadjoymodel_alt_two.pth')
## Load the base initial model for sadjoy:
model_alt_two.load_state_dict(torch.load('sadjoy_model_alt2.pt'))

<All keys matched successfully>

In [51]:
### All predictions, labels and testing loss values
all_preds, all_labels, test_loss = evaluate_model(model_alt_two, test_loader, criterion, device)

In [52]:
#### Now we calculate accuracy and F1 score for the second experiment set of the SadJoy model:
accuracy = accuracy_score(all_labels, all_preds)
f1 = f1_score(all_labels, all_preds, average='macro') # 'macro' calculates metrics for each label and then does an unweighted mean

#precision = precision_score(all_labels, all_preds, average='macro')
#recall = recall_score(all_labels, all_preds, average='macro')

print(f"Accuracy: {accuracy:.3f}")
print(f"F1 Score: {f1:.3f}")

Accuracy: 0.609
F1 Score: 0.379


In [53]:
#### Third (last) set of experiments:
### Different optimizer and dropout increased:
## Batch size to work with: GS of 32, 64 and 128:
batch_size = 32

#Loaded and ready:
train_loader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size)
val_loader = DataLoader(val_dataset, shuffle=False, batch_size=batch_size)
test_loader = DataLoader(test_dataset, shuffle=False, batch_size=batch_size)


In [54]:
# Main hyperparameters for experiment 3:
vocab_size = len(word_to_index) + 1  # +1 for padding 
print(vocab_size)
embedding_dim = 300  
n_filters = 110
filter_sizes = [2,3,4]
#output_dim = len(train_data['label_number'].unique())
output_dim = 1
print(output_dim)
#### Dropout INCREASED from 0.1 to 0.2
dropout=0.2
#dropout=0.24691844248854944
pad_idx = 0  # Assuming 0 is the index for padding

7749
1


In [55]:
## Run the main CNN model builder again for model alternative 2:
model_alt_three = TextCNN(vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, dropout, pad_idx).to(device)
print("Model's embedding layer size:", model_alt_three.embedding.weight.data.size())

Model's embedding layer size: torch.Size([7749, 300])


In [56]:
# Loss and optimizer
## BCE loss is used since it's only two categories
criterion = nn.BCELoss().to(device)
#criterion = nn.CrossEntropyLoss().to(device)
#lr=0.000682446937962706
#### Optimizer changed to SGD
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

In [57]:
# Number of epochs back to 20:
n_epochs = 20
best_valid_loss = float('inf')

for epoch in range(n_epochs):
    train_loss = train(model_alt_three, train_loader, optimizer, criterion, device)
    valid_loss = evaluate(model_alt_three, val_loader, criterion, device)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model_alt_three.state_dict(), 'sadjoy_model_alt3.pt')
    
    print(f'Epoch: {epoch+1:02}')
    print(f'\tTrain Loss: {train_loss:.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f}')

Epoch: 01
	Train Loss: 0.643
	 Val. Loss: 0.623
Epoch: 02
	Train Loss: 0.647
	 Val. Loss: 0.623
Epoch: 03
	Train Loss: 0.643
	 Val. Loss: 0.623
Epoch: 04
	Train Loss: 0.644
	 Val. Loss: 0.623
Epoch: 05
	Train Loss: 0.649
	 Val. Loss: 0.623
Epoch: 06
	Train Loss: 0.645
	 Val. Loss: 0.623
Epoch: 07
	Train Loss: 0.639
	 Val. Loss: 0.623
Epoch: 08
	Train Loss: 0.642
	 Val. Loss: 0.623
Epoch: 09
	Train Loss: 0.644
	 Val. Loss: 0.623
Epoch: 10
	Train Loss: 0.642
	 Val. Loss: 0.623
Epoch: 11
	Train Loss: 0.643
	 Val. Loss: 0.623
Epoch: 12
	Train Loss: 0.644
	 Val. Loss: 0.623
Epoch: 13
	Train Loss: 0.640
	 Val. Loss: 0.623
Epoch: 14
	Train Loss: 0.635
	 Val. Loss: 0.623
Epoch: 15
	Train Loss: 0.642
	 Val. Loss: 0.623
Epoch: 16
	Train Loss: 0.641
	 Val. Loss: 0.623
Epoch: 17
	Train Loss: 0.642
	 Val. Loss: 0.623
Epoch: 18
	Train Loss: 0.642
	 Val. Loss: 0.623
Epoch: 19
	Train Loss: 0.641
	 Val. Loss: 0.623
Epoch: 20
	Train Loss: 0.644
	 Val. Loss: 0.623


In [58]:
## Save the model:
torch.save(model_alt_three, 'sadjoymodel_alt_three.pth')
## Load the base initial model for sadjoy:
model_alt_three.load_state_dict(torch.load('sadjoy_model_alt3.pt'))

<All keys matched successfully>

In [59]:
### All predictions, labels and testing loss values
all_preds, all_labels, test_loss = evaluate_model(model_alt_three, test_loader, criterion, device)

In [60]:
#### Now we calculate accuracy and F1 score for the third last experiment set of the SadJoy model:
accuracy = accuracy_score(all_labels, all_preds)
f1 = f1_score(all_labels, all_preds, average='macro') # 'macro' calculates metrics for each label and then does an unweighted mean

#precision = precision_score(all_labels, all_preds, average='macro')
#recall = recall_score(all_labels, all_preds, average='macro')

print(f"Accuracy: {accuracy:.3f}")
print(f"F1 Score: {f1:.3f}")

Accuracy: 0.610
F1 Score: 0.387


In [54]:
### Import optuna to test through best hyperparameter configuration:
## Applied only to the baseline sadness and joy dataset with no hyperparameter configuration:
import optuna

def objective(trial):
    # 1. Define range of hyperparameters:
    lr = trial.suggest_float('lr', 1e-5, 1e-3, log=True)
    dropout = trial.suggest_float('dropout', 0.1, 0.7)
    n_filters = trial.suggest_int('n_filters', 50, 150, 10)
    filter_sizes = trial.suggest_categorical('filter_sizes', [[2,3,4], [3,4,5], [4,5,6]])
    emb_dim = trial.suggest_categorical('embedding_dim', [100, 200, 300])
    
    # Regularization
    weight_decay = trial.suggest_float('weight_decay', 1e-5, 1e-1, log=True)
    
    # Training specifics
    batch_size = trial.suggest_categorical('batch_size', [16, 32, 64, 128])
    optimizer_type = trial.suggest_categorical('optimizer_type', ['Adam', 'SGD', 'RMSProp'])
    
    # Model specifics
    activation_function = trial.suggest_categorical('activation_function', ['ReLU', 'LeakyReLU', 'ELU'])
    pooling_strategy = trial.suggest_categorical('pooling_strategy', ['max', 'avg'])
    
    # 2. Create and train model with these hyperparameters:
    model = TextCNN(vocab_size, emb_dim, n_filters, filter_sizes, output_dim, dropout, pad_idx).to(device)
    # Initialize the appropriate optimizer
    if optimizer_type == 'Adam':
        optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
    elif optimizer_type == 'SGD':
        optimizer = torch.optim.SGD(model.parameters(), lr=lr, weight_decay=weight_decay)
    elif optimizer_type == 'RMSProp':
        optimizer = torch.optim.RMSprop(model.parameters(), lr=lr, weight_decay=weight_decay)
    else:
        raise ValueError("Unsupported optimizer type")

    criterion = nn.BCELoss().to(device)
    
    best_valid_loss = float('inf')
    for epoch in range(n_epochs):
        train_loss = train(model, train_loader, optimizer, criterion, device)
        valid_loss = evaluate(model, val_loader, criterion, device)
        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            
    # 3. Return validation loss for this set of hyperparameters:
    return best_valid_loss

# Use Optuna to find best hyperparameters:
study = optuna.create_study(direction='minimize')  # We want to minimize the validation loss
study.optimize(objective, n_trials=100)  # Number of trials can be adjusted based on computational resources

# Get the best hyperparameters:
best_params = study.best_params
best_loss = study.best_value
print(f"Best hyperparameters:\n{best_params}\nValidation Loss with best hyperparameters: {best_loss}")


[I 2023-11-09 10:00:11,743] A new study created in memory with name: no-name-427c8648-16e2-4e4b-a4ed-c51135f55009
[I 2023-11-09 10:00:33,210] Trial 0 finished with value: 0.5087175882524915 and parameters: {'lr': 0.0008722720369465334, 'dropout': 0.5319940062916195, 'n_filters': 100, 'filter_sizes': [3, 4, 5], 'embedding_dim': 100, 'weight_decay': 0.00028948772875134015, 'batch_size': 64, 'optimizer_type': 'RMSProp', 'activation_function': 'ELU', 'pooling_strategy': 'max'}. Best is trial 0 with value: 0.5087175882524915.
[I 2023-11-09 10:01:20,747] Trial 1 finished with value: 0.6324671573109097 and parameters: {'lr': 0.0005562504815944998, 'dropout': 0.577547835208084, 'n_filters': 120, 'filter_sizes': [2, 3, 4], 'embedding_dim': 200, 'weight_decay': 0.0025162111599918873, 'batch_size': 64, 'optimizer_type': 'SGD', 'activation_function': 'LeakyReLU', 'pooling_strategy': 'max'}. Best is trial 0 with value: 0.5087175882524915.
[I 2023-11-09 10:01:53,498] Trial 2 finished with value: 0.5

[I 2023-11-09 10:19:20,609] Trial 20 finished with value: 0.4915858167741034 and parameters: {'lr': 0.0001950511965164039, 'dropout': 0.3075190348382899, 'n_filters': 50, 'filter_sizes': [2, 3, 4], 'embedding_dim': 200, 'weight_decay': 0.009179754645275687, 'batch_size': 128, 'optimizer_type': 'Adam', 'activation_function': 'LeakyReLU', 'pooling_strategy': 'avg'}. Best is trial 19 with value: 0.46767764869663453.
[I 2023-11-09 10:20:30,452] Trial 21 finished with value: 0.5322356099883715 and parameters: {'lr': 0.00017171977289842796, 'dropout': 0.31937351062925834, 'n_filters': 50, 'filter_sizes': [2, 3, 4], 'embedding_dim': 200, 'weight_decay': 0.0073441153634759065, 'batch_size': 128, 'optimizer_type': 'Adam', 'activation_function': 'LeakyReLU', 'pooling_strategy': 'avg'}. Best is trial 19 with value: 0.46767764869663453.
[I 2023-11-09 10:21:41,485] Trial 22 finished with value: 0.5036296836204 and parameters: {'lr': 0.00029417770805633824, 'dropout': 0.30625406810096645, 'n_filters

[I 2023-11-09 10:48:43,811] Trial 40 finished with value: 0.6382001539071401 and parameters: {'lr': 0.00038846710068263083, 'dropout': 0.34834187483448265, 'n_filters': 120, 'filter_sizes': [2, 3, 4], 'embedding_dim': 200, 'weight_decay': 0.004262904365149423, 'batch_size': 128, 'optimizer_type': 'SGD', 'activation_function': 'ReLU', 'pooling_strategy': 'avg'}. Best is trial 35 with value: 0.4274591205434667.
[I 2023-11-09 10:50:58,650] Trial 41 finished with value: 0.4797075374258889 and parameters: {'lr': 0.0005555784757515292, 'dropout': 0.2468881273955814, 'n_filters': 130, 'filter_sizes': [2, 3, 4], 'embedding_dim': 200, 'weight_decay': 0.0036613802736826235, 'batch_size': 128, 'optimizer_type': 'Adam', 'activation_function': 'LeakyReLU', 'pooling_strategy': 'avg'}. Best is trial 35 with value: 0.4274591205434667.
[I 2023-11-09 10:52:50,640] Trial 42 finished with value: 0.48701101707087624 and parameters: {'lr': 0.00031246926833652705, 'dropout': 0.2753183024941234, 'n_filters': 

[I 2023-11-09 11:45:00,244] Trial 60 finished with value: 0.48126839473843575 and parameters: {'lr': 0.00023329379580767378, 'dropout': 0.18325370957764972, 'n_filters': 60, 'filter_sizes': [2, 3, 4], 'embedding_dim': 200, 'weight_decay': 0.00814962766646767, 'batch_size': 128, 'optimizer_type': 'Adam', 'activation_function': 'LeakyReLU', 'pooling_strategy': 'avg'}. Best is trial 35 with value: 0.4274591205434667.
[I 2023-11-09 11:46:14,342] Trial 61 finished with value: 0.46948254315389526 and parameters: {'lr': 0.0005305044063785922, 'dropout': 0.39105930125708177, 'n_filters': 90, 'filter_sizes': [2, 3, 4], 'embedding_dim': 200, 'weight_decay': 0.0016618229866880727, 'batch_size': 128, 'optimizer_type': 'Adam', 'activation_function': 'LeakyReLU', 'pooling_strategy': 'avg'}. Best is trial 35 with value: 0.4274591205434667.
[I 2023-11-09 11:47:23,539] Trial 62 finished with value: 0.49015243392851615 and parameters: {'lr': 0.0005007080902041266, 'dropout': 0.4199244563605796, 'n_filte

[I 2023-11-09 12:31:13,716] Trial 80 finished with value: 0.556060796810521 and parameters: {'lr': 0.0006573890120055478, 'dropout': 0.24407069570510326, 'n_filters': 120, 'filter_sizes': [2, 3, 4], 'embedding_dim': 300, 'weight_decay': 0.0024868335705130486, 'batch_size': 32, 'optimizer_type': 'RMSProp', 'activation_function': 'ELU', 'pooling_strategy': 'avg'}. Best is trial 71 with value: 0.4168363617112239.
[I 2023-11-09 12:33:59,036] Trial 81 finished with value: 0.46203165128827095 and parameters: {'lr': 0.0008507936062455343, 'dropout': 0.30071679954065467, 'n_filters': 130, 'filter_sizes': [2, 3, 4], 'embedding_dim': 300, 'weight_decay': 0.005261700976500337, 'batch_size': 32, 'optimizer_type': 'Adam', 'activation_function': 'ELU', 'pooling_strategy': 'avg'}. Best is trial 71 with value: 0.4168363617112239.
[I 2023-11-09 12:36:50,774] Trial 82 finished with value: 0.4709513398508231 and parameters: {'lr': 0.0007822174891742368, 'dropout': 0.2759241682580842, 'n_filters': 140, 'f

Best hyperparameters:
{'lr': 0.0007809671359222956, 'dropout': 0.30216765316720884, 'n_filters': 130, 'filter_sizes': [2, 3, 4], 'embedding_dim': 300, 'weight_decay': 0.003736652093157015, 'batch_size': 32, 'optimizer_type': 'Adam', 'activation_function': 'ELU', 'pooling_strategy': 'avg'}
Validation Loss with best hyperparameters: 0.4168363617112239


<b>Best combination from using optuna trials after performing 100 trials: Number 71 is the optimal configuration:
Best hyperparameters:
{'lr': 0.0007809671359222956, 'dropout': 0.30216765316720884, 'n_filters': 130, 'filter_sizes': [2, 3, 4], 'embedding_dim': 300, 'weight_decay': 0.003736652093157015, 'batch_size': 32, 'optimizer_type': 'Adam', 'activation_function': 'ELU', 'pooling_strategy': 'avg'}
Validation Loss with best hyperparameters: 0.4168363617112239 <b>

In [1]:
#### Main Task 3 - Alternative dataset
## This time with sadness and optimism:
#Import the main tools for the task:
import os
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torch

In [2]:
# create a dictionary of the mapping between numbers and labels
mappings = {"anger": 3, "joy": 2, "optimism": 1, "sadness": 0}

def load_data(mapping_dictionary:dict , tweet_file_path: str, label_file_path:str)-> pd.DataFrame:
    """
    function to load both the tweets and the labels, combine them together as pandas dataframe
    """
    with open(tweet_file_path, 'r', encoding="utf-8") as file:
        tweets = [line.strip() for line in file.readlines()]

    with open(label_file_path, 'r', encoding="utf-8") as file:
        label_numbers = [int(line.strip()) for line in file.readlines()]

    label_texts = [next((key for key, value in mapping_dictionary.items() if value == label_number), None) for label_number in label_numbers]
    
    df = pd.DataFrame({
        'text': tweets,
        'label_text': label_texts,
        'label_number': label_numbers
    })
    
    return df

# make sure the files in the same directory as the notebook
train_data = load_data(mappings, "train_text.txt", "train_labels.txt")
validation_data = load_data(mappings, "val_text.txt", "val_labels.txt")
test_data = load_data(mappings, "test_text.txt", "test_labels.txt")

# print the head of one of them
print(train_data.head())

# Print the length of each data
print(f"Length of train_data: {len(train_data)}")
print(f"Length of validation_data: {len(validation_data)}")
print(f"Length of test_data: {len(test_data)}")

                                                text label_text  label_number
0  “Worry is a down payment on a problem you may ...        joy             2
1  My roommate: it's okay that we can't spell bec...    sadness             0
2  No but that's so cute. Atsu was probably shy a...   optimism             1
3  Rooneys fucking untouchable isn't he? Been fuc...    sadness             0
4  it's pretty depressing when u hit pan on ur fa...      anger             3
Length of train_data: 3257
Length of validation_data: 374
Length of test_data: 1421


In [4]:
### Replace the emotion of interest
# First, we grab the main dataset again, complete with training and validation packs:
## Now we filter and create the subset dataset with the replaced emotion:
# First one will be sadness (0) and optimism (1)
sadoptlist = ['sadness','optimism']
train_data = train_data[train_data['label_text'].isin(sadoptlist)]

#Alright, got only the sadness and joy!
print(train_data.head())
print(type(train_data))
## Count total with only sadness and joy:
print(len(train_data))

                                                text label_text  label_number
1  My roommate: it's okay that we can't spell bec...    sadness             0
2  No but that's so cute. Atsu was probably shy a...   optimism             1
3  Rooneys fucking untouchable isn't he? Been fuc...    sadness             0
5  @user but your pussy was weak from what I hear...    sadness             0
7  Tiller and breezy should do a collab album. Ra...   optimism             1
<class 'pandas.core.frame.DataFrame'>
2108


In [5]:
## Now we do the same for the other 2 sets for validation and testing:
validation_data = validation_data[validation_data['label_text'].isin(sadoptlist)]
test_data = test_data[test_data['label_text'].isin(sadoptlist)]

In [6]:
#Validation set got only the sadness and optimism!
print(validation_data.head())
print(type(validation_data))
# Count validation:
print(len(validation_data))

                                                text label_text  label_number
0  @user @user Oh, hidden revenge and anger...I r...    sadness             0
1  if not then #teamchristine bc all tana has don...    sadness             0
2  Hey @user #Fields in #skibbereen give your onl...    sadness             0
3  Why have #Emmerdale had to rob #robron of havi...    sadness             0
4  @user I would like to hear a podcast of you go...    sadness             0
<class 'pandas.core.frame.DataFrame'>
257


In [7]:
#Test set got only the sadness and joy!
print(test_data.head())
print(type(test_data))
#count:
print(len(test_data))

                                                text label_text  label_number
1  @user Interesting choice of words... Are you c...    sadness             0
3  @user Welcome to #MPSVT! We are delighted to h...   optimism             1
4                       What makes you feel #joyful?   optimism             1
5                                    i am revolting.    sadness             0
9  @user Get Donovan out of your soccer booth. He...    sadness             0
<class 'pandas.core.frame.DataFrame'>
916


In [9]:
# Function to tokenize the strings in text column:
def tokenize_sentence(sentence: str) -> list:
    """
    Tokenizes a sentence using nltk's word_tokenize method.

    Args:
    - sentence (str): The sentence to tokenize.

    Returns:
    - List of tokens.
    """
    return word_tokenize(sentence)

In [10]:
## Now we apply the tokenizer to the new subset and the validation/testing kit:
train_data['tokenized_text'] = train_data['text'].apply(tokenize_sentence)
validation_data['tokenized_text'] = validation_data['text'].apply(tokenize_sentence)
test_data['tokenized_text'] = test_data['text'].apply(tokenize_sentence)

In [11]:
# now the data has a new column "tokenized_text" which is a list of tokens
#print(train_data.head())
print(train_data['tokenized_text'].head())

1    [My, roommate, :, it, 's, okay, that, we, ca, ...
2    [No, but, that, 's, so, cute, ., Atsu, was, pr...
3    [Rooneys, fucking, untouchable, is, n't, he, ?...
5    [@, user, but, your, pussy, was, weak, from, w...
7    [Tiller, and, breezy, should, do, a, collab, a...
Name: tokenized_text, dtype: object


In [12]:
# Build a set of all unique tokens in the training data
vocab_set = set()
for tokens in train_data['tokenized_text']:
    vocab_set.update(tokens)

# Convert the set to a list to index tokens
vocab_list = list(vocab_set)

print(vocab_list[:4])

# Create a word to index mapping
word_to_index = {word: index for index, word in enumerate(vocab_list)}

['Maybe', 'loyal', '`', 'appearance']


In [13]:
# Add OOV token and its index to the vocabulary. This is because some tokens in the val/test might have vocabulary not in training
OOV_TOKEN = "<OOV>"
if OOV_TOKEN not in word_to_index:
    word_to_index[OOV_TOKEN] = len(word_to_index)

In [14]:
## Main function for conversion of tokens from previous step to index numbers:
def tokens_to_numbers(tokens: list, word_to_index: dict) -> list:
    """
    Converts a list of tokens to their corresponding indices using a word-to-index mapping.
    Returns the index of OOV_TOKEN for out-of-vocabulary words.
    """
    return [word_to_index.get(token, word_to_index[OOV_TOKEN]) for token in tokens]

In [15]:
## Apply the function for token to index conversion:
train_data['numeric_tokens'] = train_data['tokenized_text'].apply(lambda x: tokens_to_numbers(x, word_to_index))
validation_data['numeric_tokens'] = validation_data['tokenized_text'].apply(lambda x: tokens_to_numbers(x, word_to_index))
test_data['numeric_tokens'] = test_data['tokenized_text'].apply(lambda x: tokens_to_numbers(x, word_to_index))

In [16]:
# Now the df has a new column "numeric_tokens"
#print(train_data.head())
print(train_data['numeric_tokens'])

## Seems alright

1       [860, 3304, 7656, 6035, 5995, 3419, 6988, 2848...
2       [4295, 236, 6988, 5995, 7359, 6378, 3328, 4109...
3       [7509, 245, 2603, 4774, 2729, 3230, 3124, 6401...
5       [4129, 3368, 236, 4350, 1576, 3137, 3387, 2471...
7       [6409, 579, 6920, 962, 7497, 2934, 5047, 3347,...
                              ...                        
3250    [4129, 3368, 4262, 1675, 6323, 7416, 4506, 719...
3251    [4129, 3368, 6210, 5103, 4069, 3322, 2770, 312...
3254    [4129, 3368, 4129, 3368, 4129, 3368, 4129, 336...
3255    [2736, 531, 2934, 7416, 699, 3124, 2983, 5379,...
3256    [4129, 3368, 4129, 3368, 7592, 1129, 1172, 349...
Name: numeric_tokens, Length: 2108, dtype: object


In [17]:
# Figure the maximum sequence length, but since tweets are conveniently short
# The maximum length of tweet is defined:
MAX_SEQUENCE_LENGTH = max(train_data['numeric_tokens'].apply(len))
print('The Maximum Length is:', MAX_SEQUENCE_LENGTH)

The Maximum Length is: 48


In [18]:
## Padding is required for the dataset:
def pad_sequence(numeric_tokens: list, max_length: int) -> list:
    """
    Pads a sequence to a given length. If sequence is shorter than the target length,
    it'll be padded with zeros.
    """
    return numeric_tokens + [0]*(max_length - len(numeric_tokens))

In [19]:
## Apply the padding:
train_data['padded_tokens'] = train_data['numeric_tokens'].apply(lambda x: pad_sequence(x, MAX_SEQUENCE_LENGTH))
validation_data['padded_tokens'] = validation_data['numeric_tokens'].apply(lambda x: pad_sequence(x, MAX_SEQUENCE_LENGTH))
test_data['padded_tokens'] = test_data['numeric_tokens'].apply(lambda x: pad_sequence(x, MAX_SEQUENCE_LENGTH))

In [20]:
# Now the df has a new column "padded_tokens". 
# Usually we shouldn't load all that to memory but once again, Tweets are small, so it is ok:
#print(train_data.head())
print(train_data['padded_tokens'])

## Seems the same as previous...

1       [860, 3304, 7656, 6035, 5995, 3419, 6988, 2848...
2       [4295, 236, 6988, 5995, 7359, 6378, 3328, 4109...
3       [7509, 245, 2603, 4774, 2729, 3230, 3124, 6401...
5       [4129, 3368, 236, 4350, 1576, 3137, 3387, 2471...
7       [6409, 579, 6920, 962, 7497, 2934, 5047, 3347,...
                              ...                        
3250    [4129, 3368, 4262, 1675, 6323, 7416, 4506, 719...
3251    [4129, 3368, 6210, 5103, 4069, 3322, 2770, 312...
3254    [4129, 3368, 4129, 3368, 4129, 3368, 4129, 336...
3255    [2736, 531, 2934, 7416, 699, 3124, 2983, 5379,...
3256    [4129, 3368, 4129, 3368, 7592, 1129, 1172, 349...
Name: padded_tokens, Length: 2108, dtype: object


In [24]:
# Conversion to list:
train_dataset = TextDataset(train_data['padded_tokens'].tolist(), train_data['label_number'].tolist())
val_dataset = TextDataset(validation_data['padded_tokens'].tolist(), validation_data['label_number'].tolist())
test_dataset = TextDataset(test_data['padded_tokens'].tolist(), test_data['label_number'].tolist())

In [25]:
## Batch size to work with: GS of 32, 64 and 128:
batch_size = 32

#Loaded and ready:
train_loader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size)
val_loader = DataLoader(val_dataset, shuffle=False, batch_size=batch_size)
test_loader = DataLoader(test_dataset, shuffle=False, batch_size=batch_size)

In [28]:
# Main hyperparameters: Using the "best" configuration from sadjoy set:
vocab_size = len(word_to_index) + 1  # +1 for padding 
print(vocab_size)
embedding_dim = 300  
n_filters = 110
filter_sizes = [2,3,4]
#output_dim = len(train_data['label_number'].unique())
output_dim = 1
print(output_dim)
dropout=0.1
#dropout=0.24691844248854944
pad_idx = 0  # Assuming 0 is the index for padding

7749
1


In [29]:
## Run the main CNN model builder:
model_sadopt = TextCNN(vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, dropout, pad_idx).to(device)
print("Model's embedding layer size:", model_sadopt.embedding.weight.data.size())

Model's embedding layer size: torch.Size([7749, 300])


In [30]:
#model.embedding.weight.data.copy_(torch.tensor(embedding_matrix))
model_sadopt.embedding.weight.requires_grad = False  # Freeze the embedding layer

In [32]:
# Loss and optimizer
## BCE loss is used since it's only two categories
criterion = nn.BCELoss().to(device)
#criterion = nn.CrossEntropyLoss().to(device)
#lr=0.000682446937962706
optimizer = torch.optim.Adam(model_sadopt.parameters(), lr=0.01)

In [36]:
# Number of epochs at 20
##Now we are performing the main training for sadopt
n_epochs = 20
best_valid_loss = float('inf')

for epoch in range(n_epochs):
    train_loss = train(model_sadopt, train_loader, optimizer, criterion, device)
    valid_loss = evaluate(model_sadopt, val_loader, criterion, device)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model_sadopt.state_dict(), 'sadopt_model.pt')
    
    print(f'Epoch: {epoch+1:02}')
    print(f'\tTrain Loss: {train_loss:.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f}')

Epoch: 01
	Train Loss: 0.298
	 Val. Loss: 1.684
Epoch: 02
	Train Loss: 0.266
	 Val. Loss: 3.905
Epoch: 03
	Train Loss: 0.337
	 Val. Loss: 3.905
Epoch: 04
	Train Loss: 0.314
	 Val. Loss: 4.507
Epoch: 05
	Train Loss: 0.485
	 Val. Loss: 3.888
Epoch: 06
	Train Loss: 0.770
	 Val. Loss: 3.969
Epoch: 07
	Train Loss: 0.689
	 Val. Loss: 7.267
Epoch: 08
	Train Loss: 1.057
	 Val. Loss: 8.875
Epoch: 09
	Train Loss: 1.196
	 Val. Loss: 27.482
Epoch: 10
	Train Loss: 1.944
	 Val. Loss: 9.102
Epoch: 11
	Train Loss: 1.163
	 Val. Loss: 12.940
Epoch: 12
	Train Loss: 1.122
	 Val. Loss: 11.982
Epoch: 13
	Train Loss: 1.376
	 Val. Loss: 22.436
Epoch: 14
	Train Loss: 2.456
	 Val. Loss: 21.033
Epoch: 15
	Train Loss: 2.408
	 Val. Loss: 28.721
Epoch: 16
	Train Loss: 2.243
	 Val. Loss: 17.180
Epoch: 17
	Train Loss: 3.168
	 Val. Loss: 18.607
Epoch: 18
	Train Loss: 2.116
	 Val. Loss: 16.897
Epoch: 19
	Train Loss: 4.014
	 Val. Loss: 16.611
Epoch: 20
	Train Loss: 2.498
	 Val. Loss: 20.074


In [39]:
## Save the model:
torch.save(model_sadopt, 'sadoptmodel.pth')
## Load the base initial model for sadjoy:
model_sadopt.load_state_dict(torch.load('sadopt_model.pt'))

<All keys matched successfully>

In [40]:
### All predictions, labels and testing loss values
all_preds, all_labels, test_loss = evaluate_model(model_sadopt, test_loader, criterion, device)

In [41]:
#### Now we calculate accuracy and F1 score for the base version of Sadopt model:
accuracy = accuracy_score(all_labels, all_preds)
f1 = f1_score(all_labels, all_preds, average='macro') # 'macro' calculates metrics for each label and then does an unweighted mean

#precision = precision_score(all_labels, all_preds, average='macro')
#recall = recall_score(all_labels, all_preds, average='macro')

print(f"Accuracy: {accuracy:.3f}")
print(f"F1 Score: {f1:.3f}")

Accuracy: 0.684
F1 Score: 0.678


In [42]:
#### Now we use the hyperparameters from sadjoy ###
## Config extracted with optima are used here for sad-optimism:
## Best Batch size = 32
batch_size = 32

#Loaded and ready:
train_loader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size)
val_loader = DataLoader(val_dataset, shuffle=False, batch_size=batch_size)
test_loader = DataLoader(test_dataset, shuffle=False, batch_size=batch_size)

In [43]:
# Main hyperparameters: Using the "optima best" configuration from sadjoy set:
vocab_size = len(word_to_index) + 1  # +1 for padding 
print(vocab_size)
embedding_dim = 300
#Filters increased to 130:
n_filters = 130
filter_sizes = [2,3,4]
#output_dim = len(train_data['label_number'].unique())
output_dim = 1
print(output_dim)
#Best dropout is: 0.30216765316720884
dropout=0.30216765316720884
pad_idx = 0  # Assuming 0 is the index for padding

7749
1


In [44]:
## Run the main CNN model builder:
model_sadopt_optima = TextCNN(vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, dropout, pad_idx).to(device)
print("Model's embedding layer size:", model_sadopt_optima.embedding.weight.data.size())

Model's embedding layer size: torch.Size([7749, 300])


In [45]:
#model.embedding.weight.data.copy_(torch.tensor(embedding_matrix))
model_sadopt_optima.embedding.weight.requires_grad = False  # Freeze the embedding layer

In [46]:
# Loss and optimizer
## BCE loss is used since it's only two categories
criterion = nn.BCELoss().to(device)
#criterion = nn.CrossEntropyLoss().to(device)
#Learning rate changed to 0.0007809671359222956 as per optima best trial from sadjoy
optimizer = torch.optim.Adam(model_sadopt_optima.parameters(), lr=0.0007809671359222956)

In [47]:
# Number of epochs at 20
##Now we are performing the main training for sadopt
n_epochs = 20
best_valid_loss = float('inf')

for epoch in range(n_epochs):
    train_loss = train(model_sadopt_optima, train_loader, optimizer, criterion, device)
    valid_loss = evaluate(model_sadopt_optima, val_loader, criterion, device)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model_sadopt_optima.state_dict(), 'sadopt_model_optima.pt')
    
    print(f'Epoch: {epoch+1:02}')
    print(f'\tTrain Loss: {train_loss:.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f}')

Epoch: 01
	Train Loss: 0.590
	 Val. Loss: 0.598
Epoch: 02
	Train Loss: 0.329
	 Val. Loss: 0.536
Epoch: 03
	Train Loss: 0.182
	 Val. Loss: 0.543
Epoch: 04
	Train Loss: 0.098
	 Val. Loss: 0.554
Epoch: 05
	Train Loss: 0.062
	 Val. Loss: 0.544
Epoch: 06
	Train Loss: 0.038
	 Val. Loss: 0.559
Epoch: 07
	Train Loss: 0.033
	 Val. Loss: 0.608
Epoch: 08
	Train Loss: 0.025
	 Val. Loss: 0.583
Epoch: 09
	Train Loss: 0.025
	 Val. Loss: 0.599
Epoch: 10
	Train Loss: 0.026
	 Val. Loss: 0.606
Epoch: 11
	Train Loss: 0.025
	 Val. Loss: 0.622
Epoch: 12
	Train Loss: 0.024
	 Val. Loss: 0.667
Epoch: 13
	Train Loss: 0.018
	 Val. Loss: 0.606
Epoch: 14
	Train Loss: 0.024
	 Val. Loss: 0.610
Epoch: 15
	Train Loss: 0.011
	 Val. Loss: 0.657
Epoch: 16
	Train Loss: 0.008
	 Val. Loss: 0.702
Epoch: 17
	Train Loss: 0.014
	 Val. Loss: 0.682
Epoch: 18
	Train Loss: 0.010
	 Val. Loss: 0.660
Epoch: 19
	Train Loss: 0.014
	 Val. Loss: 0.697
Epoch: 20
	Train Loss: 0.010
	 Val. Loss: 0.658


In [48]:
## Save the model:
torch.save(model_sadopt_optima, 'sadoptmodeloptima.pth')
## Load the base initial model for sadjoy:
model_sadopt_optima.load_state_dict(torch.load('sadopt_model_optima.pt'))

<All keys matched successfully>

In [49]:
### All predictions, labels and testing loss values
all_preds, all_labels, test_loss = evaluate_model(model_sadopt_optima, test_loader, criterion, device)

In [50]:
#### Now we calculate accuracy and F1 score for the final Optima-configuration model:
accuracy = accuracy_score(all_labels, all_preds)
f1 = f1_score(all_labels, all_preds, average='macro') # 'macro' calculates metrics for each label and then does an unweighted mean

#precision = precision_score(all_labels, all_preds, average='macro')
#recall = recall_score(all_labels, all_preds, average='macro')

print(f"Accuracy: {accuracy:.3f}")
print(f"F1 Score: {f1:.3f}")

Accuracy: 0.733
F1 Score: 0.687


In [51]:
### ---- Finished ---- ###