In [1]:
import torch

In [64]:
# Check if CUDA is available
if torch.cuda.is_available():
    # Get the number of available CUDA devices
    num_devices = torch.cuda.device_count()
    print("Number of CUDA devices:", num_devices)

    # Iterate over each CUDA device and print details
    for i in range(num_devices):
        device = torch.device(f"cuda:{i}")
        print(f"CUDA Device {i}: {torch.cuda.get_device_name(i)}")
        print(f"Memory Usage - Allocated: {torch.cuda.memory_allocated(device)} bytes")
        print(f"Memory Usage - Cached: {torch.cuda.memory_reserved(device)} bytes\n")

Number of CUDA devices: 1
CUDA Device 0: NVIDIA GeForce RTX 4060 Laptop GPU
Memory Usage - Allocated: 6834670592 bytes
Memory Usage - Cached: 22779265024 bytes



Explanation: Here we are checking for the availability of CUDA which is a parallel computing platform which is created by Nvidia.We are verifying if CUDA is available using Pytorch by the function torch.cuda.is_available(). If CUDA is available we are retreiving all the CUDA devices using the function torch.cuda.device_count().

In [2]:
import pandas as pd
from transformers import BertTokenizer, T5Tokenizer, BertForSequenceClassification, T5ForConditionalGeneration
import torch
from torch.utils.data import DataLoader, Dataset
from transformers import AdamW, get_linear_schedule_with_warmup
from sklearn.model_selection import train_test_split
from tqdm import tqdm

In [4]:
# Load the CSV dataset
df = pd.read_csv('C:/Users/geeth/Downloads/sample (1).csv')

In [5]:
print(df.head(100))

   Id   ProductId          UserId                      ProfileName  \
0   1  B001E4KFG0  A3SGXH7AUHU8GW                       delmartian   
1   2  B00813GRG4  A1D87F6ZCVE5NK                           dll pa   
2   3  B000LQOCH0   ABXLMWJIXXAIN  Natalia Corres "Natalia Corres"   
3   4  B000UA0QIQ  A395BORC6FGVXV                             Karl   
4   5  B006K2ZZ7K  A1UQRSCLF8GW1T    Michael D. Bigham "M. Wassir"   
5   6  B006K2ZZ7K   ADT0SRK1MGOEU                   Twoapennything   
6   7  B006K2ZZ7K  A1SP2KVKFXXRU1                David C. Sullivan   
7   8  B006K2ZZ7K  A3JRGQVEQN31IQ               Pamela G. Williams   
8   9  B000E7L2R4  A1MZYO9TZK0BBI                         R. James   
9  10  B00171APVA  A21BT40VZCCYT4                    Carol A. Reed   

   HelpfulnessNumerator  HelpfulnessDenominator  Score        Time  \
0                     1                       1      5  1303862400   
1                     0                       0      1  1346976000   
2                  

In [7]:
import pandas as pd
# check for null values
df['Summary'].isnull().sum()  # no null values.

0

In [9]:
# remove duplicates/ for every duplicate we will keep only one row of that type. 
df.drop_duplicates(subset=['Score','Summary'],keep='first',inplace=True) 
print(df)

   Id   ProductId          UserId                      ProfileName  \
0   1  B001E4KFG0  A3SGXH7AUHU8GW                       delmartian   
1   2  B00813GRG4  A1D87F6ZCVE5NK                           dll pa   
2   3  B000LQOCH0   ABXLMWJIXXAIN  Natalia Corres "Natalia Corres"   
3   4  B000UA0QIQ  A395BORC6FGVXV                             Karl   
4   5  B006K2ZZ7K  A1UQRSCLF8GW1T    Michael D. Bigham "M. Wassir"   
5   6  B006K2ZZ7K   ADT0SRK1MGOEU                   Twoapennything   
6   7  B006K2ZZ7K  A1SP2KVKFXXRU1                David C. Sullivan   
7   8  B006K2ZZ7K  A3JRGQVEQN31IQ               Pamela G. Williams   
8   9  B000E7L2R4  A1MZYO9TZK0BBI                         R. James   
9  10  B00171APVA  A21BT40VZCCYT4                    Carol A. Reed   

   HelpfulnessNumerator  HelpfulnessDenominator  Score        Time  \
0                     1                       1      5  1303862400   
1                     0                       0      1  1346976000   
2                  

In [7]:
# Tokenizers
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
t5_tokenizer = T5Tokenizer.from_pretrained('t5-small')

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Explanation: Here we are initializing two different pre -trained tokenizers such as BERT and T5. Firstly we are initializing BERT as tokenizer by loading th pre-trained tokenizer from 'bert-base-uncased' model. Similarly we initialized T% tokenizer by loading the pre-trained tokenizer from 't5-small' model.Tokenizers are responsible for breaking down input into tokens that the models understand and process them effectively.

In [45]:
# Model Initialization
from transformers import BertModel

# Model Initialization
bert_model = BertModel.from_pretrained('bert-base-uncased')
t5_model = T5ForConditionalGeneration.from_pretrained('t5-small')

In [46]:
# Splitting data into train and test sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)


In [47]:
class CustomDataset(Dataset):
    def __init__(self, dataframe, bert_tokenizer, t5_tokenizer):
        self.data = dataframe
        self.bert_tokenizer = bert_tokenizer
        self.t5_tokenizer = t5_tokenizer
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        review = self.data.iloc[idx]['Text']
        summary = self.data.iloc[idx]['Summary']
        
        # Tokenize inputs for BERT
        bert_inputs = self.bert_tokenizer.encode_plus(
            review,
            add_special_tokens=True,
            max_length=512,  # Set max length to the desired maximum sequence length
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        # Tokenize inputs for T5
        t5_inputs = self.t5_tokenizer(
            "summarize: " + review,
            return_tensors="pt",
            max_length=512,  # Set max length to the desired maximum sequence length
            padding='max_length',
            truncation=True
        )
        
        return {
            'bert_input_ids': bert_inputs['input_ids'].flatten(),
            'bert_attention_mask': bert_inputs['attention_mask'].flatten(),
            't5_input_ids': t5_inputs['input_ids'].flatten(),
            't5_attention_mask': t5_inputs['attention_mask'].flatten(),
            'summary': summary
        }


Explanation: Here we defined a custom dataset class named "CustomDataset" which we are inheriting from the Pytorch dataset class. This class is taking a dataframe containing the text data, along with BERT and T5 tokenizers as input.In the __getitem__ method, the review text and summary for each data sample are retrieved from the dataset. It then tokenizes the review text using both BERT and T5 tokenizers, with parameters for adding special tokens, truncation, and padding to guarantee consistent input lengths.

In [48]:
# Create datasets
train_dataset = CustomDataset(train_df, bert_tokenizer, t5_tokenizer)
test_dataset = CustomDataset(test_df, bert_tokenizer, t5_tokenizer)


In [49]:
# Define DataLoader
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)


In [50]:
print("device = "+"cuda" if torch.cuda.is_available() else "cpu")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

device = cuda


In [81]:
bert_model.to(device)
t5_model.to(device)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (dropout): Drop

In [79]:
bert_model.to(device)
t5_model.to(device)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (dropout): Drop

In [52]:
optimizer = AdamW(list(bert_model.parameters()) + list(t5_model.parameters()), lr=2e-5)
num_epochs = 3

Explanation: Here we are initializing an optimizer to train the neural network model using Adamw optimization algorithm. We are using two different parameters ie. bert_model and t5_model for training them together.The AdamW optimizer is instantiated with a learning rate of 2e-5 (0.00002). The number of epochs for training is set to 3, indicating that the entire dataset will be iterated over three times during training.

In [56]:

for epoch in range(num_epochs):
    bert_model.train()
    t5_model.train()
    for batch_idx, batch in enumerate(tqdm(train_loader, desc="Epoch " + str(epoch+1))):
        optimizer.zero_grad()
        
        bert_input_ids = batch['bert_input_ids'].to(device)
        bert_attention_mask = batch['bert_attention_mask'].to(device)
        t5_input_ids = batch['t5_input_ids'].to(device)
        t5_attention_mask = batch['t5_attention_mask'].to(device)

        # Forward pass through BERT
        bert_outputs = bert_model(input_ids=bert_input_ids, attention_mask=bert_attention_mask)
        bert_last_hidden_state = bert_outputs.last_hidden_state
        
        # Forward pass through T5
        t5_outputs = t5_model(input_ids=t5_input_ids, attention_mask=t5_attention_mask, labels=t5_input_ids)
        t5_loss = t5_outputs.loss
        
        # Jointly optimize BERT and T5
        loss = t5_loss
        loss.backward()
        optimizer.step()

Epoch 1: 100%|██████████| 200/200 [17:59<00:00,  5.40s/it]
Epoch 2: 100%|██████████| 200/200 [17:05<00:00,  5.13s/it]
Epoch 3: 100%|██████████| 200/200 [17:00<00:00,  5.10s/it]


Explanation: This particular code depicts the training loop for fine-tuning two neural network models, bert_model and t5_model, which are most likely used for text summarization. It iterates over a set number of epochs, with each epoch including training on batches of data loaded from a train_loader. The optimizer's gradients are reset to zero once each epoch with optimizer.zero_grad(). The algorithm then makes forward runs through both models, gathering their outputs, most notably the latest hidden states for BERT and the loss for T5. The loss from T5 serves as the total loss for backpropagation. Backpropagation is performed using loss.backward(), which updates the parameters of both models concurrently using the optimizer's step() function.

In [78]:

# Model Testing
bert_model.eval()
t5_model.eval()

with torch.no_grad():
    for i, batch in enumerate(test_loader):
        if i == 30:  # Stop after processing 10 batches
            break
        
        bert_input_ids = batch['bert_input_ids'].to(device)
        bert_attention_mask = batch['bert_attention_mask'].to(device)
        t5_input_ids = batch['t5_input_ids'].to(device)
        t5_attention_mask = batch['t5_attention_mask'].to(device)
        
        # Forward pass through BERT
        bert_outputs = bert_model(input_ids=bert_input_ids, attention_mask=bert_attention_mask)
        bert_last_hidden_state = bert_outputs.last_hidden_state
        
        # Forward pass through T5
        t5_outputs = t5_model.generate(input_ids=t5_input_ids, attention_mask=t5_attention_mask, max_length=50, num_beams=4, early_stopping=True)
        
        # Decode the generated summaries
        generated_summaries = t5_tokenizer.batch_decode(t5_outputs, skip_special_tokens=True)
        
        # Print the original review and generated summary
         # Print the original review and generated summary
        print("Review:", test_dataset[i]['summary'])
        print("Generated Summary:", generated_summaries[0])  # Assuming batch size of 1
        print()

Review: I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.
Generated Summary: smells and looks good. best for finicky dogs.

Review: Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as 'Jumbo'.
Generated Summary: Product does not match description.

Review: This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar. And it is a tiny mouthful of heaven. Not too chewy, and very flavorful. I highly recommend this yummy treat. If you are familiar with the story of C.S. Lewis' 'The Lion, The Witch, and The 

Explanation:
This particular code represents the testing step for trained models. It starts by putting both the BERT and T5 models in evaluation mode using eval() to suppress dropout and allow inference-specific behavior. Within a torch.no_grad() context, it iterates over batches of data from the test_loader, processing each batch in turn. It terminates processing after 30 batches by applying the condition if i == 30. For each batch, it sends the input data to the appropriate device for calculation.
Then it does forward passes through both the BERT and T5 models to retrieve their results. T5 creates summaries using the generate technique, using options such as maximum length, number of beams, and early termination conditions. It then decodes the resulting summaries using the T5 tokenizer. Finally, it outputs both the original review and the produced summary, allowing for visual examination and evaluation of the model's ability to summarise material.