## Summarization

Dataset: [Amazon fine food reviews](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews/data)

### Imports

In [1]:
import re
import random
import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch.optim as optim

2024-01-11 19:46:45.001485: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-01-11 19:46:45.055806: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
# Set device to cuda if available
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')


## Dataset

#### Read Data

In [3]:
data_path = "./data/summarization/Reviews.csv"

In [4]:
# Read the data file into pandas df
reviews_df = pd.read_csv(data_path)

In [5]:
# Display all the columns in the df
reviews_df.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

In [6]:
# Since we only need Summary and Text keep only these columns and drop remaining columns
reviews_df = reviews_df[['Summary', 'Text']]

In [7]:
# Check df info
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568454 entries, 0 to 568453
Data columns (total 2 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   Summary  568427 non-null  object
 1   Text     568454 non-null  object
dtypes: object(2)
memory usage: 8.7+ MB


In [8]:
# Drop rows with null values and duplicates if any
reviews_df = reviews_df.dropna().drop_duplicates()

In [9]:
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 394967 entries, 0 to 568453
Data columns (total 2 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   Summary  394967 non-null  object
 1   Text     394967 non-null  object
dtypes: object(2)
memory usage: 9.0+ MB


In [10]:
reviews_df.head()

Unnamed: 0,Summary,Text
0,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,"""Delight"" says it all",This is a confection that has been around a fe...
3,Cough Medicine,If you are looking for the secret ingredient i...
4,Great taffy,Great taffy at a great price. There was a wid...


The dataframe now contains summary and text.
Now, we create a new column such that it contains `<<text>>, TL;DR <<summary>>`

In [11]:
reviews_df['data'] = reviews_df['Text'] + ', TL;DR ' + reviews_df['Summary']

##### Explore few examples

In [12]:
print(reviews_df.iloc[0]['Summary'])
print(reviews_df.iloc[0]['Text'])
print(reviews_df.iloc[0]['data'])

Good Quality Dog Food
I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.
I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most., TL;DR Good Quality Dog Food


In [13]:
print(reviews_df.iloc[1]['Summary'])
print(reviews_df.iloc[1]['Text'])
print(reviews_df.iloc[1]['data'])

Not as Advertised
Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".
Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo"., TL;DR Not as Advertised


### Setup

In [14]:
tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelWithLMHead.from_pretrained('gpt2')



In [15]:
# Path to save the finetuned model
model_path = './output/summarization/'

In [16]:
# Move model to gpu for processing in gpu
model = model.to(device)

    Found GPU1 NVIDIA GeForce GT 710 which is of cuda capability 3.5.
    PyTorch no longer supports this GPU because it is too old.
    The minimum cuda capability supported by this library is 3.7.
    


In [17]:
# Optimizer for training
optimizer = optim.AdamW(model.parameters(), lr=3e-4)

# Max length to cover all the examples
max_length = 200

In [18]:
extra_length = len(tokenizer.encode(" TL;DR "))

### Dataset class defination

In [19]:
class CustomDataset(Dataset):
    def __init__(self, tokenizer, reviews, max_len):
        self.max_len = max_len
        self.tokenizer = tokenizer
        self.eos = self.tokenizer.eos_token
        self.eos_id = self.tokenizer.eos_token_id
        self.reviews = reviews
        self.result = []

        for review in self.reviews:
            # Encode the text using the tokenizer
            tokenized = self.tokenizer.encode(review + self.eos)

            # padding the encoded sequence to max_len
            padded = self.pad_truncate(tokenized)

            # Creating a tensor and adding  to the result
            self.result.append(torch.tensor(padded))

    def __len__(self):
        return len(self.result)
    
    def __getitem__(self, item):
        return self.result[item]
    
    def pad_truncate(self, tokenized):
        tokenized_length = len(tokenized) - extra_length
        if tokenized_length < self.max_len:
            difference = self.max_len - tokenized_length
            padded = tokenized + [self.eos_id] * difference
        elif tokenized_length > self.max_len:
            padded = tokenized[:self.max_len + 3]+[self.eos_id]
        else:
            padded = tokenized
        return padded

In [20]:
# Load data into this custom dataset
dataset = CustomDataset(tokenizer=tokenizer,
                        reviews=reviews_df['data'],
                        max_len=max_length)

Token indices sequence length is longer than the specified maximum sequence length for this model (1430 > 1024). Running this sequence through the model will result in indexing errors


In [21]:
# Preview a object in dataset class
dataset.__getitem__(0)

tensor([   40,   423,  5839,  1811,   286,   262, 28476,   414, 32530,  3290,
         2057,  3186,   290,   423,  1043,   606,   477,   284,   307,   286,
          922,  3081,    13,   383,  1720,  3073,   517,   588,   257, 20798,
          621,   257, 13686,  6174,   290,   340, 25760,  1365,    13,  2011,
        45246,   318,   957, 17479,   290,   673,  5763,   689,   428,  1720,
         1365,   621,   220,   749,  1539, 24811,    26,  7707,  4599, 14156,
         8532,  7318, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 

In [22]:
# Create a dataloader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, drop_last=True)

### Training

In [25]:
def fine_tune(model, optimizer, dl, epochs):
    
    for epoch in range(epochs):
        model.train()
        for idx, batch in enumerate(dl):
            optimizer.zero_grad()
            batch = batch.to(device)
            output = model(batch, labels=batch)
            loss = output[0]
            loss.backward()
            optimizer.step()
            if idx%100 == 0:
                print(f"Loss: {loss}, Batches: {idx}")
                model.save_pretrained(model_path)
                # torch.save(model.state_dict(), model_path)

In [26]:
fine_tune(model=model, optimizer=optimizer, dl=dataloader, epochs=1)

OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 47.54 GiB total capacity; 2.35 GiB already allocated; 10.88 MiB free; 2.42 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

### Inference

In [None]:
# Topk implementation for selecting from topk choices
def topk(probs, n=9):
    # Convert the scores using softmax to probabilities
    probs = torch.softmax(probs, dim=-1)

    # Use topk implemntation provided by pytorch
    tokens_prob, topix = torch.topk(probs, k=n)

    # the new selection pool is normalized
    tokens_prob = tokens_prob / torch.sum(tokens_prob)

    # To CPU for handling by numpy
    tokens_prob = tokens_prob.cpu().detach().numpy()

    # Randomly select from the pool of prob distribution
    choice = np.random.choice(n, 1, p = tokens_prob)
    token_id = topix[choice][0]

    return int(token_id)

In [None]:
def model_inference(model, tokenizer, review, max_length=15):
    # Preprocess the initial tokens
    review_encoded = tokenizer.encode(review)
    result = review_encoded
    initial_input = torch.tensor(review_encoded).unsqueeze(0).to(device)

    model.eval()
    # Feed the initial input to the model
    output = model(initial_input)

    # Flatten the logits at the final time step
    logits = output.logits[0, -1]

    # Make a top-k choice and append to the result
    result.append(topk(logits))

    # For max_length times
    for _ in range(max_length):
        # Feed the current updated sequence to the model and make a choice
        input = torch.tensor(result).unsqueeze(0).to(device)
        output = model(input)
        logits = output.logits[0,-1]
        res_id = topk(logits)

        # If EOS is encountered return the result
        if res_id == tokenizer.eos_token_id:
            return tokenizer.decode(result)
        else:
            # Append the token to the sequence
            result.append(res_id)

    # IF max_length is encountered
    return tokenizer.decode(result)

In [None]:
# model.load_state_dict(torch.load(model_path))
model = model.from_pretrained(model_path)

In [None]:
test_review = """My local coffee shop has me addicted to their 20 oz vanilla chai lattes. 
                At $3.90 a pop I was spending a lot of money.  I asked what brand they used,
                need nutritional information, of course!  They told me it was Big Train Chai Vanilla.
                <br />It's important to follow the directions on the can.  I made mine with just milk
                  with a yucky result.  Use the water with a little milk as there is milk powder in the 
                  mix.<br /><br />WARNING:It's addicting!!!"""

In [None]:
summary = model_inference(model, tokenizer, test_review + " TL;DR ")

In [None]:
print(summary.split(' TL;DR ')[1])

<|endoftext|>
