# text2poem

### Aim
Given a textual description, generate a poem from it. 

### To-Do
- [ ] Clean up the variable names, etc
- [ ] Understand the T5 architecture and freeze the base layers if needed
- [ ] Create the $(\text{summary}, \ \text{poem})$ dataset using sites like [Poem Analysis](https://poemanalysis.com/)

### References
- [Base code for this notebook](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb)  
- [T5 finetuning tips](https://discuss.huggingface.co/t/t5-finetuning-tips/684/2)
- [T5 Docs](https://huggingface.co/transformers/model_doc/t5.html)


## Install Dependencies

In [1]:
# Transformer changes very frequently, thus the version is important
! pip install transformers==4.5.1
! pip install sentencepiece==0.1.94 # Version is important as T5's tokenizer has this version as a dependency 

Collecting transformers==4.5.1
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 14.5MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 54.4MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 52.8MB/s 
Installing collected packages: tokenizers, sacremoses, transformers
Successfully installed sacremoses-0.0.45 tokenizers-0.10.2 transformers-4.5.1
Collecting sentencepiece==0.1.94
[?25l  Downl

## Connect to GDrive

In [2]:
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


## Start
If you want to restart the notebook, do so from this point

In [3]:
%reset -f

## Import Libraries

In [4]:
# Importing stock libraries
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
import torch
import torch.nn.functional as F
from sklearn.model_selection import train_test_split as tts
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler

# Importing the T5 modules from huggingface/transformers
from transformers.optimization import Adafactor
from transformers import T5Tokenizer, T5ForConditionalGeneration

## Connecting to the GPU

In [5]:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# See which GPU has been allotted 
print(torch.cuda.get_device_name(torch.cuda.current_device()))

Tesla P100-PCIE-16GB


## Constants

In [6]:
PATH_DATA = "/content/gdrive/MyDrive/text2poem/news_summary.csv"
MODEL_NAME = "t5-small"

SEED = 42 

MAX_LEN = 512
SUMMARY_LEN = 150

TRAIN_BATCH_SIZE = 2
VALID_BATCH_SIZE = 2

N_EPOCHS = 2
LR = 1e-3

In [7]:
# Set random seeds and make pytorch deterministic for reproducibility
torch.manual_seed(SEED)
np.random.seed(SEED)
torch.backends.cudnn.deterministic = True

## Helper Functions

In [8]:
def countParameters(model):
    """ Counts the total number of trainable and frozen parameters in the model """
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    frozen = sum(p.numel() for p in model.parameters() if not p.requires_grad)
    return trainable, frozen

## Data Handlers

In [9]:
def loadDF(path, n_total = 100, split = 0.8,  prefix = "summarize: "):
    
    df = pd.read_csv(PATH_DATA, encoding = "latin-1")
    df = df[['text', 'ctext']]
    df.ctext = prefix + df.ctext
    df = df[:n_total]

    df_train, df_val = tts(df, train_size = split, random_state = SEED, shuffle = True)
    
    df_train.reset_index(drop=True, inplace=True)
    df_val.reset_index(drop=True, inplace=True)

    return df, df_train, df_val

In [10]:
def getDataLoaders(df_train, df_val, tokenizer):
    ds_train = CustomDataset(df_train, tokenizer, MAX_LEN, SUMMARY_LEN)
    ds_val   = CustomDataset(df_val, tokenizer, MAX_LEN, SUMMARY_LEN)

    dl_train = DataLoader(ds_train, batch_size = TRAIN_BATCH_SIZE, shuffle = False, num_workers = 0)
    dl_val   = DataLoader(ds_val, batch_size = TRAIN_BATCH_SIZE, shuffle = False, num_workers = 0)

    return dl_train, dl_val

## Custom Dataset

In [11]:
class CustomDataset(Dataset):

    def __init__(self, dataframe, tokenizer, source_len, summ_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.source_len = source_len
        self.summ_len = summ_len
        self.text = self.data.text
        self.ctext = self.data.ctext

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        ctext = str(self.ctext[index])
        ctext = ' '.join(ctext.split())

        text = str(self.text[index])
        text = ' '.join(text.split())

        source = self.tokenizer.batch_encode_plus([ctext], max_length= self.source_len, pad_to_max_length=True,return_tensors='pt')
        target = self.tokenizer.batch_encode_plus([text], max_length= self.summ_len, pad_to_max_length=True,return_tensors='pt')

        source_ids = source['input_ids'].squeeze()
        source_mask = source['attention_mask'].squeeze()
        target_ids = target['input_ids'].squeeze()
        target_mask = target['attention_mask'].squeeze()

        return {
            'source_ids': source_ids.to(dtype=torch.long), 
            'source_mask': source_mask.to(dtype=torch.long), 
            'target_ids': target_ids.to(dtype=torch.long),
            'target_ids_y': target_ids.to(dtype=torch.long)
        }

## Train and Validate Functions:

In [12]:
def train(epoch, tokenizer, model, loader, optimizer):

    model.train()

    for _, data in enumerate(tqdm(loader)):
        y = data['target_ids'].to(DEVICE, dtype = torch.long)
        y_ids = y[:, :-1].contiguous()
        lm_labels = y[:, 1:].clone().detach()
        lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
        ids = data['source_ids'].to(DEVICE, dtype = torch.long)
        mask = data['source_mask'].to(DEVICE, dtype = torch.long)

        outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, labels=lm_labels)
        loss = outputs[0]
        
        if _%20 == 0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

In [13]:
def validate(epoch, tokenizer, model, loader):

    model.eval()

    predictions = []
    actuals = []

    with torch.no_grad():
        for _ ,data in enumerate(tqdm(loader)) :

            y = data['target_ids'].to(DEVICE, dtype = torch.long)
            ids = data['source_ids'].to(DEVICE, dtype = torch.long)
            mask = data['source_mask'].to(DEVICE, dtype = torch.long)

            generated_ids = model.generate(
                input_ids = ids,
                attention_mask = mask, 
                max_length=150, 
                num_beams=2,
                repetition_penalty=2.5, 
                length_penalty=1.0, 
                early_stopping=True
                )
            preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
            target = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in y]

            predictions.extend(preds)
            actuals.extend(target)
    return predictions, actuals

## Main

In [14]:
# T5's Tokenzier for encoding the text
tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1389353.0, style=ProgressStyle(descript…




### Loading the Data

In [15]:
df, df_train, df_val = loadDF(PATH_DATA)
print(len(df), len(df_train), len(df_val))

100 80 20


In [16]:
dl_train, dl_val = getDataLoaders(df_train, df_val, tokenizer)
print(len(dl_train), len(dl_val))

40 10


### Loading the Model

In [17]:
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
model = model.to(DEVICE)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1197.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=242065649.0, style=ProgressStyle(descri…




In [18]:
p_train, p_frozen = countParameters(model)
print(f"The model has {p_train:,} trainable and {p_frozen:,} frozen parameters")

The model has 60,506,624 trainable and 0 frozen parameters


### Optimization

In [19]:
optimizer = Adafactor(
    params = model.parameters(), 
    lr = LR, 
    scale_parameter = False, 
    relative_step = False
)

In [20]:
for e in range(N_EPOCHS):
    print(f"Epoch {e}")
    train(e, tokenizer, model, dl_train, optimizer)

Epoch 0


HBox(children=(FloatProgress(value=0.0, max=40.0), HTML(value='')))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Epoch: 0, Loss:  3.2874932289123535


	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:1005.)
  exp_avg_sq_row.mul_(beta2t).add_(1.0 - beta2t, update.mean(dim=-1))


Epoch: 0, Loss:  2.9610185623168945

Epoch 1


HBox(children=(FloatProgress(value=0.0, max=40.0), HTML(value='')))

Epoch: 1, Loss:  1.8120074272155762
Epoch: 1, Loss:  2.388127088546753



In [21]:
predictions, actuals = validate(0, tokenizer, model, dl_val)
results = pd.DataFrame({'Generated Text' : predictions, 'Actual Text' : actuals})

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))






In [22]:
display(results)

Unnamed: 0,Generated Text,Actual Text
0,a Delhi University college is starting a crèch...,Lakshmibai College will be the first college i...
1,Allegations that British American Tobacco was ...,UK tobacco company British American Tobacco (B...
2,Priyanka Chopra shared a picture of herself wi...,Priyanka Chopra has shared a picture with sing...
3,"The AA has fired its boss, Bob Mackenzie, for ...",UK roadside repair firm AA on Tuesday lost ove...
4,"Finance Minister Arun Jaitley said a ""proper d...",Finance Minister Arun Jaitley has said that a ...
5,The Ghaziabad police arrested three persons an...,The Ghaziabad Police has booked 14 people incl...
6,"Over 400 farmers, including women, took to the...",Over 400 farmers from Greater Noida and adjoin...
7,"police have arrested one of the accused, a juv...",A 15-year-old Haryana girl has been found to b...
8,The Food Safety and Standards Authority of Ind...,India's food regulator Food Safety and Standar...
9,The Daman and Diu administration on Wednesday ...,The Administration of Union Territory Daman an...


In [23]:
print(results.iloc[3]["Generated Text"], "\n")
print(results.iloc[3]["Actual Text"])

The AA has fired its boss, Bob Mackenzie, for gross misconduct. "The family trusts all parties will act responsibly towards a loyal servant of the company," the statement said. 

UK roadside repair firm AA on Tuesday lost over £200 million (nearly?1,690 crore) or nearly a fifth of its value. This came after shares fell as much as 18% during the day after it fired its Executive Chairman Bob Mackenzie with immediate effect for 'gross misconduct'. However, his son said he resigned over an 'extremely distressing mental health issue'.


## Dummy cells added by the script
After setting the model on training, use inject the following JS code into the console to prevent losing connection after 90 mins. This code keeps adding new cells every 30 mins, so make sure to select the last cell before injection, after training, delete all these cells manually.
```javascript
var t_interval = 1800; // In seconds, 30 mins
function AddCell(){
    console.log("Added cell"); 
    document.querySelector("#toolbar-add-code").click() 
}
setInterval(AddCell, t_interval*1000); // this is in ms, thus multiplied by 1000 
```