# Fine Tuning Transformer for Headline Generation


### Introduction

In this task a summary of a given article/document is generated when passed through a network. There are 2 types of summary generation mechanisms:

1. ***Extractive Summary:*** the network calculates the most important sentences from the article and gets them together to provide the most meaningful information from the article.
2. ***Abstractive Summary***: The network creates new sentences to encapsulate maximum gist of the article and generates that as output. The sentences in the summary may or may not be contained in the article. 

Here we will be generating ***Abstractive Summary***. 

#### DATA

- **Data**:
	- We are using the News Summary dataset available at [Kaggle](https://www.kaggle.com/sunnysai12345/news-summary)
	- This dataset is the collection created from Newspapers published in India, extracting, details that are listed below.  We are referring only to the first csv file from the data dump: `news_summary.csv`
	- There are`4514` rows of data.  Where each row has the following data-point:
		- **author** : Author of the article
		- **date** : Date the article was published
		- **headline**: Headline for the published article
		- **read_more** : URL for the article to follow online
		- **text**: This is the summary of the article
		- **ctext**: This is the complete article


- **Language Model Used**: 
    - This notebook uses one of the most recent and novel transformers model ***T5***. [Research Paper](https://arxiv.org/abs/1910.10683)    
    - ***T5*** in many ways is one of its kind transformers architecture that not only gives state of the art results in many NLP tasks, but also has a very radical approach to NLP tasks.
    - **Text-2-Text** - According to the graphic taken from the T5 paper. All NLP tasks are converted to a **text-to-text** problem. Tasks such as translation, classification, summarization and question answering, all of them are treated as a text-to-text conversion problem, rather than seen as separate unique problem statements.

### Importing

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd 'drive/My Drive/Bridgei2i/Development_Data'

/content/drive/My Drive/Bridgei2i/Development_Data


In [None]:
!pip install transformers==2.9.0 
!pip install pytorch_lightning==0.7.5

Collecting transformers==2.9.0
[?25l  Downloading https://files.pythonhosted.org/packages/cd/38/c9527aa055241c66c4d785381eaf6f80a28c224cae97daa1f8b183b5fabb/transformers-2.9.0-py3-none-any.whl (635kB)
[K     |▌                               | 10kB 17.2MB/s eta 0:00:01[K     |█                               | 20kB 19.4MB/s eta 0:00:01[K     |█▌                              | 30kB 15.4MB/s eta 0:00:01[K     |██                              | 40kB 13.6MB/s eta 0:00:01[K     |██▋                             | 51kB 12.4MB/s eta 0:00:01[K     |███                             | 61kB 12.5MB/s eta 0:00:01[K     |███▋                            | 71kB 12.1MB/s eta 0:00:01[K     |████▏                           | 81kB 13.1MB/s eta 0:00:01[K     |████▋                           | 92kB 11.8MB/s eta 0:00:01[K     |█████▏                          | 102kB 11.6MB/s eta 0:00:01[K     |█████▊                          | 112kB 11.6MB/s eta 0:00:01[K     |██████▏                    

In [None]:
!pip install wandb -q

# Code for TPU packages install
# !curl -q https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
# !python pytorch-xla-env-setup.py --apt-packages libomp5 libopenblas-dev

[K     |████████████████████████████████| 2.0MB 11.5MB/s 
[K     |████████████████████████████████| 133kB 38.1MB/s 
[K     |████████████████████████████████| 163kB 38.7MB/s 
[K     |████████████████████████████████| 102kB 11.2MB/s 
[K     |████████████████████████████████| 71kB 8.0MB/s 
[?25h  Building wheel for pathtools (setup.py) ... [?25l[?25hdone
  Building wheel for subprocess32 (setup.py) ... [?25l[?25hdone


In [None]:
# Importing stock libraries
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler

# Importing the T5 modules from huggingface/transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration

# WandB – Import the wandb library
import wandb

In [None]:
# Checking out the GPU we have access to. This is output is from the google colab version. 
!nvidia-smi

Thu Mar 18 14:59:23 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P8    28W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# # Setting up the device for GPU usage
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

# Preparing for TPU usage
# import torch_xla
# import torch_xla.core.xla_model as xm
# device = xm.xla_device()

In [None]:
# Login to wandb to log the model run and all the parameters
!wandb login

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [None]:
# Creating a custom dataset for reading the dataframe and loading it into the dataloader to pass it to the neural network at a later stage for finetuning the model and to prepare it for predictions

class CustomDataset(Dataset):

    def __init__(self, dataframe, tokenizer, source_len, summ_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.source_len = source_len
        self.summ_len = summ_len
        self.headlines = self.data.headlines
        self.ctext = self.data.ctext

    def __len__(self):
        return len(self.headlines)

    def __getitem__(self, index):
        ctext = str(self.ctext[index])
        ctext = ' '.join(ctext.split())

        headlines = str(self.headlines[index])
        headlines = ' '.join(headlines.split())

        source = self.tokenizer.batch_encode_plus([ctext], max_length= self.source_len, pad_to_max_length=True,return_tensors='pt')
        target = self.tokenizer.batch_encode_plus([headlines], max_length= self.summ_len, pad_to_max_length=True,return_tensors='pt')

        source_ids = source['input_ids'].squeeze()
        source_mask = source['attention_mask'].squeeze()
        target_ids = target['input_ids'].squeeze()
        target_mask = target['attention_mask'].squeeze()

        return {
            'source_ids': source_ids.to(dtype=torch.long), 
            'source_mask': source_mask.to(dtype=torch.long), 
            'target_ids': target_ids.to(dtype=torch.long),
            'target_ids_y': target_ids.to(dtype=torch.long)
        }

In [None]:
df =  pd.read_csv('news_summary.csv',encoding='latin-1')
df1 = df[['headlines', 'ctext']]

In [None]:
df1['ctext'][0]

'The Daman and Diu administration on Wednesday withdrew a circular that asked women staff to tie rakhis on male colleagues after the order triggered a backlash from employees and was ripped apart on social media.The union territory?s administration was forced to retreat within 24 hours of issuing the circular that made it compulsory for its staff to celebrate Rakshabandhan at workplace.?It has been decided to celebrate the festival of Rakshabandhan on August 7. In this connection, all offices/ departments shall remain open and celebrate the festival collectively at a suitable time wherein all the lady staff shall tie rakhis to their colleagues,? the order, issued on August 1 by Gurpreet Singh, deputy secretary (personnel), had said.To ensure that no one skipped office, an attendance report was to be sent to the government the next evening.The two notifications ? one mandating the celebration of Rakshabandhan (left) and the other withdrawing the mandate (right) ? were issued by the Dama

In [None]:
df1['length_headline'] = df1['headlines'].astype(str).map(len)
df1['length_text'] = df1['ctext'].astype(str).map(len)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [None]:
df1.head()

Unnamed: 0,headlines,ctext,length_headline,length_text
0,Daman & Diu revokes mandatory Rakshabandhan in...,The Daman and Diu administration on Wednesday ...,60,2313
1,Malaika slams user who trolled her for 'divorc...,"From her special numbers to TV?appearances, Bo...",60,2382
2,'Virgin' now corrected to 'Unmarried' in IGIMS...,The Indira Gandhi Institute of Medical Science...,52,2114
3,Aaj aapne pakad liya: LeT man Dujana before be...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...,56,2384
4,Hotel staff to get training to spot signs of s...,Hotels in Mumbai and other Indian cities are t...,60,3249


In [None]:
df1.describe()

Unnamed: 0,length_headline,length_text
count,4514.0,4514.0
mean,55.948383,2033.922242
std,4.604138,2180.007774
min,31.0,3.0
25%,54.0,1079.25
50%,58.0,1696.5
75%,59.0,2463.0
max,62.0,76045.0


In [None]:
# Creating the training function. This will be called in the main function. It is run depending on the epoch value.
# The model is put into train mode and then we wnumerate over the training loader and passed to the defined network 

def train(epoch, tokenizer, model, device, loader, optimizer):
    model.train()
    for _,data in enumerate(loader, 0):
        y = data['target_ids'].to(device, dtype = torch.long)
        y_ids = y[:, :-1].contiguous()
        lm_labels = y[:, 1:].clone().detach()
        lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
        ids = data['source_ids'].to(device, dtype = torch.long)
        mask = data['source_mask'].to(device, dtype = torch.long)

        outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, lm_labels=lm_labels)
        loss = outputs[0]
        
        if _%10 == 0:
            wandb.log({"Training Loss": loss.item()})

        if _%500==0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        # xm.optimizer_step(optimizer)
        # xm.mark_step()

<a id='section04'></a>
### Validating the Model Performance: Function

During the validation stage we pass the unseen data(Testing Dataset), trained model, tokenizer and device details to the function to perform the validation run. This step generates new summary for dataset that it has not seen during the training session. 

This function is called in the `main()`

This unseen data is the 20% of `news_summary.csv` which was seperated during the Dataset creation stage. 
During the validation stage the weights of the model are not updated. We use the generate method for generating new text for the summary. 

It depends on the `Beam-Search coding` method developed for sequence generation for models with LM head. 

The generated text and originally summary are decoded from tokens to text and returned to the `main()`

In [None]:
def validate(epoch, tokenizer, model, device, loader):
    model.eval()
    predictions = []
    actuals = []
    with torch.no_grad():
        for _, data in enumerate(loader, 0):
            y = data['target_ids'].to(device, dtype = torch.long)
            ids = data['source_ids'].to(device, dtype = torch.long)
            mask = data['source_mask'].to(device, dtype = torch.long)

            generated_ids = model.generate(
                input_ids = ids,
                attention_mask = mask, 
                max_length=150, 
                num_beams=2,
                repetition_penalty=2.5, 
                length_penalty=1.0, 
                early_stopping=True
                )
            preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
            target = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in y]
            if _%100==0:
                print(f'Completed {_}')

            predictions.extend(preds)
            actuals.extend(target)
    return predictions, actuals

In [None]:
!pip install sentencepiece



In [None]:
# WandB – Initialize a new run
wandb.init(project="transformers_tutorials_summarization")

# WandB – Config is a variable that holds and saves hyperparameters and inputs
# Defining some key variables that will be used later on in the training  
config = wandb.config          # Initialize config
config.TRAIN_BATCH_SIZE = 2    # input batch size for training (default: 64)
config.VALID_BATCH_SIZE = 2    # input batch size for testing (default: 1000)
config.TRAIN_EPOCHS = 2        # number of epochs to train (default: 10)
config.VAL_EPOCHS = 1 
config.LEARNING_RATE = 1e-4    # learning rate (default: 0.01)
config.SEED = 42               # random seed (default: 42)
config.MAX_LEN = 512
config.SUMMARY_LEN = 150 

# Set random seeds and deterministic pytorch for reproducibility
torch.manual_seed(config.SEED) # pytorch random seed
np.random.seed(config.SEED) # numpy random seed
torch.backends.cudnn.deterministic = True

# tokenzier for encoding the text
tokenizer = T5Tokenizer.from_pretrained("t5-small")


# Importing and Pre-Processing the domain data
# Selecting the needed columns only. 
# Adding the summarzie text in front of the text. This is to format the dataset similar to how T5 model was trained for summarization task. 
df = pd.read_csv('news_summary.csv',encoding='latin-1')
df = df[['headlines','ctext']]
df.ctext = 'summarize: ' + df.ctext
print(df.head())


# Creation of Dataset and Dataloader
# Defining the train size. So 80% of the data will be used for training and the rest will be used for validation. 
train_size = 0.8
train_dataset=df.sample(frac=train_size,random_state = config.SEED)
val_dataset=df.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)

print("FULL Dataset: {}".format(df.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("TEST Dataset: {}".format(val_dataset.shape))


# Creating the Training and Validation dataset for further creation of Dataloader
training_set = CustomDataset(train_dataset, tokenizer, config.MAX_LEN, config.SUMMARY_LEN)
val_set = CustomDataset(val_dataset, tokenizer, config.MAX_LEN, config.SUMMARY_LEN)

# Defining the parameters for creation of dataloaders
train_params = {
    'batch_size': config.TRAIN_BATCH_SIZE,
    'shuffle': True,
    'num_workers': 0
    }

val_params = {
    'batch_size': config.VALID_BATCH_SIZE,
    'shuffle': False,
    'num_workers': 0
    }

# Creation of Dataloaders for testing and validation. This will be used down for training and validation stage for the model.
training_loader = DataLoader(training_set, **train_params)
val_loader = DataLoader(val_set, **val_params)



# Defining the model. We are using t5-base model and added a Language model layer on top for generation of Summary. 
# Further this model is sent to device (GPU/TPU) for using the hardware.
model = T5ForConditionalGeneration.from_pretrained("t5-small")
model = model.to(device)

# Defining the optimizer that will be used to tune the weights of the network in the training session. 
optimizer = torch.optim.Adam(params =  model.parameters(), lr=config.LEARNING_RATE)

# Log metrics with wandb
wandb.watch(model, log="all")
# Training loop
print('Initiating Fine-Tuning for the model on our dataset')

for epoch in range(config.TRAIN_EPOCHS):
    train(epoch, tokenizer, model, device, training_loader, optimizer)

[34m[1mwandb[0m: Currently logged in as: [33maaaaaaa[0m (use `wandb login --relogin` to force relogin)


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…


                                           headlines                                              ctext
0  Daman & Diu revokes mandatory Rakshabandhan in...  summarize: The Daman and Diu administration on...
1  Malaika slams user who trolled her for 'divorc...  summarize: From her special numbers to TV?appe...
2  'Virgin' now corrected to 'Unmarried' in IGIMS...  summarize: The Indira Gandhi Institute of Medi...
3  Aaj aapne pakad liya: LeT man Dujana before be...  summarize: Lashkar-e-Taiba's Kashmir commander...
4  Hotel staff to get training to spot signs of s...  summarize: Hotels in Mumbai and other Indian c...
FULL Dataset: (4514, 2)
TRAIN Dataset: (3611, 2)
TEST Dataset: (903, 2)


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1197.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=242065649.0, style=ProgressStyle(descri…


Initiating Fine-Tuning for the model on our dataset




Epoch: 0, Loss:  4.874608516693115
Epoch: 0, Loss:  3.1677329540252686
Epoch: 0, Loss:  4.309917449951172
Epoch: 0, Loss:  2.5353498458862305
Epoch: 1, Loss:  3.286834239959717
Epoch: 1, Loss:  1.6629774570465088
Epoch: 1, Loss:  2.8557727336883545
Epoch: 1, Loss:  2.0700879096984863


In [None]:
%ls

cleaned_articles.csv   dev_data_tweet_cleaned_v2.xlsx  news_summary.csv
cleaned.csv            dev_data_tweet_cleaned_v4.xlsx  predictions.csv
dev_data_article.xlsx  dev_data_tweet.xlsx             [0m[01;34mwandb[0m/


In [None]:
torch.save(model.state_dict(),'/content/drive/MyDrive/Bridgei2i/Development_Data/news.pth')

In [None]:
df = pd.read_csv('eng_articles.csv')

In [None]:
df

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Text,Headline,Mobile_Tech_Flag,Language
0,0,3001,GlobeNewswire\n\nRenowned Companies covered in...,NewsBytes Briefing: Google pays for systemic r...,1,en
1,1,3002,Samsung Galaxy M02 expands the company’s entry...,Samsung launches Galaxy M02 smartphone with 50...,1,en
2,2,3003,Samsung Galaxy M02 expands the company’s entry...,Samsung launches Galaxy M02 smartphone with 50...,1,en
3,3,3004,"After months of wait, Poco's has finally launc...",Home\n Poco M3 Launched in India- S...,1,en
4,4,3005,MediaTek Unveils New M80 5G Modem with Support...,MediaTek Unveils New M80 5G Modem with Support...,1,en
...,...,...,...,...,...,...
373,720,3961,Google AMP Kya hai |⚡ Accelerated Mobile Pages...,Google AMP Kya hai |⚡ Accelerated Mobile Pages...,1,en
374,723,3964,Jio Free Recharge App | Jio Free Mobile Data T...,Jio Free Recharge App | Jio Free Mobile Data T...,1,en
375,724,3965,Jio ki Internet Speed Kaise Badate hai | 50mbp...,Jio ki Internet Speed Kaise Badate hai | 50mbp...,1,en
376,725,3966,Kisi Bhi Android Mobile ko Root kare in 5 Apps...,Kisi Bhi Android Mobile ko Root kare in 5 Apps...,1,en


In [None]:
df['Text'][2]

'Samsung Galaxy M02 expands the company’s entry-level smartphone lineup in India. The new Galaxy M-series device is “designed to cater to accelerating digital needs of consumers, be it work, play or content streaming,” claims Samsung.Samsung Galaxy M02 smartphone boasts of big battery and dual rear camera setup.Samsung Galaxy M02 will be available in two variants -- 2GB + 32GB and 3GB + 32GB. The former is priced at Rs 6,999, while the latter costs Rs 7,499.The smartphone will be available in both online and offline retail stores including Amazon India website, Samung.com and other retail channels.As an introductory offer, consumers can avail a special discount of Rs 200 on Amazon.in for limited time.Samsung Galaxy M02 features a 6.5-inch screen with HD+ Infinity V Display and is powered by MediaTek 6739 processor.In terms of camera specifications, the Galaxy M02 boasts of dual-rear camera setup of 13MP main lens and 2MP macro sensor.There’s a 5MP front-facing camera as well.The smartp

In [None]:
df['Headline'][2]

'Samsung launches Galaxy M02 smartphone with 5000mAh battery, 6.5-inch display; price starts at Rs 6,999'

In [None]:
df1['Generated Text'][2]

'<extra_id_0> Galaxy M02 smartphone boasts big battery and dual rear camera setup: Samsung Galaxy M02 smartphone to be available in two variant versions: 2GB + 32GB, 3GB + 32GB & 3GB + 32GB.Samson Galaxy M02 smartphone will be available in two variant sizes: 2GB + 32GB, 3GB + 32GB, 3GB + 32GB, 3GB + 32GB for limited time: Samsung Galaxy M02 smartphone to be available in India: Samsung Galaxy M02 smartphone with big battery, dual rear camera setup'

In [None]:
device

In [None]:
df = df.rename(columns={'Text': 'ctext', 'Headline': 'headlines'})

In [None]:
val_dataset=df
val_set = CustomDataset(val_dataset, tokenizer, config.MAX_LEN, config.SUMMARY_LEN)
val_loader = DataLoader(val_set, **val_params)

In [None]:
# Validation loop and saving the resulting file with predictions and acutals in a dataframe.
# Saving the dataframe as predictions.csv
print('Now generating summaries on our fine tuned model for the validation dataset and saving it in a dataframe')
for epoch in range(config.VAL_EPOCHS):
    predictions, actuals = validate(epoch, tokenizer, model, device, val_loader)
    final_df = pd.DataFrame({'Generated Text':predictions,'Actual Text':actuals})
    final_df.to_csv('predictions.csv')
    print('Output Files generated for review')

Now generating summaries on our fine tuned model for the validation dataset and saving it in a dataframe
Completed 0
Completed 100
Output Files generated for review


In [None]:
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
... A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
... Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
... In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
... Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
... 2010 marriage license application, according to court documents.
... Prosecutors said the marriages were part of an immigration scam.
... On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
... After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
... Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
... All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
... Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
... Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
... The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
... Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
... Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
... If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
... """

In [None]:
device = "cuda:0"
model = model.to(device)

In [None]:
inputs = tokenizer.encode("summarize: " + ARTICLE, return_tensors="pt", max_length=512)
inputs = inputs.to(device)
outputs = model.generate(inputs, max_length=150, min_length=10, length_penalty=2.0, num_beams=1, early_stopping=True)

In [None]:
print(tokenizer.decode(outputs[0]))

Liana Barrientos married in New York, says she's her first marriage: Prosecutors: Barrientos' false statements on marriage license application: Reports of her false statements on marriage license application: Reports of her marriages were illegal: Prosecutors: Barrientos's false statements on marriage license application: Reports of her marriages were illegal: Prosecutors: Barrientos's false statements on marriage license application: Reports of her husbands in 2010: Liana Barrientos got married in New York, says she's first and only marriage: Prosecutors: Barrientos on immigration scam: Pro


In [None]:
val_dataset=df.drop(train_dataset.index).reset_index(drop=True)


In [None]:
df1 = pd.read_csv('predictions.csv')
df1

Unnamed: 0.1,Unnamed: 0,Generated Text,Actual Text
0,0,<extra_id_0>: Specialty Carbon Black Market fo...,NewsBytes Briefing: Google pays for systemic r...
1,1,<extra_id_0> Galaxy M02 expands Indian smartph...,Samsung launches Galaxy M02 smartphone with 50...
2,2,<extra_id_0> Galaxy M02 smartphone boasts big ...,Samsung launches Galaxy M02 smartphone with 50...
3,3,<extra_id_0> the Poco M3 comes in three colors...,Homen Poco M3 Launched in India- Snapdragon 66...
4,4,"<extra_id_0>m to combines mmWave, sub-6GHz 5G ...",MediaTek Unveils New M80 5G Modem with Support...
...,...,...,...
373,373,<extra_id_0>: Google AMP kya hai | Accelerated...,Google AMP Kya hai | Accelerated Mobile Pages ...
374,374,<extra_id_0> karke Unlimited Mobile Data pane ...,Jio Free Recharge App | Jio Free Mobile Data T...
375,375,<extra_id_0> ke karan Speed slow ho jati hai. ...,Jio ki Internet Speed Kaise Badate hai | 50mbp...
376,376,<extra_id_0>: Rooting ek process hai jiski mad...,Kisi Bhi Android Mobile ko Root kare in 5 Apps...


In [None]:
df['Generated Text'][1]

'<extra_id_0> Galaxy M02 expands Indian smartphone lineup in India: Samsung Galaxy M02 smartphone to be available in two variant versions: 2GB + 32GB, 3GB + 32GB and 3GB + 32GB. Galaxy M02: Price and availability Samsung Galaxy M02: Price and availability Samsung Galaxy M02 smartphone to be available in two variant sizes: 2GB + 32GB with dual rear camera setup: Samsung Galaxy M02: Price and availability Samsung Galaxy M02: Price and availability Samsung Galaxy M02 smartphone to be available in two variant options: 2'

In [None]:
df = df.drop(columns=['Unnamed: 0'])

In [None]:
!pip install -U sentence-transformers
!pip install -U rouge

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/6a/e2/84d6acfcee2d83164149778a33b6bdd1a74e1bcb59b2b2cd1b861359b339/sentence-transformers-0.4.1.2.tar.gz (64kB)
[K     |████████████████████████████████| 71kB 6.3MB/s 
[?25hCollecting transformers<5.0.0,>=3.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/f9/54/5ca07ec9569d2f232f3166de5457b63943882f7950ddfcc887732fc7fb23/transformers-4.3.3-py3-none-any.whl (1.9MB)
[K     |████████████████████████████████| 1.9MB 14.6MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 52.7MB/s 
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: f

Collecting rouge
  Downloading https://files.pythonhosted.org/packages/43/cc/e18e33be20971ff73a056ebdb023476b5a545e744e3fc22acd8c758f1e0d/rouge-1.0.0-py3-none-any.whl
Installing collected packages: rouge
Successfully installed rouge-1.0.0


In [None]:
import scipy
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report
from rouge import Rouge
from statistics import mean
import nltk
nltk.download('punkt')

from nltk.translate.bleu_score import sentence_bleu, corpus_bleu

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
df['Actual Text'][0]

'NewsBytes Briefing: Google pays for systemic racism, misogyny, and more'

In [None]:
df['Generated Text'][0]

'<extra_id_0>: Specialty Carbon Black Market forecasts to acquire market value of USD 2,330.9 million by 2025 from 2018 to 2025. Specialties carbon black is a pure form of carbon with low content of ash, metal and sulfur in plastic parts: Reports on Specializable carbon black market report predicts global market to grow by 2025 after COVID-19 in 2018: Reports on Specialite Carbon Black Market Overview: Reports on Specialtenance Carbon Black Market Forecasting Global Specialtality Carbon Black Market expected for'

In [None]:
def convert(text):
   a = text.split(':')
   if(len(a)>=1):
     return a[0]
   else:
     return a[0]

In [None]:
convert(df['Generated Text'][1])

'<extra_id_0> Galaxy M02 expands Indian smartphone lineup in India'

In [None]:
# Get average document similarity using BERT
model = SentenceTransformer('bert-base-nli-mean-tokens')

# Get a vector for each headlines
actual_headline = df['Actual Text'][1]
predicted_headline = convert(df['Generated Text'][1])
actual_headline_embeddings = model.encode(actual_headline)
predicted_headline_embeddings = model.encode(predicted_headline)

distance = scipy.spatial.distance.cdist([actual_headline_embeddings], [predicted_headline_embeddings], "cosine")[0]
print("Similarity Score: %.4f" % (1-distance))

Similarity Score: 0.6912


In [None]:
# Rouge Scores
rouge = Rouge()

rouge_score = rouge.get_scores(actual_headline, predicted_headline)
rouge_scores = rouge_score[0]['rouge-l']

rouge_scores

{'f': 0.2399999953920001, 'p': 0.1875, 'r': 0.3333333333333333}

In [None]:
# BLEU Scores

hypothesis = predicted_headline.split()
reference = actual_headline.split()
references = [reference] # list of references for 1 sentence.
list_of_references = [references] # list of references for all sentences in corpus.
list_of_hypotheses = [hypothesis] # list of hypotheses that corresponds to list of references.
bleu_score = corpus_bleu(list_of_references, list_of_hypotheses)

bleu_score

Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


0.2319623687227222

In [None]:
df['Generated Text'] = df['Generated Text'].apply(convert)

In [None]:
def similarity_score(actual_headline,predicted_headline):
  model = SentenceTransformer('bert-base-nli-mean-tokens')
  actual_headline_embeddings = model.encode(actual_headline)
  predicted_headline_embeddings = model.encode(predicted_headline)

  distance = scipy.spatial.distance.cdist([actual_headline_embeddings], [predicted_headline_embeddings], "cosine")[0]
  return np.round(1-distance,4)

In [None]:
def rouge_sc(actual_headline,predicted_headline):
  rouge = Rouge()

  rouge_score = rouge.get_scores(actual_headline, predicted_headline)
  rouge_scores = rouge_score[0]['rouge-l']

  return rouge_scores

In [None]:
def bleu_sc(actual_headline,predicted_headline):
  hypothesis = predicted_headline.split()
  reference = actual_headline.split()
  references = [reference] # list of references for 1 sentence.
  list_of_references = [references] # list of references for all sentences in corpus.
  list_of_hypotheses = [hypothesis] # list of hypotheses that corresponds to list of references.
  bleu_score = corpus_bleu(list_of_references, list_of_hypotheses)

  return bleu_score

In [None]:
df = df[:300]

In [None]:
similarity_sc = []
rouge_score = []
bleu_score = []

In [None]:
for i in range(len(df)):
  similarity_sc.append(similarity_score(df['Actual Text'][i],df['Generated Text'][i]))
  rouge_score.append(rouge_sc(df['Actual Text'][i],df['Generated Text'][i]))
  bleu_score.append(bleu_sc(df['Actual Text'][i],df['Generated Text'][i]))
  print(i)

0


Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


1


Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


2
3


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
27

TypeError: ignored

In [None]:
df1 = pd.DataFrame({'similarity scores':similarity_sc,'bleu_score':bleu_score,'rouge_score':rouge_score})

In [None]:
df1

Unnamed: 0,similarity scores,bleu_score,rouge_score
0,[0.8],0.497792,"{'f': 0.35714285255102046, 'p': 0.5, 'r': 0.27..."
1,[0.8],0.497792,"{'f': 0.35714285255102046, 'p': 0.5, 'r': 0.27..."
2,[0.9632],0.638943,"{'f': 0.7777777728395062, 'p': 0.875, 'r': 0.7}"
3,[0.7008],0.537285,"{'f': 0.09523809034013632, 'p': 0.111111111111..."
4,[0.6943],0.210430,"{'f': 0.36363635867768596, 'p': 0.4, 'r': 0.33..."
...,...,...,...
296,[0.8686],0.334370,"{'f': 0.2399999953920001, 'p': 0.3333333333333..."
297,[0.6532],0.258371,"{'f': 0.157894732963989, 'p': 0.3, 'r': 0.1071..."
298,[0.7386],0.500000,"{'f': 0.08333332888888913, 'p': 0.125, 'r': 0...."
299,[0.4735],0.537285,"{'f': 0.09523809024943337, 'p': 0.1, 'r': 0.09..."


In [None]:
from __future__ import division
max_value = max(similarity_sc)
min_value = min(similarity_sc)
avg_value = 0 if len(similarity_sc) == 0 else sum(similarity_sc)/len(similarity_sc)

In [None]:
max_value

array([0.9953])

In [None]:
min_value

array([-0.0138])

In [None]:
avg_value

array([0.56909671])