# Title Generator Using T5

References: [Here](https://shivanandroy.com/transformers-generating-arxiv-papers-title-from-abstracts/)

## Load Data

In [2]:
import json

data = '../dataset/arxiv-metadata-oai-snapshot.json'

def get_metadata():
    with open(data, 'r') as f:
        for line in f:
            yield line

In [3]:
metadata = get_metadata()

for paper in metadata:
    paper_dict = json.loads(paper)
    print('Title: {}\n\nAbstract: {}\nRef: {}'.format(paper_dict.get('title'), paper_dict.get('abstract'), paper_dict.get('journal-ref')))
    break

Title: Calculation of prompt diphoton production cross sections at Tevatron and
  LHC energies

Abstract:   A fully differential calculation in perturbative quantum chromodynamics is
presented for the production of massive photon pairs at hadron colliders. All
next-to-leading order perturbative contributions from quark-antiquark,
gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as
all-orders resummation of initial-state gluon radiation valid at
next-to-next-to-leading logarithmic accuracy. The region of phase space is
specified in which the calculation is most reliable. Good agreement is
demonstrated with data from the Fermilab Tevatron, and predictions are made for
more detailed tests with CDF and DO data. Predictions are shown for
distributions of diphoton pairs produced at the energy of the Large Hadron
Collider (LHC). Distributions of the diphoton pairs from the decay of a Higgs
boson are contrasted with those produced from QCD processes at the LHC, showing
tha

Take last 5 years of ArXiv papers due to computational resource limits.

In [4]:
titles = []
abstracts = []
years = []
metadata = get_metadata()

for paper in metadata:
    paper_dict = json.loads(paper)
    ref = paper_dict.get('journal-ref')
    try:
        year = int(ref[-4:])
        if 2016 < year < 2021:
            years.append(year)
            titles.append(paper_dict.get('title'))
            abstracts.append(paper_dict.get('abstract'))
    except:
        pass

len(titles), len(abstracts), len(years)

(18566, 18566, 18566)

There are about 18K research papers published from 2016 to 2020.

### Convert Into DataFrames

In [5]:
import pandas as pd

papers = pd.DataFrame({
    'title': titles,
    'abstract': abstracts,
    'year': years
})

papers.head()

Unnamed: 0,title,abstract,year
0,On the Cohomological Derivation of Yang-Mills ...,We present a brief review of the cohomologic...,2017
1,Regularity of solutions of the isoperimetric p...,In this work we consider a question in the c...,2018
2,Asymptotic theory of least squares estimators ...,This paper considers the effect of least squ...,2017
3,"Teichm\""uller Structures and Dual Geometric Gi...",The Gibbs measure theory for smooth potentia...,2020
4,Distributional Schwarzschild Geometry from non...,In this paper we leave the neighborhood of t...,2018


### Split Train-Test Data

In [6]:
# Splitting data using the 80:20 training-testing ratio
eval_df = papers.sample(frac=0.2, random_state=673)
train_df = papers.drop(eval_df.index)

train_df.shape, eval_df.shape

((14853, 3), (3713, 3))

In [7]:
from datasets import Dataset

train_data = Dataset.from_pandas(train_df)
eval_data = Dataset.from_pandas(eval_df)

In [8]:
train_data, eval_data

(Dataset({
     features: ['title', 'abstract', 'year', '__index_level_0__'],
     num_rows: 14853
 }),
 Dataset({
     features: ['title', 'abstract', 'year', '__index_level_0__'],
     num_rows: 3713
 }))

In [9]:
train_data = train_data.remove_columns('__index_level_0__')
eval_data = eval_data.remove_columns('__index_level_0__')

train_data, eval_data

(Dataset({
     features: ['title', 'abstract', 'year'],
     num_rows: 14853
 }),
 Dataset({
     features: ['title', 'abstract', 'year'],
     num_rows: 3713
 }))

In [10]:
train_data.shape, eval_data.shape

((14853, 3), (3713, 3))

## Preprocessing

### Tokenize Features

In [11]:
from transformers import AutoTokenizer

path = 'sshleifer/distilbart-cnn-12-6'

tokenizer = AutoTokenizer.from_pretrained(path)

2022-01-05 09:34:19.900121: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-01-05 09:34:19.900175: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [12]:
train_tok = tokenizer(train_data['abstract'], padding='max_length', truncation=True)
eval_tok = tokenizer(eval_data['abstract'], padding='max_length', truncation=True)

### Pair Label and Feature

In [13]:
import torch

class ArXiVDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = ArXiVDataset(train_tok, train_data['title'])
eval_dataset = ArXiVDataset(eval_tok, eval_data['title'])

## Model

In [14]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(path)

## Training

In [16]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='trainer' + path,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=4
)

NVIDIA GeForce RTX 3050 Ti Laptop GPU with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 3050 Ti Laptop GPU GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/



In [17]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [18]:
trainer.train()

***** Running training *****
  Num examples = 14853
  Num Epochs = 4
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 3716
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize


## Training

The `simpletransformers` library is used to train the `T5 model`.

In [11]:
# # Adding input and target columns
# papers = papers[['title', 'abstract']]
# papers.columns = ['target_text', 'input_text']

# # Adding prefix columns
# papers['prefix'] = 'summarize'

# # Splitting data using the 80:20 training-testing ratio
# eval_df = papers.sample(frac=0.2, random_state=673)
# train_df = papers.drop(eval_df.index)

# train_df.shape, eval_df.shape

In [12]:
# import logging

# from simpletransformers.t5 import T5Model

# # Setting logging information
# logging.basicConfig(level=logging.INFO)
# transformers_logger = logging.getLogger('transformers')
# transformers_logger.setLevel(logging.WARNING)

# # Training parameters
# params = {
#     'reprocess_input_data': True,
#     'overwrite_output_dir': True,
#     'max_seq_length': 512,
#     'train_batch_size': 16,
#     'num_train_epochs': 4,
#     'best_model_dir': '../models/title-generator',
#     'fp16': False
# }

# # Creating model
# model = T5Model(model_type='t5', model_name='t5-small', args=params, use_cuda=False)

# # Training model
# model.train_model(train_df)

# # Evaluating results
# results = model.eval_model(eval_df)

# print(results)