# Title Generator Using T5

References: [Here](https://shivanandroy.com/transformers-generating-arxiv-papers-title-from-abstracts/)

## Load Data

In [1]:
import json

data = '../dataset/arxiv-metadata-oai-snapshot.json'

def get_metadata():
    with open(data, 'r') as f:
        for line in f:
            yield line

In [3]:
metadata = get_metadata()

for paper in metadata:
    paper_dict = json.loads(paper)
    print('Title: {}\n\nAbstract: {}\nRef: {}'.format(paper_dict.get('title'), paper_dict.get('abstract'), paper_dict.get('journal-ref')))
    break

Title: Calculation of prompt diphoton production cross sections at Tevatron and
  LHC energies

Abstract:   A fully differential calculation in perturbative quantum chromodynamics is
presented for the production of massive photon pairs at hadron colliders. All
next-to-leading order perturbative contributions from quark-antiquark,
gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as
all-orders resummation of initial-state gluon radiation valid at
next-to-next-to-leading logarithmic accuracy. The region of phase space is
specified in which the calculation is most reliable. Good agreement is
demonstrated with data from the Fermilab Tevatron, and predictions are made for
more detailed tests with CDF and DO data. Predictions are shown for
distributions of diphoton pairs produced at the energy of the Large Hadron
Collider (LHC). Distributions of the diphoton pairs from the decay of a Higgs
boson are contrasted with those produced from QCD processes at the LHC, showing
tha

Take last 5 years of ArXiv papers due to computational resource limits.

In [5]:
titles = []
abstracts = []
years = []
metadata = get_metadata()

for paper in metadata:
    paper_dict = json.loads(paper)
    ref = paper_dict.get('journal-ref')
    try:
        year = int(ref[-4:])
        if 2016 < year < 2021:
            years.append(year)
            titles.append(paper_dict.get('title'))
            abstracts.append(paper_dict.get('abstract'))
    except:
        pass

len(titles), len(abstracts), len(years)

(18566, 18566, 18566)

There are about 18K research papers published from 2016 to 2020.

In [7]:
import pandas as pd

papers = pd.DataFrame({
    'title': titles,
    'abstract': abstracts,
    'year': years
})

papers.head()

Unnamed: 0,title,abstract,year
0,On the Cohomological Derivation of Yang-Mills ...,We present a brief review of the cohomologic...,2017
1,Regularity of solutions of the isoperimetric p...,In this work we consider a question in the c...,2018
2,Asymptotic theory of least squares estimators ...,This paper considers the effect of least squ...,2017
3,"Teichm\""uller Structures and Dual Geometric Gi...",The Gibbs measure theory for smooth potentia...,2020
4,Distributional Schwarzschild Geometry from non...,In this paper we leave the neighborhood of t...,2018


## Training

The `simpletransformers` library is used to train the `T5 model`.

In [8]:
# Adding input and target columns
papers = papers[['title', 'abstract']]
papers.columns = ['target_text', 'input_text']

# Adding prefix columns
papers['prefix'] = 'summarize'

# Splitting data using the 80:20 training-testing ratio
eval_df = papers.sample(frac=0.2, random_state=673)
train_df = papers.drop(eval_df.index)

train_df.shape, eval_df.shape

((14853, 3), (3713, 3))

In [13]:
import logging

from simpletransformers.t5 import T5Model

# Setting logging information
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger('transformers')
transformers_logger.setLevel(logging.WARNING)

# Training parameters
params = {
    'reprocess_input_data': True,
    'overwrite_output_dir': True,
    'max_seq_length': 512,
    'train_batch_size': 16,
    'num_train_epochs': 4,
    'best_model_dir': '../models/title-generator'
}

# Creating model
model = T5Model(model_type='t5', model_name='t5-small', args=params, use_cuda=True)

# Training model
model.train_model(train_df)

# Evaluating results
results = model.eval_model(eval_df)

print(results)

INFO:simpletransformers.t5.t5_utils: Creating features from dataset file at cache_dir/


  0%|          | 0/14853 [00:00<?, ?it/s]

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    la