### Domain Controlled 

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import logging
import wandb
import random
import pandas as pd
import torch
from simpletransformers.t5 import T5Model
# from pytorch_lightning.metrics.nlp import BLEUScore


In [3]:
def load_dataset(include_domain=False):
    
    train_df = pd.read_csv("./papers_train_small.csv")
    eval_df = pd.read_csv("./papers_eval_small.csv")
    
    train_df.dropna()
    eval_df.dropna()
  
    # add domain tokens
    if include_domain:
        train_df.abstract = train_df.abstract + " @domain: " + train_df.categories
        eval_df.abstract = eval_df.abstract + " @domain: " + eval_df.categories
        
    train_df = train_df[['title','abstract']]
    eval_df = eval_df[['title','abstract']]
    
    train_df.columns = ['target_text', 'input_text']
    eval_df.columns = ['target_text', 'input_text']
    
    
    # task tokens
    train_df['prefix'] = "summarize"
    eval_df['prefix'] = "summarize"
    
    return train_df, eval_df

    

In [4]:
%%time
train_df, eval_df = load_dataset(include_domain=False)

CPU times: user 864 ms, sys: 101 ms, total: 965 ms
Wall time: 2.98 s


In [5]:
print(train_df.shape, eval_df.shape)

(100213, 3) (10520, 3)


In [11]:
model_args = {
    "max_seq_length": 512,
    "truncation":True,
    "longest_first":True,
    "train_batch_size": 8,
    "eval_batch_size": 8,
    "num_train_epochs": 5,
    "evaluate_during_training": False,
    "evaluate_during_training_steps": 1000,
    "evaluate_during_training_verbose": True,
    
    "use_multiprocessing": False,
    "fp16": False,

    "save_steps": -1,
    "save_eval_checkpoints": True,
    "save_model_every_epoch": True,

    "reprocess_input_data": True,
    "overwrite_output_dir": True,

    "wandb_project": "title-generation",
    
}


In [8]:
# Create T5 Model
model = T5Model("./data/results/outputs/", args=model_args, use_cuda=True)

In [None]:
# Evaluate T5 Model on new task
results = model.eval_model(eval_df)

In [None]:
print(results)

In [9]:
random_num = 351
actual_title = eval_df.iloc[random_num]['target_text']
actual_abstract = ["summarize: "+eval_df.iloc[random_num]['input_text']]
predicted_title = model.predict(actual_abstract)

print(f'Actual Title: {actual_title}')
print(f'Predicted Title: {predicted_title}')
print(f'Actual Abstract: {actual_abstract}')


Generating outputs:   0%|          | 0/1 [00:00<?, ?it/s]Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Generating outputs: 100%|██████████| 1/1 [00:04<00:00,  4.62s/it]
Decoding outputs: 100%|██████████| 1/1 [00:01<00:00,  1.06s/it]


Actual Title: Hydrodynamics and beyond in the strongly coupled N=4 plasma
Predicted Title: ['Hydrodynamic and higher quasinormal modes in AdS black hole background']
Actual Abstract: ['summarize:   We continue our investigations on the relation between hydrodynamic and\nhigher quasinormal modes in the AdS black hole background started in\narXiv:0710.4458 [hep-th]. As is well known, the quasinormal modes can be\ninterpreted as the poles of the retarded Green functions of the dual N=4 gauge\ntheory at finite temperature. The response to a generic perturbation is\ndetermined by the residues of the poles. We compute these residues numerically\nfor energy-momentum and R-charge correlators. We find that the diffusion modes\nbehave in a similar way: at small wavelengths the residues go over into a form\nof a damped oscillation and therefore these modes decouple at short distances.\nThe sound mode behaves differently: its residue does not decay and at short\nwavelengths this mode behaves as th

In [10]:
random_num = 478
actual_title = eval_df.iloc[random_num]['target_text']
actual_abstract = ["summarize: "+eval_df.iloc[random_num]['input_text']]
predicted_title = model.predict(actual_abstract)

print(f'Actual Title: {actual_title}')
print(f'Predicted Title: {predicted_title}')
print(f'Actual Abstract: {actual_abstract}')

Generating outputs:   0%|          | 0/1 [00:00<?, ?it/s]Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Generating outputs: 100%|██████████| 1/1 [00:02<00:00,  2.09s/it]
Decoding outputs: 100%|██████████| 1/1 [00:00<00:00,  1.59it/s]

Actual Title: Sources of Pressure in Titan's Plasma Environment
Predicted Title: ['Magnetic pressure and oscillation phase of the plasmasheet in the near-Titan space']
Actual Abstract: ["summarize:   In order to analyze varying plasma conditions upstream of Titan, we have\ncombined a physical model of Saturn's plasmadisk with a geometrical model of\nthe oscillating current sheet. During modeled oscillation phases where Titan is\nfurthest from the current sheet, the main sources of plasma pressure in the\nnear-Titan space are the magnetic pressure and, for disturbed conditions, the\nhot plasma pressure. When Titan is at the center of the sheet, the main source\nis the dynamic pressure associated with Saturn's cold, subcorotating plasma.\nTotal pressure at Titan (dynamic plus thermal plus magnetic) typically\nincreases by a factor of five as the current sheet center is approached. The\npredicted incident plasma flow direction deviates from the orbital plane of\nTitan by < 10 deg. These r




In [12]:
random_num = 187
actual_title = eval_df.iloc[random_num]['target_text']
actual_abstract = ["summarize: "+eval_df.iloc[random_num]['input_text']]
predicted_title = model.predict(actual_abstract)

print(f'Actual Title: {actual_title}')
print(f'Predicted Title: {predicted_title}')
print(f'Actual Abstract: {actual_abstract}')


Generating outputs:   0%|          | 0/1 [00:00<?, ?it/s]Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Generating outputs: 100%|██████████| 1/1 [00:00<00:00,  6.67it/s]
Decoding outputs: 100%|██████████| 1/1 [00:00<00:00,  3.89it/s]

Actual Title: D=5 M-theory radion supermultiplet dynamics
Predicted Title: ['Radion Supermultiplet and the Cosmological Model']
Actual Abstract: ['summarize:   We show how the bosonic sector of the radion supermultiplet plus d=4, N=1\nsupergravity emerge from a consistent braneworld Kaluza-Klein reduction of D=5\nM--theory. The radion and its associated pseudoscalar form an SL(2,R)/U(1)\nnonlinear sigma model. This braneworld system admits its own brane solution in\nthe form of a 2-supercharge supersymmetric string. Requiring this to be free of\nsingularities leads to an SL(2,Z) identification of the sigma model target\nspace. The resulting radion mode has a minimum length; we suggest that this\ncould be used to avoid the occurrence of singularities in brane-brane\ncollisions. We discuss possible supersymmetric potentials for the radion\nsupermultiplet and their relation to cosmological models such as the cyclic\nuniverse or hybrid inflation.\n']



