<a href="https://colab.research.google.com/github/limyansky/GPT2_ArXiv_SnarXiv/blob/main/ArXiv_Any.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning GPT2 to Generate Paper Titles
Please read the accompanying webpage _here_ for more information.  

Of particular note:  
Due to useage restrictions in the free version of Google Colaboratory, tuning the model was not as in-depth a procedure as I would have liked.

# Environment Setup 
- Install needed packages
- Import needed libraries

In [1]:
!pip install transformers
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import tensorflow as tf
import json # read downloaded data file
import re   # regular expressions
from datasets import load_dataset, Dataset, load_from_disk
from transformers import (GPT2TokenizerFast, TFGPT2LMHeadModel, AutoConfig,
                          DataCollatorForLanguageModeling, pipeline)
from tqdm import tqdm


# Data
I select a subset of paper titles pertaining to certain categories. From over two million papers, this reduces datasets to the following number of papers:  

Tissues and Organs: 1992  
Condensed Matter - Materials Science: 82,412  
High Energy Astrophysical Phenomonea: 49,692  
High Energy Physics - Experiment: 50,477

Filtering the data takes ~3 minutes a category, making saving/loading these filtered datasets to disk preferable for debugging.

## Filtering Data


In [3]:
json_file = '/content/drive/MyDrive/arxiv-metadata-oai-snapshot.json'

# Note - need to set "split", but all the data is loaded
data_all = load_dataset('json', data_files=json_file, split='train')

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-75b479113055f308/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-75b479113055f308/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.


In [None]:
tissue_tag = 'q-bio.TO'
materials_science_tag = 'cond-mat.mtrl-sci'
astro_tag = 'astro-ph.HE'
hepEX_tag = 'hep-ex'

def search_cat(in_data, tag):
  orig_keys = list(in_data.features.keys())
  remove_keys = [to_delete for to_delete in orig_keys if to_delete != 'title']

  out_data = in_data.filter(lambda item: tag in item['categories'])
  out_data = out_data.map(remove_columns=remove_keys)
  
  return out_data

tissues_organs   = search_cat(data_all, tissue_tag)
condensed_matter = search_cat(data_all, materials_science_tag)
astrophysics     = search_cat(data_all, astro_tag)
hepEX            = search_cat(data_all, hepEX_tag)


In [None]:
path = "/content/drive/MyDrive/Colab Notebooks/ArXiv/Single_Cat/Data/"
tissues_organs.save_to_disk(path + 'tissues_organs.hf')
condensed_matter.save_to_disk(path + 'condensed_matter.hf')
hepEX.save_to_disk(path + 'hepEX.hf')
astrophysics.save_to_disk(path + 'astrophysics.hf')

## Loading Filtered Data

In [3]:
path = "/content/drive/MyDrive/Colab Notebooks/ArXiv/Single_Cat/Data/"
tissues_organs   = load_from_disk(path + 'tissues_organs.hf')
condensed_matter = load_from_disk(path + 'condensed_matter.hf')
hepEX            = load_from_disk(path + 'hepEX.hf')
astrophysics     = load_from_disk(path + 'astrophysics.hf')

In [4]:
def print_length(name, dataset):
  string = '{}: length {}'
  print(string.format(name, len(dataset)))

print_length('Tissues and Organs', tissues_organs)
print_length('Materials Science', condensed_matter)
print_length('High Energy Physics - Experiment', hepEX)
print_length('High Energy Astrophysical Phenomena', astrophysics)

Tissues and Organs: length 1992
Materials Science: length 82412
High Energy Physics - Experiment: length 49692
High Energy Astrophysical Phenomena: length 50477


# SELECT A DATASET
Other than "Tissues and Organs", these datasets are large enough to produce reasonable results with only a single training epoch. 

In [6]:
# working_data = condensed_matter
# save_loc = '/content/drive/MyDrive/Colab Notebooks/ArXiv/Single_Cat/Models/condensed_matter'

# working_data = hepEX
# save_loc = '/content/drive/MyDrive/Colab Notebooks/ArXiv/Single_Cat/Models/hepEX'

working_data = astrophysics
save_loc = '/content/drive/MyDrive/Colab Notebooks/ArXiv/Single_Cat/Models/astrophysics'


# This shuffles data as well. 
train_test_set = working_data.train_test_split(test_size=0.1, 
                                                seed=42)
train = train_test_set['train']
test = train_test_set['test']

# Data Processing Pipeline
- Remove special characters from titles
- Generate statements of the form: 
```<|startoftext|> Sparsity-certifying Graph Decompositions <|endoftext|>``` 
- Tokenize Strings

### Tokenizer
I load the tokenizer and add "start", "end", and "pad" tokens. In particular, setting the "pad" token to the "end" token during sentence generation will cause GPT2 to generate text until the maximum length is reached. We want titles which sound like they have a natural end, and are of varying length - hence the definition of a separate padding token. However, we will have to resize GPT2 to accomidate these extra tokens (it was only trained with an "end" token). **Our tuned GPT2 models will need to be matched with a similarly defined tokenizer to work.** 

In [5]:
# Load Tokenizer
tokenizer = GPT2TokenizerFast.from_pretrained('gpt2',
                                              eos_token='<|endoftext|>',
                                              bos_token='<|startoftext|>',
                                              pad_token='<pad>')

def tokenize_entry(input):
  output = tokenizer(input['training_sentence'], padding=False)
  return output

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Pipeline Functions

In [6]:
def clean_txt(input_obj):
  """ Removes special characters from a batch of strings.

  Note that a batch of strings is required, and this will not work on single
  dataset items.

  Args: 
    input_obj (dict): Dictionary of strings to filter.

  Returns:
    List of strings keeping only letters, numbers, (single) spaces, and dashes.
  """
  output = []
  for string in input_obj['title']:
    string = re.sub(r'[^a-zA-Z0-9 -]+', '', string)
    string = re.sub(r' +', ' ', string)
    output.append(string)
  return output

def training_string(titles):
  """ Generates a string used to tune the natural language model.

  Args: 
    
    titles (list(str)): Titles of papers as elements in list. 

  Returns: 
    Lists of strings like: ['<|startoftext|> title <|endoftext|>']
  """

  training_template = '<|startoftext|> {} <|endoftext|>'
  output = []
  for title in titles:
    output.append(training_template.format(title))
  
  return output

def process_entry(input):
  """ Adds 'training_sentence' keys/items to dataset.

  Args:
  input (dict): A batch of data from a dataset.

  Returns:
  input with 'training_sentence' keys/items added.

  """
  clean_title = clean_txt(input)
  final_string = training_string(clean_title)
  input['training_sentence'] = final_string
  return input

def proc_token(input):
  """ Map this function to a dataset to tokenize strings.
  
  Args: 
  input (dict): A batch of data from a dataset.

  Returns:
  Tokenized string encodings, including attention masks.

  """
  input = process_entry(input)
  tokenized = tokenize_entry(input)
  return tokenized

## Creating and running the Pipeline

In [9]:
def map_to_dataset(dataset):
  """ Fully performs tokenization and removes extra keys
  """
  dataset = dataset.map(lambda x: proc_token(x),
                  remove_columns=['title'],
                  batched=True,
                  batch_size=64)
  dataset = dataset.map(remove_columns=['training_sentence'])

  return dataset

train = map_to_dataset(train)
test = map_to_dataset(test)

Map:   0%|          | 0/45429 [00:00<?, ? examples/s]

Map:   0%|          | 0/45429 [00:00<?, ? examples/s]

Map:   0%|          | 0/5048 [00:00<?, ? examples/s]

Map:   0%|          | 0/5048 [00:00<?, ? examples/s]

# Fine Tune the Model

## Load the Model
Including a the data collator

In [10]:
#https://www.kaggle.com/code/vimalpillai/finetuning-gpt2-model-tensorflow

config = AutoConfig.from_pretrained(
    'gpt2',
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
    output_hidden_states=False
)

# Load pretrained model
gpt2_model = TFGPT2LMHeadModel.from_pretrained('gpt2', config=config)

# Tell the model we changed the tokenizer
gpt2_model.resize_token_embeddings(len(tokenizer))

gpt2_model(gpt2_model.dummy_inputs) # Builds model
gpt2_model.summary()

# The optimizer to use for training.
# These setting come from: 
# https://www.kaggle.com/code/vimalpillai/finetuning-gpt2-model-tensorflow
# and work well.
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5,
                                     epsilon=1e-08, clipnorm=1.0)

# Compile the model
gpt2_model.compile(optimizer)

# Train in mixed-precision float16 for speed
tf.keras.mixed_precision.set_global_policy("mixed_float16")

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Model: "tfgpt2lm_head_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transformer (TFGPT2MainLaye  multiple                 124441344 
 r)                                                              
                                                                 
Total params: 124,441,344
Trainable params: 124,441,344
Non-trainable params: 0
_________________________________________________________________


No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


## Convert HuggingFace dataset to TensorFlow dataset

In [11]:
# Prepares batches for training, such as by adding padding.
# Uising this also takes care of needing to create training labels.
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False,
                                                return_tensors='tf')

# Convert the higgingface datasets to something tensorflow can work with
tf_train = gpt2_model.prepare_tf_dataset(
    train,
    collate_fn=data_collator,
    batch_size=64
)

tf_test = gpt2_model.prepare_tf_dataset(
    test,
    collate_fn=data_collator,
    batch_size=64
)

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


## Perform the Fit
One epoch takes about 10-15 minutes.

In [12]:
gpt2_model.fit(tf_train, epochs=1)



<keras.callbacks.History at 0x7efd61668790>

## Evaluate and Save Model
  
Condensed Matter - Materials Science Loss: 3.42  
High Energy Physics - Experiment Loss: 3.16  
High Energy Astrophysical Phenomonea Loss: 3.37  
Tissues and Organs: 4.16

In [13]:
gpt2_model.evaluate(tf_test)



3.37039852142334

In [14]:
# Save the model
gpt2_model.save_weights(save_loc)

# Special Case: Tissues and Organs
Because this category of papers has only 1992 entries, it is both necessary and practical to incorporate multiple epochs into training. The following code block is essentially repeats "Select a Dataset" and below, but incorporates early stopping with a separate validation dataset. 

In [7]:
working_data = tissues_organs
save_loc = '/content/drive/MyDrive/Colab Notebooks/ArXiv/Single_Cat/Models/tissues_organs'

# A larger percentage is held out for evaluation
# This shuffles data as well. 
train_test_set = working_data.train_test_split(test_size=0.2, 
                                                seed=42)
train = train_test_set['train']
test = train_test_set['test']

# A validation set
train_valid_set = train.train_test_split(test_size=0.1,
                                         seed=42)

train = train_valid_set['train']
valid = train_valid_set['test']

print("Training Set: " + str(len(train)))
print("Validation Set: " + str(len(valid)))
print("Test Set: " + str(len(test)))

def map_to_dataset(dataset):
  """ Fully performs tokenization and removes extra keys
  """
  dataset = dataset.map(lambda x: proc_token(x),
                  remove_columns=['title'],
                  batched=True,
                  batch_size=64)
  dataset = dataset.map(remove_columns=['training_sentence'])

  return dataset

train = map_to_dataset(train)
valid = map_to_dataset(valid)
test = map_to_dataset(test)

config = AutoConfig.from_pretrained(
    'gpt2',
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
    output_hidden_states=False
)

# Load pretrained model
gpt2_model = TFGPT2LMHeadModel.from_pretrained('gpt2', config=config)

# Tell the model we changed the tokenizer
gpt2_model.resize_token_embeddings(len(tokenizer))

gpt2_model(gpt2_model.dummy_inputs) # Builds model
gpt2_model.summary()

# The optimizer to use for training.
# These setting come from: 
# https://www.kaggle.com/code/vimalpillai/finetuning-gpt2-model-tensorflow
# and work well.
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5,
                                     epsilon=1e-08, clipnorm=1.0)

# Compile the model
gpt2_model.compile(optimizer)

# Train in mixed-precision float16 for speed
tf.keras.mixed_precision.set_global_policy("mixed_float16")

# Prepares batches for training, such as by adding padding.
# Uising this also takes care of needing to create training labels.
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False,
                                                return_tensors='tf')

# Convert the higgingface datasets to something tensorflow can work with
tf_train = gpt2_model.prepare_tf_dataset(
    train,
    collate_fn=data_collator,
    batch_size=64
)

tf_valid = gpt2_model.prepare_tf_dataset(
    valid,
    collate_fn=data_collator,
    batch_size=64
)

tf_test = gpt2_model.prepare_tf_dataset(
    test,
    collate_fn=data_collator,
    batch_size=64
)

# Early Stopping Callback
cb_EarlyStopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',
    restore_best_weights=True,
    min_delta=0.1,
    patience=2

)

# Fine-Tune the Model
gpt2_model.fit(tf_train, validation_data=tf_valid,
               epochs=100, callbacks=[cb_EarlyStopping])



Training Set: 1433
Validation Set: 160
Test Set: 399


All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Model: "tfgpt2lm_head_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transformer (TFGPT2MainLaye  multiple                 124441344 
 r)                                                              
                                                                 
Total params: 124,441,344
Trainable params: 124,441,344
Non-trainable params: 0
_________________________________________________________________


No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100


<keras.callbacks.History at 0x7f296c5c6490>

In [8]:
gpt2_model.evaluate(tf_test)



4.160545825958252

In [9]:
# Save the model
gpt2_model.save_weights(save_loc)

# Generate some fake paper titles!

In [3]:


# save_loc = '/content/drive/MyDrive/Colab Notebooks/ArXiv/Single_Cat/Models/tissues_organs'
# save_loc = '/content/drive/MyDrive/Colab Notebooks/ArXiv/Single_Cat/Models/condensed_matter'
# save_loc = '/content/drive/MyDrive/Colab Notebooks/ArXiv/Single_Cat/Models/hepEX'
save_loc = '/content/drive/MyDrive/Colab Notebooks/ArXiv/Single_Cat/Models/astrophysics'

# Load model and tokenizer

tokenizer = GPT2TokenizerFast.from_pretrained('gpt2',
                                              eos_token='<|endoftext|>',
                                              bos_token='<|startoftext|>',
                                              pad_token='<pad>')


config = AutoConfig.from_pretrained(
    'gpt2',
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
    output_hidden_states=False
)

# Load pretrained model
gpt2_model = TFGPT2LMHeadModel.from_pretrained('gpt2', config=config)

# Tell the model we changed the tokenizer
gpt2_model.resize_token_embeddings(len(tokenizer))

gpt2_model(gpt2_model.dummy_inputs) # Builds model
gpt2_model.summary()

gpt2_model.load_weights(save_loc)

pipe = pipeline(
    "text-generation", model=gpt2_model, tokenizer=tokenizer, device=0,
    no_repeat_ngram_size=2,
    do_sample=True,
    top_k=100,
    max_length = 300,
    top_p=0.95
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Model: "tfgpt2lm_head_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transformer (TFGPT2MainLaye  multiple                 124441344 
 r)                                                              
                                                                 
Total params: 124,441,344
Trainable params: 124,441,344
Non-trainable params: 0
_________________________________________________________________


In [4]:
txt='<|startoftext|>'

fake_titles = []

print('Fake Titles')
for ii in tqdm(range(100)):
  title = pipe(txt, num_return_sequences=1)[0]['generated_text'][16:-1]
  fake_titles.append(title)

print(fake_titles)

Fake Titles


100%|██████████| 100/100 [14:11<00:00,  8.51s/it]

['A short time scale model for gamma-ray bursts', 'Relativistic optical radiation emission from ultra-high-energy cosmic rays', 'Two-dimensional modeling for accretion flows', 'A nonlinear gravity model for short-term pulsar mass accretion in a hot AGN', 'The Role of the Rotating Pulsar Magnetar in the Timing Spectrum of Galactic Blazars with Gravitational Waves', 'J16581333A an X-ray pulsar near the heart of an advanced hyperthermal magnetorotational cluster', 'The LMC of a pulsar -- pulsars and high-energy spectra', 'Transients from Gamma Rays From the Black Hole X-ray Binary SSX 339-462', 'The long-term variability of X-ray emission from the blazar ANTARES', 'A spectral look at magnetized relativistic interactions in the galactic nucleus', 'The New Emission Type Ia Supernova Bursts and the Emergence of LIGO-LAT', 'The Evolution of Gamma Ray Emission during the 2005 X-ray Burst', 'The Optical and Spectral Variability of OJ 287 and IV Afterglow', 'Rotation-Driven Neutrino Detection us


