# Fine-Tuning GPT2 to Generate Paper Titles
Please read the accompanying webpage _here_ for more information.  

Of particular note:  
Due to useage restrictions in the free version of Google Colaboratory, tuning the model was not as in-depth a procedure as I would have liked.

# Environment Setup 
- Import needed libraries

In [1]:
import tensorflow as tf
import json # read downloaded data file
import re   # regular expressions
from datasets import load_dataset, Dataset, load_from_disk
from transformers import (GPT2TokenizerFast, TFGPT2LMHeadModel, AutoConfig,
                          DataCollatorForLanguageModeling, pipeline)
from tqdm import tqdm


2023-11-16 11:26:18.303561: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  from .autonotebook import tqdm as notebook_tqdm


# Data
I select a subset of paper titles pertaining to certain categories. From over two million papers, this reduces datasets to the following number of papers:  

Tissues and Organs: 1992  
Condensed Matter - Materials Science: 82,412  
High Energy Astrophysical Phenomonea: 49,692  
High Energy Physics - Experiment: 50,477

Filtering the data takes ~3 minutes a category, making saving/loading these filtered datasets to disk preferable for debugging.

## Filtering Data


In [2]:
json_file = './data/raw/arxiv-metadata-oai-snapshot.json'

# Note - need to set "split", but all the data is loaded
data_all = load_dataset('json', data_files=json_file, split='train')

  table = cls._concat_blocks(blocks, axis=0)


In [3]:
tissue_tag = 'q-bio.TO'
materials_science_tag = 'cond-mat.mtrl-sci'
astro_tag = 'astro-ph.HE'
hepEX_tag = 'hep-ex'

def search_cat(in_data, tag):
  orig_keys = list(in_data.features.keys())
  remove_keys = [to_delete for to_delete in orig_keys if to_delete != 'title']

  out_data = in_data.filter(lambda item: tag in item['categories'])
  out_data = out_data.map(remove_columns=remove_keys)
  
  return out_data

# tissues_organs   = search_cat(data_all, tissue_tag)
condensed_matter = search_cat(data_all, materials_science_tag)
# astrophysics     = search_cat(data_all, astro_tag)
# hepEX            = search_cat(data_all, hepEX_tag)


In [4]:
path = "./data/interim/"
tissues_organs.save_to_disk(path + 'tissues_organs.hf')
condensed_matter.save_to_disk(path + 'condensed_matter.hf')
hepEX.save_to_disk(path + 'hepEX.hf')
astrophysics.save_to_disk(path + 'astrophysics.hf')

NameError: name 'tissues_organs' is not defined

## Loading Filtered Data

In [5]:
path = "./data/interim/"
tissues_organs   = load_from_disk(path + 'tissues_organs.hf')
condensed_matter = load_from_disk(path + 'condensed_matter.hf')
hepEX            = load_from_disk(path + 'hepEX.hf')
astrophysics     = load_from_disk(path + 'astrophysics.hf')

In [6]:
def print_length(name, dataset):
  string = '{}: length {}'
  print(string.format(name, len(dataset)))

print_length('Tissues and Organs', tissues_organs)
print_length('Materials Science', condensed_matter)
print_length('High Energy Physics - Experiment', hepEX)
print_length('High Energy Astrophysical Phenomena', astrophysics)

Tissues and Organs: length 2141
Materials Science: length 87760
High Energy Physics - Experiment: length 51803
High Energy Astrophysical Phenomena: length 54510


# SELECT A DATASET
Other than "Tissues and Organs", these datasets are large enough to produce reasonable results with only a single training epoch. 

In [7]:
working_data = condensed_matter
save_loc = './model/interim/condensed_matter/2023_11_16_v0'

# working_data = hepEX
# save_loc = '/content/drive/MyDrive/Colab Notebooks/ArXiv/Single_Cat/Models/hepEX'

# working_data = astrophysics
# save_loc = './model/interim/astrophysics/2023_11_15_v0'


# This shuffles data as well. 
train_test_set = working_data.train_test_split(test_size=0.1, 
                                                seed=42)
train = train_test_set['train']
test = train_test_set['test']

# Data Processing Pipeline
- Remove special characters from titles
- Generate statements of the form: 
```<|startoftext|> Sparsity-certifying Graph Decompositions <|endoftext|>``` 
- Tokenize Strings

### Tokenizer
I load the tokenizer and add "start", "end", and "pad" tokens. In particular, setting the "pad" token to the "end" token during sentence generation will cause GPT2 to generate text until the maximum length is reached. We want titles which sound like they have a natural end, and are of varying length - hence the definition of a separate padding token. However, we will have to resize GPT2 to accomidate these extra tokens (it was only trained with an "end" token). **Our tuned GPT2 models will need to be matched with a similarly defined tokenizer to work.** 

In [8]:
# Load Tokenizer
tokenizer = GPT2TokenizerFast.from_pretrained('gpt2',
                                              eos_token='<|endoftext|>',
                                              bos_token='<|startoftext|>',
                                              pad_token='<pad>')

def tokenize_entry(input):
  output = tokenizer(input['training_sentence'], padding=False)
  return output

## Pipeline Functions

In [9]:
def clean_txt(input_obj):
  """ Removes special characters from a batch of strings.

  Note that a batch of strings is required, and this will not work on single
  dataset items.

  Args: 
    input_obj (dict): Dictionary of strings to filter.

  Returns:
    List of strings keeping only letters, numbers, (single) spaces, and dashes.
  """
  output = []
  for string in input_obj['title']:
    string = re.sub(r'[^a-zA-Z0-9 -]+', '', string)
    string = re.sub(r' +', ' ', string)
    output.append(string)
  return output

def training_string(titles):
  """ Generates a string used to tune the natural language model.

  Args: 
    
    titles (list(str)): Titles of papers as elements in list. 

  Returns: 
    Lists of strings like: ['<|startoftext|> title <|endoftext|>']
  """

  training_template = '<|startoftext|> {} <|endoftext|>'
  output = []
  for title in titles:
    output.append(training_template.format(title))
  
  return output

def process_entry(input):
  """ Adds 'training_sentence' keys/items to dataset.

  Args:
  input (dict): A batch of data from a dataset.

  Returns:
  input with 'training_sentence' keys/items added.

  """
  clean_title = clean_txt(input)
  final_string = training_string(clean_title)
  input['training_sentence'] = final_string
  return input

def proc_token(input):
  """ Map this function to a dataset to tokenize strings.
  
  Args: 
  input (dict): A batch of data from a dataset.

  Returns:
  Tokenized string encodings, including attention masks.

  """
  input = process_entry(input)
  tokenized = tokenize_entry(input)
  return tokenized

## Creating and running the Pipeline

In [10]:
def map_to_dataset(dataset):
  """ Fully performs tokenization and removes extra keys
  """
  dataset = dataset.map(lambda x: proc_token(x),
                  remove_columns=['title'],
                  batched=True,
                  batch_size=64)
  dataset = dataset.map(remove_columns=['training_sentence'])

  return dataset

train = map_to_dataset(train)
test = map_to_dataset(test)

Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 78984/78984 [00:01<00:00, 44719.95 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 78984/78984 [00:01<00:00, 66391.83 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8776/8776 [00:00<00:00, 48039.25 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8776/8776 [00:00<00:00, 73377.16 examples/s]


# Fine Tune the Model

## Load the Model
Including a the data collator

In [11]:
#https://www.kaggle.com/code/vimalpillai/finetuning-gpt2-model-tensorflow

config = AutoConfig.from_pretrained(
    'gpt2',
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
    output_hidden_states=False
)

# Load pretrained model
gpt2_model = TFGPT2LMHeadModel.from_pretrained('gpt2', config=config)

# Tell the model we changed the tokenizer
gpt2_model.resize_token_embeddings(len(tokenizer))

gpt2_model(gpt2_model.dummy_inputs) # Builds model
gpt2_model.summary()

# The optimizer to use for training.
# These setting come from: 
# https://www.kaggle.com/code/vimalpillai/finetuning-gpt2-model-tensorflow
# and work well.
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5,
                                     epsilon=1e-08, clipnorm=1.0)

# Compile the model
gpt2_model.compile(optimizer)

# Train in mixed-precision float16 for speed
tf.keras.mixed_precision.set_global_policy("mixed_float16")

2023-11-16 01:01:12.026134: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-11-16 01:01:12.041071: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-11-16 01:01:12.041171: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

Model: "tfgpt2lm_head_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transformer (TFGPT2MainLaye  multiple                 124441344 
 r)                                                              
                                                                 
Total params: 124,441,344
Trainable params: 124,441,344
Non-trainable params: 0
_________________________________________________________________
INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: NVIDIA GeForce RTX 3060, compute capability 8.6


2023-11-16 01:01:12.807361: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355


## Convert HuggingFace dataset to TensorFlow dataset

In [12]:
# Prepares batches for training, such as by adding padding.
# Uising this also takes care of needing to create training labels.
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False,
                                                return_tensors='tf')

# Convert the higgingface datasets to something tensorflow can work with
tf_train = gpt2_model.prepare_tf_dataset(
    train,
    collate_fn=data_collator,
    batch_size=32
)

tf_test = gpt2_model.prepare_tf_dataset(
    test,
    collate_fn=data_collator,
    batch_size=32
)

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


## Perform the Fit
One epoch takes about 10-15 minutes.

In [13]:
gpt2_model.fit(tf_train, epochs=10)

Epoch 1/10


2023-11-16 01:03:18.878214: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f8d73888d70 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-11-16 01:03:18.878231: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6
2023-11-16 01:03:18.880657: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:255] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-11-16 01:03:18.886979: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8906
2023-11-16 01:03:18.933834: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f8eff50d950>

## Evaluate and Save Model
  
Condensed Matter - Materials Science Loss: 3.42  
High Energy Physics - Experiment Loss: 3.16  
High Energy Astrophysical Phenomonea Loss: 3.37  
Tissues and Organs: 4.16

In [14]:
gpt2_model.evaluate(tf_test)



2.9799633026123047

In [15]:
# Save the model
gpt2_model.save_weights(save_loc)

# Special Case: Tissues and Organs
Because this category of papers has only 1992 entries, it is both necessary and practical to incorporate multiple epochs into training. The following code block is essentially repeats "Select a Dataset" and below, but incorporates early stopping with a separate validation dataset. 

In [7]:
working_data = tissues_organs
save_loc = '/content/drive/MyDrive/Colab Notebooks/ArXiv/Single_Cat/Models/tissues_organs'

# A larger percentage is held out for evaluation
# This shuffles data as well. 
train_test_set = working_data.train_test_split(test_size=0.2, 
                                                seed=42)
train = train_test_set['train']
test = train_test_set['test']

# A validation set
train_valid_set = train.train_test_split(test_size=0.1,
                                         seed=42)

train = train_valid_set['train']
valid = train_valid_set['test']

print("Training Set: " + str(len(train)))
print("Validation Set: " + str(len(valid)))
print("Test Set: " + str(len(test)))

def map_to_dataset(dataset):
  """ Fully performs tokenization and removes extra keys
  """
  dataset = dataset.map(lambda x: proc_token(x),
                  remove_columns=['title'],
                  batched=True,
                  batch_size=64)
  dataset = dataset.map(remove_columns=['training_sentence'])

  return dataset

train = map_to_dataset(train)
valid = map_to_dataset(valid)
test = map_to_dataset(test)

config = AutoConfig.from_pretrained(
    'gpt2',
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
    output_hidden_states=False
)

# Load pretrained model
gpt2_model = TFGPT2LMHeadModel.from_pretrained('gpt2', config=config)

# Tell the model we changed the tokenizer
gpt2_model.resize_token_embeddings(len(tokenizer))

gpt2_model(gpt2_model.dummy_inputs) # Builds model
gpt2_model.summary()

# The optimizer to use for training.
# These setting come from: 
# https://www.kaggle.com/code/vimalpillai/finetuning-gpt2-model-tensorflow
# and work well.
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5,
                                     epsilon=1e-08, clipnorm=1.0)

# Compile the model
gpt2_model.compile(optimizer)

# Train in mixed-precision float16 for speed
tf.keras.mixed_precision.set_global_policy("mixed_float16")

# Prepares batches for training, such as by adding padding.
# Uising this also takes care of needing to create training labels.
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False,
                                                return_tensors='tf')

# Convert the higgingface datasets to something tensorflow can work with
tf_train = gpt2_model.prepare_tf_dataset(
    train,
    collate_fn=data_collator,
    batch_size=64
)

tf_valid = gpt2_model.prepare_tf_dataset(
    valid,
    collate_fn=data_collator,
    batch_size=64
)

tf_test = gpt2_model.prepare_tf_dataset(
    test,
    collate_fn=data_collator,
    batch_size=64
)

# Early Stopping Callback
cb_EarlyStopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',
    restore_best_weights=True,
    min_delta=0.1,
    patience=2

)

# Fine-Tune the Model
gpt2_model.fit(tf_train, validation_data=tf_valid,
               epochs=100, callbacks=[cb_EarlyStopping])



Training Set: 1433
Validation Set: 160
Test Set: 399


All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Model: "tfgpt2lm_head_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transformer (TFGPT2MainLaye  multiple                 124441344 
 r)                                                              
                                                                 
Total params: 124,441,344
Trainable params: 124,441,344
Non-trainable params: 0
_________________________________________________________________


No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100


<keras.callbacks.History at 0x7f296c5c6490>

In [8]:
gpt2_model.evaluate(tf_test)



4.160545825958252

In [9]:
# Save the model
gpt2_model.save_weights(save_loc)

# Generate some fake paper titles!

In [2]:


# save_loc = '/content/drive/MyDrive/Colab Notebooks/ArXiv/Single_Cat/Models/tissues_organs'
save_loc = './model/interim/condensed_matter/2023_11_16_v0'
# save_loc = '/content/drive/MyDrive/Colab Notebooks/ArXiv/Single_Cat/Models/hepEX'
# save_loc = '/content/drive/MyDrive/Colab Notebooks/ArXiv/Single_Cat/Models/astrophysics'

# Load model and tokenizer

tokenizer = GPT2TokenizerFast.from_pretrained('gpt2',
                                              eos_token='<|endoftext|>',
                                              bos_token='<|startoftext|>',
                                              pad_token='<pad>')


config = AutoConfig.from_pretrained(
    'gpt2',
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
    output_hidden_states=False
)

# Load pretrained model
gpt2_model = TFGPT2LMHeadModel.from_pretrained('gpt2', config=config)

# Tell the model we changed the tokenizer
gpt2_model.resize_token_embeddings(len(tokenizer))

gpt2_model(gpt2_model.dummy_inputs) # Builds model
gpt2_model.summary()

gpt2_model.load_weights(save_loc)

pipe = pipeline(
    "text-generation", model=gpt2_model, tokenizer=tokenizer, device=0,
    no_repeat_ngram_size=2,
    do_sample=True,
    top_k=100,
    max_length = 300,
    top_p=0.95
)

2023-11-16 11:27:17.581980: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2023-11-16 11:27:17.582033: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:168] retrieving CUDA diagnostic information for host: BrentArch
2023-11-16 11:27:17.582042: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:175] hostname: BrentArch
2023-11-16 11:27:17.582247: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:199] libcuda reported version is: 545.29.2
2023-11-16 11:27:17.582274: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:203] kernel reported version is: 545.29.2
2023-11-16 11:27:17.582281: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:309] kernel version seems to match DSO: 545.29.2
2023-11-16 11:27:17.588341: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune usin

Model: "tfgpt2lm_head_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transformer (TFGPT2MainLaye  multiple                 124441344 
 r)                                                              
                                                                 
Total params: 124,441,344
Trainable params: 124,441,344
Non-trainable params: 0
_________________________________________________________________


In [3]:
pipe = pipeline(
    "text-generation", model=gpt2_model, tokenizer=tokenizer, device=0,
    no_repeat_ngram_size=2,
    do_sample=True,
    top_k=100,
    max_length = 300,
    top_p=0.95
)

txt='<|startoftext|>'

fake_titles = []

print('Fake Titles')
for ii in tqdm(range(10)):
  title = pipe(txt, num_return_sequences=1)[0]['generated_text'][16:-1]
  fake_titles.append(title)

print(fake_titles)

Fake Titles


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:14<00:00,  1.41s/it]

['On the origin of the high concentration of interstitial H in the MgB2 superconductor under pressure', 'Effects of particle size and particle velocity on the thermoelectric properties of two-dimensional ferroelectrics', 'Superconductivity due to surface-mediated biaxial strain in Bi2Se3 films', 'Theoretical studies of lithium-oxide as cathode materials for lithium batteries', 'Theory of phonon softening in a metal', 'Topological Dirac semimetal in MnBi2Te4', 'Machine Learning of Density Functional Theory for Molecular Dynamics', 'Ferroelectric-free terahertz control of topological electronic properties in two-dimensional CrGeTe5', 'Quantum Confinement in a 2D Metal Matrix', 'Superconductivity induced by a charge density wave order in the insulating skutterudite Li15ZrO3']



