## Train a word-level GPT on ML arXiv abstracts

The arXiv dataset on [Kaggle](https://www.kaggle.com/Cornell-University/arxiv) provides meta-data on thousands of papers published over the past decades. In this post, we take all the abstracts from papers in the field of Machine Learning (or related fields) then train GPT on it. We use Andrej Karpathy's [minGPT](https://github.com/karpathy/minGPT) - a PyTorch re-implementation of OpenAI's [GPT](https://github.com/openai/gpt-3) that "tries to be small, clean, interpretable and educational" (it is.)

We train our model a single GPU available on Google Colab and feed it some prompts, which we then get it to predict an entire Machine Learning abstract! 

In [1]:
from google.colab import drive # import drive from google colab

In [2]:
ROOT = "/content/drive"     # default location for the drive
print(ROOT)                 # print content of ROOT (Optional)

drive.mount(ROOT)           # we mount the google drive at /content/drive

/content/drive
Mounted at /content/drive


In [None]:
# This is necessary to ensure that paths are correct for importing data from the google drive folder
# insert correct root for minGPT code
minGPT_DIR = '/minGPT/'
%cd $minGPT_DIR

In [4]:
# set up logging
import logging
logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO,
)

In [5]:
# make deterministic
from mingpt.utils import set_seed
set_seed(42)

In [6]:
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F

import re
import numpy as np 
import pandas as pd
import os
import json
pd.set_option('float_format', '{:f}'.format)

Let's load the data, using `yield` below to avoid memory problems with the huge json file.

In [7]:
file_path = 'arxiv-metadata-oai-snapshot.json'

def get_metadata():
    with open(file_path, 'r') as f:
        for line in f:
            yield line

We'll just look at papers from the past 10 years and select those part of the three categories arXiv tags AI papers in:
- 'cs.AI': 'Artificial Intelligence'
- 'cs.LG': 'Machine Learning'
- 'stat.ML': 'Machine Learning'

That gets us 4673 abstracts to work with!

In [8]:
ai_list = ['cs.AI','cs.LG','stat.ML']
abstracts = []

metadata = get_metadata()
# loop over all papers
for paper in metadata:
    # extract single paper
    paper_dict = json.loads(paper)
    version = paper_dict.get('versions')
    category = paper_dict.get('categories')
    try:
        try:
            year = int(paper_dict.get('journal-ref')[-4:])    ### Example Format: "Phys.Rev.D76:013009,2007"
        except:
            year = int(paper_dict.get('journal-ref')[-5:-1])    ### Example Format: "Phys.Rev.D76:013009,(2007)"

        if any(ele in category for ele in ai_list) and 2010<year<2021:
            abstracts.append(paper_dict.get('abstract'))
    except:
        pass

In [9]:
len(abstracts)

4673

Next we need to preprocess the abstracts as follows, after which we get a corpus of 857,479 words.

In [10]:
# string whitespace at end of words, replace new lines by space and add 'end of sentence' token
f = lambda x: x.strip().replace("\n"," ") + " #EOS"
abstracts = [f(x) for x in abstracts]
# seperate all words and punctuation
abstracts = [re.findall(r"[\w']+|[.,!?;]", x) for x in abstracts]
# turn list of lists in to single list
abstracts = [j for i in abstracts for j in i]

In [11]:
len(abstracts)

857479

In [12]:
import math
from torch.utils.data import Dataset

class WordDataset(Dataset):

    def __init__(self, data, block_size):
        words = sorted(list(set(data)))
        data_size, vocab_size = len(data), len(words)
        print('data has %d words, %d unique.' % (data_size, vocab_size))
        
        self.stoi = { ch:i for i,ch in enumerate(words) }
        self.itos = { i:ch for i,ch in enumerate(words) }
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data
    
    def __len__(self):
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        # grab a chunk of (block_size + 1) characters from the data
        chunk = self.data[idx:idx + self.block_size + 1]
        # encode every word to an integer
        dix = [self.stoi[s] for s in chunk]
        """
        # See https://github.com/karpathy/minGPT/blob/master/play_char.ipynb for
        # explainer of Dataset construction
        """
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y

With our Dataset object defined we can load our dataset with a block size of 128 appropriate since the average abstract in arXiv 122 words long (see prev [post](https://kushmadlani.github.io/arxiv-eda/)):

In [14]:
block_size = 128 # sets spatial extent of the model for its context
train_dataset = WordDataset(abstracts, block_size) 

data has 857479 words, 25921 unique.


Let's load a GPT! In the Character level transformer example Karpathy wrote up he built a 'GPT-1' with 8 layers and 8 heads - here we halve that to 4 layers and 4 attention heads so to able to train it on a Colab GPU (I guess we call this 'GPT-0.5').

In [16]:
from mingpt.model import GPT, GPTConfig
mconf = GPTConfig(train_dataset.vocab_size, train_dataset.block_size,
                  n_layer=4, n_head=4, n_embd=256)
model = GPT(mconf)

09/27/2020 19:49:54 - INFO - mingpt.model -   number of parameters: 1.646387e+07


In [17]:
from mingpt.trainer import Trainer, TrainerConfig

# initialize a trainer instance and kick off training
tconf = TrainerConfig(max_epochs=2, batch_size=128, learning_rate=6e-4,
                      lr_decay=True, warmup_tokens=256*20, final_tokens=2*len(train_dataset)*block_size,
                      num_workers=4)
trainer = Trainer(model, train_dataset, None, tconf)
trainer.train()

epoch 1 iter 6698: train loss 1.35257. lr 3.000110e-04: 100%|██████████| 6699/6699 [24:41<00:00,  4.52it/s]
epoch 2 iter 6698: train loss 0.94379. lr 6.000000e-05: 100%|██████████| 6699/6699 [24:45<00:00,  4.51it/s]


Model trained! Let's generate some Machine Learning abstracts...

In [20]:
# alright, let's sample some word-level abstracts
from mingpt.utils import sample

context = ['This', 'paper', 'discusses']
x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
y = sample(model, x, 150, temperature=1.0, sample=True, top_k=10)[0]
completion = ' '.join([train_dataset.itos[int(i)] for i in y])
print(completion)

This paper discusses the effect of the design and implementation of a case study . EOS Graph Neural Networks GNNs achieve remarkable performance in graph data classification tasks . In graph classification , each node of node information from labeled nodes measured nodes in a graph are connected by many , each graph represents the goal of node embedding space . Multiple graph embedding aims to create a similarity graph by representing the different graph each path graph in each graph . This information represents the embedding by learning a knowledge graph by node as the network . The goal is to design a similarity graph embedding that represents a set of entities and the entities in the graph . The nodes are generated using graph embedding techniques , which represent graph embedding methods with embedding methods , on nodes using graphs . graphs are based on the space of nodes in a


In [22]:
context = ['We', 'introduce', 'the']
x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
y = sample(model, x, 150, temperature=1.0, sample=True, top_k=10)[0]
completion = ' '.join([train_dataset.itos[int(i)] for i in y])
print(completion)

We introduce the first time algorithm to compute activity of a given graph in a graph and then using graph clustering . We give examples that in our algorithm enjoys a faster convergence rate than previous methods . We furthermore show that it is a way to address problems with different edges that are important to be used for graph clustering . We define new rank six block models , low rank , low rank , and we are able to provide an improved numerical implementation of our model . These findings are extremely sensible and useful matrix factorization NMF . EOS This paper describes a recent , much discussion about the concepts of the SP theory of intelligence , with its realisation in the SP machine both outlined in the article may help to simplify and work in the design of autonomous robots that may assist in the design of autonomous robots


In [24]:
context = ['Our', 'work', 'has', 'focused', 'on']
x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
y = sample(model, x, 200, temperature=1.0, sample=True, top_k=10)[0]
completion = ' '.join([train_dataset.itos[int(i)] for i in y])
print(completion)

Our work has focused on the first large class of relevant problems , where the problem is to identify , and evaluate the single , rank , and all algorithms simultaneously satisfy the two properties and time constraints , respectively . The experimental result shows the interest of the proposed solving high dimensional G odel's original paper illustrates the global analysis of analysed , in the context of a link with the whole paper . EOS Many studies claim that an object can be an object , e . g . , geographic query answer set in order to , when instances in a database of interest . Many classes of problems are given . More recently , some attack scenarios rely on node graph based keyword , which aims to select a set of contextual features that are most similar for each context e . g . , for that for every node ,


In [25]:
context = ['Our', 'work', 'has', 'focused', 'on']
x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
y = sample(model, x, 200, temperature=1.0, sample=True, top_k=10)[0]
completion = ' '.join([train_dataset.itos[int(i)] for i in y])
print(completion)

Our work has focused on the use of multi modal social networks and web recommender systems , in which contain heterogeneous information and items . In this paper , we propose a multi modal data embedding framework to detect matches semantically similar contexts in order to their opinions . We show that both methods can be successfully applied to Web and document clustering tasks . EOS In this paper we study the problem of finding the rating of two and , the rating score for a given time . In particular , we use the following the following questions 1 The given answer is a certain item such that the set of at any a certain item we choose one , and use the rating , combined with the answer to answer based . We review the characteristics and compare the baselines in detail to these questions . To this end , we built a deep ranking approach for general and general and statistical analysis of some recent QA methods . EOS We consider the problem of learning a probabilistic domain , agent usi

In [26]:
context = ['This', 'paper', 'considers']
x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
y = sample(model, x, 150, temperature=1.0, sample=True, top_k=10)[0]
completion = ' '.join([train_dataset.itos[int(i)] for i in y])
print(completion)

This paper considers the problem of finding a single optimal clustering that minimizes a specific number of disagreements i . e . , the sum of the number of observed missing edges within clusters . The objective of most promising intelligent algorithms appear to be evaluated on the basis of similarity matrix . However , most of the problems have with high probability , that they are designed for the pair of clusters are distinct from observational data . The optimal clustering must pass through a grid like time varying quality . We develop a new algorithm to learn K coordinate dictionaries , with dimensions m_k times p_k up to estimation error varepsilon_k is shown to be max_ k in K mathcal O m_kp_k 3 varepsilon_k 2 . EOS Understanding the causes of crime is a longstanding issue in researcher's agenda . While it is a hard task to extract causality from data
