Project Name: **Classification of Abstracts from arXiv publications into their most relevant category**

Course: **CIS 545**

Project Members: **Arvind Balaji Narayan, Bharathrushab Manthripragada, Gopik Anand**

**Model Used: GPT2**

GPT-2 (Generative Pre-trained Transformer 2) follows a causal modeling objective that is used to predict the next token in the sequence and is pretrained on a large corpus of text data which allows for great performance capabilities on sequence classification tasks.

Package Installations

In [None]:
!pip install transformers



In [None]:
!pip install kaggle



Imports

In [None]:
import numpy as np
import pandas as pd
import os, json, gc, re, random
import tensorflow as tf
import torch
from tqdm.notebook import tqdm
import transformers
from transformers import GPT2Tokenizer, GPT2ForSequenceClassification, AdamW
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import get_linear_schedule_with_warmup
import random
from sklearn.preprocessing import LabelEncoder

In [None]:
seed_val = 500
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

Loading the arXiv Dataset

In [None]:
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [None]:
!kaggle datasets download -d Cornell-University/arxiv

Downloading arxiv.zip to /content
100% 1.03G/1.03G [00:28<00:00, 51.3MB/s]
100% 1.03G/1.03G [00:28<00:00, 39.1MB/s]


In [None]:
!ls

arxiv.zip  kaggle.json	sample_data


In [None]:
!unzip /content/arxiv.zip

Archive:  /content/arxiv.zip
  inflating: arxiv-metadata-oai-snapshot.json  


In [None]:
data_file = '/content/arxiv-metadata-oai-snapshot.json'

In [None]:
def get_metadata():
    with open(data_file, 'r') as f:
        for line in f:
            yield line

Listing all Categories in cat_map

In [None]:
cat_map =      {'astro-ph': 'Astrophysics',
                'astro-ph.CO': 'Cosmology and Nongalactic Astrophysics',
                'astro-ph.EP': 'Earth and Planetary Astrophysics',
                'astro-ph.GA': 'Astrophysics of Galaxies',
                'astro-ph.HE': 'High Energy Astrophysical Phenomena',
                'astro-ph.IM': 'Instrumentation and Methods for Astrophysics',
                'astro-ph.SR': 'Solar and Stellar Astrophysics',
                'cond-mat.dis-nn': 'Disordered Systems and Neural Networks',
                'cond-mat.mes-hall': 'Mesoscale and Nanoscale Physics',
                'cond-mat.mtrl-sci': 'Materials Science',
                'cond-mat.other': 'Other Condensed Matter',
                'cond-mat.quant-gas': 'Quantum Gases',
                'cond-mat.soft': 'Soft Condensed Matter',
                'cond-mat.stat-mech': 'Statistical Mechanics',
                'cond-mat.str-el': 'Strongly Correlated Electrons',
                'cond-mat.supr-con': 'Superconductivity',
                'cs.AI': 'Artificial Intelligence',
                'cs.AR': 'Hardware Architecture',
                'cs.CC': 'Computational Complexity',
                'cs.CE': 'Computational Engineering, Finance, and Science',
                'cs.CG': 'Computational Geometry',
                'cs.CL': 'Computation and Language',
                'cs.CR': 'Cryptography and Security',
                'cs.CV': 'Computer Vision and Pattern Recognition',
                'cs.CY': 'Computers and Society',
                'cs.DB': 'Databases',
                'cs.DC': 'Distributed, Parallel, and Cluster Computing',
                'cs.DL': 'Digital Libraries',
                'cs.DM': 'Discrete Mathematics',
                'cs.DS': 'Data Structures and Algorithms',
                'cs.ET': 'Emerging Technologies',
                'cs.FL': 'Formal Languages and Automata Theory',
                'cs.GL': 'General Literature',
                'cs.GR': 'Graphics',
                'cs.GT': 'Computer Science and Game Theory',
                'cs.HC': 'Human-Computer Interaction',
                'cs.IR': 'Information Retrieval',
                'cs.IT': 'Information Theory',
                'cs.LG': 'Machine Learning',
                'cs.LO': 'Logic in Computer Science',
                'cs.MA': 'Multiagent Systems',
                'cs.MM': 'Multimedia',
                'cs.MS': 'Mathematical Software',
                'cs.NA': 'Numerical Analysis',
                'cs.NE': 'Neural and Evolutionary Computing',
                'cs.NI': 'Networking and Internet Architecture',
                'cs.OH': 'Other Computer Science',
                'cs.OS': 'Operating Systems',
                'cs.PF': 'Performance',
                'cs.PL': 'Programming Languages',
                'cs.RO': 'Robotics',
                'cs.SC': 'Symbolic Computation',
                'cs.SD': 'Sound',
                'cs.SE': 'Software Engineering',
                'cs.SI': 'Social and Information Networks',
                'cs.SY': 'Systems and Control',
                'econ.EM': 'Econometrics',
                'eess.AS': 'Audio and Speech Processing',
                'eess.IV': 'Image and Video Processing',
                'eess.SP': 'Signal Processing',
                'gr-qc': 'General Relativity and Quantum Cosmology',
                'hep-ex': 'High Energy Physics - Experiment',
                'hep-lat': 'High Energy Physics - Lattice',
                'hep-ph': 'High Energy Physics - Phenomenology',
                'hep-th': 'High Energy Physics - Theory',
                'math.AC': 'Commutative Algebra',
                'math.AG': 'Algebraic Geometry',
                'math.AP': 'Analysis of PDEs',
                'math.AT': 'Algebraic Topology',
                'math.CA': 'Classical Analysis and ODEs',
                'math.CO': 'Combinatorics',
                'math.CT': 'Category Theory',
                'math.CV': 'Complex Variables',
                'math.DG': 'Differential Geometry',
                'math.DS': 'Dynamical Systems',
                'math.FA': 'Functional Analysis',
                'math.GM': 'General Mathematics',
                'math.GN': 'General Topology',
                'math.GR': 'Group Theory',
                'math.GT': 'Geometric Topology',
                'math.HO': 'History and Overview',
                'math.IT': 'Information Theory',
                'math.KT': 'K-Theory and Homology',
                'math.LO': 'Logic',
                'math.MG': 'Metric Geometry',
                'math.MP': 'Mathematical Physics',
                'math.NA': 'Numerical Analysis',
                'math.NT': 'Number Theory',
                'math.OA': 'Operator Algebras',
                'math.OC': 'Optimization and Control',
                'math.PR': 'Probability',
                'math.QA': 'Quantum Algebra',
                'math.RA': 'Rings and Algebras',
                'math.RT': 'Representation Theory',
                'math.SG': 'Symplectic Geometry',
                'math.SP': 'Spectral Theory',
                'math.ST': 'Statistics Theory',
                'math-ph': 'Mathematical Physics',
                'nlin.AO': 'Adaptation and Self-Organizing Systems',
                'nlin.CD': 'Chaotic Dynamics',
                'nlin.CG': 'Cellular Automata and Lattice Gases',
                'nlin.PS': 'Pattern Formation and Solitons',
                'nlin.SI': 'Exactly Solvable and Integrable Systems',
                'nucl-ex': 'Nuclear Experiment',
                'nucl-th': 'Nuclear Theory',
                'physics.acc-ph': 'Accelerator Physics',
                'physics.ao-ph': 'Atmospheric and Oceanic Physics',
                'physics.app-ph': 'Applied Physics',
                'physics.atm-clus': 'Atomic and Molecular Clusters',
                'physics.atom-ph': 'Atomic Physics',
                'physics.bio-ph': 'Biological Physics',
                'physics.chem-ph': 'Chemical Physics',
                'physics.class-ph': 'Classical Physics',
                'physics.comp-ph': 'Computational Physics',
                'physics.data-an': 'Data Analysis, Statistics and Probability',
                'physics.ed-ph': 'Physics Education',
                'physics.flu-dyn': 'Fluid Dynamics',
                'physics.gen-ph': 'General Physics',
                'physics.geo-ph': 'Geophysics',
                'physics.hist-ph': 'History and Philosophy of Physics',
                'physics.ins-det': 'Instrumentation and Detectors',
                'physics.med-ph': 'Medical Physics',
                'physics.optics': 'Optics',
                'physics.plasm-ph': 'Plasma Physics',
                'physics.pop-ph': 'Popular Physics',
                'physics.soc-ph': 'Physics and Society',
                'physics.space-ph': 'Space Physics',
                'q-bio.BM': 'Biomolecules',
                'q-bio.CB': 'Cell Behavior',
                'q-bio.GN': 'Genomics',
                'q-bio.MN': 'Molecular Networks',
                'q-bio.NC': 'Neurons and Cognition',
                'q-bio.OT': 'Other Quantitative Biology',
                'q-bio.PE': 'Populations and Evolution',
                'q-bio.QM': 'Quantitative Methods',
                'q-bio.SC': 'Subcellular Processes',
                'q-bio.TO': 'Tissues and Organs',
                'q-fin.CP': 'Computational Finance',
                'q-fin.EC': 'Economics',
                'q-fin.GN': 'General Finance',
                'q-fin.MF': 'Mathematical Finance',
                'q-fin.PM': 'Portfolio Management',
                'q-fin.PR': 'Pricing of Securities',
                'q-fin.RM': 'Risk Management',
                'q-fin.ST': 'Statistical Finance',
                'q-fin.TR': 'Trading and Market Microstructure',
                'quant-ph': 'Quantum Physics',
                'stat.AP': 'Applications',
                'stat.CO': 'Computation',
                'stat.ME': 'Methodology',
                'stat.ML': 'Machine Learning',
                'stat.OT': 'Other Statistics',
                'stat.TH': 'Statistics Theory'}

Data Wrangling 

In [None]:
titles = []
abstracts = []
categories = []

# Consider all categories in the `category_map` to be used during training and prediction
paper_categories = np.array(list(cat_map.keys())).flatten()

metadata = get_metadata()
for paper in tqdm(metadata):
    paper_dict = json.loads(paper)
    category = paper_dict.get('categories')
    try:
        try:
            year = int(paper_dict.get('journal-ref')[-4:])    ### Example Format: "Phys.Rev.D76:013009,2007"
        except:
            year = int(paper_dict.get('journal-ref')[-5:-1])    ### Example Format: "Phys.Rev.D76:013009,(2007)"

        if category in paper_categories and 2018<=year<=2022:
            titles.append(paper_dict.get('title'))
            abstracts.append(paper_dict.get('abstract'))
            categories.append(paper_dict.get('categories'))
    except:
        pass 

len(titles), len(abstracts), len(categories)

0it [00:00, ?it/s]

(40981, 40981, 40981)

In [None]:
papers = pd.DataFrame({
    'title': titles,
    'abstract': abstracts,
    'categories': categories
})
papers.head(5)

Unnamed: 0,title,abstract,categories
0,Bohmian Mechanics at Space-Time Singularities....,We develop an extension of Bohmian mechanics...,quant-ph
1,On the derivation of exact eigenstates of the ...,We construct the states that are invariant u...,quant-ph
2,Weight Reduction for Mod l Bianchi Modular Forms,Let K be an imaginary quadratic field with c...,math.NT
3,Lawson Method for Obtaining Wave Functions and...,Lawson has shown that one can obtain sensibl...,nucl-th
4,Exact results for the Wigner transform phase s...,Closed form analytical expressions are obtai...,physics.atom-ph


In [None]:
papers['abstract'] = papers['abstract'].apply(lambda x: x.replace("\n",""))
papers['abstract'] = papers['abstract'].apply(lambda x: x.strip())
papers['text'] = papers['title'] + '. ' + papers['abstract']

In [None]:
papers.head(5)

Unnamed: 0,title,abstract,categories,text
0,Bohmian Mechanics at Space-Time Singularities....,We develop an extension of Bohmian mechanics t...,quant-ph,Bohmian Mechanics at Space-Time Singularities....
1,On the derivation of exact eigenstates of the ...,We construct the states that are invariant und...,quant-ph,On the derivation of exact eigenstates of the ...
2,Weight Reduction for Mod l Bianchi Modular Forms,Let K be an imaginary quadratic field with cla...,math.NT,Weight Reduction for Mod l Bianchi Modular For...
3,Lawson Method for Obtaining Wave Functions and...,Lawson has shown that one can obtain sensible ...,nucl-th,Lawson Method for Obtaining Wave Functions and...
4,Exact results for the Wigner transform phase s...,Closed form analytical expressions are obtaine...,physics.atom-ph,Exact results for the Wigner transform phase s...


In [None]:
df = papers[["text","categories"]].copy()
df

Unnamed: 0,text,categories
0,Bohmian Mechanics at Space-Time Singularities....,quant-ph
1,On the derivation of exact eigenstates of the ...,quant-ph
2,Weight Reduction for Mod l Bianchi Modular For...,math.NT
3,Lawson Method for Obtaining Wave Functions and...,nucl-th
4,Exact results for the Wigner transform phase s...,physics.atom-ph
...,...,...
40976,Constant of Motion for several one-dimensional...,physics.class-ph
40977,Activity ageing in growing networks. We presen...,physics.soc-ph
40978,Simple computer model for the quantum Zeno eff...,quant-ph
40979,Alternative Derivation of the Hu-Paz-Zhang Mas...,quant-ph


In [None]:
label_encoder = LabelEncoder()
label_encoder.fit(df['categories'])

LabelEncoder()

In [None]:
df['categories_encoded'] = df['categories'].apply(lambda x: label_encoder.transform([x])[0])
df

Unnamed: 0,text,categories,categories_encoded
0,Bohmian Mechanics at Space-Time Singularities....,quant-ph,140
1,On the derivation of exact eigenstates of the ...,quant-ph,140
2,Weight Reduction for Mod l Bianchi Modular For...,math.NT,83
3,Lawson Method for Obtaining Wave Functions and...,nucl-th,98
4,Exact results for the Wigner transform phase s...,physics.atom-ph,103
...,...,...,...
40976,Constant of Motion for several one-dimensional...,physics.class-ph,106
40977,Activity ageing in growing networks. We presen...,physics.soc-ph,119
40978,Simple computer model for the quantum Zeno eff...,quant-ph,140
40979,Alternative Derivation of the Hu-Paz-Zhang Mas...,quant-ph,140


Tokenization

In [None]:
if torch.cuda.is_available():
  device = torch.device("cuda")
else:
  device = torch.device("cpu")
torch.cuda.get_device_name(0)

'Tesla T4'

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [None]:
text = df.text.values
labels = df.categories_encoded.values

In [None]:
#Complete Tokenization
inputIds = [tokenizer.encode(element, add_special_tokens= True) for element in text]

In [None]:
print('Max text length: ', max([len(ele) for ele in inputIds]))

Max text length:  907


In [None]:
#Truncating
MAX_LEN = 512
print("\nTruncate sentences to %d values\n"%MAX_LEN)
inputIdsTrunc = pad_sequences(inputIds, maxlen=MAX_LEN, dtype="long", value=0, truncating="post", padding="post")


Truncate sentences to 512 values



In [None]:
print('Max text length after Truncating: ', max([len(ele) for ele in inputIdsTrunc]))

Max text length after Truncating:  512


In [None]:
#Attention Masks
attentionMasks = [[int(tokenId>0) for tokenId in ele] for ele in inputIdsTrunc]

In [None]:
#Train-test Split
trainInputs, validationInputs, trainLabels, validationLabels = train_test_split(inputIdsTrunc, labels, random_state=2018, test_size=0.2)
trainMasks, validationMasks, _, _ = train_test_split(attentionMasks, labels, random_state=2018, test_size=0.2)

In [None]:
#Data-Type Conversion to Torch Tensor
trainInputs = torch.tensor(trainInputs)
validationInputs = torch.tensor(validationInputs)

trainLabels = torch.tensor(trainLabels)
validationLabels = torch.tensor(validationLabels)

trainMasks = torch.tensor(trainMasks)
validationMasks = torch.tensor(validationMasks)

In [None]:
batch_size = 1
#Train DataLoader
train_data = TensorDataset(trainInputs, trainMasks, trainLabels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
#Test DataLoader
val_data = TensorDataset(validationInputs, validationMasks, validationLabels)
val_sampler = RandomSampler(val_data)
val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)

Model Definition - GPT2

In [None]:
model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels = len(set(labels)))
model.cuda()

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid

In [None]:
#optimizer
optimizer = AdamW(model.parameters(), lr = 2e-5, eps = 1e-8)



In [None]:
epochs = 2
train_steps = len(train_dataloader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = 0, num_training_steps = train_steps)
scheduler

<torch.optim.lr_scheduler.LambdaLR at 0x7fd585c53110>

Training

In [None]:
loss_values = []
for i in range(epochs):
    print("")
    print('Epoch: {}'.format(i + 1))
    print('Training...')
    # Reset the total loss for this epoch.
    total_loss = 0

    model.train()

    # For each batch of training data...
    for step, batch in enumerate(tqdm(train_dataloader)):
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 

        batch_input_ids = batch[0].to(device)
        batch_input_mask = batch[1].to(device)
        batch_labels = batch[2].to(device)
        model.zero_grad()        
        outputs = model(batch_input_ids, token_type_ids=None, attention_mask=batch_input_mask, labels=batch_labels)
        loss = outputs[0]
        total_loss += loss.item()

        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()

        scheduler.step()
          
    loss_values.append(total_loss / len(train_dataloader) )

    
print("")
print("Training complete!")


Epoch: 1
Training...


  0%|          | 0/32784 [00:00<?, ?it/s]


Epoch: 2
Training...


  0%|          | 0/32784 [00:00<?, ?it/s]


Training complete!


Testing and Predictions

In [None]:
predictions = []
true_labels = []
model.eval()
for batch in val_dataloader:
    batch = tuple(t.to(device) for t in batch)
    batchInputIds, batchInput_mask, batchLabels = batch
    with torch.no_grad():
        outputs = model(batchInputIds, token_type_ids=None, attention_mask=batchInput_mask)

    out = outputs[0]

    out = out.detach().cpu().numpy()
    labelIds = batchLabels.to('cpu').numpy()

    predictions.append(out)
    true_labels.append(labelIds)

In [None]:
finalPredictions = [ele for predList in predictions for ele in predList]
finalPredictions = np.argmax(finalPredictions, axis=1).flatten()
finalTrueLabels = [ele for trueList in true_labels for ele in trueList]

Performance Metrics

In [None]:
from sklearn.metrics import accuracy_score, f1_score

In [None]:
accuracy_score(finalPredictions, finalTrueLabels)

0.7671099182627791

In [None]:
f1_score(finalPredictions, finalTrueLabels, average="weighted")

0.7859698507144183

In [None]:
torch.save(model.state_dict(), 'gpt2_model_weights_5.pth')

In [None]:
torch.save(model, 'gpt2_model_5.pth')