Before running, choose Runtime > Change runtime type and select GPU.


# Task description

The goal is to classify whether an ArXiv paper is AI-relevant or not. There’s a train, dev, and test set, each with 500 entries. The test set doesn’t have labels. We provide all these data points to help you evaluate your solution, but as described above the solution should work with only 20 labeled examples.'

Test how well GPT-2 performs on it when applied in a straightforward way (few-shot learning with examples in prompt).
Experiment with changes that may improve it (e.g. adjustments to the prompt, using GPT-2 as part of more complex schemes, other models and training methods).
Don’t fine-tune GPT-2 or another model on the full training set, since in practice you will only have 20 labeled data points.

Deliverables:
Share a writeup with you findings on:
* How well was GPT-2 able to perform on this task?
* What tweaks that you tried worked vs. didn’t work?
* What would you recommend based on these results? 
* What would be good next steps?
Classify the test set and share a jsonl with the classifications.
Share your code.

We’ll evaluate your project using these criteria, in approximate order of importance:
* Code quality (readability, extensibility)
* Optimization process for improving performance
* Communication regarding results and recommendations
* Performance on the classification task

# Setup



In [1]:
!pip install transformers
!pip install torch
!pip install sentencepiece # to fix errors in T0pp section

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
from transformers import (set_seed,
                          GPT2Config,
                          GPT2Tokenizer,
                          GPT2LMHeadModel,
                          pipeline)
from sklearn.metrics import classification_report, accuracy_score
import torch
from torch.utils.data import Dataset, DataLoader
import json
import numpy as np
import pandas as pd
import random
from tqdm.notebook import tqdm
import collections
from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.corpus import stopwords

# Data processing

Here I define a class for parsing the datasets and read in each data file.

In [3]:
class Dataset():
    '''
    Class for parsing in the data and retrieving info about it
    '''
    def __init__(self, path):
        self.texts = []
        self.labels = []

        def load_jsonl(filename):
            ''' Load JSONL file and read it with split lines '''
            f = open(filename)
            return [json.loads(line) for line in f.read().splitlines()]

        self.data = load_jsonl(path) # load training data as list of dicts

        # number of examples 
        self.n_examples = len(self.data)
        return

    def __len__(self):
        ''' Return number of examples '''
        return self.n_examples

    def __getitem__(self, item):
        ''' 
        Given an index return an example from the position.
        :param item (int): Index position to pick an example to return.
        :return Dict[str, str]: dictionary of inputs that contain text w/ labels
        '''
        return {'text':self.texts[item],
                'label':self.labels[item]}

    def get_examples(self):
        '''
        Get positive (AI-relevant) and negative (non-AI) examples from training data
        '''
        true_AI = [x for x in self.data if x['label'] == 'True']
        false_AI = [x for x in self.data if x['label'] == 'False']
        return true_AI, false_AI

    def get_rand_examples(self, amount_pos, amount_neg):
        '''
        Choose a random sample of positive & negative examples
        '''
        # get AI-relevant and AI-irrelevant texts from method of this class
        true_AI, false_AI = self.get_examples()

        # get random sample of examples
        pos_examples = random.sample(true_AI, amount_pos)
        neg_examples = random.sample(false_AI, amount_neg)

        # merge the examples together and return the result
        return pos_examples + neg_examples

    def get_sample(self, n_samples=2):
        '''
        Get uniform randomly sampled examples from the data (with both true and false examples)
        '''
        if n_samples == 1:
            return random.choice(self.data)
        elif (n_samples % 2) != 0:
            div_odd = lambda n: (n//2, n//2 + 1)
            pos, neg = div_odd(n_samples)
        else:
            pos = n_samples // 2; neg = n_samples // 2

        return self.get_rand_examples(pos, neg)

In [7]:
# mount google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# save data folder path 
data_path = "/content/drive/MyDrive/Colab/OughtProject/data"

# save filepath variables
train_path = f"{data_path}/train.jsonl"
test_path = f"{data_path}/test_no_labels.jsonl"
dev_path = f"{data_path}/dev.jsonl"

# create Dataset objects for each data file
train_data = Dataset(path=train_path)
test_data = Dataset(path=test_path)
dev_data = Dataset(path=dev_path)

# Applying GPT-2 with Few-Shot Learning


## GPT-2 Language Model Initialization

Resources:

1.   https://huggingface.co/transformers/model_doc/gpt2.html#gpt2lmheadmodel
2.   https://github.com/huggingface/transformers/blob/master/examples/pytorch/text-generation/run_generation.py




The following class refactors the original utility functions into a single GPT-2 language model class. 

In [None]:
class GPT2LM:
    '''
    Superclass for GPT-2 language models. 
    '''
    def __init__(self):
        # set seed for simple reproduction of these results
        set_seed(42)

        # truncate text sequences to specific (tokenized) length (set to gpt2-xl)
        self.max_length = 1024 # if None use max length allowed by model

        # set model and default device
        self.model_name = 'gpt2'
        
        # create GPT language model from HuggingFace pretrained models
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
        self.tokenizer.pad_token = self.tokenizer.eos_token
        self.model = GPT2LMHeadModel.from_pretrained('gpt2',pad_token_id=self.tokenizer.eos_token_id)
        self.model.eval().cuda()

    def generate(self, prompt, max_length=5, stop_token=None):
        '''
        Generate a list of words to autocomplete a prompt. 
        :param prompt: The prompt to complete.
        :param max_length: The maximum number of words to generate.
        :param stop_token: A token to stop generating words.
        :return: A list of words to complete the prompt.
        '''
        # tokenize the prompt text and save input_ids [0], token_type_id [1], and attention_mask [2]
        input_ids = self.tokenizer.encode(prompt,
                                          truncation=True,
                                          return_tensors="pt")
        
        # generate input text IDs for the model
        generated_text_ids = self.model.generate(input_ids=input_ids.cuda(), 
                                            max_length=self.max_length, 
                                            do_sample=False)
        
        # decode the generated text from the tokenizer
        generated_text = self.tokenizer.decode(generated_text_ids[0], 
                                          clean_up_tokenization_spaces=True)
        post_prompt_text = generated_text[len(self.tokenizer.decode(input_ids[0], 
                                                                    clean_up_tokenization_spaces=True)):]
        # return the prompt plus the generated text
        return prompt + post_prompt_text[:post_prompt_text.find(stop_token) if stop_token else None]
    

    def get_logits_and_tokens(self, text):
        '''
        Returns the logits and tokens for the given text
        '''
        input_ids = self.tokenizer.encode(text, return_tensors="pt")
        tokens = [self.tokenizer.decode([input_id]) for input_id in input_ids[0]]
        output = self.model(input_ids.cuda())
        return output.logits[0][:-1], tokens

## Define GPT-2 Language Model Classifier

In [None]:
class GPTClassifier(GPT2LM):
    '''
    Classify papers as AI-relevant or not with GPT-2 few-shot learning
    '''
    def __init__(self, dataset, num_examples=3, default_examples=None):
        self.instructions = '''"Classify each example paper as 'AI' or 'Not AI':\n--'''

        # inherit the GPT2LM superclass 
        super(GPTClassifier, self).__init__()

        # save data and dataset as class variables
        self.dataset = dataset
        self.data = dataset.data

        if default_examples is None:
            # get a number of examples from the Dataset object passed to this method
            self.examples = dataset.get_sample(num_examples)
        else:
            # set the examples to the default examples (used for test data)
            self.examples = default_examples
            
    def render_example(self, example):
        ''' Render an example of a paper '''
        # take the title & abstract texts and strip them 
        title = example["text"].split(".")[0].strip()
        abstract = example["text"][len(title)+1:].strip()
        # return the title, abstract, and label of paper (if AI-relevant or not)
        label = "AI" if example["label"] == "True" else "Not AI"
        rendered = f"""Title: {title}\nAbstract: {abstract}\nLabel: {label}"""
        return rendered

    def render_end_example(self, example):
        ''' Render end example, with predicted label '''
        title = example["text"].split(".")[0].strip()
        abstract = example["text"][len(title)+1:].strip()
        rendered = f"""Title: {title}\nAbstract: {abstract}\nLabel:"""
        return rendered
    
    def make_prompt(self, instructions, examples, end_example):
        ''' Print a series of examples to make a prompt for GPT-2 '''
        rendered_examples = "\n\n--\n\n".join([self.render_example(example) for example in examples])
        return f"""{instructions}
        {rendered_examples}
        --
        {self.render_end_example(end_example)}"""
    
    def classify(self, text):
        ''' Classify a given text as AI relevant or not '''
        prompt = self.make_prompt(self.instructions, self.examples, text)
        gen_text = self.generate(prompt, stop_token="\n")

        # get the prediction from the generated text
        pred = gen_text.split('\n')[-1].strip("Label: ")
        return pred
    
    def evaluate(self, samples=50):
        ''' Evaluate the model's accuracy based on a given number of samples '''
        samples = self.dataset.get_sample(samples)

        hits = [] # list to keep track of prediction accuracy
        for sample in samples:
            response = self.classify(sample)
            # create dictionary to convert predictions to True/False
            response_dict = {
                'AI': 'True',
                'Not AI': 'False'
            }
            pred = response_dict.get(response, 'Invalid response')
            real = sample['label']
            hits.append(pred == real)
    
        # calculate accuracy and return
        accuracy = np.array(hits).sum() / len(hits)
        print(f"Model accuracy: {accuracy * 100}%")
        return accuracy

    def classify_dataset(self, num = None):
        ''' 
        Classify all the papers in the dataset and save as a dictionary 
        :param num (int): number of papers to classify (default to length of dataset when None)
        '''
        # if no number specified, classify entire dataset; else, classify up to the number
        paper_list = self.data if num is None else self.data[:num] 
            
        results = [] # create empty list to store all papers (with predictions)
        for paper in paper_list:
            pred = self.classify(paper)
            print(f"Classification result for paper '{paper['text'].split(' ')[0]}': {pred}")
            paper["label"] = pred # add prediction to paper 
            results.append(paper)
            
        return results 

## Initialize and evaluate classifier 

In [None]:
# initialize the classifier with the training data
classifier = GPTClassifier(train_data, num_examples=3)

In [None]:
# example classification - choose a paper to classify and predict it 
text = train_data.data[5] # should be False ('Not AI') for 5
pred = classifier.classify(text)
print(pred)

Not AI


In [None]:
# evaluation - used 50 samples for actual evaluation, shortened to 5 to reduce runtime)
evaluated = classifier.evaluate(samples=5)
evaluated

Input length of input_ids is 1024, but ``max_length`` is set to 1024. This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


Model accuracy: 60.0%


0.6

Depending on which samples are selected, this evaluation returns an accuracy between about 40% and 50%. 

In [None]:
# initialize a classifier for predicting the test (no labels) dataset 
# use examples from the training dataset for the GPT-2 prompt (2 AI-relevant, 2 AI-irrelevant examples)
pos, neg = train_data.get_examples()
train_examples = pos[:1] + neg[:1]
classifier = GPTClassifier(test_data, num_examples=3, default_examples=train_examples)

In [None]:
# classify part of the test predictions dataset with GPT-2 few-shot learning
# (takes too long to classify the entire 500-line dataset with GPT-2, couldn't finish run in kaggle/colab)
test_predictions = classifier.classify_dataset(num=5)

Classification result for paper 'out': Not AI
Classification result for paper 'level': Not AI
Classification result for paper 'dbscan': Not AI
Classification result for paper 'spherical': Not AI
Classification result for paper 'rdec': Not AI


In [None]:
def export_jsonl(predictions, filename='predictions.jsonl'):
    '''Function to convert a classified predictions dictionary to a jsonl file '''
    # convert to jsonl and save
    with open(filename, 'w') as f:
        for item in predictions:
            f.write(json.dumps(item))
            f.write('\n')
    print(f"Saved predictions to file {filename}")
    return filename

# Keyword-Based Classification

This simple approach uses domain-specific knowledge to classify papers as AI-relevant based on the presence of keywords. This simulates how a human might classify these papers - by scanning for keywords they recognize in the paper titles and abstracts. 

This approach can also serve as a baseline, as it is much more lightweight than a full-scale language mode and will run faster. If other models cannot outperform this baseline, it indicates that a simple dictionary of keywords can 
Along with the manually added keywords, I scan all the papers in the dataset that are AI-relevant, and add some of the most common keywords to the keyword list.

However, classification with keywords runs into the problem of generalizability - these keywords may be over-represented in this dataset compared to the full corpus of all AI-relevant papers. Further, there may be papers not relevant to AI that use these keywords, resulting in false positives. (Although arguably any paper that mentions ML or AI methods could be considered "AI-relevant"). 

In [None]:
class KeywordClassifier:
    def __init__(self, dataset, auto_keywords=False):
        '''
        Classify papers as AI-relevant or not with domain-specific keyword list
        :param auto_keywords - if True, will use automated keyword selection using add_keywords()
        '''
        self.auto_keywords = auto_keywords
        self.dataset = dataset
        self.data = dataset.data
        # this list is manually compiled based on expected words & by looking at top words from add_keywords()
        self.keyword_list = ["machine learning", "artificial intelligence", "gradient descent",
                             "neural network", "deep learning", "transformer", "natural language processing",
                             "classification", "convolutional", "thresholding", "corpus", "convolutional",
                             "embedding", "pooling", "reinforcement learning"]
        
    def classify(self, text):
        ''' 
        Classify a given text as AI relevant or not with the keyword list
        '''
        if self.auto_keywords:
            self.keyword_list = [word[0] for word in self.add_keywords(num_keywords = 20)]
        # check if the text contains any of the keywords
        text_contains_keyword = any(keyword in text for keyword in self.keyword_list)
        
        # return 'AI' if text contains keyword, return 'Not AI' if not
        return 'AI' if text_contains_keyword else 'Not AI'
    
    def add_keywords(self, num_keywords=10): 
        '''
        Add keywords to the keyword list based on their frequency in the AI-relevant papers
        Warning: does not work with unlabeled data, will throw KeyError
        '''
        pos, neg = self.dataset.get_examples()
        keyword_list = [] # initialize list for storing all keywords from the text
        
        # add all words in each paper to the keyword list 
        for paper in pos:
            # put all words in the text into a list (removing periods)
            word_list = paper["text"].replace(".", "").split(" ")
            keyword_list.extend(word_list)
        
        # remove stopwords from the list using ntlk 
        stop_words = set(stopwords.words('english'))
        keyword_list = [word for word in keyword_list if word not in stop_words]
            
        # create Counter of the frequency of each keyword
        frequency = collections.Counter(keyword_list)
        self.freq_dict = dict(frequency) # convert the Counter to a dictionary and save it as class variable
        
        # get the most common N items in the frequency dict
        most_common = frequency.most_common(num_keywords)
        
        return most_common
        
    def evaluate(self, samples=None):
        ''' 
        Evaluate the model's accuracy based on a given number of samples (or the full dataset)
        :param samples: if None, evaluate based on full dataset; if number, evaluate w/ that # of samples 
        '''
        if samples is None: 
            samples = self.dataset.data
        else: 
            samples = self.dataset.get_sample(samples)

        hits = [] # list to keep track of prediction accuracy
        for sample in samples:
            response = self.classify(sample['text'])
            # create dictionary to convert predictions to True/False
            response_dict = {
                'AI': 'True',
                'Not AI': 'False'
            }
            pred = response_dict.get(response, 'Invalid response')
            real = sample['label']
            hits.append(pred == real)
    
        # calculate accuracy and return
        accuracy = np.array(hits).sum() / len(hits)
        print(f"Model accuracy: {accuracy * 100}%")
        return accuracy

    def classify_dataset(self, num = None):
        '''
        Classify all the papers in the dataset and save as a dictionary 
        :param num (int): number of papers to classify (default to length of dataset when None)
        '''
        # if no number specified, classify entire dataset; else, classify up to the number
        paper_list = self.data if num is None else self.data[:num]
            
        results = [] # create empty list to store all papers (with predictions)
        for paper in paper_list:
            pred = self.classify(paper['text'])
            paper["label"] = "True" if pred == "AI" else "False" # add prediction to paper 
            results.append(paper)
            
        return results

In [None]:
keyword_classifier = KeywordClassifier(train_data)
keyword_classifier.evaluate()

Model accuracy: 91.60000000000001%


0.916

In [None]:
dev_keyword_classifier = KeywordClassifier(dev_data)
dev_keyword_classifier.evaluate()

Model accuracy: 92.0%


0.92

In [None]:
dev_keyword_classifier = KeywordClassifier(test_data)
keyword_predictions = dev_keyword_classifier.classify_dataset()
# export_jsonl(keyword_predictions, filename="keyword_predictions.jsonl")

This demonstrates the effectiveness of simple keyword-based approach to text classification that leverages domain-specific knowledge about the field of artifical intelligence. The keyword classifier achieves a 90-92% accuracy on both the training and the dev data (these datasets combined contain 1000 total scientific papers). For comparison, humans only achieve about a 73% accuracy on the few-shot text classification tasks in the RAFT dataset ([source](https://paperswithcode.com/sota/few-shot-text-classification-on-raft)). While generalizability might be a concern for this approach, it seems reasonable to assume that most papers that contain these technical and domain-specific keywords are in fact relevant to artificial intelligence.  

# Zero-Shot Topic Classification with BART

Finally, I try a few zero-shot topic classification approaches. These models do not require any examples at all, and achieve a surprisingly high classifiction accuracy for this dataset.

Helpful resources used for this section: 


* [Joe Davidson - Zero-Shot Learning in Modern NLP](https://joeddav.github.io/blog/2020/05/29/ZSL.html)
* [HuggingFace Implementation of Bart-Large-MNLI](https://huggingface.co/facebook/bart-large-mnli)


Resources for possible future work: 
* [BERT with HuggingFace Transformers](https://www.kaggle.com/code/tuckerarrants/bert-with-huggingface-transformers#III.-BERT)
* [Topic Modeling in Python with BERTopic](https://hackernoon.com/nlp-tutorial-topic-modeling-in-python-with-bertopic-372w35l9) 
* [Tutorial - Topic Modeling with BERTopic](https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing#scrollTo=QI6vwelqnTL-)
* [cuBERT-topic-modelling](https://github.com/rapidsai/rapids-examples/tree/main/cuBERT_topic_modelling)
* [Summarizing topic models with Transformers](https://www.kaggle.com/code/donkeys/summarizing-topic-models-with-transformers/notebook)
* [Low-resource NLP Bootcamp](https://github.com/neubig/lowresource-nlp-bootcamp-2020)

In [None]:
# install bertopic for topic modeling with BERT
# !pip install bertopic

In [None]:
# imports for this section
from transformers import BartForSequenceClassification, BartTokenizer
import requests
from bs4 import BeautifulSoup

In [None]:
class ZeroShotClassifier:
    def __init__(self, dataset, classify_method):
        '''
        Classify papers as AI-relevant or not using zero-shot learning. 
        :param classify_type: The type of classification to perform
        '''
        self.dataset = dataset
        self.data = dataset.data
        self.classify_method = classify_method
        
    def classify(self, text):
        '''
        Classify a given text as AI relevant or not, using specified method
        '''
        if self.classify_method == "hypothesis": 
            print("Classifying text with BART hypothesis prediction method")
            return self.classify_hypothesis(text)
        else:
            print("Classifying text with BART topic classification method")
            return self.classify_topic(text)

    def classify_hypothesis(self, text, threshold=60):
        '''
        Classify the text using sequence-hypothesis prediction with BART
        This treats the paper as a premise, and the hypothesis as an assertion that 
        the paper is about artificial intelligence. 
        :arg threshold: if probability is above threshold, paper is AI-relevant
        :return: True if paper is AI-relevant, False if not 
        '''
        # load model pretrained on MNLI
        tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-mnli')
        model = BartForSequenceClassification.from_pretrained('facebook/bart-large-mnli')

        # pose sequence as a NLI premise and label (politics) as a hypothesis
        premise = f"{text}?"
        hypothesis = 'This text is relevant to artificial intelligence or machine learning.'

        # encode the premise & hypothesis, then run model and retrieve logits
        input_ids = tokenizer.encode(premise, hypothesis, return_tensors='pt')
        logits = model(input_ids)[0]

        # probability of "entailment" (2) is the probability of label being true 
        entail_contradiction_logits = logits[:,[0,2]]
        probs = entail_contradiction_logits.softmax(dim=1)
        true_prob = probs[:,1].item() * 100

        print(f"""Probability that the label is true for paper: {true_prob:0.2f}%""")
        
        return 'AI' if true_prob >= threshold else 'Not AI'

    def classify_topic(self, text):
        '''
        Classify the text with BART-Large-MNLI zero-shot topic classification 
        '''
        # initialize the BART model and set the labels to AI or Not AI
        classifier = pipeline("zero-shot-classification", model = "facebook/bart-large-mnli")
        
        # get topics from the arxiv taxonomy
        topic_list = self.scrape_topics()

        # classify the text with multi-label false (only one label is correct)
        result = classifier(text, topic_list, multi_label=False)

        # extract scores and labels and add them to dict 
        labels = result['labels']
        scores = result['scores'] 
        res_dict = {label : score for label, score in zip(labels, scores)}

        # get the label and value of the highest-scored item in the dict
        max_value = max(res_dict.values())
        max_label = max(res_dict, key=res_dict.get)
        print(f"Highest-scored label = {max_label}:{max_value}")

        return 'AI' if max_label == 'Artificial Intelligence' else 'Not AI'

    def scrape_topics(self, url="https://arxiv.org/category_taxonomy"): 
        '''
        Scrape list of paper topics from url
        Made for arxiv category_taxonomy page, will likely break with other URLs
        '''
        # get the html page and parse it with BeautifulSoup
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')

        span_list = soup.findAll("span")[1:] # index past the title span

        # get text and remove parantheses
        span_list_cln = [span.text.strip('()') for span in span_list]

        # some spans are not topics, remove them - they all start lowercase
        topic_list = [word for word in span_list_cln if not word.islower()]

        # this results in a list of 155 topics, too large for classification in colab 
        # the full topic_list could be useful to improve performance with more compute
        # based on this list, manually truncate to just 10 overarching topics
        topic_list = ["Artificial Intelligence", "Computer Science", "Economics",
                      "Electrical Engineering", "Mathematics", "Physics", 
                      "Biology", "Neuroscience", "Finance", "Statistics"]

        return topic_list
          
    def evaluate(self, samples=10):
        ''' 
        Evaluate the model's accuracy based on a given number of samples (or the full dataset)
        :param samples: if None, evaluate based on full dataset; if number, evaluate w/ that # of samples 
        '''
        samples = self.dataset.get_sample(samples)

        hits = [] # list to keep track of prediction accuracy
        for sample in samples:
            response = self.classify(sample['text'])
            # create dictionary to convert predictions to True/False
            response_dict = {
                'AI': 'True',
                'Not AI': 'False'
            }
            pred = response_dict.get(response, 'Invalid response')
            real = sample['label']
            hits.append(pred == real)
    
        # calculate accuracy and return
        accuracy = np.array(hits).sum() / len(hits)
        print(f"Model accuracy: {accuracy * 100}%")
        return accuracy

    def classify_dataset(self, num = None):
        '''
        Classify all the papers in the dataset and save as a dictionary 
        :param num (int): number of papers to classify (default to length of dataset when None)
        '''
        # if no number specified, classify entire dataset; else, classify up to the number
        paper_list = self.data if num is None else self.data[:num] 
            
        results = [] # create empty list to store all papers (with predictions)
        for paper in paper_list:
            pred = self.classify(paper)
            print(f"Classification result for paper '{paper['text'].split(' ')[0]}': {pred}")
            paper["label"] = pred # add prediction to paper 
            results.append(paper)
        
        return results

In [None]:
hypothesis_classifier = ZeroShotClassifier(train_data, classify_method="hypothesis")

The line below takes a long time to run, and has been shortened to just 5 samples. However, the ZeroShotClassifier model using the hypothesis method achieved a classification accuracy of 90.0% on a sample of 50 papers.

In [None]:
evaluated = hypothesis_classifier.evaluate(samples=5)

Classifying text with BART hypothesis prediction method
Probability that the label is true for paper: 99.95%
Classifying text with BART hypothesis prediction method
Probability that the label is true for paper: 99.55%
Classifying text with BART hypothesis prediction method
Probability that the label is true for paper: 94.08%
Classifying text with BART hypothesis prediction method
Probability that the label is true for paper: 99.78%
Classifying text with BART hypothesis prediction method
Probability that the label is true for paper: 16.89%
Model accuracy: 60.0%


In [None]:
# topic classification with BART-Large-MLNI, using the arxiv topic categories
topic_classifier = ZeroShotClassifier(train_data, classify_method="topic")

In [None]:
# classify a single example using the topic classification method
example = train_data.get_sample(1)['text']

classified = topic_classifier.classify(example)

Classifying text with BART topic classification method
Highest-scored label = Mathematics:0.2151394635438919


In [None]:
evaluated = topic_classifier.evaluate(samples = 1)

Classifying text with BART topic classification method
Highest-scored label = Artificial Intelligence:0.21841001510620117
Classifying text with BART topic classification method
Highest-scored label = Computer Science:0.21647033095359802
Classifying text with BART topic classification method
Highest-scored label = Computer Science:0.2620323896408081
Classifying text with BART topic classification method
Highest-scored label = Mathematics:0.3561461269855499
Classifying text with BART topic classification method
Highest-scored label = Statistics:0.18978750705718994
Classifying text with BART topic classification method
Highest-scored label = Neuroscience:0.25566476583480835
Classifying text with BART topic classification method
Highest-scored label = Artificial Intelligence:0.25085097551345825
Classifying text with BART topic classification method
Highest-scored label = Artificial Intelligence:0.21848642826080322
Classifying text with BART topic classification method
Highest-scored label 

The BART-Large_MNLI topic classification model achieves an accuracy of x when evaluated with a uniform random sample of 50 papers.

# Classification with T0pp (T-Zero Plus Plus)

From the Big Science project - "T0 shows zero-shot task generalization on English natural language prompts, outperforming GPT-3 on many tasks, while being 16x smaller." 

In this section, I try out T0pp for the AI-related paper classification task. 

(This was added about 3 days after I submitted the final version of the project - consider it an extra I added just for fun/testing). 

NOTE: Thsi 

In [5]:
# imports for this section
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

In [6]:

# tokenizer = AutoTokenizer.from_pretrained("bigscience/T0pp")
# model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0pp")

# prompt = "Is this paper about artificial intelligence or not?"

# review = train_data.data[0]

# print(review)

Downloading:   0%|          | 0.00/41.5G [00:00<?, ?B/s]

OSError: ignored

In [None]:

inputs = tokenizer.encode("Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy", return_tensors="pt")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))