# Scaling UP: Large Language Models


Finally! Let's turn to recent events, the advent of Large Language Models (LLMs).

![finally](https://media.giphy.com/media/hZj44bR9FVI3K/giphy.gif)

Most of the materials in this Notebook are based on
- a recent [survey article](https://arxiv.org/abs/2303.18223) on Large Language Models
- the [Stanford Course CS324](https://stanford-cs324.github.io/winter2023/assignment/) on Advances in Foundation Models

## Focus 
- Context, large, larger, largest? (Theory)
- Accessing LLMs (Practical)
- Interacting with LLMs (Practical); How to talk LLM?

## Large Language Models
- Scaling pretrained language models improves performance*
- Scaling refers to increasing model size, data and compute 
 <img src="https://s10251.pcdn.co/wp-content/uploads/2023/03/2023-Alan-D-Thompson-AI-Bubbles-Rev-7b.png" alt="model_size" width="500">

*performance on tasks the ML/NLP cares about ("benchmarking")

### Scaling leads to qualitatively different (i.e. better?) models

Three differences between PLMs and LLMs (from the survey paper):
- LLMs **might** display emergent abilities that are not observed in smaller PLMs.
- LLMs would revolutionize the way we use AI algorithms: prompting, i.e. formulate a task so that LLMs can "understand" or at least follow
- "Development of LLMs no longer draws a clear distinction between research and engineering."

### We might need fewer data points to create models or systems that work well!
<img src="./imgs/dldata.jpg" alt="fewerdata" width="500">


### LLMs are general-purpose language task solvers

####  Why is this so exciting?
Imagine you want to automatically classify documents by emotion (i.e. P( positive | text)) or a translation system
- **Pre-LLM**: machine learning models (based on PLMs) are **task-specific**
    - get training data (annotations)
    - train a model that only performs well on this task with this specific data (strong limitations)
    - strong limitations
    - overfitting to the data (not learning the concept of emotion)
        - spurious correlations
       
- **LLM Age**: Design and evaluate prompt (to be discussed later)

# The LLM workflow: Prompting

### "Emergent" Abilities

Question: 
- How predictable is the behaviour of LLMs? Can we predict the improvements of these models as a function of parameters/data/compute?
- Or does scaling up lead to qualitatively different models? 

[Scaling Law](https://arxiv.org/abs/2001.08361)
- with respect to some tasks such as language modelling, LLMs tend to behave in predictable ways

However, other research pointed out that some abilities are not present in PLMs, but unexpectable "emerge" in LLMs. 
- A notion taken from Physics: "Emergence is when quantitative changes in a system result in qualitative changes in behavior." (Anderson, 1972)
- Which abilities are we referring to:
    - In-Context Learning (zero or few-shot classification): LLMs can classify data based solely on natural language description or task demonstration.
    - Instruction-following: LLMs can handle news tasks described as instruction in natural language
    - Step-by-step reasoning: LLMs follow intermediate reasoning steps in the process of answering a question

But a topic of ongoing discussion... emergent abilities a ["mirage"](https://arxiv.org/abs/2304.15004). The paper disputes the following claims in relation to the 'emergence'

1. Sharpness, transitioning seemingly instantaneously from not present to present
2. Unpredictability, transitioning at seemingly unforeseeable model scales

Debate now also has an ideological dimension. Visualisation taken from [Washington Post](https://www.washingtonpost.com/technology/2023/04/09/ai-safety-openai/) article
<img src="https://www.washingtonpost.com/wp-apps/imrs.php?src=https://arc-anglerfish-washpost-prod-washpost.s3.amazonaws.com/public/44S26VMACJD2PBYDP3ODHIABWM.jpg&w=1440&impolicy=high_res" alt="ai_debate" width="500" height="600">


## Accessing LLMs: from checkpoints or via API
- A rich 'ecology' of LLMs
- Should LLMs be open-source (interesting recent paper in Nature [paper](https://www.nature.com/articles/d41586-023-01295-4))
- Difference between 'checkpoint' and 'API' access:
     - Checkpoint: download the model and do with it whatever you want (retrain, adapt, destroy). (NB: If you have the computing power)
     - API access: query the model but you can not download or adapt it (unless you pay OpenAI, but still you won't get to "see" the model)

# Checkpoint

Hugging Face and [BLOOM](https://huggingface.co/docs/transformers/model_doc/bloom)
![bloom](https://assets.website-files.com/6139f3cdcbbff3a68486761d/62cce3c835539c54f31329b1_image1.png)
From the webpage:
"Large language models (LLMs) have made a significant impact on AI research. These powerful, general models can take on a wide variety of new language tasks from a user’s instructions. However, academia, nonprofits and smaller companies' research labs find it difficult to create, study, or even use LLMs as only a few industrial labs with the necessary resources and exclusive rights can fully access them. Today, we release BLOOM, the first multilingual LLM trained in complete transparency, to change this status quo — the result of the largest collaboration of AI researchers ever involved in a single research project."

BLOOM is one the many open-source LLM, for an overview on the state-of-the-art, you can peruse the Hugging Face LLM [leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).

In theory we could download the 176B parameters model. However, Colab will refuse to load this. For this reason, we will use smaller models as example. 

Again, working with LLMs requires new engineering skills and $$ which few (including yours truly have)

#### IMPORTANT: CHANGE RUNTIME

In [25]:
!pip install transformers torch datasets  accelerate bitsandbytes xformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting accelerate
  Downloading accelerate-0.19.0-py3-none-any.whl (219 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m219.1/219.1 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.39.0-py3-none-any.whl (92.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/92.2 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: bitsandbytes, accelerate
Successfully installed accelerate-0.19.0 bitsandbytes-0.39.0


In [26]:
import transformers
import torch
from datasets import load_dataset
from transformers import pipeline

In [28]:
from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model_name = "bigscience/bloom-1b7" # load the 1 billion bloom model
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)

tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/715 [00:00<?, ?B/s]



Downloading pytorch_model.bin:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

NameError: name 'init_empty_weights' is not defined

## Zero and few-shot Learning

Both are examples of "In-context" learning.

Taken from the Stanford course:

“In zero-shot prompting, an instruction for the task is usually specified in natural language. The model is expected to following the specification and output a correct response, without any examples (hence “zero shots”).

In few-shot prompting, we provide a few examples in the prompt, optionally including task instructions as well (all as natural language). Even without said instructions, our hope is that the LLM can use the examples to autoregressively complete what comes next to solve the desired task.”


In [27]:
# A zero-shot prompt

prompt = f"""Classify the following movie review as positive or negative

Review: I really love this movie
Sentiment:"""

print(prompt)

Classify the following movie review as positive or negative

Review: I really love this movie
Sentiment:


In [None]:
# Feed prompt to model to generate an output
generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer)
output = generator(prompt, max_new_tokens=20)
print(output[0]['generated_text'])

In [None]:
Adding examples usually improves the performance.

In [None]:
# A few-shot prompt
sample_review = 'An awful film!'
""" 
Write a few-shot prompt. Here we include a few in-context examples to the model 
demonstrating how to complete the tasks
"""

prompt = f"""Review: The movie was horrible
Sentiment: Negative

Review: The movie was the best movie I have watched all year!!!
Sentiment: Positive

Review: {sample_review}
Sentiment:"""

print(prompt)

In [None]:
# Feed prompt to model to generate an output
output = generator(prompt, max_new_tokens=1)
print(output[0]['generated_text'])

# A more difficult task: The Living Machine

In [None]:
target_sentence = "When the ***machine*** has been let down into the sea, and the coral is thought sufficiently"
prompt = f"""We want to know if the word ***machine*** in the following sentences is animate.
With animacy we mean the property of being alive

Sentence: Immured in a convent, debarred from life-giving air and light, and the beauty of life, we cease to be living, feeling, thinking girls and women, we become mere ***machines*** who blindly obey the head that directs us.'
Animacy: Animate

Sentence: Now that we were free from all fear of encountering bad cha racters in the house, the boom-boom of the little man's big voice went on unintermittingly, like a ***machine*** at work in the neigh bourhood
Animacy: Animate

Sentence: He led his ***machine*** to the side of thi_ footpath. 
Animacy: Inanimante

Sentence: The drawing shows the ***machine*** ready to begin its forward stroke.'
Animacy: Inanimante

Sentence: {target_sentence}
Animacy: 
"""

print(prompt)

In [None]:
# Feed prompt to model to generate an output
output = generator(prompt, max_new_tokens=2)
print(output[0]['generated_text'])

In [None]:
def prompt_template(target_sentence):
    return f"""We want to know if the word ***machine*** in the following sentences is animate.
    With animacy we mean the property of being alive

    Sentence: Immured in a convent, debarred from life-giving air and light, and the beauty of life, we cease to be living, feeling, thinking girls and women, we become mere ***machines*** who blindly obey the head that directs us.'
    Animacy: Animate

    Sentence: Now that we were free from all fear of encountering bad cha racters in the house, the boom-boom of the little man's big voice went on unintermittingly, like a ***machine*** at work in the neigh bourhood
    Animacy: Animate

    Sentence: He led his ***machine*** to the side of thi_ footpath. 
    Animacy: Inanimante

    Sentence: The drawing shows the ***machine*** ready to begin its forward stroke.'
    Animacy: Inanimante
    
    Sentence: {target_sentence}
    Animacy: 
    """

# API: Accessing OpenAI's GPT-3
## Text Completion

**TO DO**: create a file `openai.txt` and put your API key in there.

We use Python but there is a simple GUI [here](https://platform.openai.com/playground).

Full documentation is available [here](https://platform.openai.com/docs/api-reference/completions/create).


In [None]:
# Hey GPT-3 how can I ask a question to
import openai

# Set up your OpenAI API credentials
openai.api_key = open('openai.txt','r').read()

# Define the function to ask a question
def ask_question(question):
    prompt = f"Question: {question}\nAnswer:"

    # Generate a response from GPT-3
    response = openai.Completion.create(
        engine='text-davinci-003', # Select the model you want to use
        prompt=prompt,  # Your query as a prompt
        max_tokens=50,  # Adjust the max tokens according to your needs
        n=1, # Number of completions to generate
        stop=None, # 
        temperature=0.0 # Regulate the LLM creativity. Lower values will produce more similar responses
        # top_p=0.1, # Nucleus sampling, if 0.1 consider only predictions within the top 10% probability mass
        # logprobs=False,
        # presence_penalty = 0, # between -2.0 and 2.0 increase likelihood of new topics, new tokens penalized on whether the appear in the sentences so far
        # frequency_penalty = , between -2.0 and 2.0 decreasing the model's likelihood to repeat the same line verbatim.
    )

    # Extract and return the answer from the response
    answer = response.choices[0].text.strip().split('\n')[0]
    return answer


In [None]:

question = "What is the capital of France?"
answer = ask_question(question)
print(answer)

In [None]:

question = "Translate from English to French: Hello I am Kaspar."
answer = ask_question(question)
print(answer)

LLMs have a hard time diverging from their training data

In [None]:

question = """Classify the senteces as negative or positive:
Sentence: I am so happy!
Answer: Negative

Sentence: This is such a beautiful day :-)
Answer: Negative

Sentence: I am so sad :-()
Answer: Positive

Sentence: Life is awful, I want to cry.
Answer: Positive

Sentence: I feel great!
Answer:
"""
answer = ask_question(question)
print(answer)

# Chain-of-thought prompting

In [None]:
question = """Is the machine in the following sentence Animate or Inanimate: The Russian never learns, for he is nothing but a machine."""
answer = ask_question(question)
print(answer)

In [None]:
question = """
Question: Under this point of view, Maret, who was a true official machine was the very man whom the Emperor wanted.
Reply: Animate

Question: He led his machine to the side of the footpath.
Reply: Inanimate

Question: The Russian never learns, for he is nothing but a machine.
Reply:
"""
answer = ask_question(question)
print(answer)

In [None]:
question = """The sentence contains the word machine. Categorize the sentence as Animate if:
- The sentence directly likens a human to a machine
- The sentence directly likens a machine to a human
- The represents the machine as thinking/speaking?

Otherwise categorize the sentence as Inanimate.


Example:
Question: Under this point of view, Maret, who was a true official machine was the very man whom the Emperor wanted.
Reasoning: The human Marest is likened to a machine. 
Reply: Animate, human is likened to machine

Question: He led his machine to the side of the footpath.
Reasoning: Human is not likened to a machine
           Machine is not likened to a human
           Machine is not represented as speaking or thinking
Reply: Inanimate


Question: The Russian never learns, for he is nothing but a machine.
"""

answer = ask_question(question)
print(answer)

In [None]:
question = """The sentence contains the word machine. Categorize the sentence as Animate if:
- The sentence directly likens a human to a machine
- The sentence directly likens a machine to a human
- The sentence represents the machine as thinking or speaking

Otherwiwse categorize the sentence as Inanimate.


Example:
Question: Under this point of view, Maret, who was a true official machine was the very man whom the Emperor wanted.
Reasoning: The human Marest is likened to a machine. 
Reply: Animate, human is likened to machine

Question: He led his machine to the side of the footpath.
Reasoning: Human is not likened to a machine
           Machine is not likened to a human
           Machine is not represented as speaking or thinking
Reply: Inanimate


Question: The machines thinks it is smarter than us.
"""

answer = ask_question(question)
print(answer)

In [None]:
question = """The sentence contains the word machine. Categorize the sentence as Animate if:
- The sentence directly likens a human to a machine
- The sentence directly likens a machine to a human
- The sentence represents the machine as thinking or speaking

Otherwiwse categorize the sentence as Inanimate.


Example:
Question: Under this point of view, Maret, who was a true official machine was the very man whom the Emperor wanted.
Reasoning: The human Marest is likened to a machine. 
Reply: Animate, human is likened to machine

Question: He led his machine to the side of the footpath.
Reasoning: Human is not likened to a machine
           Machine is not likened to a human
           Machine is not represented as speaking or thinking
Reply: Inanimate


Question: The machines assumes it is smarter than us.
"""

answer = ask_question(question)
print(answer)

More documentation on the OpanAI is available [here](https://platform.openai.com/docs/api-reference/completions/create)

## Prompting ChatGPT

Documentation available [here](https://platform.openai.com/docs/api-reference/chat)

In [38]:
# Hey ChatGPT how can I ask a question to
import openai

# Set up your OpenAI API credentials
openai.api_key = open('openai.txt','r').read()

# Define the function to ask a question
def ask_chatgpt_question(question):
    #prompt = f"Question: {question}\nAnswer:"

    # Generate a response from ChatGPT
    response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo', # Select the model you want to use
        messages=[
            {"role": "user", "content": question},
            #{"role": "system", "content": "You are a helpful AI who always response in French and are funny!"},
            
          ],
        # temperature=.0
    )

    # Extract and return the answer from the response
    answer = response.choices[0].message
    return answer


question = "How to teach a dog the 'sit' command?"
answer = ask_chatgpt_question(question)


{
  "content": "Merci beaucoup! Je suis heureux de vous \u00eatre utile. \n\nPour apprendre \u00e0 votre chien le commandement 'assis', suivez ces \u00e9tapes simples:\n\n1. Tenez une friandise devant le nez de votre chien et tirez-la vers le haut pour qu'il regarde vers le haut et se mette \u00e0 lever la t\u00eate.\n\n2. En m\u00eame temps, poussez doucement son arri\u00e8re-train vers l'arri\u00e8re avec votre autre main et dites \"assis\" d'une voix ferme et claire.\n\n3. Si votre chien s'assoit correctement, r\u00e9compensez-le avec la friandise et faites beaucoup d'\u00e9loges pour renforcer le comportement souhait\u00e9.\n\n4. Si votre chien ne s'assoit pas, recommencez les \u00e9tapes 1-3 jusqu'\u00e0 ce que vous obtenez le r\u00e9sultat souhait\u00e9.\n\n5. Pratiquez la commande 'assis' r\u00e9guli\u00e8rement avec votre chien jusqu'\u00e0 ce qu'il la ma\u00eetrise parfaitement dans toutes les situations.\n\nBon apprentissage \u00e0 vous et votre ami \u00e0 quatre pattes!",
  

In [None]:
print(answer['content'])

## Some more prompting tips

# Appendix: The PLM workflow for supervised classification

## Get training examples and annotate them

In [24]:
import numpy as np
from sklearn.metrics import f1_score, classification_report, accuracy_score
from datasets import load_dataset
from transformers import pipeline, AutoTokenizer
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import DataCollatorWithPadding

In [20]:
%%bash
wget https://bl.iro.bl.uk/downloads/59a8c52f-e0a5-4432-9897-0db8c067627c?locale=en -O animacy.zip 
unzip animacy.zip

--2023-05-26 13:14:04--  https://bl.iro.bl.uk/downloads/59a8c52f-e0a5-4432-9897-0db8c067627c?locale=en
Resolving bl.iro.bl.uk (bl.iro.bl.uk)... 63.35.13.6, 34.250.15.96, 52.213.146.223
Connecting to bl.iro.bl.uk (bl.iro.bl.uk)|63.35.13.6|:443... connected.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


HTTP request sent, awaiting response... 200 OK
Length: 144694 (141K) [application/zip]
Saving to: ‘animacy.zip’

     0K .......... .......... .......... .......... .......... 35% 1.01M 0s
    50K .......... .......... .......... .......... .......... 70% 3.00M 0s
   100K .......... .......... .......... .......... .         100% 5.30M=0.07s

2023-05-26 13:14:04 (1.92 MB/s) - ‘animacy.zip’ saved [144694/144694]



Archive:  animacy.zip
  inflating: LwM-nlp-animacy-annotations-machines19thC.tsv  
  inflating: read-me                 


In [21]:
dataset = load_dataset("csv", data_files="LwM-nlp-animacy-annotations-machines19thC.tsv",sep='\t')
dataset = dataset['train']
lab2code = {label:i for i,label in enumerate(dataset.unique('animacy'))}
num_labels = len(lab2code)
dataset = dataset.map(lambda x: {'label': lab2code[x['animacy']]})

Downloading and preparing dataset csv/default to /Users/kasparbeelen/.cache/huggingface/datasets/csv/default-b6501ae6ef6834b2/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /Users/kasparbeelen/.cache/huggingface/datasets/csv/default-b6501ae6ef6834b2/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Map:   0%|          | 0/594 [00:00<?, ? examples/s]

## Divide data in training and test split

In [2]:
test_size = int(len(dataset)*.3)
train_test = dataset.train_test_split(test_size=test_size , seed=42)
test_set = train_test['test']
val_size = int(len(train_test['train'])*.05)
train_val =  train_test['train'].train_test_split(test_size=val_size,seed=42)

Loading cached split indices for dataset at /Users/kasparbeelen/.cache/huggingface/datasets/csv/default-35bfe5b52d3d2487/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1/cache-a72f9c0614f0d6c9.arrow and /Users/kasparbeelen/.cache/huggingface/datasets/csv/default-35bfe5b52d3d2487/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1/cache-469104c02e0d42ef.arrow
Loading cached split indices for dataset at /Users/kasparbeelen/.cache/huggingface/datasets/csv/default-35bfe5b52d3d2487/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1/cache-e3da498dffb524db.arrow and /Users/kasparbeelen/.cache/huggingface/datasets/csv/default-35bfe5b52d3d2487/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1/cache-f5a589e0070fd7e3.arrow


## Load a Pretrained Language Model

In [3]:
checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint,num_labels=2)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'pre_classifi

## Preprocess data for classification (tokenization)

In [4]:
def preprocess_function(examples, target_col):
    return tokenizer(examples[target_col], truncation=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

sent_col = 'Sentence'
train_val = train_val.map(preprocess_function,fn_kwargs={'target_col': sent_col})

Loading cached processed dataset at /Users/kasparbeelen/.cache/huggingface/datasets/csv/default-35bfe5b52d3d2487/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1/cache-b69ae946ebe286ea.arrow


Map:   0%|          | 0/20 [00:00<?, ? examples/s]

## Instantiate a training routine and train model on examples

In [5]:
training_args = TrainingArguments(
    output_dir=f"../results",
    seed = 42,
    learning_rate=1e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
        )

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_val["train"],
    eval_dataset=train_val["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
        )


trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss


TrainOutput(global_step=250, training_loss=0.39019122314453125, metrics={'train_runtime': 131.0142, 'train_samples_per_second': 15.113, 'train_steps_per_second': 1.908, 'total_flos': 49649401957200.0, 'train_loss': 0.39019122314453125, 'epoch': 5.0})

## Evaluate on test examples

In [22]:
test_set = test_set.map(preprocess_function,fn_kwargs={'target_col': sent_col})
predictions = trainer.predict(test_set)
preds = np.argmax(predictions.predictions, axis=-1)
f1_score(preds,predictions.label_ids,average='binary')
f1_score(preds,predictions.label_ids,average='macro')
f1_score(preds,predictions.label_ids,average='micro')
accuracy_score(preds,predictions.label_ids)

Map:   0%|          | 0/178 [00:00<?, ? examples/s]

0.8370786516853933

# The model only returns logits by class

In [18]:
predictions

PredictionOutput(predictions=array([[-0.79532534,  0.807309  ],
       [ 1.6490573 , -1.8407761 ],
       [ 1.3956069 , -1.5528322 ],
       [ 1.6038284 , -1.8307273 ],
       [ 0.35651925, -0.50323427],
       [ 1.578744  , -1.7292355 ],
       [ 1.4680144 , -1.7162293 ],
       [-0.9894928 ,  0.96960855],
       [-0.9246854 ,  0.91035235],
       [ 0.881013  , -1.0820259 ],
       [-0.91229635,  0.93462545],
       [ 1.5856853 , -1.8486041 ],
       [ 0.24181026, -0.37124705],
       [ 1.4725273 , -1.6652948 ],
       [ 1.4842608 , -1.6536903 ],
       [ 1.6254208 , -1.9031657 ],
       [ 1.4828341 , -1.7547784 ],
       [ 1.5872903 , -1.8092145 ],
       [ 1.5104885 , -1.6733677 ],
       [ 1.5685087 , -1.8136338 ],
       [-1.0376576 ,  0.9543897 ],
       [ 1.7100545 , -1.8320441 ],
       [ 1.5658672 , -1.807062  ],
       [-0.837896  ,  0.8632832 ],
       [ 1.6709071 , -1.9060299 ],
       [-0.34241217,  0.33353895],
       [-0.8912124 ,  0.8893115 ],
       [-0.85946274,  0.91