# Z3 - Basque GPT-2 Classification via Prompting

## Description
Explore the use of a generative model (GPT-2) for text classification
through prompting techniques (few-shot learning).

## Approach
- **Model**: ClassCat/gpt2-small-basque-v2
- **Technique**: Few-shot prompting
- **Tokenizer**: BPE with 50,000 token vocabulary

## Method
1. Build a prompt with examples from the training dataset
2. Append the text to classify
3. Predict the label based on the next token probability

**Expected result**: F1 ≈ 0.0936 (very low, model not suitable for this task)

In [None]:
!pip install transformers
!pip install datasets==2.13.1



In [None]:
from transformers import AutoConfig, AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer, AutoModelForCausalLM
from datasets import load_dataset, Dataset
import torch
import torch.nn.functional as F
import pandas as pd
import wandb
import os
from sklearn.metrics import f1_score
import json
import tqdm
import random

In [None]:
tokenizer = AutoTokenizer.from_pretrained("ClassCat/gpt2-small-basque-v2")
model = AutoModelForCausalLM.from_pretrained("ClassCat/gpt2-small-basque-v2")

In [None]:
!wget https://raw.githubusercontent.com/orai-nlp/BasqueGLUE/refs/heads/main/bhtc/train.jsonl -O train.jsonl
!wget https://raw.githubusercontent.com/orai-nlp/BasqueGLUE/refs/heads/main/bhtc/test.jsonl -O test.jsonl

--2024-12-27 20:05:34--  https://raw.githubusercontent.com/orai-nlp/BasqueGLUE/refs/heads/main/bhtc/train.jsonl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2727771 (2.6M) [text/plain]
Saving to: ‘train.jsonl’


2024-12-27 20:05:35 (84.6 MB/s) - ‘train.jsonl’ saved [2727771/2727771]

--2024-12-27 20:05:35--  https://raw.githubusercontent.com/orai-nlp/BasqueGLUE/refs/heads/main/bhtc/test.jsonl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 585079 (571K) [text/plain]
Saving to: ‘test.jsonl’


2024-12-27 20:05:35 (32.0 MB/s) - ‘t

## 1. Classification Function
Classify based on which label has the highest probability as the next token.
Only the first token of each label is used for prediction.

In [None]:
def classify_with_prompt(prompt, labels):
    inputs = tokenizer.encode(prompt, return_tensors='pt')
    completion_layer = model(inputs).logits[:, -1, :]
    logits = completion_layer[0]
    probabilities = F.softmax(completion_layer, dim=-1)[0]

    label_ids=[]
    for label in labels:
        token_ids=tokenizer.encode(label)
        # Solo se usa el primer token del label
        label_ids.append(token_ids[0])

    sorted_args=list(torch.argsort(probabilities[label_ids], descending=True))
    predicted_label = labels[sorted_args[0]]
    return predicted_label

## 2. Prompt Construction (Few-Shot)
Build a prompt with random examples from the training dataset,
respecting the context token limit.

Prompt format:
```
prompt: '<example text 1>'
label: '<label 1>'.
prompt: '<example text 2>'
label: '<label 2>'.
...
prompt: '<text to classify>'
```

In [None]:
def build_prompt_and_extract_labels(filepath, max_tokens):
    prompt = ""

    with open(filepath, 'r', encoding='utf-8') as f:
        lines = f.readlines()

    random.shuffle(lines)

    current_tokens = 0
    for line in lines:
        data = json.loads(line)
        text = data['text']
        label = data['label']

        num_tokens = len(tokenizer.encode(f"prompt: '{text}'\nlabel: '{label}'.\n"))

        if current_tokens + num_tokens <= max_tokens:
            prompt += f"prompt: '{text}'\n"
            prompt += f"label: '{label}'.\n"
            current_tokens += num_tokens
        else:
            if current_tokens == 0:
                remaining_tokens = max_tokens - current_tokens

                truncated_line = tokenizer.decode(tokenizer.encode(f"prompt: '{text}'\nlabel: '{label}'.\n", truncation=True, max_length=remaining_tokens))

                prompt += truncated_line[:-len(tokenizer.decode(tokenizer.encode(f"label: '{label}'.\n")))]
                prompt += f"\nlabel: '{label}'.\n"
                break
            else:
                break

    return prompt

## 3. Evaluation
For each test example:
1. Generate a random few-shot prompt
2. Append the text to classify
3. Predict the label

**Note**: The low performance (F1 = 9.36%) indicates that Basque GPT-2
is not optimized for classification tasks and the prompting method
is not effective for this type of problem.

In [None]:
def evaluate_with_prompt(test_filepath, labels, max_tokens=45):
    y_true = []
    y_pred = []
    i = 0

    with open(test_filepath, 'r', encoding='utf-8') as f:
        for i, line in enumerate(f):
            data = json.loads(line)
            text = data['text']
            true_label = data['label']

            training_prompt = build_prompt_and_extract_labels("train.jsonl", max_tokens)

            num_tokens_test = len(tokenizer.encode(f"prompt: '{text}'\n"))

            truncated_length = max(1, max_tokens - num_tokens_test)
            if len(tokenizer.encode(training_prompt)) + num_tokens_test > max_tokens:
                training_prompt = tokenizer.decode(tokenizer.encode(training_prompt, truncation=True, max_length=truncated_length))

            full_prompt = training_prompt + f"prompt: '{text}'\n"

            predicted_label = classify_with_prompt(full_prompt, labels)

            y_true.append(true_label)
            y_pred.append(predicted_label)

    f1 = f1_score(y_true, y_pred, average='weighted')
    print(f"F1 Score: {f1}")



In [None]:
def get_labels(filepath):
    labels = set()
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            data = json.loads(line)
            labels.add(data['label'])
    return list(labels)

labels = get_labels('train.jsonl')

In [None]:
evaluate_with_prompt('test.jsonl', labels)

F1 Score: 0.09367418923153888
