<center>

<h1>Final Assessment - Advanced Natural Language Processing</h1>

<i>

Course: 22DM015 Advanced Methods in Natural Language Processing <br>

Author(s): Ferran Boada Bergadà, Julián Romero, Lucia Sauer, Moritz Peist<br>

Programme: DSDM

<hr>

....

</i>

</center>

<hr>

## Setup

In [1]:
# Imports
import pandas as pd
import os

from transformers import (
    AutoModelForMaskedLM,
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)
from tqdm.notebook import tqdm
from sklearn.metrics import (
    accuracy_score,
    precision_recall_fscore_support,
    classification_report,
    confusion_matrix,
)
from sklearn.model_selection import StratifiedShuffleSplit
import torch
from torch.utils.data import Dataset
import numpy as np

In [2]:
# Global constants
DATA_PATH = "../data/"
SPLITS = {
    "train": "patent/train-00000-of-00001.parquet",
    "validation": "patent/validation-00000-of-00001.parquet",
    "test": "patent/test-00000-of-00001.parquet",
}
RANDOM_SEED = 42

LABELS = {
    0: "Human Necessities",
    1: "Performing Operations; Transporting",
    2: "Chemistry; Metallurgy",
    3: "Textiles; Paper",
    4: "Fixed Constructions",
    5: "Mechanical Engineering; Lighting; Heating; Weapons; Blasting",
    6: "Physics",
    7: "Electricity",
    8: "General tagging of new or cross-sectional technology",
}

## Data Load

In [3]:
# Data loading and persistence


def load_split(
    split_name, split_path, data_path="../data", dataset="ccdv/patent-classification"
):
    """
    Load a specific split of the dataset, checking for local cache first.
    If the split is not cached locally, it will be downloaded and saved.
    Args:
        split_name (str): Name of the split (e.g., 'train', 'validation', 'test').
        split_path (str): Path to the split file in the dataset.
        data_path (str): Local path where the dataset is cached.
        dataset (str): Name of the dataset on Hugging Face Hub.
    Returns:
        pd.DataFrame: DataFrame containing the split data.
    """
    local_path = os.path.join(data_path, split_path)
    if os.path.exists(local_path):
        return pd.read_parquet(local_path)

    # Download and cache
    os.makedirs(os.path.dirname(local_path), exist_ok=True)
    df = pd.read_parquet(f"hf://datasets/{dataset}/{split_path}")
    df.to_parquet(local_path, index=False)
    return df

In [4]:
df_train = load_split("train", SPLITS["train"])
df_validation = load_split("validation", SPLITS["validation"])
df_test = load_split("test", SPLITS["test"])

## Part 1

## Part 2

Here we load a BERT model trained by Google on patents.

In [5]:
# Loading a bert model directly
model = AutoModelForMaskedLM.from_pretrained("anferico/bert-for-patents")

Some weights of the model checkpoint at anferico/bert-for-patents were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### a. BERT Model with Limited Data (0.5 points):
Train a BERT-based model using only 32 labeled examples and assess its performance.

In [6]:
# Get the 32 subsample of the training set, ensuring stratification
sss = StratifiedShuffleSplit(n_splits=1, train_size=32, random_state=RANDOM_SEED)
train_idx, _ = next(sss.split(df_train, df_train["label"]))
df_train32 = df_train.iloc[train_idx]

#### a.1 Tokenization

In [7]:
# Load tokenizer for the patent BERT model
tokenizer = AutoTokenizer.from_pretrained("anferico/bert-for-patents")

#### a.2 Special Tokens

In [8]:
# The tokenizer already has the special tokens configured
print("Special tokens:")
print(f"[CLS]: {tokenizer.cls_token} (ID: {tokenizer.cls_token_id})")
print(f"[SEP]: {tokenizer.sep_token} (ID: {tokenizer.sep_token_id})")
print(f"[PAD]: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})")
print(f"[UNK]: {tokenizer.unk_token} (ID: {tokenizer.unk_token_id})")

Special tokens:
[CLS]: [CLS] (ID: 2)
[SEP]: [SEP] (ID: 3)
[PAD]: [PAD] (ID: 0)
[UNK]: [UNK] (ID: 1)


#### a.3 Tokens to IDs

In [9]:
def tokenize_function(examples):
    """Tokenize the text data"""
    return tokenizer(
        examples["text"],
        truncation=True,
        padding=True,
        max_length=512,
        return_tensors="pt",
    )


# Tokenize the 32 training examples
train_encodings = tokenizer(
    df_train32["text"].tolist(),
    truncation=True,
    padding=True,
    max_length=512,
    return_tensors="pt",
)

#### a.4 Padding and Truncation

In [10]:
# Create a custom dataset class
class PatentDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {}
        for key, val in self.encodings.items():
            if torch.is_tensor(val[idx]):
                item[key] = val[idx].detach().clone()
            else:
                item[key] = torch.tensor(val[idx])
        item["labels"] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item

    def __len__(self):
        return len(self.labels)


# Create dataset
train_dataset = PatentDataset(train_encodings, df_train32["label"].tolist())

#### a.5 Model Setup and Training

In [11]:
# Load model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained(
    "anferico/bert-for-patents",
    num_labels=len(LABELS.items()),  # 9 patent classes
)

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./results_32_samples",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=10,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_steps=500,
    eval_strategy="no",  # No validation set for this small training
    seed=RANDOM_SEED,
    report_to=[],  # Disable all logging
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at anferico/bert-for-patents and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
# Define metrics function
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average="weighted", zero_division=0
    )
    acc = accuracy_score(labels, predictions)
    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

In [13]:
# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    compute_metrics=compute_metrics,
)

# Train the model
print("Training BERT model with 32 labeled examples...")
trainer.train()

Training BERT model with 32 labeled examples...


Step,Training Loss
10,2.3455


TrainOutput(global_step=12, training_loss=2.265301247437795, metrics={'train_runtime': 25.4749, 'train_samples_per_second': 3.768, 'train_steps_per_second': 0.471, 'total_flos': 89467526873088.0, 'train_loss': 2.265301247437795, 'epoch': 3.0})

#### a.6 Model Evaluation

In [14]:
# Prepare validation data for evaluation
val_encodings = tokenizer(
    df_validation["text"].tolist(),
    truncation=True,
    padding=True,
    max_length=512,
    return_tensors="pt",
)

val_dataset = PatentDataset(val_encodings, df_validation["label"].tolist())

# Evaluate on validation set
print("Evaluating model on validation set...")
eval_results = trainer.evaluate(val_dataset)

print("\nValidation Results:")
for key, value in eval_results.items():
    print(f"{key}: {value:.4f}")

Evaluating model on validation set...



Validation Results:
eval_loss: 1.9537
eval_accuracy: 0.2380
eval_f1: 0.1513
eval_precision: 0.2685
eval_recall: 0.2380
eval_runtime: 226.3325
eval_samples_per_second: 22.0910
eval_steps_per_second: 2.7610
epoch: 3.0000


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


#### a.7 Performance Assessment

In [None]:
# Get predictions for analysis
predictions = trainer.predict(val_dataset)
predicted_labels = np.argmax(predictions.predictions, axis=1)
true_labels = df_validation["label"].tolist()

In [None]:
# Calculate per-class metrics
print("\nDetailed Classification Report:")
print(classification_report(true_labels, predicted_labels, target_names=list(LABELS.values()), zero_division=0))


Detailed Classification Report:
                                                              precision    recall  f1-score   support

                                           Human Necessities       0.41      0.05      0.08       703
                         Performing Operations; Transporting       0.50      0.00      0.01       705
                                       Chemistry; Metallurgy       0.52      0.28      0.36       421
                                             Textiles; Paper       0.00      0.00      0.00        40
                                         Fixed Constructions       0.00      0.00      0.00       146
Mechanical Engineering; Lighting; Heating; Weapons; Blasting       0.00      0.00      0.00       347
                                                     Physics       0.23      0.12      0.16      1092
                                                 Electricity       0.22      0.86      0.35      1049
        General tagging of new or cross-sectiona

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [25]:
print(f"\nConfusion Matrix:")
cm = confusion_matrix(true_labels, predicted_labels)
cm


Confusion Matrix:


array([[ 33,   1,  26,   0,   0,   0,  73, 570,   0],
       [  1,   2,  21,   0,   0,   0,  79, 602,   0],
       [  7,   0, 118,   0,   0,   0,  50, 246,   0],
       [  1,   0,   0,   0,   0,   0,   4,  35,   0],
       [  0,   0,   0,   0,   0,   0,   9, 137,   0],
       [  0,   0,   1,   0,   0,   0,  30, 316,   0],
       [ 17,   0,  25,   0,   0,   0, 131, 919,   0],
       [ 13,   1,  23,   0,   0,   0, 106, 906,   0],
       [  8,   0,  13,   0,   0,   0,  89, 387,   0]])

In [16]:
# Calculate baseline comparison (random classifier)
n_classes = len(LABELS)
random_accuracy = 1.0 / n_classes
print(f"\nRandom Baseline Accuracy: {random_accuracy:.4f}")
print(f"Model Accuracy: {eval_results['eval_accuracy']:.4f}")
print(f"Improvement over random: {eval_results['eval_accuracy'] - random_accuracy:.4f}")

# Performance summary
print(f"\n=== BERT Model with 32 Labeled Examples - Performance Summary ===")
print(f"Training samples: 32")
print(f"Validation accuracy: {eval_results['eval_accuracy']:.4f}")
print(f"Validation F1-score: {eval_results['eval_f1']:.4f}")
print(f"Validation precision: {eval_results['eval_precision']:.4f}")
print(f"Validation recall: {eval_results['eval_recall']:.4f}")


Random Baseline Accuracy: 0.1111
Model Accuracy: 0.2380
Improvement over random: 0.1269

=== BERT Model with 32 Labeled Examples - Performance Summary ===
Training samples: 32
Validation accuracy: 0.2380
Validation F1-score: 0.1513
Validation precision: 0.2685
Validation recall: 0.2380


#### Interpretation of Results


...

In [17]:
# Save baseline results for later comparison
baseline_results = {
    "accuracy": eval_results["eval_accuracy"],
    "f1": eval_results["eval_f1"],
    "precision": eval_results["eval_precision"],
    "recall": eval_results["eval_recall"],
}


### b. Dataset Augmentation (1 point):
Experiment with an automated technique to increase your dataset size without using LLMs (chatGPT / Mistral / Gemini / etc...). Evaluate the impact on model performance.

### b.1 Back-Translation Setup with MarianMT

In [18]:
import time

from transformers import MarianMTModel, MarianTokenizer

# Load MarianMT models for back-translation (English <-> Spanish)
print("Loading MarianMT models for back-translation...")

# English to Spanish model
en_es_model_name = "Helsinki-NLP/opus-mt-en-es"
en_es_tokenizer = MarianTokenizer.from_pretrained(en_es_model_name)
en_es_model = MarianMTModel.from_pretrained(en_es_model_name)

# Spanish to English model
es_en_model_name = "Helsinki-NLP/opus-mt-es-en"
es_en_tokenizer = MarianTokenizer.from_pretrained(es_en_model_name)
es_en_model = MarianMTModel.from_pretrained(es_en_model_name)

print("MarianMT models loaded successfully!")


Loading MarianMT models for back-translation...


tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/826k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.59M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/312M [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/826k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.59M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.44k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

MarianMT models loaded successfully!


### b.2 Back-Translation Functions

In [19]:
def translate_to_spanish(texts, batch_size=8):
    """Translate English texts to Spanish"""
    translated_texts = []

    # Calculate total number of batches for progress bar
    total_batches = (len(texts) + batch_size - 1) // batch_size

    for i in tqdm(
        range(0, len(texts), batch_size),
        desc="Translating EN→ES",
        total=total_batches,
        unit="batch",
    ):
        batch = texts[i : i + batch_size]

        # Tokenize and translate
        inputs = en_es_tokenizer(
            batch, return_tensors="pt", padding=True, truncation=True, max_length=512
        )
        translated = en_es_model.generate(
            **inputs, max_length=512, num_beams=4, early_stopping=True
        )

        # Decode translations
        batch_translations = en_es_tokenizer.batch_decode(
            translated, skip_special_tokens=True
        )
        translated_texts.extend(batch_translations)

    return translated_texts


def translate_to_english(texts, batch_size=8):
    """Translate Spanish texts back to English"""
    translated_texts = []

    # Calculate total number of batches for progress bar
    total_batches = (len(texts) + batch_size - 1) // batch_size

    for i in tqdm(
        range(0, len(texts), batch_size),
        desc="Translating ES→EN",
        total=total_batches,
        unit="batch",
    ):
        batch = texts[i : i + batch_size]

        # Tokenize and translate
        inputs = es_en_tokenizer(
            batch, return_tensors="pt", padding=True, truncation=True, max_length=512
        )
        translated = es_en_model.generate(
            **inputs, max_length=512, num_beams=4, early_stopping=True
        )

        # Decode translations
        batch_translations = es_en_tokenizer.batch_decode(
            translated, skip_special_tokens=True
        )
        translated_texts.extend(batch_translations)

    return translated_texts


def back_translate(texts, target_lang="es"):
    """Perform back-translation: en -> target_lang -> en"""
    print(f"Starting back-translation for {len(texts)} texts...")
    start_time = time.time()

    # Step 1: Translate to target language
    if target_lang == "es":
        intermediate = translate_to_spanish(texts)
        # Step 2: Translate back to English
        back_translated = translate_to_english(intermediate)
    else:
        raise ValueError(f"Language {target_lang} not supported. Use 'es' for Spanish.")

    end_time = time.time()
    print(f"Back-translation completed in {end_time - start_time:.2f} seconds")

    return back_translated

### b.3 Generate Augmented Dataset

In [20]:
# Extract original texts and labels from 32-sample training set
original_texts = df_train32["text"].tolist()
original_labels = df_train32["label"].tolist()

print(f"Original training set size: {len(original_texts)}")

# Perform back-translation to create augmented samples
print("\nGenerating augmented data through back-translation...")
augmented_texts = back_translate(original_texts, target_lang="es")

# Quality check: show examples of original vs augmented texts
print("\n=== Back-Translation Examples ===")
for i in range(min(3, len(original_texts))):
    print(f"\nExample {i + 1}:")
    print(f"Original:  {original_texts[i][:200]}...")
    print(f"Augmented: {augmented_texts[i][:200]}...")

# Filter augmented texts (remove identical ones)
filtered_augmented = []
filtered_labels = []

for orig, aug, label in zip(original_texts, augmented_texts, original_labels):
    # Only keep augmented text if it's different from original
    if orig.strip().lower() != aug.strip().lower():
        filtered_augmented.append(aug)
        filtered_labels.append(label)

print(
    f"\nValid augmented samples: {len(filtered_augmented)} out of {len(augmented_texts)}"
)

# Combine original and augmented data
combined_texts = original_texts + filtered_augmented
combined_labels = original_labels + filtered_labels

print(f"Total training samples after augmentation: {len(combined_texts)}")
print(f"Data expansion factor: {len(combined_texts) / len(original_texts):.2f}x")


Original training set size: 32

Generating augmented data through back-translation...
Starting back-translation for 32 texts...


Translating EN→ES:   0%|          | 0/4 [00:00<?, ?batch/s]

model.safetensors:   0%|          | 0.00/312M [00:00<?, ?B/s]

Translating ES→EN:   0%|          | 0/4 [00:00<?, ?batch/s]

Back-translation completed in 318.24 seconds

=== Back-Translation Examples ===

Example 1:
Original:  reference will now be made in detail to the present preferred embodiments of the invention , examples of which are illustrated in the accompanying drawings . wherever possible , the same reference num...
Augmented: the reference shall now be made in detail to the current preferred incarnations of the invention , examples of which are illustrated in the attached drawings . Wherever possible , the same reference n...

Example 2:
Original:  all terms as used herein in this specification , unless otherwise stated , shall be understood in their ordinary meaning as known in the art . other more specific definitions are as follows : the term...
Augmented: all the terms used in this specification , unless otherwise stated , shall be understood in their ordinary sense as being known in art . Other more specific definitions are the following : the term “ ...

Example 3:
Original:  in the follow

### b.4 Train Model on Augmented Dataset

In [21]:
# Prepare augmented dataset for training
augmented_encodings = tokenizer(
    combined_texts, truncation=True, padding=True, max_length=512, return_tensors="pt"
)

augmented_train_dataset = PatentDataset(augmented_encodings, combined_labels)

# Setup training for augmented model
augmented_model = AutoModelForSequenceClassification.from_pretrained(
    "anferico/bert-for-patents", num_labels=9
)

augmented_training_args = TrainingArguments(
    output_dir="./results_augmented",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=20,  # Slightly more warmup for larger dataset
    weight_decay=0.01,
    logging_dir="./logs_augmented",
    logging_steps=10,
    save_steps=500,
    eval_strategy="no",
    seed=RANDOM_SEED,
    report_to=[],  # Disable all logging, e.g. wandb
)

# Train augmented model
augmented_trainer = Trainer(
    model=augmented_model,
    args=augmented_training_args,
    train_dataset=augmented_train_dataset,
    compute_metrics=compute_metrics,
)

print("Training BERT model with back-translation augmented data...")
augmented_trainer.train()


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at anferico/bert-for-patents and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Training BERT model with back-translation augmented data...


Step,Training Loss
10,2.2092
20,1.6391


TrainOutput(global_step=24, training_loss=1.7903659343719482, metrics={'train_runtime': 41.0908, 'train_samples_per_second': 4.673, 'train_steps_per_second': 0.584, 'total_flos': 178935053746176.0, 'train_loss': 1.7903659343719482, 'epoch': 3.0})

### b.5 Evaluate Augmented Model

In [22]:
# Evaluate augmented model on validation set
print("Evaluating augmented model on validation set...")
augmented_eval_results = augmented_trainer.evaluate(val_dataset)

print("\nAugmented Model Validation Results:")
for key, value in augmented_eval_results.items():
    print(f"{key}: {value:.4f}")


Evaluating augmented model on validation set...



Augmented Model Validation Results:
eval_loss: 1.8474
eval_accuracy: 0.3206
eval_f1: 0.2797
eval_precision: 0.3746
eval_recall: 0.3206
eval_runtime: 230.4330
eval_samples_per_second: 21.6980
eval_steps_per_second: 2.7120
epoch: 3.0000


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### b.6 Compare Results: Baseline vs Augmented

In [None]:
# Performance comparison
print("\n" + "=" * 60)
print("DATASET AUGMENTATION IMPACT ANALYSIS")
print("=" * 60)

print("\n📊 TRAINING DATA COMPARISON:")
print("  Baseline (32 samples):     32 samples")
print(f"  Augmented dataset:         {len(combined_texts)} samples")
print(f"  Expansion factor:          {len(combined_texts) / 32:.2f}x")

print("\n📈 PERFORMANCE COMPARISON:")
metrics = ["accuracy", "f1", "precision", "recall"]
for metric in metrics:
    baseline_val = baseline_results[metric]
    augmented_val = augmented_eval_results[f"eval_{metric}"]
    improvement = augmented_val - baseline_val
    improvement_pct = (improvement / baseline_val) * 100

    print(f"  {metric.upper()}:")
    print(f"    Baseline:     {baseline_val:.4f}")
    print(f"    Augmented:    {augmented_val:.4f}")
    print(f"    Improvement:  {improvement:+.4f} ({improvement_pct:+.1f}%)")
    print()

# Statistical significance test (simple)
accuracy_improvement = (
    augmented_eval_results["eval_accuracy"] - baseline_results["accuracy"]
)
print("🎯 KEY FINDINGS:")
print(f"  • Accuracy improvement: {accuracy_improvement:+.4f}")
if accuracy_improvement > 0.01:  # 1% threshold
    print("  • Result: SIGNIFICANT improvement with back-translation")
elif accuracy_improvement > 0:
    print("  • Result: MARGINAL improvement with back-translation")
else:
    print("  • Result: NO improvement with back-translation")

print("\n✅ AUGMENTATION TECHNIQUE ASSESSMENT:")
print("  • Back-translation with MarianMT (en→es→en)")
print(f"  • Generated {len(filtered_augmented)} valid augmented samples")
print(
    f"  • Quality: {len(filtered_augmented) / len(original_texts) * 100:.1f}% of attempts were unique"
)
print(f"  • Impact: {accuracy_improvement:+.4f} accuracy change")

In [None]:
# Detailed analysis per class
augmented_predictions = augmented_trainer.predict(val_dataset)
augmented_pred_labels = np.argmax(augmented_predictions.predictions, axis=1)

In [None]:
print("\n📋 DETAILED CLASSIFICATION REPORT (Augmented Model):")
print(classification_report(
      true_labels, augmented_pred_labels, target_names=list(LABELS.values()), zero_division=0
  ))


📋 DETAILED CLASSIFICATION REPORT (Augmented Model):
                                                              precision    recall  f1-score   support

                                           Human Necessities       0.28      0.30      0.29       703
                         Performing Operations; Transporting       0.35      0.23      0.27       705
                                       Chemistry; Metallurgy       0.56      0.44      0.49       421
                                             Textiles; Paper       0.00      0.00      0.00        40
                                         Fixed Constructions       0.00      0.00      0.00       146
Mechanical Engineering; Lighting; Heating; Weapons; Blasting       1.00      0.01      0.01       347
                                                     Physics       0.26      0.70      0.38      1092
                                                 Electricity       0.50      0.27      0.35      1049
        General tagging of n

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### c. Zero-Shot Learning with LLM (0.5 points):
Apply a LLM (chatGPT/Claude/Mistral/Gemini/...) in a zero-shot learning setup. Document the performance.

In [22]:
# Define the system prompt for the model
system_prompt = f"""
System: You are an expert patent classifier. You will be given a patent text and you need to classify it into one of the following categories. /no_think

Task: Classify this patent text into one of the following 9 categories:
{LABELS}

Response format: {{"classification": 0, "confidence": 0.95, "reasoning": "..."}}
"""

print(system_prompt)


System: You are an expert patent classifier. You will be given a patent text and you need to classify it into one of the following categories. /no_think

Task: Classify this patent text into one of the following 9 categories:
{0: 'Human Necessities', 1: 'Performing Operations; Transporting', 2: 'Chemistry; Metallurgy', 3: 'Textiles; Paper', 4: 'Fixed Constructions', 5: 'Mechanical Engineering; Lighting; Heating; Weapons; Blasting', 6: 'Physics', 7: 'Electricity', 8: 'General tagging of new or cross-sectional technology'}

Response format: {"classification": 0, "confidence": 0.95, "reasoning": "..."}



In [23]:
# Here we use an open source model (Qwen3 8B) - a "lite-weight" LLM for smaller devices in combination with LLM studio and pydantic

from pydantic import BaseModel
import lmstudio as lms

# Load the Gemma 3 model (for this to work we first need to run (1) `lms load qwen3-8b --context-length 8096` and (2) `lms server start` in the terminal)
model = lms.llm("qwen/qwen3-8b")  # Load the Qwen3 8B model
print(lms.list_loaded_models()[0])  # Print model we are using


# Instead of using JSON structure we enforce a schema using Pydantic which we deem the more pythonic way
# A class based schema for a classification response
class PatentClassifierSchema(BaseModel):
    classification: int
    confidence: float
    reasoning: str


# Here we define a method to repeatedly ask the model for a response
def respond(model, system_prompt, prompt, response_format):
    """
    Ask the model a question and get a response in the specified format.
    """

    # Create a chat with an initial system prompt (initialize message here to always start new chats and don't use previous context)
    chat = lms.Chat(system_prompt)
    # Build the chat context by adding messages of relevant types.
    chat.add_user_message(prompt)
    # Get the response from the model, specifying the response format.
    response = model.respond(chat, response_format=response_format)

    return response.parsed

LLM(identifier='qwen/qwen3-8b')


In [None]:
# Iterate through our 32 training examples and classify them and store the results

llm_results = []

for i, text in tqdm(enumerate(df_train32["text"].tolist()), desc="Classifying patent", total=len(df_train32)):
    classification_result = respond(
        model=model,
        system_prompt=system_prompt,
        prompt=text,
        response_format=PatentClassifierSchema,
    )
    llm_results.append(classification_result)

# Unload model to free up memory
model.unload()


Classifying example 1/32:

Classifying example 2/32:

Classifying example 3/32:

Classifying example 4/32:


### d. Data Generation with LLM (1 point):
Use a LLM (chatGPT/Claude/Mistral/Gemini/...) to generate new, labeled dataset points. Train your BERT model with it + the 32 labels. Analyze how this impacts model metrics.

### e. Optimal Technique Application (0.5 points):
 Based on the previous experiments, apply the most effective technique(s) to further improve your model's performance. Comment your results and propose improvements.

## Part 3

## Part 4