# Text classifier training (1-6)
### Jakub Łubkowski, Marcin Mikuła

1. Use the FIQA-PL dataset that was used in lab 1 **and** lab lab 2 (so we need the passages, the questions and their
   relations).
2. Create a dataset of positive and negative sentence pairs.
   - In each pair the first element is a question and the second element is a passagei, i.e. "{question} {separator} {passage}",
      where `separator` should be a separator taken from the model's tokenizer.
   - Use the relations to mark the positive pairs (i.e. pairs where the question is answered
      by the passage).
   - Use your own strategy to mark negative pairs (i.e. you can draw the negative examples, but there are
      better strategies to define the negative examples). The number of negative examples should be larger than the
      number of positive examples.
3. The dataset from point 2 should be split into training, evaluation and testing subsets.
4. Train a text classifier using the Transformers library that distinguishes between the positive and the negative
   pairs. To make the process manageable use models of size `base` and a runtime providing GPU/TPU acceleration.
   Consult the discussions related to fine-tuning Transformer models to select sensible set of parameters.
   You can also run several trainings with different hyper-parameters, if you have access to large computing resources.
5. Make sure you monitor the relevant metrics on the validation set during training. The last saved model might not be the
   one with the best performance.
6. Report the results you have obtained for the model. Use appropriate measures, since the dataset is not balanced.


In [8]:
from datasets import load_dataset
import pandas as pd

ds = load_dataset("clarin-knext/fiqa-pl", "corpus")
corpus_df = pd.DataFrame(ds['corpus'])

q_data = load_dataset("clarin-knext/fiqa-pl", "queries")
q_df = pd.DataFrame(q_data['queries'])

qa_data = load_dataset("clarin-knext/fiqa-pl-qrels")['test']
qa_df = pd.DataFrame(qa_data)

q_df = q_df.rename(columns={'text': 'text_query'})

print(corpus_df.columns)
print(q_df.columns)
print(qa_df.columns)

Index(['_id', 'title', 'text'], dtype='object')
Index(['_id', 'title', 'text_query'], dtype='object')
Index(['query-id', 'corpus-id', 'score'], dtype='object')


In [9]:
qa_df['query-id'] = qa_df['query-id'].astype(str)
q_df['_id'] = q_df['_id'].astype(str)
qa_df['corpus-id'] = qa_df['corpus-id'].astype(str)
corpus_df['_id'] = corpus_df['_id'].astype(str)

In [10]:
positive_pairs = pd.merge(
    qa_df, 
    q_df[['_id', 'text_query']], 
    left_on='query-id', 
    right_on='_id'
)

In [11]:
positive_pairs = pd.merge(
    positive_pairs,
    corpus_df[['_id', 'text']],
    left_on='corpus-id',
    right_on='_id'
)

In [12]:
# For negative pairs, let's use a smart strategy:
# 1. For each question, use passages that were matched with other questions
# 2. This ensures the negative examples are challenging, as they were relevant for some questions
def create_negative_pairs(positive_pairs, corpus_df, n_negative_per_positive=2):
    negative_pairs = []
    
    for _, row in positive_pairs.iterrows():
        # Get passages that weren't paired with this question
        negative_passages = corpus_df[
            ~corpus_df['_id'].isin(
                positive_pairs[positive_pairs['query-id'] == row['query-id']]['corpus-id']
            )
        ]
        
        # Sample n random negative passages
        negative_samples = negative_passages.sample(
            n=min(n_negative_per_positive, len(negative_passages))
        )
        
        for _, neg_row in negative_samples.iterrows():
            negative_pairs.append({
                'query-id': row['query-id'],
                'corpus-id': neg_row['_id'],
                'text_query': row['text_query'],
                'text': neg_row['text'],
                'is_positive': 0
            })
    
    return pd.DataFrame(negative_pairs)

negative_pairs_df = create_negative_pairs(positive_pairs, corpus_df)

In [13]:
# Add positive label to positive pairs
positive_pairs['is_positive'] = 1

In [14]:
# Select only needed columns and combine positive and negative pairs
final_pairs = pd.concat([
    positive_pairs[['query-id', 'corpus-id', 'text_query', 'text', 'is_positive']],
    negative_pairs_df
], ignore_index=True)

In [15]:
# Let's add the formatted text pairs
# We'll use [SEP] as a placeholder - we'll replace it with the actual model separator later
final_pairs['text_pair'] = final_pairs['text_query'] + ' [SEP] ' + final_pairs['text']

In [16]:
# Show some statistics
print(f"Total pairs: {len(final_pairs)}")
print(f"Positive pairs: {len(final_pairs[final_pairs['is_positive'] == 1])}")
print(f"Negative pairs: {len(final_pairs[final_pairs['is_positive'] == 0])}")

Total pairs: 5118
Positive pairs: 1706
Negative pairs: 3412


In [17]:
final_pairs.head()

Unnamed: 0,query-id,corpus-id,text_query,text,is_positive,text_pair
0,8,566392,Jak zdeponować czek wystawiony na współpracown...,Poproś o ponowne wystawienie czeku właściwemu ...,1,Jak zdeponować czek wystawiony na współpracown...
1,8,65404,Jak zdeponować czek wystawiony na współpracown...,Po prostu poproś współpracownika o podpisanie ...,1,Jak zdeponować czek wystawiony na współpracown...
2,15,325273,Czy mogę wysłać przekaz pieniężny z USPS jako ...,Oczywiście że możesz. W sekcji Od przekazu pie...,1,Czy mogę wysłać przekaz pieniężny z USPS jako ...
3,18,88124,1 EIN prowadzący działalność pod wieloma nazwa...,Mylisz tutaj wiele rzeczy. Spółka B LLC będzie...,1,1 EIN prowadzący działalność pod wieloma nazwa...
4,26,285255,Ubieganie się o kredyt biznesowy i otrzymywani...,"„Obawiam się, że wielkim mitem spółek z ograni...",1,Ubieganie się o kredyt biznesowy i otrzymywani...


In [18]:
from sklearn.model_selection import train_test_split

# First split: 80% train+val, 20% test
train_val_df, test_df = train_test_split(
    final_pairs,
    test_size=0.2,
    random_state=42,
    stratify=final_pairs['is_positive']  # Maintain the same positive/negative ratio
)

# Second split: Split remaining 80% into 80% train, 20% validation (64% and 16% of total)
train_df, val_df = train_test_split(
    train_val_df,
    test_size=0.2,
    random_state=42,
    stratify=train_val_df['is_positive']
)

In [19]:
# Print statistics for each split
def print_split_stats(split_df, name):
    total = len(split_df)
    positives = len(split_df[split_df['is_positive'] == 1])
    negatives = len(split_df[split_df['is_positive'] == 0])
    print(f"\n{name} split statistics:")
    print(f"Total samples: {total}")
    print(f"Positive samples: {positives} ({positives/total*100:.1f}%)")
    print(f"Negative samples: {negatives} ({negatives/total*100:.1f}%)")

print_split_stats(train_df, "Training")
print_split_stats(val_df, "Validation")
print_split_stats(test_df, "Test")


Training split statistics:
Total samples: 3275
Positive samples: 1092 (33.3%)
Negative samples: 2183 (66.7%)

Validation split statistics:
Total samples: 819
Positive samples: 273 (33.3%)
Negative samples: 546 (66.7%)

Test split statistics:
Total samples: 1024
Positive samples: 341 (33.3%)
Negative samples: 683 (66.7%)


In [20]:
from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification, 
    TrainingArguments, 
    Trainer
)
from datasets import Dataset
import torch
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

In [21]:
# Let's use a Polish BERT model
model_name = "dkleczek/bert-base-polish-uncased-v1"

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [22]:
def prepare_dataset(df):
    # Convert to huggingface dataset
    dataset = Dataset.from_pandas(df)
    
    # Tokenize function
    def tokenize(batch):
        # Replace placeholder [SEP] with the model's actual separator
        texts = [
            text.replace("[SEP]", tokenizer.sep_token) 
            for text in batch['text_pair']
        ]
        
        return tokenizer(
            texts,
            padding=True,
            truncation=True,
            max_length=512,
            return_tensors="pt"
        )
    
    # Tokenize the dataset
    dataset = dataset.map(
        tokenize, 
        batched=True, 
        remove_columns=dataset.column_names
    )
    
    # Add labels
    dataset = dataset.add_column("labels", df['is_positive'].values)
    
    return dataset

In [23]:
train_dataset = prepare_dataset(train_df)
val_dataset = prepare_dataset(val_df)
test_dataset = prepare_dataset(test_df)

Map: 100%|██████████| 3275/3275 [00:00<00:00, 3917.59 examples/s]
Map: 100%|██████████| 819/819 [00:00<00:00, 3541.26 examples/s]
Map: 100%|██████████| 1024/1024 [00:00<00:00, 4013.72 examples/s]


In [24]:
# Define metrics computation
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, 
        preds, 
        average='binary'
    )
    acc = accuracy_score(labels, preds)
    
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [25]:
# Initialize model
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dkleczek/bert-base-polish-uncased-v1 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    evaluation_strategy="steps",
    eval_steps=500,
    save_strategy="steps",
    save_steps=500,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    # Enable fp16 if you have GPU
    fp16=torch.cuda.is_available(),
)



In [26]:
# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

 16%|█▋        | 100/615 [02:00<09:28,  1.10s/it]

{'loss': 0.6158, 'grad_norm': 8.54408073425293, 'learning_rate': 1e-05, 'epoch': 0.49}


 33%|███▎      | 200/615 [03:51<07:39,  1.11s/it]

{'loss': 0.4638, 'grad_norm': 7.661561012268066, 'learning_rate': 2e-05, 'epoch': 0.98}


 49%|████▉     | 300/615 [11:02<20:31,  3.91s/it]

{'loss': 0.3419, 'grad_norm': 3.0003573894500732, 'learning_rate': 3e-05, 'epoch': 1.46}


 65%|██████▌   | 400/615 [17:53<14:59,  4.18s/it]

{'loss': 0.3447, 'grad_norm': 3.99796986579895, 'learning_rate': 4e-05, 'epoch': 1.95}


 81%|████████▏ | 500/615 [24:44<09:29,  4.95s/it]

{'loss': 0.227, 'grad_norm': 5.7888312339782715, 'learning_rate': 5e-05, 'epoch': 2.44}


                                                 
 81%|████████▏ | 500/615 [25:05<09:29,  4.95s/it]

{'eval_loss': 0.4238881766796112, 'eval_accuracy': 0.8791208791208791, 'eval_f1': 0.8241563055062167, 'eval_precision': 0.8, 'eval_recall': 0.8498168498168498, 'eval_runtime': 20.0137, 'eval_samples_per_second': 40.922, 'eval_steps_per_second': 2.598, 'epoch': 2.44}


 98%|█████████▊| 600/615 [33:12<01:18,  5.24s/it]

{'loss': 0.2375, 'grad_norm': 3.6855883598327637, 'learning_rate': 6.521739130434783e-06, 'epoch': 2.93}


100%|██████████| 615/615 [34:36<00:00,  3.38s/it]

{'train_runtime': 2076.9372, 'train_samples_per_second': 4.731, 'train_steps_per_second': 0.296, 'train_loss': 0.36545832273436757, 'epoch': 3.0}





TrainOutput(global_step=615, training_loss=0.36545832273436757, metrics={'train_runtime': 2076.9372, 'train_samples_per_second': 4.731, 'train_steps_per_second': 0.296, 'total_flos': 2585066118912000.0, 'train_loss': 0.36545832273436757, 'epoch': 3.0})

In [27]:
# Evaluate on test set
test_results = trainer.evaluate(test_dataset)
print("\nTest set results:", test_results)

100%|██████████| 64/64 [00:23<00:00,  2.69it/s]


Test set results: {'eval_loss': 0.4298664927482605, 'eval_accuracy': 0.8701171875, 'eval_f1': 0.8113475177304964, 'eval_precision': 0.7857142857142857, 'eval_recall': 0.8387096774193549, 'eval_runtime': 24.6396, 'eval_samples_per_second': 41.559, 'eval_steps_per_second': 2.597, 'epoch': 3.0}





```json
{
   "eval_loss":0.4298664927482605,
   "eval_accuracy":0.8701171875,
   "eval_f1":0.8113475177304964,
   "eval_precision":0.7857142857142857,
   "eval_recall":0.8387096774193549,
   "eval_runtime":24.6396,
   "eval_samples_per_second":41.559,
   "eval_steps_per_second":2.597,
   "epoch":3.0
}
```

In [28]:
model.save_pretrained("model")
tokenizer.save_pretrained("tokenizer")

('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/vocab.txt',
 'tokenizer/added_tokens.json',
 'tokenizer/tokenizer.json')