# Text pair classification

In [1]:
from pathlib import Path

OUTPUT_DIR = Path("./output")

### Tasks 1, 2 - Load and prepare data

Load the datasets

In [2]:
from datasets import load_dataset
from datasets import load_dataset

fiqa_corpus = load_dataset("clarin-knext/fiqa-pl", "corpus")["corpus"]
fiqa_queries = load_dataset("clarin-knext/fiqa-pl", "queries")["queries"]
fiqa_qa = load_dataset("clarin-knext/fiqa-pl-qrels")["train"]

Create a map query ID -> set of matching corpus IDs

In [3]:
qc_map = {}
for q_id, c_id in zip(fiqa_qa["query-id"], fiqa_qa["corpus-id"]):
    if q_id not in qc_map:
        qc_map[q_id] = set()
    qc_map[q_id].add(c_id)

Create a map of ID -> text for easier processing

In [4]:
corpus_map = {int(idx): q for idx, q in zip(fiqa_corpus["_id"], fiqa_corpus["text"])}
query_map = {int(idx): q for idx, q in zip(fiqa_queries["_id"], fiqa_queries["text"])}

---

Create positive pairs

In [5]:
NUM_POSITIVES = 400
NEG_POS_RATIO = 2
NUM_NEGATIVES = int(NUM_POSITIVES * NEG_POS_RATIO)

In [6]:
positive_queries = list(qc_map.keys())[:NUM_POSITIVES]
positive_answers = [list(qc_map[q_id])[0] for q_id in positive_queries]

data_pos = list(zip(positive_queries, positive_answers))

In [7]:
print(query_map[data_pos[0][0]], end="\n\n")
print(corpus_map[data_pos[0][1]])

Co jest uważane za wydatek służbowy w podróży służbowej?

Wytyczne IRS dotyczące tematu. Ogólnie rzecz biorąc, najlepsze, co mogę powiedzieć, to to, że Twój wydatek biznesowy może podlegać odliczeniu. Ale to zależy od okoliczności i tego, co chcesz odliczyć. Podróże Podatnicy, którzy wyjeżdżają z domu w celach służbowych, mogą odliczyć związane z tym wydatki, w tym koszty dotarcia do miejsca docelowego, koszty zakwaterowania i wyżywienia oraz inne zwykłe i niezbędne wydatki. Podatnicy są uważani za „wyjeżdżających poza dom”, jeśli ich obowiązki wymagają od nich przebywania poza domem znacznie dłużej niż zwykły dzień pracy i muszą spać lub odpoczywać, aby sprostać wymogom pracy. Można odliczyć rzeczywisty koszt posiłków i nieprzewidziane wydatki lub skorzystać ze standardowej diety żywieniowej i obniżonych wymogów ewidencji. Niezależnie od zastosowanej metody odliczenia posiłków są zazwyczaj ograniczone do 50 procent, jak wspomniano wcześniej. Jako koszt można zgłaszać tylko rzeczywiste

---

Create negative pairs

In [8]:
from random import sample
from random import seed

seed(42)

# don't use the same queries as in positive pairs
negative_queries = list(qc_map.keys() - positive_queries)[:NUM_NEGATIVES]
negative_answers = []
for q_id in negative_queries:
    # sample one answer not present in "correct answers" or in positive answers defined above
    potential_negative_answers = sorted(
        corpus_map.keys() - (qc_map[q_id] | set(positive_answers))
    )
    negative_answers.append(sample(potential_negative_answers, 1)[0])

data_neg = list(zip(negative_queries, negative_answers))

In [9]:
print(query_map[data_neg[0][0]], end="\n\n")
print(corpus_map[data_neg[0][1]])

Precedens i modele 100% kapitału dostępne w ramach pierwszej oferty?

W przypadku eToro, tak jak w przypadku każdej innej firmy maklerskiej, możesz stracić cały swój kapitał. Proponuję zainwestować w jeden lub więcej funduszy giełdowych, które śledzą główne indeksy. Jeśli nie, po prostu włóż pieniądze na stałe konta depozytowe; zyskaj trochę zainteresowania i załóż najpierw fundusz awaryjny, zanim zainwestujesz pieniądze, które czujesz, że jesteś w stanie stracić.


---

### Task 3 - Split dataset into train, valid, test

In [10]:
from sklearn.model_selection import train_test_split

train_pos, test_pos = train_test_split(data_pos, train_size=0.7, random_state=42)
valid_pos, test_pos = train_test_split(test_pos, train_size=0.5, random_state=42)

train_neg, test_neg = train_test_split(data_neg, train_size=0.7, random_state=42)
valid_neg, test_neg = train_test_split(test_neg, train_size=0.5, random_state=42)

---

Convert to dataset

In [11]:
def convert(q, a, sep):
    return query_map[q] + f" {sep} " + corpus_map[a]

In [12]:
train_pos_text = [{"text": convert(q, a, "[SEP]"), "labels": 1} for q, a in train_pos]
train_neg_text = [{"text": convert(q, a, "[SEP]"), "labels": 0} for q, a in train_neg]

valid_pos_text = [{"text": convert(q, a, "[SEP]"), "labels": 1} for q, a in valid_pos]
valid_neg_text = [{"text": convert(q, a, "[SEP]"), "labels": 0} for q, a in valid_neg]

test_pos_text = [{"text": convert(q, a, "[SEP]"), "labels": 1} for q, a in test_pos]
test_neg_text = [{"text": convert(q, a, "[SEP]"), "labels": 0} for q, a in test_neg]

In [13]:
from datasets import Dataset, DatasetDict

train_dataset = Dataset.from_list(train_pos_text + train_neg_text)
valid_dataset = Dataset.from_list(valid_pos_text + valid_neg_text)
test_dataset = Dataset.from_list(test_pos_text + test_neg_text)

dataset = DatasetDict(
    {
        "train": train_dataset,
        "validation": valid_dataset,
        "test": test_dataset,
    }
)

---

### Task 4 - Train a model

Preprocess the dataset

In [14]:
import torch

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DEVICE

'cuda'

In [15]:
torch.cuda.get_device_name(0)

'NVIDIA GeForce RTX 3090'

In [16]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")


def preprocess_function(data):
    return tokenizer(
        data["text"], padding="max_length", truncation=True, max_length=128
    )


tokenized_datasets = dataset.map(preprocess_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets.set_format("torch")

Map:   0%|          | 0/840 [00:00<?, ? examples/s]

Map:   0%|          | 0/180 [00:00<?, ? examples/s]

Map:   0%|          | 0/180 [00:00<?, ? examples/s]

In [17]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
).to(DEVICE)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [18]:
from transformers import TrainingArguments

output_dir = OUTPUT_DIR / "results"
log_dir = OUTPUT_DIR / "logs"
output_dir.mkdir(exist_ok=True, parents=True)
log_dir.mkdir(exist_ok=True, parents=True)

training_args = TrainingArguments(
    output_dir=output_dir,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1.2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=50,
    weight_decay=0.02,
    logging_dir=log_dir,
    logging_strategy="steps",
    logging_steps=5,
    load_best_model_at_end=True,
    metric_for_best_model="auroc",
    greater_is_better=True,
    report_to="none",
)

---

### Task 5 - Monitor the metrics

In [19]:
from sklearn.metrics import (
    accuracy_score,
    precision_recall_fscore_support,
    roc_auc_score,
)


def compute_metrics(pred):
    labels = pred.label_ids
    predictions = pred.predictions.argmax(-1)
    probabilities = pred.predictions

    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average="binary", zero_division=0
    )
    acc = accuracy_score(labels, predictions)

    try:
        auroc = roc_auc_score(labels, probabilities[:, 1])
    except ValueError:
        auroc = float("nan")

    return {
        "accuracy": acc,
        "f1": f1,
        "precision": precision,
        "recall": recall,
        "auroc": auroc,
    }

In [20]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    processing_class=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Auroc
1,0.6424,0.63319,0.666667,0.0,0.0,0.0,0.594306
2,0.6253,0.615185,0.666667,0.0,0.0,0.0,0.641944
3,0.5977,0.608354,0.677778,0.121212,0.666667,0.066667,0.685972
4,0.5469,0.539692,0.755556,0.645161,0.625,0.666667,0.78625
5,0.5315,0.508035,0.727778,0.631579,0.575342,0.7,0.820833
6,0.3552,0.466881,0.794444,0.647619,0.755556,0.566667,0.846528
7,0.396,0.51354,0.75,0.671533,0.597403,0.766667,0.842222
8,0.2995,0.549685,0.744444,0.671429,0.5875,0.783333,0.842778
9,0.1685,0.570313,0.738889,0.666667,0.580247,0.783333,0.845139
10,0.1429,0.700745,0.716667,0.638298,0.555556,0.75,0.827639


TrainOutput(global_step=1350, training_loss=0.10609806701609933, metrics={'train_runtime': 294.4473, 'train_samples_per_second': 142.64, 'train_steps_per_second': 4.585, 'total_flos': 2762666081280000.0, 'train_loss': 0.10609806701609933, 'epoch': 50.0})

### Task 6 - display results

In [21]:
results = trainer.evaluate(tokenized_datasets["test"])
print("Test Set Results:", results)

Test Set Results: {'eval_loss': 0.4890971779823303, 'eval_accuracy': 0.7833333333333333, 'eval_f1': 0.6355140186915887, 'eval_precision': 0.723404255319149, 'eval_recall': 0.5666666666666667, 'eval_auroc': 0.8154166666666667, 'eval_runtime': 0.449, 'eval_samples_per_second': 400.891, 'eval_steps_per_second': 13.363, 'epoch': 50.0}


---

### Task 7 - Rerank results from elasticsearch

Input data into elasticsearch

(This is the point where I should run elasticsearch...)

In [22]:
!curl -X DELETE "localhost:9200/pol"

{"acknowledged":true}


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    21  100    21    0     0    380      0 --:--:-- --:--:-- --:--:--   381


In [23]:
import requests
import json

analyzer_settings = {
    "settings": {
        "analysis": {
            "analyzer": {
                "polish": {
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "morfologik_stem",
                        "lowercase",
                    ],
                },
            },
        }
    },
    "mappings": {
        "properties": {
            "text": {
                "type": "text",
                "analyzer": "polish",
                "fields": {"keyword": {"type": "keyword"}},
            },
        }
    },
}

In [24]:
elastic_url = "http://localhost:9200/pol"
elasticsearch_headers = {"Content-Type": "application/json"}

# send the analyzer settings and mappings to elasticsearch
response = requests.put(
    elastic_url,
    headers=elasticsearch_headers,
    data=json.dumps(analyzer_settings),
)

response.json()

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'pol'}

In [25]:
bulk_url = "http://localhost:9200/pol/_bulk"


# create data index
data = []
for _id, text in corpus_map.items():
    id_head = json.dumps(
        {"index": {"_index": "pol", "_id": str(_id)}}, ensure_ascii=False
    )
    content = json.dumps({"text": text}, ensure_ascii=False)
    data.append(id_head)
    data.append(content)

# join the bul data
bulk_data = "\n".join([item for item in data]) + "\n"

response = requests.post(
    bulk_url, headers=elasticsearch_headers, data=bulk_data.encode("utf-8")
)

print(response.status_code)
print(response.json()["errors"])

200
False


---

Compute ndcg@5 using the method from fts exercise

In [26]:
import numpy as np


def ndcg_at_k(query_corpus_map, query_map, k, search_url):
    logs = np.log2(np.arange(2, 2 + k))
    ndcg_list = []
    query_dict = {"query": {"match": {"text": {}}}, "size": k}

    for query_id, corpus_id_list in query_corpus_map.items():
        query_text = query_map[query_id]

        query_dict["query"]["match"]["text"]["query"] = query_text

        query_request = json.dumps(query_dict)
        response = requests.post(
            search_url, headers=elasticsearch_headers, data=query_request
        )
        data = response.json()
        hits = [int(h["_id"]) for h in data["hits"]["hits"]]

        # sometimes the list of correct matches is shorter than k
        # in  those cases we pad with 0s
        idcg = [1 if i < len(corpus_id_list) else 0 for i in range(k)]
        dcg = [1 if h in corpus_id_list else 0 for h in hits]

        idcg = np.array(idcg) / logs
        dcg = np.array(dcg) / logs

        ndcg_list.append(dcg.sum() / idcg.sum())

    return np.mean(ndcg_list)

In [27]:
fiqa_qa_test = load_dataset("clarin-knext/fiqa-pl-qrels")["test"]
qc_map_test = {}
for q_id, c_id in zip(fiqa_qa_test["query-id"], fiqa_qa_test["corpus-id"]):
    if q_id not in qc_map:
        qc_map_test[q_id] = set()
    qc_map_test[q_id].add(c_id)

In [28]:
ndcg_at_k(qc_map_test, query_map, 10, "http://localhost:9200/pol/_search?pretty")

np.float64(0.15957850679487598)

I don't know why this score is suddenly lower than in previous lab exercises. I use the same data and I'm pretty sure the functionality is the same.

---

Compute NDCG@5 with reordering using our model

In [29]:
from tqdm import tqdm


def ndcg_at_k_reordered(query_corpus_map, query_map, corpus_map, k, search_url):
    logs = np.log2(np.arange(2, 2 + k))
    ndcg_list = []
    query_dict = {"query": {"match": {"text": {}}}, "size": 4 * k}

    for query_id, corpus_id_list in tqdm(
        query_corpus_map.items(), total=len(query_corpus_map)
    ):
        query_text = query_map[query_id]

        query_dict["query"]["match"]["text"]["query"] = query_text

        query_request = json.dumps(query_dict)
        response = requests.post(
            search_url, headers=elasticsearch_headers, data=query_request
        )
        data = response.json()
        hits = [int(h["_id"]) for h in data["hits"]["hits"]]

        # this loop is the key. We order the data by probability of belonging to the positive class
        data_to_reorder = []
        for h in hits:
            inp = query_text + f" [SEP] " + corpus_map[h]
            inp = tokenizer(
                inp, return_tensors="pt", padding=True, truncation=True, max_length=512
            ).to(DEVICE)
            with torch.no_grad():
                outputs = model(**inp)
            score = outputs.logits[0, 1].item()
            data_to_reorder.append((h, score))
        reordered_data = sorted(data_to_reorder, key=lambda x: x[1], reverse=True)[:k]
        hits = [d[0] for d in reordered_data]

        # sometimes the list of correct matches is shorter than k
        # in  those cases we pad with 0s
        idcg = [1 if i < len(corpus_id_list) else 0 for i in range(k)]
        dcg = [1 if h in corpus_id_list else 0 for h in hits]

        idcg = np.array(idcg) / logs
        dcg = np.array(dcg) / logs

        ndcg_list.append(dcg.sum() / idcg.sum())

    return np.mean(ndcg_list)

In [30]:
ndcg_at_k_reordered(
    qc_map_test, query_map, corpus_map, 10, "http://localhost:9200/pol/_search?pretty"
)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 648/648 [05:16<00:00,  2.04it/s]


np.float64(0.049053581569495804)

We see a very low score. It's hardly a surprise since the model has been trained on a small fraction of the full dataset. There are so many different queries and potential answers that the reordering given by our model looks practically random. 

It's also possible that the order of the data in the query dataset groups it by categories. If that's the case then my model was probably trained on answers from the first category (probably something regarding finances) as the "correct answers" and random answers from all other categories as "negative answers". I don't know. But debugging the dataset is not the point of the laboratory. I learned how to train huggingface models for classification tasks in nlp

---


- Do you think simpler methods, like Bayesian bag-of-words model, would work for sentence-pair classification? Justify your answer.
    - Bayesian models don't have the understanding of language and work purely on the word occurences. Classifying sentence pairs like we did in this exercise could be problematic. The model doesn't account for context and can't tell which words come before and which come after the separator token. If we wanted to do that we'd have to encode a word from before the separater differently than the word coming after the separator.
- What hyper-parameters you have selected for the training? What resources (papers, tutorial) you have consulted to select these hyper-parameters?
    - No studies. Just trial and error. I found out that in this specific case lowering the batch-size and decreasing the learning rate helped me get the best results.
- Think about pros and cons of the neural-network models with respect to natural language processing. Provide at least 2 pros and 2 cons.
    - Pros:
        - Model's understanding of language. Possibility of better classification(if we manage to train the model properly).
        - Great flexibility. It's easy to taylor a deep learning model to a specific task.
    - Cons: 
        - Heavy training and inference - Scoring multiple data elements is much slower than with FTS. 
        - Need for a large amounts of training data. Bad for small tasks