## Introduction to Comparative Preference Classification (CPC)

Comparative Preference Classification (CPC) emerges as a crucial phase in analyzing comparative relations, focusing on discerning preferences between entities based on a set of attributes. Employing the RoBERTa-pair-NLI-M model, CPC is treated as a sentence-pair classification challenge, enriching the model's understanding by incorporating auxiliary sentences. This task is pivotal in extracting nuanced comparative insights from text, categorizing preferences into 14 distinct classes and enhancing the comprehension of comparative dynamics within questions. As a vital component of the pipeline, CPC leverages advanced NLP techniques to refine the analysis of comparative relations, underscoring the sophisticated capabilities of the RoBERTa model in handling complex classification tasks.


## Data Retrieval and Extraction

Utilizing the `requests` library, this code snippet handles the downloading of a dataset from a specified URL and subsequently unzips the contents using the `zipfile` module. This step is essential for acquiring the necessary data for our Comparative Preference Classification (CPC) analysis, ensuring we have the foundational datasets ready for preprocessing and further examination.



In [1]:
import requests
import zipfile
import os

data_url = "https://github.com/mahsamb/SCRQD/raw/main/Dataset.zip"
zip_filename = "Dataset.zip"

# Downloading using requests
response = requests.get(data_url)

# Check if the request was successful (status_code 200)
if response.status_code == 200:
    with open(zip_filename, "wb") as f:
        f.write(response.content)
else:
    print(f"Failed to retrieve the data: {response.status_code}: {response.text}")
    # Add additional error handling here

# Unzipping the dataset
try:
    with zipfile.ZipFile(zip_filename, 'r') as zip_ref:
        zip_ref.extractall("data")
    print("Files extracted:")
    print(os.listdir("data"))
except zipfile.BadZipFile:
    print("Error: The file doesn’t appear to be a valid zip file")

Files extracted:
['Relations.pkl', 'Questions.pkl', 'EntityRoles.pkl', 'Elements.pkl', 'ComparativePreferences.pkl']


## Loading and Preparing Data

By importing essential libraries such as `numpy`, `pandas`, and `pickle`, this section of code focuses on loading the comparative preferences and questions from serialized `.pkl` files into Python dictionaries. This crucial step prepares our dataset for the CPC task, allowing for the manipulation and analysis of the data within a structured format conducive to our subsequent NLP tasks.

In [2]:
import numpy as np
import pickle as cPickle
import pickle
import re
import pandas as pd
from IPython.display import display, HTML
import random



with open(r"/kaggle/working/data/Questions.pkl", "rb") as input_file:
    QuestionDict = pickle.load(input_file)
    input_file.close()


with open(r"/kaggle/working/data/ComparativePreferences.pkl", "rb") as input_file:
    CPCDict = pickle.load(input_file)
    input_file.close()

The table above shows an example sentence annotated with our enhanced BIO scheme. Each token (word) from the sentence is classified according to whether it signifies the beginning (B), inside (I), or outside (O) of the three categories: Entity (E), Aspect (A), and Constraint (C).


## Comparative Preference Classification (CPC)

### Overview of the CPC Task
The Comparative Preference Classification (CPC) task is a pivotal component of our study in understanding subjective comparative questions. This task involves classifying the nature of preferences expressed in comparative questions, such as determining whether a subjective comparison implies a preference for one entity over another.

Next, we will dive into a detailed exploration of this table, examining sample entries and their classifications to understand better how preferences are articulated and categorized in our dataset.

## Data Exploration and Example Selection

This code snippet demonstrates the process of randomly sampling entries from the Comparative Preference Classification (CPC) dataset to illustrate the variety and complexity of the data. By selecting a subset of keys and extracting corresponding questions, pseudo-sentences, and preference types, we construct a DataFrame that showcases examples of the data we'll be analyzing. This exploratory step is pivotal for understanding the dataset's structure and the nature of the comparative preferences, facilitating a deeper insight into the challenges and nuances of the CPC task.

In [3]:
# Sample 10 keys from the dictionary
random_keys = random.sample(list(CPCDict.keys()), 10)

# Build the example_data structure based on the sampled keys
example_data = {
    'ID': [],
    'Question': [],
    'Pseudo-Sentence': [],
    'Preference Type': []

}

# Loop through the sampled keys to retrieve entries
for key in random_keys:
    question = QuestionDict.get(key, 'not found')
    entry = CPCDict[key]
    example_data['ID'].append(key)
    example_data['Question'].append(question)
    example_data['Pseudo-Sentence'].append(entry[0][0])
    example_data['Preference Type'].append(entry[0][1])


# Create a DataFrame
df_examples = pd.DataFrame(example_data)

# Function to format the display of lists within the DataFrame
def format_list_for_display(series):
    return series.apply(lambda x: ', '.join(x) if isinstance(x, list) else x)

# Apply formatting function to the DataFrame
df_examples['Question'] = format_list_for_display(df_examples['Question'])
df_examples['Pseudo-Sentence'] = format_list_for_display(df_examples['Pseudo-Sentence'])
df_examples['Preference Type'] = format_list_for_display(df_examples['Preference Type'])


# Display the DataFrame
display(HTML(df_examples.to_html(index=False)))


ID,Question,Pseudo-Sentence,Preference Type
36,Would you prefer to buy the Motorola Moto X or the LG Optimus G Pro ?,Motorola Moto X versus LG Optimus G Pro,XorB
107,"Overall , in terms of service , are Samsung phones better than OnePlus in India ?",Samsung phones service versus OnePlus service,B
786,Why is my Samsung Galaxy S 4 better than a Samsung Galaxy S 5 in the AnTuTu benchmark ?,Samsung Galaxy S4 versus Samsung Galaxy S5,B
356,"Which one is better in terms of optics , the Huawei P 40 Pro or the Samsung S 20 Ultra ?",Huawei P40 Pro optics versus Samsung S20 Ultra optics,XorB
982,"I wanted to buy an smartphone , but I am in a dilemma between iPhone 7 or iPhone 7 Plus . Should I buy an iPhone 7 with 128 GB memory or a 7 Plus with 32 GB memory ? They were the same price .",iPhone 7 versus iPhone 7 Plus,XorB
941,"Why do Samsung phones have low specs and poor performance , unlike the Chinese brands ?",Samsung phones specs versus Chinese brands specs,W
1097,Which phone is the most undesirable between the Samsung J 3 and the Honor 7 s ?,Samsung J3 versus Honor 7s,XorSW
57,Which smartphone has the most user - friendly interface between a Samsung Galaxy Note 10 and a Note 9 ?,Samsung Galaxy Note 10 interface and Note 9 interface versus All interface,SB
41,Which is the best smartphone with good noise cancellation and sound quality ?,X noise cancellation versus All noise cancellation,SB
299,Which phone is preferable between the iPhone 6 S and the Samsung J 4 + ?,iPhone 6S versus Samsung J4+,XorB


### Utilizing NLI-M for Pseudo-Sentence Generation
To facilitate the CPC task, we adopt Natural Language Inference with Multiple Output (NLI-M). This approach involves generating pseudo-sentences that represent the core comparison in each question.

- **Pseudo-Sentence Formation**: We create pseudo-sentences by pairing entities and aspects mentioned in the question. For instance, a comparison between two products based on a specific aspect is represented as "(entity i-aspect j versus entity z-aspect k)". In cases where a question lacks explicit mention of an entity or aspect, we use placeholders:
  - "X" is used when an entity is not specified.
  - "All" is used for unspecified aspects.
- **Example**: Given a question comparing the camera quality of 'iPhone 10' and 'iPhone XS', the corresponding pseudo-sentence would be "iPhone 10-camera versus iPhone XS-camera".

This method allows us to convert complex comparative questions into a format that is more readily analyzable by our classification models.


### Comparative Preference Categories
We have outlined 14 potential preference categories for subjective comparative questions: `B`, `SB`, `W`, `SW`, `E`, `XOR-B`, `XOR-SB`, `XOE-E`, `XOR-W`, `XOR-SW`, `X`, `X-SB`, `X-SW`, `Non-Grad`. Consequently, the output label for the CPC task will fall into one of these 14 classifications. A detailed explanation of these abbreviations, along with their expanded interpretations, is encapsulated as :


| Preference Type | Description      |
|-----------------|------------------|
| B               | Better           |
| SB              | Strong Better    |
| E               | Equal            |
| W               | Worse            |
| SW              | Strong Worse     |
| XOR-B           | XOR-Better       |
| XOR-SB          | XOR-Strong Better|
| XOR-E           | XOR-Equal        |
| XOR-W           | XOR-Worse        |
| XOR-SW          | XOR-Strong Worse |
| X-SB            | X-Strong Better  |
| _X               | X                |
| X-SW            | X-Strong Worse   |
| Non-Grad        | Non-Gradable     |


## Comprehensive Data Compilation for CPC

In this segment, we systematically compile a comprehensive dataset from all available entries in the Comparative Preference Classification (CPC) dictionary, as opposed to randomly sampling a subset. By iterating over each key, we gather corresponding questions, pseudo-sentences, and preference types, thus forming a detailed DataFrame. This approach ensures a thorough overview of the entire dataset, offering insights into the full spectrum of comparative preferences present. This methodical compilation is instrumental for in-depth analysis and modeling, providing a solid foundation for exploring the nuances and patterns within the CPC task.

In [4]:
# Sample 10 keys from the dictionary
CPC_keys = list(CPCDict.keys())

# Build the example_data structure based on the sampled keys
example_data = {
    'ID': [],
    'Question': [],
    'Pseudo-Sentence': [],
    'Preference Type': []

}

# Loop through the sampled keys to retrieve entries
for key in CPC_keys:
    question = QuestionDict.get(key, 'not found')
    entry = CPCDict[key]
    example_data['ID'].append(key)
    example_data['Question'].append(question)
    example_data['Pseudo-Sentence'].append(entry[0][0])
    example_data['Preference Type'].append(entry[0][1])


# Create a DataFrame
df_examples = pd.DataFrame(example_data)

print(df_examples)

        ID                                           Question  \
0        1  What are the best smartphones with a built in ...   
1        2  Is the OnePlus 8 T appearance similar to the O...   
2        3  Does OnePlus 6 has been able to beat its ZenFo...   
3        4  Samsung Galaxy M 32 , Realme Narzo 30 , or Viv...   
4        5  Why is the Samsung Galaxy J 3 phone so awkward...   
...    ...                                                ...   
1270  1271  how awful is the user experience of the iPhone...   
1271  1272  How bad is the resolution of the Asus Zenfone ...   
1272  1273  How bad is the image quality of the Apple iPho...   
1273  1274  How much do you think the LG G 6 performance i...   
1274  1275  How bad is the iPhone 13 mini battery than iPh...   

                                        Pseudo-Sentence Preference Type  
0                          X display versus All display              SB  
1     OnePlus 8T appearance versus OnePlus 7T phone ...               E

In [5]:
#print(j)

In [6]:
!pip install adapters

Collecting adapters
  Downloading adapters-0.1.2-py3-none-any.whl.metadata (15 kB)
Collecting transformers~=4.36.0 (from adapters)
  Downloading transformers-4.36.2-py3-none-any.whl.metadata (126 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.8/126.8 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Downloading adapters-0.1.2-py3-none-any.whl (256 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m256.0/256.0 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0mta [36m0:00:01[0m
[?25hDownloading transformers-4.36.2-py3-none-any.whl (8.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m38.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: transformers, adapters
  Attempting uninstall: transformers
    Found existing installation: transformers 4.38.2
    Uninstalling transformers-4.38.2:
      Successfully uninstalled transformers-4.38.2
Successfully installed adapters

## Training and Evaluating the Comparative Preference Classification Model

This comprehensive code block orchestrates the training and evaluation of a Comparative Preference Classification (CPC) model using the RoBERTa architecture. Initially, it transforms preference types into numeric labels to facilitate model processing. Leveraging a `StratifiedKFold` strategy, the dataset is divided into training and testing sets across multiple folds to ensure a thorough evaluation. A custom `Dataset` class prepares sentence pairs for RoBERTa, incorporating both questions and pseudo-sentences as inputs. Throughout the training phases, the model optimizes its parameters to discern the subtle distinctions among various preference types. Finally, after training across all folds, precision, recall, and F1-score metrics are calculated, providing a detailed insight into the model's performance. The use of a classification report and confusion matrix offers a granular view of how well the model distinguishes between the nuanced categories of comparative preferences.

In [12]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from transformers import RobertaTokenizer, RobertaForSequenceClassification
from torch.optim import Adam,AdamW
from torch.utils.data import Dataset, DataLoader
import torch
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, confusion_matrix
import torch.nn.functional as F
from tqdm.auto import tqdm
import adapters 
import torch.nn as nn
from transformers import RobertaModel, RobertaConfig, RobertaTokenizer


class CustomRobertaForSentencePairClassification(nn.Module):
    def __init__(self, num_labels, device):
        super(CustomRobertaForSentencePairClassification, self).__init__()
        self.device = device
        self.num_labels = num_labels  # Define num_labels as an attribute
        config = RobertaConfig.from_pretrained('roberta-base', num_labels=num_labels)
        self.roberta = RobertaModel.from_pretrained('roberta-base', config=config)
        

        
        self.classifier = nn.Linear(config.hidden_size, num_labels)



    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.roberta(input_ids=input_ids, attention_mask=attention_mask)
        sequence_output = outputs.last_hidden_state[:, 0, :]  # Use the [CLS] token's representation
        logits = self.classifier(sequence_output)

        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
        
        return logits if loss is None else (loss, logits)

# Example usage
num_labels = 14 # Define the number of labels for your classification task




# Convert labels to integer IDs
label_to_id = {label: idx for idx, label in enumerate(sorted(df_examples['Preference Type'].unique()))}
df_examples['label'] = df_examples['Preference Type'].map(label_to_id)

class SentencePairDataset(Dataset):
    def __init__(self, questions, pseudo_sentences, labels, tokenizer, max_length=512):
        self.questions = questions
        self.pseudo_sentences = pseudo_sentences
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.questions)

    def __getitem__(self, idx):
        question = self.questions[idx]
        pseudo_sentence = self.pseudo_sentences[idx]
        label = self.labels[idx]

        encoding = self.tokenizer.encode_plus(
            question,
            pseudo_sentence,
            add_special_tokens=True,
            max_length=self.max_length,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

strat_kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

overall_true_labels = []
overall_predictions = []

for fold, (train_index, test_index) in enumerate(strat_kf.split(df_examples, df_examples['label'])):
    print(f"\nStarting fold {fold+1}/{strat_kf.n_splits}")

    train_df = df_examples.iloc[train_index]
    test_df = df_examples.iloc[test_index]

    train_dataset = SentencePairDataset(
        questions=train_df['Question'].tolist(),
        pseudo_sentences=train_df['Pseudo-Sentence'].tolist(),
        labels=train_df['label'].tolist(),
        tokenizer=tokenizer
    )

    test_dataset = SentencePairDataset(
        questions=test_df['Question'].tolist(),
        pseudo_sentences=test_df['Pseudo-Sentence'].tolist(),
        labels=test_df['label'].tolist(),
        tokenizer=tokenizer
    )

    train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=8)



    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = CustomRobertaForSentencePairClassification(num_labels=num_labels, device=device)
    model.to(device)

    optimizer = Adam(model.parameters(), lr=3e-5)

    


    for epoch in range(5):  # Adjust the number of epochs if necessary
        model.train()
        print(f"\nTraining Fold {fold+1}, Epoch {epoch+1}")
        for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}"):
            batch = {k: v.to(device) for k, v in batch.items()}
            optimizer.zero_grad()

            # Correctly handle the model's output
            output = model(**batch)
            # Check if the output is a tuple (which includes loss and logits)
            if isinstance(output, tuple):
                loss = output[0]  # Extract the loss from the tuple
            else:
                # Directly use output as loss if it's not a tuple (logits only, no labels provided)
                loss = output

            loss.backward()
            optimizer.step()

    # Evaluation
    model.eval()
    true_labels, predictions = [], []


    with torch.no_grad():
        for batch in tqdm(test_loader, desc=f"Evaluating Fold {fold+1}"):
            batch = {k: v.to(model.device) for k, v in batch.items()}
            outputs = model(**batch)
            # Check if the output is a tuple and unpack accordingly
            if isinstance(outputs, tuple):
                logits = outputs[1]  # logits are the second element in the tuple
            else:
                logits = outputs
            preds = torch.argmax(F.softmax(logits, dim=1), dim=1).cpu().numpy()
            labels = batch['labels'].cpu().numpy()
            true_labels.extend(labels)
            predictions.extend(preds)


    # Store for overall evaluation
    overall_true_labels.extend(true_labels)
    overall_predictions.extend(predictions)
    
    # Compute and display precision, recall, and F1-score for the current fold
    precision, recall, f1, _ = precision_recall_fscore_support(true_labels, predictions, average='macro')
    print(f"\nFold {fold+1} Precision: {precision:.4f}, Recall: {recall:.4f}, F1-Score: {f1:.4f}")

# After all folds are processed, calculate overall metrics and display the classification report
print("\n\nOverall Classification Report:")
print(classification_report(overall_true_labels, overall_predictions, target_names=label_to_id.keys(), digits=4))
print("Overall Confusion Matrix:")
print(confusion_matrix(overall_true_labels, overall_predictions))



Starting fold 1/5


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training Fold 1, Epoch 1


Epoch 1:   0%|          | 0/128 [00:00<?, ?it/s]


Training Fold 1, Epoch 2


Epoch 2:   0%|          | 0/128 [00:00<?, ?it/s]


Training Fold 1, Epoch 3


Epoch 3:   0%|          | 0/128 [00:00<?, ?it/s]


Training Fold 1, Epoch 4


Epoch 4:   0%|          | 0/128 [00:00<?, ?it/s]


Training Fold 1, Epoch 5


Epoch 5:   0%|          | 0/128 [00:00<?, ?it/s]

Evaluating Fold 1:   0%|          | 0/32 [00:00<?, ?it/s]


Fold 1 Precision: 0.8714, Recall: 0.8306, F1-Score: 0.8407

Starting fold 2/5


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training Fold 2, Epoch 1


Epoch 1:   0%|          | 0/128 [00:00<?, ?it/s]


Training Fold 2, Epoch 2


Epoch 2:   0%|          | 0/128 [00:00<?, ?it/s]


Training Fold 2, Epoch 3


Epoch 3:   0%|          | 0/128 [00:00<?, ?it/s]


Training Fold 2, Epoch 4


Epoch 4:   0%|          | 0/128 [00:00<?, ?it/s]


Training Fold 2, Epoch 5


Epoch 5:   0%|          | 0/128 [00:00<?, ?it/s]

Evaluating Fold 2:   0%|          | 0/32 [00:00<?, ?it/s]


Fold 2 Precision: 0.8616, Recall: 0.8340, F1-Score: 0.8411

Starting fold 3/5


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training Fold 3, Epoch 1


Epoch 1:   0%|          | 0/128 [00:00<?, ?it/s]


Training Fold 3, Epoch 2


Epoch 2:   0%|          | 0/128 [00:00<?, ?it/s]


Training Fold 3, Epoch 3


Epoch 3:   0%|          | 0/128 [00:00<?, ?it/s]


Training Fold 3, Epoch 4


Epoch 4:   0%|          | 0/128 [00:00<?, ?it/s]


Training Fold 3, Epoch 5


Epoch 5:   0%|          | 0/128 [00:00<?, ?it/s]

Evaluating Fold 3:   0%|          | 0/32 [00:00<?, ?it/s]


Fold 3 Precision: 0.8488, Recall: 0.8514, F1-Score: 0.8398

Starting fold 4/5


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training Fold 4, Epoch 1


Epoch 1:   0%|          | 0/128 [00:00<?, ?it/s]


Training Fold 4, Epoch 2


Epoch 2:   0%|          | 0/128 [00:00<?, ?it/s]


Training Fold 4, Epoch 3


Epoch 3:   0%|          | 0/128 [00:00<?, ?it/s]


Training Fold 4, Epoch 4


Epoch 4:   0%|          | 0/128 [00:00<?, ?it/s]


Training Fold 4, Epoch 5


Epoch 5:   0%|          | 0/128 [00:00<?, ?it/s]

Evaluating Fold 4:   0%|          | 0/32 [00:00<?, ?it/s]


Fold 4 Precision: 0.8756, Recall: 0.8558, F1-Score: 0.8604

Starting fold 5/5


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training Fold 5, Epoch 1


Epoch 1:   0%|          | 0/128 [00:00<?, ?it/s]


Training Fold 5, Epoch 2


Epoch 2:   0%|          | 0/128 [00:00<?, ?it/s]


Training Fold 5, Epoch 3


Epoch 3:   0%|          | 0/128 [00:00<?, ?it/s]


Training Fold 5, Epoch 4


Epoch 4:   0%|          | 0/128 [00:00<?, ?it/s]


Training Fold 5, Epoch 5


Epoch 5:   0%|          | 0/128 [00:00<?, ?it/s]

Evaluating Fold 5:   0%|          | 0/32 [00:00<?, ?it/s]


Fold 5 Precision: 0.8503, Recall: 0.8433, F1-Score: 0.8316


Overall Classification Report:
              precision    recall  f1-score   support

           B     0.8851    0.8506    0.8675       154
           E     0.9054    0.9306    0.9178        72
     NonGrad     0.8182    0.7200    0.7660        50
          SB     0.8543    0.9195    0.8857       236
        SB_X     0.8158    0.6200    0.7045        50
          SW     0.8247    0.7547    0.7882       106
        SW_X     0.8333    1.0000    0.9091        50
           W     0.7739    0.7479    0.7607       119
        XorB     0.8807    0.8708    0.8757       178
        XorE     0.9535    0.8723    0.9111        47
       XorSB     0.8077    0.8873    0.8456        71
       XorSW     1.0000    0.8936    0.9438        47
        XorW     0.8036    0.9574    0.8738        47
          _X     0.7600    0.7917    0.7755        48

    accuracy                         0.8510      1275
   macro avg     0.8512    0.8440    0.84