## Exploring the Compared Elements Identification (CEI) Task

The Compared Elements Identification (CEI) task is a critical component in understanding comparative relationships within textual data. This multi-stage process involves generating auxiliary sentences from questions and their identified entities and aspects, transitioning Entity Role Identification (ERI) into a sentence-pair classification challenge. By refining the RoBERTa model specifically for this task, CEI aims to accurately consolidate entities and aspects into 4-tuple comparative sets, effectively categorizing entity roles as Subject, Object, or None. Leveraging RoBERTa's advanced natural language processing capabilities, the CEI task focuses on precise classification and the generation of auxiliary sentences to facilitate a deep understanding of the comparative dynamics present in the data. Through meticulous adaptation and fine-tuning, this approach has shown promising results in identifying and classifying comparative elements, offering a nuanced view of comparative preferences within questions.


## Dataset Instructions

We begin by loading various tables essential for our analysis. These include Questions, Relations, Elements, Entity Roles, and Comparative Preference categorizations.
1. Download the dataset from [this link](https://github.com/mahsamb/SCRQD/blob/main/Dataset.zip)
2. Unzip the downloaded file to access the datasets.



In [1]:
import requests
import zipfile
import os

data_url = "https://github.com/mahsamb/SCRQD/raw/main/Dataset.zip"
zip_filename = "Dataset.zip"

# Downloading using requests
response = requests.get(data_url)

# Check if the request was successful (status_code 200)
if response.status_code == 200:
    with open(zip_filename, "wb") as f:
        f.write(response.content)
else:
    print(f"Failed to retrieve the data: {response.status_code}: {response.text}")
    # Add additional error handling here

# Unzipping the dataset
try:
    with zipfile.ZipFile(zip_filename, 'r') as zip_ref:
        zip_ref.extractall("data")
    print("Files extracted:")
    print(os.listdir("data"))
except zipfile.BadZipFile:
    print("Error: The file doesn’t appear to be a valid zip file")


Files extracted:
['ComparativePreferences.pkl', 'Elements.pkl', 'EntityRoles.pkl', 'Questions.pkl', 'Relations.pkl']


In [2]:
import numpy as np
import pickle as cPickle
import pickle
import re
import pandas as pd
from IPython.display import display, HTML
import random



with open(r"/kaggle/working/data/Questions.pkl", "rb") as input_file:
    QuestionDict = pickle.load(input_file)
    input_file.close()

with open('/kaggle/working/data/EntityRoles.pkl', 'rb') as input_file:
    EntityRoleIdentificaionDict = pickle.load(input_file)
    input_file.close()

## Sample Questions

In this section, we'll explore a selection of questions from the SCRQD dataset. This dataset consists of unique IDs and their corresponding subjective comparative questions, providing a glimpse into the nature and variety of queries within the smartphone domain. Let's take a look at some sample questions to understand the dataset's content better.


In [3]:
import pandas as pd
from IPython.display import display, HTML
import random


# Sample 5 random keys (question IDs) from the dictionary
sampled_keys = random.sample(list(QuestionDict.keys()), 5)

# Prepare data for DataFrame
data_for_df = [{'ID': key, 'Question': QuestionDict[key]} for key in sampled_keys]

# Create DataFrame
df_questions = pd.DataFrame(data_for_df)

# Display the DataFrame
display(HTML(df_questions.to_html(index=False)))


ID,Question
713,Which has a stronger screen among Realme 3 Pro and Redmi Note 8 ?
531,Which smartphone is the worst for gaming ?
347,Samsung Galaxy Note 10 or Microsoft Surface for Note Taking ?
49,"What are the best cell phones in a range of around 10 k to 13 k ? Preferring LG , Lenovo , Motorola ."
387,Is Realme 5 the finest budget smartphone in India ?


## Entity Role Identification (ERI)

### Overview of the ERI Task

We define the Entity Role Identification (ERI) task as a crucial process in natural language processing that involves identifying and classifying the roles of entities within a given text. In the context of subjective comparative questions, this task focuses on discerning the specific roles that entities play—such as being the subject of comparison, the object of comparison, or other relevant roles.


Let's now demonstrate the application of the ERI task on a sample of our dataset and then explore the structure and content of the Entity Role table.

In [4]:
import pandas as pd
from IPython.display import display, HTML
import random

# Function to format the display of lists within the DataFrame
def format_list_for_display(series):
    return series.apply(lambda x: ', '.join(x) if isinstance(x, list) else x)

# Sample dictionary as provided in the question

# Function to get keys where the value has more than 3 items
def find_keys_with_value_length_greater_than_three(data):
    keys_with_more_than_three = [key for key, value in data.items() if len(value) >= 3]
    return keys_with_more_than_three

# Get the keys with values longer than 3 items
keys_with_values_longer_than_three = find_keys_with_value_length_greater_than_three(EntityRoleIdentificaionDict)

# Randomly select 5 keys from the list, if there are at least 5 keys
keys_to_display = random.sample(keys_with_values_longer_than_three, min(len(keys_with_values_longer_than_three), 3))

# Function to display the DataFrame in HTML format for the given keys and their values
def display_sampled_keys_df(keys_with_values, category_name):
    print(f"--- {category_name} ---")

    for key in keys_with_values:
        values = EntityRoleIdentificaionDict[key]
        question = QuestionDict.get(key, 'not found')
        # Assuming the values are structured as a list of lists
        for value in values:
            example_data = {
                'ID': key,
                'Question': question,  # Assuming the question is the first item
                'Pseudo-Sentence': value[0],
                'Entity Role': value[1]
            }

            df_example = pd.DataFrame([example_data])  # Create a DataFrame from a list of dicts

            # Apply formatting function to the DataFrame
            for column in df_example.columns:
                if isinstance(df_example[column].iloc[0], list):
                    df_example[column] = format_list_for_display(df_example[column])

            # Display the DataFrame in HTML
            display(HTML(df_example.to_html(index=False)))

# Call the function to display the sampled keys
display_sampled_keys_df(keys_to_display, "Selected Keys with Values Longer Than Three")


--- Selected Keys with Values Longer Than Three ---


ID,Question,Pseudo-Sentence,Entity Role
1076,"In reality , which is the worst , the Poco F 3 GT , OnePlus Nord , or Vivo V 21 5 G ?",Poco F3 GT - features,1


ID,Question,Pseudo-Sentence,Entity Role
1076,"In reality , which is the worst , the Poco F 3 GT , OnePlus Nord , or Vivo V 21 5 G ?",OnePlus Nord - features,1


ID,Question,Pseudo-Sentence,Entity Role
1076,"In reality , which is the worst , the Poco F 3 GT , OnePlus Nord , or Vivo V 21 5 G ?",Vivo V21e 5G - features,1


ID,Question,Pseudo-Sentence,Entity Role
1271,how awful is the user experience of the iPhone 6 versus Google Pixel 4 a when you re playing music when you get a call ?,iPhone 6 - playing music,1


ID,Question,Pseudo-Sentence,Entity Role
1271,how awful is the user experience of the iPhone 6 versus Google Pixel 4 a when you re playing music when you get a call ?,Google Pixel 4a - playing music,2


ID,Question,Pseudo-Sentence,Entity Role
1271,how awful is the user experience of the iPhone 6 versus Google Pixel 4 a when you re playing music when you get a call ?,Google Pixel 4a - get a call,0


ID,Question,Pseudo-Sentence,Entity Role
867,Why do iPhone users who use Google Maps everyday stay with iPhone rather than switching to Android ?,iPhone - Google Maps,1


ID,Question,Pseudo-Sentence,Entity Role
867,Why do iPhone users who use Google Maps everyday stay with iPhone rather than switching to Android ?,iPhone - Google Maps,1


ID,Question,Pseudo-Sentence,Entity Role
867,Why do iPhone users who use Google Maps everyday stay with iPhone rather than switching to Android ?,Android - Google Maps,2


### Utilizing Sentence-Pair Classification for ERI

In the Entity Role Identification (ERI) process within the SCRQD dataset, we employ the sentence-pair classification method, an essential technique in Natural Language Inference (NLI) tasks. This method is particularly effective for analyzing the structure and intent of subjective comparative questions.

#### Utilizing NLI-M for Pseudo-Sentence Generation

To prepare our data for the ERI task using sentence-pair classification, we transform the original comparative questions into pseudo-sentences, adhering to a specific method:

- **Combining Entities with Aspects**: In each question, we identify and pair each mentioned entity with an aspect. If no specific aspect is mentioned, we use "features" as a placeholder. This is essential for maintaining the focus of the comparison within the question.
- **Use of a Hyphen ("-")**: We employ a hyphen ("-") to separate each entity-aspect pair, creating clear and structured pseudo-sentences. This formatting not only simplifies the original text but also preserves the core elements of comparison, enhancing its suitability for computational analysis.

#### Pairing with Labels

Each pseudo-sentence is subsequently paired with a label that signifies the role of entities within the question:
  - Label "1" for the "Subject" role.
  - Label "2" for the "Object" role.
  - Label "0" for "None", indicating either no specific role or a neutral aspect in the question's context.

These labels are pivotal in guiding our classification model to accurately identify and categorize the roles of entities in the questions.


## Compiling Data for CEI Task Analysis

In this section, we compile a comprehensive dataset for the Compared Elements Identification (CEI) task from the `EntityRoleIdentificaionDict`. By iterating through each key, we gather related questions, auxiliary sentences, and the identified entity roles into a structured format. This meticulous process results in a DataFrame that encapsulates the essence of the CEI task, providing a detailed view of the questions, their corresponding pseudo-sentences, and the entity roles as determined by the model. This dataset serves as a foundational tool for analyzing the performance and accuracy of the CEI task, enabling a deeper understanding of how well the model identifies and classifies entity roles within comparative questions.

In [5]:

CEI_keys = list(EntityRoleIdentificaionDict.keys())

# Build the example_data structure based on the sampled keys
example_data = {
    'ID': [],
    'Question': [],
    'Pseudo-Sentence': [],
    'Entity Role': []

}

# Loop through the sampled keys to retrieve entries
for key in CEI_keys:
    question = QuestionDict.get(key, 'not found')
    entry = EntityRoleIdentificaionDict[key]
    example_data['ID'].append(key)
    example_data['Question'].append(question)
    example_data['Pseudo-Sentence'].append(entry[0][0])
    example_data['Entity Role'].append(entry[0][1])


# Create a DataFrame
df_examples = pd.DataFrame(example_data)

print(df_examples)


        ID                                           Question  \
0        1  What are the best smartphones with a built in ...   
1        2  Is the OnePlus 8 T appearance similar to the O...   
2        3  Does OnePlus 6 has been able to beat its ZenFo...   
3        4  Samsung Galaxy M 32 , Realme Narzo 30 , or Viv...   
4        5  Why is the Samsung Galaxy J 3 phone so awkward...   
...    ...                                                ...   
1155  1271  how awful is the user experience of the iPhone...   
1156  1272  How bad is the resolution of the Asus Zenfone ...   
1157  1273  How bad is the image quality of the Apple iPho...   
1158  1274  How much do you think the LG G 6 performance i...   
1159  1275  How bad is the iPhone 13 mini battery than iPh...   

                        Pseudo-Sentence  Entity Role  
0     Samsung - built in stylus feature            0  
1               OnePlus 8T - appearance            1  
2                  OnePlus 6 - features            1  

## Training and Evaluation of the CEI Model with RoBERTa

In our endeavor to refine the Compared Elements Identification (CEI) model, we undertake a comprehensive approach that encompasses a meticulous analysis of entity roles distribution, strategic implementation of Focal Loss to combat class imbalance, and a structured training and evaluation process leveraging RoBERTa, a state-of-the-art language representation model. This approach not only allows us to address the inherent challenges of class imbalance but also ensures that our model is finely tuned to identify and differentiate between the nuanced roles entities play within text. By integrating the robust capabilities of RoBERTa with focused adjustments for our specific task, we aim to significantly enhance the model's accuracy and reliability in discerning complex comparative relationships, thereby advancing the field of natural language processing and comparative analysis.


## Analyzing Distribution of Entity Roles

To gain insights into the distribution of entity roles within our dataset, we utilize Python's `Counter` class from the collections module. This approach efficiently tallies the occurrences of each unique entity role, such as Subject, Object, or None, providing us with a clear view of the dataset's composition. By converting these counts into a readable format, we can understand the prevalence of each role, which is crucial for assessing the dataset's balance and diversity. This analysis helps in identifying potential biases or imbalances that could influence the model's training and performance, guiding further data preprocessing or augmentation efforts to ensure a well-rounded training process.

In [6]:
from collections import Counter
import pandas as pd

# Use Counter to count occurrences of each unique value
counter = Counter(df_examples['Entity Role'])

# Display the counts
print(counter)

Counter({1: 1067, 0: 85, 2: 8})


### Focal Loss for Handling Class Imbalance

In dealing with the challenging issue of class imbalance in our dataset, we implement a custom `FocalLoss` class. Class imbalance often leads to models that perform well on majority classes but poorly on minority classes. `FocalLoss` is particularly designed to address this by modifying the standard cross-entropy loss such that it puts more focus on hard, misclassified examples and less on easy, correctly classified examples. The gamma parameter controls the rate at which easy examples are down-weighted, while the alpha parameter offers a way to give different weights to different classes, further helping to balance the training process. This custom loss function is crucial for improving our model's performance on under-represented classes.

In [7]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from transformers import RobertaTokenizer, RobertaForSequenceClassification, AdamW
from torch.utils.data import Dataset, DataLoader
import torch
from sklearn.metrics import precision_recall_fscore_support, classification_report, confusion_matrix
from tqdm.auto import tqdm
import torch.nn.functional as F

class FocalLoss(torch.nn.Module):
    def __init__(self, gamma=2.0, alpha=None, reduction='mean'):
        super(FocalLoss, self).__init__()
        self.gamma = gamma
        self.alpha = alpha  # alpha can be a scalar or a tensor with the same shape as the input
        self.reduction = reduction

    def forward(self, input, target):
        # Assume input is the raw logits and target is the true labels
        ce_loss = F.cross_entropy(input, target, reduction='none')  # Cross entropy loss
        pt = torch.exp(-ce_loss)  # Probability of being correctly classified
        focal_loss = ((1 - pt) ** self.gamma) * ce_loss  # Focal loss calculation

        if self.alpha is not None:
            # Assuming alpha is a tensor that matches the input tensor's device and dtype
            alpha_t = self.alpha[target]
            focal_loss = alpha_t * focal_loss

        if self.reduction == 'mean':
            return focal_loss.mean()
        elif self.reduction == 'sum':
            return focal_loss.sum()
        else:  # 'none'
            return focal_loss

### Training and Evaluation Process

The core of our model training and evaluation lies within a stratified k-fold cross-validation loop. Given the imbalanced nature of our dataset, stratified sampling ensures that each fold is a good representative of the whole. For each fold, we:

1. **Prepare the Dataset**: A custom `SentencePairDataset` class is used to tokenize and prepare our data for the RoBERTa model, ensuring inputs are correctly formatted and aligned with expected input dimensions.

2. **Model Setup**: We utilize the `RobertaForSequenceClassification` model from the Transformers library, adapting it to our specific task by setting the number of output labels to match our dataset. The model is transferred to the GPU for efficient training if available.

3. **Loss Function and Optimizer**: Focal Loss is employed as our loss function to tackle the class imbalance problem, with alpha values adjusted to give more weight to the minority class. The Adam optimizer is chosen for its effectiveness in handling sparse gradients on noisy problems.

4. **Training Loop**: For each epoch, we train the model on our training set, using a DataLoader to batch our data and tqdm to visualize progress. Backpropagation is applied to update the model weights based on the computed loss.

5. **Evaluation**: Post training, the model's performance is evaluated on the test set. True labels and predictions are collected to calculate precision, recall, and F1-score, providing insights into how well the model performs across different classes, especially focusing on the minority class. This step is crucial for iteratively improving our model's ability to generalize across all classes.

The process is repeated across all folds, accumulating overall metrics to provide a comprehensive view of the model's performance. The utilization of Focal Loss plays a pivotal role in improving recognition of the minority class, as evidenced by the significant improvement in precision, recall, and F1-score for class 2 in our results.

In [9]:
import torch
import torch.nn as nn
from transformers import RobertaModel, RobertaConfig, RobertaTokenizer
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import StratifiedKFold
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
from sklearn.metrics import precision_recall_fscore_support, classification_report, confusion_matrix
from torch.optim import Adam


# Assuming df_examples is prepared and FocalLoss is defined as previously discussed
# Convert labels to integer IDs
label_to_id = {label: idx for idx, label in enumerate(sorted(df_examples['Entity Role'].unique()))}
df_examples['label'] = df_examples['Entity Role'].map(label_to_id)


class CustomRobertaForSentencePairClassification(nn.Module):
    def __init__(self, num_labels, device):
        super().__init__()
        config = RobertaConfig.from_pretrained('roberta-base', num_labels=num_labels)
        self.roberta = RobertaModel.from_pretrained('roberta-base', config=config)
        
             
        self.classifier = nn.Linear(config.hidden_size, num_labels)
        self.device = device

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.roberta(input_ids=input_ids, attention_mask=attention_mask)
        sequence_output = outputs.last_hidden_state[:, 0, :]
        logits = self.classifier(sequence_output)
        return logits
    
    

class SentencePairDataset(Dataset):
    def __init__(self, questions, pseudo_sentences, labels, tokenizer, max_length=512):
        self.questions = questions
        self.pseudo_sentences = pseudo_sentences
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.questions)

    def __getitem__(self, idx):
        question = self.questions[idx]
        pseudo_sentence = self.pseudo_sentences[idx]
        label = self.labels[idx]

        encoding = self.tokenizer.encode_plus(
            question,
            pseudo_sentence,
            add_special_tokens=True,
            max_length=self.max_length,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

    
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = CustomRobertaForSentencePairClassification(num_labels=len(label_to_id), device=device).to(device)



alpha = torch.tensor([1, 1, 2], dtype=torch.float).to(device)  # Increase weight for class 2
criterion = FocalLoss(alpha=alpha, gamma=2.0).to(device)

optimizer = Adam(model.parameters(), lr=3e-5)

strat_kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

overall_true_labels = []
overall_predictions = []

for fold, (train_index, test_index) in enumerate(strat_kf.split(df_examples, df_examples['label'])):
    print(f"\nStarting fold {fold+1}/{strat_kf.n_splits}")

    train_df = df_examples.iloc[train_index]
    test_df = df_examples.iloc[test_index]

    train_dataset = SentencePairDataset(
        questions=train_df['Question'].tolist(),
        pseudo_sentences=train_df['Pseudo-Sentence'].tolist(),
        labels=train_df['label'].tolist(),
        tokenizer=tokenizer,
        max_length=100
    )

    test_dataset = SentencePairDataset(
        questions=test_df['Question'].tolist(),
        pseudo_sentences=test_df['Pseudo-Sentence'].tolist(),
        labels=test_df['label'].tolist(),
        tokenizer=tokenizer,
        max_length=100
    )

    train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=8)

    for epoch in range(5):  # Adjust the number of epochs if necessary
        model.train()
        for batch in tqdm(train_loader, desc=f"Training Fold {fold+1} - Epoch {epoch+1}"):
            batch = {k: v.to(device) for k, v in batch.items()}
            optimizer.zero_grad()
            outputs = model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'], labels=batch['labels'])
            loss = criterion(outputs, batch['labels'])
            loss.backward()
            optimizer.step()

    # Evaluation for each fold
    model.eval()
    true_labels_fold, predictions_fold = [], []
    with torch.no_grad():
        for batch in tqdm(test_loader, desc=f"Evaluating Fold {fold+1}"):
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'])
            preds = torch.argmax(outputs, dim=1)
            true_labels_fold.extend(batch['labels'].cpu().numpy())
            predictions_fold.extend(preds.cpu().numpy())

    # Calculate and print metrics for the current fold
    precision, recall, fscore, _ = precision_recall_fscore_support(true_labels_fold, predictions_fold, average='weighted', zero_division=0)
    print(f"Fold {fold+1} - Precision: {precision:.4f}, Recall: {recall:.4f}, F1-Score: {fscore:.4f}")

    # Accumulate overall metrics
    overall_true_labels.extend(true_labels_fold)
    overall_predictions.extend(predictions_fold)


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Starting fold 1/5


Training Fold 1 - Epoch 1:   0%|          | 0/116 [00:00<?, ?it/s]

Training Fold 1 - Epoch 2:   0%|          | 0/116 [00:00<?, ?it/s]

Training Fold 1 - Epoch 3:   0%|          | 0/116 [00:00<?, ?it/s]

Training Fold 1 - Epoch 4:   0%|          | 0/116 [00:00<?, ?it/s]

Training Fold 1 - Epoch 5:   0%|          | 0/116 [00:00<?, ?it/s]

Evaluating Fold 1:   0%|          | 0/29 [00:00<?, ?it/s]

Fold 1 - Precision: 0.9673, Recall: 0.9569, F1-Score: 0.9611

Starting fold 2/5


Training Fold 2 - Epoch 1:   0%|          | 0/116 [00:00<?, ?it/s]

Training Fold 2 - Epoch 2:   0%|          | 0/116 [00:00<?, ?it/s]

Training Fold 2 - Epoch 3:   0%|          | 0/116 [00:00<?, ?it/s]

Training Fold 2 - Epoch 4:   0%|          | 0/116 [00:00<?, ?it/s]

Training Fold 2 - Epoch 5:   0%|          | 0/116 [00:00<?, ?it/s]

Evaluating Fold 2:   0%|          | 0/29 [00:00<?, ?it/s]

Fold 2 - Precision: 0.9591, Recall: 0.9655, F1-Score: 0.9612

Starting fold 3/5


Training Fold 3 - Epoch 1:   0%|          | 0/116 [00:00<?, ?it/s]

Training Fold 3 - Epoch 2:   0%|          | 0/116 [00:00<?, ?it/s]

Training Fold 3 - Epoch 3:   0%|          | 0/116 [00:00<?, ?it/s]

Training Fold 3 - Epoch 4:   0%|          | 0/116 [00:00<?, ?it/s]

Training Fold 3 - Epoch 5:   0%|          | 0/116 [00:00<?, ?it/s]

Evaluating Fold 3:   0%|          | 0/29 [00:00<?, ?it/s]

Fold 3 - Precision: 0.9959, Recall: 0.9957, F1-Score: 0.9957

Starting fold 4/5


Training Fold 4 - Epoch 1:   0%|          | 0/116 [00:00<?, ?it/s]

Training Fold 4 - Epoch 2:   0%|          | 0/116 [00:00<?, ?it/s]

Training Fold 4 - Epoch 3:   0%|          | 0/116 [00:00<?, ?it/s]

Training Fold 4 - Epoch 4:   0%|          | 0/116 [00:00<?, ?it/s]

Training Fold 4 - Epoch 5:   0%|          | 0/116 [00:00<?, ?it/s]

Evaluating Fold 4:   0%|          | 0/29 [00:00<?, ?it/s]

Fold 4 - Precision: 1.0000, Recall: 1.0000, F1-Score: 1.0000

Starting fold 5/5


Training Fold 5 - Epoch 1:   0%|          | 0/116 [00:00<?, ?it/s]

Training Fold 5 - Epoch 2:   0%|          | 0/116 [00:00<?, ?it/s]

Training Fold 5 - Epoch 3:   0%|          | 0/116 [00:00<?, ?it/s]

Training Fold 5 - Epoch 4:   0%|          | 0/116 [00:00<?, ?it/s]

Training Fold 5 - Epoch 5:   0%|          | 0/116 [00:00<?, ?it/s]

Evaluating Fold 5:   0%|          | 0/29 [00:00<?, ?it/s]

Fold 5 - Precision: 0.9957, Recall: 0.9957, F1-Score: 0.9956


## Final Analysis and Insights on CEI Model Performance

Upon completing the training and evaluation across all folds, we compile the overarching metrics to assess the Compared Elements Identification (CEI) model's performance comprehensively. The classification report provides a detailed account of precision, recall, and F1-score for each entity role category, offering a granular view of the model's accuracy in distinguishing between Subject, Object, and None roles. This analysis is crucial for understanding the model's strengths and areas for improvement, particularly in handling complex comparative relations.

Additionally, the confusion matrix presents a visual summary of the model's predictions versus the actual labels, enabling us to identify patterns in misclassifications. This overall evaluation sheds light on the model's effectiveness in navigating the intricacies of comparative element identification, highlighting its potential to contribute valuable insights in the field of natural language processing and comparative analysis.

In [11]:
# After all folds are processed, calculate overall metrics and display the classification report
print("\n\nOverall Classification Report:")

# Directly use labels as integers
# No need for 'target_names' if labels are straightforward integers like 0, 1, 2
print(classification_report(overall_true_labels, overall_predictions, labels=[0, 1, 2], digits=4, zero_division=0))

print("Overall Confusion Matrix:")
print(confusion_matrix(overall_true_labels, overall_predictions))



Overall Classification Report:
              precision    recall  f1-score   support

           0     0.9351    0.8471    0.8889        85
           1     0.9897    0.9953    0.9925      1067
           2     0.6000    0.7500    0.6667         8

    accuracy                         0.9828      1160
   macro avg     0.8416    0.8641    0.8494      1160
weighted avg     0.9831    0.9828    0.9827      1160

Overall Confusion Matrix:
[[  72   10    3]
 [   4 1062    1]
 [   1    1    6]]
