## Data Collection 🛠

The subjQA dataset is constructed based on publicly available review datasets. Specifically, the movies, books, electronics, and grocery categories are constructed using reviews from the Amazon Review dataset. The TripAdvisor category, as the name suggests, is constructed using reviews from TripAdvisor which can be found [here](link). Finally, the restaurants category is constructed using the Yelp Dataset which is also publicly available.

The process of constructing SubjQA is discussed in detail in our paper. In a nutshell, the dataset construction consists of the following steps:

1. First, all opinions expressed in reviews are extracted. In the pipeline, each opinion is modeled as a (modifier, aspect) pair which is a pair of spans where the former describes the latter. *(e.g., "good, hotel", and "terrible, acting" are a few examples of extracted opinions)*.
2. Using Matrix Factorization techniques, implication relationships between different expressed opinions are mined. For instance, the system mines that "responsive keys" implies "good keyboard". In our pipeline, we refer to the conclusion of an implication (i.e., "good keyboard" in this example) as the query opinion, and we refer to the premise (i.e., "responsive keys") as its neighboring opinion.
3. Annotators are then asked to write a question based on query opinions. For instance, given "good keyboard" as the query opinion, they might write "Is this keyboard any good?"
4. Each question written based on a query opinion is then paired with a review that mentions its neighboring opinion. In our example, that would be a review that mentions "responsive keys".
5. The question and review pairs are presented to annotators to select the correct answer span, and rate the subjectivity level of the question as well as the subjectivity level of the highlighted answer span.

## Data Format 📊

All files are in standard CSV format, and they consist of the following columns:

- **domain**: The category/domain of the review (e.g., hotels, books, ...).
- **question**: The question (written based on a query opinion).
- **review**: The review (that mentions the neighboring opinion).
- **human_ans_spans**: The span labeled by annotators as the answer.
- **human_ans_indices**: The (character-level) start and end indices of the answer span highlighted by annotators.
- **question_subj_level**: The subjectivity level of the question (on a 1 to 5 scale with 1 being the most subjective).
- **ques_subj_score**: The subjectivity score of the question computed using the TextBlob package.
- **is_ques_subjective**: A boolean subjectivity label derived from question_subj_level (i.e., scores below 4 are considered as subjective).
- **answer_subj_level**: The subjectivity level of the answer span (on a 1 to 5 scale with 5 being the most subjective).
- **ans_subj_score**: The subjectivity score of the answer span computed using the TextBlob package.
- **is_ans_subjective**: A boolean subjectivity label derived from answer_subj_level (i.e., scores below 4 are considered as subjective).
- **nn_mod**: The modifier of the neighboring opinion (which appears in the review).
- **nn_asp**: The aspect of the neighboring opinion (which appears in the review).
- **query_mod**: The modifier of the query opinion (around which a question is manually written).
- **query_asp**: The aspect of the query opinion (around which a question is manually written).
- **item_id**: The id of the item/business discussed in the review.
- **review_id**: A unique id associated with the review.
- **q_review_id**: A unique id assigned to the question-review pair.
- **q_reviews_id**: A unique id assigned to all question-review pairs with a shared question.

### Citation
Johannes Bjerva, Nikita Bhutani, Behzad Golahn, Wang-Chiew Tan, and Isabelle Augenstein. (2020). SubjQA: A Dataset for Subjectivity and Review Comprehension. In *Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing*. Association for Computational Linguistics.

In [1]:
# Standard Libraries
import collections
import numpy as np
import os

# Visualization Libraries
import plotly.express as px

# Deep Learning and Computation
import torch
from tqdm.auto import tqdm

# Machine Learning Metrics
from sklearn.metrics import accuracy_score

# Data Loading and Acceleration Utilities
from datasets import load_dataset
import datasets
from accelerate import Accelerator
import evaluate

# Transformers Library for NLP
from transformers import (
    AutoTokenizer,
    AutoModelForQuestionAnswering,
    Trainer,
    TrainingArguments,
    TrainerCallback
)

In [2]:
model = "deepset/roberta-base-squad2"
tokenizer = AutoTokenizer.from_pretrained(model)

In [3]:
# A fast tokenizer is optimized for speed and efficiency in tokenizing text
# Often implement faster processing, useful for large-scale NLP tasks.
tokenizer.is_fast

True

In [4]:
import pandas as pd
df_train = pd.read_csv('subjqa-train.csv')
df_test = pd.read_csv('subjqa-test.csv')

In [5]:
# Define the maximum length and stride parameters for tokenization
max_length = 384  # Maximum length of tokenized sequences, commonly used for a balance between context and memory usage
stride = 128  # Stride determines overlap between tokenized sequences, providing context while avoiding redundancy

In [6]:
def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [7]:
df_train.head()

Unnamed: 0,item_id,domain,nn_mod,nn_asp,query_mod,query_asp,q_review_id,q_reviews_id,question,question_subj_level,ques_subj_score,is_ques_subjective,review_id,review,human_ans_spans,human_ans_indices,answer_subj_level,ans_subj_score,is_ans_subjective
0,B00BVMXBDO,movies,addictive,show,full,series,d9a9615d45df2f6e6108db4ca46bfded,399f1046fe6bd97990107f9d7aa86f4a,Who is the author of this series?,1,0.0,False,090671369dddfeb02db9bf7125a47c79,Whether it be in her portrayal of a nerdy lesb...,ANSWERNOTFOUND,"(251, 265)",1,0.0,False
1,1404918051,movies,enough simple,film,charming,movie,06ffe37a8023636a3ce00b020a517e87,42d9dd5b0c67150cac1e13308811cbb5,Can we enjoy the movie along with our family ?,1,0.5,False,a29821121e74d319cb93f77101e99c88,"An outstanding romantic comedy, 13 Going on 30...",ANSWERNOTFOUND,"(1195, 1209)",1,0.0,False
2,B0000633ZP,movies,weak,plot,bad,one,3b625c68e91b9e6987a08b84a9a9d234,32d06ccf2132cda644aea791fa688c53,Does this one good?,5,0.6,True,12a1b821f761bd19a75be7b16cef4a7c,"To let the truth be known, I watched this movi...",ANSWERNOTFOUND,"(1476, 1490)",5,0.0,False
3,B0000AQS0F,movies,outstanding,show,wonderful,series,f3abfa98b011127e7cb49bcd07f8deeb,e546636f0bb9f93d5f24b4ade9ebab45,Is this series good and excelent?,1,0.6,True,cd0f92322e67cc9d70de6674caace78c,"At the time of my review, there had been 910 c...",this show is OUTSTANDING,"(296, 320)",1,0.875,True
4,B003Y5H5FG,movies,great,production design,great,costume design,1b03744e764b257592c2c768345c14bc,a0a97e460a194bcb3286fe68d20aadc2,How is the costume design?,1,0.0,False,f6b5024393ebc70287befdaf47a50b75,"""Fright Night"" is great! This is how the story...",The costume design by Susan Matheson is great,"(1254, 1299)",1,0.75,True


In [8]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2501 entries, 0 to 2500
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   item_id              2501 non-null   object 
 1   domain               2501 non-null   object 
 2   nn_mod               2501 non-null   object 
 3   nn_asp               2501 non-null   object 
 4   query_mod            2501 non-null   object 
 5   query_asp            2501 non-null   object 
 6   q_review_id          2501 non-null   object 
 7   q_reviews_id         2501 non-null   object 
 8   question             2501 non-null   object 
 9   question_subj_level  2501 non-null   int64  
 10  ques_subj_score      2501 non-null   float64
 11  is_ques_subjective   2501 non-null   bool   
 12  review_id            2501 non-null   object 
 13  review               2501 non-null   object 
 14  human_ans_spans      2501 non-null   object 
 15  human_ans_indices    2501 non-null   o

In [9]:
df_train.columns

Index(['item_id', 'domain', 'nn_mod', 'nn_asp', 'query_mod', 'query_asp',
       'q_review_id', 'q_reviews_id', 'question', 'question_subj_level',
       'ques_subj_score', 'is_ques_subjective', 'review_id', 'review',
       'human_ans_spans', 'human_ans_indices', 'answer_subj_level',
       'ans_subj_score', 'is_ans_subjective'],
      dtype='object')

In [10]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2501 entries, 0 to 2500
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   item_id              2501 non-null   object 
 1   domain               2501 non-null   object 
 2   nn_mod               2501 non-null   object 
 3   nn_asp               2501 non-null   object 
 4   query_mod            2501 non-null   object 
 5   query_asp            2501 non-null   object 
 6   q_review_id          2501 non-null   object 
 7   q_reviews_id         2501 non-null   object 
 8   question             2501 non-null   object 
 9   question_subj_level  2501 non-null   int64  
 10  ques_subj_score      2501 non-null   float64
 11  is_ques_subjective   2501 non-null   bool   
 12  review_id            2501 non-null   object 
 13  review               2501 non-null   object 
 14  human_ans_spans      2501 non-null   object 
 15  human_ans_indices    2501 non-null   o

## Checking the questions and answers
- Let's check the questions and answer according to the 'human_ans_indices'

In [11]:
df_train.iloc[0].question

'Who is the author of this series?'

In [12]:
df_train.iloc[0].review

"Whether it be in her portrayal of a nerdy lesbian or a punk rock rebel, Maslany's plural personalities, (though very stereotypical), are entertaining eye-candy. Combined with a complex and unpredictable plot line, this show is surprisingly addictive. ANSWERNOTFOUND"

In [13]:
df_train.iloc[0].human_ans_indices

'(251, 265)'

In [14]:
df_train.iloc[0].review[251:265]

'ANSWERNOTFOUND'

In [15]:
# Picking the necessary columns for further analysis
df_train=df_train[['question','human_ans_indices','review','human_ans_spans']]
df_test=df_test[['question','human_ans_indices','review','human_ans_spans']]

In [16]:
# Generate a sequence evenly spaced numbers
df_train['id']=np.linspace(0,len(df_train)-1,len(df_train))
df_test['id']=np.linspace(0,len(df_test)-1,len(df_test))

In [17]:
df_train['id']=df_train['id'].astype(str)
df_test['id']=df_test['id'].astype(str)

In [18]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2501 entries, 0 to 2500
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   question           2501 non-null   object
 1   human_ans_indices  2501 non-null   object
 2   review             2501 non-null   object
 3   human_ans_spans    2501 non-null   object
 4   id                 2501 non-null   object
dtypes: object(5)
memory usage: 97.8+ KB


In [19]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 582 entries, 0 to 581
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   question           582 non-null    object
 1   human_ans_indices  582 non-null    object
 2   review             582 non-null    object
 3   human_ans_spans    582 non-null    object
 4   id                 582 non-null    object
dtypes: object(5)
memory usage: 22.9+ KB


In [20]:
int(df_train.iloc[0].human_ans_indices.split('(')[1].split(',')[0])

251

In [21]:
float(df_train.iloc[0].human_ans_indices.split('(')[1].split(',')[1].split(' ')[1].split(')')[0])

265.0

In [22]:
# Indicating where the answers are
df_train['answers']=df_train['human_ans_spans']
# Actual answer text itself, right answer where should be
df_test['answers']=df_test['human_ans_spans']

In [23]:
# Extract answer data and adds it to a new column
for i in range(0,len(df_train)):
  answer1={}
  si=int(df_train.iloc[i].human_ans_indices.split('(')[1].split(',')[0])
  ei=int(df_train.iloc[i].human_ans_indices.split('(')[1].split(',')[1].split(' ')[1].split(')')[0])
  answer1['text']=[df_train.iloc[i].review[si:ei]]
  answer1['answer_start']=[si]
  df_train.at[i, 'answers']=answer1

In [24]:
print(df_train.iloc[i].answers,df_train.iloc[i].human_ans_spans)

{'text': ['ANSWERNOTFOUND'], 'answer_start': [801]} ANSWERNOTFOUND


In [25]:
df_train.columns

Index(['question', 'human_ans_indices', 'review', 'human_ans_spans', 'id',
       'answers'],
      dtype='object')

In [26]:
df_train.columns=['question', 'human_ans_indices', 'context', 'human_ans_spans', 'id',
       'answers']

df_test.columns=['question', 'human_ans_indices', 'context', 'human_ans_spans','id',
       'answers']

In [27]:
val_dataset2 = datasets.Dataset.from_pandas(df_test)
train_dataset2 = datasets.Dataset.from_pandas(df_train)

In [28]:
# Preprocess the training examples .map() function on training dataset with the preprocessing function
train_dataset = train_dataset2.map(
    preprocess_training_examples,
    batched=True,
    remove_columns=train_dataset2.column_names,
)
len(train_dataset2), len(train_dataset) # compare the lengths of the original dataset (train_dataset2) and the preprocessed dataset (train_dataset).

Map:   0%|          | 0/2501 [00:00<?, ? examples/s]

(2501, 4862)

It shows that all 2501 examples were processed in 10 seconds at a speed of 260.48 examples per second. The resulting dataset has 4862 examples.

In [29]:
def preprocess_validation_examples(examples):
    # Cleaning the questions by stripping leading and trailing whitespace for consistency
    questions = [q.strip() for q in examples["question"]]

    # Tokenization; converting questions and contexts into numerical IDs, enabling the model to understand
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length, # Total length of the input sequence
        truncation="only_second", # If the total length exceeds max_length, only the context will be truncated
        stride=stride, # Overlap between the chunks
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Extracting overflow_to_sample_mapping
    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    # Looping over the tokenized inputs
    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx]) # Retrieving example IDs

        # Adjusting offset mapping based on sequence IDs
        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    # Adding example IDs to the tokenized inputs
    inputs["example_id"] = example_ids
    return inputs

In [30]:
# Preprocess the validation dataset by applying the preprocess_validation_examples function to each example
validation_dataset = val_dataset2.map(
    preprocess_validation_examples,  # Function to preprocess each example
    batched=True,  # Process examples in batches for efficiency
    remove_columns=val_dataset2.column_names,  # Remove unnecessary columns from the dataset
)

# Calculate the length of the preprocessed validation dataset
len(validation_dataset)

Map:   0%|          | 0/582 [00:00<?, ? examples/s]

1104

In [31]:
validation_dataset

Dataset({
    features: ['input_ids', 'attention_mask', 'offset_mapping', 'example_id'],
    num_rows: 1104
})

In [32]:
tokenizer = AutoTokenizer.from_pretrained(model)

In [33]:
metric = evaluate.load("squad")

In [34]:
def compute_metrics(start_logits, end_logits, features, examples):
    # Initialize a defaultdict to map example IDs to their corresponding feature indices
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)

    # List to store the formatted predicted answers
    predicted_answers = []

    # Placeholder values for n_best and max_answer_length
    n_best = 20
    max_answer_length = 30 

    # Process each example to generate predictions
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["context"]
        answers = []

        # Iterate through all features linked to the current example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            # Determine top n_best start and end positions
            start_indexes = np.argsort(start_logit)[-1: -n_best - 1: -1].tolist()
            end_indexes = np.argsort(end_logit)[-1: -n_best - 1: -1].tolist()

            # Generate candidate answers based on top start/end positions
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Validate answer positions
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    # Formulate the answer and score
                    answer = {
                        "text": context[offsets[start_index][0]: offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # Choose the best answer for the current example
        if answers:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append({"id": example_id, "prediction_text": best_answer["text"]})
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})

    # Correctly format the references from the examples dataset
    references_corrected = []
    for ex in examples:
        # Split answers if needed and create the correct format
        individual_answers = [{'text': ans, 'answer_start': 0} for ans in ex['answers'].split('|')]
        references_corrected.append({
            'id': ex['id'],
            'answers': individual_answers
        })

    # Compute the evaluation metric using the formatted predictions and references
    return metric.compute(predictions=predicted_answers, references=references_corrected)

In [35]:
model1 = AutoModelForQuestionAnswering.from_pretrained(model)

In [36]:
class DataFrameMetricsLogger(TrainerCallback):
    def __init__(self):
        # Initialize an empty list to store metrics dictionaries
        self.metrics_list = []

    def on_log(self, args, state, control, logs=None, **kwargs):
        # This method is called whenever logs are emitted by the Trainer
        if logs is not None:
            # Append the log metrics directly to the list
            self.metrics_list.append(logs.copy())  # Copy to ensure no overwriting

    def get_dataframe(self):
        # Convert the list of dictionaries to a DataFrame
        return pd.DataFrame(self.metrics_list)

In [37]:
args = TrainingArguments(
    output_dir="roberta-finetuned-subjqa-for-rag",
    evaluation_strategy="epoch",                    # Evaluate at the end of each epoch
    logging_strategy="steps",                       # Log every specified number of steps
    logging_steps=9,                                # Number of steps to log after
    logging_dir='./logs',                           # Directory where logs will be saved
    save_strategy="epoch",                          # Save the model at the end of each epoch
    learning_rate=2e-5,                             # Learning rate
    num_train_epochs=7,                          
    weight_decay=0.01,                              # Weight decay for regularization
    push_to_hub=False,                              # Whether to push the model to the Hugging Face Hub
    report_to="all",                                # Reporting to all integrations
    fp16=False                                      # Disable mixed precision training
)


In [38]:
# Initialize the DataFrame logger
dataframe_logger = DataFrameMetricsLogger()

In [39]:
# Setup the trainer with the logger
trainer = Trainer(
    model=model1,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
    callbacks=[dataframe_logger]
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [40]:
predictions, _, _ = trainer.predict(validation_dataset)
start_logits, end_logits = predictions
compute_metrics(start_logits, end_logits, validation_dataset, val_dataset2)

  0%|          | 0/138 [00:00<?, ?it/s]

In [None]:
trainer.train()

In [None]:
# Convert logged metrics to DataFrame
metrics_df = dataframe_logger.get_dataframe()

In [None]:
metrics_df = metrics_df[['loss', 'grad_norm', 'learning_rate', 'epoch']]

In [None]:
metrics_df.head()

- ***Loss:*** Represents the loss function value calculated at a given epoch.
- ***grad_norm:*** Refers to the norm (magnitude) of the gradient vector. This is an important metric for understanding the stability of the training process. High gradient norms can indicate issues like exploding gradients.
- ***learning_rate:*** Shows the learning rate at each epoch. This parameter controls the size of the steps the optimizer takes while updating the weights. Changes in the learning rate can significantly affect model training dynamics.
- ***Epoch:*** Indicates the progression of the training process in terms of epochs. An epoch represents one complete pass through the entire training dataset.

In [None]:
fig = px.line(metrics_df, x='epoch', y='loss', 
              title='Loss over Epochs', 
              labels={'epoch': 'Epoch', 'loss': 'Loss'},
              markers=True)  # Enable markers on the line

# Update layout for more customization
fig.update_layout(
    xaxis_title='Epoch',  # Title for the X-axis
    yaxis_title='Loss',   # Title for the Y-axis
    font=dict(family="Courier New, monospace", size=12, color="RebeccaPurple"),
    xaxis=dict(tickmode='auto', nticks=20),  # Auto mode for ticks, adjust as needed
    showlegend=True
)

# Show the plot
fig.show()


In [None]:
fig = px.line(metrics_df, x='epoch', y='grad_norm', 
              title='Trend of Gradient Norms Across Epochs', 
              labels={'epoch': 'Epoch', 'grad_norm': 'Gradient Norm'},
              markers=True)  # Enable markers on the line

# Update layout for more customization
fig.update_layout(
    xaxis_title='Epoch',  # Title for the X-axis
    yaxis_title='Gradient Norm',   # Title for the Y-axis
    font=dict(family="Courier New, monospace", size=12, color="RebeccaPurple"),
    xaxis=dict(tickmode='auto', nticks=20),  # Auto mode for ticks, adjust as needed
    showlegend=True
)

# Show the plot
fig.show()

- A typical and healthy training process is indicated by a decreasing gradient norm.
- Stability in the gradient norm across epochs, without sudden spikes, indicates consistent learning. If the training process is stable, the gradient norms should not fluctuate excessively.
- If the gradient norm becomes very small and does not change much, it could either mean the model has converged or is stuck in a plateau where it is not learning effectively.
- On the other hand, a very small gradient norm, especially early in training, might be indicative of "vanishing gradients." This is a situation where gradients become so small that they do not contribute effectively to updating weights, slowing down the training or stopping it prematurely.

In [None]:
fig = px.line(metrics_df, x='epoch', y='learning_rate', 
              title='Trend of Learning Rate Across Epochs', 
              labels={'epoch': 'Epoch', 'learning_rate': 'Learning Rate'},
              markers=True)  # Enable markers on the line

# Update layout for more customization
fig.update_layout(
    xaxis_title='Epoch',  # Title for the X-axis
    yaxis_title='Learning Rate',   # Title for the Y-axis
    font=dict(family="Courier New, monospace", size=12, color="RebeccaPurple"),
    xaxis=dict(tickmode='auto', nticks=20),  # Auto mode for ticks, adjust as needed
    showlegend=True
)

# Show the plot
fig.show()

In [None]:
# Evaluate the model
evaluation_results = trainer.evaluate()

# Print the evaluation results
print(evaluation_results)

In [None]:
# Save the model
model_path = './results/Roberta-Squad2-Subjqa'
trainer.save_model(model_path)