# Introduction to the SCRQD Dataset

In this notebook, we explore the SCRQD dataset, comprising subjective comparative questions from the smartphone domain. Our focus is to analyze and extract meaningful insights from these questions, enhancing our understanding of subjective comparisons in Automated Question Answering systems.


The SCRQD Dataset comprises five key tables, each serving a distinct purpose in the analysis and processing of comparative questions:
1. **Questions Table**: Contains the subjective comparative questions.
2. **Relations Table**: Details the subhective comparative relations between various entities in the questions.
3. **Elements Table**: Focuses on extracting specific elements (Products, Aspects, ans Constraint) from the questions.
4. **Entity Roles Table**: Categorizes the roles (subject/object) of different entities within the questions.
5. **Comparative Preferences Table**: Classifies the preferences expressed in the questions.

In the following sections, we will explore each of these tables in detail to understand their structure, content, and the role they play in analyzing subjective comparative questions.


## Data Loading
We begin by loading various tables essential for our analysis. These include Questions, Relations, Elements, Entity Roles, and Comparative Preference categorizations.


In [1]:
import numpy as np
import pickle as cPickle
import pickle
import re
import pandas as pd
from IPython.display import display, HTML
import random



with open(r"/kaggle/input/scrqd-dataset/Questions.pkl", "rb") as input_file:
    QuestionDict = pickle.load(input_file)
    input_file.close()

with open('/kaggle/input/scrqd-dataset/EntityRoles.pkl', 'rb') as input_file:
    EntityRoleIdentificaionDict = pickle.load(input_file)
    input_file.close()
    
with open('/kaggle/input/scrqd-dataset/Relations.pkl', 'rb') as input_file:
    RelationDict = pickle.load(input_file)
    input_file.close()

with open(r"/kaggle/input/scrqd-dataset/Elements.pkl", "rb") as input_file:
    Product_Aspect_Contraint_dict = pickle.load(input_file)
    input_file.close()
    
with open(r"/kaggle/input/scrqd-dataset/ComparativePreferences.pkl", "rb") as input_file:
    CPCDict = pickle.load(input_file)
    input_file.close()
    



## Sample Questions 

In this section, we'll explore a selection of questions from the SCRQD dataset. This dataset consists of unique IDs and their corresponding subjective comparative questions, providing a glimpse into the nature and variety of queries within the smartphone domain. Let's take a look at some sample questions to understand the dataset's content better.


In [2]:
import pandas as pd
from IPython.display import display, HTML
import random


# Sample 5 random keys (question IDs) from the dictionary
sampled_keys = random.sample(list(QuestionDict.keys()), 5)

# Prepare data for DataFrame
data_for_df = [{'ID': key, 'Question': QuestionDict[key]} for key in sampled_keys]

# Create DataFrame
df_questions = pd.DataFrame(data_for_df)

# Display the DataFrame
display(HTML(df_questions.to_html(index=False)))


ID,Question
522,Is the Samsung S 10 a better value proposition than the iPhone XR ?
243,Is the iPhone 11 an excellent smartphone compared to iPhone 10 ?
657,Is the Google Pixel 3 XL better than Samsung Galaxy Note 9 ?
143,Why should I buy Samsung phones and not Xiaomi phones which offer much greater specs in the price range of Rs 15000 ?
308,What is the worst phone ever created ?


In [3]:
###################################################################################################

## Relation Extraction 

Expanding upon the first table, we'll now delve into the Relations table, which provides a structured representation of questions and their associated comparative relations. Each question is identified by a unique key, and the corresponding subjective comparative relations, if any, are represented as quintuples encompassing the subject entity, object entity, compared aspect, preference category, and constraint.

Let's take a look at the first few entries in the Relations table to understand its structure. The Relations table contains the following columns: Subject Entity, Object Entity, Compared Aspect, Constraint, and Preference.

In [4]:
import pandas as pd
from IPython.display import display, HTML
import random


# Function to find keys where the value has more than one non-empty list
def find_keys_with_multiple_non_empty_lists(data):
    keys_with_multiple_non_empty = []
    for key, value in data.items():
        if len([item for item in value if item]) > 1:
            keys_with_multiple_non_empty.append(key)
    return keys_with_multiple_non_empty

# Find keys with more than one non-empty list
keys_with_multiple_non_empty = find_keys_with_multiple_non_empty_lists(RelationDict)

# Display function
def display_sampled_keys_df(keys, category_name):
    print(f"--- {category_name} ---")
    keys = random.sample(keys, min(len(keys), 3))  # Randomly select up to 5 keys
    for key in keys:
        values = RelationDict[key]
        question = QuestionDict.get(key, 'not found')
        for value in values:
            if not value:  # Skip empty lists
                continue
            example_data = {
                'ID': key,
                'Question': question,
                'Subject Entity': value[0],
                'Compared Aspect': value[1],
                'Object Entity': value[2],
                'Preference Category': value[3],
                'Constraint': value[4]
            }
            df_example = pd.DataFrame([example_data])
            for column in df_example.columns:
                df_example[column] = df_example[column].apply(lambda x: ', '.join(x) if isinstance(x, list) else x)
            display(HTML(df_example.to_html(index=False)))

# Call the display function
display_sampled_keys_df(keys_with_multiple_non_empty, "Keys with Multiple Non-Empty Values")


--- Keys with Multiple Non-Empty Values ---


ID,Question,Subject Entity,Compared Aspect,Object Entity,Preference Category,Constraint
1050,"Which do you suppose has the most wretched body design and performance , Samsung Galaxy J 710 or Samsung Galaxy A 12 ?","Samsung Galaxy J710, Samsung Galaxy A12",body design,All,XorSW,


ID,Question,Subject Entity,Compared Aspect,Object Entity,Preference Category,Constraint
1050,"Which do you suppose has the most wretched body design and performance , Samsung Galaxy J 710 or Samsung Galaxy A 12 ?","Samsung Galaxy J710, Samsung Galaxy A12",performance,All,XorSW,


ID,Question,Subject Entity,Compared Aspect,Object Entity,Preference Category,Constraint
1048,"Which do you believe has the poorest selfie camera and network connectivity , OnePlus 6 or Samsung Galaxy S 9 + ?","OnePlus 6, Samsung Galaxy S9+",selfie camera,All,XorSW,


ID,Question,Subject Entity,Compared Aspect,Object Entity,Preference Category,Constraint
1048,"Which do you believe has the poorest selfie camera and network connectivity , OnePlus 6 or Samsung Galaxy S 9 + ?","OnePlus 6, Samsung Galaxy S9+",network connectivity,All,XorSW,


ID,Question,Subject Entity,Compared Aspect,Object Entity,Preference Category,Constraint
261,I want to buy a phone like an Apple iPhone 6 or a Samsung Galaxy S 6 Edge . Which one would be better in the sense of having a great battery life and seamless performance ?,"iPhone 6, Samsung Galaxy S6 Edge",battery life,All,XorSB,


ID,Question,Subject Entity,Compared Aspect,Object Entity,Preference Category,Constraint
261,I want to buy a phone like an Apple iPhone 6 or a Samsung Galaxy S 6 Edge . Which one would be better in the sense of having a great battery life and seamless performance ?,"iPhone 6, Samsung Galaxy S6 Edge",performance,All,XorB,


As can be seen, the Relations table consists of subjective comparative relations. Each entry is a list containing the following elements:

- **ID**: A unique identifier for the question.
- **Question**: The subjective comparative question text.
- **Subject Entity**: The main entity that the question is about.
- **Compared Aspect**: The specific features or aspects being compared.
- **Object Entity**: Other entities that are being compared with the subject.
- **Preference Category**: The type of preference expressed in the question.
- **Constraint**: Any constraints or conditions applied to the comparative question.


In [5]:
###################################################################################################

## Element Extraction
In this section, we will explore the Elemens table, particularly focusing on the process of Element Extraction. This process is critical for understanding the intricacies of subjective comparative questions, as it involves identifying key elements: Products, Aspects, and Constraints within these questions. Understanding these elements is fundamental to analyzing the table's content and structure.

Our dataset analysis involves:
- **Product Extraction**: Identifying the main entities or items being discussed or compared in the questions.
- **Aspect Extraction**: Pinpointing specific features or characteristics of the products that are the focus of comparison.
- **Constraint Extraction**: Recognizing conditions or limitations that frame the nature of the comparison.

To achieve a detailed and nuanced extraction of these elements, we utilize an Enhanced BIO Labeling Scheme. This scheme is crucial for accurately tagging and categorizing each part of the questions.

### Enhanced BIO Labeling Scheme

To effectively extract the subjective comparative relation quintuples from the dataset, we enhance the traditional BIO tagging scheme with specialized labels for each category of interest: Entities, Aspects, and Constraints. The scheme uses a set of labels defined as follows:

- `B-E`: Beginning of an Entity
- `I-E`: Inside an Entity
- `O-E`: Outside an Entity
- `B-A`: Beginning of an Aspect
- `I-A`: Inside an Aspect
- `O-A`: Outside an Aspect
- `B-C`: Beginning of a Constraint
- `I-C`: Inside a Constraint
- `O-C`: Outside a Constraint

This allows us to more precisely annotate our dataset for training machine learning models. Let's delve into the specifics of our dataset and see how these elements manifest in the subjective comparative questions.

Below is an illustration of this scheme:

In [6]:
import pandas as pd
from IPython.display import display, HTML

# Retrieve the value associated with 'key:1', with a default value of 'not found'
key=1

question = QuestionDict.get(key, 'not found')
values = Product_Aspect_Contraint_dict.get(key, 'not found')

# Print the value
#print(question)  
#print(values)  


# Function to prepare and display the data in a table format
def display_enhanced_bio_table(question,values):
   # for key, (entity_labels, aspect_labels, constraint_labels) in sample_dict.items():
        tokens = question.split()
       # print(len(tokens))
        entity_labels = values[0][0].split()
        #print(len(entity_labels))
        aspect_labels = values[1][0].split()
        #print(len(aspect_labels))
        constraint_labels = values[2][0].split()
       # print(len(constraint_labels))

        # Check if the lengths match
        if not (len(tokens) == len(entity_labels) == len(aspect_labels) == len(constraint_labels)):
            raise ValueError("The number of tokens and labels must be the same")

        # Create a DataFrame
        df_example = pd.DataFrame({
            'Token': tokens,
            'Product': ['B-P' if label == 'B' else ('I-P' if label == 'I' else 'O-P') for label in entity_labels],
           
            'Aspect': ['B-A' if label == 'B' else ('I-A' if label == 'I' else 'O-A') for label in aspect_labels],
          
            'Constraint': ['B-C' if label == 'B' else ('I-C' if label == 'I' else 'O-C') for label in constraint_labels],
            
        })

        # Display the DataFrame as an HTML table
        #display(HTML(f'<h3>Table for Key: {key}</h3>'))
        display(HTML(df_example.to_html(index=False)))

# Call the function with the sample data
display_enhanced_bio_table(question,values)


Token,Product,Aspect,Constraint
What,O-P,O-A,O-C
are,O-P,O-A,O-C
the,O-P,O-A,O-C
best,O-P,O-A,O-C
smartphones,O-P,O-A,O-C
with,O-P,O-A,O-C
a,O-P,O-A,O-C
built,O-P,O-A,O-C
in,O-P,O-A,O-C
stylus,O-P,O-A,O-C


The table above shows an example sentence annotated with our enhanced BIO scheme. Each token (word) from the sentence is classified according to whether it signifies the beginning (B), inside (I), or outside (O) of the three categories: Entity (E), Aspect (A), and Constraint (C).


In [7]:
###################################################################################################

## Entity Role Identification (ERI) 

### Overview of the ERI Task

We define the Entity Role Identification (ERI) task as a crucial process in natural language processing that involves identifying and classifying the roles of entities within a given text. In the context of subjective comparative questions, this task focuses on discerning the specific roles that entities play—such as being the subject of comparison, the object of comparison, or other relevant roles.


Let's now demonstrate the application of the ERI task on a sample of our dataset and then explore the structure and content of the Entity Role table.


In [8]:
import pandas as pd
from IPython.display import display, HTML
import random

# Function to format the display of lists within the DataFrame
def format_list_for_display(series):
    return series.apply(lambda x: ', '.join(x) if isinstance(x, list) else x)

# Sample dictionary as provided in the question

# Function to get keys where the value has more than 3 items
def find_keys_with_value_length_greater_than_three(data):
    keys_with_more_than_three = [key for key, value in data.items() if len(value) >= 3]
    return keys_with_more_than_three

# Get the keys with values longer than 3 items
keys_with_values_longer_than_three = find_keys_with_value_length_greater_than_three(EntityRoleIdentificaionDict)

# Randomly select 5 keys from the list, if there are at least 5 keys
keys_to_display = random.sample(keys_with_values_longer_than_three, min(len(keys_with_values_longer_than_three), 3))

# Function to display the DataFrame in HTML format for the given keys and their values
def display_sampled_keys_df(keys_with_values, category_name):
    print(f"--- {category_name} ---")
    
    for key in keys_with_values:
        values = EntityRoleIdentificaionDict[key]
        question = QuestionDict.get(key, 'not found')
        # Assuming the values are structured as a list of lists
        for value in values:
            example_data = {
                'ID': key,
                'Question': question,  # Assuming the question is the first item
                'Pseudo-Sentence': value[0],
                'Entity Role': value[1]
            }

            df_example = pd.DataFrame([example_data])  # Create a DataFrame from a list of dicts

            # Apply formatting function to the DataFrame
            for column in df_example.columns:
                if isinstance(df_example[column].iloc[0], list):
                    df_example[column] = format_list_for_display(df_example[column])

            # Display the DataFrame in HTML
            display(HTML(df_example.to_html(index=False)))

# Call the function to display the sampled keys
display_sampled_keys_df(keys_to_display, "Selected Keys with Values Longer Than Three")


--- Selected Keys with Values Longer Than Three ---


ID,Question,Pseudo-Sentence,Entity Role
1015,Why do Asus Zenfone 8 not have a stunning interface and appearance compared to Google Pixel 6 ?,Asus Zenfone 8 - interface,1


ID,Question,Pseudo-Sentence,Entity Role
1015,Why do Asus Zenfone 8 not have a stunning interface and appearance compared to Google Pixel 6 ?,Google Pixel 6 - interface,2


ID,Question,Pseudo-Sentence,Entity Role
1015,Why do Asus Zenfone 8 not have a stunning interface and appearance compared to Google Pixel 6 ?,Asus Zenfone 8 - appearance,1


ID,Question,Pseudo-Sentence,Entity Role
1015,Why do Asus Zenfone 8 not have a stunning interface and appearance compared to Google Pixel 6 ?,Google Pixel 6 - appearance,2


ID,Question,Pseudo-Sentence,Entity Role
524,"Would you choose Motorola , LG , or Samsung ?",Motorola - features,1


ID,Question,Pseudo-Sentence,Entity Role
524,"Would you choose Motorola , LG , or Samsung ?",LG - features,1


ID,Question,Pseudo-Sentence,Entity Role
524,"Would you choose Motorola , LG , or Samsung ?",Samsung - features,1


ID,Question,Pseudo-Sentence,Entity Role
21,Which phones are the best alternatives for a Nokia 6 among Xiaomi Redmi Note 10 T and Samsung Galaxy M 32 ?,Nokia 6 - features,1


ID,Question,Pseudo-Sentence,Entity Role
21,Which phones are the best alternatives for a Nokia 6 among Xiaomi Redmi Note 10 T and Samsung Galaxy M 32 ?,Xiaomi Redmi Note 10T - features,2


ID,Question,Pseudo-Sentence,Entity Role
21,Which phones are the best alternatives for a Nokia 6 among Xiaomi Redmi Note 10 T and Samsung Galaxy M 32 ?,Samsung Galaxy M32 - features,2


### Utilizing Sentence-Pair Classification for ERI

In the Entity Role Identification (ERI) process within the SCRQD dataset, we employ the sentence-pair classification method, an essential technique in Natural Language Inference (NLI) tasks. This method is particularly effective for analyzing the structure and intent of subjective comparative questions.

#### Utilizing NLI-M for Pseudo-Sentence Generation

To prepare our data for the ERI task using sentence-pair classification, we transform the original comparative questions into pseudo-sentences, adhering to a specific method:

- **Combining Entities with Aspects**: In each question, we identify and pair each mentioned entity with an aspect. If no specific aspect is mentioned, we use "features" as a placeholder. This is essential for maintaining the focus of the comparison within the question.
- **Use of a Hyphen ("-")**: We employ a hyphen ("-") to separate each entity-aspect pair, creating clear and structured pseudo-sentences. This formatting not only simplifies the original text but also preserves the core elements of comparison, enhancing its suitability for computational analysis.

#### Pairing with Labels

Each pseudo-sentence is subsequently paired with a label that signifies the role of entities within the question:
  - Label "1" for the "Subject" role.
  - Label "2" for the "Object" role.
  - Label "0" for "None", indicating either no specific role or a neutral aspect in the question's context.

These labels are pivotal in guiding our classification model to accurately identify and categorize the roles of entities in the questions.


In [9]:
###################################################################################################

## Comparative Preference Classification (CPC)

### Overview of the CPC Task
The Comparative Preference Classification (CPC) task is a pivotal component of our study in understanding subjective comparative questions. This task involves classifying the nature of preferences expressed in comparative questions, such as determining whether a subjective comparison implies a preference for one entity over another.

Next, we will dive into a detailed exploration of this table, examining sample entries and their classifications to understand better how preferences are articulated and categorized in our dataset.

In [10]:


# Sample 10 keys from the dictionary
random_keys = random.sample(list(CPCDict.keys()), 10)

# Build the example_data structure based on the sampled keys
example_data = {
    'ID': [],
    'Question': [],
    'Pseudo-Sentence': [],
    'Preference Type': []

}

# Loop through the sampled keys to retrieve entries
for key in random_keys:
    question = QuestionDict.get(key, 'not found')
    entry = CPCDict[key]
    example_data['ID'].append(key)
    example_data['Question'].append(question)
    example_data['Pseudo-Sentence'].append(entry[0][0])
    example_data['Preference Type'].append(entry[0][1])


# Create a DataFrame
df_examples = pd.DataFrame(example_data)

# Function to format the display of lists within the DataFrame
def format_list_for_display(series):
    return series.apply(lambda x: ', '.join(x) if isinstance(x, list) else x)

# Apply formatting function to the DataFrame
df_examples['Question'] = format_list_for_display(df_examples['Question'])
df_examples['Pseudo-Sentence'] = format_list_for_display(df_examples['Pseudo-Sentence'])
df_examples['Preference Type'] = format_list_for_display(df_examples['Preference Type'])


# Display the DataFrame
display(HTML(df_examples.to_html(index=False)))


ID,Question,Pseudo-Sentence,Preference Type
983,"Which smartphone will I not regret buying , LeEco Le 2 or Sony Xperia XA ?",LeEco Le2 and Sony Xperia XA versus All,B
428,"Which is better , Samsung Galaxy J 5 or Samsung Galaxy J 7 ?",Samsung Galaxy J5 versus Samsung Galaxy J7,XorB
1086,Which one is worse among the Xiaomi Mi 4 or the Asus Zenfone 2015 ?,Xiaomi Mi 4 versus Asus Zenfone 2015,XorW
568,Which mobile phone is better in terms of beautiful appearance and good camera in between Samsung Galaxy S 20 and S 21 ?,Samsung Galaxy S20 appearance versus S21 appearance,XorB
303,Which features of Samsung M 31 are more exceptional than vivo iQOO Z 3 ?,Samsung M31 features versus iQOO Z3 features,SB
137,Why do Nokia phones not have an attractive UI compared to Samsung phones ?,Nokia phones UI versus Samsung phones UI,W
641,"Which phone is better , the Xiaomi Mi 4 or the Asus Zenfone 2015 4 GB version ?",Xiaomi Mi 4 versus Asus Zenfone 2015 4GB version,XorB
1206,"Lumia 920 vs . iPhone 5 , What is the equivalent to the HTC One X in terms of camera quality ?",Lumia 920 camera quality and iPhone 5 camera quality versus HTC One X camera quality,XorE
501,"Which has the better display screen , iPhone 6 s or iPhone SE ?",iPhone 6s display screen versus iPhone SE display screen,XorB
124,Is the iPhone 11 that much superior to the iPhone 7 ?,iPhone 11 versus iPhone 7,SB


### Utilizing NLI-M for Pseudo-Sentence Generation
To facilitate the CPC task, we adopt Natural Language Inference with Multiple Output (NLI-M). This approach involves generating pseudo-sentences that represent the core comparison in each question. 

- **Pseudo-Sentence Formation**: We create pseudo-sentences by pairing entities and aspects mentioned in the question. For instance, a comparison between two products based on a specific aspect is represented as "(entity i-aspect j versus entity z-aspect k)". In cases where a question lacks explicit mention of an entity or aspect, we use placeholders:
  - "X" is used when an entity is not specified.
  - "All" is used for unspecified aspects.
- **Example**: Given a question comparing the camera quality of 'iPhone 10' and 'iPhone XS', the corresponding pseudo-sentence would be "iPhone 10-camera versus iPhone XS-camera".

This method allows us to convert complex comparative questions into a format that is more readily analyzable by our classification models.


### Comparative Preference Categories 
We have outlined 14 potential preference categories for subjective comparative questions: `B`, `SB`, `W`, `SW`, `E`, `XOR-B`, `XOR-SB`, `XOE-E`, `XOR-W`, `XOR-SW`, `X`, `X-SB`, `X-SW`, `Non-Grad`. Consequently, the output label for the CPC task will fall into one of these 14 classifications. A detailed explanation of these abbreviations, along with their expanded interpretations, is encapsulated as :


| Preference Type | Description      |
|-----------------|------------------|
| B               | Better           |
| SB              | Strong Better    |
| E               | Equal            |
| W               | Worse            |
| SW              | Strong Worse     |
| XOR-B           | XOR-Better       |
| XOR-SB          | XOR-Strong Better|
| XOR-E           | XOR-Equal        |
| XOR-W           | XOR-Worse        |
| XOR-SW          | XOR-Strong Worse |
| X-SB            | X-Strong Better  |
| _X               | X                |
| X-SW            | X-Strong Worse   |
| Non-Grad        | Non-Gradable     |


In [11]:
###################################################################################################

# Conclusion

This notebook has provided a basic introduction to the SCRQD dataset. Users are encouraged to perform further analysis and explore the dataset in more depth. Feedback, questions, and contributions are welcome via the GitHub repository's Issues and Pull Requests.
