
# Introduction to Element Extraction using Multi-Task Learning with Adapter

In the realm of Natural Language Processing (NLP), understanding and extracting meaningful elements from textual data is paramount. This notebook explores the complexities of Element Extraction (EE), an essential process that identifies and classifies entities, aspects, and constraints within texts. We adopt a Multi-Task Learning (MTL) framework, utilizing the BIO tagging scheme for systematic annotation and extraction of these elements.

Our approach integrates the advanced capabilities of RoBERTa with the strategic addition of a pre-trained "AdapterHub/roberta-base-pf-qnli" adapter. This adapter, specifically tailored for question-answering tasks, enhances RoBERTa's ability to discern nuanced information within questions, making it particularly effective for the EE task in comparative analysis contexts.

The MTL framework employs a robust architecture that not only leverages RoBERTa's transformer-based model but also fine-tunes it with the QNLI adapter to improve task-specific performance. Through a series of Python code snippets, we will detail the implementation of this enhanced model, providing insights and explanations at each step. This exploration aims to illuminate the technical underpinnings of Element Extraction while showcasing the practical applications and benefits of integrating specialized adapters in NLP tasks.


## Data Acquisition for Element Extraction

Before diving into the intricacies of our Element Extraction model, the first crucial step is to acquire the relevant dataset. In the following code snippet, we utilize the `requests` library to download a pre-compiled dataset from a specified URL. Upon successful download, we proceed to unzip the dataset using the `zipfile` library, preparing our data for the upcoming processing stages. This initial setup ensures we have the necessary data in an accessible format, paving the way for the element extraction tasks.

In [2]:
import requests
import zipfile
import os

data_url = "https://github.com/mahsamb/SCRQD/raw/main/Dataset.zip"
zip_filename = "Dataset.zip"

# Downloading using requests
response = requests.get(data_url)

# Check if the request was successful (status_code 200)
if response.status_code == 200:
    with open(zip_filename, "wb") as f:
        f.write(response.content)
else:
    print(f"Failed to retrieve the data: {response.status_code}: {response.text}")
    # Add additional error handling here

# Unzipping the dataset
try:
    with zipfile.ZipFile(zip_filename, 'r') as zip_ref:
        zip_ref.extractall("data")
    print("Files extracted:")
    print(os.listdir("data"))
except zipfile.BadZipFile:
    print("Error: The file doesn’t appear to be a valid zip file")


Files extracted:
['Relations.pkl', 'EntityRoles.pkl', 'ComparativePreferences.pkl', 'Elements.pkl', 'Questions.pkl']


## Preparing the Environment and Loading Data

To ensure our Element Extraction model functions effectively, we start by setting up our environment with essential libraries such as `numpy` and `pandas` for data manipulation, and `re` for regular expressions, which are critical for processing text data. Additionally, we use `pickle` for loading our

In [3]:
import numpy as np
import pickle as cPickle
import pickle
import re
import pandas as pd
from IPython.display import display, HTML
import random


with open(r"/kaggle/working/data/Questions.pkl", "rb") as input_file:
    QuestionDict = pickle.load(input_file)
    input_file.close()

with open(r"/kaggle/working/data/Elements.pkl", "rb") as input_file:
    Product_Aspect_Contraint_dict = pickle.load(input_file)
    input_file.close()

In [4]:
max1=100

## Initial Exploration of Loaded Datasets

With our datasets loaded into Python dictionaries, it's essential to begin with an initial exploration to understand the structure and type of data we'll be working with. In the code snippets provided, we iterate through both `Product_Aspect_Contraint_dict` and `QuestionDict`, printing the first key-value pair of each to get a glimpse into the data. This preliminary step is crucial for ensuring the integrity of our data and to familiarize ourselves with the dataset's format, which will inform our strategy for the Element Extraction process.

In [5]:
for key, value in Product_Aspect_Contraint_dict.items():
  print(key)
  print(value)
  break


for key, value in QuestionDict.items():
  print(key)
  print(value)
  break

1
[['O-P O-P O-P O-P O-P O-P O-P O-P O-P O-P O-P O-P O-P O-P O-P O-P O-P O-P O-P O-P O-P B-P O-P'], ['O-A O-A O-A O-A O-A O-A O-A B-A I-A I-A I-A O-A O-A O-A B-A I-A O-A B-A O-A O-A O-A O-A O-A'], ['O-C O-C O-C O-C B-C B-C I-C I-C I-C I-C I-C O-C O-C O-C O-C O-C O-C O-C O-C B-C I-C I-C O-C']]
1
What are the best smartphones with a built in stylus feature with a good quality display and RAM , other than Samsung ?


## Constructing the Data Structure for Model Input

After our initial exploration, the next step involves structuring our data into a format suitable for our Element Extraction model. The provided code snippet accomplishes this by iterating over the `QuestionDict` and corresponding `Product_Aspect_Contraint_dict` to create a comprehensive list of dictionaries, each containing a question and its associated entity, aspect, and constraint labels. This structured approach facilitates the easy manipulation and analysis of our dataset, preparing it for the model training phase. By aligning our questions with their respective labels in a unified data structure, we ensure a smooth transition into the model's training and evaluation stages.

In [6]:
data = []

for key, value in QuestionDict.items():
    question_text = value  # The question text from QuestionDict
    label_info = Product_Aspect_Contraint_dict[key]  # The corresponding labels from Product_Aspect_Constraint_dict

    # Extract label lists directly from label_info without trying to split them
    entity_labels, aspect_labels, constraint_labels = label_info

    # Since label_info items are already lists, we can directly use them
    # Adjust the structure as needed based on your exact format

    data_entry = {
        "text": question_text,
        "entity_labels": entity_labels,  # No need to wrap in another list or call split
        "aspect_labels": aspect_labels,
        "constraint_labels": constraint_labels
    }

    data.append(data_entry)


## Previewing Structured Data Entries

To verify the integrity and structure of our newly constructed data entries, we print the first item from our prepared data list. This step is crucial for ensuring that our data is correctly formatted and contains all necessary information for the Element Extraction tasks. By examining this sample entry, we can confirm the successful preparation of our data, setting the stage for the subsequent model training and evaluation processes.

In [7]:
for item in data:
    print(item)
    #print(item.type)
    break

{'text': 'What are the best smartphones with a built in stylus feature with a good quality display and RAM , other than Samsung ?', 'entity_labels': ['O-P O-P O-P O-P O-P O-P O-P O-P O-P O-P O-P O-P O-P O-P O-P O-P O-P O-P O-P O-P O-P B-P O-P'], 'aspect_labels': ['O-A O-A O-A O-A O-A O-A O-A B-A I-A I-A I-A O-A O-A O-A B-A I-A O-A B-A O-A O-A O-A O-A O-A'], 'constraint_labels': ['O-C O-C O-C O-C B-C B-C I-C I-C I-C I-C I-C O-C O-C O-C O-C O-C O-C O-C O-C B-C I-C I-C O-C']}


## Refining Label Formats for Analysis

Before we proceed with training our Element Extraction model, it's imperative to refine the format of our labels to ensure compatibility with our processing algorithms. This code snippet updates each data entry by splitting the label strings into lists of individual labels. This step transforms our previously structured label information into a more granular and model-friendly format, enabling precise tagging and classification during the training phase. By adjusting the label format, we enhance the model's ability to learn from and accurately predict the elements within our dataset.

In [8]:
# Iterate through each dictionary in the list
for item in data:
    # Splitting the labels based on space ' ' and updating the values
    item['entity_labels'] = item['entity_labels'][0].split()
    item['aspect_labels'] = item['aspect_labels'][0].split()
    item['constraint_labels'] = item['constraint_labels'][0].split()

#print(data)
for item in data:
    print(item)
    break

{'text': 'What are the best smartphones with a built in stylus feature with a good quality display and RAM , other than Samsung ?', 'entity_labels': ['O-P', 'O-P', 'O-P', 'O-P', 'O-P', 'O-P', 'O-P', 'O-P', 'O-P', 'O-P', 'O-P', 'O-P', 'O-P', 'O-P', 'O-P', 'O-P', 'O-P', 'O-P', 'O-P', 'O-P', 'O-P', 'B-P', 'O-P'], 'aspect_labels': ['O-A', 'O-A', 'O-A', 'O-A', 'O-A', 'O-A', 'O-A', 'B-A', 'I-A', 'I-A', 'I-A', 'O-A', 'O-A', 'O-A', 'B-A', 'I-A', 'O-A', 'B-A', 'O-A', 'O-A', 'O-A', 'O-A', 'O-A'], 'constraint_labels': ['O-C', 'O-C', 'O-C', 'O-C', 'B-C', 'B-C', 'I-C', 'I-C', 'I-C', 'I-C', 'I-C', 'O-C', 'O-C', 'O-C', 'O-C', 'O-C', 'O-C', 'O-C', 'O-C', 'B-C', 'I-C', 'I-C', 'O-C']}


## Applying Label Encoding for Machine Learning

The final preprocessing step involves converting our textual labels into numerical representations, a process known as label encoding. This transformation is crucial for machine learning models, which require numerical input. Using a predefined mapping dictionary, we replace the textual labels in our dataset with their corresponding numerical codes. This encoding not only simplifies the model's input but also ensures that the training process is optimized for efficiency and accuracy. By examining a sample entry after this transformation, we can confirm the readiness of our data for the upcoming machine learning tasks.

In [9]:
# Mapping dictionary
label_map = {"O-P": 0, "B-P": 1, "I-P": 2, "O-A": 0, "B-A": 1, "I-A": 2, "O-C": 0, "B-C": 1, "I-C": 2}

# Iterate through each dictionary in the list
for item in data:
    # Replace entity_labels
    item['entity_labels'] = [label_map[label] for label in item['entity_labels']]
    # Replace aspect_labels
    item['aspect_labels'] = [label_map[label] for label in item['aspect_labels']]
    # Replace constraint_labels
    item['constraint_labels'] = [label_map[label] for label in item['constraint_labels']]

    #print(data)
for item in data:
    print(item)
    break

{'text': 'What are the best smartphones with a built in stylus feature with a good quality display and RAM , other than Samsung ?', 'entity_labels': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], 'aspect_labels': [0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0], 'constraint_labels': [0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 0]}


## Displaying Processed Data and Labels

To confirm that our data is correctly processed and ready for the next steps, we display the text, entity labels, aspect labels, and constraint labels of the first entry in our prepared dataset. This visualization allows us to ensure that the textual data aligns with its corresponding numerical labels, indicating successful label encoding and data structuring. It's a vital checkpoint to verify the data's readiness for model training and analysis.

In [10]:
print(data[0]['text'])
print(data[0]['entity_labels'])
print(data[0]['aspect_labels'])
print(data[0]['constraint_labels'])

What are the best smartphones with a built in stylus feature with a good quality display and RAM , other than Samsung ?
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
[0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 0]


## Organizing Data for Easy Access

As we move forward, organizing our data in a manner that facilitates easy access and manipulation becomes crucial. By categorizing the processed data into separate dictionaries for entities, aspects, and constraints, we enhance our ability to swiftly retrieve and analyze specific segments of our dataset. This organization not only aids in data management but also in performing focused analyses on different elements of the text. Displaying the structured data for the first entry further solidifies our understanding of the dataset's organization and prepares us for detailed element extraction tasks.

In [11]:
# Assuming "data" is a list of dictionaries, each structured like the example you provided

# Initialize three empty dictionaries
entity_dict, aspect_dict, constraint_dict = {}, {}, {}

# Iterate through each item in the data list
for item in data:
    # Extract and assign the relevant information to each dictionary
    entity_dict[item['text']] = item['entity_labels']
    aspect_dict[item['text']] = item['aspect_labels']
    constraint_dict[item['text']] = item['constraint_labels']

# Example to show the result for the first item in the list
print("Text:", data[0]['text'])
print("Entity Labels:", entity_dict[data[0]['text']])
print("Aspect Labels:", aspect_dict[data[0]['text']])
print("Constraint Labels:", constraint_dict[data[0]['text']])


Text: What are the best smartphones with a built in stylus feature with a good quality display and RAM , other than Samsung ?
Entity Labels: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
Aspect Labels: [0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0]
Constraint Labels: [0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 0]


## Quick Access to Data Entries

To demonstrate the utility of our organized data structure, we print the first key-value pair from the `entity_dict`. This example showcases how our organized data facilitates quick access to specific information, in this case, entity labels associated with a particular text. It exemplifies the efficient retrieval of data for analysis or further processing, an essential feature for handling complex NLP tasks like Element Extraction.

In [12]:
for key, value in entity_dict.items():
    print(key)
    print(value)
    break

What are the best smartphones with a built in stylus feature with a good quality display and RAM , other than Samsung ?
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]


## Setting Up the Environment for Model Training

Before we can train our RoBERTa models for entity, aspect, and constraint classification, it is essential to set up our Python environment with the necessary libraries. This involves installing the `transformers` library, which provides the RoBERTa model and tokenizer, as well as `torch` for working with tensors, and `sklearn` for model evaluation and data manipulation utilities. By ensuring these libraries are installed, we equip our environment with the tools required for efficient model training and evaluation, paving the way for the successful application of our element extraction methodology.

In [13]:
!pip install transformers torch sklearn

Collecting sklearn
  Downloading sklearn-0.0.post12.tar.gz (2.6 kB)
  Preparing metadata (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[15 lines of output][0m
  [31m   [0m The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
  [31m   [0m rather than 'sklearn' for pip commands.
  [31m   [0m 
  [31m   [0m Here is how to fix this error in the main use cases:
  [31m   [0m - use 'pip install scikit-learn' rather than 'pip install sklearn'
  [31m   [0m - replace 'sklearn' by 'scikit-learn' in your pip requirements files
  [31m   [0m   (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
  [31m   [0m - if the 'sklearn' package is used by one of your dependencies,
  [31m   [0m   it would be great if you take some time to track which package uses
  [31m   [0m   'sklearn' instead of 'scikit-lea

## Extracting and Demonstrating Key-Value Pair Relationships

The function `extract_values` is designed to map keys from one list to values in another, based on a reference dictionary. This is particularly useful in scenarios where we need to align different elements of our dataset, such as texts and their corresponding labels, for detailed analysis or processing. Following this, we showcase the utility of this function by preparing to align texts with their labels for entities, aspects, and constraints separately. By extracting and printing the first elements of these lists, we prepare to demonstrate how texts and labels can be effectively paired, providing a clear foundation for any subsequent operations that require a direct association between textual data and their annotated labels.

In [14]:
def extract_values(dict1, list1, list2):
    result = {}
    for key, value in zip(list1, list2):
        result[key] = dict1.get(value, None)
    return result


entity_text_ = list(entity_dict.keys())
entity_labels_ = list(entity_dict.values())

print("list1:", entity_text_[0])
print("list2:", entity_labels_[0])


aspect_text_ = list(aspect_dict.keys())
aspect_labels_ = list(aspect_dict.values())

print("list1:", aspect_text_[0])
print("list2:", aspect_labels_[0])


constraint_text_ = list(constraint_dict.keys())
constraint_labels_ = list(constraint_dict.values())

print("list1:", constraint_text_[0])
print("list2:", constraint_labels_[0])


list1: What are the best smartphones with a built in stylus feature with a good quality display and RAM , other than Samsung ?
list2: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
list1: What are the best smartphones with a built in stylus feature with a good quality display and RAM , other than Samsung ?
list2: [0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0]
list1: What are the best smartphones with a built in stylus feature with a good quality display and RAM , other than Samsung ?
list2: [0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 0]


## Preparing Data for RoBERTa Token Classification

To fine-tune the RoBERTa model for token classification, we first import necessary modules and initialize the RoBERTa tokenizer and model with the appropriate number of labels for our classification task. We then proceed to preprocess our entity texts and labels for model input. This involves tokenizing the texts, creating attention masks for handling padding, and adjusting label lengths to match the tokenized inputs. Padding labels with -100 ensures that these positions are ignored during the loss calculation, aligning with the model's requirements. By converting these lists into tensors, we prepare our dataset for training with RoBERTa, setting the stage for effective learning and classification of entities, aspects, and constraints within our text data.

In [15]:
import torch
from transformers import RobertaTokenizer, RobertaForTokenClassification
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import KFold
import numpy as np

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForTokenClassification.from_pretrained('roberta-base', num_labels=3)


max_length = max1  # Adjust as needed

input_ids = []
attention_masks = []
entity_labels = []

texts=[]
labels=[]

texts = entity_text_
labels = entity_labels_

for text, label in zip(texts, labels):
    encoded_dict = tokenizer.encode_plus(
                        text,
                        add_special_tokens=True,
                        max_length=max_length,
                        padding='max_length',
                        truncation=True,
                        return_attention_mask=True,
                        return_tensors='pt',
                    )

    # Create a mask for the labels to ignore padding in the loss computation
    label_mask = [1] * len(label) + [0] * (max_length - len(label))
    label = label + [-100] * (max_length - len(label))  # Padding labels with -100, which is ignored by the loss function

    input_ids.append(encoded_dict['input_ids'])
    attention_masks.append(encoded_dict['attention_mask'])
    entity_labels.append(label)

entity_input_ids = torch.cat(input_ids, dim=0)
entity_attention_masks = torch.cat(attention_masks, dim=0)
entity_labels = torch.tensor(entity_labels)

# Proceed with dataset creation, DataLoader setup, and cross-validation as before


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Preparing Aspect Data for RoBERTa Token Classification

For the aspect classification task, we proceed to tokenize the aspect-related texts using the RoBERTa tokenizer, similarly to how we handled entity data. This process involves converting texts into input IDs, generating attention masks to handle different sequence lengths, and padding the labels to match the tokenized input lengths. Special care is taken to pad labels with -100, a value indicating positions that should be ignored during the model's loss calculation, ensuring accurate training and classification. The resulting tensors for input IDs, attention masks, and aspect labels are then ready to be utilized in creating a dataset specifically designed for aspect classification with RoBERTa, setting a clear path towards identifying and categorizing aspect-related elements within texts.

In [16]:
import torch
from transformers import RobertaTokenizer, RobertaForTokenClassification
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import KFold
import numpy as np

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForTokenClassification.from_pretrained('roberta-base', num_labels=3)


max_length = max1  # Adjust as needed

input_ids = []
attention_masks = []
aspect_labels = []

texts=[]
labels=[]

texts = aspect_text_
labels = aspect_labels_

for text, label in zip(texts, labels):
    encoded_dict = tokenizer.encode_plus(
                        text,
                        add_special_tokens=True,
                        max_length=max_length,
                        padding='max_length',
                        truncation=True,
                        return_attention_mask=True,
                        return_tensors='pt',
                    )

    # Create a mask for the labels to ignore padding in the loss computation
    label_mask = [1] * len(label) + [0] * (max_length - len(label))
    label = label + [-100] * (max_length - len(label))  # Padding labels with -100, which is ignored by the loss function

    input_ids.append(encoded_dict['input_ids'])
    attention_masks.append(encoded_dict['attention_mask'])
    aspect_labels.append(label)

aspect_input_ids = torch.cat(input_ids, dim=0)
aspect_attention_masks = torch.cat(attention_masks, dim=0)
aspect_labels = torch.tensor(aspect_labels)

# Proceed with dataset creation, DataLoader setup, and cross-validation as before


Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Preparing Constraint Data for RoBERTa Token Classification

In parallel with entities and aspects, the preparation of constraint-related data follows a structured approach to ensure the RoBERTa model can effectively learn and classify constraint information. Through the process of tokenizing constraint texts, we generate input IDs and attention masks, alongside adjusting label lengths through padding to align with the tokenized inputs. Padding the labels with -100 is crucial for correctly informing the model which positions to disregard in its loss calculations. This meticulous preparation results in tensors for constraint input IDs, attention masks, and labels, all set for integration into a dataset tailored for the constraint classification task. By systematically preparing our data for each specific element type, we enable the RoBERTa model to finely distinguish and accurately extract constraint elements from our texts, enhancing the overall capability of our element extraction system.

In [17]:
import torch
from transformers import RobertaTokenizer, RobertaForTokenClassification
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import KFold
import numpy as np

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForTokenClassification.from_pretrained('roberta-base', num_labels=3)


max_length = max1  # Adjust as needed

input_ids = []
attention_masks = []
constraint_labels = []

texts=[]
labels=[]

texts = constraint_text_
labels = constraint_labels_

for text, label in zip(texts, labels):
    encoded_dict = tokenizer.encode_plus(
                        text,
                        add_special_tokens=True,
                        max_length=max_length,
                        padding='max_length',
                        truncation=True,
                        return_attention_mask=True,
                        return_tensors='pt',
                    )

    # Create a mask for the labels to ignore padding in the loss computation
    label_mask = [1] * len(label) + [0] * (max_length - len(label))
    label = label + [-100] * (max_length - len(label))  # Padding labels with -100, which is ignored by the loss function

    input_ids.append(encoded_dict['input_ids'])
    attention_masks.append(encoded_dict['attention_mask'])
    constraint_labels.append(label)

constraint_input_ids = torch.cat(input_ids, dim=0)
constraint_attention_masks = torch.cat(attention_masks, dim=0)
constraint_labels = torch.tensor(constraint_labels)

# Proceed with dataset creation, DataLoader setup, and cross-validation as before


Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Inspecting the Prepared Datasets for RoBERTa Classification

After preprocessing our data for entity, aspect, and constraint classification, we examine the first entries of our prepared datasets to ensure their readiness for model training. This includes reviewing the input IDs and attention masks for each classification type, which are essential for informing the RoBERTa model about the text to focus on and the padding to ignore. Additionally, we inspect the labels for entities, aspects, and constraints, confirming that they have been correctly padded and are in the appropriate format for the classification tasks. This verification step is crucial for ensuring the integrity and correctness of our datasets before proceeding with the model training process, setting a solid foundation for successful element extraction.

In [18]:
print(entity_input_ids[0])
print(entity_attention_masks[0])
print(aspect_input_ids[0])
print(aspect_attention_masks[0])
print(constraint_input_ids[0])
print(constraint_attention_masks[0])
print(entity_labels[0])
print(aspect_labels[0])
print(constraint_labels[0])

tensor([    0,  2264,    32,     5,   275,  7466,    19,    10,  1490,    11,
        15240,   687,  1905,    19,    10,   205,  1318,  2332,     8, 10646,
         2156,    97,    87,  3797, 17487,     2,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

## Setting Up the Environment

Before diving into the model training and evaluation, it's essential to ensure that our environment is equipped with the necessary libraries and frameworks. The installation commands provided here are for installing PyTorch (`torch`, `torchvision`, `torchaudio`) along with `transformers` and `scikit-learn`. PyTorch serves as the backbone for building and training neural network models, `transformers` provides access to pre-trained models like RoBERTa and utilities for NLP tasks, and `scikit-learn` offers tools for data processing and evaluation metrics. Executing these commands prepares the environment with the required dependencies, setting the stage for the development and evaluation of our Multi-Task Learning model for Element Extraction.

In [19]:
!pip install torch torchvision torchaudio
!pip install transformers
!pip install scikit-learn



## Preparing Inputs for Model Training

In this step, we prepare the `input_ids` and `attention_masks` for the model training process. Notably, while we have separate variables for entities, aspects, and constraints (`entity_input_ids`, `aspect_input_ids`, `constraint_input_ids` and their corresponding attention masks), the data preparation process ensures that these variables are identical in structure and content. This uniformity allows

In [20]:
input_ids = constraint_input_ids
attention_masks = constraint_attention_masks

## Implementing a Multi-Task Learning Framework with Adapter for Element Extraction

This code segment introduces a Multi-Task Learning (MTL) framework specifically designed for Element Extraction, integrating classifications for Entity, Aspect, and Constraint within a unified model architecture. The `MultiTaskRoberta` class extends PyTorch's `Module`, incorporating a RoBERTa-based transformer model enhanced with the "AdapterHub/roberta-base-pf-qnli" adapter. This adapter is meticulously engineered to improve performance on question-answering tasks by fine-tuning the model to adeptly handle the intricacies of natural language inference.

The adapter's integration allows the model to leverage deeper contextual insights from the pre-trained RoBERTa model, specifically tuned to meet the unique requirements of the Element Extraction tasks. By incorporating the adapter, the model gains enhanced capabilities to detect subtle nuances and interrelationships within the data, which are crucial for precise classification of entities, aspects, and constraints.

Following this model setup, the `TokenClassificationDataset` class efficiently organizes token sequences and their corresponding labels, ensuring the data is optimally structured for effective training. The subsequent application of K-Fold cross-validation not only tests the robustness of the model across different data subsets but also bolsters its generalization capabilities. This iterative training and evaluation process across each fold offers detailed insights into the model's performance, showcasing the significant benefits of integrating the adapter within the MTL framework to tackle complex NLP challenges such as Element Extraction.


In [21]:
!pip install adapters

Collecting adapters
  Downloading adapters-0.1.2-py3-none-any.whl.metadata (15 kB)
Collecting transformers~=4.36.0 (from adapters)
  Downloading transformers-4.36.2-py3-none-any.whl.metadata (126 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.8/126.8 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
Downloading adapters-0.1.2-py3-none-any.whl (256 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m256.0/256.0 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading transformers-4.36.2-py3-none-any.whl (8.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m92.6 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: transformers, adapters
  Attempting uninstall: transformers
    Found existing installation: transformers 4.38.2
    Uninstalling transformers-4.38.2:
      Successfully uninstalled transformers-4.38.2
Successfully installed adapters-0.1.2 transformers-4.36.2


In [23]:
import torch
import torch.nn as nn
import numpy as np
from tqdm.auto import tqdm
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset, Subset, RandomSampler
from transformers import RobertaModel
from torch.optim import AdamW,Adam
from sklearn.model_selection import KFold
from sklearn.metrics import precision_recall_fscore_support
from transformers import RobertaModel, RobertaConfig
import adapters 


In [24]:
import torch
import torch.nn as nn
import numpy as np
from tqdm.auto import tqdm
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset, Subset, RandomSampler
from transformers import RobertaModel
from torch.optim import AdamW,Adam
from sklearn.model_selection import KFold
from sklearn.metrics import precision_recall_fscore_support
from transformers import RobertaModel, RobertaConfig


# Define the model
class MultiTaskRoberta(torch.nn.Module):
    def __init__(self, num_labels_entity, num_labels_aspect, num_labels_constraint):
        super().__init__()
        self.roberta = RobertaModel.from_pretrained('roberta-base')
                  
        adapters.init(self.roberta)
        adapter_name = self.roberta.load_adapter("AdapterHub/roberta-base-pf-qnli", source="hf")
        self.roberta.set_active_adapters(adapter_name)
        
        self.classifier_entity = torch.nn.Linear(self.roberta.config.hidden_size, num_labels_entity)
        self.classifier_aspect = torch.nn.Linear(self.roberta.config.hidden_size, num_labels_aspect)
        self.classifier_constraint = torch.nn.Linear(self.roberta.config.hidden_size, num_labels_constraint)

    def forward(self, input_ids, attention_mask):
        outputs = self.roberta(input_ids=input_ids, attention_mask=attention_mask)
        sequence_output = outputs.last_hidden_state
        logits_entity = self.classifier_entity(sequence_output)
        logits_aspect = self.classifier_aspect(sequence_output)
        logits_constraint = self.classifier_constraint(sequence_output)
        return logits_entity, logits_aspect, logits_constraint

# Define your dataset class
class TokenClassificationDataset(Dataset):
    def __init__(self, input_ids, attention_masks, entity_labels, aspect_labels, constraint_labels):
        self.input_ids = input_ids
        self.attention_masks = attention_masks
        self.entity_labels = entity_labels
        self.aspect_labels = aspect_labels
        self.constraint_labels = constraint_labels

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {
            'input_ids': self.input_ids[idx].clone().detach(),
            'attention_mask': self.attention_masks[idx].clone().detach(),
            'labels_entity': self.entity_labels[idx].clone().detach(),
            'labels_aspect': self.aspect_labels[idx].clone().detach(),
            'labels_constraint': self.constraint_labels[idx].clone().detach(),
        }



num_samples = 1257
seq_length = 100
vocab_size = 30522
num_labels_entity = 3
num_labels_aspect = 3
num_labels_constraint = 3


labels_entity = entity_labels
labels_aspect = aspect_labels
labels_constraint = constraint_labels

# Instantiate the dataset
dataset = TokenClassificationDataset(input_ids, attention_masks, labels_entity, labels_aspect, labels_constraint)

# Training and evaluation setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = MultiTaskRoberta(num_labels_entity, num_labels_aspect, num_labels_constraint).to(device)
criterion = torch.nn.CrossEntropyLoss(ignore_index=-100)  # Use -100 for padding token labels

# Prepare for K-Fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
fold_results = {}

for fold, (train_ids, test_ids) in enumerate(kf.split(dataset)):
    # DataLoaders for the current fold
    train_dataset = Subset(dataset, train_ids)
    test_dataset = Subset(dataset, test_ids)
    train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)
    test_dataloader = DataLoader(test_dataset, batch_size=8, shuffle=False)  # No shuffling for test data

    # Reinitialize the model and optimizer at the start of each fold
    model = MultiTaskRoberta(num_labels_entity, num_labels_aspect, num_labels_constraint).to(device)
    #optimizer = Adam(model.parameters(), lr=3e-5)
    optimizer = AdamW(model.parameters(), lr=5e-5)


    # Training loop
    for epoch in range(5):  # Number of epochs
        model.train()
        for batch in tqdm(train_dataloader, desc=f"Training Fold {fold+1} - Epoch {epoch+1}"):
          input_ids = batch['input_ids'].to(device)
          attention_mask = batch['attention_mask'].to(device)
          labels_entity = batch['labels_entity'].to(device)
          labels_aspect = batch['labels_aspect'].to(device)
          labels_constraint = batch['labels_constraint'].to(device)

          optimizer.zero_grad()
          logits_entity, logits_aspect, logits_constraint = model(input_ids, attention_mask)

          # Active loss mask for ignoring -100 labels for entity
          active_loss_entity = labels_entity.view(-1) != -100
          active_logits_entity = logits_entity.view(-1, num_labels_entity)[active_loss_entity]
          active_labels_entity = labels_entity.view(-1)[active_loss_entity]
          loss_entity = criterion(active_logits_entity, active_labels_entity)

          # Active loss mask for ignoring -100 labels for aspect
          active_loss_aspect = labels_aspect.view(-1) != -100
          active_logits_aspect = logits_aspect.view(-1, num_labels_aspect)[active_loss_aspect]
          active_labels_aspect = labels_aspect.view(-1)[active_loss_aspect]
          loss_aspect = criterion(active_logits_aspect, active_labels_aspect)

          # Active loss mask for ignoring -100 labels for constraint
          active_loss_constraint = labels_constraint.view(-1) != -100
          active_logits_constraint = logits_constraint.view(-1, num_labels_constraint)[active_loss_constraint]
          active_labels_constraint = labels_constraint.view(-1)[active_loss_constraint]
          loss_constraint = criterion(active_logits_constraint, active_labels_constraint)

          # Aggregate losses and perform a backward pass
          total_loss = loss_entity + loss_aspect + loss_constraint
          total_loss.backward()
          optimizer.step()


     



    # Evaluation loop
    model.eval()
    with torch.no_grad():
        all_labels_entity, all_preds_entity = [], []
        all_labels_aspect, all_preds_aspect = [], []
        all_labels_constraint, all_preds_constraint = [], []

        for batch in test_dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels_entity = batch['labels_entity'].to(device)
            labels_aspect = batch['labels_aspect'].to(device)
            labels_constraint = batch['labels_constraint'].to(device)

            logits_entity, logits_aspect, logits_constraint = model(input_ids, attention_mask)

            preds_entity = torch.argmax(logits_entity, dim=2).detach().cpu().numpy()
            preds_aspect = torch.argmax(logits_aspect, dim=2).detach().cpu().numpy()
            preds_constraint = torch.argmax(logits_constraint, dim=2).detach().cpu().numpy()

            labels_entity = labels_entity.cpu().numpy()
            labels_aspect = labels_aspect.cpu().numpy()
            labels_constraint = labels_constraint.cpu().numpy()

            # Filter out '-100' used for padding in labels
            active_labels_entity = labels_entity != -100
            active_labels_aspect = labels_aspect != -100
            active_labels_constraint = labels_constraint != -100

            preds_entity = preds_entity[active_labels_entity]
            preds_aspect = preds_aspect[active_labels_aspect]
            preds_constraint = preds_constraint[active_labels_constraint]

            labels_entity = labels_entity[active_labels_entity]
            labels_aspect = labels_aspect[active_labels_aspect]
            labels_constraint = labels_constraint[active_labels_constraint]

            precision_entity, recall_entity, f1_entity, _ = precision_recall_fscore_support(labels_entity, preds_entity, average='macro', zero_division=0)
            precision_aspect, recall_aspect, f1_aspect, _ = precision_recall_fscore_support(labels_aspect, preds_aspect, average='macro', zero_division=0)
            precision_constraint, recall_constraint, f1_constraint, _ = precision_recall_fscore_support(labels_constraint, preds_constraint, average='macro', zero_division=0)

            fold_results[fold+1] = {
                'Entity': {'Precision': precision_entity, 'Recall': recall_entity, 'F1': f1_entity},
                'Aspect': {'Precision': precision_aspect, 'Recall': recall_aspect, 'F1': f1_aspect},
                'Constraint': {'Precision': precision_constraint, 'Recall': recall_constraint, 'F1': f1_constraint}
            }

# Print out fold results
for fold, metrics in fold_results.items():
    print(f"\nFold {fold} Results:")
    for task, scores in metrics.items():
        print(f"{task} - Precision: {scores['Precision']:.4f}, Recall: {scores['Recall']:.4f}, F1-Score: {scores['F1']:.4f}")


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

pytorch_model_head.bin:   0%|          | 0.00/2.37M [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/575 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/2.22k [00:00<?, ?B/s]

pytorch_adapter.bin:   0%|          | 0.00/3.59M [00:00<?, ?B/s]

head_config.json:   0%|          | 0.00/391 [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

Training Fold 1 - Epoch 1:   0%|          | 0/128 [00:00<?, ?it/s]

Training Fold 1 - Epoch 2:   0%|          | 0/128 [00:00<?, ?it/s]

Training Fold 1 - Epoch 3:   0%|          | 0/128 [00:00<?, ?it/s]

Training Fold 1 - Epoch 4:   0%|          | 0/128 [00:00<?, ?it/s]

Training Fold 1 - Epoch 5:   0%|          | 0/128 [00:00<?, ?it/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

Training Fold 2 - Epoch 1:   0%|          | 0/128 [00:00<?, ?it/s]

Training Fold 2 - Epoch 2:   0%|          | 0/128 [00:00<?, ?it/s]

Training Fold 2 - Epoch 3:   0%|          | 0/128 [00:00<?, ?it/s]

Training Fold 2 - Epoch 4:   0%|          | 0/128 [00:00<?, ?it/s]

Training Fold 2 - Epoch 5:   0%|          | 0/128 [00:00<?, ?it/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

Training Fold 3 - Epoch 1:   0%|          | 0/128 [00:00<?, ?it/s]

Training Fold 3 - Epoch 2:   0%|          | 0/128 [00:00<?, ?it/s]

Training Fold 3 - Epoch 3:   0%|          | 0/128 [00:00<?, ?it/s]

Training Fold 3 - Epoch 4:   0%|          | 0/128 [00:00<?, ?it/s]

Training Fold 3 - Epoch 5:   0%|          | 0/128 [00:00<?, ?it/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

Training Fold 4 - Epoch 1:   0%|          | 0/128 [00:00<?, ?it/s]

Training Fold 4 - Epoch 2:   0%|          | 0/128 [00:00<?, ?it/s]

Training Fold 4 - Epoch 3:   0%|          | 0/128 [00:00<?, ?it/s]

Training Fold 4 - Epoch 4:   0%|          | 0/128 [00:00<?, ?it/s]

Training Fold 4 - Epoch 5:   0%|          | 0/128 [00:00<?, ?it/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

Training Fold 5 - Epoch 1:   0%|          | 0/128 [00:00<?, ?it/s]

Training Fold 5 - Epoch 2:   0%|          | 0/128 [00:00<?, ?it/s]

Training Fold 5 - Epoch 3:   0%|          | 0/128 [00:00<?, ?it/s]

Training Fold 5 - Epoch 4:   0%|          | 0/128 [00:00<?, ?it/s]

Training Fold 5 - Epoch 5:   0%|          | 0/128 [00:00<?, ?it/s]


Fold 1 Results:
Entity - Precision: 0.9792, Recall: 0.9963, F1-Score: 0.9874
Aspect - Precision: 0.9524, Recall: 0.9974, F1-Score: 0.9730
Constraint - Precision: 1.0000, Recall: 1.0000, F1-Score: 1.0000

Fold 2 Results:
Entity - Precision: 0.9815, Recall: 0.9560, F1-Score: 0.9682
Aspect - Precision: 1.0000, Recall: 1.0000, F1-Score: 1.0000
Constraint - Precision: 0.5556, Recall: 0.5972, F1-Score: 0.5744

Fold 3 Results:
Entity - Precision: 0.9117, Recall: 0.9047, F1-Score: 0.9081
Aspect - Precision: 0.7514, Recall: 0.7514, F1-Score: 0.7514
Constraint - Precision: 0.6667, Recall: 0.9856, F1-Score: 0.7704

Fold 4 Results:
Entity - Precision: 0.9220, Recall: 0.9172, F1-Score: 0.9185
Aspect - Precision: 0.7798, Recall: 0.9854, F1-Score: 0.8600
Constraint - Precision: 0.9858, Recall: 0.7000, F1-Score: 0.7983

Fold 5 Results:
Entity - Precision: 0.9372, Recall: 0.9674, F1-Score: 0.9511
Aspect - Precision: 0.7556, Recall: 0.9878, F1-Score: 0.8438
Constraint - Precision: 0.8333, Recall: 0.997

In [28]:
# Initialize dictionaries to store total metrics for calculating averages
total_metrics = {
    'Entity': {'Precision': 0, 'Recall': 0, 'F1': 0},
    'Aspect': {'Precision': 0, 'Recall': 0, 'F1': 0},
    'Constraint': {'Precision': 0, 'Recall': 0, 'F1': 0}
}

# Sum up metrics for each task across all folds
for fold in fold_results.values():
    for task, metrics in fold.items():
        for metric, value in metrics.items():
            total_metrics[task][metric] += value
a
# Calculate the average for each metric for each task
average_metrics = {
    task: {metric: value / len(fold_results) for metric, value in metrics.items()}
    for task, metrics in total_metrics.items()
}

print(average_metrics)

{'Entity': {'Precision': 0.9463052376922583, 'Recall': 0.9483129690680914, 'F1': 0.9466488365771555}, 'Aspect': {'Precision': 0.8478252484794542, 'Recall': 0.9444035649240515, 'F1': 0.885664204020611}, 'Constraint': {'Precision': 0.8082621082621081, 'Recall': 0.8561059890262064, 'F1': 0.8061597303058671}}
