### Data processing

This notebook processes the dataset in json format and outputs pickle files for training and validation. Each json file has several data samples where a question is mapped to the relevant passages that answer the question accurately and completely. For instance, this would be a sample from the testing dataset.

```json
{
        "QuestionID": "777e7a14-fea3-4c37-a0e6-9ffb50024d5c",
        "Question": "Can the ADGM provide clarity on the level of detail and documentation that should accompany a report of suspicious activity to ensure it meets regulatory standards?",
        "Passages": [
            {
                "DocumentID": 1,
                "PassageID": "14.2.3.Guidance.10.",
                "Passage": "Relevant Persons should comply with guidance issued by the EOCN with regard to identifying and reporting suspicious activity and Transactions relating to money laundering, terrorist financing and proliferation financing."
            }
        ],
        "Group": 2
},
```

In [12]:
# Import libraries

# Data handling
from datasets import Dataset
from pandas import to_pickle

# Other utils
import json
from re import compile
import os

In [2]:
# Function to perform some basic clean up of the query
def simple_cleaning(query: str) -> str:
    pattern_newline = compile(r'[\n\t\u200e]')  # Remove new lines, tabs, and undesired characters
    pattern_multiple_spaces = compile(r' +')  # Remove contiguous blank spaces

    cln_query = pattern_newline.sub(' ', query)
    cln_query = pattern_multiple_spaces.sub(' ', cln_query).strip()
    return cln_query

In [3]:
# Load the data in memory

# Training dataset
with open('./ObliQADataset/ObliQA_train.json') as f:
    data_train = json.load(f)
    
# Evaluation dataset
with open('./ObliQADataset/ObliQA_dev.json') as f:
    data_eval = json.load(f)

# Testing dataset
with open('./ObliQADataset/ObliQA_test.json') as f:
    data_test = json.load(f)   

In [4]:
len(data_train), len(data_eval), len(data_test)

(22295, 2788, 2786)

### Training dataset

In [None]:
# Create training dataset with 4 features
train_set = []
for q in data_train:
    q_id = q['QuestionID']
    for rel_doc in q['Passages']:
        # The training dataset has 4 features
        # - anchor_id: the query id
        # - anchor: the query itself
        # - positive: the passage itself
        # - positive_id: the passage id along with the document id
        train_set.append({
            'anchor_id': q_id,
            'anchor': simple_cleaning(q['Question']),
            'positive': simple_cleaning(f"{rel_doc['PassageID']} {rel_doc['Passage']}"),
            'positive_id': f"{rel_doc['DocumentID']}-{rel_doc['PassageID']}",
        })
        
train_dataset = Dataset.from_list(train_set)

train_dataset.save_to_disk('./data/train_dataset')

train_dataset

Saving the dataset (1/1 shards): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 29547/29547 [00:00<00:00, 749409.50 examples/s]


Dataset({
    features: ['anchor_id', 'anchor', 'positive', 'positive_id'],
    num_rows: 29547
})

In [6]:
train_dataset[0]

{'anchor_id': 'a10724b5-ad0e-4b69-8b5e-792aef214f86',
 'anchor': 'Under Rules 7.3.2 and 7.3.3, what are the two specific conditions related to the maturity of a financial instrument that would trigger a disclosure requirement?',
 'positive': '7.3.4 Events that trigger a disclosure. For the purposes of Rules 7.3.2 and 7.3.3, a Person is taken to hold Financial Instruments in or relating to a Reporting Entity, if the Person holds a Financial Instrument that on its maturity will confer on him: (1) an unconditional right to acquire the Financial Instrument; or (2) the discretion as to his right to acquire the Financial Instrument.',
 'positive_id': '11-7.3.4'}

### Evaluation dataset

In [7]:
# Same features as the training dataset
eval_set = []
for q in data_eval:
    q_id = q['QuestionID']
    for rel_doc in q['Passages']:
        eval_set.append({
            'anchor_id': q_id,
            'anchor': simple_cleaning(q['Question']),
            'positive': simple_cleaning(f"{rel_doc['PassageID']} {rel_doc['Passage']}"),
            'positive_id': f"{rel_doc['DocumentID']}-{rel_doc['PassageID']}",
        })
        
eval_dataset = Dataset.from_list(eval_set)

eval_dataset.save_to_disk('./data/eval_dataset')

eval_dataset

Saving the dataset (1/1 shards): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3677/3677 [00:00<00:00, 477771.25 examples/s]


Dataset({
    features: ['anchor_id', 'anchor', 'positive', 'positive_id'],
    num_rows: 3677
})

### Testing dataset

In [8]:
# Same features as the training dataset
test_set = []
for q in data_test:
    q_id = q['QuestionID']
    for rel_doc in q['Passages']:
        test_set.append({
            'anchor_id': q_id,
            'anchor': simple_cleaning(q['Question']),
            'positive': simple_cleaning(f"{rel_doc['PassageID']} {rel_doc['Passage']}"),
            'positive_id': f"{rel_doc['DocumentID']}-{rel_doc['PassageID']}",
        })
        
test_dataset = Dataset.from_list(test_set)

test_dataset.save_to_disk('./data/test_dataset')

test_dataset

Saving the dataset (1/1 shards): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3666/3666 [00:00<00:00, 475634.70 examples/s]


Dataset({
    features: ['anchor_id', 'anchor', 'positive', 'positive_id'],
    num_rows: 3666
})

### Corpus

Creates a corpus pickle from the list of 40 regulatory documents and saves it to disk

In [13]:
# Process to load a collection of passage
ndocs = 40  # Number of documents to process
collection = []  # Collection of documents

# Read each document and extracts the relevant passages
for i in range(1, ndocs + 1):
    with open(os.path.join("ObliQADataset/StructuredRegulatoryDocuments", f"{i}.json")) as f:
        doc = json.load(f)  # Load the contents of the json file
        for psg in doc:  # For each passage in the document
            # Only add a passage to the collection if the length is greater than 100. This is based 
            # on the assumption that shorther passages may be irrelevant or empty as they may simply refer to sections
            # of the document
            if len(psg["PassageID"] + " " + psg["Passage"])>100:
                collection.append(
                    dict(
                        text=psg["PassageID"] + " " + psg["Passage"],  # Joins the passageId and the passage itself
                        ID=psg["ID"],  # entry ID
                        DocumentId=psg['DocumentID'],  # document ID
                        PassageId=psg['PassageID'],  # passage ID
                    )
                )
                
corpus = {f"{doc['DocumentId']}-{doc['PassageId']}": doc["text"] for doc in collection}
# Save the corpus to disk
to_pickle(corpus, './data/corpus.pkl')