# Versatility of CUAD models

The aim of this notebooks is to create more questions for the CUAD dataset to increase the size of the dataset and potentially increase the versatility. 

It also adds a set of questions only for a seperate test set with unseen questions

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [51]:
df_desc = pd.read_csv('../data/category_descriptions.csv')
df_desc.head(50)

Unnamed: 0,Category (incl. context and answer),Description,Answer Format,Group
0,Category: Document Name,Description: The name of the contract,Answer Format: Contract Name,Group: -
1,Category: Parties,Description: The two or more parties who signe...,Answer Format: Entity or individual names,Group: -
2,Category: Agreement Date,Description: The date of the contract,Answer Format: Date (mm/dd/yyyy),Group: 1
3,Category: Effective Date,Description: The date when the contract is eff...,Answer Format: Date (mm/dd/yyyy),Group: 1
4,Category: Expiration Date,Description: On what date will the contract's ...,Answer Format: Date (mm/dd/yyyy) / Perpetual,Group: 1
5,Category: Renewal Term,Description: What is the renewal term after th...,Answer Format: [Successive] number of years/mo...,Group: 1
6,Category: Notice Period to Terminate Renewal,Description: What is the notice period require...,Answer Format: Number of days/months/year(s),Group: 1
7,Category: Governing Law,Description: Which state/country's law governs...,Answer Format: Name of a US State / non-US Pro...,Group: -
8,Category: Most Favored Nation,Description: Is there a clause that if a third...,Answer Format: Yes/No,Group: -
9,Category: Non-Compete,Description: Is there a restriction on the abi...,Answer Format: Yes/No,Group: 2


The default formulation of question are 

"Highlight the parts (if any) of this contract related to {<span style="color:red">Category</span>} that should be reviewed by a lawyer.
Details: {<span style="color:red">Description</span>(minus the "Description:")}
Thus we aim to make the questions in a way where we can generate the new questions automatically.

In [94]:
q = set()
for x in df_desc.apply(lambda x: f"Highlight the parts (if any) of this contract related to \"{x['Category (incl. context and answer)'].split(': ')[-1]}\" that should be reviewed by a lawyer. Details: {x['Description'].split('Description: ')[-1]}", axis=1).values:
    #x=x.replace("Notice Period to Terminate Renewal", "Notice Period To Terminate Renewal")
    #x=x.replace("No Solicitation Of Customers", "No-Solicit Of Customers")
    print(x)
    q.add(x)

Highlight the parts (if any) of this contract related to "Document Name" that should be reviewed by a lawyer. Details: The name of the contract
Highlight the parts (if any) of this contract related to "Parties" that should be reviewed by a lawyer. Details: The two or more parties who signed the contract
Highlight the parts (if any) of this contract related to "Agreement Date" that should be reviewed by a lawyer. Details: The date of the contract
Highlight the parts (if any) of this contract related to "Effective Date" that should be reviewed by a lawyer. Details: The date when the contract is effective 
Highlight the parts (if any) of this contract related to "Expiration Date" that should be reviewed by a lawyer. Details: On what date will the contract's initial term expire?
Highlight the parts (if any) of this contract related to "Renewal Term" that should be reviewed by a lawyer. Details: What is the renewal term after the initial term expires? This includes automatic extensions and 

This is exactly the structure from the dataset. We will now create addition this this in various degree of difficulty. The goals is that the models are more susceptible to "normal" human wording. 

E.G. that is someone can ask: What is the document name? Who are the parties in this contract. Furthermore, just rearrangements of the words etc. should work

## A new test set with different wordings
We will do three different degrees of difficulty

1. Remove casing
2. Rearrangement of the question and answer
3. Different wording buy close to the original
4. Completely different natural wording

All experiments will reformulate on the test.json dataset into a new version.

In [112]:
import json
# Open the json file and load the data
def load_json(file):
    with open(file) as f:
        data = json.load(f)
    return data

def dump_json(file, data):
    with open(file, 'w') as f:
        json.dump(data, f)

In [116]:
def change_question(file,question_map):
    data = load_json(file)
    for contract in data['data']:
        for paragraph in contract['paragraphs']:
            for qa in paragraph['qas']:
                if qa['question'].lower() not in question_map:
                    raise Exception(f"{qa['question']} not in question_map")
                qa['question'] = question_map[qa['question'].lower()]
    return data


**1. To lower**

This script just changes the low 

In [117]:
data = load_json('../data/test.json')
question_map = {k.lower(): k.lower() for k in q}
data_lower = change_question('../data/test.json', question_map)

In [118]:
dump_json('../data/test_lower.json', data_lower)

**2. Rearrangement of the questions and answers**

This first script simply reverses the order of question and details formulation

In [121]:
q_map_reverse_order = {}
for row in df_desc.iterrows():
    original = f"Highlight the parts (if any) of this contract related to \"{row[1]['Category (incl. context and answer)'].split(': ')[-1]}\" that should be reviewed by a lawyer. Details: {row[1]['Description'].split('Description: ')[-1]}"
    reversed_ = f"Details: {row[1]['Description'].split('Description: ')[-1]}. Highlight the parts (if any) of this contract related to \"{row[1]['Category (incl. context and answer)'].split(': ')[-1]}\" that should be reviewed by a lawyer."
    q_map_reverse_order[original.lower()] = reversed_

# Create dataset with new map
data_lower_reversed = change_question('../data/test_lower.json', q_map_reverse_order)
dump_json('../data/test_questions_reversed.json', data_lower_reversed)

**3. Different wording buy close to the original**

This script changes the wording of the questions, but it's still very close to the original