# Processing "raw" data from the datasets

## Imports and other setup

In [1]:
import json
import os

from find_dataset import locate

DATA = locate("Datasets")
SINK = DATA / "processed"

os.makedirs(SINK, exist_ok=True)

print(f"Datasets: {DATA}")
print(f"Sink: {SINK}")

Datasets: c:\Users\yip\Documents\GitHub\542-LegalContract-AI\Datasets
Sink: c:\Users\yip\Documents\GitHub\542-LegalContract-AI\Datasets\processed


## Dataset directory layout

<!DOCTYPE html>
<html>
<head>
 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
 <meta name="Author" content="Made by 'tree'">
 <meta name="GENERATOR" content="tree v2.1.0 © 1996 - 2022 by Steve Baker, Thomas Moore, Francesc Rocher, Florian Sesser, Kyosuke Tokoro">
 <title>Directory Tree</title>
 <style type="text/css">
  BODY { font-family : monospace, sans-serif;  color: black;}
  P { font-family : monospace, sans-serif; color: black; margin:0px; padding: 0px;}
  A:visited { text-decoration : none; margin : 0px; padding : 0px;}
  A:link    { text-decoration : none; margin : 0px; padding : 0px;}
  A:hover   { text-decoration: underline; background-color : yellow; margin : 0px; padding : 0px;}
  A:active  { margin : 0px; padding : 0px;}
  .VERSION { font-size: small; font-family : arial, sans-serif; }
  .NORM  { color: black;  }
  .FIFO  { color: purple; }
  .CHAR  { color: yellow; }
  .DIR   { color: blue;   }
  .BLOCK { color: yellow; }
  .LINK  { color: aqua;   }
  .SOCK  { color: fuchsia;}
  .EXEC  { color: green;  }
 </style>
</head>
<body>
        <h1>Directory Tree</h1><p>
        <a href="baseREF/">baseREF</a><br>
        ├── <a href="baseREF/ContractNLI/">ContractNLI</a><br>
        │   └── <a href="baseREF/ContractNLI/raw/">raw</a><br>
        ├── <a href="baseREF/CUAD/">CUAD</a><br>
        │   ├── <a href="baseREF/CUAD/full_contract_pdf/">full_contract_pdf</a><br>
        │   ├── <a href="baseREF/CUAD/full_contract_txt/">full_contract_txt</a><br>
        │   └── <a href="baseREF/CUAD/label_group_xlsx/">label_group_xlsx</a><br>
        └── <a href="baseREF/processed/">processed</a><br>
</body>
</html>

Make sure that your directory tree looks like this, with `dev.json`, `test.json` and `train.json` at the same level as `raw` (under ContractNLI).

## ContractNLI Processing

The not raw data is housed in `dev.json`, `test.json` and `train.json` files. They're primarily in the form:

```json

{
    "documents": [
        {
            "id": "",
            "file_name": "",
            "spans": [
            ],
            "annotation_sets": [
                "annotations": {
                }
            ],
            "document_type": "",
            "url": "",
        }
    ],
    "labels": {
    }
}

```

In [2]:
nli = []
for file in (DATA / "ContractNLI").glob("*.json"):
    with open(file) as f:
        data = json.load(f)
        for doc in data["documents"]:
            [doc.pop(i) for i in ["spans", "annotation_sets", "document_type", "url"]]
            nli.append(doc)

nli = sorted(nli, key=lambda x: x["id"])
nli

[]

In [None]:
with open(SINK / "contractnli.json", "w") as f:
    json.dump(nli, f, indent=4, ensure_ascii=False)

## CUAD Processing

In [None]:
cuad = []

for idx, file in enumerate((DATA / "CUAD" / "full_contract_txt").glob("*.txt")):
    with open(file) as f:
        text = f.read()
        cuad.append(
            {
                "id": f"{idx}",
                "file_name": file.name,
                "text": text,
            }
        )

cuad = sorted(cuad, key=lambda x: x["id"])
cuad

In [None]:
with open(SINK / "cuad.json", "w") as f:
    json.dump(cuad, f, indent=4, ensure_ascii=False)