April 2022

# LAW3027 - Tutorial 2: Contract Automation

Contract Automation can be described as the process of generating, managing, and storing contracts digitally to create a more efficient contract workflow.

### Intended learning outcomes

By the end of this notebook, you will know:

* how to extract key contract elements in a structured way;
* how to visualize these elements in a contract;
* how to store the extracted elements in a dataframe;
* how to query the dataframe to retrieve elements that satisfy given conditions.


### Github page of the course Advanced Legal Analytics

The link to the github page of the course is available here: https://github.com/maastrichtlawtech/law3027-advanced-legal-analytics

The github page of the course will contain all the tutorial notebooks, datasets, environment and configuration files. It is recommended that you download the latest version of this git repository as a zip and extract it into your local drive. Then you can be organize the files according to the github repository of the course. Advanced users can use git (follow any online tutorial) to be updated with the latest version of the github repository.


### Create a new conda environment from the yml file

Go to the environment.yml file located on the Github page of the course here: https://github.com/maastrichtlawtech/law3027-advanced-legal-analytics/blob/main/environment.yml . Save the file in your local directory where you run Jupyter notebooks. 

- For Windows: Open the “Anaconda prompt” and for Mac OS: open the terminal. Navigate to the folder where you stored the `environment.yml` file using the `cd` command. For instance you are in the `/home/user` folder. The user folder contains a `Downloads` folder where you stored the `environment.yml` file. Then you can use the `cd` command as follows: `cd Downloads`. Then run the following command to create a new environment called `ala`. The name `ala` is taken from the first line of the `environment.yml` file. 

Create the environment: `conda env create -f environment.yml`

Activate the environment: `conda activate ala`

Then run `jupyter notebook` and open the `tutorial2.ipynb` file

### Library to be used in today's tutorial: LexNLP
Read more about LexNLP here: https://lexpredict-lexnlp.readthedocs.io/en/latest/about.html  or https://github.com/LexPredict/lexpredict-lexnlp/blob/master/documentation/docs/source/about.rst. The documentation of the library can be found here: https://lexpredict-lexnlp.readthedocs.io/en/latest/modules/extract/extract.html

<hr/>

## 1. Extraction of key contract elements with LexNLP

[LexNLP](https://github.com/LexPredict/lexpredict-lexnlp) is a Python library for information extraction in real, unstructured legal text (including contracts, plans, policies, procedures, and other material). It various contains methods that allow to extract the following data types from unstructured textual legal sources:

* *acts*, e.g., “section 1 of the Advancing Hope Act, 1986”
* *amounts*, e.g., “ten pounds” or “5.8 megawatts”
* *companies*, e.g., “Lexpredict LLC”
* *conditions*, e.g., “subject to …” or “unless and until …”
* *constraints*, e.g., “no more than” or “
* *copyright*, e.g., “(C) Copyright 2000 Acme”
* *courts*, e.g., “Supreme Court of New York”
* *CUSIP*, e.g., “392690QT3”
* *dates*, e.g., “June 1, 2017” or “2018-01-01”
* *definitions*, e.g., “Term shall mean …”
* *distances*, e.g., “fifteen miles”
* *durations*, e.g., “ten years” or “thirty days”
* *money*, e.g., “$5” or “10 Euro”
* *percents and rates*, e.g., “10%” or “50 bps”
* *PII*, e.g., “212-212-2121” or “999-999-9999”
* *ratios*, e.g.,” 3:1” or “four to three”
* *regulations*, e.g., “32 CFR 170”
* *trademarks*, e.g., “MyApp (TM)”
* *URLs*, e.g., “http://acme.com/”

Normally, extraction of elements using LexNLP works as follows:

1. Load a text to extract elements from:
    ```python
    with open('contract.txt') as f:
        contract = f.readlines()
    ```
2. Select **one** type of elements (among those mentioned above) to extract:
    * e.g., amounts
3. Import the LexNLP module corresponding to that type of elements:
    ```python
    import lexnlp.extract.en.amounts
    ```
4. Extract those elements using the dedicated function from the imported module:
    ```python
    amounts = lexnlp.extract.en.amounts.get_amounts(text=contract)
    ```


#### 1.1 Extracting Copyright

In [None]:
import lexnlp.extract.en.copyright
text = "(C) Copyright 1993-1996 Hughes Information Systems Company"
print(list(lexnlp.extract.en.copyright.get_copyright(text)))

#### 1.2 Extract the regulation and acts from the following text

In [None]:
#Extract the regulations
text = """
Pub. L. 107–207, §1, Aug. 5, 2002, 116 Stat. 926, provided that: "This Act [enacting section 8 of this title] may be cited as the 'Born-Alive Infants Protection Act of 2002'."
"""


In [None]:
# Extract the acts

### 2. New Utility module: We (the course team) provides a utility function to extract several types of elements at once in a given contract

Now, one bottleneck of their (LexNLP's) approach is that you can't extract **multiple** types of elements at once in a given contract. For each type of element, you must import the corresponding module and call the proper function. That is why, to make our life easier (reason for living of any computer scientist), we have implemented a small utility function on top of LexNLP that allows to extract one **or** several types of elements at once in a given contract. The utility function file is available on the GitHub page of the course here: https://github.com/maastrichtlawtech/law3027-advanced-legal-analytics/blob/main/notebooks/utils.py

Make sure that the utility file is in the same folder as your notebook (keep the organization of your local directory same as the Github page of the course)

1. Import the utility function:
    ```python
    from utils import extract
    ```
2. Extract one or several types of elements from a contract using that function:
    ```python
    elements = extract(text=contract, element_types=['amounts', 'dates', 'copyright'])

Let's now have a look at the format in which each type of extracted elements is returned by our function.


In [None]:
# import the extract function
from utils import extract

In [None]:
# Extract the Acts
example = "test section 12 of the VERY Important Act of 1954."


In [None]:
# Extract the Amounts
example = "There are ten cows in the dozen acre pasture."


In [None]:
# Extract the Companies
example = "This is Deutsche Bank Securities Inc."


In [None]:
#Extract the Conditions
example = "This will occur unless something else happens."


In [None]:
#Extract the Constraints
example = "The rate shall be no less than 50 bps."


In [None]:
#Extract the Copyrights
example = "(C) Copyright 1993-1996 Hughes Information Systems Company"


In [None]:
#Extract the Courts
example = "To be heard in either SCOTUS or D.C. Cir."


In [None]:
#Extract the CUSIP
example = "This is 39298#QT5 code."


In [None]:
#Extract the Dates
example = "This agreement shall terminate on the 15th day of March, 2020."


In [None]:
#Extract the Definitions
example = "'Advance' means a Revolving Credit Advance"


In [None]:
#Extract the Distances
example = "Within 50 miles of office."


In [None]:
#Extract the Durations
example = "This Agreement shall terminate in nine (9) months."


In [None]:
#Extract the Money
example = "The cost is estimated to 10000 USD."


In [None]:
#Extract the Percents
example = "At a discount of 1%"


In [None]:
#Extract the Personally-identifiable information (PII)
example = "Mary Doe (212-123-4567)"


In [None]:
#Extract the US regulatory references
example = "Pursuant to 123 CFR 456, Provider shall"


In [None]:
#Extract the Ratios
example = "At a leverage ratio of no more than ten to one."


In [None]:
#Extract the Trademarks
example = "Customer agrees to license HAL(TM)"


#### 2.1

So, we have seen that the output of the `extract()` function is a list of dictionaries, where each dictionary element represents an extracted entity that comes in the following format:

```python
{
    'type': '...',  #the type of the extracted entity.
    'element': '...',  #the value of the extracted entity.
    'location': (starting_char, ending_char),  #the location in the text of the extracted entity.
    'details': {'x': '...', 'y': '...', 'z': '...'}  #additional details on the extracted entity.
}
```

Now, let's try to extract the entities in a contract. To do so, let's load a few contracts from the [Contract Understanding Atticus Dataset (CUAD)](https://www.atticusprojectai.org/cuad), a publicly available dataset of 510 commercial legal contracts.

* Let us first create a variable that stores the path to the folder containing the contracts: `../data/CUAD_v1/full_contract_txt/"`. The dataset is already provided on the github page of the course here: 
* Let us now create a list that contains the names of each file in that folder (hint: use the `os.listdir()` function). Only keep the first 10 file names.
* Create the relative paths of each of these ten files by concatening the folder path to the file name (hint: iterate over each filename and use the `os.path.join()` function to create the relative path).


In [None]:
#Extract the Urls
example = "A copy of the terms can be found at www.acme.com/terms"


In [None]:
# Get the data paths of the first 10 contracts from CUAD.


#### 2.2

Let's now define a function `read_contract()`to read our text file contracts. The function should take one argument `filepath` (the path to the text file).

* First, load the file by using the Python `open()` function. Make sure to set the encoding parameter to `"utf-8"`or you'll have some formatting errors.
* Save the content  of the file in a new variable using the `readlines()` function. It will save each line of the file into a list.
* Append all these lines together and save the result in a new variable.
* Make sure to remove all the subsequent whitespaces to get a nice-looking text.

In [None]:
# Create a function that loads a contract.


#### 2.3 Let's use that function to load our 10 contracts.

* Create a new list variable `contracts` that will store the content of the contracts.
* Iterate over each contract paths and load the corresponding contract using our `read_contract()` function.
* Append the loaded contract to our list variable.

In [None]:
# Load our 10 contracts.

#### 2.4 Let's create a new variable `contract1` that saves the first contract from our loaded contracts. Print its content.

In [None]:
# Save the first contract in a new variable and print its content.


#### 2.5 Let's extract entities from our 10 contracts.

* Create a new list variable `to_extract` that contains the following elements: `['acts', 'companies',  'copyright', 'courts', 'cusip', 'conditions', 'dates', 'distances', 'durations', 'money', 'percents', 'pii', 'regulations', 'trademarks', 'urls']`
* Create another list variable `outputs` that will store the extracted entities of each contract.
* Iterate over each contract and extract their entities using the `extract()` function.

In [None]:
# Extract entities from the contracts.


#### 2.6 Let's focus on the extracted entities of the first contract. 

* Create a new variable `output1` that stores the extracted entities from our first contract.
* Print all extracted elements of that contract by iterating over `output1`.

In [None]:
# Print the extracted elements of the first contract.


## 3. Visualization of the elements with spaCy

[spaCy](https://spacy.io/) is a free open-source library for Natural Language Processing in Python. 

It supports 64+ languages, provides pre-trained word vectors, and allows to build custom components for various NLP tasks such as named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more.

In this tutorial, we are interested in another feature of the library, namely its **displacy** visualizer for named entity recognition that let us see our extracted entities highlighted directly in the text.

In [None]:
# Import spacy and displacy.
import spacy
from spacy import displacy

With spaCy, once you’ve downloaded and installed a trained pipeline, you can load it via `spacy.load()`. This will return a *Language* object containing all components and data needed to process text. We usually call it `nlp`. If you don't need a speccific pipeline, it is also possible to load an empty one in the language of your choice. You can do so by using `spacy.blank()` and mention the language of interest (in our case, `"en"`) as parameter.

In [None]:
# Load an empty spaCy pipeline in English called 'nlp'.
nlp = spacy.blank("en")

The second step with spaCy is always to call the `nlp` object on a string of text. This will return a processed *Doc* object that we can work with.

In [None]:
# Call the nlp object on our first contract 'contract1'. Save the result in a variable called 'doc'.
doc = nlp(contract1)

#### 3.1

Now, let's add our extracted entities to our newly created spaCy *Doc* object. This will allow us to visualize them with displacy.

* Create a new list variable that will store each spaCy *Span* (i.e., entity) object.
* Iterate over the extracted entities from `output1` and create a new *Span* object for each one of them. To do so, use the `doc.char_span()` function that takes three arguments as inputs: 
    
    1. the index of the first character of the entity;
    2. the index of the last character of the entity;
    3. the name of the entity.

Refer to the documentation of spaCy's `Doc.char_span` here: https://spacy.io/api/doc#char_span 

* Append the newly created *Span* object to your list variable **after making sure that the output of `char_span()` is not None**.
* Finally, save your list of *Span* object in `doc.ents`.

In [None]:
# Create the spaCy Span objects from the extracted entities of the first contract.
ents = [] # we want to append the entities returned by spacy's char_span() function in this list
for element in output1:
    ent = #complete code
    #complete code

doc.ents = ents
    

#### 3.2
It's now time to display the contract with its entities highlighted.

* First, import the `ENTITIES_CONFIG` macro from our script `utils`.
* Then, display the entities on the contract by running:

```python
displacy.render(doc, style="ent", jupyter=True, options=ENTITIES_CONFIG)
```

In [None]:
# Render the contract with extracted entities highlighted.


## 4. Saving the contract elements with Pandas

In [None]:
# Import the pandas library
import pandas as pd

#### 4.1
Let's create a function `create_row_data()` that takes as inputs three parameters:

1. `extracted_entities`: the list of extracted entities (in the form {'type':..., 'element':..., 'location':...}) for a given contract.
2. `entity_names`: the list of types of entity that were extracted (NB: you can reuse the value of our previously created variable `to_extract`).
3. `contract_name`: the name of the given contract.

* First, create an empty dictionary called `row_data` whose keys are the values from `entity_names`. To do so, use the `dict.fromkeys()` function.
* Then, add a new key-value pair to that dictionary corresponding to the contract name (i.e., the key should be `name` and the associated value should correspond to the value of our passed argument `contract_name`).
* Next, iterate over each extracted entity and save its type and element.
* Finally, append the element to the previously saved value of the dictionary at the corresponding key (aka type). Separate the different string elements using the '[SEP]' token.

In [None]:
# Create the function create_row_data().
def create_row_data(extracted_entities, entity_names, contract_name):
    row_data = dict.fromkeys(entity_names, "")
    #complete code here
    
    
    
    return row_data

#### 4.2
Let's now use our `create_row_data()` function on each of the extracted outputs (for each contract).

* Create an empty list variable `data` that will store the different row data.
* Iterate over the different `outputs` and `filenames` at the same time. You can do so by using the `zip()`function.
* For each output, use `create_row_data()` to create the contract dictionary.
* Append that dictionary to our newly created list variable `data`.

In [None]:
# Use the create_row_data() function to create a list of row data for our contracts.
data=[]
for out, name in zip(outputs, filenames):
    #complete code
    #complete code

#### 4.3
Let's create a pandas dataframe `df`. You can do so by passing `data` to `pd.DataFrame()`.

In [None]:
# Create a pandas dataframe.


Replace the empty strings in the dataframe by `nan` values. You can do so by using the `replace()` function with the following regex expression: `r'^\s*$'` ('r' means what follows will be a regex expression, '^' indicates the start of a string, '\s' indicates a whitespace, '*' indicates that there might be multiple ones, and '$' indicates the end of a string).

NB: make sure to set the `regex` parameter to True.

In [None]:
# Replace empty strings by NaN.


## 5. Querying the dataframe

Let's finish the tutorial by querying our new dataframe. For example, let's get all the contracts that deal with the **"21 U.S.C."** regulation. 

Remember, values in a specific column are not unique entities, but the concatenation of multiple ones separated by the [SEP] token. Therefore, to check wheter or not one specific value appears in a column, we first have to get each value in a cell by splitting according to that [SEP] token.

#### 5.1
First, we need to drop all rows that have missing values for the `regulations` column (hint: filter user the `notna()` function). The new dataframe after dropping the rows should be called `filtered_df`

In [None]:
# Drop rows where 'regulations' column has no value.


#### 5.2
Let's now create a condition where we'll split each 'regulations' value according to the [SEP] token and check for each split element if it contains the "21 U.S.C." string.

* Use a *lambda* function of the form: `lambda x: ...`, where x represents a 'regulations' value. 
* First, you want to split the 'regulations' value according to the [SEP] token. Use the `split()` function.
* Then, you want to check in your output list (containing the split values) if they contain "21 U.S.C.". Use the `str.contains()`function. Refer to the documentation here: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html
* Save your lambda function in a variable called `by_value_condition`.
* Finally, apply your condition (lambda function) on the 'regulations' column of your dataframe. Use the `apply()` function.

In [None]:
# Create a by-value condition with a lambda function and filter the dataframe accordingly.


#### 5.3
Create a new condition called `overall_condition` that reuses the previous one but adds the `.any()` at the end. Apply that condition (lambda function) to the 'regulations' column.

In [None]:
# Create an overall condition with a lambda function and filter the dataframe accordingly.


#### 5.4
Next, select the rows from the dataframe that statisfy that overall condition.

In [None]:
# Select the rows where the condition holds.


#### 5.5
Finally, print the names of the corresponding contracts.

In [None]:
# Print the corresponding contract names.
