## Step 1: Split contract text into individual articles

Using the regex method or elliott's method

In [1]:
import os
import pipeline

In [2]:
# Old: Load from disk
canadian_pt_path = "/home/research/corpora/contracts/canadian/txt"
len(os.listdir(canadian_pt_path)) # 44,589 contracts *total*

44589

In [3]:
# New: Stream from S3
import boto3

In [5]:
# Retrieve the list of existing buckets
s3 = boto3.client('s3')
response = s3.list_buckets()

# Output the bucket names
print('Existing buckets:')
for bucket in response['Buckets']:
    print(f'  {bucket["Name"]}')

Existing buckets:
  cuecon-textlab


In [3]:
# Pick out only English-language contracts
#len(txt_fpaths) #35,931 *English-language* contracts

In [4]:
#pl = pipeline.Pipeline("canadian", canadian_pt_path, lang_list=["eng"],
#                       sample_N=100, splitter="elliott")
pl = pipeline.Pipeline("canadian", canadian_pt_path, lang_list=["eng"],
                       splitter="elliott")

In [5]:
# done!
#pl.split_contracts()

In [6]:
#import json
#fpath = "../canadian_output/01_artsplit_elliott_json/0000102a.json"
#with open(fpath, 'r') as f:
#    data = json.load(f)

## Step 2: Parse the articles using spaCy

In [77]:
# Python imports
import functools
import glob
import json
import logging
import os

# 3rd party imports
import joblib

# Local imports
import pipeline
import main02_parse_articles

In [78]:
# Set up logging
logger = logging.getLogger()
logging.basicConfig(format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO)

In [79]:
# And set it to work with spacy's use of multiprocessing
import multiprocessing_logging
multiprocessing_logging.install_mp_handler()

In [80]:
canadian_pt_path = "/home/research/corpora/contracts/canadian/txt"
len(os.listdir(canadian_pt_path)) # 44,589 contracts *total*
pl = pipeline.Pipeline("canadian", canadian_pt_path, lang_list=["eng"],
                       splitter="elliott")

Load the actual spaCy NLP object (`nlp_eng`), and extend it to include neuralcoref annotations

In [81]:
import spacy
import neuralcoref

In [82]:
print("Loading spaCy core model")
nlp_eng = spacy.load('en_core_web_md', disable=["ner"])
print("Loading spaCy coref model. May take a while...")
neuralcoref.add_to_pipe(nlp_eng);

Loading spaCy core model
Loading spaCy coref model. May take a while...


Annoying but necessary additional step: adding "contract_id" and "art_num" attributes to spacy's Doc class, so that we can serialize and deserialize without headaches [x__x]

See https://spacy.io/usage/processing-pipelines#custom-components-attributes

In [121]:
# The force=True is just so that we can change (e.g.) the names or default values and overwrite the extensions
# (otherwise this would always cause an Exception)
spacy.tokens.Doc.set_extension("contract_id", default=None, force=True)
spacy.tokens.Doc.set_extension("article_num", default=None, force=True)
spacy.tokens.Doc.set_extension("coref_list", default=[], force=True)

Aaand yet another necessary workaround...

See https://github.com/huggingface/neuralcoref/issues/82#issuecomment-569431503

[update: holding off on this one actually, since I need the coref data... ugh]

In [122]:
cr_test_doc = nlp_eng(u'My sister has a dog. She loves him.')

In [123]:
mentions = [
    {
        "start": mention.start_char,
        "end": mention.end_char,
        "text": mention.text,
        "resolved": cluster.main.text,
    }
    for cluster in cr_test_doc._.coref_clusters
    for mention in cluster.mentions
]
#clusters = list(
#    list(span.text for span in cluster)
#    for cluster in cr_test_doc._.coref_clusters
#)
#resolved = cr_test_doc._.coref_resolved
#response = {}
#response["mentions"] = mentions
#response["clusters"] = clusters
#response["resolved"] = resolved
mentions

[{'start': 0, 'end': 9, 'text': 'My sister', 'resolved': 'My sister'},
 {'start': 21, 'end': 24, 'text': 'She', 'resolved': 'My sister'},
 {'start': 14, 'end': 19, 'text': 'a dog', 'resolved': 'a dog'},
 {'start': 31, 'end': 34, 'text': 'him', 'resolved': 'a dog'}]

K I guess we'll use this representation to avoid the serialization errors :|

In [124]:
# All the neuralcoref attributes for a doc, for future reference:
#cr_test_doc._.has_coref
#cr_test_doc._.coref_resolved
#cr_test_doc._.coref_clusters
#cr_test_doc._.coref_scores

In [125]:
#for cluster in cr_test_doc._.coref_clusters:
#    print(f"===== #{cluster.i}")
#    print(cluster)
#    print(f"main: '{cluster.main}'")
#    print(cluster.mentions)
#    for mention in cluster.mentions:
#        print(mention)
#        print(mention.start)
#        print(mention.end)

In [69]:
def stream_art_data(test_N=None):
    """
    test_N: If set to an int, the function will only yield article data for the first `test_N` contracts.
            Otherwise, if set to None, article data for all contracts is yielded.
    """
    art_data_fpaths = glob.glob("../canadian_output/01_artsplit_elliott_json/*.json")
    # Loop over contracts
    for fnum, fpath in enumerate(art_data_fpaths):
        if test_N is not None and fnum >= test_N:
            # We've already yielded the first `test_N` contracts, so terminate
            break
        with open(fpath, 'r') as f:
            all_articles = json.load(f)
        # Now loop over the articles
        for cur_article in all_articles:
            # We want to yield tuples of (string, {contract_id, article_num})
            art_str = cur_article['text']
            art_data = {'contract_id':cur_article['contract_id'],
                        'article_num':cur_article['section_num']}
            yield (art_str, art_data)

In [70]:
#art_data_fpaths = glob.glob("../canadian_output/01_artsplit_elliott_json/*.json")
#first_fpath = art_data_fpaths[0]
#with open(first_fpath, 'r') as f:
#    data = json.load(f)

In [None]:
def remove_unserializable_results(doc):
    doc.user_data = {}
    for x in dir(doc._):
        if x in ['get', 'set', 'has']: continue
        setattr(doc._, x, None)
    for token in doc:
        for x in dir(token._):
            if x in ['get', 'set', 'has']: continue
            setattr(token._, x, None)
    return doc

In [120]:
def get_coref_data(doc_obj):
    mentions = [
        {
            "start": mention.start_char,
            "end": mention.end_char,
            "text": mention.text,
            "resolved": cluster.main.text,
        }
        for cluster in doc_obj._.coref_clusters
        for mention in cluster.mentions
    ]
    return mentions

In [133]:
def transform_texts(nlp, batch_id, batch_tuples, output_dir):
    # Using spacy's "DocBin" functionality: see https://spacy.io/usage/saving-loading#docs
    batch_bin = spacy.tokens.DocBin(store_user_data=True)
    #print(nlp.pipe_names)
    output_fpath = os.path.join(output_dir, f"{batch_id}.bin")
    if os.path.isfile(output_fpath):  # return None in case same batch is called again
        return None
    print("Processing batch", batch_id)
    for art_doc, art_meta in nlp.pipe(batch_tuples, as_tuples=True):
        # This is the weird part where we now have to change contract_id and art_num
        # from being metadata to being attributes of the spacy Doc objects themselves
        contract_id = art_meta["contract_id"]
        article_num = art_meta["article_num"]
        art_doc._.contract_id = contract_id
        art_doc._.article_num = article_num
        # And now we don't need the meta object anymore, since it's encoded in the Doc itself
        # But next we need to get a serializable representation of the detected corefs
        art_doc._.coref_list = get_coref_data(art_doc)
        # Ok now we can get rid of the original coref attributes that break the data
        art_doc = remove_unserializable_results(art_doc)
        batch_bin.add(art_doc)
    # Now we can use spacy's serialization methods [joblib basically fails at serializing
    # spacy Docs for various reasons]
    # [see https://spacy.io/usage/saving-loading#docs]
    batch_bytes = batch_bin.to_bytes()
    # And save the bytes object to file
    with open(output_fpath, "wb") as f:
        f.write(batch_bytes)
    print("Saved {} texts to {}.bin".format(len(batch_tuples), batch_id))

In [134]:
# Trying to use multiprocessing like in
# https://spacy.io/usage/examples#multi-processing
#output_dir = "./mp_test"
output_dir = "./mp_full"
#art_tuple_stream = stream_art_data(test_N=50)
art_tuple_stream = stream_art_data()

print("Processing texts...")
batch_size = 1000
#batch_size = 200
n_jobs = 16
art_partitions = spacy.util.minibatch(art_tuple_stream, size=batch_size)
executor = joblib.Parallel(n_jobs=n_jobs, backend="multiprocessing", prefer="processes")
do = joblib.delayed(functools.partial(transform_texts, nlp_eng))
tasks = (do(i, batch_tuples, output_dir) for i, batch_tuples in enumerate(art_partitions))
executor(tasks);

Processing texts...
Processing batch 0
Processing batch 1
Processing batch 2
Processing batch 3
Saved 200 texts to 0.bin
Processing batch 4
Saved 200 texts to 3.bin
Processing batch 5
Saved 200 texts to 1.bin
Processing batch 6
Saved 200 texts to 4.bin
Saved 200 texts to 2.bin
Saved 200 texts to 5.bin
Saved 77 texts to 6.bin


[None, None, None, None, None, None, None]

Test that it worked

In [136]:
bin_fpath = "./mp_test/6.bin"
with open(bin_fpath, "rb") as f:
    loaded_bytes = f.read()
loaded_bin = spacy.tokens.DocBin().from_bytes(loaded_bytes)

In [143]:
doc_iter = loaded_bin.get_docs(nlp_eng.vocab)
doc_list = list(doc_iter)

In [144]:
doc_list[5]

Arbitration. 16.Ol The parties have agreed to the concept of pursuing alternative means of dispute resolution. This may include grievance mediation, "referee" systems, or other similar systems. 16.02 a) When either party decides to submit a grievance to arbitration as per Article 1504 b), then the other party shall be so notified in writing by registered mail.b) The parties have agreed on a panel of arbitrators to be used, if required, during the term of this agreement. c) If the parties fail to agree on the selection of a singlearbitrator from among the panel, they shall request the honourable Minister of Labour of the Province of Manitoba to make the appointment from among the said panel.d) In the event that the arbitrators provided for in this section are not available to preside as arbitrator, the parties agree that they will request the Honourable Minister of Labour of the Province of Manitoba to appoint a temporary replacement.16.03 No person shall be appointed as an arbitrator w

In [29]:
processed_arts = []

In [135]:
#for art_nlp, art_meta in nlp_eng.pipe(stream_art_data(test_N=1), as_tuples=True):
#    logger.info(f"Finished processing: {art_meta}")
#    processed_arts.append((art_nlp, art_meta))

In [31]:
len(processed_arts)

52

In [34]:
type(processed_arts[0][0])

spacy.tokens.doc.Doc

In [16]:
statement_list = main02_parse_articles.parallel_parse(pl, nlp_eng, stream_art_data)

Parsing articles in parallel via parallel_parse()


KeyboardInterrupt: 