In [2]:
import os
import re
import pandas as pd
import numpy as np

try:
    import xml.etree.cElementTree as ET
except ImportError:
    import xml.etree.ElementTree as ET

## Open Curated Corpus

Tilinghast & Shearman on New York Pleading (1865), Volume 1: <http://galenet.galegroup.com.ezproxy.cul.columbia.edu/servlet/MOML?af=RN&ae=F3703252199&srchtp=a&ste=14>

Gale Document Number: F3703252199 

And Volume 2: <http://galenet.galegroup.com.ezproxy.cul.columbia.edu/servlet/MOML?af=RN&ae=F3703253018&srchtp=a&ste=14>

Gale Document Number: F3703253018

Whittaker on New York Pleading (1852): <http://galenet.galegroup.com.ezproxy.cul.columbia.edu/servlet/MOML?af=RN&ae=F3705343875&srchtp=a&ste=14>

Gale Document Number: F3705343875

Van Santvoord on New York Pleading (1852): <http://galenet.galegroup.com.ezproxy.cul.columbia.edu/servlet/MOML?af=RN&ae=F3703633577&srchtp=a&ste=14>

Gale Document Number: F3703633577
 
Pomeroy on Remedies (1876): <http://galenet.galegroup.com.ezproxy.cul.columbia.edu/servlet/MOML?af=RN&ae=F3705788545&srchtp=a&ste=14>

Gale Document Number: F3705788545

In [3]:
"""Tilinghast & Shearman on New York Pleading (1865)"""
# filepath = os.path.join('../curated_corpus/', '19004049201_PageText.xml')

"""Tilinghast & Shearman on New York Pleading (1865) Volume 2"""
filepath = os.path.join('../curated_corpus/', '19004049202_PageText.xml')

"""Whittaker on New York Pleading (1852)"""
# filepath = os.path.join('../curated_corpus/', '19006691600_PageText.xml')

"""Van Santvoord on New York Pleading (1852)"""
# filepath = os.path.join('../curated_corpus/', '19004543800_PageText.xml')

"""Pomeroy on Remedies (1876)"""
# filepath = os.path.join('../curated_corpus/', '19007263500_PageText.xml')

root = ET.parse(filepath)

Different types of cirations

Legislations
Other treatise literature
Book of statutes (Code, .. 160)

Some citations may not exist (matching problem) - cross reference

Main thing are cases

Can be non standard (conventions... vary over different treatises)

Case citations tend to be more standard

Volume, abbr, page number (Always have)

May or may not have party names

Following citations change the format (page numbers change, Barb ... 82 lets say)

Volume numbers usually 2 or 3 digits (never 4) 

[square citations] different names for the same reporter

16 N.Y. 284; or , 23 P. 381 same case but 2 different volumes, 

seldon N. Y.   U.S. Wheaton

4 US. 31


Try to find corresponding citations in CAP Harvard

Link CAP and MOML

Beginning there are Tables of Cases (follow through) (All cases should be cited here but OCR Text might be messy)

pg.......XX (remove the dots using tokenizer), look at footnotes

Frequency measure

In [21]:
# Try a few pages first
selected_pages = ['00410', '00420', '00430', '00440', '00450', '00460', '00470', '00480', '00490']
selected_text = []
for elem in root.iter('page'):
    if elem.attrib['id'] in selected_pages:
        selected_text.append(elem.find('ocrText').text)

# Example of a text
print(selected_text[4])

# Example of a text
print(selected_text[5])

# Example of a text
print(selected_text[6])


			Intelligibility. Particularity. Time.
			CHAPTER XXXV. HOW THE FACr S SHOULD BE PLEADED. ABTICLE 1. Intelligibilityand particularity.
			2. Conciseness. 3. Positiveness. ART. 1. Intelligibility and particularity. Every pleading must be sufficiently definite and cer- tain in its statements to make the precise nature of the charge or defense apparent to the court and to the adverse party. (Code, § 160.) Intelligibilitly.]-Pleadings must therefore be intelligi- ble, as much under the new system as under the former. (Boye v. Brown, sp. t., 3 How. 391; asf'd, 7 Barb. 80.) They are expressly required by the Code to be couched in plain and ordinary language. (Code, Q§ 142, 149, 153.) Particularity.]-Pleadings must, in order to be "cer- tain," give sufficient particulars of the transactions stated to enable the adverse parties to identify the circum- stances. For this purpose, several descriptive allegations are required, which, though they cannot be made the subject of an issue, and are n

## Regular Expressions

In [28]:
# Try Regex
citations = []
for text in selected_text:
    matches = re.findall("\((.*?)\)", text)
    citations.append(matches)

# Change to DataFrame
df = pd.DataFrame(list(zip(selected_text, citations)), columns=['text', 'citations'])

for _, row in df.iterrows():
    print(row['text'])
    
    for citation in row['citations']:
        print("("+citation+")")
    break


			No other facts to be pleaded. Evidence not to be pleaded. strictness unnecessary. (See Lanning v. Carpenter, 20 N. y. 447, 458; M'Kyring v. Bull, 16 N. Y. 297.) None but material and issuable facts may be pleaded. (Man!l v. Morcwood, 5 Sands. 557; Rensselaer & Wlash. p. S. Co. v. Wetsel, sp. t., 6 How. 68; Williams v. HIayes, sp.t., 5 How. 470.) And though a pleading may state any facts bearing upon the final judgment in the action (Howard v. Tiffany, 3 Sands. 695), it need not (Corwin v. Freland, 6 N. Y. [2 Seld.] 560; rev'g S. C. 6 How. 241; Clleney v. Garbutt, sp. t., 5 H-ow. 467) and must not (Lee v. Elias, 3 Sands. 737; Code Rep. N. S. 116; Sellar v. Sage, sp. t., 13 How. 231; Field v. Morse, sp. t., 8 How. 47; Pitnlam? v. Piitnam, sp. t., 2 Code Rep. 64) contain any allegations which affect only the right of a party to a provisional remedy. Evidence not to be p)leatlded.]-Tle material facts only, and not the circumstances which tend to prove those facts, are to be pleaded. Ev

In [6]:
example_text = df['text'][4]
print(example_text)


			Intelligibility. Particularity. Time.
			CHAPTER XXXV. HOW THE FACr S SHOULD BE PLEADED. ABTICLE 1. Intelligibilityand particularity.
			2. Conciseness. 3. Positiveness. ART. 1. Intelligibility and particularity. Every pleading must be sufficiently definite and cer- tain in its statements to make the precise nature of the charge or defense apparent to the court and to the adverse party. (Code, § 160.) Intelligibilitly.]-Pleadings must therefore be intelligi- ble, as much under the new system as under the former. (Boye v. Brown, sp. t., 3 How. 391; asf'd, 7 Barb. 80.) They are expressly required by the Code to be couched in plain and ordinary language. (Code, Q§ 142, 149, 153.) Particularity.]-Pleadings must, in order to be "cer- tain," give sufficient particulars of the transactions stated to enable the adverse parties to identify the circum- stances. For this purpose, several descriptive allegations are required, which, though they cannot be made the subject of an issue, and are n

## NLTK Named Entity Recognition

In [7]:
# https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da
import nltk

# Run these to install
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')

from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint

def preprocess(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    return sent

sent = preprocess(example_text)
                  
# Get pattern for NER
# TODO: Change this to ( )
# Find out what is a chunk pattern
pattern = "NP: {<DT>?<JJ>*<NN>}"
     
# Create Chunk Parser
cp = nltk.RegexpParser(pattern)
cs = cp.parse(sent)

iob_tagged = tree2conlltags(cs)
pprint(iob_tagged)

# ne_tree = ne_chunk(pos_tag(word_tokenize(ex)))
# print(ne_tree)

[('Intelligibility', 'NNP', 'O'),
 ('.', '.', 'O'),
 ('Particularity', 'NNP', 'O'),
 ('.', '.', 'O'),
 ('Time', 'NNP', 'O'),
 ('.', '.', 'O'),
 ('CHAPTER', 'NN', 'B-NP'),
 ('XXXV', 'NNP', 'O'),
 ('.', '.', 'O'),
 ('HOW', 'NNP', 'O'),
 ('THE', 'NNP', 'O'),
 ('FACr', 'NNP', 'O'),
 ('S', 'NNP', 'O'),
 ('SHOULD', 'NNP', 'O'),
 ('BE', 'NNP', 'O'),
 ('PLEADED', 'NNP', 'O'),
 ('.', '.', 'O'),
 ('ABTICLE', 'NNP', 'O'),
 ('1', 'CD', 'O'),
 ('.', '.', 'O'),
 ('Intelligibilityand', 'NNP', 'O'),
 ('particularity', 'NN', 'B-NP'),
 ('.', '.', 'O'),
 ('2', 'CD', 'O'),
 ('.', '.', 'O'),
 ('Conciseness', 'NNP', 'O'),
 ('.', '.', 'O'),
 ('3', 'CD', 'O'),
 ('.', '.', 'O'),
 ('Positiveness', 'NN', 'B-NP'),
 ('.', '.', 'O'),
 ('ART', 'NNP', 'O'),
 ('.', '.', 'O'),
 ('1', 'CD', 'O'),
 ('.', '.', 'O'),
 ('Intelligibility', 'NN', 'B-NP'),
 ('and', 'CC', 'O'),
 ('particularity', 'NN', 'B-NP'),
 ('.', '.', 'O'),
 ('Every', 'DT', 'B-NP'),
 ('pleading', 'NN', 'I-NP'),
 ('must', 'MD', 'O'),
 ('be', 'VB', 'O'),
 ('

## spaCy Named Entity Recognition + Phrase Matching

In [8]:
# Try spaCy NER
# https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da

# https://spacy.io/usage/spacy-101#annotations-pos-deps
# Check out Rule based matching using NER and LSTMs!!

import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

doc = nlp(example_text)
pprint([(X.text, X.label_) for X in doc.ents])

[('ABTICLE 1', 'ORG'),
 ('2', 'CARDINAL'),
 ('3', 'CARDINAL'),
 ('1', 'CARDINAL'),
 ('160', 'CARDINAL'),
 ('Boye v. Brown', 'PERSON'),
 ('3', 'CARDINAL'),
 ('391', 'CARDINAL'),
 ('Barb', 'PERSON'),
 ('80', 'CARDINAL'),
 ('142', 'CARDINAL'),
 ('149', 'CARDINAL'),
 ('153', 'CARDINAL'),
 ('sp', 'GPE'),
 ('26', 'CARDINAL'),
 ('Barb', 'PERSON'),
 ('9', 'CARDINAL'),
 ('13', 'CARDINAL'),
 ('557', 'CARDINAL'),
 ('BlanlMa rd v. Strait', 'GPE'),
 ('sp', 'GPE'),
 ('8', 'CARDINAL'),
 ('83', 'CARDINAL'),
 ("1'arcy v. Lee", 'PERSON'),
 ('10 Abb', 'DATE'),
 ('143', 'CARDINAL'),
 ('Lester v. Jewvett', 'PERSON'),
 ('11', 'CARDINAL'),
 ('N. Y.', 'PERSON'),
 ('1', 'CARDINAL'),
 ('460', 'CARDINAL'),
 ('8', 'DATE'),
 ('Y.', 'PERSON'),
 ('157', 'CARDINAL'),
 ('Thompson', 'ORG'),
 ('22', 'CARDINAL'),
 ('Barb', 'PERSON'),
 ('89', 'CARDINAL'),
 ('Walden v. Crafts', 'PERSON'),
 ('2', 'CARDINAL'),
 ('302', 'CARDINAL'),
 ('4', 'CARDINAL'),
 ('E. D. Smith', 'PERSON'),
 ('490', 'CARDINAL'),
 ('21', 'CARDINAL')]


In [9]:
pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])

[(
			, 'O', ''),
 (Intelligibility, 'O', ''),
 (., 'O', ''),
 (Particularity, 'O', ''),
 (., 'O', ''),
 (Time, 'O', ''),
 (., 'O', ''),
 (
			, 'O', ''),
 (CHAPTER, 'O', ''),
 (XXXV, 'O', ''),
 (., 'O', ''),
 (HOW, 'O', ''),
 (THE, 'O', ''),
 (FACr, 'O', ''),
 (S, 'O', ''),
 (SHOULD, 'O', ''),
 (BE, 'O', ''),
 (PLEADED, 'O', ''),
 (., 'O', ''),
 (ABTICLE, 'B', 'ORG'),
 (1, 'I', 'ORG'),
 (., 'O', ''),
 (Intelligibilityand, 'O', ''),
 (particularity, 'O', ''),
 (., 'O', ''),
 (
			, 'O', ''),
 (2, 'B', 'CARDINAL'),
 (., 'O', ''),
 (Conciseness, 'O', ''),
 (., 'O', ''),
 (3, 'B', 'CARDINAL'),
 (., 'O', ''),
 (Positiveness, 'O', ''),
 (., 'O', ''),
 (ART, 'O', ''),
 (., 'O', ''),
 (1, 'B', 'CARDINAL'),
 (., 'O', ''),
 (Intelligibility, 'O', ''),
 (and, 'O', ''),
 (particularity, 'O', ''),
 (., 'O', ''),
 (Every, 'O', ''),
 (pleading, 'O', ''),
 (must, 'O', ''),
 (be, 'O', ''),
 (sufficiently, 'O', ''),
 (definite, 'O', ''),
 (and, 'O', ''),
 (cer-, 'O', ''),
 (tain, 'O', ''),
 (in, 'O', '

In [10]:
labels = [x.label_ for x in doc.ents]
Counter(labels)

Counter({'ORG': 2, 'CARDINAL': 28, 'PERSON': 10, 'GPE': 3, 'DATE': 2})

In [11]:
displacy.render(nlp(str(doc)), jupyter=True, style='ent')

In [24]:
displacy.render(nlp(str(selected_text[5])), jupyter=True, style='ent')

In [25]:
displacy.render(nlp(str(selected_text[6])), jupyter=True, style='ent')

In [12]:
sentences = [x for x in doc.sents]
displacy.render(nlp(str(sentences[20])), style='dep', jupyter = True, options = {'distance': 120})

In [13]:
[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(doc)) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

[('\n\t\t\t', 'SPACE', '\n\t\t\t'),
 ('Intelligibility', 'PROPN', 'Intelligibility'),
 ('Particularity', 'NOUN', 'particularity'),
 ('Time', 'PROPN', 'Time'),
 ('\n\t\t\t', 'SPACE', '\n\t\t\t'),
 ('CHAPTER', 'NOUN', 'chapter'),
 ('XXXV', 'PROPN', 'XXXV'),
 ('FACr', 'NOUN', 'FACr'),
 ('S', 'PROPN', 'S'),
 ('PLEADED', 'PROPN', 'PLEADED'),
 ('ABTICLE', 'NOUN', 'abticle'),
 ('1', 'NUM', '1'),
 ('Intelligibilityand', 'PROPN', 'Intelligibilityand'),
 ('particularity', 'NOUN', 'particularity'),
 ('\n\t\t\t', 'SPACE', '\n\t\t\t'),
 ('2', 'NUM', '2'),
 ('Conciseness', 'PROPN', 'Conciseness'),
 ('3', 'X', '3'),
 ('Positiveness', 'ADJ', 'positiveness'),
 ('ART', 'PROPN', 'ART'),
 ('1', 'X', '1'),
 ('Intelligibility', 'NOUN', 'intelligibility'),
 ('particularity', 'NOUN', 'particularity'),
 ('pleading', 'NOUN', 'pleading'),
 ('sufficiently', 'ADV', 'sufficiently'),
 ('definite', 'ADJ', 'definite'),
 ('cer-', 'ADJ', 'cer-'),
 ('tain', 'NOUN', 'tain'),
 ('statements', 'NOUN', 'statement'),
 ('precis

In [14]:
dict([(str(x), x.label_) for x in nlp(str(doc)).ents])

{'ABTICLE 1': 'ORG',
 '2': 'CARDINAL',
 '3': 'CARDINAL',
 '1': 'CARDINAL',
 '160': 'CARDINAL',
 'Boye v. Brown': 'PERSON',
 '391': 'CARDINAL',
 'Barb': 'PERSON',
 '80': 'CARDINAL',
 '142': 'CARDINAL',
 '149': 'CARDINAL',
 '153': 'CARDINAL',
 'sp': 'GPE',
 '26': 'CARDINAL',
 '9': 'CARDINAL',
 '13': 'CARDINAL',
 '557': 'CARDINAL',
 'BlanlMa rd v. Strait': 'GPE',
 '8': 'DATE',
 '83': 'CARDINAL',
 "1'arcy v. Lee": 'PERSON',
 '10 Abb': 'DATE',
 '143': 'CARDINAL',
 'Lester v. Jewvett': 'PERSON',
 '11': 'CARDINAL',
 'N. Y.': 'PERSON',
 '460': 'CARDINAL',
 'Y.': 'PERSON',
 '157': 'CARDINAL',
 'Thompson': 'ORG',
 '22': 'CARDINAL',
 '89': 'CARDINAL',
 'Walden v. Crafts': 'PERSON',
 '302': 'CARDINAL',
 '4': 'CARDINAL',
 'E. D. Smith': 'PERSON',
 '490': 'CARDINAL',
 '21': 'CARDINAL'}

## LSTM and Word Embeddings

In [15]:
# Try LSTM and Word or Character Embeddings
# https://www.depends-on-the-definition.com/lstm-with-char-embeddings-for-ner/
# Another useful tutorial
# https://dashayushman.github.io/tutorials/2017/08/19/neural-language-model.html

# Then finally try ELMO NER LSTMs

# Get word, tags and entity
word_tag_ent = [(X, X.tag_, X.ent_iob_) for X in doc]

# Get number of words and tags
words = list(set([(X) for X in doc]))
n_words = len(words);
tags = list(set([(X.tag_) for X in doc]))
n_tags = len(tags); 

max_len = 75
max_len_char = 10

word2idx = {str(w): i + 2 for i, w in enumerate(words)}
word2idx["UNK"] = 1  # unknown values
word2idx["PAD"] = 0  # padding
idx2word = {i: w for w, i in word2idx.items()}
tag2idx = {t: i + 1 for i, t in enumerate(tags)}
tag2idx["PAD"] = 0  # padding
idx2tag = {i: w for w, i in tag2idx.items()}

word2idx

{'apparent': 2,
 '[': 327,
 'as': 67,
 'Strait': 5,
 'necessary': 6,
 '302': 7,
 'Barb': 151,
 'in': 261,
 'stated': 124,
 '.': 400,
 ',': 393,
 'particularity': 121,
 'sufficient': 15,
 'the': 386,
 'they': 276,
 'Y.': 120,
 'intelligi-': 19,
 ';': 396,
 'material': 21,
 '7': 22,
 'included': 23,
 'time': 125,
 '\n\t\t\t': 359,
 'XXXV': 26,
 'couched': 28,
 'and': 283,
 'order': 30,
 'not': 362,
 'charge': 32,
 'several': 33,
 '460': 34,
 'subject': 37,
 't.': 286,
 'or': 265,
 'period': 43,
 'sp': 377,
 '3': 255,
 'will': 47,
 'its': 48,
 'identify': 226,
 'Code': 184,
 '22': 52,
 'former': 53,
 '0': 56,
 'what': 58,
 'v.': 331,
 'at': 324,
 'Seld': 66,
 'cause': 69,
 '80': 70,
 'statements': 280,
 'Chesbrotugl': 72,
 'THE': 73,
 'Every': 76,
 'of': 303,
 'occurred': 336,
 'are': 376,
 'How': 332,
 'issue': 83,
 'a': 318,
 'which': 126,
 '\n\t\t': 87,
 'HOW': 88,
 'be': 392,
 'bound': 93,
 'defense': 333,
 'stances': 95,
 '160': 97,
 '89': 98,
 'an': 99,
 'Boye': 100,
 '8': 127,
 '4'

In [17]:
# from keras.preprocessing.sequence import pad_sequences
# Just use sequence from spacy hash
X_word = [[token.norm] for token in doc]

# X_word = pad_sequences(maxlen=max_len, sequences=X_word, value=word2idx["PAD"], padding='post', truncating='post')

# X_word
# X_word = [word2idx[X]  for X in doc]
# # X_word = [word2idx[w] for w in words]
# # X_word = [[word2idx[w[0]] for w in s] for s in sentences]


In [15]:
# Find out the original source of the Citation, then MATCH IT using similarity using spaCy