# Exploring the eyecite library


This purpose of this notebook is to explore the eyecite module ([documentation](https://freelawproject.github.io/eyecite/)). This module is supposed to find citations in legal documents.


In [93]:
import eyecite

We know the module was created and used by CAP. So it stands to reason that things which are true about CAP's front end are going to be true about the library itself. One problem with CAP is that it doesn't not recognize antique reporters. For instance, we know that Kelly is an early version of the Georgia Reports. So let's create some dummy text and see if we can find the citations.


In [94]:
text = """
    This principle was established in 3 Kelly 234. That citation is equivalent to 3 Ga. 234. 
    It was also established in 12 U.S. 534. 
"""

cites = eyecite.get_citations(text)
print(f"We found {len(cites)} citations.")

We found 2 citations.


So as expected, we found 2 citations rather than 3. Let's look at each of the citations.


In [95]:
print(cites[0])
print(cites[1])

FullCaseCitation('3 Ga. 234', groups={'volume': '3', 'reporter': 'Ga.', 'page': '234'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year=None, court=None, plaintiff=None, defendant=None, extra=None))
FullCaseCitation('12 U.S. 534', groups={'volume': '12', 'reporter': 'U.S.', 'page': '534'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year=None, court='scotus', plaintiff=None, defendant=None, extra=None))


So we found two citations: the Georgia one (not the Kelly one) and the U.S. one. Note that eyecite figured out that `U.S.` meant that the court is `scotus`.

We can note that it also knows that the repoort is "Georgia Reports," that it is a state reporter from Georgia, and that the start date for that reporter is 1846.


In [96]:
ga = cites[0]
ga.all_editions

(Edition(reporter=Reporter(short_name='Ga.', name='Georgia Reports', cite_type='state', source='reporters', is_scotus=False), short_name='Ga.', start=datetime.datetime(1846, 1, 1, 0, 0), end=None),)

This might imply that if we teach eyecite about Kelly, it will be able to pick it up.

We can also try to resolve the citations. Let's do it for these.


In [97]:
res = eyecite.resolve_citations(cites)
res

defaultdict(list,
            {Resource(citation=FullCaseCitation('3 Ga. 234', groups={'volume': '3', 'reporter': 'Ga.', 'page': '234'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year=None, court=None, plaintiff=None, defendant=None, extra=None))): [FullCaseCitation('3 Ga. 234', groups={'volume': '3', 'reporter': 'Ga.', 'page': '234'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year=None, court=None, plaintiff=None, defendant=None, extra=None))],
             Resource(citation=FullCaseCitation('12 U.S. 534', groups={'volume': '12', 'reporter': 'U.S.', 'page': '534'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year=None, court='scotus', plaintiff=None, defendant=None, extra=None))): [FullCaseCitation('12 U.S. 534', groups={'volume': '12', 'reporter': 'U.S.', 'page': '534'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year=None, court='scotus', plaintiff=None, defendant=None, extra

This is apparently something you have to extend yourself.


## Exploring the module itself


We can see the various regexes that are used. For example.


In [98]:
print(eyecite.regexes.POST_FULL_CITATION_REGEX)


    (?:  # handle a full cite with a valid year paren:
        # content before year paren:
        (?:
            # pin cite with comma and extra:
            
    (?P<pin_cite>
        # optional comma, space, "at" before pin cite
        ,?\ ?(?:at\ )?
        # first mandatory page number
        
    # optional label (longest to shortest):
    (?:
        (?:
            (?:&\ )?note|       # note, & note
            (?:&\ )?nn?\.?|     # n., nn., & nn.
            (?:&\ )?fn?\.?|     # fn., & fn.
            ¶{1,2}|             # ¶
            §{1,2}|             # §
            \*{1,4}|            # *
            pg\.?|              # pg.
            pp?\.?              # p., pp.
        )\ ?  # optional space after label
    )?
    (?:
        # page:paragraph cite, like 123:24-25 or 123:24-124:25:
        \d+:\d+(?:-\d+(?::\d+)?)?|
        # page range, like 12 or 12-13:
        \d+(?:-\d+)?
    )

        # optional additional page numbers
        (?:,\ ?
    # optional lab

The eyecite package itself depends on a reporters_db and a courts_db package. Obviously there is a lot of data in those libraries. Here is the [reporters_db data](https://github.com/freelawproject/reporters-db/tree/main/reporters_db/data). Here is the [courts-db data](https://github.com/freelawproject/courts-db/tree/main/courts_db/data). Unsurprisingly, Kelly (our example above) is not used in the reporters database. But it could be.


## Trying to do something real


We have some sample treatises. So let's see if we can extract citations from one of those.


In [99]:
with open('../data/sample-treatises/Pomeroy, Remedies, 1876.txt') as f:
    treatise = f.read()

print(treatise[:3000])

Disclaimer: This file is generated using OCR (optical character recognition) and/or HTR (handwritten text recognition), which are technologies that convert images of text into text. While the technologies are good at deciphering legible text, there are limitations and some text may not have been extracted correctly.
Remedies and remedial rights by the civil action, according to the reformed American procedure : a treatise adapted to use in all the states and territories where that system prevails
c REME D IES,AND REMEDIAL RIGHTS,DY THE CIVIL ACTION,,ACCOI1DINO TO THE REFORMED AMERICAN PROCEDURE.,A TREATISE ADAPTED TO USE IN ALL THE STATES AND TERRITORIES,WHERE TIIAT SYSTEM PREVAILS.,BY JOHN NORTON POMEROY, LL.D., AUTROR OP "AN INTIODUCTrIOT TO MUNICIPAL LAIV," ' AN INTRODUCnOTI,TO CONBTITUTIONAL LAW," ETC., ETC. B,5 /Osrp BOSTON: LITTLE, BROWN, AND COMPANY.,1876.,1
*,;,)/~. . . Entorole according to Act of Congress, In thq year 1870, by,JOIlt0 NOnTON 1'OlIF.OY, In tle Omce of the Libra

Now that we have the text, we will clean it, find the citations, and reduce the citations down to unique instances. The [eyecite tutorial](https://github.com/freelawproject/eyecite/blob/main/TUTORIAL.ipynb) provides a guide.


In [100]:
cleaned_text = eyecite.clean_text(treatise, ['underscores', 'all_whitespace'])
tokenizer = eyecite.tokenizers.HyperscanTokenizer(cache_dir='/tmp/test_cache')
citations = eyecite.get_citations(cleaned_text, tokenizer=tokenizer)
resolutions = eyecite.resolve_citations(citations)
print(f'Found {len(citations)} citations and resolved them into {len(resolutions)} groups.')

Found 7030 citations and resolved them into 3276 groups.


That seems like a reasonable result, keeping in mind the question of reporters.

Now let's just print out a list of the citations.


In [101]:
for resource in resolutions.keys():
    print(resource.citation.corrected_citation())

32 Cal. 172
31 Barb. 288
36 N.Y. 613
31 N.Y. 664
28 Barb. 382
4 Lans. 164
2 Vt. 9
19 Ind. 339
19 Ind. 418
37 Mo. 141
81 Ind. 20
43 Ind. 167
48 Ind. 12
39 Cal. 292
17 Abb. Pr. 169
39 Mo. 145
21 Ind. 137
3 Lans. 116
2 Barb. 258
37 Iowa, 454
16 Ired. 89
2 Duv. 800
46 Cal. 482
1 N.Y. 305
44 N.Y. 228
88 N.Y. 280
11 How. Pr. 218
11 Ohio St. 874
7 N.Y. 476
48 Ind. 496
44 Mo. 263
16 Ohio St. 146
16 N.Y. 416
46 N.Y. 688
41 Ind. 164
41 Mo. 484
48 Mo. 819
9 Kan. 401
45 Ind. 156
26 Ind. 461
68 Barb. 288
28 N.Y. 600
2 Duv. 480
16 Barb. 633
21 Ala. 487
47 N.Y. 487
44 Cal. 809
28 How. Pr. 324
8 How. Pr. 434
20 Iowa, 66
18 Iowa, 241
4 Kan. 211
10 Cal. 150
4 Wend. 406
17 How. Pr. 76
8 Minn. 254
32 Ind. 408
26 Mo. 210
29 N.Y. 494
3 Abb. Pr. 184
9 Abb. Pr. 353
29 Mo. 429
23 Ohio St. 543
2 Kan. 135
47 N.Y. 860
44 N.Y. 63
31 Ind. 270
18 Barb. 264
63 Barb. 454
51 Barb. 466
1 Kan. 803
4 Abb. Pr. 176
44 Barb. 577
2 Duer, 160
2 Paige Ch., 278
36 Me. 50
29 Iowa, 631
16 Barb. 64
7 Cal. 661
1 Daly, 469
6 Mass. 46

Here's what the "resolutions" look like.

In [102]:
sample_res = list(resolutions.keys())[6]
print("This is a deduplicated case.")
print(sample_res)
print("\nAnd these are all the examples of raw text that were matched to that case.")
for cite in resolutions[sample_res]:
    print(cite.matched_text())

This is a deduplicated case.
Resource(citation=FullCaseCitation('2 Vt. 9', groups={'volume': '2', 'reporter': 'Vt.', 'page': '9'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year=None, court=None, plaintiff='277', defendant='Hall', extra=None)))

And these are all the examples of raw text that were matched to that case.
2 Vt. 9
2 Vt. 9
Ibid.
2 Vt. 9


Let's find the unique reporters.

In [113]:
reporters = []
for case in list(resolutions.keys()):
    reporters.append(case.citation.corrected_reporter())
reporters = list(set(reporters))
reporters.sort()
for reporter in reporters:
    print(reporter)

A.
Abb.
Abb. Pr.
Ala.
Alaska
Am. Law Reg.
Ark.
B. Mon.
Barb.
Barb. Ch.
Bosw.
Bush
Cai. Cas.
Cal.
Call
Code Rep.
Conn.
Cow.
Cush.
Daly
Dana
Denio
Dev. & Bat. Eq.
Duer
Duv.
Fla.
Ga.
Gratt.
Gray
Hill
How.
How. Pr.
Ill.
Ind.
Iowa
Ired.
Johns.
Johns. Cas.
Johns. Ch.
Jones Eq.
Kan.
La. Ann.
Lans.
Litt.
Mart.
Mass.
McCord Eq.
Md.
Me.
Metc.
Mich.
Mills Surr.
Minn.
Mo.
Mont.
Munf.
N.Y.
N.Y. Sup. Ct.
Neb.
Nev.
Ohio
Ohio St.
Paige Ch.
Phil.
Pick.
Rob.
Scam.
Smith & H.
Story
Vt.
W. Va.
Wall.
Watts
Wend.
Wheat.
Wis.
Yer.
