# Testing citation detection in eyecite

The purpose of this notebook is to set up a few tests of eyecite to see which citations it detects, on the way to creating a more formal test suite. This is also a chance to test things like correcting OCR errors.

In [130]:
from typing import Callable
from csv import DictReader
import eyecite

## Cleanup

Set up some functions to do the data cleaning and pre-processing.

In [131]:
# We need to generate the cleaning function, because we want to read thed data 
# from a CSV, but eyecite will expect the data to have a particular signature.
# So, we capture the data in a closure.
def generate_correct_ocr(corrections_file: str) -> Callable[[str], str]:
    corrections = dict()
    with open(corrections_file, 'r') as file:
        csv = DictReader(file)
        for row in csv:
            corrections[row['mistake']] = row['correction']
    
    def cleaning_func(text: str) -> str:
        for mistake, replacement in corrections.items():
            text = text.replace(mistake, replacement)
        return text
    
    return cleaning_func

Load the list of OCR corrections from a CSV file in this repo.

In [132]:
correct_ocr = generate_correct_ocr("../data/ocr-errors.csv")

Check that it corrects a simple example.

In [133]:
input = "This is a citation to 12 Wise. 345, a case."
output = "This is a citation to 12 Wis. 345, a case."

print(correct_ocr(input))

assert correct_ocr(input) == output

This is a citation to 12 Wis. 345, a case.


## Eyecite pipline

eyecite has some levers and knobs, which we want to use consistently. Create a function that takes a text in and gives the citations back.

In [134]:
def find_citations(text: str):
    text = eyecite.clean_text(text, ['underscores', 'all_whitespace', 'inline_whitespace', correct_ocr])
    tokenizer = eyecite.tokenizers.HyperscanTokenizer(cache_dir='/tmp/hyperscan_cache')
    citations = eyecite.get_citations(text, tokenizer=tokenizer)
    resolutions = eyecite.resolve_citations(citations)
    return (citations, resolutions)

def print_results(citations, resolutions) -> None:
    print(f'Found {len(citations)} citations and resolved them into {len(resolutions)} groups.')
    
    print('\nThese are the citations:')
    for citation in citations:
        print("- " + citation.corrected_citation())
    
    print('\nThese are the resolutions:')
    for resource in resolutions.keys():
        print("- " + resource.citation.corrected_citation())

Now let's test that on a simple example, but one where we know we had to make an OCR correction and another where we have to make a whitespace correction.

In [135]:
print('Input: ' + input + '\n')
citations, resolutions = find_citations(input)
print_results(citations, resolutions)

Input: This is a citation to 12 Wise. 345, a case.

Found 1 citations and resolved them into 1 groups.

These are the citations:
- 12 Wis. 345

These are the resolutions:
- 12 Wis. 345


## Sample document test

Let's try this on a dummy document for testing purposes.

In [136]:
sample = open("../data/pretend-document.txt").read()
citations, resolutions = find_citations(sample)
print_results(citations, resolutions)

Found 4 citations and resolved them into 2 groups.

These are the citations:
- 18 S.C.L. 104
- 2 Bail. 104
- 2 Bail., 104
- 2 Bail. 104

These are the resolutions:
- 18 S.C.L. 104
- 2 Bail. 104


## Test on more formalized examples

Let's test this on our more formalized examples.

Here is a function that can be used to run through our test data.

In [140]:
def test_citation_detection(input: str, expected: str, purpose: str) -> None:
    citations, resolutions = find_citations(input)
    if len(citations) > 0:
        actual = citations[0].corrected_citation()
    else:
        actual = ""
    result = "SUCCESS!" if actual == expected else "FAILURE!"
    output = f"""
    Purpose: {purpose}
    ------------------------
    Input:    {input}
    Expected: {expected}
    Actual:   {actual}
    {result}
    """
    print(output)

test_citation_detection("12 Wise. 345", "12 Wis. 345", "OCR correction")
test_citation_detection("12 wis. 345", "12 Wis. 345", "Case sensitivity")



    Purpose: OCR correction
    ------------------------
    Input:    12 Wise. 345
    Expected: 12 Wis. 345
    Actual:   12 Wis. 345
    SUCCESS!
    

    Purpose: Case sensitivity
    ------------------------
    Input:    12 wis. 345
    Expected: 12 Wis. 345
    Actual:   
    FAILURE!
    


Let's run this on the test suite.

In [138]:
with open("../data/citation-detection-tests.csv", 'r') as file:
        csv = DictReader(file)
        for row in csv:
            test_citation_detection(row['raw'], row['normalized'], row['test_purpose'])


    Purpose: Raw identical to normalized
    ------------------------
    Input:    Caston v. Perry, 18 S.C.L. 104 (1831). 
    Expected: Caston v. Perry, 18 S.C.L. 104 (1831)
    Actual:   18 S.C.L. 104
    FAILURE!
    

    Purpose: Irregular spacing
    ------------------------
    Input:    Caston v. Perry, 18 S. C.L. 104 (1831)
    Expected: Caston v. Perry, 18 S.C.L. 104 (1831)
    Actual:   
    FAILURE!
    

    Purpose: Irregular spacing
    ------------------------
    Input:    Caston v. Perry, 18 S.  C. L. 104 (1831)
    Expected: Caston v. Perry, 18 S.C.L. 104 (1831)
    Actual:   
    FAILURE!
    

    Purpose: Irregular punctuation
    ------------------------
    Input:    Caston v. Perry, 18 S C.L. 104 (1831)
    Expected: Caston v. Perry, 18 S.C.L. 104 (1831)
    Actual:   
    FAILURE!
    

    Purpose: Parallel citation
    ------------------------
    Input:    Caston v. Perry, 2 Bail. 104 (1831)
    Expected: Caston v. Perry, 18 S.C.L. 104 (1831)
    Actual: 

It does appear, though, that our approach to fixing whitespace is not working.

In [139]:
too_many_spaces = 'Caston v. Perry, 18   S.   C.   L.   104 (1831)'
text = eyecite.clean_text(too_many_spaces, ['underscores', 'all_whitespace', 'inline_whitespace', correct_ocr])
print(text)

Caston v. Perry, 18 S. C. L. 104 (1831)


Presumably eyecite only recognizes `S.C.L.` not `S. C. L.`

Most of these tests fail but that's something to fix later.