In [1]:
from utils.loader import load_pdfs, extract_text_from_pdf, clean_ocr_text
import config
from pathlib import Path
from difflib import SequenceMatcher

### Why are duplicates scoring low on Page-window · min page ROUGE-1 F1 score?
This notebook was created in a temporary environment with just two duplicates to test the differences between them.

We'll compare just these 2 files, which are exact duplicates but are problematically found to have page-window ROUGE-1 F1 = 0.00 

In [2]:
# This folder does not exist in repo, created locally for ease of experimentation

files = load_pdfs(Path("dupes"))
files

[PosixPath('dupes/2535-REPORT K-V_1.1_2000-06-01 Electromagnetic Terrain Conductivity Mapping - Okanagan University College.pdf'),
 PosixPath('dupes/2535-REPORT K-V_1.2_1999-10-21 Electromagnetic Terrain Conductivity Mapping - Okanagan University College.pdf')]

In [3]:
text1 = extract_text_from_pdf(files[0])
text2 = extract_text_from_pdf(files[1])

We can observe that the two files have slightly different OCR contents (eg. `text2` appears to have captured some additional content/noise in the header)

In [4]:
print(text1)

Golder Associates Ltd.
500 - 4260 Still Creek Drive
Burnaby, British Columbia, Canada V5C 6C6
Telephone (604) 298-6623
Fax (604) 298-5253
rGolder 
Associates
REPORT ON
ELECTROMAGNETIC TERRAIN 
CONDUCTrVITY MAPPING: 
OKANAGAN UNIVERSITY COLLEGE, 
KELOWNA, B.C.
56350 -2g/s-53^ K-V
Submitted to:
Reid Crowther and Partners Ltd. 
Suite 201 - 3275 Lakeshore Dr.
Kelowna, B.C.
V1W 3S9
Attention: Mr. Al Gartner
MINISTRY OF 
ENVIRONMENT, LANDS & P*^®
JUN 0 1 2000
POLLUTION PREVENTION 
AND REMEDIATION BRANCH
DISTRIBUTION:
10 Copies - Reid Crowther and Partners Ltd., Kelowna, B.C.
1 
Copy Reid Crowther and Partners Ltd., Victoria, B.C.
2 
Copies- Golder Associates Ltd., Burnaby, B.C.
October 1999
992-1897
OFFICES IN AUSTRALIA, CANADA, GERMANY, HUNGARY, ITALY, SWEDEN, UNITED KINGDOM, UNITED STATES
October 21, 1999-i-992-1897
TABLE OF CONTENTS
Table of Contentsi 
List of Figuresi
SECTION 
PAGE
1 .0 SCOPE OF WORK.........................................................................................

In [None]:
print(text2)

Victoria File #
WSsrWW^’3S^i
I
 ' “iSKSS^'^’^SSM^^ 
te^s^j^ii^wi^-^- 'ZU'S
^i>^u9aMa«(Hm
26 
- ZO /2^35
Golder Associates Ltd.
500 - 4260 Still Creek Drive
Burnaby, British Columbia, Canada V5C 6C6
Telephone (604) 298-6623
Fax (604) 298-5253
rGolder 
Associates
REPORT ON
ENTERED ON THE SITE 
INFORMATION SYSfuM
Victoria File #
26250-20/. '^3^ Ki/
ELECTROMAGNETIC TERRAIN 
CONDUCTIVITY MAPPING: 
OKANAGAN UNIVERSITY COLLEGE, 
KELOWNA, B.C.
Submitted to:
Reid Crowther and Partners Ltd. 
Suite 201 - 3275 Lakeshore Dr. 
Kelowna, B.C.
V1W 3S9
Attention: Mr. Al Gartner
DISTRIBUTION:
10 Copies- Reid Crowther and Partners Ltd., Kelowna, B.C.
1 
Copy Reid Crowther and Partners Ltd., Victoria, B.C.
2 
Copies - Golder Associates Ltd., Burnaby, B.C.
October 1999
992-1897
OFFICES IN AUSTRALIA, CANADA, GERMANY, HUNGARY, ITALY, SWEDEN, UNITED KINGDOM, UNITED STATES
October 21,1999 
- i- 
992-1897
TABLE OF CONTENTS
Table of Contentsi
List of Figuresi
SECTION 
PAGE
1 .0 SCOPE OF WORK......................

The same holds true after calling our `clean_ocr_text` function.

In [5]:
print(clean_ocr_text(text1))

Golder Associates Ltd. 500 - 4260 Still Creek Drive Burnaby, British Columbia, Canada V5C 6C6 Telephone 604 298-6623 Fax 604 298-5253 rGolder Associates REPORT ON ELECTROMAGNETIC TERRAIN CONDUCTrVITY MAPPING: OKANAGAN UNIVERSITY COLLEGE, KELOWNA, B.C. 56350 -2g/s-53 K-V Submitted to: Reid Crowther and Partners Ltd. Suite 201 - 3275 Lakeshore Dr. Kelowna, B.C. V1W 3S9 Attention: Mr. Al Gartner MINISTRY OF ENVIRONMENT, LANDS P JUN 0 1 2000 POLLUTION PREVENTION AND REMEDIATION BRANCH DISTRIBUTION: 10 Copies - Reid Crowther and Partners Ltd., Kelowna, B.C. 1 Copy Reid Crowther and Partners Ltd., Victoria, B.C. 2 Copies- Golder Associates Ltd., Burnaby, B.C. October 1999 992-1897 OFFICES IN AUSTRALIA, CANADA, GERMANY, HUNGARY, ITALY, SWEDEN, UNITED KINGDOM, UNITED STATES October 21, 1999-i-992-1897 TABLE OF CONTENTS Table of Contentsi List of Figuresi SECTION PAGE 1 .0 SCOPE OF WORK................................................................................................ 1 2 .0 METHOD

In [6]:
print(clean_ocr_text(text2))

Victoria File WSsrWW3Si I iSKSSSSM tesjiiwi-- ZUS iu9aMaHm 26 - ZO /235 Golder Associates Ltd. 500 - 4260 Still Creek Drive Burnaby, British Columbia, Canada V5C 6C6 Telephone 604 298-6623 Fax 604 298-5253 rGolder Associates REPORT ON ENTERED ON THE SITE INFORMATION SYSfuM Victoria File 26250-20/. 3 Ki/ ELECTROMAGNETIC TERRAIN CONDUCTIVITY MAPPING: OKANAGAN UNIVERSITY COLLEGE, KELOWNA, B.C. Submitted to: Reid Crowther and Partners Ltd. Suite 201 - 3275 Lakeshore Dr. Kelowna, B.C. V1W 3S9 Attention: Mr. Al Gartner DISTRIBUTION: 10 Copies- Reid Crowther and Partners Ltd., Kelowna, B.C. 1 Copy Reid Crowther and Partners Ltd., Victoria, B.C. 2 Copies - Golder Associates Ltd., Burnaby, B.C. October 1999 992-1897 OFFICES IN AUSTRALIA, CANADA, GERMANY, HUNGARY, ITALY, SWEDEN, UNITED KINGDOM, UNITED STATES October 21,1999 - i- 992-1897 TABLE OF CONTENTS Table of Contentsi List of Figuresi SECTION PAGE 1 .0 SCOPE OF WORK...........................................................................

### Problem: common subsequences present in ground-truth document are not being correctly identified.

Calling `SequenceMatcher` below for a low-tech demonstration of how few common subsequences are being caught in our data.

In [7]:
match = SequenceMatcher(None, clean_ocr_text(text1), clean_ocr_text(text2)).find_longest_match()

# SequenceMatcher finds only a very short portion (3 tokens!) as a common substring.
print(match)
print(clean_ocr_text(text1)[match.a:match.a + match.size])

Match(a=162, b=304, size=32)
 ELECTROMAGNETIC TERRAIN CONDUCT


### First hypothesis: encoding issues.

Encoding and text normalization issues are possible culprits when two OCR-processed documents look visually similar but perform poorly under automated similarity measures like ROUGE or `SequenceMatcher`.

This is a longshot but we will explore it here.

### Approach: aggressively normalize text and encoding by updating `clean_ocr_text`

In [8]:
import unicodedata
import re

def clean_ocr_text(text):
    # Normalize Unicode (NFKC is stronger than NFC)
    text = unicodedata.normalize('NFKC', text)
    
    # Remove zero-width characters and common OCR junk, as our current function does
    text = re.sub(r'[\u200B-\u200F\u202A-\u202E\u2060\uFEFF]', '', text)

    # Normalize whitespace (spaces, newlines, etc.)
    text = re.sub(r'\s+', ' ', text)

    # Strip leading/trailing whitespace
    return text.strip()

In [9]:
match = SequenceMatcher(None, clean_ocr_text(text1), clean_ocr_text(text2)).find_longest_match()

# No change here.
print(match)
print(clean_ocr_text(text1)[match.a:match.a + match.size])

# Finding: normalizing to NFKC does not solve the issue. We still only find a subsequence of 3 tokens...

Match(a=166, b=342, size=32)
 ELECTROMAGNETIC TERRAIN CONDUCT


In [10]:
encoded1 = clean_ocr_text(text1).encode('utf-8')
encoded2 = clean_ocr_text(text2).encode('utf-8')

In [12]:
match = SequenceMatcher(None, encoded1, encoded2).find_longest_match()

# No change here.
print(match)
print(encoded1[match.a:match.a + match.size])

# Finding: manual encoding with .encode does not solve the issue either

Match(a=166, b=349, size=32)
b' ELECTROMAGNETIC TERRAIN CONDUCT'


Key finding: playing with unicode encoding settings does not seem to affect the fact that common substrings are not being found.

Splitting on whitespace and re-joining somewhat improves our ability to capture common subsequences, though performance is middling - see below

In [13]:
text1_clean = clean_ocr_text(text1)
text2_clean = clean_ocr_text(text2)

words1 = text1_clean.split()
words2 = text2_clean.split()

matcher = SequenceMatcher(None, words1, words2)
match = matcher.find_longest_match(0, len(words1), 0, len(words2))

# Result - we get a slightly longer sequence, but not much
print(f"Match: {match}")
print(" ".join(words1[match.a:match.a + match.size]))

Match: Match(a=0, b=14, size=25)
Golder Associates Ltd. 500 - 4260 Still Creek Drive Burnaby, British Columbia, Canada V5C 6C6 Telephone (604) 298-6623 Fax (604) 298-5253 rGolder Associates REPORT ON


---

### The true culprit = likely OCR junk, screwing up the tokens themselves.

See excerpts below to identify how tokens and even the order of tokens are being represented entirely differently on different OCR attempts.

### Next approach: `rapidfuzz`

`rapidfuzz` is a fast string matching library for Python.

### `fuzz.ratio`
Fuzz Ratio is useful when you need to compare two strings and determine their overall similarity. It’s effective for identifying similar strings that may have minor differences due to typos or spelling variations.

### `fuzz.token_sort_ratio`
Tokenizes the strings, sorts the tokens alphabetically, and then calculates the ratio. Ignores token order.

Note that with either of these, `100.00` indicates an exact match.

In [14]:
from rapidfuzz import fuzz

# Success! Fuzzy ratio is correctly finding that these are extremely similar documents.

score = fuzz.ratio(text1_clean, text2_clean)
print(f"fuzz.ratio similarity: {score}")

fuzz.ratio similarity: 96.7181328545781


In [None]:
# Ignoring token order further improves the score; caution, however, that this may inflate scores on non-duplicate document similarities.

score = fuzz.token_sort_ratio(text1_clean, text2_clean)
print(f"fuzz.token_sort_ratio similarity: {score}")

fuzz.token_sort_ratio similarity: 97.42190305206464


### Initial conclusions: `rapidfuzz` may be useful to flag documents for manual review

If a document receives a low ROUGE score but a very high fuzzy similarity score, it may be a good candidate for manual review. We will explore this going forward.