# Rule-Based Pipeline for Intertextuality Detection

This notebook demonstrates the rule-based pipeline for detecting intertextuality in Latin texts.

The pipeline implements a multi-stage approach:
1. **Text preprocessing** - Orthographic normalization (v→u, j→i), prefix assimilation
2. **Text matching** - Finding shared non-stopword tokens
3. **Distance criterion** - Shared words must appear close together
4. **Scissa filter** - Punctuation agreement check
5. **HTRG filter** - Part-of-Speech analysis (optional)
6. **Similarity filter** - Word embedding similarity (optional)

Based on work by Michael Wittweiler, Franziska Schropp, and Marie Revellio.

In [1]:
# Reinstall the package from local source to get the latest fixes
%pip install -e ..[rule-based]

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Obtaining file:///Users/julianschelb/Repositories/locisimiles
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: locisimiles
  Building editable for locisimiles (pyproject.toml) ... [?25ldone
[?25h  Created wheel for locisimiles: filename=locisimiles-0.3.4-py3-none-any.whl size=2789 sha256=46d2ed965d3d1b11bce5fafe7f3437badb20fc7d8b4674e075b7878a4af7cf90
  Stored in directory: /private/var/folders/0l/f_tn4q4x449826sdsnd1zv6c0000gn/T/pip-ephem-wheel-cache-7wwm0xt8/wheels/8c/99/d3/dbf812ed67b5b569bf8284b817cd0533e9824f7c51c2b05455
Successfully built locisimiles
Installing collected packages: locisimiles
  Attempting uninstall: locisimiles
    Found existing installation: locisimil

In [None]:
# # Install Latin spaCy model from LatinCy (HuggingFace)
# # The wheel needs to be renamed with a valid version number
# import os
# import urllib.request

# wheel_path = "/tmp/la_core_web_lg-3.8.0-py3-none-any.whl"
# if not os.path.exists(wheel_path):
#     print("Downloading la_core_web_lg model (~230MB)...")
#     urllib.request.urlretrieve(
#         "https://huggingface.co/latincy/la_core_web_lg/resolve/main/la_core_web_lg-any-py3-none-any.whl",
#         wheel_path
#     )
#     print("Download complete!")

# # Install the renamed wheel
# %pip install /tmp/la_core_web_lg-3.8.0-py3-none-any.whl


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.3[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[31mERROR: Invalid wheel filename (invalid version): 'la_core_web_lg-any-py3-none-any'[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
from locisimiles.pipeline import RuleBasedPipeline, pretty_print
from locisimiles.document import Document
from locisimiles.evaluator import IntertextEvaluator

  from .autonotebook import tqdm as notebook_tqdm


## Load Example Documents

Load the query document (Hieronymus) and source document (Vergil) from the example CSV files.

In [4]:
# Load example query and source documents
query_doc = Document("./hieronymus_samples.csv", author="Hieronymus")
source_doc = Document("./vergil_samples.csv", author="Vergil")

print("Loaded query and source documents:")
print(f"Query Document: {query_doc}")
print(f"Source Document: {source_doc}")
print("=" * 70)

Loaded query and source documents:
Query Document: Document('hieronymus_samples.csv', segments=11, author='Hieronymus', meta={})
Source Document: Document('vergil_samples.csv', segments=10, author='Vergil', meta={})


## Basic Rule-Based Pipeline

Run the rule-based pipeline with default settings. The pipeline uses lexical matching with stopword filtering and applies various linguistic filters.

In [5]:
# Initialize the rule-based pipeline with default settings
pipeline = RuleBasedPipeline(
    min_shared_words=2,    # Require at least 2 shared non-stopwords
    max_distance=3,        # Shared words must be within 3 tokens of each other
)

print("Pipeline configuration:")
print(f"  Minimum shared words: {pipeline.min_shared_words}")
print(f"  Maximum distance: {pipeline.max_distance}")
print(f"  Number of stopwords: {len(pipeline.stopwords)}")

Pipeline configuration:
  Minimum shared words: 2
  Maximum distance: 3
  Number of stopwords: 89


In [6]:
# Run the pipeline
# query = Hieronymus (the text containing potential quotes)
# source = Vergil (the source text being quoted)
results = pipeline.run(
    query=query_doc,
    source=source_doc,
    query_genre="prose",     # Hieronymus writes prose
    source_genre="poetry",   # Vergil's Aeneid is poetry
)

print("Results of the rule-based pipeline:")
pretty_print(results)

Results of the rule-based pipeline:

▶ Query segment 'hier. adv. iovin. 1.1':
  verg. aen. 10.636          sim=+0.600  P(pos)=1.000

▶ Query segment 'hier. adv. iovin. 1.41':
  verg. aen. 11.508          sim=+0.600  P(pos)=1.000

▶ Query segment 'hier. adv. iovin. 2.36':
  verg. aen. 4.172           sim=+1.000  P(pos)=1.000

▶ Query segment 'hier. adv. pelag. 1.23':
  verg. ecl. 8.62            sim=+0.600  P(pos)=1.000

▶ Query segment 'hier. adv. pelag. 3.11':
  verg. ecl. 3.49            sim=+1.000  P(pos)=1.000

▶ Query segment 'hier. adv. pelag. 3.4':
  verg. georg. 1.197         sim=+0.800  P(pos)=1.000

▶ Query segment 'hier. adv. rufin. 1.17':
  verg. ecl. 3.26            sim=+1.000  P(pos)=1.000

▶ Query segment 'hier. adv. rufin. 1.5':
  verg. aen. 10.875          sim=+0.600  P(pos)=1.000

▶ Query segment 'hier. adv. rufin. 1.6':
  verg. aen. 1.177           sim=+1.000  P(pos)=1.000

▶ Query segment 'hier. adv. rufin. 3.28':
  verg. georg. 2.475         sim=+1.000  P(pos)=1.00

## Evaluate Pipeline Performance

Use the `IntertextEvaluator` to compare the pipeline's predictions against ground truth annotations.

In [7]:
# Create evaluator for the rule-based pipeline
evaluator = IntertextEvaluator(
    query_doc=query_doc,
    source_doc=source_doc,
    ground_truth_csv="./ground_truth.csv",
    pipeline=pipeline,
    top_k=10,
    threshold=0.5,  # Since rule-based returns 1.0 for all matches
)

print("Evaluation Results:")
print("\nMicro-averaged scores:")
print(evaluator.evaluate(average="micro"))
print("\nMacro-averaged scores:")
print(evaluator.evaluate(average="macro"))

Evaluation Results:

Micro-averaged scores:
   precision  recall   f1  accuracy  fpr  fnr  smr  tp  fp  fn   tn
0        1.0     1.0  1.0       1.0  0.0  0.0  0.0  10   0   0  100

Macro-averaged scores:
   precision  recall   f1  accuracy  fpr  fnr  smr  tp  fp  fn   tn
0        1.0     1.0  1.0       1.0  0.0  0.0  0.0  10   0   0  100


In [8]:
# Evaluate a single query sentence
print("Single sentence evaluation:")
print(evaluator.evaluate_single_query("hier. adv. iovin. 1.41"))

Single sentence evaluation:
{'query_id': 'hier. adv. iovin. 1.41', 'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'accuracy': 1.0, 'errors': 0, 'tp': 1, 'fp': 0, 'fn': 0, 'tn': 9, 'fpr': 0.0, 'fnr': 0.0, 'smr': 0.0}


In [9]:
# Per-sentence detailed results
print("Per-sentence results (first 10):")
print(evaluator.evaluate_all_queries().head(10))

Per-sentence results (first 10):
                 query_id  precision  recall   f1  accuracy  errors  tp  fp  \
0   hier. adv. iovin. 1.1        1.0     1.0  1.0       1.0       0   1   0   
1  hier. adv. iovin. 1.41        1.0     1.0  1.0       1.0       0   1   0   
2  hier. adv. iovin. 2.36        1.0     1.0  1.0       1.0       0   1   0   
3  hier. adv. pelag. 1.23        1.0     1.0  1.0       1.0       0   1   0   
4  hier. adv. pelag. 3.11        1.0     1.0  1.0       1.0       0   1   0   
5   hier. adv. pelag. 3.4        1.0     1.0  1.0       1.0       0   1   0   
6  hier. adv. rufin. 1.17        1.0     1.0  1.0       1.0       0   1   0   
7   hier. adv. rufin. 1.5        1.0     1.0  1.0       1.0       0   1   0   
8   hier. adv. rufin. 1.6        1.0     1.0  1.0       1.0       0   1   0   
9  hier. adv. rufin. 3.28        1.0     1.0  1.0       1.0       0   1   0   

   fn  tn  fpr  fnr  smr  
0   0   9  0.0  0.0  0.0  
1   0   9  0.0  0.0  0.0  
2   0   9  0.0  

## Custom Stopwords

You can customize the stopwords used by the pipeline. Load from a file or provide a custom set.

In [10]:
from pathlib import Path

# Initialize with custom stopwords
custom_stopwords = {"et", "in", "non", "est", "ut", "cum", "ad", "que", "sed"}

pipeline_custom = RuleBasedPipeline(
    min_shared_words=2,
    stopwords=custom_stopwords,  # Use custom stopword set
)

print(f"Custom pipeline with {len(pipeline_custom.stopwords)} stopwords")

Custom pipeline with 9 stopwords


In [11]:
# Or load additional stopwords from a file
# The package includes a comprehensive Latin stoplist
stoplist_path = Path("../src/locisimiles/data/stoplist.txt")

if stoplist_path.exists():
    pipeline_with_stoplist = RuleBasedPipeline(min_shared_words=2)
    pipeline_with_stoplist.load_stopwords(stoplist_path)
    print(f"Loaded stoplist: {len(pipeline_with_stoplist.stopwords)} stopwords total")
else:
    print(f"Stoplist not found at {stoplist_path}")

Loaded stoplist: 666 stopwords total


## Pipeline Parameters

The rule-based pipeline has several tunable parameters:

In [12]:
# More restrictive settings - require more shared words
pipeline_strict = RuleBasedPipeline(
    min_shared_words=3,    # Require at least 3 shared non-stopwords
    min_complura=5,        # Minimum 5 adjacent tokens for complura matches
    max_distance=2,        # Stricter distance criterion
)

results_strict = pipeline_strict.run(
    query=query_doc,
    source=source_doc,
)

# Count total matches
total_matches = sum(len(pairs) for pairs in results_strict.values())
queries_with_matches = sum(1 for pairs in results_strict.values() if pairs)

print(f"Strict pipeline results:")
print(f"  Total matches found: {total_matches}")
print(f"  Queries with matches: {queries_with_matches}")

Strict pipeline results:
  Total matches found: 10
  Queries with matches: 10


In [13]:
# More permissive settings - fewer requirements
pipeline_permissive = RuleBasedPipeline(
    min_shared_words=1,    # Even single word matches
    max_distance=5,        # Allow more spread-out shared words
)

results_permissive = pipeline_permissive.run(
    query=query_doc,
    source=source_doc,
)

total_matches = sum(len(pairs) for pairs in results_permissive.values())
queries_with_matches = sum(1 for pairs in results_permissive.values() if pairs)

print(f"Permissive pipeline results:")
print(f"  Total matches found: {total_matches}")
print(f"  Queries with matches: {queries_with_matches}")

Permissive pipeline results:
  Total matches found: 10
  Queries with matches: 10


## Advanced Filters (Optional)

The pipeline supports optional advanced filters:
- **HTRG filter**: Uses Part-of-Speech tagging to ensure grammatical agreement
- **Similarity filter**: Uses word embeddings to verify semantic similarity

These require additional dependencies:
- HTRG: `torch`, `transformers`
- Similarity: `spacy` with the `la_core_web_lg` model

In [None]:
# Check available optional features
try:
    import torch
    print("✓ PyTorch available - HTRG filter can be used")
except ImportError:
    print("✗ PyTorch not available - HTRG filter disabled")

try:
    import spacy
    print("✓ spaCy available - Similarity filter can be used")
    # Check for Latin model
    try:
        nlp = spacy.load("la_core_web_lg")
        print("  ✓ Latin model (la_core_web_lg) loaded")
    except OSError:
        print("  ✗ Latin model not installed. Run cell 3 to install it.")
except ImportError:
    print("✗ spaCy not available - Similarity filter disabled")

✓ PyTorch available - HTRG filter can be used
✓ spaCy available - Similarity filter can be used
  ✗ Latin model not installed. Install via:
    pip install https://huggingface.co/latincy/la_core_web_lg/resolve/main/la_core_web_lg-any-py3-none-any.whl


In [15]:
# Enable optional filters (uncomment to use)
# Note: These require additional dependencies and models

# pipeline_advanced = RuleBasedPipeline(
#     min_shared_words=2,
#     use_htrg=True,           # Enable POS-based filter
#     use_similarity=True,     # Enable embedding similarity filter
#     similarity_threshold=0.3,
#     device="cpu",
# )
# 
# results_advanced = pipeline_advanced.run(
#     source=source_doc,
#     target=query_doc,
# )

## Inspecting Results

Let's examine the structure of the results returned by the pipeline.

In [16]:
# Look at a specific query's results
example_query_id = list(results.keys())[0]
matches = results[example_query_id]

print(f"Query: {example_query_id}")
print(f"Query text: {query_doc.get_text(example_query_id)[:100]}...")
print(f"\nNumber of matches: {len(matches)}")

if matches:
    print("\nTop matches:")
    for j in matches[:5]:
        print(f"  {j.segment.id}:")
        print(f"    Text: {j.segment.text[:80]}...")
        print(f"    Score: {j.candidate_score:.3f}, Probability: {j.judgment_score:.3f}")

Query: hier. adv. iovin. 1.1
Query text: Furiosas Apollinis uates legimus; et illud Uirgilianum: Dat sine mente sonum....

Number of matches: 1

Top matches:
  verg. aen. 10.636:
    Text: tum dea nube caua tenuem sine uiribus umbram in faciem Aeneae uisu mirabile mons...
    Score: 0.600, Probability: 1.000


In [17]:
# Summary statistics
total_matches = sum(len(pairs) for pairs in results.values())
queries_with_matches = sum(1 for pairs in results.values() if pairs)

print(f"Summary Statistics:")
print(f"  Total query segments: {len(results)}")
print(f"  Queries with matches: {queries_with_matches}")
print(f"  Total matches found: {total_matches}")
print(f"  Average matches per query: {total_matches / len(results):.2f}")

Summary Statistics:
  Total query segments: 11
  Queries with matches: 10
  Total matches found: 10
  Average matches per query: 0.91
