# Automatic Annotation of Environmental Sentences

## 1. Introduction

### 1.1 Background and Purpose

This notebook applies a rule-based method to annotate environmental text with named entities. It is the third stage in a pipeline for building a domain-specific Named Entity Recognition (NER) dataset. Earlier stages involved sentence segmentation and vocabulary construction.

The goal here is to automatically assign entity labels to pre-segmented sentences using curated vocabulary lists. Each vocabulary term is matched directly in text, and the matching span is annotated with its corresponding category. This produces a labelled dataset that can be used to train downstream NER models.

The method prioritises speed, scale, and reproducibility. It requires no manual annotation or machine learning during this stage.

### 1.2 Objectives

The main objectives of this notebook are:

- Apply each vocabulary category (TAXONOMY, HABITAT, ENV_PROCESS, POLLUTANT, MEASUREMENT) to a large sentence corpus
- Construct an Aho-Corasick trie per category for efficient multi-pattern matching
- Handle overlapping and nested matches through precedence and merging rules
- Apply custom logic for MEASUREMENT entities, ensuring numbers and units are annotated together
- Generate output in a SpaCy-compatible .jsonl format for downstream model training
- Validate annotations using programmatic checks and spot-checking of random samples

### 1.3 Challenges in Rule-Based Annotation

Rule-based annotation offers simplicity and transparency but presents several challenges that must be addressed carefully:

**Overlapping entities across categories**  
Certain terms may appear in multiple categories or as part of longer expressions. For example, the word "forest" may be listed as a HABITAT but also occur in the species name "African forest elephant" (TAXONOMY). Overlapping matches must be resolved by choosing the longest span or prioritising specific categories.

**Lack of context awareness**  
Exact term matching does not account for semantic context. This can result in false positives where a term has different meanings (e.g. “Amazon” referring to a company vs. a rainforest).

**Span boundaries and formatting**  
Care must be taken to ensure that only the correct characters are included in each entity span. Matches must be aligned to full words and should not include punctuation, whitespace, or extraneous tokens.

**Special treatment for measurement expressions**  
Measurement entities such as “20 kg” or “<10 µg/L” require combined annotation of numbers and units. These must be identified and merged into single spans even when not contiguous in the vocabulary list.

**Vocabulary coverage and noise**  
Some terms may not be present in the vocabulary and will be missed. Others may be too generic and match unintended text. This stage must balance recall with precision and include mechanisms for iterative refinement of the vocab lists.

## 2. Method Overview

### 2.1 What Is Aho-Corasick Matching?
The Aho-Corasick algorithm is a fast and efficient method for matching many string patterns at once. It builds a data structure called a trie, which allows it to search for all vocabulary terms in a sentence in a single pass.

This makes it well-suited for Named Entity Recognition tasks where the goal is to find known phrases (e.g. "climate change", "acid rain") across a large body of text. Unlike regular expressions or repeated substring searches, Aho-Corasick performs in linear time with respect to the input length.

In this notebook, one Aho-Corasick matcher is created for each entity category (e.g. TAXONOMY, HABITAT). Each match includes the matched text, its character span, and its associated category label.

### 2.2 Why Weak Labelling?
Weak labelling refers to the automatic assignment of labels using predefined rules or resources, rather than manual annotation. In this case, the labels are derived from vocabulary lists matched against the text.

This approach is commonly used when:
- Manual annotation is too costly or time-consuming
- Domain expertise is required to label data accurately
- A large amount of unlabelled text is available

Weak labelling allows researchers to generate useful training data without human effort, using only heuristics or dictionaries. It is particularly effective in structured domains like environmental science, where many terms are standardised.

### 2.3 Benefits and Limitations
#### Benefits:
- Fast and scalable to large corpora
- Easy to understand and reproduce
- High recall for known vocabulary terms
- Suitable for domains with well-defined terminology

#### Limitations:
- No understanding of context or meaning
- Fails to detect entities not in the vocabulary
- Can produce false positives (e.g. "Amazon" as rainforest vs company)
- Requires careful span handling to avoid overlap or misalignment

This method is not a replacement for human-labelled data, but it provides a strong starting point. The resulting annotations can be used to train statistical models or refine vocabularies through iteration and validation.

## 3.

In [44]:
import os
import json
import time
from pathlib import Path
from collections import defaultdict

import ahocorasick

SEGMENTED_PATH = Path("../data/segmented_text")
VOCAB_PATH = Path("../vocabs/final")
OUTPUT_PATH = Path("../data/json")
os.makedirs(OUTPUT_PATH, exist_ok=True)

In [45]:
all_sentences = []

for file_path in SEGMENTED_PATH.rglob("*.txt"):
    with open(file_path, encoding="utf-8") as f:
        lines = [line.strip() for line in f if line.strip()]
        all_sentences.extend(lines)

print(f"Loaded {len(all_sentences):,} sentences")

Loaded 2,748,655 sentences


In [46]:
matchers = {}

for vocab_file in VOCAB_PATH.glob("*.txt"):
    label = vocab_file.stem.upper()

    with open(vocab_file, encoding="utf-8") as f:
        terms = [line.strip().lower() for line in f if line.strip()]
    
    automaton = ahocorasick.Automaton()
    for term in terms:
        automaton.add_word(term, (term, label))
    automaton.make_automaton()

    matchers[label] = automaton
    print(f"Loaded {len(terms):,} terms for {label}")

Loaded 1,250 terms for POLLUTANT
Loaded 1,303 terms for ENV_PROCESS
Loaded 1,170 terms for MEASUREMENT
Loaded 574 terms for HABITAT
Loaded 129,452 terms for TAXONOMY


In [47]:
# -------- Hyphenated-word boundary check --------
def is_inside_hyphenated_word(text: str, start: int, end: int) -> bool:
    """True if the span touches a '-' that is part of a larger token."""
    return (start > 0 and text[start-1] == '-') or (end < len(text) and text[end] == '-')


# -------- Whole-word Aho-Corasick matching with filtering --------
def annotate_text_with_vocab(text: str, automaton, label: str):
    """
    Return a list of [start, end, label] spans matched in *text*
    using a case-insensitive automaton, respecting word boundaries.
    """
    lowered = text.lower()
    text_len = len(text)
    spans = []

    for end_idx, (term, _) in automaton.iter(lowered):
        start_idx = end_idx - len(term) + 1

        # Word-boundary guards
        before_ok = start_idx == 0 or not text[start_idx-1].isalnum()
        after_ok  = end_idx + 1 == text_len or not text[end_idx+1].isalnum()

        if before_ok and after_ok and not is_inside_hyphenated_word(text, start_idx, end_idx+1):
            spans.append([start_idx, end_idx+1, label])

    # Sort by start-pos then (implicitly) by length (longer last → will be dropped on overlap)
    spans.sort(key=lambda s: (s[0], s[1]-s[0]))
    return spans

In [48]:
# -------- Rebuild raw annotations using strict matching --------
raw_annotations = defaultdict(list)

for i, sentence in enumerate(all_sentences):
    for label, automaton in matchers.items():
        matches = annotate_text_with_vocab(sentence, automaton, label)
        if matches:
            raw_annotations[sentence].extend(matches)

    if (i + 1) % 100000 == 0:
        print(f"Annotated {i + 1:,}/{len(all_sentences):,} sentences")

print(f"Annotated {len(raw_annotations):,} sentences with at least one entity")

Annotated 100,000/2,748,655 sentences
Annotated 200,000/2,748,655 sentences
Annotated 300,000/2,748,655 sentences
Annotated 400,000/2,748,655 sentences
Annotated 500,000/2,748,655 sentences
Annotated 600,000/2,748,655 sentences
Annotated 700,000/2,748,655 sentences
Annotated 800,000/2,748,655 sentences
Annotated 900,000/2,748,655 sentences
Annotated 1,000,000/2,748,655 sentences
Annotated 1,100,000/2,748,655 sentences
Annotated 1,200,000/2,748,655 sentences
Annotated 1,300,000/2,748,655 sentences
Annotated 1,400,000/2,748,655 sentences
Annotated 1,500,000/2,748,655 sentences
Annotated 1,600,000/2,748,655 sentences
Annotated 1,700,000/2,748,655 sentences
Annotated 1,800,000/2,748,655 sentences
Annotated 1,900,000/2,748,655 sentences
Annotated 2,000,000/2,748,655 sentences
Annotated 2,100,000/2,748,655 sentences
Annotated 2,200,000/2,748,655 sentences
Annotated 2,300,000/2,748,655 sentences
Annotated 2,400,000/2,748,655 sentences
Annotated 2,500,000/2,748,655 sentences
Annotated 2,600,00

So a total of 2,037,619 of them has at least one entities and the rest of 711026 were discarded. This is a good thing or bad thing because....

In [49]:
def has_overlap(spans):
    spans = sorted(spans, key=lambda x: x[0])
    for i in range(len(spans)-1):
        if spans[i][1] > spans[i+1][0]:
            return True
    return False

print("\nSentences with overlapping entity spans:\n")
count = 0
for sent, spans in raw_annotations.items():
    if has_overlap(spans):
        print(f"Sentence: {sent}")
        print(f"Spans: {spans}\n")
        count += 1
        if count == 5:
            break

if count == 0:
    print("No overlapping spans found.")


Sentences with overlapping entity spans:

Sentence: During sediment collection samplers were removed from the metal uprights secured to the river bed and the contents emptied into 5-L containers.
Spans: [[88, 93, 'HABITAT'], [88, 97, 'HABITAT']]

Sentence: All samples were lightly hand ground, pressed and then measured under a He atmosphere under combined Pd and Co excitation radiation and using a high resolution, low spectral interference silicon drift detector.
Spans: [[122, 131, 'POLLUTANT'], [122, 131, 'POLLUTANT']]

Sentence: Results are the average of three repeats after elimination of outliers, a process that minimises intra-sample noise in the laser granulometry.
Spans: [[110, 115, 'POLLUTANT'], [110, 115, 'POLLUTANT']]

Sentence: Earth Surface Processes and Landforms 26: 1237 1248.
Spans: [[28, 37, 'HABITAT'], [28, 37, 'HABITAT']]

Sentence: Springer; 373 390 Folk RL, Ward WC. 1957.
Spans: [[0, 8, 'TAXONOMY'], [0, 8, 'TAXONOMY']]



When two entity spans overlap, only the longer one is kept. This is intentional.

Longer spans often provide more precise meaning in context. For example:

- We prefer “acidic rain” over “rain”
- We keep “coastal salt marsh” instead of just “salt” or “marsh”

Shorter spans are typically part of larger phrases and can lead to misleading annotations if extracted alone. By keeping the longest available match, we reduce ambiguity and improve the quality of weak labels.

In [50]:
def resolve_overlaps(spans):
    """
    Keep longest span when overlaps occur (longest-match-wins).
    Spans must be [start, end, label].
    """
    spans = sorted(spans, key=lambda s: (s[0], -(s[1]-s[0])))
    resolved, occupied = [], set()
    for s, e, lbl in spans:
        if not any(pos in occupied for pos in range(s, e)):
            resolved.append([s, e, lbl])
            occupied.update(range(s, e))
    return resolved

In [51]:
clean_annotations = {}

for sent, spans in raw_annotations.items():
    deduped = list({(s, e, l) for s, e, l in spans})  # remove exact duplicates
    resolved = resolve_overlaps(deduped)
    if resolved:
        clean_annotations[sent] = resolved

print(f"Cleaned annotations for {len(clean_annotations):,} sentences")

Cleaned annotations for 735,542 sentences


In [52]:
overlap_count = 0
for spans in clean_annotations.values():
    if has_overlap(spans):
        overlap_count += 1

print(f"Sentences with overlapping spans after resolution: {overlap_count}")

Sentences with overlapping spans after resolution: 0


In [53]:
import random

def mark_entities(text, spans):
    spans = sorted(spans, key=lambda x: x[0])
    marked = ""
    last = 0
    for start, end, label in spans:
        marked += text[last:start]
        marked += f"[{text[start:end]}|{label}]"
        last = end
    marked += text[last:]
    return marked

samples = random.sample(list(clean_annotations.items()), 5)

for sent, spans in samples:
    print(mark_entities(sent, spans))
    print()

He has also spread [red squirrels|TAXONOMY] across the Highlands and established [goldeneye|TAXONOMY] [ducks|TAXONOMY] as a breeding species.

In 2022, a unit of the CDC reported that out of 2,310 [urine|POLLUTANT] samples collected, more than 80% were laced with detectable traces of [glyphosate|POLLUTANT].

Marine [protected areas|HABITAT] ([MPAs|MEASUREMENT]) require ecologically meaningful designs capable of taking into account the particularities of the species under consideration, the dynamic nature of the [marine environment|HABITAT], and the multiplicity of [anthropogenic|ENV_PROCESS] impacts.

'War on plastic' could strand oil industry's £300bn investment | Major oil firms plan to grow plastic supply to counter impact of shift against fossil fuels | The war on plastic waste could scupper the oil industry’s multi-billion dollar bet that the world will continue to need more fossil fuels to help make the petrochemicals used in [plastics|POLLUTANT], according to a new report.

Coun

In [54]:
jsonl_path = OUTPUT_PATH / "training_data.jsonl"
with jsonl_path.open("w", encoding="utf-8") as f:
    for sent, spans in clean_annotations.items():
        json.dump({"text": sent, "label": spans}, f, ensure_ascii=False)
        f.write("\n")

print(f"\nSaved → {jsonl_path}  ({len(clean_annotations):,} lines)")


Saved → ../data/json/training_data.jsonl  (735,542 lines)


In [55]:
# ----------------- Count entities per category -----------------
from collections import Counter

label_counter = Counter()

for spans in clean_annotations.values():
    for _, _, lbl in spans:
        label_counter[lbl] += 1

print("Entity counts by category\n")
for lbl, n in label_counter.most_common():
    print(f"{lbl:<15} : {n:,}")

Entity counts by category

HABITAT         : 373,060
ENV_PROCESS     : 365,599
TAXONOMY        : 258,948
MEASUREMENT     : 111,187
POLLUTANT       : 96,938


In [56]:
from collections import Counter

# Top 30 frequent entity texts per label (lowercased)
for lbl in label_counter:
    c = Counter(
        text[start:end].lower()
        for text, spans in clean_annotations.items()
        for start, end, l in spans
        if l == lbl
    )
    print(f"\n── {lbl} ──")
    for token, n in c.most_common(30):
        print(f"{token:<30} {n}")


── HABITAT ──
habitat                        32782
forest                         25871
ecosystem                      21801
habitats                       21766
river                          20572
ecosystems                     14783
landscape                      12460
forests                        9274
lake                           8154
island                         8084
rivers                         7568
coast                          7220
wood                           6070
grassland                      5635
islands                        4730
landscapes                     4280
wetland                        4225
garden                         4171
reef                           4166
wetlands                       4118
lakes                          3976
bay                            3952
beach                          3886
valley                         3793
national park                  3591
mountain                       3502
hill                           2882
territ

In [2]:
import os
import json
from pathlib import Path
from collections import defaultdict
import re
import spacy
from spacy.training.example import Example
from spacy.tokens import DocBin
import ahocorasick
import time
import random

# Set paths
BASE_DIR = Path("..") / "data" / "sentences"
OUTPUT_DIR = Path("..") / "data" / "json"
VOCAB_DIR = Path("..") / "vocabularies"
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Load the input text data
preprocessed_texts = []
for file in BASE_DIR.rglob("*.txt"):
    with open(file, encoding="utf-8") as f:
        lines = [line.strip() for line in f if line.strip()]
        preprocessed_texts.extend(lines)
print(f"📄 Loaded {len(preprocessed_texts):,} total sentences from {BASE_DIR}")


📄 Loaded 0 total sentences from ../data/sentences


In [20]:
def fix_missing_spaces(text):
    return re.sub(r'(?<=[a-zA-Z0-9])(?=[.?!])(?=[^\s])', r'\g<0> ', text)

with open(BASE_DIR / "env_data.txt", encoding="utf-8") as f:
    preprocessed_texts = [
        fix_missing_spaces(line.strip().lower())
        for line in f if line.strip()
    ]

In [4]:
def build_automaton(vocab_terms):
    A = ahocorasick.Automaton()
    for term in vocab_terms:
        A.add_word(term, term)
    A.make_automaton()
    return A

def is_inside_hyphenated_word(text, start, end):
    # Check if the match is attached to another token via a hyphen
    if start > 0 and text[start - 1] == '-':
        return True
    if end < len(text) and text[end] == '-':
        return True
    return False

def annotate_text_with_vocab(text, automaton, label):
    text_length = len(text)
    matches = []

    # Iterate using the automaton
    for end_index, term in automaton.iter(text):
        start_index = end_index - len(term) + 1

        # Whole word check
        if (start_index == 0 or not text[start_index - 1].isalnum()) and (
            end_index + 1 == text_length or not text[end_index + 1].isalnum()
        ):
            if not is_inside_hyphenated_word(text, start_index, end_index + 1):
                matches.append([start_index, end_index + 1, label])

    # Sort by start index, then longer spans first
    matches.sort(key=lambda x: (x[0], x[1] - x[0]), reverse=False)
    
    # Don't block overlaps — just collect all clean, whole-word matches
    annotations = [[start, end, label] for start, end, label in matches]

    return annotations


In [6]:
theme_files = [
    "measurement.txt", 
    "pollutant.txt", 
    "env_process.txt", 
    "habitat.txt", 
    "taxonomy.txt"
]


In [5]:
def resolve_overlaps(entities):
    entities = sorted(entities, key=lambda x: (x[0], -(x[1] - x[0])))
    resolved = []
    occupied = set()
    for start, end, label in entities:
        if not any(pos in occupied for pos in range(start, end)):
            resolved.append([start, end, label])
            occupied.update(range(start, end))
    return sorted(resolved, key=lambda x: x[0])


In [9]:


start_time = time.time()
text_to_annotations = defaultdict(list)

for fname in theme_files:
    theme_name = fname.replace(".txt", "")
    label = theme_name.upper()
    print(f"🔍 Annotating {label}")

    with open(VOCAB_DIR / fname, encoding="utf-8") as f:
        vocab_terms = [line.strip().lower() for line in f if line.strip()]

    automaton = build_automaton(vocab_terms)

    for i, text in enumerate(preprocessed_texts):
        annotations = annotate_text_with_vocab(text, automaton, label)
        if annotations:
            text_to_annotations[text].extend(annotations)
            text_to_annotations[text] = resolve_overlaps(text_to_annotations[text])
        if (i + 1) % 100000 == 0:
            print(f"Processed {i + 1:,}/{len(preprocessed_texts):,} texts...")

# Save final output
annotated_path = OUTPUT_DIR / "training_data.jsonl"
with open(annotated_path, "w", encoding="utf-8") as f:
    for text, annotations in text_to_annotations.items():
        json.dump({"text": text, "label": annotations}, f, ensure_ascii=False)
        f.write("\n")

end_time = time.time()
print(f"✅ Done! Annotated {len(text_to_annotations):,} unique texts.")
print(f"📁 Saved to: {annotated_path}")
print(f"⏱ Total time: {end_time - start_time:.2f} seconds")


🔍 Annotating MEASUREMENT
Processed 100,000/2,910,834 texts...
Processed 200,000/2,910,834 texts...
Processed 300,000/2,910,834 texts...
Processed 400,000/2,910,834 texts...
Processed 500,000/2,910,834 texts...
Processed 600,000/2,910,834 texts...
Processed 700,000/2,910,834 texts...
Processed 800,000/2,910,834 texts...
Processed 900,000/2,910,834 texts...
Processed 1,000,000/2,910,834 texts...
Processed 1,100,000/2,910,834 texts...
Processed 1,200,000/2,910,834 texts...
Processed 1,300,000/2,910,834 texts...
Processed 1,400,000/2,910,834 texts...
Processed 1,500,000/2,910,834 texts...
Processed 1,600,000/2,910,834 texts...
Processed 1,700,000/2,910,834 texts...
Processed 1,800,000/2,910,834 texts...
Processed 1,900,000/2,910,834 texts...
Processed 2,000,000/2,910,834 texts...
Processed 2,100,000/2,910,834 texts...
Processed 2,200,000/2,910,834 texts...
Processed 2,300,000/2,910,834 texts...
Processed 2,400,000/2,910,834 texts...
Processed 2,500,000/2,910,834 texts...
Processed 2,600,00

In [10]:
import random
import json
from pathlib import Path

INPUT_FILE = Path("..") / "data" / "json" / "training_data.jsonl"
OUTPUT_FILE = Path("..") / "data" / "json" / "sample_for_manual_testing.jsonl"

all_data = []

# Load safely, skip blank or bad lines
with open(INPUT_FILE, "r", encoding="utf-8") as f:
    for line in f:
        line = line.strip()
        if not line:
            continue
        try:
            data = json.loads(line)
            all_data.append(data)
        except json.JSONDecodeError:
            print("⚠️ Skipped malformed line")

# Sample
sampled = random.sample(all_data, min(1000, len(all_data)))

# Save sample
with open(OUTPUT_FILE, "w", encoding="utf-8") as f:
    for item in sampled:
        json.dump(item, f, ensure_ascii=False)
        f.write("\n")

print(f"✅ Sampled {len(sampled)} items to {OUTPUT_FILE}")


✅ Sampled 1000 items to ../data/json/sample_for_manual_testing.jsonl


In [None]:


# --- Directory Setup ---
BASE_DIR = Path("data") / "raw_data"
OUTPUT_DIR = Path("data") / "processed_data"
annotated_path = OUTPUT_DIR / "training_data.jsonl"
cleaned_path = OUTPUT_DIR / "cleaned_training_data.jsonl"

# --- Helper Functions ---
def has_overlapping_entities(entities):
    sorted_entities = sorted(entities, key=lambda x: x[0])
    for i in range(len(sorted_entities) - 1):
        current_start, current_end, _ = sorted_entities[i]
        next_start, _, _ = sorted_entities[i + 1]
        if current_end > next_start:
            return True
    return False

def resolve_overlaps(entities):
    entities = sorted(entities, key=lambda x: (x[0], -(x[1] - x[0])))
    resolved = []
    occupied = set()
    for start, end, label in entities:
        if not any(pos in occupied for pos in range(start, end)):
            resolved.append([start, end, label])
            occupied.update(range(start, end))
    return sorted(resolved, key=lambda x: x[0])

# --- SpaCy Setup ---
nlp = spacy.blank("en")
nlp.max_length = 5_000_000

# --- Load Annotated Data ---
with open(annotated_path, "r", encoding="utf-8") as f:
    raw_data = [json.loads(line.strip()) for line in f]

valid_data = []
invalid_data = []

for i, example in enumerate(raw_data):
    text = example["text"]
    annotations = example["label"]

    if has_overlapping_entities(annotations):
        annotations = resolve_overlaps(annotations)

    doc = nlp(text)
    try:
        Example.from_dict(doc, {"entities": annotations})
        valid_data.append({"text": text, "label": annotations})
    except Exception as e:
        invalid_data.append({
            "index": i,
            "error": str(e),
            "text": text,
            "label": annotations
        })

# --- Save Cleaned Data ---
with open(cleaned_path, "w", encoding="utf-8") as f:
    for item in valid_data:
        json.dump(item, f, ensure_ascii=False)
        f.write("\n")

print(f"Cleaned {len(valid_data)} valid samples.")
print(f"Skipped {len(invalid_data)} invalid samples.")
print(f"Saved cleaned annotations to: {cleaned_path}")


In [13]:
import spacy
from pathlib import Path

# Paths
BASE_DIR = Path("..") / "data" / "raw_data"
OUTPUT_DIR = Path("..") / "data" / "processed_data"
VOCAB_DIR = Path("..") / "vocabularies"

input_path = VOCAB_DIR / "taxonomy.txt"
output_path = VOCAB_DIR / "taxonomy_lemmatized.txt"

# Load SpaCy English model
nlp = spacy.load("en_core_web_sm")

# Read original taxonomy terms
with open(input_path, "r", encoding="utf-8") as f:
    terms = [line.strip().lower() for line in f if line.strip()]

lemmatised_terms = set()

for term in terms:
    doc = nlp(term)
    lemma = " ".join([token.lemma_ for token in doc])
    lemmatised_terms.add(lemma)

# Sort and save
lemmatised_sorted = sorted(lemmatised_terms)

with open(output_path, "w", encoding="utf-8") as f:
    for term in lemmatised_sorted:
        f.write(term + "\n")

print(f"Lemmatized {len(terms)} terms down to {len(lemmatised_terms)} unique ones.")
print(f"Saved to: {output_path}")


Lemmatized 4961 terms down to 4554 unique ones.
Saved to: ..\vocabularies\taxonomy_lemmatized.txt
