# Vocabulary Curation for Environmental Science NER

## 1. Introduction

### 1.1 Background and Purpose

This notebook focuses on building a vocabulary of environmental terms to support downstream Named Entity Recognition (NER). It is the second step in a pipeline designed to develop a domain-specific NER dataset, following on from the raw text preparation carried out in `01_data_collection.ipynb`.

The aim is to identify and organise key domain-specific terms that appear in environmental texts. These include species names, habitat types, ecological processes, pollutants, and scientific measurements. General-purpose NER models often fail to capture such terms accurately, as they are either underrepresented or entirely absent in mainstream corpora.

The vocabulary lists created here will be used in the next step to apply rule-based annotation. This pre-annotation will enable the creation of weakly labelled data, which will then be used to train statistical and machine learning models for environmental NER. As a result, this stage is critical in determining the coverage and relevance of named entities in the training set.

Vocabulary terms are gathered from a range of publicly available sources, including domain-specific glossaries, classification systems, and structured datasets. Some are extracted using scripts, while others are compiled manually. The final outputs will be cleaned, deduplicated, and stored in a structured format ready for annotation.

### 1.2 Objectives

The main objectives of this notebook are:

- To define a set of named entity categories that reflect the types of concepts commonly found in environmental science
- To identify and extract vocabulary terms for each category from trusted sources, including ontologies, glossaries, and environmental datasets
- To standardise and clean the collected terms, removing duplicates and formatting them consistently
- To save the resulting vocabulary files in a format suitable for rule-based matching and future model training

Each category is handled in a separate section, with a clear explanation of sources used, any filtering applied, and a final cleaned list. These outputs will be used to annotate the sentence-level dataset produced earlier in the pipeline.

### 1.3 Challenges in Vocabulary Development

Developing vocabulary lists for environmental NER presents several challenges that need to be addressed carefully:

- **Choosing meaningful categories**  
  There is no single standard for environmental entity types. Categories must be defined based on project needs, typical text content, and the structure of existing ontologies. Each category should be well defined and distinct from the others.

- **Finding reliable and relevant sources**  
  Domain-specific terms are spread across many different resources. These resources vary in structure and accessibility. Some are easy to extract from; others require manual effort or custom scripts. Some of these are easy to extract; others require manual effort or custom scripts.

- **Managing ambiguity**  
  Certain terms have multiple meanings in different contexts. For example, “lead” may refer to a pollutant or act as a verb. These terms need to be flagged and treated cautiously during annotation.

- **Balancing coverage with precision**  
  A broad vocabulary may capture more entities but also introduces noise. It is important to focus on terms that are likely to appear in the corpus and are useful for the task. Rare, overly general, or ambiguous entries may be excluded or reviewed manually.

- **Handling formatting inconsistencies**  
  Terms from different sources may vary in spelling, case, pluralisation, or punctuation. A standard cleaning process is needed to normalise these variations and ensure consistent matching later.

These challenges underline the need for a clear and reproducible method for vocabulary collection. Each step must be documented and carefully structured to ensure high-quality results in the next phase of the pipeline.


## 2. Defining Entity Categories

### 2.1 Choosing the Entity Categories

Named Entity Recognition in environmental science requires clearly defined entity types that reflect the domain’s specialised vocabulary. General NER categories, such as “Organisation” or “Location,” are not specific enough to capture the kinds of entities that appear in environmental text. Terms such as species names, habitat descriptors, or chemical pollutants need to be grouped meaningfully to support accurate tagging.

The categories chosen for this notebook aim to balance three key criteria: specificity, clarity, and practical relevance. Each category should represent a well-understood environmental concept, be distinct from the others, and be useful for annotating terms that frequently appear in environmental documents. This includes both scientific abstracts and semi-structured metadata.

The category design also draws on patterns seen in domain vocabularies, classification systems, and earlier review of the text corpus collected earlier. These sources reveal recurring term types such as biological taxa, ecosystem names, pollution indicators, and measurement units. Grouping these into separate categories improves both annotation consistency and future model training.

### 2.2 Category Review Process
The selection of entity categories was based on a review of environmental materials and the text data collected earlier. The aim was to identify recurring patterns in how key concepts are mentioned and described across different sources.

To guide this process, a range of domain-specific materials were considered, including glossaries, classification systems, and datasets. These were not used to define categories directly, but to observe common term types that appear in environmental science. In parallel, the collected corpus was manually reviewed to spot frequent entities and assess their natural groupings.

The process was iterative. Some categories, such as species names and pollutants, were clearly distinguishable from the start. Others, like environmental processes or measurements, required refinement to avoid overlapping meanings or inconsistent boundaries. Wherever possible, categories were defined to be specific, distinct, and useful for labelling terms that appear regularly in environmental text.

### 2.3 Final Entity Categories

Based on the review process above, five named entity categories were selected. Each category is designed to capture a specific type of domain-relevant term found in environmental science text.

| Entity Category | Description | Example Terms |
|------------------|-------------|----------------|
| TAXONOMY         | Names of species, genera, families, or other taxonomic units | *Salmo salar*, *Panthera leo* |
| HABITAT          | Names of ecosystems, land cover types, or habitat descriptors | estuary, peatland, saltmarsh |
| ENV_PROCESS      | Environmental or ecological processes, including both natural and human-driven events | erosion, eutrophication, acidification |
| POLLUTANT        | Chemical or physical substances known to cause pollution or environmental harm | mercury, nitrate, microplastics |
| MEASUREMENT      | Units, quantities, or indicators used to measure environmental variables | pH, mg/L, temperature, CO₂ levels |

These categories are used to group vocabulary terms in the next sections. Each term list is built and processed separately, then saved in a structured format for annotation.


## 3. Collecting Vocabulary Terms
This section focuses on building the vocabulary lists for each of the entity categories defined earlier. For each category, relevant terms are collected from domain-specific resources such as glossaries, classification systems, and downloadable datasets. Where possible, structured sources are processed programmatically. Others are manually reviewed or extracted depending on format and accessibility.

The goal is to compile targeted, category-specific lists that reflect the kinds of entities typically found in environmental science texts. Each vocabulary is cleaned, standardised, and saved in a structured format for later use in annotation and model development.

In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import random
from pathlib import Path

BASE_PATH = Path("../vocabs")
RAW_PATH = BASE_PATH / "raw"
EXTRACTED_PATH = BASE_PATH / "extracted"

RAW_PATH.mkdir(parents=True, exist_ok=True)
EXTRACTED_PATH.mkdir(parents=True, exist_ok=True)

### 3.1 Taxonomy

The taxonomy vocabulary is based on the GBIF Backbone Taxonomy dataset [link](https://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c), which provides a large set of biological names across all major taxonomic groups. The raw file was downloaded as a TSV and includes a wide range of vernacular terms.

In [2]:
df = pd.read_csv(RAW_PATH / "VernacularName.tsv", sep='\t', dtype=str)
df.head()

Unnamed: 0,taxonID,vernacularName,language,country,countryCode,sex,lifeStage,source
0,5371864,sels næp,no,,,,,Nordic plant uses from Gunnerus and Høeg
1,5371864,spreng-rood,no,,,,,Nordic plant uses from Gunnerus and Høeg
2,5371864,syle-næbber,no,,,,,Nordic plant uses from Gunnerus and Høeg
3,3034225,angelsrot,,,,,,Nordic plant uses from Gunnerus and Høeg
4,3034225,mjølkerot,,,,,,Nordic plant uses from Gunnerus and Høeg


An initial preview of the file revealed that the entries span multiple languages, stored under a `language` column. Since the corpus being annotated is in English, only entries marked as English (`language == 'en'`) are retained for this stage.

In [78]:
df = pd.read_csv(RAW_PATH / "VernacularName.tsv", sep='\t', usecols=["vernacularName", "language"], dtype=str)

df = df[df["language"] == "en"]
df = df.dropna(subset=["vernacularName"])
df["vernacularName"] = df["vernacularName"].str.strip().str.lower()
df = df[df["vernacularName"] != ""]

taxonomy_terms = df["vernacularName"].drop_duplicates().sort_values()

taxonomy_path = EXTRACTED_PATH / "taxonomy.txt"
taxonomy_path.parent.mkdir(parents=True, exist_ok=True)
taxonomy_terms.to_csv(taxonomy_path, index=False, header=False)

In [79]:
print(len(taxonomy_terms))

229559


In [76]:
with open(output_path, "r", encoding="utf-8") as f:
    for i in range(10):
        print(f.readline().strip())

abandoned industrial site
accidental release of organisms
afforestation
afforestations
agricultural landscape
alluvial plain
altitude
altitudes
amusement park
anaerobic lagoon


The extracted list contains approximately 229,000 unique English-language names. From a brief manual inspection, the terms appear accurate, well-formed, and highly relevant to biological taxonomy. Many include common species and group-level names, making this a strong starting point.

While the number of terms is large, it is not expected to negatively impact annotation. Any names not found in the corpus will simply not be matched. Basic cleaning will be applied later to remove duplicates, overly generic names, or problematic entries where necessary.

Due to the size of the list, full manual review is not feasible. However, spot checks indicate the data quality is high. A standardised version of the vocabulary will be prepared during the cleaning stage.

### 3.2 Habitat
The habitat vocabulary is based on the GEMET thesaurus, which includes a wide range of environmental and ecological terms. It provides structured terminology for describing natural environments, land cover types, and habitat-related concepts. The terms were downloaded in structured format and filtered for habitat-relevant entries.

In [57]:
def scrape_gemet_theme(theme_id: int):
    """Scrapes GEMET terms under a given theme and writes them to a text file"""
    
    base_url = f"https://www.eionet.europa.eu/gemet/en/theme/{theme_id}/concepts/?page={{}}&letter=0"
    
    page = 1
    terms = set()

    print(f"Scraping theme ID {theme_id}")
    
    while True:
        url = base_url.format(page)
        print(f"Page {page}")
        response = requests.get(url)
        if response.status_code != 200:
            break

        soup = BeautifulSoup(response.content, "html.parser")
        term_elements = soup.select("ul.listing.columns.split-20 li a")
        if not term_elements:
            break

        new_terms = {el.get_text(strip=True) for el in term_elements}
        if new_terms.issubset(terms):
            break

        terms.update(new_terms)
        page += 1
        time.sleep(0.5)

    return sorted(terms)

In [59]:
habitat_terms = scrape_gemet_theme(theme_id=23)

output_path = EXTRACTED_PATH / "habitat.txt"
with open(output_path, "w", encoding="utf-8") as f:
    for term in habitat_terms:
        f.write(term + "\n")

print(f"Saved habitat terms to {output_path.name}")

Scraping theme ID 23
Page 1
Page 2
Page 3
Page 4
Page 5
Page 6
Page 7
Page 8
Page 9
Page 10
Page 11
Page 12
Page 13
Saved habitat terms to habitat.txt


In [62]:
print(len(habitat_terms))

467


In [64]:
with open(output_path, "r", encoding="utf-8") as f:
    terms = [line.strip() for line in f if line.strip()]

sampled = random.sample(terms, 20)
for term in sampled:
    print(term)

temperate woodland
wildlife sanctuary
dam draining
coast protection
rivers
periurban space
site rehabilitation
area under stress
resource scarcity
bays
forest fire
cultural ecosystem services
coastal environment
fens
nesting area
forest
forest reserve
integrated environmental assessment
bocages
abandoned industrial site


The extracted list for habitat terms includes 467 unique entries from the GEMET thesaurus under the "natural areas" theme. While this number is smaller compared to the taxonomy list, it is expected since habitats are a more constrained and well-defined concept, with fewer distinct variations than species names.

The list covers a broad set of environments, including terms such as *marshes*, *coastal environment*, *catchment area*, *bog*, *sand dune fixation*, *mountain forest*, and *terraced landscape*. These are well-formed, domain-relevant, and reflect both natural and managed habitats.

Initial inspection confirms the quality and relevance of the terms. The vocabulary will be further cleaned and standardised before annotation, but no immediate expansion is needed unless specific gaps are identified later.


### 3.3 Environmental Processes
This category includes vocabulary terms that describe natural, physical, or chemical processes occurring within the environment. These may include terms related to climate, hydrology, soil dynamics, pollution cycles, and other system-level interactions commonly found in environmental science texts.

The goal is to capture processes that are relevant to ecological modelling, environmental monitoring, and scientific descriptions of system change.

In [65]:
env_process_path = EXTRACTED_PATH / "env_process.txt"

climate_terms = scrape_gemet_theme(theme_id=7)

with open(env_process_path, "w", encoding="utf-8") as f:
    for term in climate_terms:
        f.write(term + "\n")

print(f"Saved env_process terms to {env_process_path.name}")

Scraping theme ID 7
Page 1
Page 2
Page 3
Page 4
Page 5
Saved env_process terms to env_process.txt


In [66]:
print(len(climate_terms))

128


In [67]:
with open(env_process_path, "r", encoding="utf-8") as f:
    terms = [line.strip() for line in f if line.strip()]

sampled = random.sample(terms, 20)
for term in sampled:
    print(term)

precipitation enhancement
season
meteorological disaster
oceanic climate
flood forecast
haze
atmospheric structure
climate regulation
tornado
weather modification
snow
storm damage
air conditioning
troposphere
meteorological research
mountain climate
ozone layer
atmospheric precipitation
climate protection
atmospheric composition


The environmental process vocabulary was initially sourced from the GEMET thesaurus under the "climate" theme, which produced 128 terms. This list includes core concepts such as *global warming*, *ozone layer*, *atmospheric composition*, *feedback loop*, and *water scarcity*.

While these are relevant and well-formed, the coverage is narrow and heavily climate-focused. Environmental processes span a wider range of concepts including soil dynamics, chemical pollution cycles, hydrology, and ecosystem-level change. These are not well represented in the current list.

Additional sources will be explored to expand this vocabulary, including manually compiled lists and process-focused glossaries or ontologies. The GEMET terms will still be included as part of the final vocabulary after cleaning and standardisation.


In [68]:
theme_ids = {
    "climate": 5,
    "pollution": 26,
    "natural_dynamics": 36,
}

all_terms = set()

for theme, theme_id in theme_ids.items():
    print(f"Scraping: {theme}")
    
    terms = scrape_gemet_theme(theme_id)
    all_terms.update(term.lower() for term in terms if term.strip())
    
    with open(env_process_path, "r", encoding="utf-8") as f:
        for line in f:
            term = line.strip().lower()
            if term:
                all_terms.add(term)

with open(env_process_path, "w", encoding="utf-8") as f:
    for term in sorted(all_terms):
        f.write(term + "\n")

Scraping: climate
Scraping theme ID 5
Page 1
Page 2
Page 3
Page 4
Page 5
Page 6
Page 7
Page 8
Page 9
Scraping: pollution
Scraping theme ID 26
Page 1
Page 2
Page 3
Page 4
Page 5
Page 6
Page 7
Page 8
Page 9
Page 10
Page 11
Page 12
Page 13
Page 14
Page 15
Page 16
Scraping: natural_dynamics
Scraping theme ID 36
Page 1
Page 2


In [69]:
print(len(all_terms))

1003


In [70]:
with open(env_process_path, "r", encoding="utf-8") as f:
    terms = [line.strip() for line in f if line.strip()]

sampled = random.sample(terms, 20)
for term in sampled:
    print(term)

microclimate
land restoration
combined cycle-power station
tundra
bleaching process
anaerobic treatment
advection
building service
chemical decontamination
environmental impact of agriculture
hail
salination
polluter-pays principle
olfactory pollution
nursery garden
water pollution
pollution monitoring
building site
physical alteration
sea level rise


The expanded list for environmental processes now includes 1,003 unique terms aggregated from multiple GEMET themes, including climate, pollution, and natural dynamics. This broader coverage captures a wide range of relevant concepts such as microclimate, chemical decontamination, salination, environmental impact of agriculture, and pollution monitoring. While some entries are loosely defined or overlap with other categories, the overall list is diverse and strongly aligned with the types of system-level processes found in environmental science literature. The vocabulary will be reviewed and refined in the next stage to remove ambiguous or overly general terms, but the current coverage is considered sufficient for initial annotation.

### 3.4 Pollutants
The pollutants vocabulary includes chemical substances and other environmental stressors that are commonly referenced in environmental science. These may be individual compounds such as benzene or lead, or broader classes such as pesticides, microplastics, or particulate matter. These terms are often found in regulatory reports, monitoring programmes, and scientific publications that focus on pollution, exposure, and environmental impact.

To build this vocabulary, the Toxics Release Inventory (TRI) Chemical List was selected as the source. This is a publicly available list of substances that are tracked under the US EPA’s TRI reporting programme, which requires facilities to report on the management of certain toxic chemicals. The list is maintained as part of the EPA’s environmental transparency efforts and includes both individual chemicals and chemical categories.

The full list was downloaded in spreadsheet format from the following source:
https://www.epa.gov/toxics-release-inventory-tri-program/tri-listed-chemicals

In [33]:
df = pd.read_excel(RAW_PATH / "2024_tri_chemical_list.xlsx", engine="openpyxl")
df.head()

  warn(msg)


Unnamed: 0,CASRN,TRI Chemical or Chemical Category Name,Chemical Structure,De Minimis Limit %,"M,P/OU Thresholds (lb)",Category Description,Category Member,Additional Information
0,71751-41-2,Abamectin,,1.0,"25,000/10,000",,,
1,30560-19-1,Acephate,,1.0,"25,000/10,000",,,
2,75-07-0,Acetaldehyde,,0.1,"25,000/10,000",,,
3,60-35-5,Acetamide,,0.1,"25,000/10,000",,,
4,75-05-8,Acetonitrile,,1.0,"25,000/10,000",,,


The TRI chemical file includes a structured list of pollutant names, along with metadata such as CAS numbers, thresholds, and classification notes. A preview of the dataset shows chemical entries like Abamectin, Acetaldehyde, and Acetonitrile, which reflect a balance of technical specificity and practical relevance.

This initial inspection confirms that the file contains suitable pollutant terms for vocabulary extraction. Further processing will focus on selecting and standardising these names into a plain text list for use in annotation.

In [34]:
df.columns = df.columns.str.strip()

pollutant_terms = (
    df["TRI Chemical or Chemical Category Name"]
    .dropna()
    .astype(str)
    .str.strip()
    .str.lower()
    .drop_duplicates()
    .sort_values()
)

pollutant_path = EXTRACTED_PATH / "pollutant.txt"
pollutant_path.parent.mkdir(parents=True, exist_ok=True)
pollutant_terms.to_csv(pollutant_path, index=False, header=False)

In [35]:
len(pollutant_terms)

728

In [36]:
with open(pollutant_path, "r", encoding="utf-8") as f:
    terms = [line.strip() for line in f if line.strip()]

sampled = random.sample(terms, 20)
for term in sampled:
    print(term)

dichlorobromomethane
nickel
paraldehyde
acrolein
bromine
dihydrosafrole
cycloate
"3,3'-dimethylbenzidine dihydrofluoride"
"3,3'-dimethylbenzidine dihydrochloride"
triforine
diaminotoluene (mixed isomers) (toluenediamine)
c.i. direct blue 218
"hexamethylene-1,6-diisocyanate"
"1,2-dichloro-1,1,3,3,3-pentafluoropropane (hcfc-225da)"
"1,2,3,7,8‑pentachlorodibenzo-p-dioxin"
chlorothalonil
2-nitrophenol (o-nitrophenol)
perchloromethyl mercaptan
sodium azide
"1,1,2,2-tetrachloroethane"


A total of 728 unique pollutant terms were extracted from the TRI list. The sample includes a mix of well-formed names and more complex entries, such as those wrapped in quotation marks or containing grouped chemicals.

Basic cleaning will be applied to remove entries with formatting issues or compound categories that are too broad for direct annotation. Since the list is relatively small, this will be done manually.

Some common terms like pesticide, ozone, or microplastic are not included in the TRI list. These will be added separately to improve coverage across general environmental texts.

### 3.5 Measurements
This category includes vocabulary related to units, scales, and quantities commonly used in environmental datasets and reports. These may appear alongside named entities or as standalone references in measurements, thresholds, or limits.

Examples include standard units like mg/L, ppm, µg/m³, and domain-specific terms such as emission factor, exceedance, or threshold concentration. These are important for tasks such as span-based annotation, context interpretation, and distinguishing between entity mentions and surrounding descriptors.

Relevant terms will be collected from environmental glossaries, standards documentation, and precompiled unit lists. A final cleaned vocabulary will be prepared to support annotation and model training.

## 2. Cleaning and Filtering Terms
Lemmatise all GEMET terms.

Remove:

* Words that are too generic (e.g. "animal", "life")

* Words that are < 3 characters

* Anything that doesn't fit your NER context.

## 3. Lemmatization Script (with spaCy)

one of the below is correct and should be used

In [None]:
import spacy
from pathlib import Path

nlp = spacy.load("en_core_web_sm")
VOCAB_DIR = Path("gemet_terms")
CLEAN_DIR = Path("../vocabularies")
CLEAN_DIR.mkdir(exist_ok=True)

for vocab_file in VOCAB_DIR.glob("*.txt"):
    with open(vocab_file, "r", encoding="utf-8") as f:
        terms = {line.strip() for line in f if line.strip()}

    lemmatised = set()
    for term in terms:
        doc = nlp(term)
        lemma = " ".join([token.lemma_ for token in doc])
        lemmatised.add(lemma.lower())

    # Filter out generic or short terms (e.g., less than 3 chars)
    filtered = sorted({term for term in lemmatised if len(term) > 2 and not term.isnumeric()})
    
    out_path = CLEAN_DIR / vocab_file.name
    with open(out_path, "w", encoding="utf-8") as f:
        for term in filtered:
            f.write(term + "\n")
    
    print(f"{vocab_file.name}: {len(filtered)} terms saved.")


In [None]:
import spacy
from pathlib import Path

# Paths
BASE_DIR = Path("..") / "data" / "raw_data"
OUTPUT_DIR = Path("..") / "data" / "processed_data"
VOCAB_DIR = Path("..") / "vocabularies"

input_path = VOCAB_DIR / "taxonomy.txt"
output_path = VOCAB_DIR / "taxonomy_lemmatized.txt"

# Load SpaCy English model
nlp = spacy.load("en_core_web_sm")

# Read original taxonomy terms
with open(input_path, "r", encoding="utf-8") as f:
    terms = [line.strip().lower() for line in f if line.strip()]

lemmatised_terms = set()

for term in terms:
    doc = nlp(term)
    lemma = " ".join([token.lemma_ for token in doc])
    lemmatised_terms.add(lemma)

# Sort and save
lemmatised_sorted = sorted(lemmatised_terms)

with open(output_path, "w", encoding="utf-8") as f:
    for term in lemmatised_sorted:
        f.write(term + "\n")

print(f"Lemmatized {len(terms)} terms down to {len(lemmatised_terms)} unique ones.")
print(f"Saved to: {output_path}")


This step uses SpaCy to lemmatise all terms to their base forms (e.g., "habitats" → "habitat"). Generic terms and words shorter than 3 characters were removed.


We now have cleaned and curated vocabulary lists for each NER category:

- `taxonomy.txt` (from BHL)
- `habitat.txt`
- `pollutant.txt`
- `env_process.txt`
- `measurement.txt`

These vocabulary files are stored in the `vocabularies/` directory and are ready to be used for Aho-Corasick-based annotation. If additional terms are discovered later, they can be appended and recompiled on the fly.
