# Data Collection and Preparation

## 1. Introduction
This notebook handles the first stage of the pipeline: collecting domain-specific textual data for use in a Named Entity Recognition (NER) system for environmental science. A diverse and well-curated corpus is critical for downstream performance, especially when the goal is to identify entities like species names, habitats, pollutants, processes, and measurements from unstructured text.

The dataset must be large enough to cover a variety of sentence structures, vocabulary styles, and topic domains. Since environmental language appears in both scientific and non-scientific formats, the collected data should reflect this range.

Data is obtained using a combination of:

* **Web scraping**: used where structured access to public datasets exists.
* **Manual export**: used for sources with downloadable content (e.g. scientific abstracts).
* **Programmatic parsing**: used to extract text from complex formats like PDFs, DOCX files, and zipped document archives.

At this stage, no cleaning, annotation, or sentence segmentation is performed. The goal is to preserve the original data as faithfully as possible, and only organise it in a reproducible format for later preprocessing.

## 2. Data Collection Overview
To support robust training of a domain-specific NER model, data was collected from multiple environmental sources. Each source offers different linguistic characteristics and entity coverage, improving the model’s ability to generalise to varied real-world texts. The three core sources are:

* **UKCEH Catalogue**: A structured data archive containing environmental metadata and reports in the form of supporting documents (PDFs, DOCX files, etc.). These are scraped directly from the UKCEH online catalogue.
* **PubMed Abstracts**: Scientific paper abstracts retrieved from PubMed, grouped by category (e.g. habitat, taxonomy, pollutants). These abstracts reflect formal scientific language and frequently include complex environmental terms.
* **Environmental News Articles**: News-style articles exported from a Kaggle dataset, representing more informal, public-facing language and terminology.


Each of these sources contributes to one or more of the five key NER categories: TAXONOMY, HABITAT, ENV_PROCESS, POLLUTANT, and MEASUREMENT. By combining these sources, the resulting corpus is better suited to capture both scientific and layperson contexts.

The collected data is stored in the `data/raw_data/` directory, with one subfolder per source.

In [None]:
from pathlib import Path

# Set up base directory for raw data
BASE_DIR = Path("..") / "data" / "raw_data"
BASE_DIR.mkdir(parents=True, exist_ok=True)

### 2.1. UKCEH Data Collection
The UK Centre for Ecology & Hydrology (UKCEH) provides an online catalogue that publishes environmental datasets across various domains such as land use, biodiversity, and water quality. Each dataset page includes a title, description, and often a ZIP archive containing supporting documents. These documents may include project reports, data collection methodologies, and technical notes.

The goal here is to extract both the structured metadata (titles and descriptions) and unstructured text content from supporting files. This data will later be used for vocabulary creation and entity recognition.

To collect this data, a web scraping approach is used. For each catalogue page, a script identifies all dataset links, then visits each one to extract available metadata. If a downloadable ZIP is found, it is retrieved and unpacked so that the contents (PDFs, DOCX, TXT files, etc.) can be converted into plain text. This makes it possible to mine large amounts of raw language data for further annotation and model training.

#### Setup and Imports

In [1]:
from pathlib import Path
import os
import time
import requests
import tempfile
import zipfile
import re
import unicodedata
import contextlib
import io

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

from docx import Document
from pdfminer.high_level import extract_text as extract_pdf_text
import filetype
import fitz
import pytesseract
from PIL import Image

# Target directory for UKCEH data
UKCEH_DIR = BASE_DIR / "ukceh"
UKCEH_DIR.mkdir(parents=True, exist_ok=True)
OUTPUT_FILE = UKCEH_DIR / "ukceh_data.txt"

# Base URL for CEH catalogue
BASE_URL = "https://catalogue.ceh.ac.uk"

#### Retrieving Dataset Links
This function loads a single page of the UKCEH catalogue and returns all dataset URLs listed on that page. Selenium is used here to dynamically render content that would otherwise be hidden from static scraping tools like requests.

In [31]:
def get_links_from_page(page_num):
    url = f"{BASE_URL}/?page={page_num}"
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
    driver.get(url)

    soup = BeautifulSoup(driver.page_source, "html.parser")
    driver.quit()

    results_list = soup.find("div", class_="results__list")
    if not results_list:
        return []

    dataset_links = [
        BASE_URL + a['href']
        for a in results_list.find_all("a", href=True)
        if a['href'].startswith("/documents/")
    ]
    return dataset_links

#### Extracting Dataset Metadata
Once the dataset URLs have been collected, the next step is to visit each individual dataset page and extract three core elements: the dataset title, its summary description, and the link to any downloadable supporting documents. These documents often contain the actual scientific content of interest — e.g. technical reports, sampling protocols, or field notes.

All of this information is pulled directly from the HTML of the dataset detail page using BeautifulSoup. If a supporting ZIP file is found, its link is returned so it can be downloaded later. This is important because, while the description gives us a short summary, the real value is often inside those attached files.


In [32]:
def get_dataset_info(dataset_url):
    response = requests.get(dataset_url)
    soup = BeautifulSoup(response.text, "html.parser")

    title = soup.find("h1").get_text(strip=True)
    desc_div = soup.find("div", class_="description-text")
    description = desc_div.get_text(strip=True) if desc_div else ""

    supporting_tag = soup.find("a", class_="btn btn-access")
    supporting_link = supporting_tag['href'] if supporting_tag else None

    return title, description, supporting_link

#### Extracting Text from Individual Files
The supporting documents come in different formats, mostly pdf, docx, and txt. The aim here is to read and convert them into plain text wherever possible. For PDFs, we use a two-step fallback system:

* First try a text-based extractor (pdfminer).
* If that fails (which it does often, because many PDFs are actually scanned images), then use OCR with pytesseract and PyMuPDF to extract image-based text.

This fallback was added after early runs of the scraper showed zero content being extracted from several files, which turned out to be scanned PDFs. Without OCR, those would be unusable. DOCX and TXT files are handled more straightforwardly.

Files that can’t be parsed or yield no text are silently skipped.

In [33]:
def read_file_text(file_path):
    try:
        kind = filetype.guess(file_path)
        ext = kind.extension if kind else os.path.splitext(file_path)[-1].lower()

        if ext == "pdf":
            with contextlib.redirect_stderr(io.StringIO()):
                text = extract_pdf_text(file_path).replace("\n", " ").strip()
            
            if text:
                return text
            
            text_parts = []
            doc = fitz.open(file_path)
            for page in doc:
                pix = page.get_pixmap(dpi=300)
                img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
                ocr_text = pytesseract.image_to_string(img).strip()
                if ocr_text:
                    text_parts.append(ocr_text.replace("\n", " "))
            return " ".join(text_parts).strip()

        elif ext == "docx":
            doc = Document(file_path)
            return " ".join(p.text for p in doc.paragraphs).strip()

        elif ext == "txt":
            with open(file_path, encoding="utf-8", errors="ignore") as f:
                return f.read().replace("\n", " ").strip()

        else:
            return ""

    except Exception as e:
        return ""

#### Downloading and Reading ZIP Archives
Each dataset’s supporting documents are stored inside a downloadable ZIP file. Once a link is found, the ZIP is downloaded into a temporary folder, extracted, and all valid text files are parsed using the logic above.

This was necessary because the CEH catalogue doesn't expose individual documents directly — everything is bundled. There were also a few quirks here — like some ZIPs containing metadata files (json, html, rtf) that weren’t useful for this task. These were filtered out during processing, and only useful content was included.

In the end, the full dataset (title + description + extracted content from supporting files) is written to a single output file, one line per dataset. This makes it easier to process later.

In [34]:
def extract_text_from_zip(zip_url, dataset_id):
    try:
        response = requests.get(zip_url)
        if not response.ok:
            
            return ""

        with tempfile.TemporaryDirectory() as tmpdir:
            zip_path = os.path.join(tmpdir, f"{dataset_id}.zip")
            with open(zip_path, "wb") as f:
                f.write(response.content)

            with zipfile.ZipFile(zip_path, 'r') as zip_ref:
                zip_ref.extractall(tmpdir)

            text_chunks = []
            for root, _, files in os.walk(tmpdir):
                for name in files:
                    file_path = os.path.join(root, name)
                    try:
                        text = read_file_text(file_path)
                        if text:
                            text_chunks.append(text)
                    except Exception as e:
                        print(f"Failed to read file {name}: {e}")

            return " ".join(text_chunks).strip()
    except Exception as e:
        print(f"Exception while processing zip: {e}")
        return ""


#### Writing Output to File
Each dataset is saved as a single line in a raw text file. This includes the title, description, and any extracted content from supporting documents. The three fields are separated by a pipe (|) character for clarity.

At this point, no preprocessing or text cleaning is applied — this is intentional. The idea is to preserve the raw source content in full and defer any normalisation or filtering to a later preprocessing stage.

In [36]:
def write_to_file(title, url, description, extracted_text, output_file):
    line = f"{title} | {description} | {extracted_text}".strip()
    
    with open(output_file, "a", encoding="utf-8") as f:
        f.write(line + "\n")

#### Scraping All Pages
This loop goes through every page of the CEH catalogue (1 to 114) and processes each dataset on the page. For each entry, it:

1. Extracts metadata (title, description, document link),
2. Downloads and extracts any ZIP content,
3. Saves everything into a single output file (ukceh_data.txt).

If a page has no results, the loop exits early. Any datasets that fail (e.g. broken links, unreadable ZIPs) are skipped with an error message.

In [37]:
for page in range(1, 115):
    print(f"Scraping page {page}")
    urls = get_links_from_page(page)

    if not urls:
        print(f"No results on page {page}")
        break

    for url in urls:
        try:
            title, desc, zip_link = get_dataset_info(url)
            dataset_id = url.split("/")[-1]
            extracted = ""

            if zip_link:
                extracted = extract_text_from_zip(zip_link, dataset_id)

            write_to_file(title, url, desc, extracted, OUTPUT_FILE)
        except Exception as e:
            print(f"Error on dataset: {e}")
            continue

Scraping page 1
Scraping page 2
Scraping page 3
Scraping page 4
Scraping page 5
Scraping page 6
Scraping page 7
Scraping page 8
Scraping page 9
Scraping page 10
Scraping page 11
Scraping page 12
Scraping page 13
Scraping page 14
Scraping page 15
Scraping page 16
Scraping page 17
Scraping page 18
Scraping page 19
Scraping page 20
Scraping page 21
Scraping page 22
Scraping page 23
Scraping page 24
Scraping page 25
Scraping page 26
Scraping page 27
Scraping page 28
Scraping page 29
Scraping page 30
Scraping page 31
Scraping page 32
Scraping page 33
Scraping page 34
Scraping page 35
Scraping page 36
Exception while processing zip: Response ended prematurely
Scraping page 37
Scraping page 38
Scraping page 39
Scraping page 40
Scraping page 41
Scraping page 42
Scraping page 43
Scraping page 44
Scraping page 45
Scraping page 46
Scraping page 47
Scraping page 48
Scraping page 49
Scraping page 50
Scraping page 51
Scraping page 52
Scraping page 53
Scraping page 54
Scraping page 55
Scraping page 5

### 2.2. Environment News Article (Kaggle)

The second source comes from a Kaggle dataset of approximately 30,000 environment-related news articles originally published by The Guardian. The dataset includes metadata and full text for each article. It is publicly available as a CSV file, downloadable from:

https://www.kaggle.com/datasets/beridzeg45/guardian-environment-related-news

The CSV is manually downloaded and placed inside the working directory. Since this notebook focuses on collecting all raw text into unified .txt files, the CSV is loaded and the relevant columns are concatenated into a single line per article. Each line contains the title, introduction, and article body, separated by a pipe (|).

This text format aligns with how other sources are being collected and makes downstream processing (like annotation or splitting) simpler.

In [None]:
import pandas as pd
from pathlib import Path

ENV_NEWS_DIR = BASE_DIR / "environment_news"
ENV_NEWS_DIR.mkdir(parents=True, exist_ok=True)

csv_path = ENV_NEWS_DIR / "env_news_data.csv"
output_path = ENV_NEWS_DIR / "env_news_data.txt"

df = pd.read_csv(csv_path, engine="python", on_bad_lines="skip")

# Clean and combine Title, Authors, and Article Text columns
combined_lines = df[["Title", "Intro Text", "Article Text"]].dropna().astype(str).apply(
    lambda row: " | ".join([
        row["Title"].replace("\n", " ").replace("\r", " ").strip(),
        row["Intro Text"].replace("\n", " ").replace("\r", " ").strip(),
        row["Article Text"].replace("\n", " ").replace("\r", " ").strip()
    ]),
    axis=1
)

output_path = csv_path.parent / "data.txt"
with open(output_path, "w", encoding="utf-8") as f:
    for line in combined_lines:
        f.write(line + "\n")

### 2.3. PubMed Abstracts
PubMed provides access to millions of scientific abstracts in medicine and life sciences, many of which intersect with environmental topics. While its full-text articles are often behind paywalls, abstracts are openly accessible and contain rich descriptions of research objectives, findings, and terminology.

To supplement our dataset with high-quality scientific language, we downloaded abstracts related to environmental science concepts. For each of our core entity categories (e.g. TAXONOMY, POLLUTANT, HABITAT), we used broad search terms like biodiversity, pollutants, and climate to retrieve results directly through PubMed’s web interface. Up to 10,000 abstracts were downloaded per category using the built-in “Send to File” feature.

Each export was saved as a plain .txt file under: `data/raw_data/pubmed`

Each line in the file corresponds to a single abstract, and no further metadata (such as title or authors) was included. No code was used for collection — this step was completed manually to comply with PubMed’s terms.

While programmatic access via the PubMed API (Entrez) is possible, it comes with rate limits and technical constraints that made bulk collection impractical for this project. In contrast, abstracts are freely accessible for research and can be downloaded in bulk via the UI. As a result, our choice to use abstracts is both practical and legally sound.

It is also worth noting that including additional data that may not contain annotated entities is not a concern at this stage. During the annotation phase, sentences that do not contain any known entities will be excluded from the final training data. This allows us to prioritise breadth of coverage and domain relevance during data collection without needing perfect alignment up front.

## 3. Preprocessing Raw Text
Raw textual data often contains non-informative lines, broken encodings, formatting artifacts, and low-content text (e.g. bullets or section headers like “Table 1”). While it's important not to over-clean, a lightweight standardisation step improves downstream tokenisation and annotation.

This stage does not remove domain-specific content or alter sentence boundaries. Instead, it:

* Normalises Unicode characters
* Removes broken characters, bullet symbols, and URLs
* Filters out junk lines (e.g. lines with fewer than three words)

PubMed abstracts require one additional fix: merging broken lines caused by formatting errors during export. We identify blocks separated by blank lines and join them into single coherent abstracts.


### 3.1 Cleaning and Merging Functions

In [66]:
import re
import unicodedata
from pathlib import Path


RAW_BASE = Path("../data/raw_data")
CLEAN_BASE = Path("../data/processed")

def clean_line(line):
    line = line.strip()
    if not line:
        return None  # skip empty lines

    line = unicodedata.normalize("NFKD", line)  # remove weird Unicode ligatures
    line = re.sub(r"[•▪●‣–—·]", " ", line)  # bullet points
    line = re.sub(r"\.{3,}", "...", line)  # collapse dot chains
    line = re.sub(r"https?://\S+", "", line)  # remove URLs
    line = re.sub(r"[\x00-\x1F\x7F-\x9F]", "", line)  # remove control chars
    line = re.sub(r"\s{2,}", " ", line)  # extra spaces

    if len(line.split()) < 3:  # junk (like "Table 1", "Appendix", etc)
        return None

    return line


In [63]:
def merge_pubmed_lines(input_path):
    """Merge broken abstract lines using blank lines as separators."""
    with open(input_path, "r", encoding="utf-8") as f:
        lines = [line.strip() for line in f]

    abstracts = []
    current = []
    for line in lines:
        if line:
            current.append(line)
        else:
            if current:
                abstracts.append(" ".join(current))
                current = []
    if current:
        abstracts.append(" ".join(current))
    return abstracts


In [65]:
def clean_file(input_path, output_path, is_pubmed=False):
    if is_pubmed:
        raw_lines = merge_pubmed_lines(input_path)
    else:
        with open(input_path, "r", encoding="utf-8") as infile:
            raw_lines = infile.readlines()

    cleaned_lines = []
    for line in raw_lines:
        cleaned = clean_line(line)
        if cleaned:
            cleaned_lines.append(cleaned)

    output_path.parent.mkdir(parents=True, exist_ok=True)
    with open(output_path, "w", encoding="utf-8") as outfile:
        for line in cleaned_lines:
            outfile.write(line + "\n")

    print(f"Saved cleaned file: {output_path.name} ({len(cleaned_lines)} lines)")


In [67]:
for source in ["ukceh", "environment_news", "pubmed"]:
    raw_dir = RAW_BASE / source
    clean_dir = CLEAN_BASE / source
    clean_dir.mkdir(parents=True, exist_ok=True)

    for txt_file in raw_dir.glob("*.txt"):
        is_pubmed = (source == "pubmed")
        clean_file(txt_file, clean_dir / txt_file.name, is_pubmed=is_pubmed)


Saved cleaned file: ukceh_data.txt (3948 lines)
Saved cleaned file: env_news_data.txt (28669 lines)
Saved cleaned file: abstract-habitat.txt (136427 lines)
Saved cleaned file: abstract-env_process.txt (66957 lines)
Saved cleaned file: abstract-environment.txt (66376 lines)
Saved cleaned file: abstract-taxonomy.txt (45741 lines)
Saved cleaned file: abstract-pollutants.txt (61937 lines)
Saved cleaned file: abstract-measurement.txt (58694 lines)


## 3. Sentence segmentation

After cleaning the raw documents, the next step involves sentence segmentation. This is necessary for two main reasons:

1. Named Entity Recognition (NER) models work best when input is divided into grammatically coherent units (sentences).
2. Many downstream annotation and evaluation tools expect one sentence per line.

We use `spaCy` for this step, which applies robust rule-based and statistical models for accurate sentence boundary detection. The segmented sentences are stored under `../data/sentences/`, preserving the original filenames.


In [69]:
import spacy
from pathlib import Path

# Load model and increase max length (safely)
nlp = spacy.load("en_core_web_sm")
nlp.max_length = 2_000_000  # You can increase more if needed

INPUT_DIR = Path("../data/processed")
OUTPUT_DIR = Path("../data/sentences")

def segment_sentences_streaming(input_path, output_path):
    output_path.parent.mkdir(parents=True, exist_ok=True)

    with open(input_path, "r", encoding="utf-8") as infile, open(output_path, "w", encoding="utf-8") as outfile:
        for block in infile:
            block = block.strip()
            if not block:
                continue

            doc = nlp(block)
            for sent in doc.sents:
                sentence = sent.text.strip()
                if sentence:
                    outfile.write(sentence + "\n")

    print(f"Done: {output_path.name}")

# Apply to all files
for source in ["ukceh", "environment_news", "pubmed"]:
    input_folder = INPUT_DIR / source
    output_folder = OUTPUT_DIR / source
    output_folder.mkdir(parents=True, exist_ok=True)

    for file_path in input_folder.glob("*.txt"):
        output_path = output_folder / file_path.name
        segment_sentences_streaming(file_path, output_path)


Done: ukceh_data.txt
Done: env_news_data.txt
Done: abstract-habitat.txt
Done: abstract-env_process.txt
Done: abstract-environment.txt
Done: abstract-taxonomy.txt
Done: abstract-pollutants.txt
Done: abstract-measurement.txt


# This is for processing news data so include this somewhere in the code