# SpaCy CNN Model for Environmental NER

## 1. Introduction

### 1.1 Background and Purpose
This notebook continues the model training phase of the environmental Named Entity Recognition (NER) pipeline. Following the baseline evaluation using Conditional Random Fields (CRFs), we now transition to training a neural model using SpaCy’s convolutional neural network (CNN) architecture.

SpaCy’s built-in `tok2vec` + CNN pipeline offers a fast and compact neural alternative to classical sequence models. While less complex than transformer-based architectures, it retains the ability to model contextual features using a combination of static embeddings and convolutional filters. This makes it a practical next step in assessing how well a lightweight neural approach can generalise from rule-based annotations to unseen text.

The same annotated dataset is used here, comprising 735,542 sentences and over 1.2 million entity spans, labelled with five domain-specific entity types: TAXONOMY, HABITAT, ENV_PROCESS, POLLUTANT, and MEASUREMENT. The labels were generated using exact string matching against curated environmental vocabularies and follow the SpaCy `doc.ents` format rather than BIO tagging.

This stage will help evaluate whether a neural network trained end-to-end on the raw text and entity spans can learn contextual patterns beyond surface string matches. It also establishes a reference point for comparing more advanced transformer-based models later.

### 1.2 Objectives
This notebook aims to train and evaluate a series of SpaCy CNN-based NER models, each with progressively stronger hyperparameters. These models are intended to:

- Serve as neural baselines that are more expressive than CRFs but faster than transformers.
- Test whether convolutional features can generalise entity recognition beyond the rule-based matches.
- Explore the impact of hyperparameters such as batch size, dropout, and training steps.
- Identify how early SpaCy models perform on different entity types.
- Prepare for subsequent experiments involving transformer-based architectures.


## 2. Preparing the Dataset for SpaCy CNN

The SpaCy CNN model expects training and evaluation data in its native binary format (`.spacy`), where each example contains raw text and associated entity spans stored in `doc.ents`. The dataset generated from the rule-based annotation pipeline was originally stored in `.jsonl` format, with each entry containing a sentence and a list of character-offset entity spans under the keys `"text"` and `"label"`.

To prepare this data for SpaCy’s training pipeline, the following steps are performed:

1. Load the annotated `.jsonl` file into memory.
2. Split the dataset into training, validation, and test sets using a 70/15/15 ratio.
3. Convert each list of records into `DocBin` objects using `nlp.make_doc(...)` and character-span alignment.
4. Save the converted data to `.spacy` files, which are required for use with SpaCy’s CLI training interface.

Unlike the CRF model, there is no need to convert to BIO tags explicitly, as SpaCy internally handles alignment between raw text and annotated entity spans during training. This makes the pipeline more streamlined while maintaining compatibility with downstream neural models.

This step ensures the data is formatted efficiently for high-throughput model training and consistent evaluation.

In [1]:
from pathlib import Path
import os
import json
import random
from collections import Counter

import spacy
from spacy.tokens import DocBin
from spacy.training import Example
from spacy.util import minibatch
from spacy.scorer import Scorer

from tqdm import trange

TRAINING_DATA_PATH = Path("../data/json/training_data.jsonl")
SPACY_MODEL_PATH = Path("../models/spaCy")
SPACY_DATA_PATH = Path("../data/spaCy")
CONFIG_PATH = Path("./spaCy_configs/cnn")

SPACY_MODEL_PATH.mkdir(parents=True, exist_ok=True)
SPACY_DATA_PATH.mkdir(parents=True, exist_ok=True)

### 2.1 Load and Inspect Annotated Data
The annotated dataset is stored in `.jsonl` format, with each line containing a sentence `text` and a list of entity spans under the key `label`. Each span is defined using character offsets and a corresponding entity label, reflecting the results of the earlier rule-based annotation stage.


In [2]:
def load_jsonl(path: Path):
    with path.open("r", encoding="utf-8") as f:
        return [json.loads(line) for line in f]

In [3]:
training_data = load_jsonl(TRAINING_DATA_PATH)

len(training_data)

735542

In [4]:
from random import sample

for ex in sample(training_data, 3):
    text = ex["text"]
    entities = ex["label"]
    print("\nText:", text)
    for start, end, label in entities:
        print(f" → {text[start:end]} [{label}]")


Text: Biodiverse, multitrophic communities are increasingly recognised as important promoters of species persistence and resilience under environmental change.
 → environmental change [ENV_PROCESS]

Text: Environmental groups have long campaigned to protect the area, which sustains millions of migrating birds and is home to a major population of endangered Iberian lynxes, pointing out that the illegal wells sunk to feed the region’s numerous soft fruit farms are stressing the aquifer.
 → birds [TAXONOMY]
 → Iberian lynxes [TAXONOMY]

Text: world is not on track to prevent catastrophic warming, to keep temperatures from increasing more than 2C (3.6F), Patricia Espinosa, the executive secretary of the United Nations framework convention on climate change, said earlier this month.
 → temperatures [MEASUREMENT]
 → climate change [ENV_PROCESS]


A random sample of entries confirms that the spans have been accurately mapped to their corresponding substrings in the text. Entities such as *environmental change* and *climate change* are correctly identified as `ENV_PROCESS`, while biological mentions like *birds* and *Iberian lynxes* are classified as `TAXONOMY`. Quantitative references such as *temperatures* are assigned the `MEASUREMENT` label.

This inspection demonstrates both the structural integrity of the span annotations and the diversity of entity types captured by the rule-based method. The presence of both single-token and multi-token entities, along with varying sentence structures, supports the suitability of the data for training a generalisable NER model.

### 2.2 Train–Validation–Test Split
Before converting the data into SpaCy’s training format, the dataset must be partitioned into separate subsets for training, validation, and testing. This allows for fair evaluation of the model’s ability to generalise beyond its training data.

The dataset is randomly split into 70% for training, 15% for validation, and 15% for testing. The validation set is used during model training to monitor performance and prevent overfitting, while the test set is held out entirely for final evaluation.

Unlike standard classification tasks, stratified splitting is not used here because the entity labels exist as spans within text rather than as discrete document-level categories. The large size of the dataset ensures sufficient representation of each entity type across all three subsets.

In [4]:
from sklearn.model_selection import train_test_split

# First split: 70% train, 30% temp
train_data, temp_data = train_test_split(training_data, test_size=0.3, random_state=42)

# Second split: 15% val, 15% test
val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)

print(f"Train size: {len(train_data)}")
print(f"Validation size: {len(val_data)}")
print(f"Test size: {len(test_data)}")

Train size: 514879
Validation size: 110331
Test size: 110332


The split results in 514,879 training examples and 110,331 each for validation and test. These proportions provide a strong foundation for both model optimisation and evaluation. Entity span diversity is preserved across subsets due to the randomised sampling, making the split suitable for training a general-purpose neural NER model.

### 2.3 Convert Records to SpaCy DocBin Format

SpaCy's training pipeline requires data to be serialised into its binary `.spacy` format, which is optimised for speed and memory efficiency. To achieve this, each sentence and its associated entity spans are converted into a `Doc` object and then stored in a `DocBin` container.

The `char_span` method is used to align each entity span (stored as character offsets) with the correct token boundaries within the `Doc`. This is necessary because entity spans must match valid token boundaries for SpaCy to process them correctly. In cases where the span cannot be aligned (due to tokenisation mismatch or annotation error), the resulting `None` values are filtered out before assigning to `doc.ents`.

This process ensures that only well-formed entity spans are used in training, maintaining data integrity and preventing runtime errors during model fitting.


In [5]:
def convert_json_records_to_docbin(records, nlp):
    doc_bin = DocBin()
    for record in records:
        text = record["text"]
        entities = record["label"]
        doc = nlp.make_doc(text)
        spans = [doc.char_span(start, end, label=label) for start, end, label in entities]
        spans = [span for span in spans if span is not None]
        doc.ents = spans
        doc_bin.add(doc)
    return doc_bin

In [6]:
nlp_blank = spacy.blank("en")

train_docbin = convert_json_records_to_docbin(train_data, nlp_blank)
val_docbin = convert_json_records_to_docbin(val_data, nlp_blank)
test_docbin = convert_json_records_to_docbin(test_data, nlp_blank)

print(f"Train: {len(list(train_docbin.get_docs(nlp_blank.vocab)))}")
print(f"Val: {len(list(val_docbin.get_docs(nlp_blank.vocab)))}")
print(f"Test: {len(list(test_docbin.get_docs(nlp_blank.vocab)))}")

Train: 514879
Val: 110331
Test: 110332


The output confirms that each subset has been successfully converted into a `DocBin` object, with valid entity spans attached to each `Doc`. The number of `Doc` objects in each bin matches the counts from the previous data split, confirming that no data was lost during conversion.

By filtering out `None` spans, the function ensures that only valid, token-aligned entity annotations are retained, which is essential for error-free training using SpaCy’s NER component.


### 2.4 Prepare `Example` Objects for Evaluation and Fine-Grained Control
While SpaCy's CLI training interface operates directly on `.spacy` files, evaluation and diagnostics often require the use of `Example` objects. These allow more granular inspection of model predictions and facilitate custom scoring, visualisation, and error analysis.

Each `Example` pairs a `Doc` object with its annotated entity spans, allowing functions like `evaluate_ner_model()` to compare predictions against the ground truth at a span level.

This step converts the `DocBin` datasets into in-memory `Example` lists for the training, validation, and test sets. Although not strictly necessary for model training via the CLI, these objects provide flexibility for analysis and are essential for consistent evaluation across all models.

In [7]:
def prepare_examples_from_docbin(docbin, vocab):
    docs = list(docbin.get_docs(vocab))
    examples = [
        Example.from_dict(doc, {
            "entities": [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]
        }) for doc in docs
    ]
    return examples

In [8]:
train_examples = prepare_examples_from_docbin(train_docbin, nlp_blank.vocab)
val_examples = prepare_examples_from_docbin(val_docbin, nlp_blank.vocab)
test_examples = prepare_examples_from_docbin(test_docbin, nlp_blank.vocab)

In [10]:
sample_example = random.choice(train_examples)
print("Text:", sample_example.reference.text)
for ent in sample_example.reference.ents:
    print(f" → {ent.text} [{ent.label_}]")

Text: Inland waters are unique ecosystems offering services and habitat resources upon which many species depend.
 → Inland waters [HABITAT]
 → ecosystems [HABITAT]
 → habitat [HABITAT]


The sample output confirms that entity spans from the original annotations have been correctly mapped to their corresponding substrings in the `Doc` objects. Terms such as *Inland waters*, *ecosystems*, and *habitat* are properly recognised as `HABITAT`, demonstrating that the conversion to `Example` objects retains both the structure and semantics needed for reliable evaluation.

### 2.5 Save Dataset in `.spacy` Format
To enable efficient training using SpaCy’s CLI interface, the processed datasets must be saved in SpaCy’s binary `.spacy` format. This format stores tokenised `Doc` objects and their associated entity spans in a compact, serialised form optimised for fast loading during training.

The `save_examples_to_spacy_file` function writes each list of `Example` objects to a file by serialising the underlying `Doc` objects (i.e., `example.reference`). These files are then referenced by the training configuration to load the data in a format that is fully compatible with SpaCy's pipeline.


In [11]:
from spacy.tokens import DocBin
from pathlib import Path

def save_examples_to_spacy_file(examples, nlp, output_path):
    doc_bin = DocBin()
    for example in examples:
        doc_bin.add(example.reference)
    doc_bin.to_disk(output_path)

In [12]:
save_examples_to_spacy_file(train_examples, nlp_blank, SPACY_DATA_PATH / "train.spacy")
save_examples_to_spacy_file(val_examples, nlp_blank, SPACY_DATA_PATH / "val.spacy")
save_examples_to_spacy_file(test_examples, nlp_blank, SPACY_DATA_PATH / "test.spacy")

## 3. Training Spacy CNN models
This section introduces the training of neural models using SpaCy’s built-in convolutional neural network (CNN) pipeline. Unlike classical models like CRF, SpaCy models operate directly on raw text and automatically learn feature representations using token-to-vector embeddings and convolutional layers.

The training is conducted using SpaCy’s CLI interface, which requires a configuration file defining all aspects of the pipeline. This includes component definitions (such as `tok2vec`, `ner`), model architecture, training hyperparameters, optimiser settings, and paths to input/output data.

Multiple models will be trained using different configurations. This section documents each configuration, its rationale, and how it was applied to train the model.

### 3.1 Training the Baseline Model
The first model is intended as a baseline for performance comparison. A starter configuration file is generated using the command below. It sets up a SpaCy pipeline optimised for accuracy using a standard NER architecture with no pre-trained embeddings.

In [13]:
!python -m spacy init config ./spaCy_configs/cnn/config0.cfg --lang en --pipeline ner --optimize accuracy

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: accuracy
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
spaCy_configs/cnn/config0.cfg
You can now add your data and train your pipeline:
python -m spacy train config0.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


This generates a default config0.cfg file, which can be further customised. For this project, the file is manually adjusted to define the baseline settings for the first CNN model. The following table summarises the key parameters modified to initialise the baseline run.

| Parameter                     | Value       | Purpose                                                                 |
|------------------------------|-------------|-------------------------------------------------------------------------|
| `vectors`                    | `null`      | No pre-trained word vectors are used; training starts from scratch.    |
| `init_tok2vec`               | `null`      | No initial weights; the tok2vec layer is learned from the training data.|
| `batch_size`    | `1000`      | Number of words per batch; high value speeds up training.              |
| `encoder architecture`       | `MaxoutWindowEncoder.v2` | CNN-style encoder used by SpaCy’s tok2vec layer.     |
| `encoder width`              | `128`       | Number of output channels from the encoder (model capacity).           |
| `encoder depth`              | `4`         | Number of convolutional layers in the encoder.                         |
| `dropout`                    | `0.5`       | Regularisation to prevent overfitting.                                 |
| `max_steps`                  | `2000`     | Maximum number of update steps before training stops.                  |
| `patience`                   | `1000`      | Early stopping patience if validation score does not improve.          |
| `learn_rate`                 | `0.001`     | Learning rate for optimiser.                                           |
| `eval_frequency`             | `500`       | Number of steps between validation evaluations.                        |


With the configuration file prepared and training data saved in SpaCy’s `.spacy` format, the model is now trained using the CLI interface. The command below specifies the model configuration, input/output paths, and GPU usage.

Training progress is printed after every evaluation step and includes loss values (`LOSS TOK2VEC`, `LOSS NER`) and entity-level metrics (`ENTS_P`, `ENTS_R`, `ENTS_F`, `SCORE`). These metrics reflect the model’s performance on the validation set and are used to monitor convergence and early stopping.

The best-performing model is saved under the `model-best` directory inside the specified output folder.

In [15]:
!python -m spacy train {CONFIG_PATH / "config0.cfg"} \
  --output {SPACY_MODEL_PATH / "cnn_0"} \
  --paths.train {SPACY_DATA_PATH / "train.spacy"} \
  --paths.dev {SPACY_DATA_PATH / "val.spacy"} \
  --gpu-id 0 --verbose

[2025-06-30 21:41:31,308] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev']
[38;5;2m✔ Created output directory: ../models/spaCy/cnn_0[0m
[38;5;4mℹ Saving to output directory: ../models/spaCy/cnn_0[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2025-06-30 21:41:31,657] [INFO] Set up nlp object from config
[2025-06-30 21:41:31,666] [DEBUG] Loading corpus from path: ../data/spaCy/val.spacy
[2025-06-30 21:41:31,668] [DEBUG] Loading corpus from path: ../data/spaCy/train.spacy
[2025-06-30 21:41:31,668] [INFO] Pipeline: ['tok2vec', 'ner']
[2025-06-30 21:41:31,671] [INFO] Created vocabulary
[2025-06-30 21:41:31,671] [INFO] Finished initializing nlp object

Load the table in your config with:

[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]

[2025-06-30 21:48:09,060] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[2025-06-30 21:48:09,069] [DEBUG] Loading corpus from path: ../data/s

### Model Statistics

| Model   | Dropout | Batch Size | Max Steps | Learn Rate | Loss NER | F1 |
|---------|---------|------------|-----------|------------|----------|----|
| cnn_0   | 0.5     | 1000       | 2000      |  0.01      | 8088.55  | 0.86 |

The first CNN model was trained using a high batch size of 1000 and a dropout rate of 0.5 to encourage regularisation. The model was trained for 2000 steps with an initial learning rate of 0.001, and no pretrained vectors or tok2vec weights were used.

The validation F1-score improved rapidly over the first few hundred steps, rising from 0.07 to 0.86. Loss NER also steadily increased as the model learned to capture more patterns from the data. At step 2000, the final validation F1-score reached 86.03%, with precision at 89.68% and recall at 82.66%.

This model serves as a strong baseline for evaluating future configurations. The results suggest that the CNN pipeline can learn meaningful span boundaries from weakly labelled environmental data, even without external embeddings or pretrained features.

### 3.2 SpaCy Model 2: Lower Dropout and Smaller Batch Size
The second model `cnn_1` is designed to explore how a simpler NER classification head performs when given more training time. It retains the same CNN encoder as the baseline (`MaxoutWindowEncoder.v2` with width 128 and depth 4), but lowers the hidden width of the NER transition model to `32`. This reduces the number of parameters in the classification layer, helping test the model’s efficiency and generalisation when under tighter representational constraints.

Unlike the baseline, this configuration also reduces `maxout_pieces` to `2`, meaning fewer activation combinations are computed at each layer. These changes together make the model lighter and more interpretable while preserving most of the upstream embedding logic.

The dropout remains high `0.5` to maintain regularisation, as the smaller model is more likely to overfit quickly. However, the total number of training steps is increased to `20,000` (up from 2,000 in the baseline). This gives the lower-capacity architecture a fair opportunity to converge fully.

Early stopping is still enabled with a `patience of 1000 steps`, which means training will stop early if validation F1 does not improve for 1000 evaluation steps. This allows for longer training without overfitting or wasting compute unnecessarily.

Other values such as learning rate (0.001) are kept constant, but the batch size is reduced to 500. This allows for more frequent model updates, which can be beneficial when training smaller architectures. Smaller batches introduce additional noise into the gradient updates, acting as an implicit regulariser and helping the model escape sharp minima. This setup gives the model more chances to refine its weights across a longer training horizon, especially useful when early stopping is enabled.

| Parameter                     | Value       |
|------------------------------|-------------|
| `vectors`                    | `null`      |
| `init_tok2vec`               | `null`      |
| `batch_size`    | `500`       |
| `encoder architecture`       | `MaxoutWindowEncoder.v2` |
| `encoder width`              | `128`       |
| `encoder depth`              | `4`         |
| `dropout`                    | `0.5`       |
| `max_steps`                  | `20000`     |
| `patience`                   | `1000`      |
| `learn_rate`                 | `0.001`     |
| `eval_frequency`             | `200`       |


In [17]:
!python -m spacy init config ./spaCy_configs/cnn/config1.cfg --lang en --pipeline ner --optimize accuracy

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: accuracy
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
spaCy_configs/cnn/config1.cfg
You can now add your data and train your pipeline:
python -m spacy train config1.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [18]:
!python -m spacy train {CONFIG_PATH / "config1.cfg"} \
  --output {SPACY_MODEL_PATH / "cnn_1"} \
  --paths.train {SPACY_DATA_PATH / "train.spacy"} \
  --paths.dev {SPACY_DATA_PATH / "val.spacy"} \
  --gpu-id 0  --verbose

[2025-06-30 22:40:35,084] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev']
[38;5;4mℹ Saving to output directory: ../models/spaCy/cnn_1[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2025-06-30 22:40:35,418] [INFO] Set up nlp object from config
[2025-06-30 22:40:35,428] [DEBUG] Loading corpus from path: ../data/spaCy/val.spacy
[2025-06-30 22:40:35,429] [DEBUG] Loading corpus from path: ../data/spaCy/train.spacy
[2025-06-30 22:40:35,429] [INFO] Pipeline: ['tok2vec', 'ner']
[2025-06-30 22:40:35,432] [INFO] Created vocabulary
[2025-06-30 22:40:35,432] [INFO] Finished initializing nlp object

Load the table in your config with:

[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]

[2025-06-30 22:46:51,709] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[2025-06-30 22:46:51,718] [DEBUG] Loading corpus from path: ../data/spaCy/val.spacy
[2025-06-30 22:46:51,719] [DEBUG] Loading corpus

### 3.3 SpaCy Model 3: Shallower Encoder with Lower Dropout
This third model (`cnn_2`) explores the effect of simplifying the architecture by reducing both the encoder depth and the dropout rate.

The encoder depth is reduced from 4 to 2. This limits how many convolutional layers the model uses to extract contextual features. It makes the model faster to train and potentially more generalisable on simpler patterns.

Dropout is lowered from 0.5 to 0.3. Because the encoder is already smaller, using high dropout could cause underfitting. A moderate value maintains regularisation while allowing the model to learn from more of the training signal.

All other values stay the same as the previous model. These include a batch size of 500, a learning rate of 0.001, and a total of 20,000 training steps with early stopping patience set to 1,000. This lets the model train long enough while stopping early if the validation score does not improve.

This setup helps assess how well a shallower network can perform with milder regularisation.

| Parameter                     | Value       |
|------------------------------|-------------|
| `vectors`                    | `null`      |
| `init_tok2vec`               | `null`      |
| `batch_size`                 | `500`       |
| `encoder architecture`       | `MaxoutWindowEncoder.v2` |
| `encoder width`              | `128`       |
| `encoder depth`              | `2`         |
| `dropout`                    | `0.3`       |
| `max_steps`                  | `20000`     |
| `patience`                   | `1000`      |
| `learn_rate` (Adam)          | `0.001`     |
| `eval_frequency`             | `200`       |


In [19]:
!python -m spacy init config ./spaCy_configs/cnn/config2.cfg --lang en --pipeline ner --optimize accuracy


[38;5;1m✘ The provided output file already exists. To force overwriting the
config file, set the --force or -F flag.[0m



In [20]:
!python -m spacy train {CONFIG_PATH / "config2.cfg"} \
  --output {SPACY_MODEL_PATH / "cnn_2"} \
  --paths.train {SPACY_DATA_PATH / "train.spacy"} \
  --paths.dev {SPACY_DATA_PATH / "val.spacy"} \
  --gpu-id 0  --verbose

[2025-07-01 01:16:19,564] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev']
[38;5;2m✔ Created output directory: ../models/spaCy/cnn_2[0m
[38;5;4mℹ Saving to output directory: ../models/spaCy/cnn_2[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2025-07-01 01:16:19,874] [INFO] Set up nlp object from config
[2025-07-01 01:16:19,882] [DEBUG] Loading corpus from path: ../data/spaCy/val.spacy
[2025-07-01 01:16:19,883] [DEBUG] Loading corpus from path: ../data/spaCy/train.spacy
[2025-07-01 01:16:19,883] [INFO] Pipeline: ['tok2vec', 'ner']
[2025-07-01 01:16:19,886] [INFO] Created vocabulary
[2025-07-01 01:16:19,886] [INFO] Finished initializing nlp object

Load the table in your config with:

[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]

[2025-07-01 01:22:21,709] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[2025-07-01 01:22:21,718] [DEBUG] Loading corpus from path: ../data/s

In [10]:
# def evaluate_f1(nlp, examples):
#     scorer = Scorer()
#     example_preds = []
#     for example in examples:
#         pred = nlp(example.text)
#         example_preds.append(Example(pred, example.reference))
#     scores = scorer.score(example_preds)
#     return scores["ents_f"]

In [14]:
def evaluate_f1(nlp, examples):
    scorer = Scorer()
    texts = [ex.text for ex in examples]
    pred_docs = nlp.pipe(texts)  # ✅ batch prediction
    example_preds = [Example(pred, ex.reference) for pred, ex in zip(pred_docs, examples)]
    scores = scorer.score(example_preds)
    return scores.get("ents_f", 0.0)

In [15]:
def train_spacy_model(
    model_name,
    train_examples,
    val_examples,
    n_iter=50,
    batch_size=32,
    dropout=0.1,
    patience=5
):
    # Force SpaCy to use GPU via Thinc + CuPy backend
    spacy.require_gpu()

    # Create a blank English pipeline and add NER component
    nlp = spacy.blank("en")
    ner = nlp.add_pipe("ner")

    # Register all entity labels from training examples
    for example in train_examples:
        for ent in example.reference.ents:
            ner.add_label(ent.label_)

    # Initialise pipeline weights
    nlp.initialize(get_examples=lambda: train_examples)
    optimizer = nlp.resume_training()

    # Set up early stopping and output directory
    best_f1 = 0.0
    patience_counter = 0
    model_path = SPACY_MODEL_PATH / model_name
    model_path.mkdir(parents=True, exist_ok=True)

    # Training loop
    for epoch in trange(n_iter):
        random.shuffle(train_examples)
        losses = {}
        batches = minibatch(train_examples, size=batch_size)

        for batch in batches:
            nlp.update(batch, sgd=optimizer, drop=dropout, losses=losses)

        # Evaluate on validation set
        f1 = evaluate_f1(nlp, val_examples)
        print(f"Epoch {epoch+1}: Loss = {losses['ner']:.4f}, F1 = {f1:.4f}")

        # Save model if F1 improves
        if f1 > best_f1:
            best_f1 = f1
            patience_counter = 0
            nlp.to_disk(model_path)
            print(f"New best model saved: {model_name} (F1={f1:.4f})")
        else:
            patience_counter += 1
            if patience_counter >= patience:
                print(f"Early stopping: no improvement in {patience} epochs.")
                break

    return nlp

In [16]:
def evaluate_ner_model(nlp, examples):
    tp = 0
    fp = 0
    fn = 0
    label_counter = Counter()

    for example in examples:
        # Predict using trained model
        pred_doc = nlp(example.text)

        # Ground truth entities (called 'gold' in NLP convention)
        gold_ents = {(ent.start_char, ent.end_char, ent.label_) for ent in example.reference.ents}
        pred_ents = {(ent.start_char, ent.end_char, ent.label_) for ent in pred_doc.ents}

        # Print comparison for debugging and review
        print(f"\nText: {example.text}")
        print("Gold:", gold_ents)
        print("Pred:", pred_ents)

        tp += len(gold_ents & pred_ents)
        fp += len(pred_ents - gold_ents)
        fn += len(gold_ents - pred_ents)

        for _, _, label in gold_ents:
            label_counter[label] += 1

    precision = tp / (tp + fp + 1e-8)
    recall = tp / (tp + fn + 1e-8)
    f1 = 2 * precision * recall / (precision + recall + 1e-8)

    print(f"\nEvaluation summary:")
    print(f"True Positives: {tp}")
    print(f"False Positives: {fp}")
    print(f"False Negatives: {fn}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall:    {recall:.4f}")
    print(f"F1 Score:  {f1:.4f}")

    print("\nEntity label distribution (ground truth):")
    for label, count in label_counter.items():
        print(f"{label}: {count}")

    return {
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "true_positives": tp,
        "false_positives": fp,
        "false_negatives": fn,
        "label_counts": dict(label_counter)
    }


In [18]:
cnn_model_1 = train_spacy_model("cnn_1", train_examples, val_examples, n_iter=25, batch_size=128, dropout=0.0)

  4%|██████                                                                                                                                                | 1/25 [17:08<6:51:30, 1028.77s/it]

Epoch 1: Loss = 118134.8166, F1 = 0.9832
New best model saved: cnn_1 (F1=0.9832)


  8%|████████████                                                                                                                                          | 2/25 [34:23<6:35:41, 1032.25s/it]

Epoch 2: Loss = 26347.9396, F1 = 0.9880
New best model saved: cnn_1 (F1=0.9880)


 12%|██████████████████                                                                                                                                    | 3/25 [51:20<6:15:52, 1025.12s/it]

Epoch 3: Loss = 19234.6683, F1 = 0.9899
New best model saved: cnn_1 (F1=0.9899)


 16%|███████████████████████▋                                                                                                                            | 4/25 [1:08:23<5:58:29, 1024.28s/it]

Epoch 4: Loss = 15891.4237, F1 = 0.9906
New best model saved: cnn_1 (F1=0.9906)


 20%|█████████████████████████████▌                                                                                                                      | 5/25 [1:25:31<5:41:56, 1025.81s/it]

Epoch 5: Loss = 13654.0269, F1 = 0.9913
New best model saved: cnn_1 (F1=0.9913)


 24%|███████████████████████████████████▌                                                                                                                | 6/25 [1:42:40<5:25:07, 1026.72s/it]

Epoch 6: Loss = 12437.5488, F1 = 0.9923
New best model saved: cnn_1 (F1=0.9923)


 28%|█████████████████████████████████████████▍                                                                                                          | 7/25 [1:59:56<5:08:56, 1029.78s/it]

Epoch 7: Loss = 11180.7606, F1 = 0.9922


 32%|███████████████████████████████████████████████▎                                                                                                    | 8/25 [2:17:11<4:52:13, 1031.39s/it]

Epoch 8: Loss = 10628.8827, F1 = 0.9923
New best model saved: cnn_1 (F1=0.9923)


 36%|█████████████████████████████████████████████████████▎                                                                                              | 9/25 [2:34:25<4:35:16, 1032.25s/it]

Epoch 9: Loss = 10121.0306, F1 = 0.9920


 40%|██████████████████████████████████████████████████████████▊                                                                                        | 10/25 [2:51:40<4:18:19, 1033.28s/it]

Epoch 10: Loss = 9594.1881, F1 = 0.9925
New best model saved: cnn_1 (F1=0.9925)


 44%|████████████████████████████████████████████████████████████████▋                                                                                  | 11/25 [3:08:48<4:00:44, 1031.72s/it]

Epoch 11: Loss = 9108.4780, F1 = 0.9925


 48%|██████████████████████████████████████████████████████████████████████▌                                                                            | 12/25 [3:25:51<3:42:54, 1028.80s/it]

Epoch 12: Loss = 8761.9833, F1 = 0.9932
New best model saved: cnn_1 (F1=0.9932)


 52%|████████████████████████████████████████████████████████████████████████████▍                                                                      | 13/25 [3:42:54<3:25:26, 1027.21s/it]

Epoch 13: Loss = 8427.8264, F1 = 0.9928


 56%|██████████████████████████████████████████████████████████████████████████████████▎                                                                | 14/25 [4:00:11<3:08:50, 1030.05s/it]

Epoch 14: Loss = 7937.2117, F1 = 0.9923


 60%|████████████████████████████████████████████████████████████████████████████████████████▏                                                          | 15/25 [4:17:37<2:52:28, 1034.82s/it]

Epoch 15: Loss = 7971.9553, F1 = 0.9929


 64%|██████████████████████████████████████████████████████████████████████████████████████████████                                                     | 16/25 [4:35:14<2:36:13, 1041.52s/it]

Epoch 16: Loss = 7494.5822, F1 = 0.9930


 64%|██████████████████████████████████████████████████████████████████████████████████████████████                                                     | 16/25 [4:52:49<2:44:43, 1098.11s/it]

Epoch 17: Loss = 7200.9827, F1 = 0.9929
Early stopping: no improvement in 5 epochs.





In [None]:
results_1 = evaluate_ner_model(cnn_model_1, test_examples)


Text: Once the project development areas have been agreed, they will be offered to businesses through a tender process, which is due to be launched in mid-2023.The crown estate hopes these areas will deliver 4 gigawatts of floating offshore wind power by 2035, fuelling almost 4m homes.
Gold: {(235, 239, 'ENV_PROCESS')}
Pred: {(235, 239, 'ENV_PROCESS')}

Text: In the present study, we use Maximum Entropy (MaxEnt) modelling approach to predict the potential of distribution of eleven IAPS under future climatic conditions under RCP 2.6 and RCP 8.5 in part of Kailash sacred landscape region in Western Himalaya.
Gold: {(214, 223, 'HABITAT')}
Pred: {(214, 223, 'HABITAT')}

Text: The report also finds: There is a lack of coordination between departments, state and federal governments on threatened species activities There has been no critical habitat declared in Queensland Policies are missing in key areas of species protection, including stopping species from reaching threatened status, prio

In [None]:
cnn_model_2 = train_spacy_model("cnn_2", train_examples, val_examples, n_iter=100, batch_size=32, dropout=0.0)


In [None]:
results_2 = evaluate_ner_model(cnn_model_2, test_examples)

In [None]:
cnn_model_3 = train_spacy_model("cnn_3", train_examples, val_examples, n_iter=100, batch_size=32, dropout=0.5)

In [None]:
results_3 = evaluate_ner_model(cnn_model_3, test_examples)

In [None]:
cnn_model_4 = train_spacy_model("cnn_4", train_examples, val_examples, n_iter=100, batch_size=64, dropout=0.5)

In [None]:
results_4 = evaluate_ner_model(cnn_model_4, test_examples)

In [None]:
cnn_model_5 = train_spacy_model("cnn_5", train_examples, val_examples, n_iter=100, batch_size=64, dropout=0.2)

In [None]:
results_5 = evaluate_ner_model(cnn_model_5, test_examples)

In [None]:
cnn_model_6 = train_spacy_model("cnn_5", train_examples, val_examples, n_iter=200, batch_size=128, dropout=0.1)

In [None]:
result_6 = evaluate_ner_model(cnn_model_6, test_examples)