# **NER Dataset Creation Notebook**

This notebook demonstrates the process of **creating a dataset** for the Named Entity Recognition (NER) task.  
We will:
1. Define or generate text samples containing mountain names.
2. Annotate these texts with NER labels (e.g., `MOUNTAIN`).
3. Convert them into a format suitable for further training.

Let's get started!

In [2]:
# ---
# CELL 1: Import Libraries
# ---
import random
import re
import pandas as pd
import spacy
from spacy.tokens import DocBin

random.seed(42)


## 1. Define Mountain Names

Below, we list some well-known mountain names. In a real project, you could expand this list significantly or replace it with ones specific to your domain.


In [3]:
# ---
# CELL 2: Define a list of mountain names
# ---
mountain_names = [
    "Mount Everest", "K2", "Kangchenjunga", "Lhotse", "Makalu",
    "Cho Oyu", "Dhaulagiri", "Manaslu", "Annapurna", "Nanga Parbat"
]


## 2. Generate Synthetic Sentences

We can create synthetic sentences that mention these mountains.  
In a real-world scenario, you might collect text from articles, books, or other sources and then annotate them manually or with a semi-automated approach.


In [4]:
# ---
# CELL 3: Generate Synthetic Sentences
# ---
def generate_sentences(mountains, num_samples=20):
    """
    Generate synthetic sentences mentioning random mountains from the given list.
    
    Args:
        mountains (list): A list of mountain names.
        num_samples (int): Number of synthetic sentences to generate.
    
    Returns:
        list: A list of synthetic sentences.
    """
    sentences = []
    for _ in range(num_samples):
        num_mtns = random.randint(1, 2)  # Randomly choose how many mountains to mention
        chosen = random.sample(mountains, num_mtns)
        
        if len(chosen) == 1:
            sentence = f"I have always dreamed of climbing {chosen[0]}."
        else:
            sentence = f"{chosen[0]} and {chosen[1]} are both on my bucket list."
        
        sentences.append(sentence)
    return sentences

synthetic_sentences = generate_sentences(mountain_names, num_samples=30)

# Let's preview a few sentences
for s in synthetic_sentences[:5]:
    print(s)


I have always dreamed of climbing Mount Everest.
Lhotse and Nanga Parbat are both on my bucket list.
I have always dreamed of climbing K2.
I have always dreamed of climbing Nanga Parbat.
Mount Everest and Nanga Parbat are both on my bucket list.


## 3. Annotate the Text for NER

We need to annotate each mention of a mountain with the label `MOUNTAIN`.  
We'll do this by searching for exact matches of the mountain names in the generated sentences (a simple approach).


In [5]:
# ---
# CELL 4: Annotate the Text
# ---
def annotate_text(sentences, mountains):
    """
    Annotate each sentence by identifying mountain mentions and 
    storing the character start, end indices, and label.
    
    Args:
        sentences (list): List of text strings.
        mountains (list): List of mountain names.
        
    Returns:
        list: A list of tuples (sentence, {"entities": [...]})
    """
    training_data = []
    for sentence in sentences:
        entities = []
        for mountain in mountains:
            # Use re.finditer to find all occurrences of a mountain name in the sentence
            # re.escape to handle special regex characters in mountain names
            matches = list(re.finditer(re.escape(mountain), sentence))
            for match in matches:
                start, end = match.span()
                entities.append((start, end, "MOUNTAIN"))
        
        # Sort entities by start position (good practice for some NER frameworks)
        entities = sorted(entities, key=lambda x: x[0])
        
        training_data.append((sentence, {"entities": entities}))
    return training_data

training_data = annotate_text(synthetic_sentences, mountain_names)

# Let's see how a few annotated samples look
for item in training_data[:5]:
    print(item)


('I have always dreamed of climbing Mount Everest.', {'entities': [(34, 47, 'MOUNTAIN')]})
('Lhotse and Nanga Parbat are both on my bucket list.', {'entities': [(0, 6, 'MOUNTAIN'), (11, 23, 'MOUNTAIN')]})
('I have always dreamed of climbing K2.', {'entities': [(34, 36, 'MOUNTAIN')]})
('I have always dreamed of climbing Nanga Parbat.', {'entities': [(34, 46, 'MOUNTAIN')]})
('Mount Everest and Nanga Parbat are both on my bucket list.', {'entities': [(0, 13, 'MOUNTAIN'), (18, 30, 'MOUNTAIN')]})


## 4. Converting to spaCy's `DocBin` Format

spaCy can store training data in the `DocBin` format. This section shows how to:
1. Load a spaCy model (to create `Doc` objects).
2. Convert the annotated data into `Doc` objects.
3. Save them as a `.spacy` file for easy loading during training.


In [6]:
# ---
# CELL 5: Convert the Training Data to DocBin
# ---
def create_spacy_docbin(nlp, data):
    """
    Convert annotated data into a spaCy DocBin object.
    
    Args:
        nlp (Language): A loaded spaCy language model (to create Doc objects).
        data (list): The annotated data in the format [(text, {"entities": [...]})].
        
    Returns:
        DocBin: A DocBin object containing the docs with entity annotations.
    """
    doc_bin = DocBin()
    for text, annot in data:
        doc = nlp.make_doc(text)
        ents = []
        
        for start, end, label in annot["entities"]:
            span = doc.char_span(start, end, label=label)
            if span is None:
                # If span creation fails (overlapping entities or invalid indices),
                # skip or handle accordingly.
                continue
            ents.append(span)
        
        doc.ents = ents
        doc_bin.add(doc)
    return doc_bin

# We can load a small spaCy pipeline or a blank model for doc creation.
# For English, let's use a lightweight blank model:
nlp_blank = spacy.blank("en")

doc_bin = create_spacy_docbin(nlp_blank, training_data)
doc_bin.to_disk("ner_mountain_dataset.spacy")

print("SpaCy DocBin dataset saved to: ner_mountain_dataset.spacy")


SpaCy DocBin dataset saved to: ner_mountain_dataset.spacy


## 5. (Optional) Convert to CSV or Any Other Format

Depending on your workflow, you might also want to export the dataset to CSV or JSON for inspection.


In [7]:
# ---
# CELL 6: (Optional) Save to CSV
# ---
df = pd.DataFrame([
    {
        "text": text,
        "entities": annot["entities"]
    }
    for text, annot in training_data
])
df.to_csv("ner_mountain_dataset.csv", index=False)

print("CSV dataset saved to: ner_mountain_dataset.csv")

# Quick preview
df.head()


CSV dataset saved to: ner_mountain_dataset.csv


Unnamed: 0,text,entities
0,I have always dreamed of climbing Mount Everest.,"[(34, 47, MOUNTAIN)]"
1,Lhotse and Nanga Parbat are both on my bucket ...,"[(0, 6, MOUNTAIN), (11, 23, MOUNTAIN)]"
2,I have always dreamed of climbing K2.,"[(34, 36, MOUNTAIN)]"
3,I have always dreamed of climbing Nanga Parbat.,"[(34, 46, MOUNTAIN)]"
4,Mount Everest and Nanga Parbat are both on my ...,"[(0, 13, MOUNTAIN), (18, 30, MOUNTAIN)]"


## 6. Conclusion

- We **generated synthetic sentences** mentioning mountain names.
- We **annotated** them with a `MOUNTAIN` label.
- We **converted** the annotated data into:
  - A **spaCy DocBin** format (`.spacy`) for easy integration with spaCy training scripts.
  - A **CSV file** for quick inspection.

This concludes the dataset creation process for our **NER** task.
