# 🧠 spaCy NER Pipeline Setup - Terminal Log View

✔️ Importing required libraries: spaCy, tqdm, pandas, and utility modules  
✔️ Loading IPython display tools for visualizing HTML content  
✔️ Enabling warning suppression for cleaner logs  
✔️ Initializing logging system with timestamped messages  
✔️ Logger ready to track all major operations in the pipeline  

📦 All core dependencies are in place.  
🛠️ Ready for: data loading, annotation conversion, model initialization, and NER training.  


In [1]:
import spacy
from tqdm import tqdm
from spacy import displacy
from spacy.training import Example
from spacy.util import minibatch, compounding
import pandas as pd
from IPython.core.display import display, HTML
import random
import warnings
import logging
import ast
warnings.filterwarnings("ignore")
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(message)s')
logger = logging.getLogger(__name__)

# 📂 Loading Dataset for NER Training

✔️ Reading annotated NER data from CSV file  
📍 File path: /kaggle/input/custom-image-dataset/01_03_data_annotation_for_named_entity_recognition.csv  
🔄 Using pandas to parse structured entity data  
📊 DataFrame successfully created — ready for inspection and preprocessing  


In [2]:
df = pd.read_csv("/kaggle/input/custom-image-dataset/01_03_data_annotation_for_named_entity_recognition.csv")

# 👀 Previewing the First Few Rows of the Dataset

✔️ Displaying the top 5 rows of the NER dataset  
🔍 Verifying column structure, entity tags, and sentence formatting  
📑 Confirmed: DataFrame includes necessary fields for NER processing  
➡️ Next Step: Clean and convert annotations for spaCy model training


In [3]:
df.head()

Unnamed: 0,text,annotations
0,The company was founded in December 2002 by R...,(' The company was founded in December 2002 by...
1,"In June 2008, Sequoia Capital, Greylock Partne...","('In June 2008, Sequoia Capital, Greylock Part..."
2,LinkedIn office building at 222 Second Street ...,('LinkedIn office building at 222 Second Stree...
3,LinkedIn office in Toronto inside the Toronto ...,('LinkedIn office in Toronto inside the Toront...
4,LinkedIn filed for an initial public offering ...,('LinkedIn filed for an initial public offerin...


# 🔄 Converting Annotation Strings to Python Objects

✔️ Parsing 'annotations' column from string to native Python data structures  
🧠 Used ast.literal_eval for safe and accurate conversion  
📦 Now each annotation is a dictionary or list — ready for entity extraction  
✅ Annotation format validated for compatibility with spaCy training pipeline


In [4]:
df['annotations'] = df['annotations'].apply(ast.literal_eval)

In [5]:
df.head()

Unnamed: 0,text,annotations
0,The company was founded in December 2002 by R...,( The company was founded in December 2002 by ...
1,"In June 2008, Sequoia Capital, Greylock Partne...","(In June 2008, Sequoia Capital, Greylock Partn..."
2,LinkedIn office building at 222 Second Street ...,(LinkedIn office building at 222 Second Street...
3,LinkedIn office in Toronto inside the Toronto ...,(LinkedIn office in Toronto inside the Toronto...
4,LinkedIn filed for an initial public offering ...,(LinkedIn filed for an initial public offering...


In [6]:
df.shape

(15, 2)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   text         15 non-null     object
 1   annotations  15 non-null     object
dtypes: object(2)
memory usage: 372.0+ bytes


# 🧠 Defining Custom Entity Vocabulary

✔️ Initialized entity list: companies (35 tech brands)  
✔️ Initialized entity list: persons (30 notable founders, CEOs, and entrepreneurs)  
✔️ Initialized entity list: dates (20 structured time references across years)  
✔️ Initialized entity list: places (30 global tech hubs and major cities)  
✔️ Initialized entity list: amounts (20 financial figures for investment/funding events)  

📦 These lists will support rule-based NER tagging or custom pattern-based preprocessing  
✅ Ready for integration with the spaCy pipeline or external annotation augmentation


In [8]:
companies = [
    "LinkedIn", "Google", "Microsoft", "Amazon", "Facebook", "Netflix", "Tesla", "Apple", "Nvidia", "Adobe",
    "Intel", "Oracle", "Samsung", "IBM", "Zoom", "Salesforce", "Spotify", "Dropbox", "Snapchat", "Twitter",
    "Airbnb", "Uber", "Lyft", "PayPal", "Reddit", "Shopify", "Stripe", "GitHub", "Slack", "TikTok", "Tencent",
    "Alibaba", "WeWork", "Dell", "HP", "SpaceX"
]

persons = [
    "Jeff Bezos", "Satya Nadella", "Sundar Pichai", "Mark Zuckerberg", "Elon Musk", "Tim Cook", "Reid Hoffman",
    "Bill Gates", "Sheryl Sandberg", "Susan Wojcicki", "Jack Dorsey", "Travis Kalanick", "Dara Khosrowshahi",
    "Daniel Ek", "Drew Houston", "Marc Benioff", "Jensen Huang", "Marissa Mayer", "Kevin Systrom",
    "Brian Chesky", "Jan Koum", "Evan Spiegel", "Palmer Luckey", "Sam Altman", "Demis Hassabis",
    "Patrick Collison", "Melinda Gates", "Vinod Khosla", "Peter Thiel", "Garrett Camp"
]

dates = [
    "January 2021", "March 2020", "July 2019", "December 2022", "October 2018", "May 2017", "June 2023",
    "February 2016", "April 2015", "August 2014", "September 2013", "November 2012", "March 2011",
    "June 2010", "January 2009", "July 2008", "May 2007", "December 2006", "February 2005", "April 2004"
]


places = [
    "New York", "San Francisco", "Dublin", "Toronto", "Seattle", "London", "Berlin", "Paris", "Tokyo",
    "Singapore", "Sydney", "Bangalore", "Beijing", "Shanghai", "Los Angeles", "Boston", "Austin", "Zurich",
    "Amsterdam", "Chicago", "Vancouver", "Seoul", "Munich", "Jakarta", "Dubai", "Hong Kong", "Melbourne",
    "Delhi", "Nairobi", "São Paulo", "Mexico City"
]

amounts = [
    "$10 million", "$500 million", "$1 billion", "$2.5 million", "$75 million", "$6 billion", "$3.2 million",
    "$900,000", "$25 billion", "$450 million", "$120 million", "$15.4 million", "$100 million", "$30 million",
    "$7 billion", "$2 billion", "$8.8 million", "$60 million", "$4.7 billion", "$210 million"
]



# 🧪 Generating Synthetic NER Training Sample

✔️ Randomly selecting a COMPANY, PERSON, DATE, PLACE, and MONEY value from predefined entity pools  
🧱 Constructing a natural language sentence embedding all selected entities  
🧾 Example format:
   "[COMPANY] was founded by [PERSON] in [DATE] and opened a new office in [PLACE]. It raised [MONEY] in funding."

🗂️ Creating annotation dictionary with entity spans and labels:
   • "COMPANY" for organizations
   • "PERSON" for individuals
   • "DATE" for time references
   • "PLACE" for geographic locations
   • "MONEY" for monetary amounts

✅ Output: 
   • `text` — the synthetic sentence  
   • `(text, annotations)` — tuple formatted for NER training with spaCy  

📦 Function ready for data generation loops or batch processing for NER model development


In [9]:
def generate_sample():
    company = random.choice(companies)
    person = random.choice(persons)
    date = random.choice(dates)
    place = random.choice(places)
    amount = random.choice(amounts)

    text = f"{company} was founded by {person} in {date} and opened a new office in {place}. It raised {amount} in funding."

    annotations = {
        "entities": [
            (text.find(company), text.find(company) + len(company), "COMPANY"),
            (text.find(person), text.find(person) + len(person), "PERSON"),
            (text.find(date), text.find(date) + len(date), "DATE"),
            (text.find(place), text.find(place) + len(place), "PLACE"),
            (text.find(amount), text.find(amount) + len(amount), "MONEY"),
        ]
    }

    return text, (text, annotations)


# 🔄 Generating Synthetic NER Dataset

🧠 Looping 1,000 times to generate unique, annotated text samples using generate_sample()  
🧾 Each sample includes:
   • A natural language sentence with embedded entities  
   • A corresponding annotation dictionary with labeled entity spans

📦 Aggregating all samples into a list named `data`  
📊 Converting data into a pandas DataFrame with two columns:
   • "text" — the generated sentence  
   • "annotations" — the entity label mappings

✅ Synthetic NER dataset created successfully — ready for training, export, or visualization


In [10]:
data = [generate_sample() for _ in range(1000)]
df = pd.DataFrame(data, columns=["text", "annotations"])

In [11]:
df.head()

Unnamed: 0,text,annotations
0,Intel was founded by Satya Nadella in February...,(Intel was founded by Satya Nadella in Februar...
1,Tesla was founded by Travis Kalanick in July 2...,(Tesla was founded by Travis Kalanick in July ...
2,WeWork was founded by Patrick Collison in Febr...,(WeWork was founded by Patrick Collison in Feb...
3,LinkedIn was founded by Daniel Ek in July 2008...,(LinkedIn was founded by Daniel Ek in July 200...
4,Spotify was founded by Patrick Collison in Dec...,(Spotify was founded by Patrick Collison in De...


In [12]:
df.shape

(1000, 2)

# 🧠 Loading spaCy Language Model

✔️ Loaded pre-trained pipeline: 'en_core_web_sm'  
📦 Model includes components for:

   • Tokenization
   
   • Part-of-speech tagging
   
   • Dependency parsing
   
   • Named Entity Recognition (NER)

✅ Ready to use for text processing, entity extraction, and further customization  
💡 Note: This is the small English model — lightweight, fast, but less accurate for complex tasks


In [13]:
nlp=spacy.load("en_core_web_sm")

# 🏷️ Mapping spaCy Entity Labels to Custom Tags

✔️ Created a dictionary to translate default spaCy NER labels to custom-defined labels

🔁 Mappings:
   • 'ORG'    → 'company'
   • 'PERSON' → 'person'
   • 'MONEY'  → 'money'
   • 'GPE'    → 'place'
   • 'DATE'   → 'date'

🎯 Purpose:
   • Normalize entity labels to match project-specific schema
   • Enable consistent downstream processing and evaluation
✅ Mapping ready for use in label transformation, evaluation, or custom visualization tasks


In [14]:
label_map = {'ORG': 'company','PERSON': 'person','MONEY': 'money','GPE': 'place',
    'DATE': 'date'}

In [15]:
# print("Original Text:")
# for idx, text in enumerate(df['text']):
#     print(f"{idx}: {text}")

# 🔁 Converting Text Data to spaCy-Compatible Training Format

📥 Looping through each sentence in the DataFrame
🔍 Using spaCy's pre-trained model to extract named entities
🗂️ Filtering only those entities matching the custom label_map

✅ Appending filtered entity spans into TRAIN_DATA as tuples: (text, {'entities': [...]})
✔️ Final count: [X] valid training examples with at least one labeled entity

# 🧪 Logging Sample Training Examples
📄 Displaying first 2 examples with associated entity spans for verification

# 🧱 Initializing a New spaCy Blank Model
📦 Creating a fresh English pipeline with no pre-loaded components

# 🔌 Adding 'ner' Component
✔️ If not present, added a new Named Entity Recognizer (NER) to the pipeline

# 🏷️ Registering Custom Entity Labels
🔁 Iterating through all annotations in TRAIN_DATA
➕ Adding unique labels to the NER component dynamically

✅ NER pipeline is now aware of all custom entity types
❌ Any failure during label addition is logged and raised for debugging


In [16]:
TRAIN_DATA = []
for text in df['text']:
    doc = nlp(text)
    entities = []
    for ent in doc.ents:
        if ent.label_ in label_map:
            entities.append((ent.start_char, ent.end_char, label_map[ent.label_]))
    if entities:  # Only include texts with entities
        TRAIN_DATA.append((text, {'entities': entities}))

# Log training data summary
logger.info(f"Created {len(TRAIN_DATA)} training examples")
for text, annot in TRAIN_DATA[:2]:
    logger.info(f"Text: {text}")
    logger.info(f"Entities: {annot['entities']}")

# Initialize blank spaCy model for training
nlp = spacy.blank('en')
if 'ner' not in nlp.pipe_names:
    ner = nlp.add_pipe('ner')
else:
    ner = nlp.get_pipe('ner')

# Add custom labels
try:
    for _, annotations in TRAIN_DATA:
        for _, _, label in annotations['entities']:
            ner.add_label(label)
except Exception as e:
    logger.error(f"Error adding labels: {e}")
    raise

# 🏋️‍♂️ Training the Custom spaCy NER Model

🔧 Disabling all pipes except 'ner' to focus training solely on entity recognition  
✔️ Optimizer initialized for training  
🗓️ Training set for 30 iterations using adaptive batch sizes via compounding strategy

📦 Batches prepared using `minibatch` with compounding growth (4.0 → 32.0)  
🔢 Total calculated batches: [total_batches] — [batches_per_iteration] per iteration  

# 📊 Real-time Progress Monitoring
🕹️ Single `tqdm` progress bar tracks training progress by batch  
🔁 For each iteration:
   • TRAIN_DATA is shuffled for variation  
   • Losses are tracked and logged periodically  
   • Each batch is processed and updated into the model

📉 Logging:
   • Loss values logged every 100 batches or at the final batch
   • Errors during batch updates are caught and logged without halting training

✅ Each iteration logs completion and total loss for tracking training curve

# 💾 Saving the Trained Model
✔️ Training completed — attempting to save model to disk as `custom_ner_model`  
📁 Model directory created successfully and contents written  
❌ Any saving failure is logged and raised for debugging

📦 Ready for model loading, evaluation, or inference


In [17]:
# Train model with a single tqdm progress bar
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes):
    try:
        optimizer = nlp.begin_training()
        n_iter = 30
        logger.info(f"Starting training for {n_iter} iterations")

        # Calculate total number of batches (optimized)
        batch_size = compounding(4.0, 32.0, 1.001)
        batches = list(minibatch(TRAIN_DATA, size=batch_size))
        batches_per_iteration = len(batches)
        total_batches = batches_per_iteration * n_iter
        logger.info(f"Total batches: {total_batches} ({batches_per_iteration} per iteration)")

        # Single progress bar
        with tqdm(total=total_batches, desc="Training NER Model", unit="batch", leave=True) as pbar:
            for itn in range(n_iter):
                random.shuffle(TRAIN_DATA)
                losses = {}
                batches = list(minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001)))

                for batch in batches:
                    try:
                        for text, annotations in batch:
                            doc = nlp.make_doc(text)
                            example = Example.from_dict(doc, annotations)
                            nlp.update([example], drop=0.5, losses=losses)
                        pbar.update(1)
                        # Log every 100 batches or at the end
                        if pbar.n % 100 == 0 or pbar.n == total_batches:
                            logger.info(f"Iteration {itn + 1}, Batch {pbar.n}/{total_batches}, Loss: {losses['ner']:.4f}")
                    except Exception as e:
                        logger.error(f"Error processing batch {pbar.n}: {e}")
                        continue
                logger.info(f"Iteration {itn + 1} completed, Total Loss: {losses['ner']:.4f}")
    except Exception as e:
        logger.error(f"Training failed: {e}")
        raise

# Save model
try:
    nlp.to_disk('custom_ner_model')
    logger.info("Model saved to 'custom_ner_model'")
except Exception as e:
    logger.error(f"Failed to save model: {e}")
    raise

[2025-05-18 10:55:46,787] [INFO] Created vocabulary
[2025-05-18 10:55:46,789] [INFO] Finished initializing nlp object
Training NER Model: 100%|██████████| 7350/7350 [14:32<00:00,  8.43batch/s]


# 🎨 Rendering Named Entities in Sample Texts

🧠 Defining a function to visualize entity recognition results
📍 Accepts:
   • Raw text
   • Loaded spaCy model
   • Rendering mode: Jupyter (inline) or HTML (external)

🖼️ If running in Jupyter:
   • Uses displacy to render entities with visual highlights in notebook cells

🌐 If not in Jupyter:
   • Renders as raw HTML string — printed in console or redirected to HTML handler

❌ If rendering fails:
   • Falls back to basic text-based output with entity-label pairs
   • Logs the rendering error using `logger`

# ✅ Testing the Trained NER Model

✔️ Loads saved model from 'custom_ner_model' directory  
🧪 Selects first 20 samples from the original dataset for evaluation

🔍 For each test sentence:
   • Prints the raw text
   • Visualizes the detected entities with labels via `render_entities()`

📦 This step verifies that:
   • The model can be successfully reloaded
   • Named entities are correctly identified and visually interpretable

❌ Any failure during model loading or entity rendering is caught and logged


In [18]:
# Rendering function
def render_entities(text, nlp, jupyter=True):
    doc = nlp(text)
    try:
        if jupyter:
            displacy.render(doc, style="ent", jupyter=True)
        else:
            html = displacy.render(doc, style="ent", jupyter=False)
            print("HTML Output (non-Jupyter):")
            print(html)
    except Exception as e:
        logger.error(f"Error rendering entities: {e}. Falling back to text output.")
        for ent in doc.ents:
            print(f"{ent.text} -> {ent.label_}")

# Test model
try:
    nlp = spacy.load('custom_ner_model')
    logger.info("Testing trained model")
    for text in df['text'][: 20]:
        print(f"\nText: {text}")
        render_entities(text, nlp, jupyter=True)  # Set jupyter=False if not in Jupyter
except Exception as e:
    logger.error(f"Failed to load model: {e}")


Text: Intel was founded by Satya Nadella in February 2016 and opened a new office in São Paulo. It raised $75 million in funding.



Text: Tesla was founded by Travis Kalanick in July 2019 and opened a new office in Mexico City. It raised $210 million in funding.



Text: WeWork was founded by Patrick Collison in February 2016 and opened a new office in Melbourne. It raised $30 million in funding.



Text: LinkedIn was founded by Daniel Ek in July 2008 and opened a new office in Boston. It raised $100 million in funding.



Text: Spotify was founded by Patrick Collison in December 2006 and opened a new office in Los Angeles. It raised $10 million in funding.



Text: Spotify was founded by Daniel Ek in June 2023 and opened a new office in Delhi. It raised $450 million in funding.



Text: Google was founded by Jack Dorsey in July 2008 and opened a new office in Dubai. It raised $25 billion in funding.



Text: Stripe was founded by Melinda Gates in September 2013 and opened a new office in Boston. It raised $2 billion in funding.



Text: Stripe was founded by Melinda Gates in October 2018 and opened a new office in Seoul. It raised $450 million in funding.



Text: HP was founded by Jan Koum in April 2004 and opened a new office in Los Angeles. It raised $450 million in funding.



Text: Lyft was founded by Jan Koum in January 2009 and opened a new office in Dubai. It raised $75 million in funding.



Text: Reddit was founded by Kevin Systrom in February 2016 and opened a new office in Nairobi. It raised $75 million in funding.



Text: Dell was founded by Patrick Collison in June 2010 and opened a new office in Jakarta. It raised $450 million in funding.



Text: Samsung was founded by Satya Nadella in April 2015 and opened a new office in New York. It raised $75 million in funding.



Text: WeWork was founded by Bill Gates in March 2011 and opened a new office in Dubai. It raised $75 million in funding.



Text: Dropbox was founded by Susan Wojcicki in February 2005 and opened a new office in London. It raised $25 billion in funding.



Text: Airbnb was founded by Peter Thiel in January 2021 and opened a new office in São Paulo. It raised $60 million in funding.



Text: GitHub was founded by Travis Kalanick in March 2020 and opened a new office in Dublin. It raised $3.2 million in funding.



Text: Airbnb was founded by Melinda Gates in August 2014 and opened a new office in Shanghai. It raised $210 million in funding.



Text: Dropbox was founded by Kevin Systrom in January 2021 and opened a new office in Bangalore. It raised $500 million in funding.
