# Named Entity Recognition with spaCy

This notebook demonstrates how to:

1. **Extract named entities** from text using spaCy's pretrained models
2. **Visualize entities** inline with displaCy
3. **Analyze entity frequencies** across a document
4. **Process files** and export results to CSV

spaCy recognizes entity types including: PERSON, ORG, GPE (countries/cities/states), LOC, DATE, MONEY, EVENT, WORK_OF_ART, and [more](https://spacy.io/models/en#en_core_web_sm-labels).

## Setup

In [None]:
# !pip install spacy
# !python -m spacy download en_core_web_sm

In [None]:
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd

nlp = spacy.load("en_core_web_sm")
print(f"spaCy {spacy.__version__} â€” model: en_core_web_sm")

---
## Part 1: Extract Entities from Text

In [None]:
text = """Barack Obama was born in Honolulu, Hawaii on August 4, 1961.
He graduated from Columbia University and Harvard Law School.
Obama served as the 44th President of the United States from 2009 to 2017.
His memoir "A Promised Land" was published by Crown in November 2020."""

doc = nlp(text)

for ent in doc.ents:
    print(f"{ent.label_:10} {ent.text}")

### Visualize Entities

displaCy renders entities inline with color-coded labels.

In [None]:
displacy.render(doc, style="ent", jupyter=True)

### Filter by Entity Type

In [None]:
target_types = ["PERSON", "ORG", "GPE"]

for ent in doc.ents:
    if ent.label_ in target_types:
        print(f"{ent.label_:10} {ent.text}")

---
## Part 2: Entity Frequency Analysis

In [None]:
longer_text = """Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne
in Cupertino, California in 1976. Jobs later returned to Apple in 1997 after
being ousted in 1985. Under his leadership, Apple introduced the iMac, iPod,
iPhone, and iPad. Tim Cook became CEO of Apple after Jobs passed away in
October 2011. Apple is now headquartered in Apple Park, Cupertino. The company
reported $394 billion in revenue for fiscal year 2022. Apple competes with
Microsoft, Google, and Samsung in various markets."""

doc2 = nlp(longer_text)

entity_counts = Counter((ent.text, ent.label_) for ent in doc2.ents)

df = pd.DataFrame(
    [(text, label, count) for (text, label), count in entity_counts.most_common()],
    columns=["Entity", "Type", "Count"]
)
df

In [None]:
displacy.render(doc2, style="ent", jupyter=True)

---
## Part 3: Process a Text File

Set `FILE_PATH` to a `.txt` file to extract entities and save results as CSV.

In [None]:
def extract_entities_from_file(file_path, model=None):
    """Extract named entities from a text file and return a DataFrame."""
    if model is None:
        model = nlp

    with open(file_path, "r", encoding="utf-8") as f:
        text = f.read()

    doc = model(text)
    entities = [
        {"text": ent.text, "label": ent.label_, "start": ent.start_char, "end": ent.end_char}
        for ent in doc.ents
    ]
    return pd.DataFrame(entities)

In [None]:
# Example usage (uncomment and set your file path):
# FILE_PATH = "your_text.txt"
# entities_df = extract_entities_from_file(FILE_PATH)
# print(f"Found {len(entities_df)} entities")
# entities_df.head(10)

In [None]:
# Save to CSV:
# entities_df.to_csv("entities.csv", index=False)
# print("Saved to entities.csv")

---
## Part 4: Dependency Visualization

displaCy can also render syntactic dependencies, showing how words relate grammatically.

In [None]:
short_doc = nlp("Obama served as President of the United States.")
displacy.render(short_doc, style="dep", jupyter=True, options={"distance": 100})