# SpaCy NER: single-pass entity extraction on a transcript string

```

**Purpose.** Add named-entity extraction to transcripts using spaCy.

| Component | Role                     | Why it matters                                         |
| --------- | ------------------------ | ------------------------------------------------------ |
| spaCy     | NLP toolkit              | Fast NER with easy pipelines                           |
| Entity    | Labeled real-world thing | Pulls people, products, places, dates from text        |
| Pipeline  | Ordered processors       | Control what runs and when (tagger, parser, ner, etc.) |

**Setup.** Install spaCy and a language model.

| Step           | Command                                   | Note                                |
| -------------- | ----------------------------------------- | ----------------------------------- |
| Install        | `pip install spacy`                       | Once per environment                |
| Download model | `python -m spacy download en_core_web_sm` | Use `*_md`/`*_lg` for larger models |

**Core objects.** How spaCy structures text.

| Object  | Meaning                          | Accessors                                           |
| ------- | -------------------------------- | --------------------------------------------------- |
| `Doc`   | Processed text container         | `doc.text`, `doc.ents`, `doc.sents`, iterate tokens |
| `Token` | Single lexical unit              | `token.text`, `token.idx`                           |
| `Span`  | Slice of tokens (e.g., sentence) | `span.text`, `span.start`, `span.end`               |


In [1]:
# Minimal: load model, build a Doc, view tokens and sentences
import spacy
nlp = spacy.load("en_core_web_sm") # load model
doc = nlp("Acme shipped my smartphone to Sydney on 3 Oct. Thank you.")

print([ (t.text, t.idx) for t in doc ])       # tokens with start indices
print([ s.text for s in doc.sents ])          # sentence spans

[('Acme', 0), ('shipped', 5), ('my', 13), ('smartphone', 16), ('to', 27), ('Sydney', 30), ('on', 37), ('3', 40), ('Oct.', 42), ('Thank', 47), ('you', 53), ('.', 56)]
['Acme shipped my smartphone to Sydney on 3 Oct. Thank you.']



**Built-in NER.** Extract entities and inspect labels.

| Access                   | Returns                | Example labels                         |
| ------------------------ | ---------------------- | -------------------------------------- |
| `doc.ents`               | iterable of entities   | `PERSON, ORG, GPE, DATE, PRODUCT, ...` |
| `ent.text`, `ent.label_` | surface form and label | `"Sydney" → GPE`, `"3 Oct" → DATE`     |

In [3]:
# Named entities from a Doc
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

Entity: Sydney, Label: GPE
Entity: 3 Oct., Label: DATE


**Custom entities via rules.** Add domain terms with `EntityRuler`.

| Step             | API                                          | Purpose                              |
| ---------------- | -------------------------------------------- | ------------------------------------ |
| Create ruler     | `nlp.add_pipe("entity_ruler", before="ner")` | Insert before statistical NER        |
| Add patterns     | `ruler.add_patterns([...])`                  | Rule labels for products, SKUs, etc. |
| Inspect pipeline | `nlp.pipe_names`                             | Confirm `entity_ruler` is active     |

In [None]:
# Add a rule: label "smartphone" as PRODUCT before ner runs (ner nouns are statistical models that may miss domain-specific terms)
ruler = nlp.add_pipe("entity_ruler", before="ner")

ruler.add_patterns([{"label": "PRODUCT", "pattern": "smartphone"}]) # add patterns, this is useful for example for domain-specific terms, in this case a product

print(nlp.pipe_names)  # e.g., ['tok2vec', 'tagger', 'parser', 'entity_ruler', 'ner'] # confirm entity_ruler is active

doc = nlp("The smartphone arrived late.")
[(ent.text, ent.label_) for ent in doc.ents]  # -> [('smartphone', 'PRODUCT')]

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'entity_ruler', 'ner']


[('smartphone', 'PRODUCT')]

: 


**Pipeline control.** Keep order deterministic for reliable outputs.

| Desired effect                     | Placement                       | Rationale                             |
| ---------------------------------- | ------------------------------- | ------------------------------------- |
| Rules override or complement model | `entity_ruler` **before** `ner` | Ensure rule annotations are preserved |
| Pure model behavior                | No ruler or `after="ner"`       | Use statistical NER only              |

**Usage in your project.** Apply to transcripts post-ASR.

| Stage       | Action                                   | Output                      |
| ----------- | ---------------------------------------- | --------------------------- |
| Transcripts | Feed text to `nlp`                       | `Doc`                       |
| Extract     | `[(e.text, e.label_) for e in doc.ents]` | Entities for analytics      |
| Customize   | Add product/brand rules                  | Better recall on Acme terms |
