<div style="  background: linear-gradient(145deg, #0f172a, #1e293b);  border: 4px solid transparent;  border-radius: 14px;  padding: 18px 22px;  margin: 12px 0;  font-size: 26px;  font-weight: 600;  color: #f8fafc;  box-shadow: 0 6px 14px rgba(0,0,0,0.25);  background-clip: padding-box;  position: relative;">  <div style="    position: absolute;    inset: 0;    padding: 4px;    border-radius: 14px;    background: linear-gradient(90deg, #06b6d4, #3b82f6, #8b5cf6);    -webkit-mask:       linear-gradient(#fff 0 0) content-box,       linear-gradient(#fff 0 0);    -webkit-mask-composite: xor;    mask-composite: exclude;    pointer-events: none;  "></div>    <b>Named Entity Recognition</b>    <br/>  <span style="color:#9ca3af; font-size: 18px; font-weight: 400;">(Introduction to Natural Language Processing in Python)</span></div>

## Table of Contents

1. [What is Named Entity Recognition?](#section-1)
2. [NLTK and the Stanford CoreNLP Library](#section-2)
3. [Introduction to SpaCy](#section-3)
4. [Multilingual NER with Polyglot](#section-4)
5. [Conclusion](#section-5)

***

<a id="section-1"></a>
<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 1. What is Named Entity Recognition?</span><br>

Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP). Its primary goal is to identify and classify key information (entities) within unstructured text into predefined categories.

### Key Concepts
NER answers the questions: **Who? What? When? Where?**

It involves identifying categories such as:
*   **People**: Names of individuals (e.g., "Ruth Reichl", "Einstein").
*   **Places**: Locations, cities, countries (e.g., "New York", "Germany").
*   **Organizations**: Companies, institutions (e.g., "MOMA", "Google").
*   **Dates & States**: Temporal expressions and geopolitical states.
*   **Works of Art**: Books, movies, paintings.
*   ...and many other categories!

### Usage
NER is versatile and can be deployed in various contexts:
1.  **Alongside Topic Identification**: To understand not just *what* a text is about, but *who* is involved.
2.  **On its own**: To extract structured data from unstructured documents (e.g., extracting dates and locations from news articles).

<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> In visual examples (like those from Europeana Newspapers), NER systems often highlight text with color-coded tags such as <code>LOCATION</code>, <code>TIME</code>, <code>PERSON</code>, <code>ORGANIZATION</code>, <code>MONEY</code>, and <code>DATE</code>. </div>

***

<a id="section-2"></a>
<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 2. NLTK and the Stanford CoreNLP Library</span><br>

The Natural Language Toolkit (NLTK) is a classic library in Python for NLP. It integrates with the **Stanford CoreNLP** library to provide robust entity recognition capabilities.

### The Stanford CoreNLP Library
*   **Integration**: It is integrated into Python via `nltk`.
*   **Architecture**: It is Java-based.
*   **Capabilities**:
    *   Named Entity Recognition (NER).
    *   Coreference resolution.
    *   Dependency tree parsing.

### Using NLTK for NER
To perform NER in NLTK, a standard pipeline is usually followed:
1.  **Tokenize**: Split the sentence into words.
2.  **POS Tag**: Assign Part-of-Speech tags (e.g., Noun, Verb) to each token.
3.  **Chunk**: Use `ne_chunk` to identify named entities based on the tags.

#### Step 1: Tokenization and POS Tagging

**Original Code (from PDF):**


In [48]:
import nltk

sentence = '''In New York, I like to ride the Metro to
visit MOMA and some restaurants rated
well by Ruth Reichl.'''

tokenized_sent = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokenized_sent)
tagged_sent[:3]

[('In', 'IN'), ('New', 'NNP'), ('York', 'NNP')]


**Enhanced Executable Code:**


In [49]:
import nltk

# Ensure necessary NLTK models are downloaded
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Define the sentence
sentence = '''In New York, I like to ride the Metro to
visit MOMA and some restaurants rated
well by Ruth Reichl.'''

# 1. Tokenize the sentence
tokenized_sent = nltk.word_tokenize(sentence)

# 2. Part-of-Speech (POS) Tagging
tagged_sent = nltk.pos_tag(tokenized_sent)

# Display the first 3 tagged tokens
# Expected Output: [('In', 'IN'), ('New', 'NNP'), ('York', 'NNP')]
print(tagged_sent[:3])


[('In', 'IN'), ('New', 'NNP'), ('York', 'NNP')]


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mohdf\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\mohdf\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\mohdf\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\mohdf\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!



#### Step 2: Named Entity Chunking
Once the text is tagged, we use `nltk.ne_chunk` to classify the entities.

**Original Code (from PDF):**


In [50]:
print(nltk.ne_chunk(tagged_sent))


(S
  In/IN
  (GPE New/NNP York/NNP)
  ,/,
  I/PRP
  like/VBP
  to/TO
  ride/VB
  the/DT
  (ORGANIZATION Metro/NNP)
  to/TO
  visit/VB
  (ORGANIZATION MOMA/NNP)
  and/CC
  some/DT
  restaurants/NNS
  rated/VBN
  well/RB
  by/IN
  (PERSON Ruth/NNP Reichl/NNP)
  ./.)



**Enhanced Executable Code:**


In [51]:
# 3. Perform Named Entity Chunks
ne_tree = nltk.ne_chunk(tagged_sent)

# Print the tree structure
print(ne_tree)

# Explanation of Output Tags:
# GPE: Geo-Political Entity (e.g., New York)
# ORGANIZATION: Organizations (e.g., Metro, MOMA)
# PERSON: People (e.g., Ruth Reichl)


(S
  In/IN
  (GPE New/NNP York/NNP)
  ,/,
  I/PRP
  like/VBP
  to/TO
  ride/VB
  the/DT
  (ORGANIZATION Metro/NNP)
  to/TO
  visit/VB
  (ORGANIZATION MOMA/NNP)
  and/CC
  some/DT
  restaurants/NNS
  rated/VBN
  well/RB
  by/IN
  (PERSON Ruth/NNP Reichl/NNP)
  ./.)



**Expected Output Analysis:**
The output is a tree structure where entities are grouped.
*   `(GPE New/NNP York/NNP)` indicates "New York" is a Geo-Political Entity.
*   `(ORGANIZATION Metro/NNP)` indicates "Metro" is an Organization.
*   `(PERSON Ruth/NNP Reichl/NNP)` indicates a Person.

***

<a id="section-3"></a>
<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 3. Introduction to SpaCy</span><br>

SpaCy is a modern, industrial-strength NLP library. While similar to libraries like `gensim` in that it handles NLP tasks, its implementation and philosophy differ significantly.

### What is SpaCy?
*   **Pipeline Focus**: SpaCy focuses on creating NLP pipelines to generate models and corpora efficiently.
*   **Open-source**: It includes extra libraries and tools.
*   **Visualization**: Includes **Displacy**, a built-in entity recognition visualizer.

### Why use SpaCy for NER?
<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> SpaCy is particularly effective for informal language corpora, such as Tweets and chat messages. </div>

1.  **Easy pipeline creation**: Streamlined API for processing text.
2.  **Different entity types**: Offers a different set of entity labels compared to NLTK.
3.  **Informal Language**: Robust performance on social media text.
4.  **Quickly growing**: Active community and frequent updates.

### Using SpaCy for NER

**Original Code (from PDF):**


In [52]:
import spacy
nlp = spacy.load('en_core_web_sm')

doc = nlp("""Berlin is the capital of Germany;
and the residence of Chancellor Angela Merkel.""")
doc.ents
print(doc.ents[0], doc.ents[0].label_)

Berlin GPE



**Enhanced Executable Code:**


In [53]:
import spacy

# NOTE: You must download the model first in your terminal:
# python -m spacy download en_core_web_sm

# Load the pre-trained English model
try:
    nlp = spacy.load('en_core_web_sm')
except OSError:
    print("Model not found. Please run: python -m spacy download en_core_web_sm")
    # Fallback for demonstration if model isn't installed in this specific env
    # In a real notebook, the user must install the model.

# Process the text
doc = nlp("""Berlin is the capital of Germany; 
and the residence of Chancellor Angela Merkel.""")

# Accessing Entities
# doc.ents returns a tuple of named entities found in the document
print("Entities found:", doc.ents)

# Inspecting the first entity
# Expected: Berlin GPE
if doc.ents:
    print(f"First Entity: {doc.ents[0]}")
    print(f"Label: {doc.ents[0].label_}")

# Iterating through all entities to see their labels
print("\n--- All Entities ---")
for ent in doc.ents:
    print(f"{ent.text}: {ent.label_}")


Entities found: (Berlin, Germany, Angela Merkel)
First Entity: Berlin
Label: GPE

--- All Entities ---
Berlin: GPE
Germany: GPE
Angela Merkel: PERSON



**Output Explanation:**
*   **Berlin**: `GPE` (Geo-Political Entity)
*   **Germany**: `GPE`
*   **Angela Merkel**: `PERSON`

***

<a id="section-4"></a>
<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 4. Multilingual NER with Polyglot</span><br>

Polyglot is an NLP library that relies heavily on word vectors and is designed for multilingual applications.

### What is Polyglot?
*   **Word Vectors**: Uses word embeddings to understand context and meaning.
*   **Language Support**: It has vectors for **more than 130 languages**.
*   **Why use it?**: It is the go-to library when working with non-English text or when you need to support many different languages simultaneously.

### Spanish NER Example
The following example demonstrates extracting entities from a Spanish text about Carles Puigdemont and Manuela Carmena.

**Original Code (from PDF):**


```python
from polyglot.text import Text

text = """El presidente de la Generalitat de CataluÃ±a,
Carles Puigdemont, ha afirmado hoy a la alcaldesa
de Madrid, Manuela Carmena, que en su etapa de
alcalde de Girona (de julio de 2011 a enero de 2016)
hizo una gran promociÃ³n de Madrid."""

ptext = Text(text)
ptext.entities
```

In [54]:
# Alternative to Polyglot: Using spaCy for Spanish NER
# polyglot is not compatible with Python 3.13/Windows

import spacy

# Load the Spanish model
nlp_es = spacy.load('es_core_news_sm')

text = """El presidente de la Generalitat de CataluÃ±a,
Carles Puigdemont, ha afirmado hoy a la alcaldesa
de Madrid, Manuela Carmena, que en su etapa de
alcalde de Girona (de julio de 2011 a enero de 2016)
hizo una gran promociÃ³n de Madrid."""

doc = nlp_es(text)

print("Entities found:")
for ent in doc.ents:
    print(f"{ent.text}: {ent.label_}")

Entities found:
Generalitat de CataluÃ±a: LOC
Carles Puigdemont: PER
Madrid: LOC
Manuela Carmena: PER
Girona: LOC
Madrid: LOC



**Enhanced Executable Code:**


In [55]:
# NOTE: Polyglot requires 'libicu' and specific model downloads.
# Installation: pip install polyglot pyicu pycld2 Morfessor
# Model download: polyglot download embeddings2.es ner2.es

try:
    from polyglot.text import Text
    
    # Spanish text input
    text = """El presidente de la Generalitat de CataluÃ±a,
    Carles Puigdemont, ha afirmado hoy a la alcaldesa
    de Madrid, Manuela Carmena, que en su etapa de
    alcalde de Girona (de julio de 2011 a enero de 2016)
    hizo una gran promociÃ³n de Madrid."""

    # Create Polyglot Text object
    ptext = Text(text)

    # Extract entities
    print("Entities found:")
    for entity in ptext.entities:
        print(entity)

except ImportError:
    print("Polyglot is not installed or dependencies (libicu) are missing.")
    print("This code block requires a specific environment setup.")

# Expected Output Structure based on PDF:
# I-ORG(['Generalitat', 'de'])
# I-LOC(['Generalitat', 'de', 'CataluÃ±a'])
# I-PER(['Carles', 'Puigdemont'])
# I-LOC(['Madrid'])
# I-PER(['Manuela', 'Carmena'])
# I-LOC(['Girona'])
# I-LOC(['Madrid'])


Polyglot is not installed or dependencies (libicu) are missing.
This code block requires a specific environment setup.



***

<a id="section-5"></a>
<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 5. Conclusion</span><br>

This notebook covered the essentials of Named Entity Recognition (NER) in Python using three distinct libraries, each with its own strengths:

1.  **NLTK**:
    *   **Best for**: Education, understanding the underlying mechanics of NLP (tokenization -> tagging -> chunking).
    *   **Pros**: Highly granular control, integrates with Stanford CoreNLP.
    *   **Cons**: Can be slower and more verbose than modern alternatives.

2.  **SpaCy**:
    *   **Best for**: Production environments, building efficient pipelines.
    *   **Pros**: Fast, easy to use, excellent visualization (Displacy), supports informal language.
    *   **Cons**: Less transparent than NLTK regarding internal model decisions.

3.  **Polyglot**:
    *   **Best for**: Multilingual applications.
    *   **Pros**: Massive language support (130+), relies on word vectors.
    *   **Cons**: Installation can be complex due to system dependencies (`libicu`).

### Next Steps
*   **Practice**: Try running the SpaCy code on your own social media data (e.g., a tweet history).
*   **Explore**: Use the `Displacy` visualizer to see how the model parses complex sentences.
*   **Expand**: Attempt to train a custom NER model if the pre-trained models do not recognize specific entities relevant to your domain (e.g., medical drugs or specific product names).
