### Named Entity Recognition (NER) & Information Extraction

#### 1. What problem does the practical solve 
News articles are unstructured text

Ex. Apple CEO Tim Cook visited India in April 2024 to meet Prime Minister Narendra Modi.

For humans:
- Person -> Tim cook, Narendra Modi
- Organization -> Apple 
- Location -> India
- Date -> April 2024

For machines:
- Just words. No meaning

Goal:
Convert unstructured text into structured information.

This is exactly what Threat Intelligence, Search Engines, Chatbots, Analytics systems do

#### 2. Core Concept
2.1 A named entity is a real-world object that can be uniquely identified 

Types:
- Person -> names of people
- ORG -> organizations
- GPE/ LOC -> places 
- Date/ Time -> temporal entities 
- Money, Percent, etc.

2.2 Named entity recognition: 

NER is the process of locating and classifying named entities in text into predefined categories such as Person, Organization, Location, Date, etc.

2.3 Information Extraction (IE):

NER alone finds entities, Information extraction goes one step furthur.

IE: 
- Extracts structured facts from text
- Converts text -> database-like records

Ex. Text: "Apple CEO Tim Cook visited India in April 2024."
- Organization: Apple
- Person: Tim Cook
- Location: India
- Date: April 2024

This is machine-usable intelligence

#### 3. High-Level Algorithm
Input text -> Preprocessing -> Tokenization -> Contextual Analysis -> NER Model Prediction -> Entity Label Assignment -> Information Extraction -> Visualization -> Frequency Analysis

#### 4. Practical Implementation

##### Aim: To identify named entities such as Person, ORG, Location, Date from unstructured text and convert them into structured data.

Why This Practical Matters:
- Real-world data (news, blogs, reports) is unstructured 
- Machines need structured information 
- NER is the foundation of information extraction systems 


##### Step 1: Input Text (News Article)
- We take a realistic news-style paragraph
- This mimics real-world data used in NLP applications

In [2]:
text = """
Apple CEO Tim Cook visited India in April 2024.
He met Prime Minister Narendra Modi in New Delhi to discuss technology investments.
"""

print("RAW NEWS TEXT:\n", text)

RAW NEWS TEXT:
 
Apple CEO Tim Cook visited India in April 2024.
He met Prime Minister Narendra Modi in New Delhi to discuss technology investments.



##### Step 2: Why Preprocessing is Minimal in NER 
Important Theory:

Unlike sentiment analysis or classification, NER depends heavily on 
- Capitalization 
- Proper noun structure 

So we do not:
- lowercase text
- remove punctuation aggressively 

Because:
- "Apple" â‰  "apple"
- Capital letters help to identify named entities 

##### Step 3: LOAD NLP Model (spaCy)
spaCy provided pre-trained NER models.

How It Works Internally:
- Uses word embeddings (vector representation)
- Uses context windows 
- Uses neural networks (CNN / Transformer-based)

Model learns patterns like:
- "Mr. X" -> Person
- "Ltd.", "Inc." -> ORG
- Capitalized nouns -> possible entities 

In [1]:
import spacy 
nlp = spacy.load("en_core_web_sm")

##### Step 4: Tokenization + Contextual Analysis 
spaCy processes text and creates a Doc object.

Internally:
- Sentence segmentation
- Tokenization
- POS tagging 
- Dependency parsing 

NER uses all of this context 

In [3]:
doc = nlp(text)

##### Step 5: Named Entity Recognition (NER)
What happens here:

The model looks at:
- Word itself
- Neighbouring words 
- Capitalization 
- Sentence structure 

And predicts:
- entity span + entity label 

In [4]:
print("\nNamed Entities Found:\n")

for ent in doc.ents:
    print(
        "Entity Text:", ent.text,
        "| Entity Label:", ent.label_
    )


Named Entities Found:

Entity Text: Apple | Entity Label: ORG
Entity Text: Tim Cook | Entity Label: PERSON
Entity Text: India | Entity Label: GPE
Entity Text: April 2024 | Entity Label: DATE
Entity Text: Narendra Modi | Entity Label: PERSON
Entity Text: New Delhi | Entity Label: GPE


##### Step 6: Information Extraction (Structured Data)
Now we convert entities into structured format

This is the core information extraction 

In [5]:
structured_data = []

for ent in doc.ents:
    structured_data.append({
        "entity": ent.text,
        "label": ent.label_
    })

print("\nStructured Information: ")
for item in structured_data:
    print(item)


Structured Information: 
{'entity': 'Apple', 'label': 'ORG'}
{'entity': 'Tim Cook', 'label': 'PERSON'}
{'entity': 'India', 'label': 'GPE'}
{'entity': 'April 2024', 'label': 'DATE'}
{'entity': 'Narendra Modi', 'label': 'PERSON'}
{'entity': 'New Delhi', 'label': 'GPE'}


##### Step 7: Count Frequency Of Each Entity Type
Why:
- To understand dominant entity types 
- Useful in analytics and trend detection 

In [6]:
from collections import Counter 

entity_labels = [ent.label_ for ent in doc.ents]
freq = Counter(entity_labels)

print("\nEntity Type Frequency:")
for label, count in freq.items():
    print(label,":", count)


Entity Type Frequency:
ORG : 1
PERSON : 2
GPE : 2
DATE : 1


##### Step 8: Visualization (NER)
Why Visualization:
- Helps humans verify model output 
- Useful in demos and reports 


In [7]:
from spacy import displacy

displacy.render(doc, style="ent")

##### Step 9: Final Understanding 
Input: Unstructured news article 

Process:
- spaCy NLP pipeline -> NER model 

Output:
- Structured named entities 
- Entity labels 
- Frequency counts 