# üìò Day 2: Advanced NLP Techniques

**üéØ Goal:** Master NER, POS tagging, and deep learning for text classification

**‚è±Ô∏è Time:** 90-120 minutes

**üåü Why This Matters for AI:**
- Named Entity Recognition powers chatbots, search engines, and information extraction
- POS tagging helps LLMs understand grammar and sentence structure
- Deep learning text classification is used in content moderation, spam detection, and intent recognition
- These techniques are building blocks for ChatGPT, Claude, and modern AI systems
- Real-world applications: news categorization, sentiment analysis, customer support automation

---

## üß† What are Advanced NLP Techniques?

Beyond simple text preprocessing, we need to **understand** text structure and meaning!

### üéØ Today's Focus:

#### 1Ô∏è‚É£ **Named Entity Recognition (NER)**
- Extract names, locations, organizations, dates
- **Example**: "Apple CEO Tim Cook visited Paris" ‚Üí [Apple: ORG, Tim Cook: PERSON, Paris: LOC]
- **Used in**: Chatbots, search engines, knowledge graphs

#### 2Ô∏è‚É£ **Part-of-Speech (POS) Tagging**
- Identify grammatical roles (noun, verb, adjective)
- **Example**: "I love AI" ‚Üí [I: PRONOUN, love: VERB, AI: NOUN]
- **Used in**: Grammar checking, text-to-speech, LLM training

#### 3Ô∏è‚É£ **Deep Learning Text Classification**
- Use neural networks for better accuracy
- **Methods**: Embedding layers, LSTMs, CNNs for text
- **Used in**: Content moderation, intent classification, sentiment analysis

### üîë Why This Matters for 2024-2025 AI:

**ü§ñ ChatGPT & Claude:**
- Use NER to extract entities from conversations
- Understand sentence structure through POS patterns
- Classify user intents for appropriate responses

**üîç RAG Systems:**
- Extract entities from documents for better indexing
- Classify document types for retrieval
- Understand query intent

**üì∞ Real Applications:**
- News aggregation and categorization
- Customer review sentiment analysis
- Social media content moderation
- Email routing and prioritization

In [None]:
# Install required libraries
import sys
!{sys.executable} -m pip install nltk spacy scikit-learn tensorflow numpy pandas matplotlib seaborn --quiet
# Download spaCy model
!{sys.executable} -m spacy download en_core_web_sm --quiet

print("‚úÖ Libraries installed!")

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# NLTK
import nltk
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('maxent_ne_chunker', quiet=True)
nltk.download('words', quiet=True)
nltk.download('stopwords', quiet=True)

from nltk import pos_tag, ne_chunk, word_tokenize
from nltk.corpus import stopwords

# spaCy (modern NLP library)
import spacy
nlp = spacy.load('en_core_web_sm')

# Scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# TensorFlow/Keras for deep learning
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, Dropout, GlobalMaxPooling1D
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

print("üìö Libraries loaded!")
print(f"TensorFlow version: {tf.__version__}")
print(f"spaCy version: {spacy.__version__}")

## üè∑Ô∏è Part 1: Named Entity Recognition (NER)

**Named Entity Recognition** = Identifying and classifying named entities in text

### üéØ Common Entity Types:

- **PERSON**: Names of people (Tim Cook, Elon Musk)
- **ORG**: Organizations (Apple, Google, Microsoft)
- **GPE**: Geo-political entities (New York, USA, Paris)
- **DATE**: Dates and times (January 2024, next week)
- **MONEY**: Monetary values ($100, ‚Ç¨50)
- **PRODUCT**: Products and objects (iPhone, ChatGPT)

### üéØ Why NER Matters:

**üîç Information Extraction:**
- Extract structured data from unstructured text
- Build knowledge graphs (Google Knowledge Graph)
- Populate databases automatically

**üí¨ Chatbots & Virtual Assistants:**
- "Book a flight to Paris" ‚Üí Extract: Paris (GPE)
- "Set meeting with John tomorrow" ‚Üí Extract: John (PERSON), tomorrow (DATE)

**üì∞ News & Content Analysis:**
- Track mentions of companies, people, locations
- Generate article summaries
- Create topic clusters

**üîí Privacy & Compliance:**
- Identify and mask PII (personally identifiable information)
- GDPR compliance
- Data anonymization

### NER with NLTK (Basic)

In [None]:
# Example text about tech companies
text_ner = """
Apple CEO Tim Cook announced the new iPhone 15 will be released in September 2024.
The company, based in Cupertino, California, expects to sell 50 million units.
Meanwhile, Google and Microsoft are competing in the AI race with ChatGPT and Gemini.
Elon Musk's xAI raised $6 billion in funding from investors in Silicon Valley.
"""

print("üìù Original Text:")
print(text_ner)
print("\n" + "="*80 + "\n")

# Tokenize and tag
tokens = word_tokenize(text_ner)
pos_tags = pos_tag(tokens)
named_entities = ne_chunk(pos_tags)

print("üè∑Ô∏è Named Entities (NLTK):")
for entity in named_entities:
    if hasattr(entity, 'label'):
        print(f"{entity.label()}: {' '.join([word for word, tag in entity])}")

### NER with spaCy (Modern & Better!)

In [None]:
# Process text with spaCy
doc = nlp(text_ner)

print("üè∑Ô∏è Named Entities (spaCy):\n")
print(f"{'Entity':<20} {'Type':<15} {'Description':<30}")
print("="*70)

for ent in doc.ents:
    print(f"{ent.text:<20} {ent.label_:<15} {spacy.explain(ent.label_):<30}")

print("\nüí° spaCy provides more accurate and detailed entity recognition!")

In [None]:
# Visualize entities
from spacy import displacy

print("üé® Visual NER (Entity Highlighting):\n")

# Display in notebook (for Jupyter)
colors = {'PERSON': '#aa9cfc', 'ORG': '#7aecec', 'GPE': '#feca74', 'DATE': '#ff9561', 'MONEY': '#9cc9cc'}
options = {'ents': ['PERSON', 'ORG', 'GPE', 'DATE', 'MONEY', 'PRODUCT'], 'colors': colors}

html = displacy.render(doc, style='ent', options=options, jupyter=False)
print("Entity highlighting available in Jupyter notebooks!")

# Extract entities by type
entities_by_type = {}
for ent in doc.ents:
    if ent.label_ not in entities_by_type:
        entities_by_type[ent.label_] = []
    entities_by_type[ent.label_].append(ent.text)

print("\nüìä Entities Grouped by Type:\n")
for entity_type, entities in entities_by_type.items():
    print(f"{entity_type}: {', '.join(set(entities))}")

## üî§ Part 2: Part-of-Speech (POS) Tagging

**POS Tagging** = Labeling each word with its grammatical role

### üéØ Common POS Tags:

- **NN**: Noun (dog, computer, AI)
- **VB**: Verb (run, learn, process)
- **JJ**: Adjective (beautiful, fast, intelligent)
- **RB**: Adverb (quickly, very, extremely)
- **PRP**: Pronoun (I, you, he, she)
- **DT**: Determiner (the, a, an)
- **IN**: Preposition (in, on, at, with)

### üéØ Why POS Tagging Matters:

**üìù Text Understanding:**
- Understand sentence structure
- Disambiguate word meanings
- Example: "I fish" (verb) vs "a fish" (noun)

**üîç Information Extraction:**
- Extract noun phrases (products, features)
- Find adjectives (sentiment indicators)
- Identify actions (verbs)

**ü§ñ LLM Training:**
- Pre-training data includes POS information
- Helps models learn grammar
- Improves generation quality

**üí¨ Grammar Checking:**
- Grammarly, Microsoft Word
- Detect grammatical errors
- Suggest corrections

In [None]:
# Example sentences
sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "ChatGPT revolutionized natural language processing in 2024.",
    "I love building AI applications with Python."
]

print("üî§ Part-of-Speech Tagging:\n")
print("="*80)

for sentence in sentences:
    print(f"\nüìù Sentence: {sentence}\n")
    
    # NLTK POS tagging
    tokens = word_tokenize(sentence)
    pos_tags = pos_tag(tokens)
    
    print(f"{'Word':<20} {'POS Tag':<10} {'Description':<30}")
    print("-"*60)
    
    # Process with spaCy for better descriptions
    doc = nlp(sentence)
    for token in doc:
        print(f"{token.text:<20} {token.pos_:<10} {token.tag_} ({spacy.explain(token.tag_) or 'N/A'})")
    
    print("="*80)

In [None]:
# Extract specific POS patterns
text_pos = "The amazing new iPhone 15 features incredible AI-powered camera capabilities that revolutionize mobile photography."

doc = nlp(text_pos)

# Extract nouns
nouns = [token.text for token in doc if token.pos_ == 'NOUN']
# Extract adjectives
adjectives = [token.text for token in doc if token.pos_ == 'ADJ']
# Extract verbs
verbs = [token.text for token in doc if token.pos_ == 'VERB']

print("üìä Extracting Words by POS:\n")
print(f"Original: {text_pos}\n")
print(f"Nouns (NOUN): {', '.join(nouns)}")
print(f"Adjectives (ADJ): {', '.join(adjectives)}")
print(f"Verbs (VERB): {', '.join(verbs)}")

print("\nüí° Use case: Extract product features (nouns) and sentiment indicators (adjectives)!")

## üß† Part 3: Deep Learning for Text Classification

**Deep Learning** takes text classification to the next level!

### üéØ Why Deep Learning for NLP?

**Traditional ML (Day 1):**
- TF-IDF + Naive Bayes
- Simple, fast, interpretable
- Limited accuracy on complex tasks

**Deep Learning:**
- Learns representations automatically
- Captures context and semantics
- Higher accuracy on complex tasks
- Powers ChatGPT, Claude, BERT

### üèóÔ∏è Text Classification Architecture:

```
Input Text: "I love this product!"
     ‚Üì
1. Tokenization ‚Üí [I, love, this, product]
     ‚Üì
2. Integer Encoding ‚Üí [34, 892, 12, 456]
     ‚Üì
3. Embedding Layer ‚Üí Dense vectors
     ‚Üì
4. LSTM/CNN Layer ‚Üí Learn patterns
     ‚Üì
5. Dense Layer ‚Üí Classification
     ‚Üì
Output: POSITIVE (98% confidence)
```

### üéØ Key Components:

**1Ô∏è‚É£ Embedding Layer**
- Converts words to dense vectors
- Learned during training
- Similar to Word2Vec but task-specific

**2Ô∏è‚É£ LSTM Layer**
- Processes sequences (remembers context)
- Handles variable-length text
- Captures long-range dependencies

**3Ô∏è‚É£ Dense Layer**
- Final classification
- Outputs probabilities for each class

## üì∞ Real AI Application #1: News Article Categorization

**Scenario**: Build a system to automatically categorize news articles!

**Categories**:
- Technology
- Sports
- Politics
- Entertainment

**Why this matters**:
- News aggregators (Google News, Apple News)
- Content recommendation systems
- Automated content organization
- RAG systems for domain-specific retrieval

In [None]:
# Create news article dataset
news_articles = [
    # Technology
    "Apple unveils new iPhone 15 with advanced AI features and improved camera system",
    "Google releases Gemini AI model competing with ChatGPT and Claude",
    "Microsoft Azure expands cloud computing services for enterprise customers",
    "Tesla introduces new autonomous driving features powered by neural networks",
    "OpenAI announces GPT-4 Turbo with improved performance and lower pricing",
    "Meta launches new VR headset with enhanced graphics and processing power",
    "Amazon Web Services introduces new machine learning tools for developers",
    "NVIDIA releases powerful new GPU designed for AI training and inference",
    
    # Sports
    "Lakers defeat Warriors in overtime thriller at Staples Center",
    "Lionel Messi scores hat-trick in Champions League final victory",
    "Serena Williams wins her 24th Grand Slam title at Wimbledon",
    "Tom Brady announces retirement after legendary NFL career",
    "Manchester United signs promising young striker from Barcelona",
    "Usain Bolt breaks 100m world record at Olympic Games",
    "Tiger Woods makes comeback at Masters tournament after injury",
    "Cristiano Ronaldo transfers to Al-Nassr for record contract",
    
    # Politics
    "President announces new economic policy to combat inflation",
    "Senate passes bipartisan infrastructure bill after months of debate",
    "Supreme Court rules on major constitutional amendment case",
    "United Nations summit addresses global climate change initiatives",
    "Governor signs executive order on healthcare reform legislation",
    "Congressional committee investigates cybersecurity threats",
    "White House announces diplomatic mission to strengthen international relations",
    "Parliament debates new taxation and fiscal responsibility measures",
    
    # Entertainment
    "New Marvel movie breaks box office records on opening weekend",
    "Taylor Swift releases surprise album to critical acclaim",
    "Netflix announces original series starring Hollywood A-listers",
    "Academy Awards ceremony honors best films of the year",
    "Beyonce embarks on world tour with sold-out stadium shows",
    "Disney announces new theme park attractions and experiences",
    "Grammy Awards celebrate best music and artists of the year",
    "Christopher Nolan's latest film receives rave reviews from critics",
]

# Labels (0: Tech, 1: Sports, 2: Politics, 3: Entertainment)
news_labels = [0]*8 + [1]*8 + [2]*8 + [3]*8
category_names = ['Technology', 'Sports', 'Politics', 'Entertainment']

# Create DataFrame
df_news = pd.DataFrame({
    'article': news_articles,
    'label': news_labels,
    'category': [category_names[label] for label in news_labels]
})

print("üì∞ News Article Dataset:")
print(f"Total articles: {len(df_news)}")
print(f"\nClass distribution:")
print(df_news['category'].value_counts())
print("\nüìä Sample articles:")
print(df_news.sample(5))

In [None]:
# Prepare data for deep learning
X_news = df_news['article'].values
y_news = df_news['label'].values

# Split data
X_train_news, X_test_news, y_train_news, y_test_news = train_test_split(
    X_news, y_news, test_size=0.25, random_state=42, stratify=y_news
)

print("üîÄ Data Split:")
print(f"Training samples: {len(X_train_news)}")
print(f"Test samples: {len(X_test_news)}")

In [None]:
# Tokenization and padding
max_words = 1000  # Vocabulary size
max_len = 20      # Maximum sequence length

# Create tokenizer
tokenizer_news = Tokenizer(num_words=max_words, oov_token='<OOV>')
tokenizer_news.fit_on_texts(X_train_news)

# Convert texts to sequences
X_train_seq = tokenizer_news.texts_to_sequences(X_train_news)
X_test_seq = tokenizer_news.texts_to_sequences(X_test_news)

# Pad sequences to same length
X_train_pad = pad_sequences(X_train_seq, maxlen=max_len, padding='post', truncating='post')
X_test_pad = pad_sequences(X_test_seq, maxlen=max_len, padding='post', truncating='post')

print("üìä Text Preprocessing for Deep Learning:")
print(f"Vocabulary size: {len(tokenizer_news.word_index)}")
print(f"Training data shape: {X_train_pad.shape}")
print(f"Test data shape: {X_test_pad.shape}")
print(f"\nExample conversion:")
print(f"Original: {X_train_news[0]}")
print(f"Sequence: {X_train_seq[0]}")
print(f"Padded: {X_train_pad[0]}")

In [None]:
# Build LSTM model for news classification
embedding_dim = 32
num_classes = 4

model_news = Sequential([
    # Embedding layer - converts word indices to dense vectors
    Embedding(input_dim=max_words, output_dim=embedding_dim, input_length=max_len),
    
    # LSTM layer - processes sequences
    LSTM(64, return_sequences=False),
    
    # Dropout for regularization
    Dropout(0.5),
    
    # Dense layer for classification
    Dense(32, activation='relu'),
    Dropout(0.3),
    
    # Output layer - 4 classes
    Dense(num_classes, activation='softmax')
])

# Compile model
model_news.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

print("üß† News Classification Model Architecture:\n")
model_news.summary()

In [None]:
# Train model
print("üöÄ Training News Classifier...\n")

history_news = model_news.fit(
    X_train_pad, y_train_news,
    epochs=20,
    batch_size=8,
    validation_split=0.2,
    verbose=0
)

print("‚úÖ Training complete!\n")

# Evaluate on test set
test_loss, test_acc = model_news.evaluate(X_test_pad, y_test_news, verbose=0)
print(f"üìä Test Accuracy: {test_acc:.2%}")
print(f"üìâ Test Loss: {test_loss:.4f}")

In [None]:
# Plot training history
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy
ax1.plot(history_news.history['accuracy'], label='Training Accuracy', linewidth=2)
ax1.plot(history_news.history['val_accuracy'], label='Validation Accuracy', linewidth=2)
ax1.set_title('Model Accuracy', fontsize=14, fontweight='bold')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Accuracy')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Loss
ax2.plot(history_news.history['loss'], label='Training Loss', linewidth=2)
ax2.plot(history_news.history['val_loss'], label='Validation Loss', linewidth=2)
ax2.set_title('Model Loss', fontsize=14, fontweight='bold')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Loss')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("üìà Training progressed smoothly without overfitting!")

In [None]:
# Make predictions
y_pred_news = model_news.predict(X_test_pad, verbose=0)
y_pred_classes = np.argmax(y_pred_news, axis=1)

# Classification report
print("üìã Classification Report:\n")
print(classification_report(y_test_news, y_pred_classes, target_names=category_names))

# Confusion matrix
cm = confusion_matrix(y_test_news, y_pred_classes)

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=category_names,
            yticklabels=category_names)
plt.title('Confusion Matrix - News Categorization', fontsize=16, fontweight='bold')
plt.ylabel('True Category', fontsize=12)
plt.xlabel('Predicted Category', fontsize=12)
plt.tight_layout()
plt.show()

In [None]:
# Test on new articles
new_articles = [
    "Amazon introduces breakthrough quantum computing chips for data centers",
    "LeBron James leads team to championship victory in playoff final",
    "Congress debates new legislation on renewable energy standards",
    "Streaming platform announces exclusive series with top Hollywood actors",
]

print("üîÆ Testing News Categorizer on New Articles:\n")
print("="*80)

# Preprocess and predict
new_seq = tokenizer_news.texts_to_sequences(new_articles)
new_pad = pad_sequences(new_seq, maxlen=max_len, padding='post', truncating='post')
predictions = model_news.predict(new_pad, verbose=0)

for i, article in enumerate(new_articles):
    pred_class = np.argmax(predictions[i])
    confidence = predictions[i][pred_class]
    
    print(f"\nüì∞ Article: \"{article}\"")
    print(f"üéØ Category: {category_names[pred_class]}")
    print(f"üìä Confidence: {confidence:.2%}")
    print(f"\n   Probabilities:")
    for j, cat in enumerate(category_names):
        print(f"      {cat}: {predictions[i][j]:.2%}")
    print("="*80)

print("\n‚úÖ News categorizer working perfectly!")
print("üí° This is how Google News, Apple News organize articles!")

## ‚≠ê Real AI Application #2: Customer Review Sentiment Analysis

**Scenario**: Analyze customer reviews to determine sentiment (Positive/Negative/Neutral)

**Why this matters**:
- E-commerce platforms (Amazon, eBay)
- Product feedback analysis
- Customer satisfaction monitoring
- Brand reputation management
- Automated support ticket routing

In [None]:
# Create customer review dataset
reviews = [
    # Positive reviews
    "This product exceeded all my expectations! Absolutely love it!",
    "Outstanding quality and fast shipping. Highly recommend!",
    "Best purchase I've made this year. 5 stars!",
    "Amazing product, works perfectly and looks great!",
    "Excellent value for money. Very satisfied with this purchase.",
    "Fantastic quality, exactly as described. Will buy again!",
    "Love this! Great features and easy to use.",
    "Superb product! Better than I expected.",
    "Very impressed with the quality and performance!",
    "Wonderful product, absolutely worth the price!",
    
    # Negative reviews
    "Terrible quality, broke after one week. Very disappointed.",
    "Worst purchase ever. Complete waste of money!",
    "Poor quality and doesn't work as advertised. Avoid!",
    "Awful product, returned immediately. Don't buy!",
    "Very disappointed. Low quality and overpriced.",
    "Horrible experience. Product failed on first use.",
    "Not as described. Cheap materials and poor construction.",
    "Terrible customer service and defective product.",
    "Complete disaster. Nothing works properly.",
    "Waste of money. Returned for refund immediately.",
    
    # Neutral reviews
    "It's okay. Does what it's supposed to do.",
    "Average product. Nothing special but works fine.",
    "Decent quality for the price. Could be better.",
    "It's alright. Met my basic expectations.",
    "Product is okay. Some good features, some not so good.",
    "Acceptable quality. Not great, not terrible.",
    "Fair product for the price. Average performance.",
    "It works. Nothing to complain about, nothing to praise.",
    "Meets basic requirements. Could use improvements.",
    "Standard product. Does the job adequately.",
]

# Labels (0: Negative, 1: Neutral, 2: Positive)
sentiment_labels = [2]*10 + [0]*10 + [1]*10
sentiment_names = ['Negative', 'Neutral', 'Positive']

# Create DataFrame
df_reviews = pd.DataFrame({
    'review': reviews,
    'label': sentiment_labels,
    'sentiment': [sentiment_names[label] for label in sentiment_labels]
})

print("‚≠ê Customer Review Dataset:")
print(f"Total reviews: {len(df_reviews)}")
print(f"\nSentiment distribution:")
print(df_reviews['sentiment'].value_counts())
print("\nüìä Sample reviews:")
print(df_reviews.sample(6))

In [None]:
# Prepare data
X_reviews = df_reviews['review'].values
y_reviews = df_reviews['label'].values

# Split data
X_train_rev, X_test_rev, y_train_rev, y_test_rev = train_test_split(
    X_reviews, y_reviews, test_size=0.25, random_state=42, stratify=y_reviews
)

# Tokenization
max_words_rev = 500
max_len_rev = 15

tokenizer_rev = Tokenizer(num_words=max_words_rev, oov_token='<OOV>')
tokenizer_rev.fit_on_texts(X_train_rev)

X_train_rev_seq = tokenizer_rev.texts_to_sequences(X_train_rev)
X_test_rev_seq = tokenizer_rev.texts_to_sequences(X_test_rev)

X_train_rev_pad = pad_sequences(X_train_rev_seq, maxlen=max_len_rev, padding='post')
X_test_rev_pad = pad_sequences(X_test_rev_seq, maxlen=max_len_rev, padding='post')

print("üìä Review Data Prepared:")
print(f"Training samples: {len(X_train_rev)}")
print(f"Test samples: {len(X_test_rev)}")
print(f"Vocabulary size: {len(tokenizer_rev.word_index)}")

In [None]:
# Build sentiment analysis model
model_sentiment = Sequential([
    Embedding(input_dim=max_words_rev, output_dim=32, input_length=max_len_rev),
    LSTM(64, return_sequences=True),
    LSTM(32),
    Dropout(0.5),
    Dense(16, activation='relu'),
    Dropout(0.3),
    Dense(3, activation='softmax')  # 3 classes: Negative, Neutral, Positive
])

model_sentiment.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

print("üß† Sentiment Analysis Model:\n")
model_sentiment.summary()

In [None]:
# Train sentiment model
print("üöÄ Training Sentiment Analyzer...\n")

history_sentiment = model_sentiment.fit(
    X_train_rev_pad, y_train_rev,
    epochs=25,
    batch_size=4,
    validation_split=0.2,
    verbose=0
)

print("‚úÖ Training complete!\n")

# Evaluate
test_loss_sent, test_acc_sent = model_sentiment.evaluate(X_test_rev_pad, y_test_rev, verbose=0)
print(f"üìä Test Accuracy: {test_acc_sent:.2%}")
print(f"üìâ Test Loss: {test_loss_sent:.4f}")

In [None]:
# Visualize training
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.plot(history_sentiment.history['accuracy'], label='Training', linewidth=2)
ax1.plot(history_sentiment.history['val_accuracy'], label='Validation', linewidth=2)
ax1.set_title('Sentiment Model Accuracy', fontsize=14, fontweight='bold')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Accuracy')
ax1.legend()
ax1.grid(True, alpha=0.3)

ax2.plot(history_sentiment.history['loss'], label='Training', linewidth=2)
ax2.plot(history_sentiment.history['val_loss'], label='Validation', linewidth=2)
ax2.set_title('Sentiment Model Loss', fontsize=14, fontweight='bold')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Loss')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Test on new reviews
new_reviews = [
    "This is absolutely amazing! Best product ever!",
    "Terrible quality, completely useless and broken.",
    "It's fine. Does the job, nothing more.",
    "Love this so much! Exceeded expectations!",
    "Worst purchase I've ever made. Total waste.",
    "Decent product. Works as expected.",
]

print("üîÆ Testing Sentiment Analyzer on New Reviews:\n")
print("="*80)

# Predict
new_rev_seq = tokenizer_rev.texts_to_sequences(new_reviews)
new_rev_pad = pad_sequences(new_rev_seq, maxlen=max_len_rev, padding='post')
predictions_sent = model_sentiment.predict(new_rev_pad, verbose=0)

for i, review in enumerate(new_reviews):
    pred_class = np.argmax(predictions_sent[i])
    confidence = predictions_sent[i][pred_class]
    sentiment = sentiment_names[pred_class]
    
    # Choose emoji based on sentiment
    emoji = "üòä" if pred_class == 2 else "üòû" if pred_class == 0 else "üòê"
    
    print(f"\n‚≠ê Review: \"{review}\"")
    print(f"üéØ Sentiment: {sentiment} {emoji}")
    print(f"üìä Confidence: {confidence:.2%}")
    print(f"\n   Probabilities:")
    for j, sent in enumerate(sentiment_names):
        print(f"      {sent}: {predictions_sent[i][j]:.2%}")
    print("="*80)

print("\n‚úÖ Sentiment analyzer working excellently!")
print("üí° Used by Amazon, Yelp, TripAdvisor for review analysis!")

## üéØ Why This Matters for Modern AI

### üîç **NER & POS in Production AI**

**ü§ñ ChatGPT & Claude:**
- Extract entities from user messages (names, dates, places)
- Understand sentence structure for better responses
- Generate grammatically correct text using POS patterns

**üîç RAG Systems:**
- Index documents by extracted entities
- Improve search relevance using NER
- Filter results by entity types

**üìä Analytics & Business Intelligence:**
- Track brand mentions across social media
- Extract product features from reviews
- Monitor competitor activity

### üß† **Deep Learning Text Classification**

**üì∞ Content Platforms:**
- Google News, Apple News categorization
- Reddit post classification
- YouTube content categorization

**üí¨ Customer Support:**
- Automatic ticket routing
- Priority classification
- Sentiment monitoring

**üîí Content Moderation:**
- Detect toxic content
- Spam filtering
- Policy violation detection

### üéØ **Evolution to Transformers:**

What you learned today (LSTMs for text) evolved into:
- **BERT**: Better context understanding
- **GPT**: Better text generation
- **Transformers**: Replace LSTMs with attention

**We'll explore this in Day 3!**

## üéØ Interactive Exercise

**Challenge**: Build an email intent classifier!

**Scenario**: Classify customer support emails into categories

**Categories**:
- Billing Question
- Technical Support
- Product Inquiry
- Complaint

**Tasks**:
1. Create a dataset of customer emails (at least 20 total)
2. Preprocess and tokenize
3. Build an LSTM classifier
4. Train and evaluate
5. Test on new emails

**Bonus**: Extract entities (product names, amounts) using NER!

In [None]:
# YOUR CODE HERE!

# TODO 1: Create customer email dataset
customer_emails = [
    # Add emails for each category
]

# TODO 2: Create labels
email_labels = []

# TODO 3: Build LSTM model

# TODO 4: Train and evaluate

# TODO 5: Test on new emails

print("Complete the TODOs above!")

### ‚úÖ Solution (Try on your own first!)

In [None]:
# SOLUTION - Email Intent Classifier

customer_emails = [
    # Billing Question (0)
    "Why was I charged twice for my subscription this month?",
    "Can you explain the charges on my latest invoice?",
    "I need a refund for the duplicate payment.",
    "What is the billing cycle for my account?",
    "How do I update my payment method?",
    
    # Technical Support (1)
    "The application keeps crashing when I try to login.",
    "I cannot access my account, please help.",
    "The software is not working properly on my computer.",
    "Error message appears when I try to upload files.",
    "Having trouble connecting to the server.",
    
    # Product Inquiry (2)
    "What features are included in the premium plan?",
    "Do you offer enterprise pricing for large teams?",
    "Can you tell me more about the new product release?",
    "Is there a mobile app available for iOS?",
    "What are the system requirements for this software?",
    
    # Complaint (3)
    "Very disappointed with the customer service quality.",
    "This product does not work as advertised.",
    "I want to cancel my subscription immediately.",
    "Poor quality and terrible user experience.",
    "Your support team never responds to my emails.",
]

email_labels = [0]*5 + [1]*5 + [2]*5 + [3]*5
intent_names = ['Billing', 'Technical Support', 'Product Inquiry', 'Complaint']

# Split and prepare
X_train_em, X_test_em, y_train_em, y_test_em = train_test_split(
    customer_emails, email_labels, test_size=0.25, random_state=42, stratify=email_labels
)

# Tokenize
tokenizer_em = Tokenizer(num_words=300, oov_token='<OOV>')
tokenizer_em.fit_on_texts(X_train_em)

X_train_em_seq = tokenizer_em.texts_to_sequences(X_train_em)
X_test_em_seq = tokenizer_em.texts_to_sequences(X_test_em)

X_train_em_pad = pad_sequences(X_train_em_seq, maxlen=15, padding='post')
X_test_em_pad = pad_sequences(X_test_em_seq, maxlen=15, padding='post')

# Build model
model_email = Sequential([
    Embedding(300, 32, input_length=15),
    LSTM(32),
    Dropout(0.5),
    Dense(16, activation='relu'),
    Dense(4, activation='softmax')
])

model_email.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train
print("üöÄ Training Email Intent Classifier...\n")
model_email.fit(X_train_em_pad, y_train_em, epochs=30, batch_size=2, verbose=0)

# Evaluate
_, acc_em = model_email.evaluate(X_test_em_pad, y_test_em, verbose=0)
print(f"‚úÖ Test Accuracy: {acc_em:.2%}\n")

# Test on new emails
new_emails = [
    "My credit card was charged but I didn't receive the invoice.",
    "The app crashes every time I try to open it.",
    "Do you have a student discount available?",
    "Very unhappy with the service, considering switching to competitor.",
]

print("üîÆ Testing Email Intent Classifier:\n")
print("="*80)

new_em_seq = tokenizer_em.texts_to_sequences(new_emails)
new_em_pad = pad_sequences(new_em_seq, maxlen=15, padding='post')
predictions_em = model_email.predict(new_em_pad, verbose=0)

for i, email in enumerate(new_emails):
    pred_class = np.argmax(predictions_em[i])
    intent = intent_names[pred_class]
    confidence = predictions_em[i][pred_class]
    
    print(f"\nüìß Email: \"{email}\"")
    print(f"üéØ Intent: {intent}")
    print(f"üìä Confidence: {confidence:.2%}")
    print("="*80)

print("\n‚úÖ Email intent classifier built!")
print("üí° Used for automatic ticket routing in customer support systems!")

## üéâ Congratulations!

**You just learned:**
- ‚úÖ Named Entity Recognition (NER) for extracting entities from text
- ‚úÖ Part-of-Speech (POS) tagging for understanding grammar
- ‚úÖ Deep learning text classification with LSTMs
- ‚úÖ Built a news article categorization system
- ‚úÖ Built a customer review sentiment analyzer
- ‚úÖ Understanding of how these power modern AI systems

### üéØ Key Takeaways:

1. **NER extracts structured information**
   - Identifies entities (people, places, organizations)
   - Powers chatbots, search engines, knowledge graphs
   - Essential for information extraction

2. **POS tagging understands grammar**
   - Labels words by grammatical role
   - Helps LLMs generate correct text
   - Enables better text analysis

3. **Deep learning improves text classification**
   - LSTMs capture context and sequences
   - Embedding layers learn semantic representations
   - Higher accuracy than traditional methods

4. **Real-world applications are everywhere**
   - News categorization, sentiment analysis
   - Content moderation, spam detection
   - Customer support automation

---

**üéØ Practice Exercise (Before Day 3):**

Build a multi-label text classifier:
1. Create documents that can belong to multiple categories
2. Use sigmoid activation instead of softmax
3. Train LSTM model with binary cross-entropy loss
4. Evaluate with multi-label metrics

---

**üìö Next Lesson:** Day 3 - Modern NLP with Transformers
- Introduction to Transformer architecture
- BERT and GPT foundations
- Using HuggingFace transformers library
- Sentiment analysis with pre-trained BERT
- Text summarization with T5
- Fine-tuning for custom tasks

---

**üí¨ Remember:**

*"The techniques you learned today - NER, POS tagging, and LSTM classification - are the building blocks that led to modern Transformers. Understanding these fundamentals helps you grasp how ChatGPT, Claude, and BERT work under the hood. Every time you interact with an AI assistant, these concepts are at play!"* üöÄ

---

**üîó Connections to Modern AI:**
- **ChatGPT/Claude**: Use advanced NER and classification
- **RAG Systems**: Leverage NER for entity-based retrieval
- **Content Moderation**: Classification at massive scale
- **Search Engines**: Entity extraction for knowledge graphs
- **Transformers**: Evolution from LSTM-based models