# Day 7: Fake News Detection
**The AI Engineer Course 2026 - Section 27**

**Student:** Natruja

**Date:** Wednesday, February 18, 2026

---

## Learning Objectives
1. Build a fake news detection system
2. Work with real datasets
3. Use advanced classifiers (Logistic Regression)
4. Handle imbalanced data
5. Evaluate with multiple metrics and confusion matrices

## Setup: Install and Import Required Libraries

In [None]:
import subprocess
import sys

# Install required libraries
subprocess.check_call([sys.executable, "-m", "pip", "install", "scikit-learn", "pandas", "nltk", "-q"])

# Download NLTK data
import nltk
nltk.download('stopwords', quiet=True)

print("✓ Libraries installed successfully!")

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score, roc_curve
)
from sklearn.pipeline import Pipeline
import nltk

print("✓ All imports successful!")

## The Fake News Problem

### What is Fake News?
- Deliberately false or misleading information
- Spread for political, commercial, or social gain
- Difficult to distinguish from real news

### Real-World Impact:
- Influences elections
- Spreads health misinformation
- Creates public distrust
- Can cause real harm

### NLP Solution:
- Analyze text patterns
- Identify suspicious language
- Flag unreliable sources
- Help fact-checkers prioritize

### Challenges:
- Fake news is evolving
- Context matters
- Satire vs. actual misinformation
- Language bias

## Sample Dataset: Create Your Own Fake/Real News Data

Since you may not have a real fake news CSV file, we'll create sample data with realistic fake and real headlines.
This lets you learn on real examples without needing external files.

In [None]:
# Create sample fake and real news datasets
# These are representative examples of common patterns in fake vs real news

fake_texts = [
    "SHOCKING: Celebrities secretly aliens confirmed by leaked documents! You won't believe what we found!",
    "UNBELIEVABLE: Free money from government! This will CHANGE YOUR LIFE! Click here before it's deleted!",
    "You won't BELIEVE what this senator is hiding! Mainstream media refuses to cover this BOMBSHELL!",
    "Doctors HATE this simple trick that cures cancer! Big pharma trying to HIDE the truth!",
    "BREAKING: This will DESTROY the mainstream narrative! The government DOESN'T want you knowing!",
    "SHOCKING SCANDAL: Billionaire caught in bizarre conspiracy! This is the BIGGEST story of the year!",
    "This SHOCKING evidence proves the election was RIGGED! The mainstream media is LYING to you!",
    "UNBELIEVABLE: This celebrity just did something SHOCKING that left everyone speechless!",
    "INCREDIBLE: Scientists discover FOUNTAIN OF YOUTH! Big pharma trying to hide this discovery!",
    "You won't BELIEVE what happened next! This viral story will leave you in SHOCK!",
    "EXPOSED: Government conspiracy revealed! They don't want you seeing this video!",
    "BREAKING NEWS: This will CHANGE EVERYTHING! World leaders meeting in SECRET revealed!"
]

real_texts = [
    "Reuters: President signs new economic policy bill. Officials explain implementation timeline starting next month.",
    "AP: Researchers publish study in Nature journal. Peer review confirms findings on climate change impacts.",
    "In a groundbreaking study, scientists found a promising new treatment. Clinical trials showed positive results.",
    "Government health agency announces vaccine safety data. Comprehensive review of millions of doses administered.",
    "Leading researchers at Stanford University published findings in peer-reviewed journal this week.",
    "Official statement from Ministry of Health regarding new policy. Implementation begins in Q2 2026.",
    "Scientists at MIT conducted research on renewable energy. Results published in quarterly scientific report.",
    "International health organization releases guidance based on evidence. Multiple countries adopt recommendations.",
    "University research team announces discovery. Methodology and data now available for peer review.",
    "Press release from government agency. Officials confirmed findings after months of investigation.",
    "Academic journal publishes peer-reviewed research. Authors describe their methodology in detail.",
    "Health authorities provide guidance based on clinical evidence. Public consultation period begins Friday."
]

# Create DataFrame
data = {
    'text': fake_texts + real_texts,
    'label': [1] * len(fake_texts) + [0] * len(real_texts)  # 1=fake, 0=real
}

df = pd.DataFrame(data)

# Shuffle the data
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

print("Sample Fake News Detection Dataset Created!")
print(f"Total samples: {len(df)}")
print(f"\nDataset shape: {df.shape}")
print(f"\nClass distribution:")
print(df['label'].value_counts())
print(f"\nFirst few rows:")
print(df.head())

## Example: Exploratory Data Analysis

In [None]:
# Analyze features that distinguish fake from real news
df['text_length'] = df['text'].str.len()
df['exclamation_marks'] = df['text'].str.count('!')
df['caps_ratio'] = df['text'].str.count(r'[A-Z]') / df['text'].str.len()
df['all_caps_words'] = df['text'].str.count(r'\b[A-Z]+\b')

print("Feature Analysis: Real News vs Fake News")
print("="*60)

real_news = df[df['label'] == 0]
fake_news = df[df['label'] == 1]

print(f"\nText Length (characters):")
print(f"  Real: {real_news['text_length'].mean():.1f} avg")
print(f"  Fake: {fake_news['text_length'].mean():.1f} avg")

print(f"\nExclamation Marks:")
print(f"  Real: {real_news['exclamation_marks'].mean():.2f} avg")
print(f"  Fake: {fake_news['exclamation_marks'].mean():.2f} avg")

print(f"\nAll-CAPS Words:")
print(f"  Real: {real_news['all_caps_words'].mean():.2f} avg")
print(f"  Fake: {fake_news['all_caps_words'].mean():.2f} avg")

print(f"\nCapital Letters Ratio:")
print(f"  Real: {real_news['caps_ratio'].mean():.3f} avg")
print(f"  Fake: {fake_news['caps_ratio'].mean():.3f} avg")

## Example: Building a Fake News Detector with Pipeline

In [None]:
# Prepare data
texts = df['text'].values
labels = df['label'].values

# Split data (80-20)
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.2, random_state=42, stratify=labels
)

print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")
print(f"\nClass distribution in training set:")
unique, counts = np.unique(y_train, return_counts=True)
for u, c in zip(unique, counts):
    label_name = "Real" if u == 0 else "Fake"
    print(f"  {label_name}: {c} ({100*c/len(y_train):.1f}%)")

## Example: Logistic Regression with TfidfVectorizer Pipeline

In [None]:
# Create pipeline with Logistic Regression
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=1000, stop_words='english', max_df=0.7)),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])

# Train
print("Training classifier...")
pipeline.fit(X_train, y_train)
print("✓ Training complete!")

# Make predictions
y_pred = pipeline.predict(X_test)
y_pred_proba = pipeline.predict_proba(X_test)

print(f"\nSample predictions on test set:")
for i, (text, pred, proba) in enumerate(zip(X_test[:3], y_pred[:3], y_pred_proba[:3])):
    label = "REAL" if pred == 0 else "FAKE"
    confidence = proba[pred]
    print(f"\nSample {i+1}: {label} (Confidence: {confidence:.2f})")
    print(f"  Text: {text[:60]}...")

## Example: Comprehensive Model Evaluation

In [None]:
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba[:, 1])

print("FAKE NEWS DETECTOR - EVALUATION RESULTS")
print("="*60)
print(f"\nMetrics:")
print(f"  Accuracy:  {accuracy:.2f}")
print(f"  Precision: {precision:.2f}")
print(f"  Recall:    {recall:.2f}")
print(f"  F1-Score:  {f1:.2f}")
print(f"  ROC-AUC:   {roc_auc:.2f}")

print(f"\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(f"  True Negatives (Correct Real):  {cm[0][0]}")
print(f"  False Positives (Real→Fake):    {cm[0][1]}")
print(f"  False Negatives (Fake→Real):    {cm[1][0]}")
print(f"  True Positives (Correct Fake):  {cm[1][1]}")

print(f"\nDetailed Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Real', 'Fake']))

---
# EXERCISES: 15 Progressive Challenges

Complete all 5 exercises in each difficulty level. Each exercise has TODO comments and blanks (___) for you to fill in.

**Key Concepts to Use:**
- pd.read_csv(), df.shape, df['label'].value_counts()
- LogisticRegression, TfidfVectorizer(max_df=0.7)
- Pipeline, train_test_split, model.score()
- confusion_matrix, classification_report, predict_proba
- feature importance (coef_), random_state for reproducibility

## ⭐ EASY: Exercise 1 - Load and Explore Dataset

In [None]:
# TODO: Load the dataframe we created earlier (it's already in memory as 'df')
# TODO: Print the shape using df.shape
# TODO: Print total number of samples
# TODO: Print class distribution using df['label'].value_counts()

print(f"Dataset shape: {___}")
print(f"Total samples: {___}")
print(f"\nClass distribution:")
print___

## ⭐ EASY: Exercise 2 - Check Column Names and Data Types

In [None]:
# TODO: Print column names
# TODO: Print data types using df.dtypes
# TODO: Print first 3 rows using df.head(3)

print("Column names:")
print___
print(f"\nData types:")
print___
print(f"\nFirst 3 rows:")
print___

## ⭐ EASY: Exercise 3 - Basic Data Splitting

In [None]:
# TODO: Split df['text'] and df['label'] into 80-20 train/test
# TODO: Use train_test_split with test_size=0.2 and random_state=42
# TODO: Assign to X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = ______

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Total: {len(X_train) + len(X_test)}")

## ⭐ EASY: Exercise 4 - Train Simple Logistic Regression

In [None]:
# TODO: Create a Pipeline with:
#   1. TfidfVectorizer(max_features=1000, stop_words='english')
#   2. LogisticRegression(max_iter=1000, random_state=42)
# TODO: Fit the pipeline on X_train, y_train
# TODO: Get accuracy score on test set using model.score()

simple_pipeline = Pipeline([
    ('tfidf', ___,
    ('classifier', ___
])

# Train
___

# Evaluate
accuracy = ___
print(f"Model Accuracy: {accuracy:.4f}")

## ⭐ EASY: Exercise 5 - Make Predictions on New Text

In [None]:
# TODO: Use the pipeline from Exercise 4 to predict on a new article
# TODO: Make a prediction on this text: "Scientists publish findings in peer-reviewed journal"
# TODO: Use pipeline.predict() to get the class
# TODO: Print the result (0=real, 1=fake)

new_text = "Scientists publish findings in peer-reviewed journal"

prediction = ___
label = "REAL" if prediction[0] == 0 else "FAKE"

print(f"Text: {new_text}")
print(f"Prediction: {label}")

---
## ⭐⭐ MEDIUM: Exercise 6 - Build Pipeline with max_df Parameter

In [None]:
# TODO: Create a Pipeline with TfidfVectorizer that has max_df=0.7
# max_df=0.7 removes words that appear in more than 70% of documents
# TODO: Use LogisticRegression with max_iter=1000 and random_state=42
# TODO: Train on X_train, y_train
# TODO: Get accuracy on test set

pipeline_with_maxdf = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=1000, stop_words='english', max_df=___),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])

___  # Train the pipeline

accuracy_maxdf = ___  # Get accuracy score
print(f"Accuracy with max_df=0.7: {accuracy_maxdf:.4f}")

## ⭐⭐ MEDIUM: Exercise 7 - Generate Confusion Matrix

In [None]:
# TODO: Make predictions on X_test using pipeline_with_maxdf
# TODO: Calculate confusion matrix using confusion_matrix(y_test, y_pred)
# TODO: Print TN, FP, FN, TP values

y_pred_medium = ___  # Get predictions

cm = ___  # Calculate confusion matrix

tn, fp, fn, tp = cm[0][0], cm[0][1], cm[1][0], cm[1][1]

print("Confusion Matrix Analysis:")
print(f"  True Negatives (correctly identified REAL):  {___}")
print(f"  False Positives (real classified as FAKE):   {___}")
print(f"  False Negatives (fake classified as REAL):   {___}")
print(f"  True Positives (correctly identified FAKE):  {___}")
print(f"\nInterpretation:")
print(f"  Correctly classified: {tn + tp} out of {len(y_test)}")

## ⭐⭐ MEDIUM: Exercise 8 - Compare Train vs Test Accuracy (Overfitting Check)

In [None]:
# TODO: Get accuracy on TRAINING data using model.score(X_train, y_train)
# TODO: Get accuracy on TEST data using model.score(X_test, y_test)
# TODO: Calculate the gap (overfitting indicator)
# TODO: Print both and compare

train_acc = ___  # Score on training data
test_acc = ___   # Score on test data

gap = train_acc - test_acc

print("Overfitting Analysis:")
print(f"  Training Accuracy: {train_acc:.4f}")
print(f"  Test Accuracy:     {test_acc:.4f}")
print(f"  Gap (overfitting): {gap:.4f}")
print(f"\nInterpretation:")
if gap < 0.05:
    print("  ✓ Good fit - low overfitting")
elif gap < 0.15:
    print("  ~ Moderate fit - some overfitting")
else:
    print("  ✗ Poor fit - significant overfitting")

## ⭐⭐ MEDIUM: Exercise 9 - Use random_state for Reproducibility

In [None]:
# TODO: Create two pipelines with random_state=42
# TODO: Split data TWICE with the same random_state
# TODO: Train both models and verify they have identical performance
# TODO: This proves random_state makes results reproducible

# First split
X_train_1, X_test_1, y_train_1, y_test_1 = ___

# Second split (identical conditions)
X_train_2, X_test_2, y_train_2, y_test_2 = ___

# Train first model
pipe_1 = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=1000, stop_words='english', random_state=42)),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])
___  # Fit pipe_1 on split 1
acc_1 = ___  # Get score

# Train second model
pipe_2 = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=1000, stop_words='english', random_state=42)),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])
___  # Fit pipe_2 on split 2
acc_2 = ___  # Get score

print("Reproducibility Test:")
print(f"  Model 1 Accuracy: {acc_1:.4f}")
print(f"  Model 2 Accuracy: {acc_2:.4f}")
print(f"  Difference: {abs(acc_1 - acc_2):.4f}")
if abs(acc_1 - acc_2) == 0:
    print("  ✓ Perfect reproducibility with random_state=42")
else:
    print("  Note: Any difference is due to non-deterministic operations")

## ⭐⭐ MEDIUM: Exercise 10 - Get Probability Predictions

In [None]:
# TODO: Use pipeline_with_maxdf to get predictions AND probabilities
# TODO: Use predict_proba() to get confidence scores
# TODO: Show predictions with confidence levels

test_examples = X_test[:5]  # First 5 test samples
true_labels = y_test[:5]

predictions = ___  # Use predict()
probabilities = ___  # Use predict_proba()

print("Predictions with Confidence Scores:")
print("="*70)
for i, (text, pred, proba, true_label) in enumerate(zip(test_examples, predictions, probabilities, true_labels)):
    pred_label = "FAKE" if pred == 1 else "REAL"
    true_name = "FAKE" if true_label == 1 else "REAL"
    confidence = proba[pred]
    
    print(f"\n{i+1}. Text: {text[:50]}...")
    print(f"   Predicted: {pred_label} (confidence: {confidence:.3f})")
    print(f"   True Label: {true_name}")
    print(f"   Probabilities - Real: {proba[0]:.3f}, Fake: {proba[1]:.3f}")

---
## ⭐⭐⭐ HARD: Exercise 11 - Build Complete Fake News Detector Function

In [None]:
# TODO: Create a function that encapsulates the entire pipeline
# TODO: It should:
#   1. Take texts and labels as input
#   2. Split data (80-20)
#   3. Create and train pipeline
#   4. Return the trained model

def build_fake_news_detector(texts, labels, test_size=0.2, random_state=42):
    """
    Build a complete fake news detection model.
    
    Args:
        texts: Array of text samples
        labels: Array of labels (0=real, 1=fake)
        test_size: Proportion of test set
        random_state: For reproducibility
    
    Returns:
        model: Trained pipeline
        X_test: Test texts
        y_test: Test labels
    """
    # TODO: Split data
    X_train, X_test, y_train, y_test = ______
    
    # TODO: Create pipeline with TfidfVectorizer and LogisticRegression
    model = Pipeline([
        ('tfidf', TfidfVectorizer(max_features=1000, stop_words='english', max_df=0.7)),
        ('classifier', LogisticRegression(max_iter=1000, random_state=random_state))
    ])
    
    # TODO: Train the model
    ___
    
    return model, X_test, y_test

# Test your function
detector_model, X_test_hard, y_test_hard = build_fake_news_detector(df['text'].values, df['label'].values)
print(f"✓ Detector built successfully")
print(f"  Test set size: {len(X_test_hard)}")
print(f"  Model accuracy: {detector_model.score(X_test_hard, y_test_hard):.4f}")

## ⭐⭐⭐ HARD: Exercise 12 - Extract Feature Importance (Top Words)

In [None]:
# TODO: Extract feature importance from the trained model
# TODO: Get the top 10 words associated with FAKE news (positive coefficients)
# TODO: Get the top 10 words associated with REAL news (negative coefficients)
# TODO: Use model.named_steps to access the components

# Get the Logistic Regression model and TfidfVectorizer
lr_model = detector_model.named_steps['classifier']
tfidf = detector_model.named_steps['tfidf']

# Get feature names
feature_names = ___  # Use tfidf.get_feature_names_out()

# Get coefficients (importance scores)
coefficients = ___  # Use lr_model.coef_[0]

# Get top 10 FAKE news words (highest positive coefficients)
fake_indices = np.argsort(coefficients)[-10:][::-1]
fake_words = feature_names[fake_indices]
fake_scores = coefficients[fake_indices]

# Get top 10 REAL news words (lowest coefficients)
real_indices = np.argsort(coefficients)[:10]
real_words = feature_names[real_indices]
real_scores = coefficients[real_indices]

print("Feature Importance Analysis")
print("="*60)
print(f"\nTop 10 words strongly associated with FAKE news:")
for word, score in zip(fake_words, fake_scores):
    print(f"  {word:20s} (score: {score:7.3f})")

print(f"\nTop 10 words strongly associated with REAL news:")
for word, score in zip(real_words, real_scores):
    print(f"  {word:20s} (score: {score:7.3f})")

## ⭐⭐⭐ HARD: Exercise 13 - Confidence Scores with predict_proba

In [None]:
# TODO: Use predict_proba to get confidence scores
# TODO: Identify predictions with LOW confidence (close to 0.5)
# TODO: Identify predictions with HIGH confidence (close to 0 or 1)
# TODO: Show examples of uncertain vs confident predictions

y_pred_hard = detector_model.predict(X_test_hard)
y_pred_proba_hard = ___  # Use predict_proba()

# Extract confidence for each prediction
confidence = np.max(y_pred_proba_hard, axis=1)
uncertainty = np.abs(y_pred_proba_hard[:, 0] - y_pred_proba_hard[:, 1])

# Find uncertain predictions (confidence close to 0.5)
uncertain_indices = np.where(uncertainty < 0.2)[0]
confident_indices = np.where(uncertainty >= 0.8)[0]

print("Confidence Analysis")
print("="*70)
print(f"\nTotal predictions: {len(X_test_hard)}")
print(f"Uncertain predictions (confidence 0.5-0.6): {len(uncertain_indices)}")
print(f"Confident predictions (confidence >0.9): {len(confident_indices)}")

print(f"\n3 Most UNCERTAIN predictions:")
for idx in uncertain_indices[:3]:
    text = X_test_hard[idx]
    pred = "FAKE" if y_pred_hard[idx] == 1 else "REAL"
    conf_real, conf_fake = y_pred_proba_hard[idx]
    print(f"\n  Text: {text[:50]}...")
    print(f"  Prediction: {pred} | Real: {conf_real:.3f} | Fake: {conf_fake:.3f}")

print(f"\n3 Most CONFIDENT predictions:")
for idx in confident_indices[:3]:
    text = X_test_hard[idx]
    pred = "FAKE" if y_pred_hard[idx] == 1 else "REAL"
    conf_real, conf_fake = y_pred_proba_hard[idx]
    print(f"\n  Text: {text[:50]}...")
    print(f"  Prediction: {pred} | Real: {conf_real:.3f} | Fake: {conf_fake:.3f}")

## ⭐⭐⭐ HARD: Exercise 14 - Parameter Tuning and Comparison

In [None]:
# TODO: Train multiple models with different max_df values
# TODO: Compare their accuracy scores
# TODO: See how max_df affects model performance

max_df_values = [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
results = []

for max_df in max_df_values:
    # TODO: Create pipeline with this max_df value
    pipeline_temp = Pipeline([
        ('tfidf', TfidfVectorizer(max_features=1000, stop_words='english', max_df=___),
        ('classifier', LogisticRegression(max_iter=1000, random_state=42))
    ])
    
    # TODO: Train on X_train_hard, y_train_hard (from exercise 11)
    # First, get these from the detector_model training
    X_train_hard, _, y_train_hard, _ = ___
    
    ___  # Fit the pipeline
    
    # TODO: Get test accuracy
    acc = ___
    results.append({'max_df': max_df, 'accuracy': acc})

print("Parameter Tuning Results (max_df effect):")
print("="*50)
print(f"{'max_df':<10} {'Accuracy':<10} {'Status'}")
print("-"*50)

best_accuracy = 0
best_max_df = 0

for result in results:
    max_df = result['max_df']
    acc = result['accuracy']
    status = "Best" if acc == max(r['accuracy'] for r in results) else ""
    print(f"{max_df:<10.1f} {acc:<10.4f} {status}")
    
    if acc > best_accuracy:
        best_accuracy = acc
        best_max_df = max_df

print("-"*50)
print(f"Best max_df: {best_max_df} with accuracy {best_accuracy:.4f}")

## ⭐⭐⭐ HARD: Exercise 15 - Full Evaluation Report

In [None]:
# TODO: Build a complete evaluation function that:
#   1. Gets predictions and probabilities
#   2. Calculates all metrics (accuracy, precision, recall, f1, roc_auc)
#   3. Generates confusion matrix
#   4. Generates classification report
#   5. Returns comprehensive dictionary

def evaluate_fake_news_detector(model, X_test, y_test):
    """
    Generate complete evaluation report for fake news detector.
    
    Args:
        model: Trained pipeline
        X_test: Test texts
        y_test: Test labels
    
    Returns:
        Dictionary with all evaluation metrics
    """
    # TODO: Get predictions
    y_pred = ___
    
    # TODO: Get probability predictions
    y_proba = ___
    
    # TODO: Calculate metrics
    metrics = {
        'accuracy': ___,
        'precision': ___,
        'recall': ___,
        'f1': ___,
        'roc_auc': ___
    }
    
    # TODO: Get confusion matrix
    cm = ___
    
    # TODO: Get classification report
    report = ___
    
    return {
        'metrics': metrics,
        'confusion_matrix': cm,
        'classification_report': report
    }

# Test your evaluation function
final_report = ___

print("\n" + "="*70)
print("COMPLETE FAKE NEWS DETECTOR - FINAL EVALUATION REPORT")
print("="*70)

print(f"\nKey Metrics:")
for metric, value in final_report['metrics'].items():
    print(f"  {metric.upper():12s}: {value:.4f}")

print(f"\nConfusion Matrix:")
cm = final_report['confusion_matrix']
print(f"  TN (Correct Real):    {cm[0][0]:3d}")
print(f"  FP (Real→Fake):       {cm[0][1]:3d}")
print(f"  FN (Fake→Real):       {cm[1][0]:3d}")
print(f"  TP (Correct Fake):    {cm[1][1]:3d}")

print(f"\nClassification Report:")
print(final_report['classification_report'])

print("="*70)
print("✓ Evaluation Complete!")

---
# Summary

## Congratulations! You've Completed All 15 Exercises

### What You Learned:

**EASY (Exercises 1-5):** Foundational skills
- Loading and exploring datasets with pandas
- Understanding data shape and class distribution
- Splitting data into training and testing sets
- Training basic Logistic Regression models
- Making predictions on new text

**MEDIUM (Exercises 6-10):** Intermediate NLP pipelines
- Building sophisticated pipelines with TfidfVectorizer
- Using max_df parameter to remove common words
- Generating and interpreting confusion matrices
- Detecting overfitting by comparing train/test accuracy
- Ensuring reproducibility with random_state
- Understanding confidence scores with predict_proba()

**HARD (Exercises 11-15):** Advanced techniques
- Encapsulating complex pipelines in reusable functions
- Extracting feature importance to understand what the model learned
- Analyzing prediction confidence and uncertainty
- Hyperparameter tuning and model comparison
- Building comprehensive evaluation reports

### Key Concepts Mastered:
- **pd.read_csv()** - Loading CSV data
- **df.shape, df['label'].value_counts()** - Data exploration
- **train_test_split()** - Creating train/test splits
- **TfidfVectorizer(max_df=0.7)** - Text feature extraction
- **Pipeline** - Building reusable ML workflows
- **LogisticRegression** - Binary classification
- **model.score()** - Quick accuracy assessment
- **predict_proba()** - Confidence scores
- **confusion_matrix, classification_report** - Comprehensive metrics
- **Feature importance (coef_)** - Model interpretability
- **random_state** - Reproducibility

### Real-World Applications:
- Detecting misinformation on social media
- Filtering spam and scam messages
- Identifying unreliable news sources
- Prioritizing content for human fact-checkers
- Building content moderation systems

### What's Next:
Tomorrow is **Day 8: Final Review** - we'll recap all concepts and explore modern NLP with Transformers and LLMs!

---
*Created for Natruja's NLP study plan*