# Day 6: Building a Text Classifier
**The AI Engineer Course 2026 - Section 26**

**Student:** Natruja

**Date:** Tuesday, February 17, 2026

---

## Learning Objectives
1. Build a complete text classification pipeline
2. Learn Naive Bayes for text classification
3. Use sklearn Pipeline for streamlined workflows
4. Evaluate classifiers with metrics
5. Handle multi-class classification problems

## Setup: Install and Import Required Libraries

In [None]:
import subprocess
import sys

# Install scikit-learn
subprocess.check_call([sys.executable, "-m", "pip", "install", "scikit-learn", "nltk", "-q"])

# Download NLTK data
import nltk
nltk.download('stopwords', quiet=True)

print("✓ Libraries installed successfully!")

In [None]:
# Import necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.pipeline import Pipeline
import numpy as np

print("✓ All imports successful!")

## Text Classification Workflow

Building a text classifier involves these steps:

### 1. Data Preparation
- Collect labeled examples
- Split into train/test sets
- Prepare features

### 2. Feature Extraction
- Convert text to vectors (TF-IDF, BoW)
- Select important features

### 3. Model Training
- Choose algorithm (Naive Bayes, SVM, etc.)
- Train on training data

### 4. Evaluation
- Test on test set
- Calculate metrics (accuracy, precision, recall)

### 5. Optimization
- Fine-tune parameters
- Improve performance

### 6. Deployment
- Use on new, unseen data

## Naive Bayes for Text Classification

**Multinomial Naive Bayes** is the most popular algorithm for text classification.

### Why It Works Well:
- Fast and efficient
- Works well with high-dimensional sparse data (text)
- Requires relatively little training data
- Probabilistic: gives confidence scores

### How It Works:
- Uses Bayes' theorem: P(Class|Text) = P(Text|Class) × P(Class) / P(Text)
- Assumes words are independent (naive assumption)
- Calculates probability of each class
- Picks class with highest probability

### Applications:
- Spam detection
- Sentiment analysis
- Topic classification
- Language detection

## EXAMPLE: Simple Text Classifier

In [None]:
# Sample training data (texts and labels)
texts = [
    "This movie is great and entertaining",
    "Loved this film, highly recommend",
    "Amazing performances and cinematography",
    "Best movie I have seen all year",
    "Terrible movie, waste of time",
    "Horrible acting and boring plot",
    "Did not enjoy this film at all",
    "Worst movie ever made"
]

labels = [1, 1, 1, 1, 0, 0, 0, 0]  # 1=positive, 0=negative

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.25, random_state=42
)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")
print(f"\nTraining examples:")
for text, label in zip(X_train, y_train):
    sentiment = "Positive" if label == 1 else "Negative"
    print(f"  [{sentiment}] {text}")

## EXAMPLE: Training a Naive Bayes Classifier

In [None]:
# Create vectorizer
vectorizer = TfidfVectorizer(lowercase=True, stop_words='english')

# Transform training data
X_train_vec = vectorizer.fit_transform(X_train)
print(f"Training data shape: {X_train_vec.shape}")
print(f"Vocabulary size: {len(vectorizer.get_feature_names_out())}")

# Create and train classifier
classifier = MultinomialNB()
classifier.fit(X_train_vec, y_train)

print("\n✓ Classifier trained successfully!")
print(f"Classes: {classifier.classes_}")

## EXAMPLE: Making Predictions

In [None]:
# Transform test data
X_test_vec = vectorizer.transform(X_test)

# Make predictions
y_pred = classifier.predict(X_test_vec)
y_pred_proba = classifier.predict_proba(X_test_vec)

print("Predictions on Test Set:")
print("="*80)
print(f"{'Text':<40} | {'Actual':<8} | {'Predicted':<10} | {'Confidence'}")
print("-"*80)

for text, actual, pred, proba in zip(X_test, y_test, y_pred, y_pred_proba):
    actual_label = "Positive" if actual == 1 else "Negative"
    pred_label = "Positive" if pred == 1 else "Negative"
    confidence = max(proba)
    
    display_text = text[:37] + "..." if len(text) > 40 else text
    print(f"{display_text:<40} | {actual_label:<8} | {pred_label:<10} | {confidence:.2f}")

## EXAMPLE: Using Pipeline for Streamlined Workflow

In [None]:
# Create a pipeline combining vectorizer and classifier
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(lowercase=True, stop_words='english')),
    ('classifier', MultinomialNB())
])

# Train on original texts (no need to vectorize separately!)
pipeline.fit(X_train, y_train)

# Make predictions directly on texts
y_pred_pipeline = pipeline.predict(X_test)

# Test on new texts
new_texts = [
    "This is a fantastic movie!",
    "Absolutely terrible, don't watch it"
]

print("Pipeline Predictions on New Texts:")
print("="*50)
for text in new_texts:
    pred = pipeline.predict([text])[0]
    proba = pipeline.predict_proba([text])[0]
    sentiment = "Positive" if pred == 1 else "Negative"
    print(f"Text: {text}")
    print(f"Prediction: {sentiment} (Confidence: {max(proba):.2f})\n")

## Model Evaluation Metrics

### Accuracy
- Percentage of correct predictions
- Formula: (TP + TN) / (TP + TN + FP + FN)
- Good for balanced datasets

### Precision
- Of predicted positives, how many are actually positive?
- Formula: TP / (TP + FP)
- Important when false positives are costly

### Recall
- Of actual positives, how many did we find?
- Formula: TP / (TP + FN)
- Important when false negatives are costly

### F1-Score
- Harmonic mean of precision and recall
- Formula: 2 × (Precision × Recall) / (Precision + Recall)
- Good for imbalanced datasets

## EXAMPLE: Model Evaluation

In [None]:
# Get predictions
y_pred = pipeline.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

print("Model Evaluation Results:")
print("="*60)
print(f"\nAccuracy: {accuracy:.2f}")

print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))

print(f"\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(f"\nInterpretation:")
print(f"  True Negatives: {cm[0][0]}")
print(f"  False Positives: {cm[0][1]}")
print(f"  False Negatives: {cm[1][0]}")
print(f"  True Positives: {cm[1][1]}")

---
# EXERCISES: Text Classification

Complete all 15 exercises below to master text classification!

## ⭐ EASY: Exercise 1 - Split Data into Train/Test Sets

**Goal:** Learn how to split data using `train_test_split` with proper parameters.

**Concepts:** train_test_split, test_size, random_state

In [None]:
# Sample sentiment data
sample_texts = [
    "I love this product",
    "This is amazing",
    "I hate this",
    "Terrible quality",
    "Excellent service",
    "Very disappointed"
]
sample_labels = [1, 1, 0, 0, 1, 0]  # 1=positive, 0=negative

# TODO: Use train_test_split to split the data
# - Use test_size=0.3 (30% test, 70% train)
# - Use random_state=42 (for reproducibility)
X_train, X_test, y_train, y_test = ___

# TODO: Print the sizes of train and test sets
print(f"Training set size: ___")
print(f"Test set size: ___")

# TODO: Print one example from training set
print(f"\nExample from training set: {___}")

## ⭐ EASY: Exercise 2 - Train a MultinomialNB Model

**Goal:** Train a basic Naive Bayes classifier on vectorized text data.

**Concepts:** TfidfVectorizer, fit_transform, MultinomialNB, fit()

In [None]:
texts = [
    "Python is awesome", "I love Python", "Python rocks",
    "Java is boring", "I hate Java", "Java is slow"
]
labels = [1, 1, 1, 0, 0, 0]

# Split data
X_train, X_test, y_train, y_test = ___

# TODO: Create a TfidfVectorizer with stop_words='english'
vectorizer = ___

# TODO: Fit and transform the training data
X_train_vec = ___

# TODO: Create a MultinomialNB classifier
classifier = ___

# TODO: Train the classifier on vectorized training data
___

print("✓ Model trained successfully!")
print(f"Classifier classes: {classifier.classes_}")

## ⭐ EASY: Exercise 3 - Make Predictions on Test Data

**Goal:** Use the trained model to predict labels for test data.

**Concepts:** transform(), predict(), predict_proba()

In [None]:
# TODO: Transform test data using the trained vectorizer
X_test_vec = ___

# TODO: Make predictions on test data
y_pred = ___

# TODO: Get probability scores for predictions
y_pred_proba = ___

# Print results
print("Predictions:")
for text, true_label, pred_label, proba in zip(X_test, y_test, y_pred, y_pred_proba):
    true_sentiment = "Positive" if true_label == 1 else "Negative"
    pred_sentiment = "Positive" if pred_label == 1 else "Negative"
    confidence = max(proba)
    print(f"Text: {text}")
    print(f"  True: {true_sentiment}, Predicted: {pred_sentiment}, Confidence: {confidence:.2f}\n")

## ⭐ EASY: Exercise 4 - Calculate Accuracy Score

**Goal:** Evaluate model performance using accuracy metric.

**Concepts:** accuracy_score, score()

In [None]:
# Use the predictions from Exercise 3
# (y_pred and y_test should still be in memory)

# TODO: Calculate accuracy using accuracy_score
acc = ___

# TODO: Also calculate accuracy using the model's score() method
acc_score = ___

print(f"Accuracy (using accuracy_score): {acc:.4f}")
print(f"Accuracy (using model.score()): {acc_score:.4f}")
print(f"\nAccuracy percentage: {acc * 100:.1f}%")

## ⭐ EASY: Exercise 5 - Print Confusion Matrix

**Goal:** Understand model errors using the confusion matrix.

**Concepts:** confusion_matrix, True Positives, False Positives, True Negatives, False Negatives

In [None]:
# TODO: Calculate confusion matrix
cm = ___

# Print the matrix
print("Confusion Matrix:")
print(cm)

# TODO: Extract values from confusion matrix
# cm[i][j] where i is actual class, j is predicted class
tn = ___  # True Negatives: predicted 0, actually 0
fp = ___  # False Positives: predicted 1, actually 0
fn = ___  # False Negatives: predicted 0, actually 1
tp = ___  # True Positives: predicted 1, actually 1

print(f"\nBreakdown:")
print(f"  True Negatives: {tn}")
print(f"  False Positives: {fp}")
print(f"  False Negatives: {fn}")
print(f"  True Positives: {tp}")

## ⭐⭐ MEDIUM: Exercise 6 - Build a Pipeline with TfidfVectorizer + MultinomialNB

**Goal:** Create a complete pipeline combining vectorization and classification.

**Concepts:** Pipeline, combining multiple steps

In [None]:
texts = [
    "Great movie", "Love it", "Amazing",
    "Bad film", "Hate it", "Terrible"
]
labels = [1, 1, 1, 0, 0, 0]

# Split data
X_train, X_test, y_train, y_test = ___

# TODO: Create a Pipeline with two steps:
# 1. ('tfidf', TfidfVectorizer(stop_words='english'))
# 2. ('classifier', MultinomialNB())
pipe = Pipeline([
    ___,
    ___
])

# TODO: Train the pipeline on X_train and y_train
___

# TODO: Get accuracy on test set using pipe.score()
accuracy = ___

print(f"Pipeline Accuracy: {accuracy:.2f}")

## ⭐⭐ MEDIUM: Exercise 7 - Use Pipeline to Predict New Text

**Goal:** Use the trained pipeline to classify new, unseen text examples.

**Concepts:** Pipeline prediction, predict_proba()

In [None]:
# Use the pipeline from Exercise 6
# (pipe should still be trained in memory)

new_texts = [
    "This is fantastic!",
    "I absolutely hate it",
    "Not bad, pretty good"
]

# TODO: Make predictions using the pipeline
predictions = ___

# TODO: Get probability scores
probabilities = ___

# Print results
print("Predictions on New Texts:")
print("="*60)
for text, pred, proba in zip(new_texts, predictions, probabilities):
    sentiment = "Positive" if pred == 1 else "Negative"
    confidence = max(proba)
    print(f"Text: {text}")
    print(f"  Prediction: {sentiment} | Confidence: {confidence:.3f}\n")

## ⭐⭐ MEDIUM: Exercise 8 - Print Classification Report and Explain Metrics

**Goal:** Understand precision, recall, and F1-score from the classification report.

**Concepts:** classification_report, precision, recall, F1-score

In [None]:
# Get predictions from the pipeline (Exercise 6)
y_pred = ___

# TODO: Print classification report with target_names=['Negative', 'Positive']
print("Classification Report:")
print___

# TODO: Calculate and print these metrics manually:
# Precision = TP / (TP + FP)
# Recall = TP / (TP + FN)
# F1 = 2 * (Precision * Recall) / (Precision + Recall)

cm = ___
tn, fp, fn, tp = cm[0][0], cm[0][1], cm[1][0], cm[1][1]

precision = ___
recall = ___
f1 = ___

print(f"\nManual Calculations:")
print(f"Precision: {precision:.3f} (of predicted positives, how many are actually positive?)")
print(f"Recall: {recall:.3f} (of actual positives, how many did we find?)")
print(f"F1-Score: {f1:.3f} (harmonic mean of precision and recall)")

## ⭐⭐ MEDIUM: Exercise 9 - Compare Train Accuracy vs Test Accuracy

**Goal:** Detect overfitting by comparing training and test accuracy.

**Concepts:** overfitting, generalization

In [None]:
texts = [
    "Love Python", "Python is great", "I enjoy Python", "Python rocks",
    "Hate Java", "Java is bad", "I dislike Java", "Java is slow",
    "Love C++", "C++ is fast", "I like C++", "C++ is good",
    "Hate JavaScript", "JavaScript is confusing", "JS is bad", "JS is frustrating"
]
labels = [1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0]

# Split data
X_train, X_test, y_train, y_test = ___

# Create and train pipeline
pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('classifier', MultinomialNB())
])
pipe.fit(X_train, y_train)

# TODO: Calculate accuracy on training set
train_accuracy = ___

# TODO: Calculate accuracy on test set
test_accuracy = ___

print(f"Training Accuracy: {train_accuracy:.3f}")
print(f"Test Accuracy: {test_accuracy:.3f}")
print(f"Difference: {(train_accuracy - test_accuracy):.3f}")

if train_accuracy > test_accuracy:
    print("\n⚠️  Model shows signs of overfitting (higher train accuracy)")
else:
    print("\n✓ Model generalizes well (similar train and test accuracy)")

## ⭐⭐ MEDIUM: Exercise 10 - Try LogisticRegression Instead of NB

**Goal:** Compare different algorithms by replacing MultinomialNB with LogisticRegression.

**Concepts:** LogisticRegression, algorithm comparison

In [None]:
# Use the same data from Exercise 9
# X_train, X_test, y_train, y_test should be in memory

# Create pipeline with LogisticRegression
# TODO: Replace MultinomialNB() with LogisticRegression(max_iter=200)
pipe_lr = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('classifier', ___
])

# Train the new pipeline
pipe_lr.fit(X_train, y_train)

# TODO: Calculate accuracies for LogisticRegression
lr_train_acc = ___
lr_test_acc = ___

# TODO: Calculate accuracies for MultinomialNB (from previous exercise)
nb_train_acc = ___
nb_test_acc = ___

# Compare results
print("Algorithm Comparison:")
print("="*50)
print(f"{'Algorithm':<20} {'Train':<12} {'Test':<12}")
print("-"*50)
print(f"{'MultinomialNB':<20} {nb_train_acc:<12.3f} {nb_test_acc:<12.3f}")
print(f"{'LogisticRegression':<20} {lr_train_acc:<12.3f} {lr_test_acc:<12.3f}")
print("="*50)

if lr_test_acc > nb_test_acc:
    print(f"\n✓ LogisticRegression performs better (+{(lr_test_acc - nb_test_acc):.3f})")
else:
    print(f"\n✓ MultinomialNB performs better (+{(nb_test_acc - lr_test_acc):.3f})")

## ⭐⭐⭐ HARD: Exercise 11 - Build Complete Spam Classifier Pipeline from Scratch

**Goal:** Build an end-to-end spam classifier with data preprocessing, training, and evaluation.

**Concepts:** Complete pipeline workflow, data splitting, training, evaluation

In [None]:
# Spam classifier data
emails = [
    # Spam (1)
    "Click here to win free money now!",
    "You have won the lottery! Claim your prize",
    "Limited time offer: 99% off everything",
    "Congratulations! You are selected for a prize",
    "Act now! Special deal expires today",
    "Get rich quick with our secret method",
    # Ham (0)
    "Meeting scheduled for tomorrow at 3 PM",
    "Can you review the attached document?",
    "Project update: Phase 1 is complete",
    "Please confirm your attendance for the event",
    "Here are the quarterly results",
    "Thanks for your email, I will respond soon"
]

labels = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]  # 1=spam, 0=ham

# TODO: Step 1 - Split data (80-20 split, random_state=42)
X_train, X_test, y_train, y_test = ___

# TODO: Step 2 - Create pipeline with TfidfVectorizer and MultinomialNB
spam_pipe = Pipeline([
    ___,
    ___
])

# TODO: Step 3 - Train the pipeline
___

# TODO: Step 4 - Make predictions
y_pred = ___

# TODO: Step 5 - Calculate accuracy
accuracy = ___

# TODO: Step 6 - Print confusion matrix and classification report
print(f"Spam Classifier Accuracy: {accuracy:.2f}")
print(f"\nConfusion Matrix:")
print___
print(f"\nClassification Report:")
print___

# Test on new emails
new_emails = [
    "You have been selected to receive $1000",
    "Can you help me with the project?"
]

# TODO: Predict on new emails
new_preds = ___

print(f"\nNew Email Classification:")
for email, pred in zip(new_emails, new_preds):
    label = "Spam" if pred == 1 else "Ham"
    print(f"  [{label}] {email}")

## ⭐⭐⭐ HARD: Exercise 12 - Use cross_val_score for Model Evaluation

**Goal:** Use cross-validation to get more reliable performance estimates.

**Concepts:** cross_val_score, k-fold cross-validation, model reliability

In [None]:
texts = [
    "Excellent product", "Very satisfied", "Love it", "Highly recommend", "Amazing quality",
    "Terrible", "Disappointed", "Waste of money", "Very bad", "Horrible"
]
labels = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]

# Create pipeline
pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('classifier', MultinomialNB())
])

# TODO: Use cross_val_score with cv=5 folds
# This will train 5 different models and return 5 accuracy scores
cv_scores = ___

print(f"Cross-Validation Results (5-fold):")
print(f"Fold scores: {cv_scores}")
print(f"Mean accuracy: {cv_scores.mean():.3f}")
print(f"Std deviation: {cv_scores.std():.3f}")

# Interpretation:
print(f"\nInterpretation:")
print(f"  - Mean shows average performance: {cv_scores.mean():.1%}")
print(f"  - Std shows consistency (lower is better): {cv_scores.std():.3f}")
if cv_scores.std() < 0.1:
    print(f"  ✓ Model is consistent across all folds")
else:
    print(f"  ⚠️  Model performance varies across folds")

## ⭐⭐⭐ HARD: Exercise 13 - Compare MultinomialNB vs LogisticRegression Performance

**Goal:** Use cross-validation to fairly compare two different algorithms.

**Concepts:** algorithm comparison, cross-validation

In [None]:
# Use the same data from Exercise 12
# texts and labels should be in memory

# Create pipeline with MultinomialNB
nb_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('classifier', MultinomialNB())
])

# Create pipeline with LogisticRegression
lr_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('classifier', LogisticRegression(max_iter=200))
])

# TODO: Get cross-validation scores for both pipelines (cv=5)
nb_scores = ___
lr_scores = ___

# Print comparison
print("Algorithm Comparison (5-fold Cross-Validation):")
print("="*60)
print(f"{'Algorithm':<25} {'Mean CV':<15} {'Std':<15}")
print("-"*60)
print(f"{'MultinomialNB':<25} {nb_scores.mean():<15.3f} {nb_scores.std():<15.3f}")
print(f"{'LogisticRegression':<25} {lr_scores.mean():<15.3f} {lr_scores.std():<15.3f}")
print("="*60)

# Determine winner
if nb_scores.mean() > lr_scores.mean():
    winner = "MultinomialNB"
    diff = nb_scores.mean() - lr_scores.mean()
else:
    winner = "LogisticRegression"
    diff = lr_scores.mean() - nb_scores.mean()

print(f"\nWinner: {winner} (+{diff:.3f})")

## ⭐⭐⭐ HARD: Exercise 14 - Tune TfidfVectorizer Parameters

**Goal:** Optimize model performance by tuning vectorizer parameters.

**Concepts:** max_features, ngram_range, hyperparameter tuning

In [None]:
texts = [
    "This is an amazing product", "Love the quality and service", "Excellent choice",
    "Terrible experience", "Very disappointed", "Do not recommend",
    "Great value for money", "Outstanding performance", "Best purchase ever",
    "Worst decision ever", "Complete waste of time", "Very poor quality"
]
labels = [1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0]

# Split data
X_train, X_test, y_train, y_test = ___

# Test different configurations
configs = [
    {'max_features': 50, 'ngram_range': (1, 1)},
    {'max_features': 100, 'ngram_range': (1, 1)},
    {'max_features': 50, 'ngram_range': (1, 2)},
    {'max_features': 100, 'ngram_range': (1, 2)}
]

results = []

for config in configs:
    # TODO: Create pipeline with the given config parameters
    pipe = Pipeline([
        ('tfidf', TfidfVectorizer(
            stop_words='english',
            max_features=___,
            ngram_range=___
        )),
        ('classifier', MultinomialNB())
    ])
    
    # Train and evaluate
    pipe.fit(X_train, y_train)
    # TODO: Calculate test accuracy
    accuracy = ___
    
    results.append({
        'max_features': config['max_features'],
        'ngram_range': config['ngram_range'],
        'accuracy': accuracy
    })

# Print results
print("Hyperparameter Tuning Results:")
print("="*60)
for res in results:
    print(f"max_features={res['max_features']:3d}, ngram={res['ngram_range']} → Accuracy: {res['accuracy']:.3f}")

# Find best configuration
best = ___
print(f"\nBest configuration: max_features={best['max_features']}, ngram={best['ngram_range']}")
print(f"Best accuracy: {best['accuracy']:.3f}")

## ⭐⭐⭐ HARD: Exercise 15 - Build Prediction Function with Confidence

**Goal:** Create a reusable function that predicts labels and returns confidence scores.

**Concepts:** Functions, predict_proba(), confidence scores

In [None]:
# Create training data
texts = [
    "I love this", "Amazing product", "Excellent service", "Very satisfied",
    "I hate this", "Terrible product", "Bad experience", "Very unhappy"
]
labels = [1, 1, 1, 1, 0, 0, 0, 0]

# Create and train pipeline
pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('classifier', MultinomialNB())
])
pipe.fit(texts, labels)

# TODO: Create a function called 'predict_with_confidence' that:
# 1. Takes a text string as input
# 2. Returns a dictionary with:
#    - 'label': predicted class (0 or 1)
#    - 'sentiment': 'Positive' or 'Negative' (mapped from label)
#    - 'confidence': probability of the predicted class (0.0 to 1.0)
# 3. Handles confidence values properly

def predict_with_confidence(text):
    """
    Predict sentiment with confidence score.
    
    Args:
        text (str): Input text to classify
        
    Returns:
        dict: {'label': int, 'sentiment': str, 'confidence': float}
    """
    # TODO: Get prediction
    pred = ___
    
    # TODO: Get probability scores
    proba = ___
    
    # TODO: Extract confidence (maximum probability)
    confidence = ___
    
    # TODO: Map label to sentiment
    sentiment = ___  # 'Positive' if pred == 1 else 'Negative'
    
    return {
        'label': pred,
        'sentiment': sentiment,
        'confidence': confidence
    }

# Test the function
test_texts = [
    "This product is amazing!",
    "I really hate this",
    "It's okay, not great"
]

print("Predictions with Confidence:")
print("="*70)
for text in test_texts:
    result = predict_with_confidence(text)
    print(f"Text: {text}")
    print(f"  Prediction: {result['sentiment']}")
    print(f"  Confidence: {result['confidence']:.3f}")
    print()

---

## Summary

### Key Takeaways:
- **Text Classification** converts documents to categories using machine learning
- **Naive Bayes** is fast and effective for text classification tasks
- **Pipelines** combine feature extraction and classification into a single workflow
- **Evaluation metrics** (accuracy, precision, recall, F1) measure classifier performance
- **Train/test split** prevents overfitting and provides reliable performance estimates
- **Cross-validation** gives more reliable performance estimates than single splits
- **Hyperparameter tuning** can improve model performance significantly
- **Confidence scores** help assess model certainty in predictions

### Common Use Cases:
- Spam detection
- Sentiment analysis
- Topic categorization
- Intent classification
- Document routing
- Fake news detection

### What's Next:
Tomorrow we'll apply these skills to **Fake News Detection** using real-world data!

---

*Created for Natruja's NLP study plan*