# Text Classification System - Exploration Notebook

This notebook provides an interactive exploration of the text classification system.

## Table of Contents
1. [Data Loading and Exploration](#data-loading)
2. [Text Preprocessing](#preprocessing)
3. [Feature Extraction](#features)
4. [Model Training](#training)
5. [Model Evaluation](#evaluation)
6. [Results Visualization](#visualization)

In [None]:
# Import required libraries
import sys
import os
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# Import our custom modules
from src.preprocessing import TextPreprocessor, FeatureExtractor
from src.models import TextClassifier, ModelTuner
from src.evaluation import ModelEvaluator
from src.visualization import TextVisualization
from data.dataset_loader import load_dataset, get_available_datasets

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline

print("Libraries imported successfully!")

## 1. Data Loading and Exploration <a id="data-loading"></a>

In [None]:
# Show available datasets
datasets = get_available_datasets()
print("Available datasets:")
for name, info in datasets.items():
    print(f"- {name}: {info['description']}")

In [None]:
# Load dataset (change this to try different datasets)
dataset_name = '20newsgroups'  # Options: 'sms_spam', '20newsgroups', 'movie_reviews'

texts, labels, label_names = load_dataset(dataset_name)

print(f"Dataset: {dataset_name}")
print(f"Total documents: {len(texts)}")
print(f"Number of classes: {len(label_names)}")
print(f"Classes: {list(label_names.values())}")

In [None]:
# Initialize visualization component
visualizer = TextVisualization()

# Visualize class distribution
class_counts = visualizer.plot_class_distribution(
    [label_names[label] for label in labels],
    title=f"{dataset_name.title()} Dataset - Class Distribution"
)

In [None]:
# Show sample documents
print("Sample documents from each class:")
print("=" * 50)

for class_idx, class_name in label_names.items():
    # Find first document of this class
    sample_idx = labels.index(class_idx)
    sample_text = texts[sample_idx][:300] + "..." if len(texts[sample_idx]) > 300 else texts[sample_idx]
    
    print(f"\nClass: {class_name}")
    print("-" * 30)
    print(sample_text)
    print()

## 2. Text Preprocessing <a id="preprocessing"></a>

In [None]:
# Initialize preprocessor
preprocessor = TextPreprocessor()

# Show preprocessing example
sample_text = texts[0]
processed_text = preprocessor.preprocess_text(sample_text)

print("Preprocessing Example:")
print("=" * 50)
print("Original:")
print(sample_text[:500] + "..." if len(sample_text) > 500 else sample_text)
print("\nProcessed:")
print(processed_text[:500] + "..." if len(processed_text) > 500 else processed_text)

In [None]:
# Preprocess all texts
print("Preprocessing all texts...")
processed_texts = [preprocessor.preprocess_text(text) for text in texts]
print("Preprocessing completed!")

# Analyze text length distribution
visualizer.plot_text_length_distribution(
    processed_texts, 
    [label_names[label] for label in labels]
)

## 3. Feature Extraction <a id="features"></a>

In [None]:
# Split data
X_train_text, X_test_text, y_train, y_test = train_test_split(
    processed_texts, labels, test_size=0.2, random_state=42, stratify=labels
)

print(f"Training set size: {len(X_train_text)}")
print(f"Test set size: {len(X_test_text)}")

In [None]:
# Initialize feature extractor
feature_extractor = FeatureExtractor()

# Extract TF-IDF features
print("Extracting TF-IDF features...")
X_train_tfidf = feature_extractor.extract_tfidf_features(X_train_text, max_features=5000)
X_test_tfidf = feature_extractor.transform_tfidf(X_test_text)

print(f"TF-IDF feature matrix shape: {X_train_tfidf.shape}")
print(f"Feature density: {X_train_tfidf.nnz / (X_train_tfidf.shape[0] * X_train_tfidf.shape[1]):.4f}")

In [None]:
# Extract Bag of Words features
print("Extracting Bag of Words features...")
X_train_bow = feature_extractor.extract_bow_features(X_train_text, max_features=5000)
X_test_bow = feature_extractor.transform_bow(X_test_text)

print(f"BoW feature matrix shape: {X_train_bow.shape}")
print(f"Feature density: {X_train_bow.nnz / (X_train_bow.shape[0] * X_train_bow.shape[1]):.4f}")

## 4. Model Training <a id="training"></a>

In [None]:
# Initialize evaluator
evaluator = ModelEvaluator()

# Define models and feature sets to test
models_to_test = ['naive_bayes', 'logistic_regression']
feature_sets = [
    ('TF-IDF', X_train_tfidf, X_test_tfidf),
    ('BoW', X_train_bow, X_test_bow)
]

print("Training and evaluating models...")
print("=" * 50)

In [None]:
# Train and evaluate all model combinations
results = {}

for feature_name, X_train_feat, X_test_feat in feature_sets:
    print(f"\n--- Testing with {feature_name} features ---")
    
    for model_type in models_to_test:
        model_name = f"{model_type}_{feature_name}"
        print(f"\nTraining {model_name}...")
        
        # Train model
        classifier = TextClassifier(model_type)
        classifier.train(X_train_feat, y_train)
        
        # Evaluate model
        metrics = evaluator.evaluate_model(
            classifier, X_test_feat, y_test, model_name
        )
        
        results[model_name] = {
            'classifier': classifier,
            'metrics': metrics
        }
        
        print(f"Accuracy: {metrics['accuracy']:.4f}")
        print(f"F1-Score: {metrics['f1_score']:.4f}")

print("\nTraining completed!")

## 5. Model Evaluation <a id="evaluation"></a>

In [None]:
# Compare all models
comparison_df = evaluator.compare_models()
print("Model Comparison:")
print("=" * 50)
print(comparison_df.round(4))

In [None]:
# Plot model comparison
evaluator.plot_model_comparison()

In [None]:
# Get best model
best_model_name, best_score = evaluator.get_best_model('f1_score')
print(f"Best Model: {best_model_name}")
print(f"Best F1-Score: {best_score:.4f}")

# Detailed evaluation report
evaluator.print_evaluation_report(best_model_name)

In [None]:
# Plot confusion matrix for best model
evaluator.plot_confusion_matrix(
    best_model_name, 
    class_names=list(label_names.values())
)

In [None]:
# Plot ROC curve if binary classification
if len(label_names) == 2:
    evaluator.plot_roc_curve(best_model_name)

## 6. Results Visualization <a id="visualization"></a>

In [None]:
# Feature importance analysis
best_classifier = results[best_model_name]['classifier']

# Get feature names
if 'tfidf' in best_model_name.lower():
    feature_names = feature_extractor.tfidf_vectorizer.get_feature_names_out()
else:
    feature_names = feature_extractor.count_vectorizer.get_feature_names_out()

# Get feature importance
feature_importance = best_classifier.get_feature_importance(feature_names)
if feature_importance:
    visualizer.plot_feature_importance(feature_importance, top_n=20)

In [None]:
# Create word clouds for each class
print("Generating word clouds...")
for label_idx, class_name in label_names.items():
    visualizer.create_wordcloud(
        processed_texts, 
        labels, 
        class_name=label_idx,
        figsize=(12, 6)
    )

In [None]:
# Prediction confidence analysis
best_result = evaluator.results[best_model_name]
visualizer.plot_prediction_confidence(
    best_result['y_pred_proba'], 
    best_result['y_true']
)

## Summary and Conclusions

This notebook demonstrated a complete text classification pipeline including:

1. **Data Loading**: Multiple dataset options with easy switching
2. **Preprocessing**: Comprehensive text cleaning and normalization
3. **Feature Extraction**: TF-IDF and Bag of Words vectorization
4. **Model Training**: Multiple algorithms with systematic evaluation
5. **Evaluation**: Comprehensive metrics and visualizations
6. **Analysis**: Feature importance and prediction confidence

### Key Findings:
- Best performing model and its characteristics
- Feature extraction method comparison
- Most important features for classification
- Model strengths and weaknesses

### Next Steps:
- Try additional models (SVM, Random Forest)
- Experiment with different preprocessing techniques
- Use word embeddings (Word2Vec, GloVe)
- Implement cross-validation
- Deploy the best model in production