# Text Classification - Exploratory Data Analysis (EDA)

**Course:** Artificial Intelligence  
**Instructor:** Dr. Pishgoo  
**Project Supervisor:** Eng. Alireza Ghorbani

---

## Table of Contents
1. [Introduction](#introduction)
2. [Load Data](#load-data)
3. [Basic Statistics](#basic-statistics)
4. [Text Analysis](#text-analysis)
5. [Class Distribution](#class-distribution)
6. [Text Preprocessing](#text-preprocessing)
7. [Feature Extraction](#feature-extraction)
8. [Conclusions](#conclusions)

## 1. Introduction

This notebook performs comprehensive exploratory data analysis for text classification task.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette('husl')

# Import custom modules
import sys
sys.path.append('../')

from src.preprocessing.text_processor import TextPreprocessor, FeatureExtractor
from src.utils.helpers import (
    set_seed, plot_word_cloud, plot_text_length_distribution,
    plot_class_distribution
)

# Set random seed
set_seed(42)

## 2. Load Data

In [None]:
# Load dataset
# TODO: Replace with your actual data path
# df = pd.read_csv('../data/raw/your_dataset.csv')

# For demonstration, create sample data
sample_data = {
    'text': [
        'This product is absolutely amazing! Best purchase ever.',
        'Terrible experience. Would not recommend to anyone.',
        'Pretty good overall, meets expectations.',
        'Worst service I have ever received. Very disappointed.',
        'Excellent quality! Highly satisfied with this product.'
    ] * 20,
    'label': [1, 0, 1, 0, 1] * 20
}

df = pd.DataFrame(sample_data)

print(f"Dataset shape: {df.shape}")
df.head()

## 3. Basic Statistics

In [None]:
# Dataset info
print("Dataset Information:")
print(df.info())

print("\nBasic Statistics:")
print(df.describe())

# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

# Check for duplicates
print(f"\nNumber of duplicates: {df.duplicated().sum()}")

## 4. Text Analysis

In [None]:
# Calculate text statistics
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()
df['avg_word_length'] = df['text'].apply(lambda x: np.mean([len(word) for word in x.split()]))

print("Text Statistics:")
print(df[['text_length', 'word_count', 'avg_word_length']].describe())

In [None]:
# Plot text length distribution
plot_text_length_distribution(
    df['text'].tolist(),
    labels=df['label'].tolist(),
    figsize=(12, 6)
)

In [None]:
# Generate word cloud for entire dataset
plot_word_cloud(
    df['text'].tolist(),
    title="Word Cloud - All Data",
    max_words=100,
    figsize=(14, 8)
)

## 5. Class Distribution

In [None]:
# Class distribution
class_counts = df['label'].value_counts()
print("Class Distribution:")
print(class_counts)
print(f"\nClass Balance: {class_counts.min() / class_counts.max():.2f}")

# Plot class distribution
plot_class_distribution(
    df['label'].values,
    class_names=['Negative', 'Positive'],
    figsize=(10, 6)
)

In [None]:
# Word clouds by class
for label in df['label'].unique():
    class_texts = df[df['label'] == label]['text'].tolist()
    class_name = 'Positive' if label == 1 else 'Negative'
    
    plot_word_cloud(
        class_texts,
        title=f"Word Cloud - {class_name}",
        max_words=50,
        figsize=(12, 6)
    )

## 6. Text Preprocessing

In [None]:
# Initialize preprocessor
preprocessor = TextPreprocessor()

# Preprocess sample texts
print("Original vs Cleaned Text Examples:\n")
for i in range(3):
    original = df['text'].iloc[i]
    cleaned = preprocessor.preprocess(original)
    
    print(f"Example {i+1}:")
    print(f"Original: {original}")
    print(f"Cleaned:  {cleaned}")
    print()

In [None]:
# Preprocess entire dataset
df_processed = preprocessor.preprocess_dataframe(df, 'text', 'label')

print(f"Original dataset size: {len(df)}")
print(f"Processed dataset size: {len(df_processed)}")
print(f"Removed samples: {len(df) - len(df_processed)}")

df_processed.head()

## 7. Feature Extraction

In [None]:
# TF-IDF feature extraction
extractor = FeatureExtractor(method='tfidf', max_features=100)
X = extractor.fit_transform(df_processed['cleaned_text'])

print(f"Feature matrix shape: {X.shape}")
print(f"Number of features: {len(extractor.get_feature_names())}")

# Top features
feature_names = extractor.get_feature_names()
print(f"\nTop 20 features:")
print(feature_names[:20])

In [None]:
# Visualize feature importance
# Calculate average TF-IDF scores
mean_tfidf = np.asarray(X.mean(axis=0)).flatten()
top_indices = mean_tfidf.argsort()[-20:][::-1]
top_features = [feature_names[i] for i in top_indices]
top_scores = mean_tfidf[top_indices]

# Plot
plt.figure(figsize=(12, 6))
plt.barh(range(len(top_features)), top_scores, color='steelblue')
plt.yticks(range(len(top_features)), top_features)
plt.xlabel('Average TF-IDF Score', fontsize=12)
plt.title('Top 20 Features by TF-IDF Score', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## 8. Conclusions

### Key Findings:

1. **Dataset Size**: [Describe dataset size]
2. **Class Balance**: [Describe class distribution]
3. **Text Characteristics**: [Describe text length, word count, etc.]
4. **Data Quality**: [Describe missing values, duplicates, etc.]
5. **Preprocessing Impact**: [Describe preprocessing results]

### Next Steps:

1. Train baseline models (Logistic Regression, Naive Bayes, etc.)
2. Implement deep learning models (CNN, LSTM, BERT)
3. Hyperparameter tuning
4. Model evaluation and comparison
5. Error analysis

In [None]:
# Save processed data
# df_processed.to_csv('../data/processed/processed_data.csv', index=False)
# print("Processed data saved!")