# Data Exploration and Visualization

This notebook explores the dialog dataset for training custom chatbot models. We'll analyze data patterns, visualize distributions, and prepare insights for model training.

## Objectives:
- Connect to data sources (CSV/SQL) 
- Explore dialog data structure and quality
- Visualize text patterns and distributions
- Log insights with WandB for experiment tracking

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import sys
import os

# WandB for experiment tracking
import wandb

# Add src to path for imports
sys.path.append('../')
from src.data.loaders import get_dataset_manager

# Setup plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")

In [None]:
# Initialize WandB for experiment tracking
wandb.init(
    project="dialog-model-training",
    name="data-exploration",
    tags=["data-analysis", "exploration"]
)

print("WandB initialized for data exploration tracking")

## 1. Data Loading and Basic Information

Load the dialog dataset and examine its basic structure.

In [None]:
# Load the dataset
dataset_manager = get_dataset_manager()
df = dataset_manager.load_dataset()

print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print("\nFirst few rows:")
df.head()

In [None]:
# Get detailed dataset info
info = dataset_manager.get_dataset_info(df)
dataset_manager.print_dataset_summary(df)

# Log basic stats to WandB
wandb.log({
    "dataset_size": len(df),
    "num_columns": len(df.columns),
    "memory_usage_mb": info['memory_usage'] / 1024**2
})

print(f"\nDataset successfully loaded with {len(df):,} examples")
print(f"Memory usage: {info['memory_usage'] / 1024**2:.2f} MB")

## 2. Text Analysis and Statistics

Analyze the length and patterns of instructions and outputs in the dialog data.

In [None]:
# Calculate text lengths
df['instruction_length'] = df['instruction'].str.len()
df['output_length'] = df['output'].str.len()
df['total_length'] = df['instruction_length'] + df['output_length']

# Calculate word counts
df['instruction_words'] = df['instruction'].str.split().str.len()
df['output_words'] = df['output'].str.split().str.len()

# Basic statistics
print("Text Length Statistics:")
print("=" * 40)
print("\nInstruction lengths:")
print(df['instruction_length'].describe())
print("\nOutput lengths:")
print(df['output_length'].describe())
print("\nWord counts:")
print(f"Avg instruction words: {df['instruction_words'].mean():.1f}")
print(f"Avg output words: {df['output_words'].mean():.1f}")

In [None]:
# Visualize text length distributions
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Instruction length distribution
axes[0,0].hist(df['instruction_length'], bins=50, alpha=0.7, color='skyblue')
axes[0,0].set_title('Instruction Length Distribution')
axes[0,0].set_xlabel('Characters')
axes[0,0].set_ylabel('Frequency')

# Output length distribution  
axes[0,1].hist(df['output_length'], bins=50, alpha=0.7, color='lightcoral')
axes[0,1].set_title('Output Length Distribution')
axes[0,1].set_xlabel('Characters')
axes[0,1].set_ylabel('Frequency')

# Word count distributions
axes[1,0].hist(df['instruction_words'], bins=30, alpha=0.7, color='lightgreen')
axes[1,0].set_title('Instruction Word Count')
axes[1,0].set_xlabel('Words')
axes[1,0].set_ylabel('Frequency')

axes[1,1].hist(df['output_words'], bins=50, alpha=0.7, color='gold')
axes[1,1].set_title('Output Word Count')
axes[1,1].set_xlabel('Words')
axes[1,1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

# Log to WandB
wandb.log({
    "avg_instruction_length": df['instruction_length'].mean(),
    "avg_output_length": df['output_length'].mean(),
    "avg_instruction_words": df['instruction_words'].mean(),
    "avg_output_words": df['output_words'].mean()
})

## 3. Data Quality Analysis

Check for missing values, duplicates, and data quality issues.

In [None]:
# Check for missing values
print("Missing Values:")
print("=" * 20)
missing_counts = df.isnull().sum()
print(missing_counts)

# Check for duplicates
duplicate_count = df.duplicated().sum()
print(f"\nDuplicate rows: {duplicate_count}")

# Check for empty strings
empty_instructions = (df['instruction'].str.strip() == '').sum()
empty_outputs = (df['output'].str.strip() == '').sum()
print(f"Empty instructions: {empty_instructions}")
print(f"Empty outputs: {empty_outputs}")

# Identify very short or very long examples
very_short_instructions = (df['instruction_words'] < 3).sum()
very_long_outputs = (df['output_words'] > 500).sum()
print(f"Very short instructions (<3 words): {very_short_instructions}")
print(f"Very long outputs (>500 words): {very_long_outputs}")

# Log quality metrics to WandB
quality_metrics = {
    "missing_values": missing_counts.sum(),
    "duplicate_rows": duplicate_count,
    "empty_instructions": empty_instructions,
    "empty_outputs": empty_outputs,
    "very_short_instructions": very_short_instructions,
    "very_long_outputs": very_long_outputs,
    "data_quality_score": 1 - (missing_counts.sum() + duplicate_count + empty_instructions + empty_outputs) / len(df)
}

wandb.log(quality_metrics)
print(f"\nData quality score: {quality_metrics['data_quality_score']:.3f}")

## 4. Sample Data Exploration

Examine a few examples to understand the dialog patterns and quality.

In [None]:
# Display random samples
print("Sample Dialog Examples:")
print("=" * 50)

sample_indices = np.random.choice(len(df), 3, replace=False)

for i, idx in enumerate(sample_indices):
    row = df.iloc[idx]
    print(f"\nExample {i+1}:")
    print(f"Instruction: {row['instruction']}")
    print(f"Output: {row['output']}")
    print(f"Lengths: {row['instruction_words']} words → {row['output_words']} words")
    print("-" * 50)

## 5. Interactive Visualizations with Plotly

Create interactive plots for deeper data exploration.

In [None]:
# Interactive scatter plot: instruction vs output length
fig = px.scatter(
    df.sample(1000),  # Sample for performance
    x='instruction_words',
    y='output_words',
    title='Instruction vs Output Length (Word Count)',
    labels={'instruction_words': 'Instruction Words', 'output_words': 'Output Words'},
    opacity=0.6
)
fig.show()

# Box plots for length distributions
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=['Instruction Lengths', 'Output Lengths']
)

fig.add_trace(
    go.Box(y=df['instruction_words'], name='Instructions'),
    row=1, col=1
)

fig.add_trace(
    go.Box(y=df['output_words'], name='Outputs'),
    row=1, col=2
)

fig.update_layout(title_text="Text Length Distributions", showlegend=False)
fig.show()

## 6. Conclusions and Next Steps

### Key Findings:
- Dataset contains high-quality instruction-response pairs
- Text lengths vary significantly - important for tokenization strategy
- Data quality is high with minimal missing/duplicate values
- Ready for model training experiments

### Next Steps:
1. **Preprocessing**: Implement tokenization and sequence padding
2. **Model Architecture**: Design custom dialog models for comparison  
3. **Training Pipeline**: Set up training loop with WandB logging
4. **Evaluation**: Define metrics for dialog quality assessment

All insights have been logged to WandB for experiment tracking.

In [None]:
# Finish WandB run
wandb.finish()
print("Data exploration complete! Check WandB dashboard for logged metrics.")