# Dataset Overview Analysis

This notebook provides a comprehensive overview of the image scraping dataset, including:
- High-level statistics and metrics
- Class and category distributions
- Success rate analysis
- Format distribution overview

## Setup and Data Loading

In [None]:
import sys
import os

# Add paths for imports
notebook_dir = os.path.dirname(os.path.abspath('__file__' if '__file__' in globals() else 'dataset_overview.ipynb'))
visualizations_dir = os.path.dirname(notebook_dir)
sys.path.append(os.path.join(visualizations_dir, 'visualizers'))
sys.path.append(os.path.join(visualizations_dir, 'utils'))

from dataset_stats import create_combined_overview
from data_loader import load_report_data
from plot_helpers import apply_global_style, display_config

apply_global_style()

# Load data - adjust path as needed
data = load_report_data('../../sample_report.json')
print("Data loaded successfully!")
print(f"Generated at: {data.get('generated_at', 'Unknown')}")

## Dataset Overview Dashboard

Key metrics at a glance:

In [None]:
charts = create_combined_overview(data)

overview_fig = charts['overview']
overview_fig.show(config=display_config())

## Class Distribution Analysis

Distribution of images across different classes:

In [None]:
class_fig = charts['class_distribution']
class_fig.show(config=display_config())

## Category Distribution

Breakdown by major categories (Grow, Glow, Go):

In [None]:
category_fig = charts['category_distribution']
category_fig.show(config=display_config())

## Download Success Rate

Overall success rate of the scraping operation:

In [None]:
success_fig = charts['success_rate']
success_fig.show(config=display_config())

## Summary Statistics

Let's extract some key insights from the data:

In [None]:
from data_loader import get_overview_metrics, extract_quality_stats

metrics = get_overview_metrics(data)
quality = extract_quality_stats(data)

print("=== DATASET SUMMARY ===")
print(f"Total Images: {metrics['total_images']:,}")
print(f"Total Classes: {metrics['total_classes']}")
print(f"Total Categories: {metrics['total_categories']}")
print(f"Success Rate: {metrics['success_rate']:.2f}%")
print(f"Average File Size: {metrics['avg_file_size_mb']:.2f} MB")
print("\n=== DOWNLOAD STATS ===")
print(f"URLs Found: {quality.get('total_urls_found', 0):,}")
print(f"Successfully Downloaded: {quality.get('total_downloaded', 0):,}")
print(f"Missing/Failed: {quality.get('urls_found_but_missing_metadata', 0):,}")

## Export Options

Save charts for reports or presentations:

In [None]:
# Uncomment to save charts
# from plot_helpers import save_plot
# 
# save_plot(overview_fig, 'dataset_overview_dashboard', 'html')
# save_plot(class_fig, 'class_distribution', 'png', width=1000, height=600)
# save_plot(category_fig, 'category_distribution', 'png', width=800, height=600)
# save_plot(success_fig, 'success_rate_gauge', 'png', width=600, height=400)
# 
# print("Charts saved to visualizations/output/")