# Image Scraper Overview Dashboard

This notebook provides a comprehensive overview of the image scraping results.
All data is loaded directly from the metadata JSON files.

In [1]:
# Import required modules
import sys
import os
sys.path.append('..')

from visualizations.data_loader import (
    load_all_metadata,
    create_summary_dataframe,
    create_image_details_dataframe,
    get_duplicate_analysis,
    get_category_stats,
    get_temporal_stats
)

from visualizations.plotters import (
    create_summary_dashboard,
    plot_category_distribution
)

print("✅ Modules imported successfully")

✅ Modules imported successfully


In [2]:
# Load all metadata
metadata_list = load_all_metadata()

if len(metadata_list) == 0:
    print("❌ No metadata files found. Please ensure you have run the image scraper.")
else:
    print(f"✅ Loaded {len(metadata_list)} metadata files")

Found 100 metadata files
✅ Successfully loaded 100 metadata files
✅ Loaded 100 metadata files


In [3]:
# Create DataFrames
summary_df = create_summary_dataframe(metadata_list)
image_df = create_image_details_dataframe(metadata_list)

print(f"📊 Summary: {len(summary_df)} classes")
print(f"📷 Images: {len(image_df)} total image entries")
print(f"✅ Downloaded: {len(image_df[image_df['has_download_data'] == True])} images")

# Display first few rows of summary
print("\n📋 Sample Summary Data:")
display(summary_df.head())

📊 Summary: 100 classes
📷 Images: 49150 total image entries
✅ Downloaded: 49017 images

📋 Sample Summary Data:


Unnamed: 0,category,class_name,search_key,urls_requested,urls_found,total_images,downloaded_images,download_success_rate,file_path
0,Glow,Adobong Kangkong,Adobong Kangkong,500,500,500,499,99.8,../output\metadata\Glow\Adobong Kangkong\Adobo...
1,Glow,Adobong Sitaw,Adobong Sitaw,500,500,500,500,100.0,../output\metadata\Glow\Adobong Sitaw\AdobongS...
2,Glow,Adobong Talong,Adobong Talong,500,500,500,497,99.4,../output\metadata\Glow\Adobong Talong\Adobong...
3,Glow,Ampalaya con Itlog,Ampalaya con Itlog,500,500,500,500,100.0,../output\metadata\Glow\Ampalaya con Itlog\Amp...
4,Glow,Bulanglang,Bulanglang,500,500,500,500,100.0,../output\metadata\Glow\Bulanglang\Bulanglang_...


In [4]:
# Analyze duplicates
duplicate_stats = get_duplicate_analysis(image_df)

print(f"🔄 Duplicate Analysis:")
print(f"   Total Images: {duplicate_stats['total_images']}")
print(f"   Unique Hashes: {duplicate_stats['unique_hashes']}")
print(f"   Duplicates: {duplicate_stats['duplicate_count']}")
print(f"   Duplicate Rate: {duplicate_stats['duplicate_rate']:.1f}%")
print(f"   Inter-class Duplicates: {len(duplicate_stats['inter_class_duplicates'])}")
print(f"   Intra-class Duplicates: {len(duplicate_stats['intra_class_duplicates'])}")

🔄 Duplicate Analysis:
   Total Images: 49017
   Unique Hashes: 44677
   Duplicates: 4340
   Duplicate Rate: 8.9%
   Inter-class Duplicates: 3334
   Intra-class Duplicates: 26


In [5]:
# Get category statistics
category_stats = get_category_stats(summary_df)

print("📊 Category Statistics:")
for category, stats in category_stats.items():
    print(f"\n{category}:")
    print(f"  Classes: {stats['total_classes']}")
    print(f"  Images: {stats['total_images']}")
    print(f"  Avg per class: {stats['avg_images_per_class']:.1f}")
    print(f"  Success rate: {stats['avg_success_rate']:.1f}%")

📊 Category Statistics:

Glow:
  Classes: 34
  Images: 16147
  Avg per class: 474.9
  Success rate: 99.6%

Go:
  Classes: 33
  Images: 16486
  Avg per class: 499.6
  Success rate: 99.9%

Grow:
  Classes: 33
  Images: 16384
  Avg per class: 496.5
  Success rate: 99.7%


In [6]:
# Get temporal statistics
temporal_stats = get_temporal_stats(image_df)

if temporal_stats['has_temporal_data']:
    print("⏱️ Temporal Statistics:")
    print(f"  Images with timestamps: {temporal_stats['total_images']}")
    print(f"  Earliest download: {temporal_stats['earliest_download']}")
    print(f"  Latest download: {temporal_stats['latest_download']}")
    print(f"  Duration: {temporal_stats['duration_hours']:.1f} hours")
    print(f"  Avg interval: {temporal_stats['avg_interval_seconds']:.1f} seconds")
else:
    print("⏱️ No temporal data available")

⏱️ Temporal Statistics:
  Images with timestamps: 49017
  Earliest download: 2025-05-24 09:57:21.152048
  Latest download: 2025-05-30 08:45:46.744266
  Duration: 142.8 hours
  Avg interval: 10.5 seconds


In [7]:
# Create comprehensive dashboard
create_summary_dashboard(summary_df, image_df, duplicate_stats)

📊 DATASET SUMMARY
📁 Total Classes: 100
📸 Total Images: 49,017
✅ Average Success Rate: 99.7%
💾 Average File Size: 0.2 MB
📦 Total Storage: 10730.9 MB
🔄 Duplicate Rate: 8.9%


In [8]:
# Category distribution analysis
plot_category_distribution(summary_df, use_plotly=True)

## Key Insights Summary

Based on the analysis above, here are the key insights from your image scraping operation:

### Dataset Overview
- **Total Classes**: Check the dashboard for the exact count
- **Total Images**: Sum across all categories
- **Success Rate**: Average download success across all classes

### Quality Assessment
- **Duplicate Rate**: Percentage of duplicate images found
- **File Quality**: Distribution of image formats and resolutions
- **Storage Efficiency**: Total storage used and potential savings

### Recommendations
1. **Classes with low success rates** should be reviewed for URL quality
2. **High duplicate rates** suggest need for better deduplication
3. **Storage optimization** can be achieved by removing duplicates
4. **Category balance** shows distribution across Go/Grow/Glow categories