# Duplicate Analysis

This notebook focuses on identifying and analyzing duplicate images:
- Inter-class duplicates (same image in different classes)
- Intra-class duplicates (same image multiple times in one class)
- Storage impact and optimization opportunities
- Detailed duplicate reports

In [1]:
# Import required modules
import sys
import os
sys.path.append('..')

from visualizations.data_loader import (
    load_all_metadata,
    create_image_details_dataframe,
    get_duplicate_analysis
)

from visualizations.plotters import (
    plot_duplicate_analysis
)

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

print("✅ Modules imported successfully")

✅ Modules imported successfully


In [2]:
# Load data
metadata_list = load_all_metadata()
image_df = create_image_details_dataframe(metadata_list)

print(f"📊 Analyzing {len(image_df)} total image entries")

# Filter to downloaded images with hashes
downloaded_df = image_df[
    (image_df['has_download_data'] == True) & 
    (image_df['hash'].notna()) & 
    (image_df['hash'] != '')
].copy()

print(f"📷 {len(downloaded_df)} images available for duplicate analysis")

Found 100 metadata files
✅ Successfully loaded 100 metadata files
📊 Analyzing 49188 total image entries
📷 49041 images available for duplicate analysis


In [3]:
# Perform duplicate analysis
duplicate_stats = get_duplicate_analysis(image_df)

print("🔄 DUPLICATE ANALYSIS RESULTS:")
print("=" * 40)
print(f"Total Images: {duplicate_stats['total_images']:,}")
print(f"Unique Images: {duplicate_stats['unique_hashes']:,}")
print(f"Duplicate Images: {duplicate_stats['duplicate_count']:,}")
print(f"Duplicate Rate: {duplicate_stats['duplicate_rate']:.1f}%")
print(f"Inter-class Duplicates: {len(duplicate_stats['inter_class_duplicates'])} unique hashes")
print(f"Intra-class Duplicates: {len(duplicate_stats['intra_class_duplicates'])} unique hashes")

🔄 DUPLICATE ANALYSIS RESULTS:
Total Images: 49,041
Unique Images: 44,701
Duplicate Images: 4,340
Duplicate Rate: 8.8%
Inter-class Duplicates: 3334 unique hashes
Intra-class Duplicates: 26 unique hashes


In [4]:
# Visualize duplicate analysis
if duplicate_stats['total_images'] > 0:
    plot_duplicate_analysis(duplicate_stats, use_plotly=True)

In [5]:
# Detailed inter-class duplicate analysis
def analyze_inter_class_duplicates(duplicate_stats, image_df):
    inter_class = duplicate_stats['inter_class_duplicates']
    
    if not inter_class:
        print("✅ No inter-class duplicates found!")
        return
    
    print(f"🔀 INTER-CLASS DUPLICATE ANALYSIS:")
    print("=" * 45)
    print(f"Found {len(inter_class)} unique images appearing in multiple classes\n")
    
    # Create detailed analysis
    inter_class_details = []
    
    for hash_val, info in inter_class.items():
        classes = info['classes']
        files = info['files']
        count = info['count']
        
        # Get additional details from image_df
        hash_images = image_df[image_df['hash'] == hash_val]
        if len(hash_images) > 0:
            sample_image = hash_images.iloc[0]
            file_size_mb = sample_image['bytes'] / 1048576 if sample_image['bytes'] > 0 else 0
            dimensions = f"{sample_image['width']}x{sample_image['height']}"
        else:
            file_size_mb = 0
            dimensions = "Unknown"
        
        inter_class_details.append({
            'hash': hash_val[:12] + '...',  # Truncated hash for display
            'classes_affected': ', '.join(classes),
            'total_copies': count,
            'file_size_mb': file_size_mb,
            'dimensions': dimensions,
            'wasted_storage_mb': file_size_mb * (count - 1)  # Storage that could be saved
        })
    
    inter_class_df = pd.DataFrame(inter_class_details)
    inter_class_df = inter_class_df.sort_values('wasted_storage_mb', ascending=False)
    
    print("Top 10 inter-class duplicates by storage waste:")
    display(inter_class_df.head(10))
    
    # Summary statistics
    total_wasted_storage = inter_class_df['wasted_storage_mb'].sum()
    avg_copies_per_duplicate = inter_class_df['total_copies'].mean()
    
    print(f"\n📊 Inter-class Duplicate Summary:")
    print(f"   💾 Total wasted storage: {total_wasted_storage:.1f} MB")
    print(f"   📈 Average copies per duplicate: {avg_copies_per_duplicate:.1f}")
    print(f"   🏆 Largest duplicate: {inter_class_df.iloc[0]['file_size_mb']:.1f} MB ({inter_class_df.iloc[0]['total_copies']} copies)")
    
    return inter_class_df

inter_class_df = analyze_inter_class_duplicates(duplicate_stats, downloaded_df)

🔀 INTER-CLASS DUPLICATE ANALYSIS:
Found 3334 unique images appearing in multiple classes

Top 10 inter-class duplicates by storage waste:


Unnamed: 0,hash,classes_affected,total_copies,file_size_mb,dimensions,wasted_storage_mb
16,222b9745d76b...,"Bangsilog, Cornsilog, Hotsilog, Spamsilog, Tap...",5,14.204645,5184.0x3456.0,56.818581
498,f638252f8337...,"Pakbet Ilokano, Pakbet Tagalog, Pinakbet",3,10.227165,5472.0x3648.0,20.45433
63,8c00b8da312a...,"Adobong Sitaw, Ginataang Kalabasa at Sitaw, Gi...",4,6.215278,3000.0x2000.0,18.645833
562,7ad6b796b472...,"Adobong Baboy, Adobong Dilaw, Adobong Pula",3,9.117071,6000.0x4000.0,18.234142
264,d57f67f481c7...,"Adobong Sitaw, Ginisang Sitaw, Adobong Baboy",3,8.106813,2500.0x2500.0,16.213627
518,a91b5224f875...,"Champorado, Pancit Malabon, Pancit Palabok",3,6.934529,4608.0x3456.0,13.869059
3,a49a09269354...,"Adobong Sitaw, Chopsuey, Ginataang Kalabasa at...",18,0.577042,1640.0x850.0,9.809708
2070,9ec97490815d...,"Bulanglang, Sinigang na Baka",2,9.223146,3286.0x4929.0,9.223146
2504,8cc27bf02d75...,"Adobong Pula, Sinigang na Bangus",2,6.545164,4608.0x3456.0,6.545164
1914,f3c56b5e74c3...,"Bibingka, Puto Bumbong",2,5.550042,4608.0x3456.0,5.550042



📊 Inter-class Duplicate Summary:
   💾 Total wasted storage: 975.1 MB
   📈 Average copies per duplicate: 2.3
   🏆 Largest duplicate: 14.2 MB (5 copies)


In [6]:
# Detailed intra-class duplicate analysis
def analyze_intra_class_duplicates(duplicate_stats, image_df):
    intra_class = duplicate_stats['intra_class_duplicates']
    
    if not intra_class:
        print("✅ No intra-class duplicates found!")
        return
    
    print(f"🔁 INTRA-CLASS DUPLICATE ANALYSIS:")
    print("=" * 45)
    print(f"Found {len(intra_class)} unique images duplicated within the same class\n")
    
    # Create detailed analysis
    intra_class_details = []
    
    for hash_val, info in intra_class.items():
        class_name = info['class']
        files = info['files']
        count = info['count']
        
        # Get additional details from image_df
        hash_images = image_df[image_df['hash'] == hash_val]
        if len(hash_images) > 0:
            sample_image = hash_images.iloc[0]
            file_size_mb = sample_image['bytes'] / 1048576 if sample_image['bytes'] > 0 else 0
            dimensions = f"{sample_image['width']}x{sample_image['height']}"
            category = sample_image['category']
        else:
            file_size_mb = 0
            dimensions = "Unknown"
            category = "Unknown"
        
        intra_class_details.append({
            'hash': hash_val[:12] + '...',  # Truncated hash for display
            'category': category,
            'class_name': class_name,
            'total_copies': count,
            'file_size_mb': file_size_mb,
            'dimensions': dimensions,
            'wasted_storage_mb': file_size_mb * (count - 1)  # Storage that could be saved
        })
    
    intra_class_df = pd.DataFrame(intra_class_details)
    intra_class_df = intra_class_df.sort_values('wasted_storage_mb', ascending=False)
    
    print("Top 10 intra-class duplicates by storage waste:")
    display(intra_class_df.head(10))
    
    # Summary statistics
    total_wasted_storage = intra_class_df['wasted_storage_mb'].sum()
    avg_copies_per_duplicate = intra_class_df['total_copies'].mean()
    
    # Classes with most duplicates
    classes_with_duplicates = intra_class_df.groupby('class_name').agg({
        'total_copies': 'sum',
        'wasted_storage_mb': 'sum'
    }).sort_values('wasted_storage_mb', ascending=False)
    
    print(f"\n📊 Intra-class Duplicate Summary:")
    print(f"   💾 Total wasted storage: {total_wasted_storage:.1f} MB")
    print(f"   📈 Average copies per duplicate: {avg_copies_per_duplicate:.1f}")
    print(f"   🔥 Classes with most duplicates:")
    for class_name, row in classes_with_duplicates.head(5).iterrows():
        print(f"      {class_name}: {row['wasted_storage_mb']:.1f} MB wasted")
    
    return intra_class_df

intra_class_df = analyze_intra_class_duplicates(duplicate_stats, downloaded_df)

🔁 INTRA-CLASS DUPLICATE ANALYSIS:
Found 26 unique images duplicated within the same class

Top 10 intra-class duplicates by storage waste:


Unnamed: 0,hash,category,class_name,total_copies,file_size_mb,dimensions,wasted_storage_mb
4,a615d59eb752...,Glow,Okoy,2,0.583369,800.0x800.0,0.583369
24,4382b59014d6...,Grow,Kalderetang Kambing,2,0.208222,736.0x1564.0,0.208222
7,5eeb11e0e465...,Glow,Ensaladang Ampalaya,2,0.188663,1600.0x1067.0,0.188663
0,4994491e1e4a...,Go,Espasol,2,0.183537,640.0x567.0,0.183537
22,8a77650685a0...,Grow,Inihaw na Liempo,2,0.178895,1280.0x720.0,0.178895
21,a4162e06dcc0...,Grow,Inasal na Isol,2,0.157674,1200.0x754.0,0.157674
19,87913144227c...,Grow,Adobong Baboy,2,0.148787,1124.0x1124.0,0.148787
2,b1bff126d11b...,Go,Biko,2,0.095818,496.0x1033.0,0.095818
6,c2014e04b5a5...,Glow,Okoy,2,0.083466,612.0x612.0,0.083466
8,0d6b824d61f9...,Go,Arroz Caldo,2,0.079165,474.0x948.0,0.079165



📊 Intra-class Duplicate Summary:
   💾 Total wasted storage: 2.7 MB
   📈 Average copies per duplicate: 2.0
   🔥 Classes with most duplicates:
      Okoy: 0.7 MB wasted
      Kalderetang Kambing: 0.2 MB wasted
      Espasol: 0.2 MB wasted
      Ensaladang Ampalaya: 0.2 MB wasted
      Inihaw na Liempo: 0.2 MB wasted


In [7]:
# Storage impact visualization
def visualize_storage_impact(duplicate_stats, image_df):
    if duplicate_stats['total_images'] == 0:
        return
    
    # Calculate storage metrics
    total_storage = image_df[image_df['has_download_data'] == True]['bytes'].sum() / 1048576  # MB
    
    # Estimate duplicate storage
    inter_storage = 0
    intra_storage = 0
    
    for hash_val, info in duplicate_stats['inter_class_duplicates'].items():
        hash_images = image_df[image_df['hash'] == hash_val]
        if len(hash_images) > 0:
            file_size = hash_images.iloc[0]['bytes'] / 1048576
            inter_storage += file_size * (info['count'] - 1)
    
    for hash_val, info in duplicate_stats['intra_class_duplicates'].items():
        hash_images = image_df[image_df['hash'] == hash_val]
        if len(hash_images) > 0:
            file_size = hash_images.iloc[0]['bytes'] / 1048576
            intra_storage += file_size * (info['count'] - 1)
    
    total_duplicate_storage = inter_storage + intra_storage
    unique_storage = total_storage - total_duplicate_storage
    
    # Create storage impact visualization
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=(
            'Storage Breakdown',
            'Potential Savings by Type',
            'Duplicate Storage Impact',
            'Optimization Opportunity'
        ),
        specs=[[{"type": "pie"}, {"type": "bar"}],
               [{"type": "indicator"}, {"type": "bar"}]]
    )
    
    # Storage breakdown pie chart
    fig.add_trace(
        go.Pie(
            labels=['Unique Content', 'Duplicate Content'],
            values=[unique_storage, total_duplicate_storage],
            hole=0.4,
            marker_colors=['#2E8B57', '#DC143C']
        ),
        row=1, col=1
    )
    
    # Potential savings by type
    fig.add_trace(
        go.Bar(
            x=['Inter-class', 'Intra-class'],
            y=[inter_storage, intra_storage],
            text=[f"{inter_storage:.1f} MB", f"{intra_storage:.1f} MB"],
            textposition='auto',
            marker_color=['#FF6B6B', '#4ECDC4']
        ),
        row=1, col=2
    )
    
    # Storage impact indicator
    savings_percentage = (total_duplicate_storage / total_storage * 100) if total_storage > 0 else 0
    
    fig.add_trace(
        go.Indicator(
            mode="gauge+number+delta",
            value=savings_percentage,
            title={'text': "Potential Savings (%)"},
            gauge={
                'axis': {'range': [0, 50]},
                'bar': {'color': "darkred"},
                'steps': [
                    {'range': [0, 10], 'color': "lightgreen"},
                    {'range': [10, 25], 'color': "yellow"},
                    {'range': [25, 50], 'color': "orange"}
                ],
                'threshold': {
                    'line': {'color': "red", 'width': 4},
                    'thickness': 0.75,
                    'value': 30
                }
            },
            delta={'reference': 10, 'valueformat': ".1f"}
        ),
        row=2, col=1
    )
    
    # Optimization comparison
    fig.add_trace(
        go.Bar(
            x=['Current Storage', 'After Deduplication'],
            y=[total_storage, unique_storage],
            text=[f"{total_storage:.1f} MB", f"{unique_storage:.1f} MB"],
            textposition='auto',
            marker_color=['#FF7675', '#74B9FF']
        ),
        row=2, col=2
    )
    
    fig.update_layout(
        title_text="💾 Storage Impact Analysis",
        title_x=0.5,
        height=700,
        showlegend=False
    )
    
    fig.show()
    
    print(f"💾 STORAGE IMPACT SUMMARY:")
    print(f"   📦 Total Storage: {total_storage:.1f} MB")
    print(f"   ✅ Unique Content: {unique_storage:.1f} MB ({unique_storage/total_storage*100:.1f}%)")
    print(f"   🔄 Duplicate Content: {total_duplicate_storage:.1f} MB ({savings_percentage:.1f}%)")
    print(f"   💰 Potential Savings: {total_duplicate_storage:.1f} MB")
    print(f"   🔀 Inter-class waste: {inter_storage:.1f} MB")
    print(f"   🔁 Intra-class waste: {intra_storage:.1f} MB")

if duplicate_stats['total_images'] > 0:
    visualize_storage_impact(duplicate_stats, downloaded_df)

💾 STORAGE IMPACT SUMMARY:
   📦 Total Storage: 10737.2 MB
   ✅ Unique Content: 9759.3 MB (90.9%)
   🔄 Duplicate Content: 977.9 MB (9.1%)
   💰 Potential Savings: 977.9 MB
   🔀 Inter-class waste: 975.1 MB
   🔁 Intra-class waste: 2.7 MB


In [8]:
# Generate deduplication recommendations
def generate_deduplication_recommendations(duplicate_stats, image_df):
    if duplicate_stats['total_images'] == 0:
        print("No data available for recommendations")
        return
    
    recommendations = []
    
    # High duplicate rate
    if duplicate_stats['duplicate_rate'] > 20:
        recommendations.append({
            'priority': 'HIGH',
            'issue': f"High duplicate rate ({duplicate_stats['duplicate_rate']:.1f}%)",
            'action': 'Implement automated deduplication process',
            'impact': 'Significant storage savings and improved data quality'
        })
    
    # Inter-class duplicates
    if len(duplicate_stats['inter_class_duplicates']) > 0:
        recommendations.append({
            'priority': 'HIGH',
            'issue': f"{len(duplicate_stats['inter_class_duplicates'])} inter-class duplicates found",
            'action': 'Review class definitions and search terms for overlap',
            'impact': 'Reduces confusion between classes and improves model training'
        })
    
    # Intra-class duplicates
    if len(duplicate_stats['intra_class_duplicates']) > 0:
        recommendations.append({
            'priority': 'MEDIUM',
            'issue': f"{len(duplicate_stats['intra_class_duplicates'])} intra-class duplicates found",
            'action': 'Implement hash-based filtering during download process',
            'impact': 'Saves storage space and processing time'
        })
    
    # Storage impact
    total_storage = image_df[image_df['has_download_data'] == True]['bytes'].sum() / 1048576
    if total_storage > 1000:  # > 1GB
        recommendations.append({
            'priority': 'MEDIUM',
            'issue': f"Large dataset ({total_storage:.1f} MB) with duplicates",
            'action': 'Consider implementing tiered storage or compression',
            'impact': 'Reduces storage costs and improves access speed'
        })
    
    print("🎯 DEDUPLICATION RECOMMENDATIONS:")
    print("=" * 45)
    
    if not recommendations:
        print("✅ Low duplicate rate detected. Current deduplication strategy is working well!")
    else:
        for i, rec in enumerate(recommendations, 1):
            print(f"\n{i}. [{rec['priority']}] {rec['issue']}")
            print(f"   Action: {rec['action']}")
            print(f"   Impact: {rec['impact']}")
    
    # Specific action items
    print("\n📋 IMMEDIATE ACTION ITEMS:")
    print("=" * 30)
    
    if len(duplicate_stats['inter_class_duplicates']) > 0:
        print("1. Review inter-class duplicates to identify search term overlap")
        print("2. Consider moving duplicate images to a common 'shared' category")
    
    if len(duplicate_stats['intra_class_duplicates']) > 0:
        print("3. Run deduplication script to remove intra-class duplicates")
        print("4. Add hash checking to prevent future duplicates")
    
    if duplicate_stats['duplicate_rate'] > 10:
        print("5. Implement pre-download duplicate checking")
        print("6. Consider using perceptual hashing for similar (not identical) images")
    
    # Calculate ROI of deduplication
    if duplicate_stats['duplicate_count'] > 0:
        avg_file_size = image_df[image_df['has_download_data'] == True]['bytes'].mean() / 1048576
        storage_savings = duplicate_stats['duplicate_count'] * avg_file_size
        
        print(f"\n💰 ESTIMATED BENEFITS:")
        print(f"   Storage savings: ~{storage_savings:.1f} MB")
        print(f"   Processing time reduction: ~{duplicate_stats['duplicate_rate']:.1f}%")
        print(f"   Improved data quality: Reduced redundancy")

generate_deduplication_recommendations(duplicate_stats, downloaded_df)

🎯 DEDUPLICATION RECOMMENDATIONS:

1. [HIGH] 3334 inter-class duplicates found
   Action: Review class definitions and search terms for overlap
   Impact: Reduces confusion between classes and improves model training

2. [MEDIUM] 26 intra-class duplicates found
   Action: Implement hash-based filtering during download process
   Impact: Saves storage space and processing time

3. [MEDIUM] Large dataset (10737.2 MB) with duplicates
   Action: Consider implementing tiered storage or compression
   Impact: Reduces storage costs and improves access speed

📋 IMMEDIATE ACTION ITEMS:
1. Review inter-class duplicates to identify search term overlap
2. Consider moving duplicate images to a common 'shared' category
3. Run deduplication script to remove intra-class duplicates
4. Add hash checking to prevent future duplicates

💰 ESTIMATED BENEFITS:
   Storage savings: ~950.2 MB
   Processing time reduction: ~8.8%
   Improved data quality: Reduced redundancy


## Duplicate Analysis Summary

This analysis provides comprehensive insights into duplicate images in your dataset:

### Key Findings
- **Overall Duplicate Rate**: Percentage of images that are duplicates
- **Inter-class Duplicates**: Same images appearing in different classes
- **Intra-class Duplicates**: Same images repeated within a single class
- **Storage Impact**: Amount of storage wasted on duplicate content

### Types of Duplicates

#### Inter-class Duplicates
- **Problem**: Same image appears in multiple classes
- **Impact**: Can confuse machine learning models
- **Solution**: Review search terms and class definitions

#### Intra-class Duplicates
- **Problem**: Same image downloaded multiple times for one class
- **Impact**: Wastes storage and processing time
- **Solution**: Implement hash-based filtering

### Optimization Opportunities
1. **Immediate**: Remove duplicate files to save storage
2. **Short-term**: Implement duplicate checking in download process
3. **Long-term**: Use perceptual hashing for similar image detection
4. **Ongoing**: Regular deduplication maintenance

### Next Steps
1. **Review Details**: Examine specific duplicate files listed above
2. **Implement Deduplication**: Run automated deduplication tools
3. **Prevent Future Duplicates**: Add hash checking to scraping process
4. **Monitor Progress**: Track duplicate rates over time