# Filoma ML Split Visualization Demo

This notebook demonstrates the new visualization capabilities for analyzing ML data splits created with `filoma.ml.split_data`.

The plot module provides comprehensive analysis and visualization for:
- **Balance Analysis**: Split size distributions and balance assessment
- **Feature Distribution**: Analysis of feature distributions across splits
- **Leakage Detection**: Identification of potential data leakage between splits
- **File Characteristics**: Analysis of file metadata (sizes, extensions, etc.)

## Key Features:
- üé® **Dual Mode**: Interactive plots (with matplotlib/seaborn) or rich text summaries
- üöÄ **Lazy Loading**: Fast imports with optional visualization dependencies
- üîç **Data-Centric**: Focus on data quality and split validation
- üîß **Non-invasive**: Wrapper pattern that doesn't modify existing ML functionality

In [None]:
# Standard imports
import sys

import polars as pl

# Add filoma to path (if running from repo)
sys.path.insert(0, "src")

# Import filoma
import filoma.plot as plot

## Check Visualization Availability

The plot module gracefully handles missing visualization dependencies:

In [None]:
# Check if visualization is available
status = plot.check_plotting_available()
print(f"Plotting available: {status['available']}")
if not status["available"]:
    print(f"Missing packages: {status['missing']}")
    print("Install with: pip install 'filoma[viz]'")
    print("\nüìã Text summaries will be provided instead of interactive plots")
else:
    print("\nüìä Interactive plots available!")

## Create Sample Data

Let's create sample file data that mimics what filoma would produce:

In [None]:
# Create sample file data
sample_files = [
    {"path": "/data/images/train/cat_001.jpg", "size": 1024000, "extension": "jpg", "depth": 4},
    {"path": "/data/images/train/dog_001.jpg", "size": 1536000, "extension": "jpg", "depth": 4},
    {"path": "/data/images/train/bird_001.png", "size": 2048000, "extension": "png", "depth": 4},
    {"path": "/data/docs/train/readme.txt", "size": 5000, "extension": "txt", "depth": 4},
    {"path": "/data/images/val/cat_002.jpg", "size": 1200000, "extension": "jpg", "depth": 4},
    {"path": "/data/docs/val/info.pdf", "size": 150000, "extension": "pdf", "depth": 4},
    {"path": "/data/images/test/dog_002.png", "size": 1800000, "extension": "png", "depth": 4},
    {"path": "/data/images/test/bird_002.jpg", "size": 1300000, "extension": "jpg", "depth": 4},
    {"path": "/data/docs/test/results.csv", "size": 75000, "extension": "csv", "depth": 4},
]

# Convert to Polars DataFrame
df = pl.DataFrame(sample_files)

# Add some derived features for analysis
df = df.with_columns(
    [
        (pl.col("size") / 1000000).alias("size_mb"),  # Size in MB
        pl.when(pl.col("extension") == "jpg")
        .then(pl.lit("image"))
        .when(pl.col("extension") == "png")
        .then(pl.lit("image"))
        .otherwise(pl.lit("document"))
        .alias("file_type"),
        pl.col("path").str.extract(r"/(\w+)/[^/]+$", 1).alias("category"),
    ]
)

print("üìÅ Sample dataset created:")
print(df)

## Simulate ML Split Creation

In a real workflow, you would use `filoma.ml.split_data()`. Here we simulate the result:

In [None]:
# Split the data (simulating ml.split_data output)
train_df = df.filter(pl.col("path").str.contains("/train/"))
val_df = df.filter(pl.col("path").str.contains("/val/"))
test_df = df.filter(pl.col("path").str.contains("/test/"))

# Create the splits tuple (as returned by ml.split_data)
splits = (train_df, val_df, test_df)
split_names = ["train", "validation", "test"]

print(f"üìä Created splits: {[len(s) for s in splits]} files each")
print(f"üè∑Ô∏è  Split names: {split_names}")

## Initialize Split Analyzer

Create the analyzer to begin visualization and analysis:

In [None]:
# Create the split analyzer
analyzer = plot.analyze_splits(
    splits=splits,
    split_names=split_names,
    feature="size_mb",  # Feature used for analysis
    original_data=df,
)

print("‚úÖ Split analyzer created successfully!")
print(f"üìä Analyzing {len(analyzer.splits)} splits: {analyzer.split_names}")

## 1. Balance Analysis

Analyze the distribution of samples across splits:

In [None]:
# Analyze split balance
balance_result = analyzer.balance()

print("\nüìä Balance Analysis Results:")
for split, count in balance_result.items():
    print(f"  {split}: {count} files")

## 2. Feature Distribution Analysis

Analyze how features are distributed across splits:

In [None]:
# Analyze feature distribution
feature_result = analyzer.feature_distribution()

print("\nüìà Feature Distribution Analysis Complete")
print(f"Feature analyzed: {analyzer.feature}")

## 3. Data Leakage Detection

Check for potential data leakage between splits:

In [None]:
# Check for data leakage
leakage_result = analyzer.leakage_check()

print("\nüîç Leakage Analysis Complete")
print("Check the visualization above for potential issues")

## 4. File Characteristics Analysis

Analyze file metadata characteristics across splits:

In [None]:
# Analyze file characteristics
char_result = analyzer.characteristics(["size", "extension", "file_type"])

print("\nüìã Characteristics Analysis Complete")
print("Analyzed: file sizes, extensions, and file types")

## 5. Complete Validation Report

Get a comprehensive validation summary:

In [None]:
# Get complete validation report
validation_result = analyzer.validate()

print("\nüéØ Complete Validation Summary:")
for check, status in validation_result.items():
    emoji = "‚úÖ" if status.get("passed", False) else "‚ö†Ô∏è"
    print(f"  {emoji} {check}: {status.get('message', 'Unknown')}")

## Summary

This notebook demonstrated the key capabilities of filoma's new plot module:

### ‚úÖ **Completed Features:**
- **Balance Analysis**: Visual and statistical assessment of split distributions
- **Feature Distribution**: Cross-split feature analysis with distribution plots
- **Leakage Detection**: Comprehensive data leakage analysis with heatmaps
- **Characteristics Analysis**: File metadata analysis with flexible visualization
- **Validation Pipeline**: Automated quality checks and reporting

### üé® **Visualization Modes:**
- **Interactive**: Full matplotlib/seaborn plots when dependencies are available
- **Text Mode**: Rich formatted summaries when plotting libraries are missing
- **Graceful Fallback**: Seamless experience regardless of environment

### üöÄ **Integration:**
- **Non-invasive**: Works as a wrapper around existing `filoma.ml.split_data` results
- **Lazy Loading**: Fast imports with optional visualization dependencies
- **Data-Centric**: Focus on ML data quality and validation workflows

The plot module enhances filoma's ML capabilities by providing comprehensive tools for evaluating and validating data splits - a critical component of robust ML pipelines.