# ChunkFlow: Visualization and Analysis

This notebook demonstrates ChunkFlow's powerful visualization and analysis capabilities.

## What You'll Learn

1. How to use ResultsDataFrame for data analysis
2. How to create publication-quality visualizations
3. How to export results in various formats
4. How to generate comprehensive comparison reports

## Prerequisites

```bash
pip install chunk-flow[huggingface,viz]
```

In [None]:
# Import required libraries
import asyncio
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from chunk_flow.chunking import StrategyRegistry
from chunk_flow.embeddings import EmbeddingProviderFactory
from chunk_flow.evaluation import EvaluationPipeline
from chunk_flow.analysis import ResultsDataFrame, StrategyVisualizer

# Set matplotlib style
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

print("✓ All imports successful!")

## 1. Prepare Data

Let's evaluate multiple strategies on a sample document.

In [None]:
# Sample document about renewable energy
document = """
# Renewable Energy: The Future of Power

Renewable energy comes from natural sources that are constantly replenished. Unlike fossil 
fuels, which take millions of years to form, renewable sources regenerate naturally and 
provide sustainable alternatives to conventional energy.

## Solar Energy

Solar power harnesses energy from the sun using photovoltaic cells or concentrated solar 
power systems. Solar panels convert sunlight directly into electricity through the 
photovoltaic effect. As of 2024, solar energy costs have dropped by over 90% in the past 
decade, making it one of the cheapest energy sources available.

The efficiency of solar panels continues to improve, with modern panels achieving 20-23% 
efficiency. New technologies like perovskite solar cells promise even higher efficiencies 
above 30%. Solar farms now span thousands of acres, generating gigawatts of clean power.

## Wind Energy

Wind turbines convert kinetic energy from wind into electrical power. Modern wind turbines 
can be massive, with blades over 80 meters long and towers exceeding 100 meters in height. 
Offshore wind farms take advantage of stronger, more consistent ocean winds.

Wind power has become highly competitive with fossil fuels. In many regions, new wind 
installations are cheaper than coal or natural gas plants. The global wind energy capacity 
exceeded 1,000 GW in 2023, with continued rapid growth expected.

## Hydroelectric Power

Hydroelectric dams generate electricity by channeling water through turbines. Hydropower 
is the world's largest source of renewable electricity, accounting for about 16% of global 
electricity production. Large dams can generate thousands of megawatts, powering entire cities.

However, large dams can have significant environmental impacts, including ecosystem 
disruption and greenhouse gas emissions from reservoirs. Small-scale hydropower and 
run-of-river systems offer more environmentally friendly alternatives.

## Geothermal Energy

Geothermal power taps into Earth's internal heat. Geothermal plants work best in regions 
with volcanic activity or hot springs. Iceland generates over 25% of its electricity from 
geothermal sources, while Kenya produces 45% of its power geothermally.

Enhanced geothermal systems (EGS) could unlock vast geothermal potential in areas without 
natural hot water sources. This technology involves drilling deep wells and fracturing hot 
rocks to create artificial reservoirs.

## Biomass Energy

Biomass energy comes from organic materials like wood, crops, and waste. Burning biomass 
releases energy, but it also produces carbon dioxide. However, if biomass is sustainably 
managed, new plant growth reabsorbs the CO2, creating a carbon-neutral cycle.

Advanced biofuels derived from algae or agricultural waste show promise. These second and 
third-generation biofuels avoid competing with food crops for land, addressing a key 
criticism of first-generation biofuels like corn ethanol.

## Energy Storage

A major challenge for renewables is intermittency - the sun doesn't always shine, and the 
wind doesn't always blow. Energy storage technologies are crucial for reliability. Lithium-ion 
batteries dominate current storage, but alternatives like flow batteries, compressed air 
storage, and hydrogen are developing rapidly.

Grid-scale battery installations can store hundreds of megawatt-hours, smoothing out supply 
fluctuations. Tesla's Megapack and similar systems are being deployed worldwide, enabling 
higher renewable penetration in electricity grids.

## Smart Grids

Smart grids use digital technology to manage electricity distribution efficiently. They can 
balance supply and demand in real-time, integrate distributed renewable sources, and respond 
to outages automatically. Advanced metering and demand response programs help shift 
consumption to times when renewable generation is abundant.

## Economic Impact

The renewable energy sector employs millions globally. Solar and wind jobs have grown 
rapidly, offering new employment opportunities. The transition to renewables is projected 
to create more jobs than are lost in fossil fuel industries.

Investment in renewables reached $500 billion in 2023. Governments worldwide offer 
incentives like tax credits, feed-in tariffs, and renewable energy certificates to 
accelerate adoption.

## Future Outlook

By 2050, renewables could provide 80% or more of global electricity if current trends 
continue. Continued cost reductions, improved storage, and supportive policies will drive 
this transformation. The age of fossil fuels is ending, and the renewable energy era has begun.
"""

print(f"Document length: {len(document)} characters")

In [None]:
# Create multiple strategies
strategies = {
    "fixed_300": StrategyRegistry.create("fixed_size", {"chunk_size": 300, "overlap": 50}),
    "fixed_600": StrategyRegistry.create("fixed_size", {"chunk_size": 600, "overlap": 100}),
    "recursive": StrategyRegistry.create("recursive", {"chunk_size": 500, "overlap": 80}),
    "markdown": StrategyRegistry.create("markdown", {"respect_headers": True}),
}

# Chunk document with each strategy
chunk_results = {}
for name, strategy in strategies.items():
    result = await strategy.chunk(document, doc_id="renewable_energy")
    chunk_results[name] = result

print(f"Chunked document with {len(strategies)} strategies")

In [None]:
# Generate embeddings
embedder = EmbeddingProviderFactory.create(
    "huggingface",
    {"model": "sentence-transformers/all-MiniLM-L6-v2", "normalize": True}
)

embedding_results = {}
for name, chunk_result in chunk_results.items():
    emb_result = await embedder.embed_texts(chunk_result.chunks)
    embedding_results[name] = emb_result

print("Generated embeddings for all strategies")

In [None]:
# Evaluate all strategies
pipeline = EvaluationPipeline(
    metrics=[
        "semantic_coherence",
        "boundary_quality",
        "chunk_stickiness",
        "topic_diversity",
    ]
)

evaluation_results = {}
for name in strategies.keys():
    eval_result = await pipeline.evaluate(
        chunks=chunk_results[name].chunks,
        embeddings=embedding_results[name].embeddings,
    )
    evaluation_results[name] = eval_result

print("Evaluation complete for all strategies")

## 2. ResultsDataFrame Analysis

The ResultsDataFrame provides pandas-based analysis capabilities.

In [None]:
# Create ResultsDataFrame
results_df = ResultsDataFrame.from_evaluation_results(
    evaluation_results,
    strategy_names=list(strategies.keys())
)

# Display raw data
print("Raw Results:\n")
print(results_df.to_string())

In [None]:
# Get summary statistics
summary = results_df.summary_statistics()

print("\nSummary Statistics:\n")
print(summary)

In [None]:
# Rank strategies by each metric
print("\nRanking by Semantic Coherence:\n")
ranked_coherence = results_df.rank_strategies(by="semantic_coherence", ascending=False)
print(ranked_coherence[["strategy", "semantic_coherence"]])

In [None]:
# Weighted ranking
print("\nWeighted Ranking (emphasizing coherence and boundaries):\n")
ranked_weighted = results_df.rank_strategies(
    weights={
        "semantic_coherence": 2.0,
        "boundary_quality": 2.0,
        "chunk_stickiness": 1.0,
        "topic_diversity": 1.0,
    },
    ascending=False
)
print(ranked_weighted[["strategy", "weighted_score", "semantic_coherence", "boundary_quality"]])

In [None]:
# Filter high-performing strategies
print("\nStrategies with coherence > 0.5:\n")
high_coherence = results_df.filter_by_metric("semantic_coherence", min_value=0.5)
print(high_coherence[["strategy", "semantic_coherence"]])

In [None]:
# Correlation analysis
print("\nMetric Correlations:\n")
correlations = results_df.correlation_analysis()
print(correlations)

## 3. Visualization: Heatmap

Visualize strategy performance across all metrics.

In [None]:
# Create performance heatmap
plt.figure(figsize=(10, 6))

StrategyVisualizer.plot_heatmap(
    data=results_df.df[["semantic_coherence", "boundary_quality", "chunk_stickiness", "topic_diversity"]].values,
    strategies=list(strategies.keys()),
    metrics=["semantic_coherence", "boundary_quality", "chunk_stickiness", "topic_diversity"],
    title="Strategy Performance Heatmap",
    figsize=(10, 6),
    cmap="RdYlGn",
)

plt.tight_layout()
plt.savefig("heatmap.png", dpi=300, bbox_inches='tight')
plt.show()

print("\n✓ Heatmap saved as heatmap.png")

## 4. Visualization: Bar Chart Comparison

Compare strategies side-by-side.

In [None]:
# Create bar chart comparison
plt.figure(figsize=(12, 6))

StrategyVisualizer.plot_bar_comparison(
    data=results_df.df,
    strategies=list(strategies.keys()),
    metrics=["semantic_coherence", "boundary_quality", "topic_diversity"],
    title="Strategy Comparison: Key Metrics",
    figsize=(12, 6),
)

plt.tight_layout()
plt.savefig("bar_comparison.png", dpi=300, bbox_inches='tight')
plt.show()

print("\n✓ Bar chart saved as bar_comparison.png")

## 5. Visualization: Radar Chart

Multi-dimensional comparison with radar plots.

In [None]:
# Create radar chart
plt.figure(figsize=(10, 10))

StrategyVisualizer.plot_radar_chart(
    data=results_df.df,
    strategies=list(strategies.keys()),
    metrics=["semantic_coherence", "boundary_quality", "chunk_stickiness", "topic_diversity"],
    title="Multi-Metric Strategy Comparison",
    figsize=(10, 10),
)

plt.tight_layout()
plt.savefig("radar_chart.png", dpi=300, bbox_inches='tight')
plt.show()

print("\n✓ Radar chart saved as radar_chart.png")

## 6. Visualization: Box Plot

Show metric distributions across strategies.

In [None]:
# Create box plot
plt.figure(figsize=(10, 6))

StrategyVisualizer.plot_box_plot(
    data=results_df.df,
    metric="semantic_coherence",
    title="Semantic Coherence Distribution",
    figsize=(10, 6),
)

plt.tight_layout()
plt.savefig("box_plot.png", dpi=300, bbox_inches='tight')
plt.show()

print("\n✓ Box plot saved as box_plot.png")

## 7. Visualization: Correlation Matrix

Understand relationships between metrics.

In [None]:
# Create correlation matrix
plt.figure(figsize=(8, 6))

StrategyVisualizer.plot_correlation_matrix(
    data=results_df.df[["semantic_coherence", "boundary_quality", "chunk_stickiness", "topic_diversity"]],
    title="Metric Correlation Matrix",
    figsize=(8, 6),
    cmap="coolwarm",
)

plt.tight_layout()
plt.savefig("correlation_matrix.png", dpi=300, bbox_inches='tight')
plt.show()

print("\n✓ Correlation matrix saved as correlation_matrix.png")

## 8. Visualization: Scatter Plot

Compare two metrics directly.

In [None]:
# Create scatter plot
plt.figure(figsize=(8, 6))

StrategyVisualizer.plot_scatter(
    data=results_df.df,
    x_metric="semantic_coherence",
    y_metric="boundary_quality",
    strategies=list(strategies.keys()),
    title="Coherence vs Boundary Quality",
    figsize=(8, 6),
)

plt.tight_layout()
plt.savefig("scatter_plot.png", dpi=300, bbox_inches='tight')
plt.show()

print("\n✓ Scatter plot saved as scatter_plot.png")

## 9. Generate Dashboard

Create a comprehensive dashboard with all visualizations.

In [None]:
# Generate comprehensive dashboard
StrategyVisualizer.create_dashboard(
    data=results_df.df,
    strategies=list(strategies.keys()),
    metrics=["semantic_coherence", "boundary_quality", "chunk_stickiness", "topic_diversity"],
    output_path="strategy_dashboard.png",
    title="ChunkFlow Strategy Analysis Dashboard",
)

print("\n✓ Dashboard saved as strategy_dashboard.png")

## 10. Export Results

Export results in various formats for sharing and further analysis.

In [None]:
# Export to CSV
results_df.to_csv("results.csv")
print("✓ Exported to results.csv")

# Export to JSON
results_df.to_json("results.json")
print("✓ Exported to results.json")

# Export to Parquet (efficient binary format)
results_df.to_parquet("results.parquet")
print("✓ Exported to results.parquet")

# Export to Excel (if openpyxl is installed)
try:
    results_df.to_excel("results.xlsx")
    print("✓ Exported to results.xlsx")
except ImportError:
    print("⚠ Excel export requires openpyxl: pip install openpyxl")

## 11. Custom Analysis

Perform custom pandas operations on the results.

In [None]:
# Access underlying pandas DataFrame
df = results_df.df

# Calculate composite score (custom formula)
df['composite_score'] = (
    df['semantic_coherence'] * 0.3 + 
    df['boundary_quality'] * 0.3 + 
    (1 - df['chunk_stickiness']) * 0.2 +  # Inverted
    df['topic_diversity'] * 0.2
)

# Sort by composite score
df_sorted = df.sort_values('composite_score', ascending=False)

print("\nCustom Composite Score Ranking:\n")
print(df_sorted[['strategy', 'composite_score', 'semantic_coherence', 'boundary_quality']].to_string(index=False))

In [None]:
# Find best strategy for each metric
print("\nBest Strategy per Metric:\n")
metrics = ['semantic_coherence', 'boundary_quality', 'topic_diversity']

for metric in metrics:
    best_strategy = df.loc[df[metric].idxmax(), 'strategy']
    best_score = df[metric].max()
    print(f"{metric:25s} → {best_strategy:15s} ({best_score:.4f})")

# Chunk stickiness: lower is better
best_stickiness_strategy = df.loc[df['chunk_stickiness'].idxmin(), 'strategy']
best_stickiness_score = df['chunk_stickiness'].min()
print(f"{'chunk_stickiness':25s} → {best_stickiness_strategy:15s} ({best_stickiness_score:.4f})")

In [None]:
# Statistical comparison
print("\nStatistical Summary:\n")
print(df[['semantic_coherence', 'boundary_quality', 'chunk_stickiness', 'topic_diversity']].describe())

## 12. Interactive Exploration

Tips for interactive exploration in Jupyter.

In [None]:
# Use pandas styling for better visualization
styled = results_df.df.style.background_gradient(cmap='RdYlGn', subset=['semantic_coherence', 'boundary_quality'])
styled = styled.background_gradient(cmap='RdYlGn_r', subset=['chunk_stickiness'])  # Inverted for stickiness
styled = styled.format({
    'semantic_coherence': '{:.4f}',
    'boundary_quality': '{:.4f}',
    'chunk_stickiness': '{:.4f}',
    'topic_diversity': '{:.4f}',
})

styled

## Insights from Visualizations

### Key Takeaways:

1. **Heatmap** shows overall performance patterns
   - Identify strategies that excel across multiple metrics
   - Spot weaknesses (red cells) that need attention

2. **Bar Chart** enables direct comparison
   - See which strategy wins for each metric
   - Understand trade-offs between strategies

3. **Radar Chart** reveals multi-dimensional performance
   - Larger areas indicate better overall performance
   - Asymmetric shapes show specialized strengths

4. **Correlation Matrix** uncovers metric relationships
   - Positive correlations: metrics tend to move together
   - Negative correlations: trade-offs exist

5. **Scatter Plot** shows pairwise relationships
   - Ideal strategies appear in top-right (high on both)
   - Outliers may indicate unique characteristics

### Best Practices:

- **Compare apples to apples**: Use same document for all strategies
- **Consider context**: Different use cases may prioritize different metrics
- **Look for trade-offs**: No strategy is perfect for everything
- **Validate on multiple documents**: Single document may not be representative
- **Export and share**: Use visualizations to communicate findings

## Summary

In this notebook, you learned:

✅ How to use ResultsDataFrame for analysis
✅ How to create 7 types of visualizations
✅ How to generate comprehensive dashboards
✅ How to export results in multiple formats (CSV, JSON, Parquet, Excel)
✅ How to perform custom pandas analysis
✅ How to interpret visualizations for insights

## Next Steps

- **Notebook 05**: Using the ChunkFlow REST API
- Try with your own documents and strategies
- Experiment with different metric combinations
- Share visualizations with your team