# Stats Compass Core - Multi-DataFrame Operations

This notebook demonstrates stats-compass-core's ability to manage and combine multiple DataFrames simultaneously - a key differentiator for stateful data orchestration.

**Features demonstrated:**
- Managing multiple DataFrames in a single session
- SQL-style merges (inner, left, right, outer joins)
- Concatenating DataFrames (vertical and horizontal)
- Chaining operations across DataFrames

In [1]:
# Setup
from stats_compass_core import DataFrameState, registry
import pandas as pd
import numpy as np

# Initialize state
state = DataFrameState()
registry.auto_discover()
print("Stats Compass Core ready!")

Stats Compass Core ready!


---
## Part 1: Creating Multiple DataFrames

Let's simulate a realistic scenario: an e-commerce database with customers, orders, and products.

In [2]:
# Create Customers table
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'city': ['London', 'Paris', 'Berlin', 'Madrid', 'Rome'],
    'signup_date': pd.to_datetime(['2023-01-15', '2023-02-20', '2023-03-10', '2023-04-05', '2023-05-01'])
})
state.set_dataframe(customers, name='customers', operation='create')
print("Customers:")
customers

Customers:


Unnamed: 0,customer_id,name,city,signup_date
0,1,Alice,London,2023-01-15
1,2,Bob,Paris,2023-02-20
2,3,Charlie,Berlin,2023-03-10
3,4,Diana,Madrid,2023-04-05
4,5,Eve,Rome,2023-05-01


In [3]:
# Create Orders table
orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105, 106, 107],
    'customer_id': [1, 2, 1, 3, 2, 6, 1],  # Note: customer 6 doesn't exist!
    'product_id': ['P001', 'P002', 'P001', 'P003', 'P001', 'P002', 'P003'],
    'quantity': [2, 1, 3, 1, 2, 1, 1],
    'order_date': pd.to_datetime(['2023-06-01', '2023-06-02', '2023-06-03', '2023-06-04', '2023-06-05', '2023-06-06', '2023-06-07'])
})
state.set_dataframe(orders, name='orders', operation='create')
print("Orders:")
orders

Orders:


Unnamed: 0,order_id,customer_id,product_id,quantity,order_date
0,101,1,P001,2,2023-06-01
1,102,2,P002,1,2023-06-02
2,103,1,P001,3,2023-06-03
3,104,3,P003,1,2023-06-04
4,105,2,P001,2,2023-06-05
5,106,6,P002,1,2023-06-06
6,107,1,P003,1,2023-06-07


In [4]:
# Create Products table
products = pd.DataFrame({
    'product_id': ['P001', 'P002', 'P003', 'P004'],
    'product_name': ['Widget', 'Gadget', 'Gizmo', 'Doohickey'],
    'price': [29.99, 49.99, 19.99, 99.99],
    'category': ['Electronics', 'Electronics', 'Tools', 'Premium']
})
state.set_dataframe(products, name='products', operation='create')
print("Products:")
products

Products:


Unnamed: 0,product_id,product_name,price,category
0,P001,Widget,29.99,Electronics
1,P002,Gadget,49.99,Electronics
2,P003,Gizmo,19.99,Tools
3,P004,Doohickey,99.99,Premium


In [5]:
# List all DataFrames in state
from stats_compass_core.data.list_dataframes import list_dataframes, ListDataFramesInput

result = list_dataframes(state, ListDataFramesInput())
print(f"DataFrames in state: {result.total_count}\n")
for df_info in result.dataframes:
    print(f"  - {df_info['name']}: {df_info['shape'][0]} rows × {df_info['shape'][1]} cols")

DataFrames in state: 3

  - customers: 5 rows × 4 cols
  - orders: 7 rows × 5 cols
  - products: 4 rows × 4 cols


---
## Part 2: SQL-Style Merges

Use `merge_dataframes` to join tables like SQL JOINs.

In [6]:
from stats_compass_core.data.merge_dataframes import merge_dataframes, MergeDataFramesInput

In [7]:
# INNER JOIN: Orders with customer names (only matching records)
result = merge_dataframes(state, MergeDataFramesInput(
    left_dataframe='orders',
    right_dataframe='customers',
    how='inner',
    on='customer_id',
    save_as='orders_with_customers'
))

print(result.message)
print(f"\nNote: Order 106 (customer_id=6) was excluded - no matching customer!")
state.get_dataframe('orders_with_customers')

Merged 'orders' (7 rows) with 'customers' (5 rows) using INNER JOIN on customer_id. Result: 6 rows.

Note: Order 106 (customer_id=6) was excluded - no matching customer!


Unnamed: 0,order_id,customer_id,product_id,quantity,order_date,name,city,signup_date
0,101,1,P001,2,2023-06-01,Alice,London,2023-01-15
1,102,2,P002,1,2023-06-02,Bob,Paris,2023-02-20
2,103,1,P001,3,2023-06-03,Alice,London,2023-01-15
3,104,3,P003,1,2023-06-04,Charlie,Berlin,2023-03-10
4,105,2,P001,2,2023-06-05,Bob,Paris,2023-02-20
5,107,1,P003,1,2023-06-07,Alice,London,2023-01-15


In [8]:
# LEFT JOIN: All orders, with customer info where available
result = merge_dataframes(state, MergeDataFramesInput(
    left_dataframe='orders',
    right_dataframe='customers',
    how='left',
    on='customer_id',
    save_as='orders_left_join'
))

print(result.message)
print("\nNote: Order 106 is included, but customer info is NaN:")
state.get_dataframe('orders_left_join')

Merged 'orders' (7 rows) with 'customers' (5 rows) using LEFT JOIN on customer_id. Result: 7 rows.

Note: Order 106 is included, but customer info is NaN:


Unnamed: 0,order_id,customer_id,product_id,quantity,order_date,name,city,signup_date
0,101,1,P001,2,2023-06-01,Alice,London,2023-01-15
1,102,2,P002,1,2023-06-02,Bob,Paris,2023-02-20
2,103,1,P001,3,2023-06-03,Alice,London,2023-01-15
3,104,3,P003,1,2023-06-04,Charlie,Berlin,2023-03-10
4,105,2,P001,2,2023-06-05,Bob,Paris,2023-02-20
5,106,6,P002,1,2023-06-06,,,NaT
6,107,1,P003,1,2023-06-07,Alice,London,2023-01-15


In [9]:
# OUTER JOIN: All customers and all orders
result = merge_dataframes(state, MergeDataFramesInput(
    left_dataframe='customers',
    right_dataframe='orders',
    how='outer',
    on='customer_id',
    save_as='full_outer'
))

print(result.message)
print("\nCustomers 4 & 5 have no orders, customer 6 has no profile:")
state.get_dataframe('full_outer')

Merged 'customers' (5 rows) with 'orders' (7 rows) using OUTER JOIN on customer_id. Result: 9 rows.

Customers 4 & 5 have no orders, customer 6 has no profile:


Unnamed: 0,customer_id,name,city,signup_date,order_id,product_id,quantity,order_date
0,1,Alice,London,2023-01-15,101.0,P001,2.0,2023-06-01
1,1,Alice,London,2023-01-15,103.0,P001,3.0,2023-06-03
2,1,Alice,London,2023-01-15,107.0,P003,1.0,2023-06-07
3,2,Bob,Paris,2023-02-20,102.0,P002,1.0,2023-06-02
4,2,Bob,Paris,2023-02-20,105.0,P001,2.0,2023-06-05
5,3,Charlie,Berlin,2023-03-10,104.0,P003,1.0,2023-06-04
6,4,Diana,Madrid,2023-04-05,,,,NaT
7,5,Eve,Rome,2023-05-01,,,,NaT
8,6,,,NaT,106.0,P002,1.0,2023-06-06


In [10]:
# Chain merge: Add product details to orders
# First, merge orders with customers, then with products

# Step 1: Orders + Customers
merge_dataframes(state, MergeDataFramesInput(
    left_dataframe='orders',
    right_dataframe='customers',
    how='left',
    on='customer_id',
    save_as='temp_orders_customers'
))

# Step 2: (Orders + Customers) + Products
result = merge_dataframes(state, MergeDataFramesInput(
    left_dataframe='temp_orders_customers',
    right_dataframe='products',
    how='left',
    on='product_id',
    save_as='complete_orders'
))

print("Complete order details with customer and product info:")
df = state.get_dataframe('complete_orders')
df[['order_id', 'name', 'product_name', 'quantity', 'price']]

Complete order details with customer and product info:


Unnamed: 0,order_id,name,product_name,quantity,price
0,101,Alice,Widget,2,29.99
1,102,Bob,Gadget,1,49.99
2,103,Alice,Widget,3,29.99
3,104,Charlie,Gizmo,1,19.99
4,105,Bob,Widget,2,29.99
5,106,,Gadget,1,49.99
6,107,Alice,Gizmo,1,19.99


In [11]:
# Calculate order totals
df = state.get_dataframe('complete_orders')
df['total'] = df['quantity'] * df['price']
state.set_dataframe(df, name='complete_orders', operation='calculate_totals')

print("Order totals:")
df[['order_id', 'name', 'product_name', 'quantity', 'price', 'total']]

Order totals:


Unnamed: 0,order_id,name,product_name,quantity,price,total
0,101,Alice,Widget,2,29.99,59.98
1,102,Bob,Gadget,1,49.99,49.99
2,103,Alice,Widget,3,29.99,89.97
3,104,Charlie,Gizmo,1,19.99,19.99
4,105,Bob,Widget,2,29.99,59.98
5,106,,Gadget,1,49.99,49.99
6,107,Alice,Gizmo,1,19.99,19.99


---
## Part 3: Concatenating DataFrames

Use `concat_dataframes` to stack DataFrames vertically (add rows) or horizontally (add columns).

In [12]:
from stats_compass_core.data.concat_dataframes import concat_dataframes, ConcatDataFramesInput

In [13]:
# Scenario: Monthly sales data arriving in batches
jan_sales = pd.DataFrame({
    'date': pd.to_datetime(['2024-01-05', '2024-01-15', '2024-01-25']),
    'revenue': [1200, 1500, 1800],
    'region': ['North', 'South', 'East']
})
state.set_dataframe(jan_sales, name='jan_sales', operation='create')

feb_sales = pd.DataFrame({
    'date': pd.to_datetime(['2024-02-05', '2024-02-15', '2024-02-25']),
    'revenue': [1400, 1600, 2000],
    'region': ['North', 'South', 'East']
})
state.set_dataframe(feb_sales, name='feb_sales', operation='create')

mar_sales = pd.DataFrame({
    'date': pd.to_datetime(['2024-03-05', '2024-03-15', '2024-03-25']),
    'revenue': [1700, 1900, 2200],
    'region': ['North', 'South', 'East']
})
state.set_dataframe(mar_sales, name='mar_sales', operation='create')

print("Created 3 monthly sales DataFrames")

Created 3 monthly sales DataFrames


In [14]:
# Vertical concatenation: Stack all months together
result = concat_dataframes(state, ConcatDataFramesInput(
    dataframes=['jan_sales', 'feb_sales', 'mar_sales'],
    axis=0,  # Stack rows
    save_as='q1_sales'
))

print(result.message)
print("\nQ1 Sales (all months combined):")
state.get_dataframe('q1_sales')

Concatenated 3 DataFrames vertically (rows): 'jan_sales' (3×3), 'feb_sales' (3×3), 'mar_sales' (3×3). Result: 9 rows × 3 columns.

Q1 Sales (all months combined):


Unnamed: 0,date,revenue,region
0,2024-01-05,1200,North
1,2024-01-15,1500,South
2,2024-01-25,1800,East
3,2024-02-05,1400,North
4,2024-02-15,1600,South
5,2024-02-25,2000,East
6,2024-03-05,1700,North
7,2024-03-15,1900,South
8,2024-03-25,2200,East


In [15]:
# Horizontal concatenation: Combine different feature sets
demographics = pd.DataFrame({
    'age': [25, 35, 45, 55],
    'income': [50000, 75000, 90000, 120000]
})
state.set_dataframe(demographics, name='demographics', operation='create')

behavior = pd.DataFrame({
    'visits': [10, 25, 5, 15],
    'purchases': [2, 8, 1, 4]
})
state.set_dataframe(behavior, name='behavior', operation='create')

result = concat_dataframes(state, ConcatDataFramesInput(
    dataframes=['demographics', 'behavior'],
    axis=1,  # Add columns
    save_as='customer_features'
))

print(result.message)
print("\nCombined customer features:")
state.get_dataframe('customer_features')

Concatenated 2 DataFrames horizontally (columns): 'demographics' (4×2), 'behavior' (4×2). Result: 4 rows × 4 columns.

Combined customer features:


Unnamed: 0,age,income,visits,purchases
0,25,50000,10,2
1,35,75000,25,8
2,45,90000,5,1
3,55,120000,15,4


In [16]:
# Handling mismatched columns with outer join (default)
df_a = pd.DataFrame({'x': [1, 2], 'y': [3, 4]})
state.set_dataframe(df_a, name='df_a', operation='create')

df_b = pd.DataFrame({'x': [5, 6], 'z': [7, 8]})  # Has 'z' instead of 'y'
state.set_dataframe(df_b, name='df_b', operation='create')

# Outer join - keeps all columns
result = concat_dataframes(state, ConcatDataFramesInput(
    dataframes=['df_a', 'df_b'],
    axis=0,
    join='outer',
    save_as='outer_concat'
))

print("Outer concat (keeps all columns, fills NaN):")
state.get_dataframe('outer_concat')

Outer concat (keeps all columns, fills NaN):


Unnamed: 0,x,y,z
0,1,3.0,
1,2,4.0,
2,5,,7.0
3,6,,8.0


In [17]:
# Inner join - keeps only common columns
result = concat_dataframes(state, ConcatDataFramesInput(
    dataframes=['df_a', 'df_b'],
    axis=0,
    join='inner',
    save_as='inner_concat'
))

print("Inner concat (keeps only common columns):")
state.get_dataframe('inner_concat')

Inner concat (keeps only common columns):


Unnamed: 0,x
0,1
1,2
2,5
3,6


---
## Part 4: Real-World Workflow

Combine merge and concat in a realistic data pipeline.

In [19]:
# Scenario: Analyze Q1 sales by product category

# First, let's use our complete_orders DataFrame
orders_df = state.get_dataframe('complete_orders')
print(f"Working with {len(orders_df)} complete orders")

# Group by category and calculate totals using transforms
from stats_compass_core.transforms.groupby_aggregate import groupby_aggregate, GroupByAggregateInput

result = groupby_aggregate(state, GroupByAggregateInput(
    dataframe_name='complete_orders',
    by=['category'],
    agg_func={'total': ['sum', 'mean', 'count']},
    save_as='sales_by_category'
))

print("\nSales by Category:")
state.get_dataframe('sales_by_category')

Working with 7 complete orders

Sales by Category:


Unnamed: 0,category,total_sum,total_mean,total_count
0,Electronics,309.91,61.982,5
1,Tools,39.98,19.99,2


In [27]:
# Final state summary
summary = state.get_state_summary()

print("=" * 60)
print("SESSION SUMMARY")
print("=" * 60)
print(f"\nTotal DataFrames managed: {len(summary['dataframes'])}")
print(f"Memory used: {summary['memory']['used_mb']:.2f} MB")
print(f"\nDataFrames created:")
for name in [df_dict["name"] for df_dict in summary['dataframes']]:
    df = state.get_dataframe(name)
    print(f"  - {name}: {df.shape[0]} rows × {df.shape[1]} cols")

SESSION SUMMARY

Total DataFrames managed: 20
Memory used: 0.01 MB

DataFrames created:
  - customers: 5 rows × 4 cols
  - orders: 7 rows × 5 cols
  - products: 4 rows × 4 cols
  - orders_with_customers: 6 rows × 8 cols
  - orders_left_join: 7 rows × 8 cols
  - full_outer: 9 rows × 8 cols
  - temp_orders_customers: 7 rows × 8 cols
  - complete_orders: 7 rows × 12 cols
  - jan_sales: 3 rows × 3 cols
  - feb_sales: 3 rows × 3 cols
  - mar_sales: 3 rows × 3 cols
  - q1_sales: 9 rows × 3 cols
  - demographics: 4 rows × 2 cols
  - behavior: 4 rows × 2 cols
  - customer_features: 4 rows × 4 cols
  - df_a: 2 rows × 2 cols
  - df_b: 2 rows × 2 cols
  - outer_concat: 4 rows × 3 cols
  - inner_concat: 4 rows × 1 cols
  - sales_by_category: 2 rows × 4 cols


---
## Summary

Stats Compass Core's stateful architecture enables:

1. **Multi-DataFrame Management** - Hold multiple datasets in memory with named references
2. **SQL-Style Merges** - Join tables with inner/left/right/outer joins
3. **Flexible Concatenation** - Stack data vertically or horizontally
4. **Chainable Operations** - Build complex pipelines by referencing previous results

This makes it ideal for:
- Data integration workflows
- Multi-source analytics
- Building AI-powered data agents that need to orchestrate across datasets