# Use Cases and Examples for maxent_disaggregation

The `maxent_disaggregation` package enables statistically sound disaggregation of aggregate data with uncertainty propagation. When disaggregating data (splitting a total into components), the components are naturally correlated. These correlations must be properly accounted for in uncertainty analysis to avoid mis-estimating uncertainty in downstream calculations.

This notebook demonstrates how to use `maxent_disaggregation` through a practical example in Industrial Ecology.

## Example: Carbon Footprint of Steel in Vehicle Manufacturing

### Problem Statement

An Industrial Ecology researcher has data on total steel consumption for vehicle manufacturing but needs to disaggregate this figure by vehicle type (ICE, BEV, and HBEV) based on production volume proxies. A second researcher will use these disaggregated figures for Life Cycle Assessment (LCA).

![Figure 1: A diagram of the example](images/example_steel.png)

Properly accounting for correlations between the disaggregated values is critical for accurate uncertainty estimation in the LCA.

### Available Information

The researcher has:

1. A best estimate for total steel consumption: 100 tonnes/year
2. An uncertainty estimate for that total: standard deviation of 3 tonnes/year
3. A natural lower bound of 0 (no negative consumption)
4. Proxy-based estimates of shares by vehicle type: ICE (80%), BEV (19%), HBEV (1%)

### Generating Correlated Samples with maxent_disaggregation

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from maxent_disaggregation import maxent_disagg
import seaborn as sns

# Set parameters
n_samples = 10000
mean_aggregate = 100  # Best estimate for total
sd_aggregate = 3      # Standard deviation for total
shares = [0.8, 0.19, 0.01]  # Proxy-based estimates of shares
vehicle_types = ["ICE", "BEV", "HBEV"]

In [None]:
# Generate correlated samples
samples, _, _, _ = maxent_disagg(
    n=n_samples,
    mean_0=mean_aggregate,
    sd_0=sd_aggregate,
    min_0=0,  # Lower bound
    shares=shares,
    log=True  # Use log-normal distribution for sampling
)

# Convert samples to a DataFrame for easier analysis
samples_df = pd.DataFrame(samples, columns=vehicle_types)
samples_df.head()

### Visualizing the Distribution of Each Component

In [None]:
# Plot histograms for each vehicle type
samples_df.melt(var_name="Vehicle Type", value_name="Steel Consumption").pipe(
    (sns.histplot, "data"),
    x="Steel Consumption",
    hue="Vehicle Type",
    kde=True,
    bins=100,
    alpha=0.3
).set(
    xlabel="Steel Consumption (tonnes)",
    ylabel="Frequency",
    title="Distribution of Steel Consumption by Vehicle Type"
);

### Validating the Samples

In [None]:
# Check if the total matches our specified mean and SD
sample_total = samples_df.sum(axis=1)
print("Mean of the sampled total:", sample_total.mean())
print("SD of the sampled total:", sample_total.std())

# Check if shares match our specified values
sample_shares = samples_df.div(sample_total, axis=0)
print("Means of the sampled shares:", sample_shares.mean())

### Understanding Correlations in the Sample

In [None]:
# Visualize correlations
sns.pairplot(samples_df, kind="reg", diag_kind="kde", plot_kws={"scatter_kws": {"alpha": 0.1}})
plt.suptitle("Correlations Between Vehicle Type Steel Consumption", y=1.02);

### Downstream Analysis: LCA of Carbon Footprint

In [None]:
# Set emission factor
emission_factor_steel = 2.5

# Calculate emissions using correlated samples
sample_emissions_full = samples_df * emission_factor_steel
total_emissions_full = sample_emissions_full.sum(axis=1)

print("Mean emissions:", total_emissions_full.mean(), "tonnes CO₂")
print("SD emissions:", total_emissions_full.std(), "tonnes CO₂")
print("CV emissions:", total_emissions_full.std() / total_emissions_full.mean())

### Comparing Results with Independent Sampling

In [None]:
# Independent sampling ignoring correlations
independent_samples = pd.DataFrame({
    "ICE": np.random.normal(loc=shares[0] * mean_aggregate, scale=sd_aggregate * shares[0], size=n_samples),
    "BEV": np.random.normal(loc=shares[1] * mean_aggregate, scale=sd_aggregate * shares[1], size=n_samples),
    "HBEV": np.random.normal(loc=shares[2] * mean_aggregate, scale=sd_aggregate * shares[2], size=n_samples)
})

# Calculate emissions using independent samples
sample_emissions_independent = independent_samples * emission_factor_steel
total_emissions_independent = sample_emissions_independent.sum(axis=1)

print("Mean emissions (independent):", total_emissions_independent.mean(), "tonnes CO₂")
print("SD emissions (independent):", total_emissions_independent.std(), "tonnes CO₂")
print("CV emissions (independent):", total_emissions_independent.std() / total_emissions_independent.mean())

### Comparing Distributions

In [None]:
# Combine results for comparison
comparison_df = pd.DataFrame({
    "Emissions": np.concatenate([total_emissions_full, total_emissions_independent]),
    "Approach": ["With Correlations"] * len(total_emissions_full) + ["Independent Sampling"] * len(total_emissions_independent)
})

# Plot comparison
sns.histplot(data=comparison_df, x="Emissions", hue="Approach", kde=True, bins=50, alpha=0.5)
plt.xlabel("Total Emissions (tonnes CO₂)")
plt.ylabel("Frequency")
plt.title("Impact of Correlations on Uncertainty Estimation");

## Conclusion

The `maxent_disaggregation` package provides a simple interface for generating statistically valid samples of disaggregated data while properly accounting for correlations. This is crucial for accurate uncertainty propagation in subsequent analyses.