# Uncertainty Propagation in Generalized Systems

> Our statistical models are not built on raw data, but on a foundation of previously modeled and imputed data. Core variables like revenue and employee counts are themselves estimates from complex allocation models. This creates a 'model-on-model' problem, where errors and uncertainties from the foundational layers can cascade and become amplified in our final results, making it difficult to assess their true accuracy

We face a challenge of patch work data processing in our statistical production. The auxiliary data we use for sampling, imputation, and analysis undergoes a multi-stage pipeline of modification before we even use it. This pipeline can include:

1. Rule-based editing of raw administrative data.
2. Manual edits
3. Probabilistic imputation for missing or "outlier" values
4. **Reallocation** where enterprise-level totals are distributed to company/establishment/location level units.


The result is that our foundational data frame is a composite of observed values, deterministic edits, and multiple layers of model-based estimates. When we then build our own analytical or imputation models, we are effectively stacking a new model on top of this complex, pre-modeled base.

---
Our objective is to analyze these scienaros rigorously. Let's consider a simplified (but representative) pipeline.

Our goal is to estimate a parameter $θ$ from a final analytical model. This could be a regression coefficient, a population mean, or a Gini coefficient. This parameter is defined by a model involving the true, unobserved variables for a population of units, let's say ($Y_{true}, X_{true}$).

The process we desribed above can be formalized as a sequence of transformations or operators applied to the data:

### The Truth ($D_{true}$)

The set of true values $\{Y_i, X_i\}_{i=1}^N$ for the entire population. This is a Platonic ideal but we never see it.

### Administrative Data ($D_{admin}$)

We observe administrative records, which are a flawed measurement of the truth.

$$X_{admin} = X_{true} + \epsilon_{obs}$$

where $\epsilon_{obs}$ represents measurement error, reporting errors, etc. For categorical data like NAICS, this is a misclassification operator.

### Editing & Cleaning ($f_{edit}$)

A deterministic function is applied to fix known errors.
$$X_{clean} = f_{edit}(X_{admin})$$

### Imputation ($f_{imp}$)

A stochastic function fills in missing values. It does so by drawing from a predictive distribution, which is itself a model: $P(X_{missing} \mid X_{observed}, \phi)$, where $\phi$ are the parameters of the imputation model.
$$X_{imp} = f_{imp}(X_{clean})$$

Crucially, a single imputation produces one realization, $X_{imp}^{(1)}$. A second run would produce $X_{imp}^{(2)}$.

### Allocation ($f_{alloc}$)

Another model-based transformation is applied. For example, enterprise-level data $Z_{ent}$ is used to allocate values to establishments. This function has its own parameters, $\psi$.
$$X_{proc} = f_{alloc}(X_{imp}, Z_{ent}, \psi)$$

This gives us the processed data, $D_{proc} = \{Y_i, X_{proc,i}\}$.

### Final Analysis ($f_{analysis}$)

We receive $D_{proc}$ and estimate your parameter $θ$. You compute an estimator $\hat{\theta}$ and its variance, $\text{Var}(\hat{\theta})$, by treating $D_{proc}$ **as if it were the true fixed data**.

The core of the problem lies in the final step. The analyst's inference is conditioned on the processed data: $P(Y \mid X_{proc})$. However, the true data generating process involves all the preceding layers of uncertainty. A fully correct inference would need to integrate over the uncertainty from all prior stages. The variance you calculate, $\text{Var}(\hat{\theta} \mid X_{proc})$, is an underestimate of the true total variance, which would need to account for the variance from imputation and allocation:

$$\text{Var}_{total}(\hat{\theta}) = E[\text{Var}(\hat{\theta} \mid X_{proc})] + \text{Var}(E[\hat{\theta} \mid X_{proc}])$$

The second term, $\text{Var}(E[\hat{\theta} \mid X_{proc}])$, represents the variance introduced by the upstream processing ($f_{imp}$, $f_{alloc}$, etc.). By treating $X_{proc}$ as fixed, this term is implicitly set to zero.

***



#### Scenario A: Contamination

This occurs when an imputation model for a survey leverages auxiliary data from a business frame that is itself a modeled product.

1.  **The Frame's Artifact:** An allocation model ($f_{BR}$) on the Business Register (BR) creates processed auxiliary data, $X_{proc}$. For a subset of records, this model imposes a simple linear relationship to generate $X_{proc}$ from another variable, $Z$.
    $$ \text{For some } i, \quad X_{proc, i} = f_{BR}(Z_i) = \beta_0 + \beta_1 Z_i $$

2.  **The Survey Model's Input:** A new model ($f_{survey}$) is trained to impute a missing survey variable, $Y_{missing}$, using $X_{proc}$ as a key feature.
    $$ Y_{imp, i} = f_{survey}(X_{proc, i}, ...) $$

3.  **The Consequence:** The model $f_{survey}$ learns the strong, artificial correlation between $X_{proc}$ and $Z$ embedded by $f_{BR}$. It learns the conditional relationship $P(Y | X_{proc})$, which is a biased and distorted version of the true relationship, $P(Y | X_{true})$. The assumptions of the frame-building model are thus injected directly into the survey's imputed values.

---

#### Scenario B: Historical (Temporal) Contamination

This occurs when models are updated over time by training on the output of previous models, creating a legacy of bias.

1.  **The Past Artifact:** At time $t-1$, missing survey data was imputed using a simple linear model, $f_{imp, t-1}$. This introduced an artificial linear trend into the historical dataset, $D_{proc, t-1}$.
    $$ \text{For imputed records in } D_{proc, t-1}, \quad Y_{proc, i, t-1} = \alpha_0 + \alpha_1 X_{proc, i, t-1} $$

2.  **Training the Current Model:** At time $t$, a new, more sophisticated imputation model, $f_{imp, t}$, is trained on the now-complete historical dataset, $D_{proc, t-1}$.
    $$ \text{Train } f_{imp, t} \text{ using } D_{proc, t-1} \text{ as ground truth.} $$

3.  **The Consequence (Bias Propagation):** The new model, $f_{imp, t}$, diligently learns the patterns from the historical data, including the artificial linear trend represented by the coefficient $\alpha_1$. The model has no way to know this trend was an artifact. When $f_{imp, t}$ is used to impute missing values in the current survey year ($t$), it reproduces this pattern.
    $$ Y_{proc, i, t} = f_{imp, t}(...) \quad \text{// This function now implicitly contains } \alpha_1 $$
The bias from the model at $t-1$ has now been successfully propagated to the data at time $t$, reinforcing a potentially false trend.

---
### The Core Risk

In both scenarios, the analytical models are no longer learning exclusively about the real world. They are, in part, learning the "ghosts" of previous models. This can lead to the **institutionalization of bias**, where artificial correlations become stable, recognized "facts" within the data ecosystem, leading to misguided analysis and decision-making.

---

### Scenario: Allocation-Induced Spurious Correlation

Allocation  $f_{alloc}$ embeds a structural assumption (here a simple linear rule):
- Let the unobserved truth be $(Y_{true}, X_{true})$.
- Allocation uses an auxiliary driver $Z$ (e.g., NAICS or regional payroll totals).
- The designer hard-codes a linear trend:
$$X_{proc} \;=\; \alpha + \gamma\,Z \;+\; \eta,$$
with $\eta$ a small prediction error.

Mechanism — Because $\eta$ is small, almost every processed value sits exactly on the line $\alpha + \gamma Z$.  Two distortions emerge:
1.	Over-tight coupling of $X_{proc}$ and $Z$
The correlation $\operatorname{Corr}(X_{proc},Z)$ is artificially inflated—often near 1—even if the true correlation $\operatorname{Corr}(X_{true},Z)$ is modest or zero.
2.	Induced correlation with any $Y$ that depends on $Z$

If, in the real economy, $Y_{true}$ depends on $Z$ (e.g.\ high-tech NAICS codes pay higher wages), the manufactured link $X_{proc}!\leftrightarrow! Z$ leaks into your target regression $Y\!\sim\!X$.  You cannot tell whether $\hat\beta_1$ captures a genuine labour-productivity effect or just the baked-in allocation rule.

Consequence — In the regression
$$Y_i \;=\; \beta_0 + \beta_1 X_{proc,i} + \nu_i,$$
$\hat\beta_1$ is biased away from zero (the opposite of attenuation) because $X_{proc}$ inherits part of the signal that properly belongs to $Z$.  Standard errors are also understated: repeated allocations would reproduce almost exactly the same $X_{proc}$, so the analyst sees very little variation and reports misleadingly tight confidence intervals.

--- 


### Scenario : A Hidden, Problematic Situation

**Condition:** The processing model introduces a systematic bias or a spurious correlation that the final analytical model cannot distinguish from a real-world effect.

This is the most dangerous scenario.

**Example:** Suppose the true model is a regression: $Y_i = \beta_0 + \beta_1 X_{true,i} + \epsilon_i$. And an analyst fits $Y_i = \beta_0 + \beta_1 X_{proc,i} + \nu_i$.

* **The Allocation Model:** If our allocation model ($f_{alloc}$) for the number of employees (X) tends to be conservative. It shrinks extreme values towards the mean (a common feature of any prediction model). So, for very large establishments, $X_{proc} < X_{true}$, and for very small ones, $X_{proc} > X_{true}$.

* **The Consequence:** This is a classic EIV problem. The measurement error ($X_{proc} - X_{true}$) is correlated with the true value $X_{true}$. When you run your regression, your estimate $\hat{\beta_1}$ will be biased, almost always towards zero. This is called *attenuation bias*. You would incorrectly conclude that the number of employees has a weaker effect on Y than it actually does.

* **The Deeper Problem:** What if the allocation model ($f_{alloc}$) used the establishment's NAICS code as a predictor? And you, the analyst, are trying to estimate the effect of the NAICS code on Y? The allocation model has already created a relationship between the processed employee count ($X_{proc}$) and NAICS. Your final model, which includes both variables, will misattribute the effects, as the variables' relationship is partly a mathematical artifact of the upstream processing, not a pure economic reality.

### Scenario 4:

**Condition:** Multiple stochastic, model-based steps are chained together, with the output of one being treated as fixed input for the next, violating the principles of Scenario 2 at each step.

**Example:** The agency first imputes missing revenue ($X_{imp} = f_{imp}(...)$). Then, treating this now-complete revenue data as "truth," it uses it as a predictor in a model to allocate the total number of employees ($Z_{proc} = f_{alloc}(X_{imp}, ...)$). You receive the final dataset with imputed revenue and allocated employee counts and use both as predictors in a model to analyze workplace safety (Y).

**Why it's a problem:** You have a chain of uncertainty that is ignored at every step.

* The uncertainty in the revenue imputation is not propagated into the employee allocation model.
* The combined uncertainty from both revenue imputation and employee allocation is not propagated into your final workplace safety analysis.

The result: The standard errors you report for your model will be wrong. They will be too small, leading to confidence intervals that are too narrow and p-values that are too low. You will have a false sense of precision and are at high risk of making Type I errors (finding statistically significant effects that are not really there).

***
