# Synthetic control

In this notebook, we simulate a panel dataset, then demonstrate the use of synthetic control to estimate causal effect.

## Data generating process

### Dynamic factor model

For the data-generating process, we employ a dynamic factor model.  In this framework, we posit the existence of some latent factors that drive the evolution of the outcomes across time.  This makes it a natural choice for simulating panel data: it allows us to generate data time series data for each unit in our sample, while letting each unit respond differently to the common underlying trends.  


Outcomes for each unit $i$ at time $t$ is specified as:
$$ 
\begin{align*}
Y_{it} = \sum_{k=1}^{K} \lambda_{ik} f_{kt} + \epsilon_{it}
\end{align*}
$$
where:
- $f_{kt}$ denote the time-varying latent factors. There are a total of $K$ latent factors.
- $\lambda_{ik}$ denote the unit-specific factor loadings.
- $\epsilon_{it}$ is idiosyncratic noise.

For a survey on dynamic factor models, see Stock and Watson, "Dynamic factor models", (2010): https://swh.princeton.edu/~mwatson/papers/dfm_oup_4.pdf



### Generating panel dataset

For concreteness, let's assume we are interested in measuring a certain outcome at the city-level across time.  Perhaps our marketing team ran a billboard campaign in a selected city, and we want to measure the effect of those campaigns on sales.  We will generate a panel dataset of sales at the city-month level.

We can consider adding a few city-specific observables such as population and average household income.  For now, let's keep things simple and focus only on generating the outcome variable.

Note that this setting can be applied to other business cases.  For example, a digital e-commerce company may decide open physical stores in a few locations, and wants to estimate the impact (if any) of those physical stores.  A two-sided platform might implement a new matching algorithm in a certain geographic market, and wants to examine the algorithm's effect.

In [13]:
import numpy as np
import pandas as pd

# Set parameters
np.random.seed(42)
J = 40            # number of cities
T = 30            # total months
K = 2             # number of factors
T0 = 20       # treatment starts
sigma_noise = 0.5

# Create base factors (latent variables that influence outcomes)
months = np.arange(T)
factors = np.array([
    np.sin(months / 5),
    np.cos(months / 7)
])

# Create the loadings (size J by K)
mean_loading = np.random.standard_normal(size=(J,K))

# Calculate each city's response to the latent factors over time, then add a random noise
response = mean_loading @ factors
noise = np.random.normal(loc=0, scale=sigma_noise, size=response.shape)
response += noise

# Add treatment effect to city_0 after T0
treated_city_idx = 0
response[treated_city_idx, T0:] += 2.0  # example effect

We now have the response matrix, all that's left to do is reshaping it into a long-format dataframe.

In [14]:
# Create city and month labels
cities = [f'city_{i}' for i in range(J)]

# Flatten response matrix into long format
df = pd.DataFrame({
    'city': np.repeat(cities, T),
    'month': np.tile(months, J),
    'outcome': response.flatten()
})

# Add treatment indicators
df['treated'] = (df['city'] == 'city_0').astype(int)
df['post_treatment'] = (df['month'] >= T0).astype(int)
df['treatment'] = df['treated'] * df['post_treatment']



We can proceed to graph the outcome trends by city, then fit a synthetic control model.