# Tutorial 5: Procurement

This notebook replicates the structural estimation from:

**"Winning by Default: Why is There So Little Competition in Government Procurement?"**
*Kang & Miller, Review of Economic Studies (2022)*

## Part 1: Model Setup, Theory, and Identification

### 1.1 Environment

The government agency procures IT services from contractors. Each project $i$ is characterized by:
- **Project attributes** $x_i$ (duration, size, service type, commercial vs. defense, etc.)
- **Agency attributes** $z_i$ (experience, workload, congressional representation)
- **Competition variables** (number of past winners, number of establishments)

Each contractor has a private type $\pi_i \in [0,1]$ drawn from a Beta distribution on $[\pi_{\min}, \pi_{\max}]$.
The type $\pi$ determines the probability of being a "high-cost" contractor:
- With probability $\pi$, the contractor is high-cost (type $H$)
- With probability $1 - \pi$, the contractor is low-cost (type $L$)

### 1.2 Contract Outcomes and Moral Hazard

After awarding the contract, cost shocks $s = (s_1, s_2)$ are realized:
- $s_1$: ex-post price modifications (three categories), observed only for fixed-price contracts
- $s_2$: ex-post duration modifications (three categories), observed for all contracts

The distributions of $s$ differ by contractor type, creating **moral hazard**:
- $f_L(s|x)$: distribution for low-cost contractors
- $f_H(s|x)$: distribution for high-cost contractors

The likelihood ratio $\ell(s) = f_L(s) / f_H(s)$ measures the informativeness of contract outcomes.

### 1.3 Optimal Contract Design (Adverse Selection)

The agency offers a menu of contracts. For a cost-plus contract awarded to a contractor with type $\pi$:
- **Base price**: $p_0(\pi, x) = \alpha(\pi, x) + \beta(\pi, x) - \int \psi(q(\pi, s)) f_H(s) ds$
- **Ex-post payment schedule**: $q(\ell(s); \pi) = -\psi \cdot \ln\left(\frac{1 - \tilde{\pi}}{1 - \tilde{\pi} \cdot \ell(s)}\right)$

where $\tilde{\pi}$ solves the IR constraint for low-cost contractors, and:
- $\alpha(\pi, x)$: project cost for low-cost contractors
- $\beta(\pi, x)$: additional cost for high-cost contractors (information rent)
- $\psi(x)$: CARA risk-tolerance parameter (higher $\psi$ → *less* risk averse; the Arrow-Pratt measure is $1/\psi$)

For a fixed-price contract with $n$ bidders:
$$p_n(\pi, x) = \alpha(\pi, x) + \frac{\beta(\pi, x) \cdot \pi (1-\pi)^{n-1}}{1 - (1-\pi)^n}$$

where the second term is the expected information rent from the lowest-bidding contractor.

### 1.4 Entry and Endogenous Competition

Each potential contractor $j$ faces a competition cost $\eta_j$ drawn from $N(\mu_\eta(x,z), \sigma_\eta^2)$.
The contractor participates (enters the bidding) if the expected surplus exceeds $\eta_j$:
$$\Pr(\text{compete} | x, z) = \Phi\left(\frac{\omega(x,z) - \mu_\eta(x,z)}{\sigma_\eta}\right)$$

where $\omega(\pi, x, z) = (1 - e^{-\lambda \pi})(\beta + \gamma) - \kappa \lambda$ captures the
expected payoff from competing, with:
- $\lambda(\pi, x, z)$: expected number of rival bidders
- $\kappa(\pi, x, z)$: buyer search costs (per-bidder cost of evaluating)
- $\beta(\pi, x) + \gamma(\pi, x)$: total contractor surplus

### 1.5 Identification Strategy

The five-step sequential estimation exploits the following identification arguments (Section 4 of the paper):

1. **Step 1** ($\pi$ distribution): The choice between fixed-price ($d=1$) and cost-plus ($d=0$)
   contracts identifies the distribution of $\pi$, since $\Pr(d=1 | \pi, n, x, z)$ depends on the
   type distribution through a selection mechanism.

2. **Step 2** ($s$ distribution): Contract outcomes $(s_1, s_2)$ directly identify $f_L$ and $f_H$
   from their observed marginal distributions.

3. **Step 3** (Cost parameters): Given Steps 1–2, the model predicts expected prices $E[p|x,z,d]$
   and expected ex-post payments $E[q|x,z,d=0]$, which are matched to data via NLS.

4. **Step 4** (Buyer search costs $\kappa$): Given all prior parameters, $\kappa$ is computed directly
   from the equilibrium pricing equation.

5. **Step 5** (Competition costs $\eta$): The observed entry decision $r$ (restricted vs. competitive)
   identifies $\eta$ distribution parameters via MLE.


---
## Part 2: Data Loading and Summary Statistics

The dataset contains 6,981 U.S. federal IT procurement contracts. We load the data and construct the variables used in estimation, following the variable definitions in the paper's Table 1.

In [23]:
import os
import numpy as np
import pandas as pd
from scipy import optimize, stats
from scipy.special import gammaln
from tabulate import tabulate
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
print("Libraries loaded successfully.")

Libraries loaded successfully.


In [24]:
# ── Load raw data into DataFrame ──
# Column names follow Table 1 of Kang & Miller (2022) and estimation.m
#
# Players:
#   - Agency (buyer):  a U.S. federal government agency procuring IT services
#   - Contractor (seller): a private IT firm bidding on the project
#
# Timeline:
#   1. Agency posts project with observable characteristics (x, z)
#   2. Contractors privately know their type pi ~ Beta(alpha, beta)
#   3. Agency designs contract: fixed-price (d=1) vs cost-plus (d=0)
#   4. Contract awarded; cost shock realized -> outcomes s1, s2 observed

BASE_DIR = os.path.dirname(os.path.abspath('__file__'))
DATA_PATH = os.path.join(BASE_DIR, 'data')

col_names = [
    # ── Contract outcomes (observed ex-post) ──
    'base_price',            # p:  base price ($)
    'expost_price_change',   # q:  ex-post price changes ($), cost-plus only
    's1_cat1', 's1_cat2', 's1_cat3',   # price modification categories (FP only)
    's2_cat1', 's2_cat2', 's2_cat3',   # duration modification categories (all)
    # ── Contract design (agency's ex-ante choice) ──
    'restricted',            # r:  1 = sole-source (no competition), 0 = competitive
    'num_bids',              # b:  number of bids received
    'fixed_price',           # d:  1 = firm-fixed-price, 0 = cost-plus
    # ── Project attributes x (publicly observed) ──
    'dur_gt_3mo',            # duration > 3 months
    'size',                  # project size (log dollars)
    'service',               # service contract (vs. supply/construction)
    'commercial',            # commercial item
    'defense',               # Department of Defense (vs. civilian agency)
    'dca',                   # definitive contract action
    # ── Agency attributes z (publicly observed) ──
    'experience_raw',        # agency's procurement experience (continuous)
    'past_experience_raw',   # agency-contractor match: worked together before?
    'workload_raw',          # agency workload (continuous)
    'congress_rep',          # congressional oversight on this agency
    # ── Market structure (publicly observed) ──
    'num_past_winners_raw',  # number of distinct past winners in this market
    'num_establishments_raw',# number of firms in this market
    # ── Auxiliary ──
    'xcase',                 # project-type case indicator (for grouping)
    'sol_info1', 'sol_info2', 'sol_info3', 'sol_info4'
]

df = pd.read_csv(os.path.join(DATA_PATH, 'data_main.csv'), header=None, names=col_names)
nsim = len(df)

# ── Binarize continuous covariates (thresholds from estimation.m) ──
df['experience']        = (df['experience_raw'] >= 0.8).astype(float)        # ≥ 5 yrs
df['past_experience']   = (df['past_experience_raw'] > 0).astype(float)      # ever worked together
df['workload']          = (df['workload_raw'] >= 4.5).astype(float)          # high workload
df['num_past_winners']  = (df['num_past_winners_raw'] >= 2).astype(float)    # ≥ 2 past winners
df['num_establishments']= (df['num_establishments_raw'] >= 24).astype(float) # ≥ 24 firms

# ── Derived variables ──
bmax = 4
df['b_censored']  = np.minimum(df['num_bids'], bmax)
df['competition']  = (1 - df['restricted']) * df['b_censored']  # c: 0=sole-source, 1..4=competitive

# ── Define variable groups (used throughout estimation) ──
# These lists make it easy to reference groups of columns by role
PROJ_COLS  = ['dur_gt_3mo', 'size', 'service', 'commercial', 'defense', 'dca']
AGEN_COLS  = ['experience', 'past_experience', 'workload', 'congress_rep']
COMP_COLS  = ['num_past_winners', 'num_establishments']
S1_COLS    = ['s1_cat1', 's1_cat2', 's1_cat3']
S2_COLS    = ['s2_cat1', 's2_cat2', 's2_cat3']

# ── Extract numpy arrays for estimation ──
# Scipy optimizers need raw numpy; we extract once here and reuse throughout.
# Notation: we keep short aliases (p, q, d, r, b, s1, s2) matching the paper.
p  = df['base_price'].values
q  = df['expost_price_change'].values
s1 = df[S1_COLS].values                            # (nsim, 3)
s2 = df[S2_COLS].values                            # (nsim, 3)
r  = df['restricted'].values
b  = df['num_bids'].values
d  = df['fixed_price'].values
b_censored = df['b_censored'].values
c  = df['competition'].values

# Project attributes: intercept + 6 binary indicators
xproj = np.column_stack([np.ones(nsim), df[PROJ_COLS].values])   # (nsim, 7)
xagen = df[AGEN_COLS].values                                      # (nsim, 4)
xcomp = df[COMP_COLS].values                                      # (nsim, 2)
xzvec = np.column_stack([xproj, xagen, xcomp])                    # (nsim, 13)

# Dimensions
nproj = xproj.shape[1]         # 7 (intercept + 6)
nagen = xagen.shape[1]         # 4
ncomp = xcomp.shape[1]         # 2
nxz   = nproj + nagen + ncomp  # 13

# ── Project-type case mapping ──
# Group observations by unique project-attribute combinations
# (Used in Step 3 to avoid redundant simulation across identical project types)
xcase = df['xcase'].values
unique_cases = sorted(df['xcase'].unique())
ncase = len(unique_cases)

xproj_case = (
    pd.DataFrame(xproj, columns=['intercept'] + PROJ_COLS)
    .assign(xcase=xcase)
    .groupby('xcase')
    .mean()
    .values
)

# ── Print summary ──
print(f"Data loaded: {nsim:,} observations, {ncase} unique project types")
print(f"Dimensions:  nproj={nproj}, nagen={nagen}, ncomp={ncomp}, nxz={nxz}")
print(f"DataFrame:   {df.shape[0]} rows × {df.shape[1]} columns")

Data loaded: 6,981 observations, 62 unique project types
Dimensions:  nproj=7, nagen=4, ncomp=2, nxz=13
DataFrame:   6981 rows × 35 columns


In [25]:
# ── Summary Statistics (Table 1 of the paper) ──

# Panel A: Contract outcomes
outcome_vars = ['base_price', 'expost_price_change', 'fixed_price', 'restricted', 'num_bids']
outcome_labels = ['Base Price p ($)', 'Ex-post Changes q ($)', 'Firm-Fixed Price (d=1)',
                  'Restricted (r=1)', 'Number of Bids']

rows = []
for var, label in zip(outcome_vars, outcome_labels):
    col = df[var]
    rows.append([label, f"{col.mean():.4f}", f"{col.std():.4f}", f"{col.min():.4f}", f"{col.max():.4f}"])

# s1 categories (FP only)
fp_mask = df['fixed_price'] == 1
for cat in S1_COLS:
    col = df.loc[fp_mask, cat]
    rows.append([f"{cat} (FP only)", f"{col.mean():.4f}", f"{col.std():.4f}",
                 f"{col.min():.4f}", f"{col.max():.4f}"])

# s2 categories (all contracts)
for cat in S2_COLS:
    col = df[cat]
    rows.append([f"{cat} (all)", f"{col.mean():.4f}", f"{col.std():.4f}",
                 f"{col.min():.4f}", f"{col.max():.4f}"])

print("=" * 70)
print("Table 1: Summary Statistics")
print("=" * 70)

print("\nPanel A: Contract Outcomes")
print("-" * 50)
print(tabulate(rows, headers=["Variable", "Mean", "Std Dev", "Min", "Max"],
               tablefmt="simple", numalign="right"))

# Panel B: Project and agency attributes
print("\nPanel B: Covariate Means")
print("-" * 50)
cov_vars = PROJ_COLS + AGEN_COLS + COMP_COLS
cov_rows = [[var, f"{df[var].mean():.4f}"] for var in cov_vars]
print(tabulate(cov_rows, headers=["Covariate", "Mean"], tablefmt="simple", numalign="right"))

print(f"\nN = {nsim:,} contracts")

Table 1: Summary Statistics

Panel A: Contract Outcomes
--------------------------------------------------
Variable                    Mean    Std Dev          Min          Max
----------------------  --------  ---------  -----------  -----------
Base Price p ($)          336754     188771       126847       999736
Ex-post Changes q ($)    26951.4     130273      -822319  2.49503e+06
Firm-Fixed Price (d=1)    0.9622     0.1908            0            1
Restricted (r=1)          0.6598     0.4738            0            1
Number of Bids            1.6353     1.9246            1           35
s1_cat1 (FP only)         5086.2    61474.3      -682889  1.78564e+06
s1_cat2 (FP only)        21077.7     103599      -822319   1.6056e+06
s1_cat3 (FP only)       -1630.77    37811.2  -1.4919e+06       707700
s2_cat1 (all)             0.1299     0.8285           -1      17.7174
s2_cat2 (all)             0.1786     0.8305      -1.7381      15.2976
s2_cat3 (all)             0.1853     1.0393      -5.0

In [26]:
# ── Competition Distribution ──
comp_dist = (
    df.groupby('competition')
    .size()
    .reset_index(name='count')
    .assign(fraction=lambda x: x['count'] / nsim)
)

rows_comp = [[f"c = {int(row['competition'])}", int(row['count']), f"{row['fraction']:.4f}"]
             for _, row in comp_dist.iterrows()]

print("Competition Distribution:")
print(tabulate(rows_comp, headers=["Competition Level (c)", "Count", "Fraction"],
               tablefmt="simple", numalign="right"))

# ── Contract Type Distribution ──
n_fp = int(df['fixed_price'].sum())
n_cp = nsim - n_fp
n_restr = int(df['restricted'].sum())
n_comp  = nsim - n_restr

print(f"\nFixed-Price contracts: {n_fp} ({n_fp/nsim*100:.1f}%)")
print(f"Cost-Plus contracts:   {n_cp} ({n_cp/nsim*100:.1f}%)")
print(f"Restricted (no competition): {n_restr} ({n_restr/nsim*100:.1f}%)")
print(f"Competitive:                 {n_comp} ({n_comp/nsim*100:.1f}%)")

Competition Distribution:
Competition Level (c)      Count    Fraction
-----------------------  -------  ----------
c = 0                       4606      0.6598
c = 1                        871      0.1248
c = 2                        453      0.0649
c = 3                        523      0.0749
c = 4                        528      0.0756

Fixed-Price contracts: 6717 (96.2%)
Cost-Plus contracts:   264 (3.8%)
Restricted (no competition): 4606 (66.0%)
Competitive:                 2375 (34.0%)


---
## Part 3: Estimation

### Overview

**Players**: Agency (government buyer) chooses contract form; Contractor (private seller) knows their type $\pi$.

**Timeline**:
| | Step    | Description |
|------|-----------------|-----------|
| 1 | Nature | Contractor draws type $\pi \sim Beta (\alpha, \beta)$ |
| 2 | Entry  | Contractors compete if expected surplus > private cost $\eta$ |
| 3 | Design | Agency picks FP (d=1) or CP (d=0) to minimize expected cost |
| 4 | Bid    | Contractor submits base price $p$ |
| 5 | Execute | Cost shock realized |
| 6 | Signal | Ex-post signals $s_1$ (price mods), $s_2$ (duration mods) observed |
| 7 | Settle | CP: agency pays $q(s)$;  FP: price adjustments applied |

**FP vs CP trade-off**: FP transfers cost risk to contractor (who demands a premium); CP reimburses costs (but contractor may inflate). Agency picks whichever is cheaper in expectation.

**Estimation roadmap** (5 sequential steps):

| Step | What we estimate | Data used | Method | # params |
|------|-----------------|-----------|--------|----------|
| 1 | Type distribution $f(\pi \mid x,z)$ | Contract choice $d$ | MLE (binary) | 14 × 5 |
| 2 | Signal distributions $f_L(s)$ | Signals $s_1, s_2$ | MLE (two-part) | 30 + 36 |
| 3 | Cost structure $(\alpha, \beta, \psi)$ + signal shifts | Prices $p, q$ | NLS | 35 |
| 4 | Search cost $\kappa$ | — (closed form from Steps 1–3) | Analytic | 0 |
| 5 | Competition cost $F(\eta)$ | Entry/competition $r$ | MLE (binary) | 14 |

**Dependencies**: Step 1 → Step 2 → Step 3 (uses steps 1-2) → Step 4 (uses steps 1–3) → Step 5 (uses steps 1–4).


In [27]:
# ── Gauss-Legendre Quadrature (fun_lgwt.m) ──
def fun_lgwt(N, a, b):
    """Gauss-Legendre quadrature nodes and weights on [a, b]."""
    N_range = np.arange(1, N)
    beta_gl = N_range / np.sqrt(4 * N_range**2 - 1)
    J = np.diag(beta_gl, -1) + np.diag(beta_gl, 1)
    eigvals, eigvecs = np.linalg.eigh(J)
    x = eigvals
    w = 2 * eigvecs[0, :]**2
    # Map from [-1, 1] to [a, b]
    x = 0.5 * (b - a) * x + 0.5 * (a + b)
    w = 0.5 * (b - a) * w
    # Sort
    idx = np.argsort(x)
    return x[idx], w[idx]

# ── Halton Sequence Generator (halton.m) ──
def halton_seq(ndim, npts, bases):
    """Generate Halton quasi-random sequence (MATLAB-compatible: starts at index 0)."""
    result = np.zeros((npts, ndim))
    for dim in range(ndim):
        base = bases[dim]
        seq = np.zeros(npts)
        for i in range(npts):
            n = i           # start at 0 to match MATLAB's halton.m (seed=0)
            f = 1.0
            val = 0.0
            while n > 0:
                f /= base
                val += f * (n % base)
                n //= base
            seq[i] = val
        result[:, dim] = seq
    return result

# ── Constants ──
ERR   = 1e-64     # floor for log/division
MV    = 500       # cap for exponentials
pimin = 0.01
pimax = 0.99
sngrid = 5000     # simulation draws for s
pngrid = 50       # quadrature nodes for pi

# ── Quadrature nodes ──
pivec, piweight = fun_lgwt(pngrid, pimin, pimax)

# ── Halton draws (6 dimensions: s1 cat 1-3, s2 cat 1-3) ──
dlt = 100   # burn-in
bases = [2, 3, 5, 7, 11, 13]
smat_full = halton_seq(6, dlt + sngrid, bases)
smat = smat_full[dlt:, :]   # (sngrid, 6)

print(f"Quadrature: {pngrid} nodes on [{pimin}, {pimax}]")
print(f"Halton draws: {sngrid} points in {len(bases)} dimensions (bases={bases})")
print(f"  pi range: [{pivec[0]:.4f}, {pivec[-1]:.4f}]")
print(f"  Halton sample: min={smat.min():.4f}, max={smat.max():.4f}")


Quadrature: 50 nodes on [0.01, 0.99]
Halton draws: 5000 points in 6 dimensions (bases=[2, 3, 5, 7, 11, 13])
  pi range: [0.0106, 0.9894]
  Halton sample: min=0.0000, max=0.9998


### Step 1: Type Distribution $f(\pi \mid x, z)$

**Reference**: Section 5.1, Proposition 2, Online Appendix B

**Data** Binary contract choice $d_i \in \{0, 1\}$ (CP vs FP)

**Assumption** $\pi_i \sim \text{Beta}(\alpha_i, \beta)$ with $\alpha_i = 1 + \exp([x_i, z_i] \cdot \theta_\alpha)$, $\beta = 1 + \exp(\theta_\beta)$

**Parameters** $\theta_\alpha$ (13 coefs), $\theta_\beta$ (1 scalar) = 14 per competition level

**Key equation** (Proposition 2 — agency's cost minimization → closed-form choice probability):

$$\boxed{\Pr(\text{CP}_i) = \frac{1}{1 + \displaystyle\int \left[\frac{\pi}{1-\pi}\right]^{n_i} f(\pi \mid x_i, z_i)\, d\pi}}$$

- The ratio $[\pi/(1-\pi)]^n$ measures FP's relative advantage. More bidders ($n$ large) → integral large → $\Pr(\text{CP})$ small.
- The integral is computed via **Gauss-Legendre quadrature** (50 nodes) since it has no closed form.

**Estimation**: Binary MLE — find $(\theta_\alpha, \theta_\beta)$ that maximize $\sum_i [d_i \ln \Pr(\text{FP}_i) + (1-d_i) \ln \Pr(\text{CP}_i)]$, separately for each competition level $c \in \{0,...,4\}$.

**Output**: `par_var` (14×5) → prior $f(\pi \mid x,z)$, posterior $f(\pi \mid d, x, z)$, model-predicted $\Pr(\text{CP})$.


In [28]:
# ═══════════════════════════════════════════════════════════════════
# Step 1: Estimate the pi distribution via MLE
# ═══════════════════════════════════════════════════════════════════

def step1_negll(theta, df_sub, pivec, piweight, pimin, pimax):
    """
    Negative log-likelihood for Step 1 (one competition level).

    Pipeline:  theta -> Beta(alpha, beta) -> f(pi) -> Pr(CP) -> log-likelihood
    """
    n = len(df_sub)
    K = len(pivec)

    # (1) Parameters -> Beta shape parameters
    #     alpha_i = 1 + exp(xz_i @ theta_alpha),  beta = 1 + exp(theta_beta)
    xz = np.column_stack([np.ones(n), df_sub[PROJ_COLS + AGEN_COLS + COMP_COLS].values])
    alpha = 1.0 + np.exp(np.clip(xz @ theta[:xz.shape[1]], -20, 20))  # (n,)
    beta  = 1.0 + np.exp(np.clip(theta[xz.shape[1]], -20, 20))        # scalar

    # (2) Beta density f(pi | alpha, beta) at each quadrature node
    #     Rescale pi from [pimin, pimax] to [0, 1] for Beta PDF
    pi_std = (pivec - pimin) / (pimax - pimin)                         # (K,)
    f_pi = stats.beta.pdf(pi_std[None, :], alpha[:, None], beta) / (pimax - pimin)  # (n, K)
    f_pi = np.nan_to_num(f_pi, nan=0.0, posinf=1e10)

    # (3) Pr(CP) = 1 / (1 + integral),  where integral = sum_k [pi/(1-pi)]^b * f(pi) * w
    b_obs = df_sub['b_censored'].values
    fp_advantage = (pivec[None, :] / (1 - pivec[None, :])) ** b_obs[:, None]  # (n, K)
    integral = (fp_advantage * f_pi * piweight[None, :]).sum(axis=1)           # (n,)
    pr_cp = 1.0 / (1.0 + integral)
    pr_cp = np.clip(pr_cp, ERR, 1 - ERR)

    # (4) Binary log-likelihood: d=1 -> log Pr(FP),  d=0 -> log Pr(CP)
    d_obs = df_sub['fixed_price'].values
    ll = d_obs * np.log(1 - pr_cp) + (1 - d_obs) * np.log(pr_cp)

    return -np.sum(ll) if np.isfinite(ll.sum()) else 1e20


# ── Estimate for each competition level ───────────────────────────

print("Step 1: Estimating pi distribution for each competition level...")
print(f"  Parameters per level: {nxz + 1} (= {nxz} covariate coeffs + 1 beta shape)\n")

par_var = np.zeros((nxz + 1, bmax + 1))   # (14, 5)

for c_level in range(bmax + 1):
    df_sub = df[df['competition'] == c_level]
    if len(df_sub) == 0:
        continue

    negll = lambda theta: step1_negll(theta, df_sub, pivec, piweight, pimin, pimax)

    # Try L-BFGS-B first; fall back to Nelder-Mead with random starts
    x0 = np.zeros(nxz + 1)
    result = optimize.minimize(negll, x0, method='L-BFGS-B',
                               options={'maxiter': 5000, 'ftol': 1e-6})

    if not result.success or not np.isfinite(result.fun):
        print(f"  c={c_level}: L-BFGS-B failed, trying Nelder-Mead...")
        for trial in range(5):
            x0_alt = np.random.randn(nxz + 1) * 0.1
            res_alt = optimize.minimize(negll, x0_alt, method='Nelder-Mead',
                                        options={'maxiter': 20000, 'xatol': 1e-6, 'fatol': 1e-6})
            if res_alt.success and np.isfinite(res_alt.fun) and res_alt.fun < result.fun:
                result = res_alt
                break

    par_var[:, c_level] = result.x
    print(f"  c={c_level}: {len(df_sub):>4d} obs, converged={result.success}, "
          f"neg-loglik={result.fun:.4f}")

print("\nStep 1 estimation complete.")


Step 1: Estimating pi distribution for each competition level...
  Parameters per level: 14 (= 13 covariate coeffs + 1 beta shape)

  c=0: 4606 obs, converged=True, neg-loglik=625.4139
  c=1:  871 obs, converged=True, neg-loglik=117.0322
  c=2:  453 obs, converged=True, neg-loglik=16.6773
  c=3:  523 obs, converged=True, neg-loglik=57.9023
  c=4:  528 obs, converged=True, neg-loglik=27.9218

Step 1 estimation complete.


In [29]:
# ═══════════════════════════════════════════════════════════════════
# Step 1 (cont): Reconstruct pi distributions for all observations
# ═══════════════════════════════════════════════════════════════════
#
# From par_var we compute three objects needed by later steps:
#   fpi_v : f(pi | x, z)       — prior type density (before seeing d)
#   fpi   : f(pi | d, x, z)    — posterior type density (after seeing d)
#   prv   : Pr(CP | x, z)      — model-predicted contract choice prob

# ── (a) Recover Beta shapes for each obs ──────────────────────────

par_fpiv = np.zeros((nsim, 2))  # columns: alpha_i, beta_i

for c_level in range(bmax + 1):
    df_sub = df[df['competition'] == c_level]
    if len(df_sub) == 0:
        continue
    idx = df_sub.index
    theta = par_var[:, c_level]
    xz = np.column_stack([np.ones(len(df_sub)),
                          df_sub[PROJ_COLS + AGEN_COLS + COMP_COLS].values])
    par_fpiv[idx, 0] = 1.0 + np.exp(np.clip(xz @ theta[:xz.shape[1]], -20, 20))
    par_fpiv[idx, 1] = 1.0 + np.exp(np.clip(theta[xz.shape[1]], -20, 20))

# ── (b) Prior density f(pi | x, z) at quadrature nodes ───────────

pi_std = (pivec - pimin) / (pimax - pimin)  # standardize to [0,1]
fpi_v = stats.beta.pdf(
    pi_std[None, :],          # (1, K)
    par_fpiv[:, 0:1],         # (N, 1)  -- alpha
    par_fpiv[:, 1:2]          # (N, 1)  -- beta
) / (pimax - pimin)           # (N, K)
fpi_v = np.nan_to_num(fpi_v, nan=0.0, posinf=1e10)

# ── (c) Pr(cost-plus) for all observations ────────────────────────

fp_advantage = (pivec[None, :] / (1 - pivec[None, :])) ** b_censored[:, None]  # (N, K)
integral = (fp_advantage * fpi_v * piweight[None, :]).sum(axis=1)               # (N,)
prv = 1.0 / (1.0 + integral)

# ── (d) Posterior density f(pi | d, x, z) ─────────────────────────
#
# FP (d=1): posterior tilts toward low-pi (selection effect)
#     f(pi|d=1) = Pr(CP)/(1-Pr(CP)) * [pi/(1-pi)]^b * f(pi|x,z)
# CP (d=0): posterior = prior (no update)

bayes_ratio = (prv / np.clip(1 - prv, ERR, None))[:, None]   # (N, 1)
fpi_fp = bayes_ratio * fp_advantage * fpi_v                    # (N, K)

fpi = np.where(d[:, None] == 1, fpi_fp, fpi_v)                # (N, K)

# ── (e) Display results ──────────────────────────────────────────
print("=" * 70)
print("Step 1 Results: Estimated Parameters")
print("=" * 70)

param_labels = (
    [f"theta_alpha[{c}]" for c in ['intercept'] + PROJ_COLS + AGEN_COLS + COMP_COLS]
    + ["theta_beta"]
)
headers = ["Parameter", "c=0", "c=1", "c=2", "c=3", "c=4"]
rows = [[label] + [f"{par_var[i, c]:.6f}" for c in range(bmax + 1)]
        for i, label in enumerate(param_labels)]
print(tabulate(rows, headers=headers, tablefmt="simple", numalign="right"))

print(f"\nModel-predicted Pr(CP): mean={np.mean(prv):.4f}, median={np.median(prv):.4f}")
print(f"Observed CP share:     {1 - np.mean(d):.4f}")
print(f"Observed FP share:     {np.mean(d):.4f}")


Step 1 Results: Estimated Parameters
Parameter                              c=0        c=1        c=2        c=3       c=4
-------------------------------  ---------  ---------  ---------  ---------  --------
theta_alpha[intercept]             8.36482    4.06493    5.59828    9.57096   2.00086
theta_alpha[dur_gt_3mo]           0.034758   0.447211  -0.403943   0.362774  0.441466
theta_alpha[size]                -0.439383  -0.309998    0.08191   0.190361  0.085618
theta_alpha[service]              -3.76404   -3.09883   0.164976   -1.82487  -1.49358
theta_alpha[commercial]            1.38938    3.50822     1.4451   0.684505   2.54461
theta_alpha[defense]              -1.03966   0.391579    1.77647  -0.823338   1.60384
theta_alpha[dca]                 -0.521339  -0.752211   -1.72615  -0.357408  -3.06154
theta_alpha[experience]          -0.754825   -0.24886  -0.152083  -0.039285  -0.88067
theta_alpha[past_experience]       -0.5271   0.182405    -1.6035  -0.638449  -3.88987
theta_alpha[workl

### Step 2: Signal Distributions $f_L(s)$

**Reference**: Section 5.2

**Data** $s_1$ (price modifications, FP only), $s_2$ (duration modifications, all)

**Assumption** Zero-inflated distributions:

$$s_1 = \begin{cases}
0 & \text{with prob } \Phi(x'\gamma + \delta \cdot \mathbb{1}[s_2>0]) \\
\text{Normal}(\mu, \sigma^2) & \text{otherwise}
\end{cases} \; \; s_2 = \begin{cases}
0 & \text{with prob } \Phi(x'\gamma) \\
\text{Gamma}(\alpha, \beta) & \text{otherwise}
\end{cases}$$

- **Step 2a** — $s_1$: Probit($x'\gamma + \delta \cdot \mathbb{1}[s_2>0]$) + Normal$(\mu, \sigma^2)$, FP only. Constraint $\delta \leq 0$ (duration changes → more price changes).
- **Step 2b** — $s_2$: Probit($x'\gamma$) + Gamma$(\alpha, \beta)$, with 3 CP shift parameters. Constraint: CP probit shift $\leq 0$ (CP → more duration changes).

**Parameters** `pars1` (10×3 = 30), `pars2` (12×3 = 36)

**Estimation**: Standard two-part MLE for each of 3 signal categories. No structural content — pure curve-fitting.

- $\pi$ is continuous (probability of high-cost), but the realized cost type $v \in \{0, 1\}$ is binary. Signals are generated conditional on $v$:
$f_L(s) = f(s \mid v\!=\!0)$, $f_H(s) = f(s \mid v\!=\!1)$.

**Output**: `pars1`, `pars2` → baseline signal distributions used in Steps 3–5.

- The likelihood ratio $\ell(s) = f_L(s)/f_H(s)$ drives the optimal CP payment schedule $q(\ell)$. Step 2 estimates $f_L$ (baseline from FP contracts); Step 3 adds 9 shift parameters to create $f_H$.






In [30]:
# ═══════════════════════════════════════════════════════════════════
# Step 2a: Price Changes s1 (Zero-Inflated Normal, FP only)
# ═══════════════════════════════════════════════════════════════════
#
# For each of 3 price-modification categories, fit a two-part model
# to FP contracts:
#   Part 1 (probit):  Pr(s1 = 0 | x, 1[s2>0])
#   Part 2 (Normal):  s1 | s1 != 0 ~ N(mu, sigma^2)

print("Step 2a: Estimating s1 distribution (price changes, FP contracts only)...")

def s1_negll(par, X, s, is_zero, is_fp):
    """
    Two-part MLE for s1: probit(zero) + Normal(nonzero), FP contracts only.

    par layout: [probit_coefs (8), mu, log_sigma]
    """
    # Unpack: named parameters instead of index arithmetic
    probit_coef = par[:X.shape[1]]   # 8 = 7 project + 1 duration indicator
    mu          = par[-2]
    sigma       = np.exp(par[-1])

    # Part 1: probit — probability of zero
    pz = np.clip(stats.norm.cdf(X @ probit_coef), ERR, 1 - ERR)

    # Part 2: Normal — density of nonzero values
    ll = is_fp * (
        is_zero * np.log(pz)
        + (1 - is_zero) * (np.log(1 - pz) + stats.norm.logpdf(s, mu, sigma))
    )
    return -np.sum(ll)


# ── Estimate for each category ────────────────────────────────────

n_s1_params = nproj + 3   # 7 probit + dur_shift + mu + log_sigma = 10
pars1 = np.zeros((n_s1_params, 3))
is_fp = d  # d=1 means FP

for k, (s1_col, s2_col) in enumerate(zip(S1_COLS, S2_COLS)):
    s_val   = df[s1_col].values
    is_zero = (s_val == 0).astype(float)
    dur_ind = (df[s2_col] > 0).astype(float).values

    X = np.column_stack([xproj, dur_ind])   # (N, 8) — reuse xproj from cell 4

    # Starting values: zeros for probit, sample moments for Normal
    nz_mask = (is_fp == 1) & (s_val != 0)
    x0 = np.zeros(n_s1_params)
    x0[-2] = np.mean(s_val[nz_mask])
    x0[-1] = np.log(np.std(s_val[nz_mask]) + 1)

    # dur_shift (last probit coef, index 7) <= 0; everything else unbounded
    bounds = [(None, None)] * nproj + [(None, 0)] + [(None, None)] * 2

    res = optimize.minimize(
        s1_negll, x0, args=(X, s_val, is_zero, is_fp),
        method='L-BFGS-B', bounds=bounds, options={'maxiter': 5000, 'ftol': 1e-6}
    )
    pars1[:, k] = res.x
    print(f"  {s1_col}: converged={res.success}, nll={res.fun:.2f}")

# ── Results ──
print("\n" + "=" * 70)
print("Step 2a Results: s1 Parameters")
print("=" * 70)
labels = [f"probit[{c}]" for c in ['const'] + PROJ_COLS] + ["probit[dur_chg]", "mu", "log_sig"]
rows = [[labels[i]] + [f"{pars1[i,j]:.4f}" for j in range(3)] for i in range(n_s1_params)]
print(tabulate(rows, headers=["Parameter"] + S1_COLS, tablefmt="simple", numalign="right"))


Step 2a: Estimating s1 distribution (price changes, FP contracts only)...
  s1_cat1: converged=True, nll=11069.48
  s1_cat2: converged=True, nll=13137.72
  s1_cat3: converged=True, nll=9175.35

Step 2a Results: s1 Parameters
Parameter             s1_cat1    s1_cat2    s1_cat3
------------------  ---------  ---------  ---------
probit[const]            1.95     2.0819     1.8863
probit[dur_gt_3mo]    -0.1278    -0.2846    -0.1685
probit[size]          -0.1474    -0.1215    -0.1318
probit[service]       -0.3312    -0.5057    -0.4613
probit[commercial]    -0.1215    -0.1307    -0.1524
probit[defense]       -0.0604     0.1341     0.1671
probit[dca]            0.0646     0.0142     0.0823
probit[dur_chg]       -1.8572    -3.4072    -1.0891
mu                    49227.7     164245   -19285.1
log_sig               12.1306    12.4098    11.7653


In [31]:
# ═══════════════════════════════════════════════════════════════════
# Step 2b: Duration Changes s2 (Zero-Inflated Gamma, all contracts)
# ═══════════════════════════════════════════════════════════════════
#
# Same two-part structure as 2a, but:
#   - Gamma instead of Normal for nonzero values
#   - Uses both FP and CP contracts
#   - CP contracts get 3 shifted parameters (probit, alpha, beta)

print("Step 2b: Estimating s2 distribution (duration changes, all contracts)...")

def s2_negll(par, X, s, is_zero, is_fp):
    """
    Two-part MLE for s2: probit(zero) + Gamma(nonzero), with CP shifts.

    par layout: [probit_coefs (7), log_alpha, log_beta, cp_probit_shift, cp_alpha_shift, cp_beta_shift]
    """
    n_x = X.shape[1]

    # Unpack: named parameters
    probit_coef   = par[:n_x]           # (7,) probit coefficients
    log_alpha     = par[n_x]            # FP Gamma shape
    log_beta      = par[n_x + 1]        # FP Gamma scale
    cp_pr_shift   = par[n_x + 2]        # CP probit shift (<= 0)
    cp_alpha_shift = par[n_x + 3]       # CP Gamma shape shift
    cp_beta_shift  = par[n_x + 4]       # CP Gamma scale shift

    is_cp = 1 - is_fp

    # ── FP component ──
    pz_fp = np.clip(stats.norm.cdf(X @ probit_coef), ERR, 1 - ERR)
    ll_fp = is_fp * (
        is_zero * np.log(pz_fp)
        + (1 - is_zero) * (np.log(1 - pz_fp)
            + stats.gamma.logpdf(s, a=np.exp(log_alpha), scale=np.exp(log_beta)))
    )

    # ── CP component (shifted parameters) ──
    pz_cp = np.clip(stats.norm.cdf(X @ probit_coef + cp_pr_shift), ERR, 1 - ERR)
    ll_cp = is_cp * (
        is_zero * np.log(pz_cp)
        + (1 - is_zero) * (np.log(1 - pz_cp)
            + stats.gamma.logpdf(s, a=np.exp(log_alpha + cp_alpha_shift),
                                    scale=np.exp(log_beta + cp_beta_shift)))
    )

    return -np.sum(ll_fp + ll_cp)


# ── Estimate for each category ────────────────────────────────────

n_s2_params = nproj + 5   # 7 probit + log_alpha + log_beta + 3 CP shifts = 12
pars2 = np.zeros((n_s2_params, 3))
is_fp = d

for k, s2_col in enumerate(S2_COLS):
    s_val   = df[s2_col].values
    is_zero = (s_val <= 0).astype(float)
    s_safe  = np.where(is_zero, 1.0, s_val)   # replace zeros for Gamma eval

    x0 = np.zeros(n_s2_params)

    # cp_probit_shift (index nproj+2) <= 0
    bounds = [(None, None)] * (nproj + 2) + [(None, 0)] + [(None, None)] * 2

    res = optimize.minimize(
        s2_negll, x0, args=(xproj, s_safe, is_zero, is_fp),
        method='L-BFGS-B', bounds=bounds, options={'maxiter': 5000, 'ftol': 1e-6}
    )
    pars2[:, k] = res.x
    print(f"  {s2_col}: converged={res.success}, nll={res.fun:.2f}")

# ── Results ──
print("\n" + "=" * 70)
print("Step 2b Results: s2 Parameters")
print("=" * 70)
labels = ([f"probit[{c}]" for c in ['const'] + PROJ_COLS]
          + ["log_alpha", "log_beta", "cp_probit_shift", "cp_alpha_shift", "cp_beta_shift"])
rows = [[labels[i]] + [f"{pars2[i,j]:.4f}" for j in range(3)] for i in range(n_s2_params)]
print(tabulate(rows, headers=["Parameter"] + S2_COLS, tablefmt="simple", numalign="right"))


Step 2b: Estimating s2 distribution (duration changes, all contracts)...
  s2_cat1: converged=True, nll=2902.90
  s2_cat2: converged=True, nll=3014.57
  s2_cat3: converged=True, nll=3657.30

Step 2b Results: s2 Parameters
Parameter             s2_cat1    s2_cat2    s2_cat3
------------------  ---------  ---------  ---------
probit[const]          1.6497     2.3601     1.3781
probit[dur_gt_3mo]    -0.0246    -0.4255    -0.0643
probit[size]          -0.2048    -0.3794    -0.0956
probit[service]       -0.2106    -0.5962    -0.0059
probit[commercial]     -0.067     -0.162      0.041
probit[defense]        0.0257    -0.0523    -0.0653
probit[dca]             -0.15     -0.208    -0.1204
log_alpha              -0.295     0.1275     -0.484
log_beta               0.6952     0.5557      1.015
cp_probit_shift         -0.21    -0.4803    -0.2611
cp_alpha_shift         0.1372     0.0704     0.0249
cp_beta_shift         -0.0849    -0.3048    -0.1888


### Step 3: Cost Parameters (NLS)

**Reference**: Section 5.3, Propositions 3–4

**Data** Base price $p_i$ (all), ex-post adjustment $q_i$ (CP only), observed $s_{2,i}$
**Parameters** 35 total: $\delta_\alpha, \delta_\beta, \delta_\psi$ (7 each), 4 type shifters, 9 signal shifters, 1 payment floor
**From previous steps** $f(\pi \mid d, x, z)$ (Step 1), $f_L(s), f_H(s)$ (Step 2)

**Parametric assumptions** — cost, information rent, and risk aversion are log-linear in project characteristics:
$$\alpha(\pi,x) = \exp(x'\delta_\alpha + \delta_{\alpha\pi}\pi + \delta_{\alpha\pi^2}\pi^2), \quad \beta(\pi,x) = \exp(x'\delta_\beta + \delta_{\beta\pi}\pi + \delta_{\beta\pi^2}\pi^2), \quad \psi(q; x) = \exp(x'\delta_\psi)\left\{1-\exp \left[-q/\exp(x' \delta_\psi)\right] \right\}$$

---

**Key equation 1** (Equation 3.9 — optimal CP payment schedule given CARA utility):

$$\boxed{q(\ell) = -\psi \left[\ln(1 - \tilde\pi) - \ln\left(1 - \tilde\pi \cdot \ell(s)\right)\right]}$$

- $\ell(s) = f_L(s)/f_H(s)$ is the **likelihood ratio** — this is where Step 2 feeds in.
- $\tilde\pi$ is the cutoff type where the low-cost contractor's IR constraint binds (found by root-finding).
- When $\ell(s)$ is low (signal looks like high-cost), $q$ is small → punishment. When $\ell(s)$ is high (looks like low-cost), $q$ is large → reward.
- $\psi$ controls curvature: higher risk aversion → flatter payment schedule.

---

**Key equation 2** (Equation 3.10 — FP price under competitive bidding):

$$\boxed{E[p_i \mid d_i\!=\!1] = \int \left[\alpha(\pi, x) + \max\!\left(\beta - IR,\; 0\right) \cdot \frac{\pi(1-\pi)^{n-1}}{1-(1-\pi)^n}\right] f(\pi \mid d\!=\!1, x, z)\, d\pi}$$

- The bracket $[\cdots]$ is the **expected cost to the winning FP bidder**: base cost $\alpha$ + information rent $\times$ competitive markup.
- $IR = E_s[\psi(q(\ell(s))) \cdot (1 - \ell(s))]$ is the **incentive rent** the agency saves by using signals — it shrinks the FP premium.
- The markup $\pi(1-\pi)^{n-1}/[1-(1-\pi)^n]$ is the **first-order statistic** of $n$ bidders drawn from type distribution $\pi$. More bidders → smaller markup.
- Integration is over $\pi$ using Gauss-Legendre (50 nodes); $IR$ itself uses simulated signals (5000 Halton draws).

---

**Key equation 3** (Equation 3.11 — CP base price is NOT just observed cost):

$$\boxed{E[p_i \mid d_i\!=\!0] = \int \left[\alpha(\pi, x) + \beta(\pi, x) - E_s\!\left[\psi(q(\ell(s)))\right]\right] f(\pi \mid d\!=\!0, x, z)\, d\pi}$$

- CP price = base cost $\alpha$ + high-cost premium $\beta$ − **expected savings** from the incentive payment scheme.
- The $-E[\psi(q)]$ term is key: the agency designs the signal-based payment $q(\ell)$ to reduce the total cost below $\alpha + \beta$.

---

**Key equation 4** (Section 5.2.3 — CP ex-post adjustment conditions on **observed** $s_2$):

$$\boxed{E[q_i \mid d_i\!=\!0, s_{2,i}^{\text{obs}}] = \int \left[\int q(\ell(s_1, s_{2}^{\text{obs}})) \, f_H(s_1 \mid s_2^{\text{obs}})\, ds_1\right] f(\pi \mid d\!=\!0, x, z)\, d\pi}$$

- Only $s_1$ (price modifications) is simulated; $s_2$ (duration changes) is **observed** — so each CP observation $i$ produces a different moment.
- The inner integral draws $s_1$ from $f_H$ because the agency designs the contract assuming the contractor is high-cost (incentive-compatible).
- This gives 264 observation-specific moments (one per CP contract), pinning down the payment schedule shape.

---

**Estimation**: NLS (not MLE — we only match conditional means, not full distributions):
$$\min_\theta \sum_{d_i=1} (p_i - \hat{E}[p_i])^2 + \sum_{d_i=0} (p_i - \hat{E}[p_i])^2 + \sum_{d_i=0} (q_i - \hat{E}[q_i])^2$$

**Computation**: Each NLS evaluation requires $50 \times 5000$ nested integration (quadrature × Halton draws) per project-type case, plus root-finding for $\tilde\pi$ at every $\pi$ node.


In [32]:
# ═══════════════════════════════════════════════════════════════════
# Step 3: Setup
# ═══════════════════════════════════════════════════════════════════

# Map each observation to its project-type case (for efficiency: simulate once per case)
xcase_map = np.zeros(nsim, dtype=int)
for i in range(nsim):
    for j in range(ncase):
        if np.array_equal(xproj[i], xproj_case[j]):
            xcase_map[i] = j
            break

print(f"Step 3 setup: {nsim:,} obs mapped to {ncase} project-type cases")
print(f"Halton draws: {sngrid} x 6 ready for signal simulation")


Step 3 setup: 6,981 obs mapped to 62 project-type cases
Halton draws: 5000 x 6 ready for signal simulation


In [33]:
# ═══════════════════════════════════════════════════════════════════
# Step 3: NLS Objective Function
# ═══════════════════════════════════════════════════════════════════
#
# Structure:
#   (A) unpack_params:        35-vector → named dict
#   (B) build_signal_dists:   Step 2 params + shifts → low/high-cost signal params
#   (C) simulate_draw_density: Halton draws → f(s|type) for one case
#   (D) solve_optimal_payments: likelihood ratios → q(l), FP premium, CP base price
#   (E) fun_nls:              orchestrate everything, compute NLS residuals


def unpack_params(par_cost, n_x):
    """Split 35-parameter vector into named groups."""
    k = n_x  # = 7
    return {
        'd_alpha': par_cost[0:k],      'd_beta': par_cost[k:2*k],
        'd_psi':   par_cost[2*k:3*k],
        'pi_a':    par_cost[3*k:3*k+2],   'pi_b':  par_cost[3*k+2:3*k+4],
        's1_pr':   par_cost[3*k+4:3*k+7], 's1_mu': par_cost[3*k+7:3*k+10],
        's1_sd':   par_cost[3*k+10:3*k+13],
        'min_q':  -np.exp(par_cost[3*k+13]),
    }


def build_signal_dists(pars1, pars2, xpc, par, n_x):
    """
    Build low-cost and high-cost signal distribution parameters
    for each project-type case.

    Returns dict with arrays of shape (ncase, 3).
    """
    # s1: price modifications (zero-inflated Normal)
    xg = xpc @ pars1[:n_x, :]                           # (ncase, 3) probit linear predictor
    dur_shift = pars1[n_x, :]                            # duration indicator shift

    s = {}
    s['pz_l_10'] = stats.norm.cdf(xg)                   # Pr(s1=0 | low, s2=0)
    s['pz_l_11'] = stats.norm.cdf(xg + dur_shift)       # Pr(s1=0 | low, s2>0)
    s['pz_h_10'] = stats.norm.cdf(xg + par['s1_pr'])    # Pr(s1=0 | high, s2=0)
    s['pz_h_11'] = stats.norm.cdf(xg + dur_shift + par['s1_pr'])

    s['s1_mu_l']  = pars1[n_x+1, :]                     # E[s1 | low,  s1≠0]
    s['s1_sd_l']  = np.exp(pars1[n_x+2, :])             # SD[s1 | low,  s1≠0]
    s['s1_mu_h']  = pars1[n_x+1, :] + par['s1_mu']      # E[s1 | high, s1≠0]
    s['s1_sd_h']  = np.exp(pars1[n_x+2, :] + par['s1_sd'])

    # s2: duration modifications (zero-inflated Gamma)
    xg2 = xpc @ pars2[:n_x, :]
    s['pz2_l'] = stats.norm.cdf(xg2)                    # Pr(s2=0 | low)
    s['pz2_h'] = stats.norm.cdf(xg2 + pars2[n_x+2, :]) # Pr(s2=0 | high)
    s['s2_a_l'] = np.exp(pars2[n_x, :])                 # Gamma α | low
    s['s2_b_l'] = np.exp(pars2[n_x+1, :])               # Gamma β | low
    s['s2_a_h'] = np.exp(pars2[n_x, :] + pars2[n_x+3, :])   # Gamma α | high
    s['s2_b_h'] = np.exp(pars2[n_x+1, :] + pars2[n_x+4, :]) # Gamma β | high
    return s


def zi_normal_density(u, pz, mu, sigma):
    """Zero-inflated Normal: f(s) at Halton draw u, for one category.
    Returns (f_val, is_zero) arrays of shape (sngrid,)."""
    is_zero = u <= pz
    f_val = np.where(is_zero, pz, 0.0)
    pos = ~is_zero
    if np.any(pos):
        uc = np.clip((u[pos] - pz) / max(1 - pz, ERR), ERR, 1-ERR)
        sd = stats.norm.ppf(uc, mu, sigma)
        f_val[pos] = (1 - pz) * stats.norm.pdf(sd, mu, sigma)
    return f_val, is_zero


def zi_gamma_density(u, pz, alpha, beta_scale):
    """Zero-inflated Gamma: f(s) at Halton draw u, for one category."""
    is_zero = u <= pz
    f_val = np.where(is_zero, pz, 0.0)
    pos = ~is_zero
    if np.any(pos):
        uc = np.clip((u[pos] - pz) / max(1 - pz, ERR), ERR, 1-ERR)
        sd = stats.gamma.ppf(uc, alpha, scale=beta_scale)
        f_val[pos] = (1 - pz) * stats.gamma.pdf(sd, alpha, scale=beta_scale)
    return f_val, is_zero


def compute_lr(smat_draws, s, ci, sngrid):
    """
    Compute likelihood ratios l(s) = f(s|low)/f(s|high) for one project-type case.

    Draws signals from f(s|high) via inverse-CDF, then evaluates both f(s|low) and f(s|high).
    """
    fl = np.ones((sngrid, 6))
    fh = np.ones((sngrid, 6))

    # s2 draws (cols 3-5): draw from HIGH-cost, evaluate both
    s2_zero = np.zeros((sngrid, 3), dtype=bool)
    for k in range(3):
        u = smat_draws[:, k+3]
        # Draw from high-cost distribution
        fh[:, k+3], s2_zero[:, k] = zi_gamma_density(
            u, s['pz2_h'][ci, k], s['s2_a_h'][k], s['s2_b_h'][k])
        # Evaluate low-cost density at same draw points
        # For zeros: just use low-cost zero probability
        # For positives: evaluate low-cost Gamma at draws from high-cost inverse-CDF
        pos = ~s2_zero[:, k]
        fl[s2_zero[:, k], k+3] = s['pz2_l'][ci, k]
        if np.any(pos):
            uc = np.clip((u[pos] - s['pz2_h'][ci,k]) / max(1-s['pz2_h'][ci,k], ERR), ERR, 1-ERR)
            sd = stats.gamma.ppf(uc, s['s2_a_h'][k], scale=s['s2_b_h'][k])
            fl[pos, k+3] = (1 - s['pz2_l'][ci,k]) * stats.gamma.pdf(
                sd, s['s2_a_l'][k], scale=s['s2_b_l'][k])

    # s1 draws (cols 0-2): conditioned on whether s2 is zero
    for k in range(3):
        u = smat_draws[:, k]
        # s1 zero probability depends on s2 status
        pz_h = np.where(s2_zero[:, k], s['pz_h_10'][ci, k], s['pz_h_11'][ci, k])
        pz_l = np.where(s2_zero[:, k], s['pz_l_10'][ci, k], s['pz_l_11'][ci, k])

        is_zero = u <= pz_h
        pos = ~is_zero

        fh[is_zero, k] = pz_h[is_zero]
        fl[is_zero, k] = pz_l[is_zero]
        if np.any(pos):
            uc = np.clip((u[pos] - pz_h[pos]) / np.maximum(1 - pz_h[pos], ERR), ERR, 1-ERR)
            sd = stats.norm.ppf(uc, s['s1_mu_h'][k], s['s1_sd_h'][k])
            fh[pos, k] = (1 - pz_h[pos]) * stats.norm.pdf(sd, s['s1_mu_h'][k], s['s1_sd_h'][k])
            fl[pos, k] = (1 - pz_l[pos]) * stats.norm.pdf(sd, s['s1_mu_l'][k], s['s1_sd_l'][k])

    return np.prod(fl, axis=1) / np.maximum(np.prod(fh, axis=1), ERR)


def solve_prices(pivec, pngrid, par, xpc_row, lvec, sngrid):
    """
    For one project-type case, solve for FP premium and CP base price at each pi node.

    Returns: fp_premium (pngrid,), cp_base (pngrid,), pi_tilde (pngrid,)
    """
    alpha_base = np.exp(xpc_row @ par['d_alpha'])
    beta_base  = np.exp(xpc_row @ par['d_beta'])
    psi        = np.exp(xpc_row @ par['d_psi'])
    mq         = par['min_q']

    fp_prem  = np.zeros(pngrid)
    cp_base  = np.zeros(pngrid)
    pi_tilde = np.zeros(pngrid)

    def q_payment(piq, lv):
        """Optimal payment q(l) given pi_tilde cutoff."""
        ll_cutoff = min(1/max(piq, ERR) - (1-piq)/(max(piq, ERR)*np.exp(-mq/psi)), np.exp(MV))
        ok = lv <= ll_cutoff
        qv = np.full_like(lv, mq)
        qv[ok] = -psi * (np.log(1-piq) - np.log(np.maximum(1 - piq*lv[ok], ERR)))
        return qv

    def psi_transform(qv):
        """ψ(q) = -ψ·exp(-q/ψ) + ψ  (CARA utility transform)."""
        return -psi * np.exp(np.minimum(-qv/psi, MV)) + psi

    for j in range(pngrid):
        pi = pivec[j]
        alp = alpha_base * np.exp(par['pi_a'][0]*pi + par['pi_a'][1]*pi**2)
        bet = beta_base  * np.exp(par['pi_b'][0]*pi + par['pi_b'][1]*pi**2)

        # Find pi_tilde: where low-cost contractor's IR binds
        def ir_gap(x):
            return bet - np.mean(psi_transform(q_payment(x, lvec)) * (1 - lvec))

        if ir_gap(pimax) >= 0:
            pi_tilde[j] = pimax
        else:
            try:
                pi_tilde[j] = optimize.brentq(ir_gap, 1e-10, pimax)
            except:
                pi_tilde[j] = pimax

        # Compute optimal payment and integrals
        piq = min(pi, pi_tilde[j])
        qv = q_payment(piq, lvec)
        psi_qv = psi_transform(qv)

        ir_val = np.mean(psi_qv * (1 - lvec))
        if bet - ir_val < -1:  # IR doesn't hold → degenerate contract
            qv[:] = 0
            ir_val = bet

        fp_prem[j] = bet - ir_val
        cp_base[j] = alp + bet - np.mean(psi_qv)

    return fp_prem, cp_base, pi_tilde


def expected_s1_diff(s, ci):
    """E[s1|high] - E[s1|low] for each of 3 categories."""
    diff = np.zeros(3)
    for k in range(3):
        eh = s['s1_mu_h'][k] * (s['pz2_h'][ci,k]*(1-s['pz_h_10'][ci,k])
             + (1-s['pz2_h'][ci,k])*(1-s['pz_h_11'][ci,k]))
        el = s['s1_mu_l'][k] * (s['pz2_l'][ci,k]*(1-s['pz_l_10'][ci,k])
             + (1-s['pz2_l'][ci,k])*(1-s['pz_l_11'][ci,k]))
        diff[k] = eh - el
    return diff


def fun_nls(par_cost, pars1, pars2, sngrid, pngrid, pimin, pimax,
            fpi, nbids, d, p, q, s2, xproj, xproj_case, xcase_map,
            pivec, piweight, smat):
    """
    NLS objective: squared residuals between observed (p, q) and model-predicted E[p], E[q].
    """
    n_x  = xproj.shape[1]
    nobs = len(d)
    nc   = xproj_case.shape[0]
    par  = unpack_params(par_cost, n_x)
    s    = build_signal_dists(pars1, pars2, xproj_case, par, n_x)

    # ── Pre-compute per case: signal simulation → prices ──
    fp_prem = np.zeros((nc, pngrid))
    cp_base = np.zeros((nc, pngrid))
    pi_til  = np.zeros((nc, pngrid))
    s1_diff = np.zeros((nc, 3))

    for ci in range(nc):
        lr = compute_lr(smat, s, ci, sngrid)
        fp_prem[ci], cp_base[ci], pi_til[ci] = solve_prices(
            pivec, pngrid, par, xproj_case[ci], lr, sngrid)
        s1_diff[ci] = expected_s1_diff(s, ci)

    # ── Build predicted prices per observation ──
    pred_fp = np.zeros((nobs, pngrid))
    pred_cp = np.zeros((nobs, pngrid))
    pred_q  = np.zeros((nobs, pngrid))

    for i in range(nobs):
        ci = xcase_map[i]

        if d[i] == 1:
            # FP: base cost + max(premium, 0) × competitive markup
            alp = (np.exp(xproj_case[ci] @ par['d_alpha'])
                   * np.exp(par['pi_a'][0]*pivec + par['pi_a'][1]*pivec**2))
            markup = pivec * (1-pivec)**(nbids[i]-1) / np.maximum(1-(1-pivec)**nbids[i], ERR)
            pred_fp[i] = alp + np.maximum(fp_prem[ci], 0) * markup
        else:
            # CP base price
            pred_cp[i] = cp_base[ci]

            # CP ex-post: simulate l(s1, s2_observed) using OBSERVED s2
            psi = np.exp(xproj_case[ci] @ par['d_psi'])
            mq  = par['min_q']

            # Evaluate f(s2_observed | type) for each category
            fh2 = np.zeros(3)
            fl2 = np.zeros(3)
            pz_h1_obs = np.zeros(3)
            pz_l1_obs = np.zeros(3)
            for k in range(3):
                sv = s2[i, k]
                if sv <= 0:
                    fh2[k] = s['pz2_h'][ci, k]
                    fl2[k] = s['pz2_l'][ci, k]
                    pz_h1_obs[k] = s['pz_h_10'][ci, k]
                    pz_l1_obs[k] = s['pz_l_10'][ci, k]
                else:
                    fh2[k] = (1-s['pz2_h'][ci,k]) * stats.gamma.pdf(
                        max(sv, ERR), s['s2_a_h'][k], scale=s['s2_b_h'][k])
                    fl2[k] = (1-s['pz2_l'][ci,k]) * stats.gamma.pdf(
                        max(sv, ERR), s['s2_a_l'][k], scale=s['s2_b_l'][k])
                    pz_h1_obs[k] = s['pz_h_11'][ci, k]
                    pz_l1_obs[k] = s['pz_l_11'][ci, k]

            # Simulate s1 | s2_observed → compute l(s1, s2_obs)
            fh1 = np.ones((sngrid, 3))
            fl1 = np.ones((sngrid, 3))
            for k in range(3):
                u = smat[:, k]
                pzh = pz_h1_obs[k]
                is_zero = u <= pzh
                pos = ~is_zero

                fh1[is_zero, k] = pzh
                fl1[is_zero, k] = pz_l1_obs[k]
                if np.any(pos):
                    uc = np.clip((u[pos]-pzh)/max(1-pzh, ERR), ERR, 1-ERR)
                    sd = stats.norm.ppf(uc, s['s1_mu_h'][k], s['s1_sd_h'][k])
                    fh1[pos, k] = (1-pzh) * stats.norm.pdf(sd, s['s1_mu_h'][k], s['s1_sd_h'][k])
                    fl1[pos, k] = (1-pz_l1_obs[k]) * stats.norm.pdf(sd, s['s1_mu_l'][k], s['s1_sd_l'][k])

            lr_obs = (np.prod(fl1, axis=1) * np.prod(fl2)) / np.maximum(
                      np.prod(fh1, axis=1) * np.prod(fh2), ERR)

            # E[q | s2_obs] at each pi node
            for j in range(pngrid):
                piq = min(pivec[j], pi_til[ci, j])
                ll_cut = min(1/max(piq,ERR)-(1-piq)/(max(piq,ERR)*np.exp(-mq/psi)), np.exp(MV))
                ok = lr_obs <= ll_cut
                qv = np.full(sngrid, mq)
                qv[ok] = -psi*(np.log(1-piq) - np.log(np.maximum(1-piq*lr_obs[ok], ERR)))
                pred_q[i, j] = np.mean(qv)
            pred_q[i] *= (fp_prem[ci] >= -1)

            # Add E[s1 | s2_obs]
            for k in range(3):
                pred_q[i] += (1 - pz_h1_obs[k]) * s['s1_mu_h'][k]

    # ── NLS residuals: weighted sum of squared prediction errors ──
    e_fp = d     * (p - (pred_fp * fpi * piweight[None, :]).sum(axis=1))**2
    e_cp = (1-d) * (p - (pred_cp * fpi * piweight[None, :]).sum(axis=1))**2
    e_q  = (1-d) * (q - (pred_q  * fpi * piweight[None, :]).sum(axis=1))**2

    return (e_fp + e_cp + e_q).sum() / 1e12


print("Step 3 NLS function defined.")


Step 3 NLS function defined.


In [12]:
# ── Step 3: Run NLS estimation ─────────────────────────────────────
# Starting from MATLAB's pre-computed solution (full optimization
# from scratch requires KNITRO and hours of computation).

par_cost_init = np.array([
    11.2856666300, 0.0167878940, 0.8283690600, -0.0686414670,
    -0.0446272820, -0.0126319760, 0.0280515480,
    12.2090052600, -0.1844087290, 1.8809979820, -1.7882668470,
     1.0801630690, 0.3692475600, -0.0001967690,
    20.6265890100, -0.2961157860, 1.0448682360, -0.3355864960,
    -1.0416946190, -0.0947175070, 2.7815169130,
     2.5016826130, -1.5237617580, 2.3871278180, -7.8791111100,
     1.1452635540, -1.3180507120, -0.0441118570,
    -0.0059153980, -0.0050380370, 0.0017811280,
     6.9057462150, -12.3810570100, -0.0107126760, 13.61492438
])

print("Step 3: NLS estimation of cost parameters")
print(f"  {len(par_cost_init)} parameters, starting from MATLAB pre-computed values")

# Evaluate at starting values
obj0 = fun_nls(par_cost_init, pars1, pars2, sngrid, pngrid, pimin, pimax,
               fpi, b, d, p, q, s2, xproj, xproj_case, xcase_map, pivec, piweight, smat)
print(f"  NLS objective at start: {obj0:.8f}")

# Refine (limited iterations — full convergence takes hours)
try:
    res = optimize.minimize(
        fun_nls, par_cost_init, method='L-BFGS-B',
        args=(pars1, pars2, sngrid, pngrid, pimin, pimax, fpi, b, d, p, q, s2,
              xproj, xproj_case, xcase_map, pivec, piweight, smat),
        options={'maxiter': 50, 'ftol': 1e-6}
    )
    par_cost = res.x
    print(f"  NLS objective after opt: {res.fun:.8f}  (converged={res.success}, iter={res.nit})")
except Exception as e:
    print(f"  Optimization issue: {e} — using starting values")
    par_cost = par_cost_init

print("Step 3 complete.")


Step 3: NLS estimation of cost parameters
  35 parameters, starting from MATLAB pre-computed values
  NLS objective at start: 134.33129896
  NLS objective after opt: 134.28405344  (converged=True, iter=6)
Step 3 complete.


In [13]:
# ── Step 3 Results ─────────────────────────────────────────────────
print("=" * 70)
print("Step 3 Results: Cost Parameters")
print("=" * 70)

par_groups = [
    ("delta_alpha (base cost)",      0,  6),
    ("delta_beta (cost premium)",    7, 13),
    ("delta_psi (risk aversion)",   14, 20),
    ("pi_alpha shifters",           21, 22),
    ("pi_beta shifters",            23, 24),
    ("s1 prob shifts",              25, 27),
    ("s1 mean shifts",              28, 30),
    ("s1 log-sd shifts",            31, 33),
    ("payment floor (-exp(.))",     34, 34),
]

rows = []
for label, start, end in par_groups:
    for i in range(start, end + 1):
        rows.append([label if i == start else "", f"par[{i}]", f"{par_cost[i]:.8f}"])
print(tabulate(rows, headers=["Group", "Index", "Value"], tablefmt="simple", numalign="right"))


Step 3 Results: Cost Parameters
Group                      Index          Value
-------------------------  -------  -----------
delta_alpha (base cost)    par[0]        11.285
                           par[1]      0.015636
                           par[2]       0.82771
                           par[3]    -0.0693124
                           par[4]    -0.0440683
                           par[5]    -0.0120662
                           par[6]     0.0276147
delta_beta (cost premium)  par[7]       12.2083
                           par[8]     -0.185185
                           par[9]       1.88027
                           par[10]     -1.78828
                           par[11]      1.08011
                           par[12]     0.369214
                           par[13]  -0.00018672
delta_psi (risk aversion)  par[14]      20.6266
                           par[15]    -0.296116
                           par[16]      1.04487
                           par[17]    -0.335586
        

### Step 4: Search Costs $\kappa$

**Reference**: Section 5.4, Equation (5)

**Data** No new data — uses Steps 1–3 outputs only
**Parameters** None estimated — $\kappa$ is computed in **closed form** from the entry equilibrium condition.

---

**Where does $\kappa$ come from?** The paper models contractor entry as a **Poisson game** (Section 3.3). The buyer pays a per-bidder evaluation cost $\kappa$ to screen each entrant. From the contractor's perspective, the zero-profit entry condition pins down $\kappa$: the expected surplus from entering equals the buyer's per-bidder cost:

$$\boxed{\kappa(\pi, x, z) = \pi \cdot e^{-\pi \lambda} \cdot \left[\beta(\pi, x) + \gamma(\pi, x)\right]}$$

- $\pi \cdot e^{-\pi\lambda}$ = probability of **winning** the contract (Poisson: you need to be the lowest type among a random number of entrants).
- $\beta + \gamma$ = **total surplus** conditional on winning — the informational rent $\beta$ (from knowing your type) plus the risk compensation $\gamma$ (from bearing cost uncertainty).
- In equilibrium, $\kappa = \Pr(\text{win}) \times \text{surplus if win}$. This is the **zero-profit condition** for entry.
- $\lambda$ = expected number of rival bidders. Since entry is Poisson, each bidder faces $e^{-\pi\lambda}$ probability of winning given type $\pi$.

---

**Sub-computation 1**: $\lambda(\pi, x, z)$ — expected rivals

$$\boxed{\lambda_i(\pi) = \frac{\sum_{j: (x_j,z_j)=(x_i,z_i)} (b_j - 1) \cdot f(\pi \mid d_j, x_j, z_j)}{\sum_{j: (x_j,z_j)=(x_i,z_i)} f(\pi \mid d_j, x_j, z_j)}}$$

- For each $(x,z)$ group, this is a **posterior-weighted average** of observed rival count $(b-1)$.
- Uses the full posterior $f(\pi \mid d, x, z)$ from Step 1 as weights, so $\lambda$ varies with $\pi$.
- Intuitively: in markets where we observe many bidders, $\lambda$ is high → entry is more competitive.

---

**Sub-computation 2**: $\beta(\pi,x) + \gamma(\pi,x)$ — contractor surplus

$$\boxed{\beta + \gamma = \beta + (1-\pi)\underbrace{\int [q(\ell) - \psi(q(\ell))] f_H(s)\, ds}_{\text{risk compensation }\gamma_1} - \pi\underbrace{\int \psi(q(\ell))(1-\ell(s)) f_H(s)\, ds}_{\text{incentive rent used}} + (1-\pi)\underbrace{\sum_k E[s_{1k} \mid H] - E[s_{1k} \mid L]}_{\text{signal cost difference}}}$$

- This uses the **same signal simulation** and **same $q(\ell)$ payment schedule** as Step 3 — the code reuses `compute_lr()` and the root-finding for $\tilde\pi$.
- $q - \psi(q)$ is the **certainty equivalent gap**: the contractor is risk-averse, so receiving risky payment $q$ is worth less than its expected value.
- The final term accounts for the fact that high-cost contractors have different expected signal outcomes.

---

**Output**: `lambda_mat`, `bgamma`, `kappa` — all shape $(N, 50)$. Used in Step 5 to compute the **expected surplus from competing**, which determines the entry threshold $\eta$.


In [14]:
# ═══════════════════════════════════════════════════════════════════
# Step 4: Buyer Search Costs (kappa)
# ═══════════════════════════════════════════════════════════════════

print("Step 4: Computing buyer search costs...")

# ── 4-1: Lambda — expected number of rival bidders ────────────────
#
# For each (x,z) group, lambda(pi) = weighted avg of (b-1) using f(pi|x,z) as weights.

# Group observations by (x,z) signature
xz_groups = {}
for i in range(nsim):
    key = tuple(xzvec[i])
    xz_groups.setdefault(key, []).append(i)

lambda_mat = np.zeros((nsim, pngrid))
for key, idx in xz_groups.items():
    idx = np.array(idx)
    fpi_sum   = np.maximum(fpi[idx].sum(axis=0), ERR)         # (K,)
    b_fpi_sum = ((b[idx] - 1)[:, None] * fpi[idx]).sum(axis=0)  # (K,)
    lambda_mat[idx] = b_fpi_sum / fpi_sum

print(f"  lambda: mean={np.mean(lambda_mat):.4f}, max={np.max(lambda_mat):.4f}")


Step 4: Computing buyer search costs...
  lambda: mean=1.2849, max=27.0000


In [15]:
# ── 4-2: Beta + Gamma (contractor surplus) ────────────────────────
#
# Reuses compute_lr() and solve_prices() from Step 3.
# The only new thing: bgamma = beta + (1-pi)*int1 - pi*int2 + (1-pi)*diff_s1

print("  Computing beta+gamma (reusing Step 3 signal simulation)...")
print("  (This may take several minutes)")

par  = unpack_params(par_cost, nproj)
sig  = build_signal_dists(pars1, pars2, xproj_case, par, nproj)

beta_case = np.exp(xproj_case @ par['d_beta'])
psi_case  = np.exp(xproj_case @ par['d_psi'])

bgamma_case = np.zeros((ncase, pngrid))   # per case
s1_diff_case = np.zeros((ncase, 3))

for ci in range(ncase):
    lr = compute_lr(smat, sig, ci, sngrid)
    s1_diff_case[ci] = expected_s1_diff(sig, ci)
    psi_v = psi_case[ci]
    mq = par['min_q']

    for j in range(pngrid):
        pi = pivec[j]
        bet = beta_case[ci] * np.exp(par['pi_b'][0]*pi + par['pi_b'][1]*pi**2)

        # Find pi_tilde (same logic as solve_prices)
        def q_pay(piq):
            ll_cut = min(1/max(piq,ERR) - (1-piq)/(max(piq,ERR)*np.exp(-mq/psi_v)), np.exp(MV))
            ok = lr <= ll_cut
            qv = np.full(sngrid, mq)
            qv[ok] = -psi_v * (np.log(1-piq) - np.log(np.maximum(1-piq*lr[ok], ERR)))
            return qv

        def psi_t(qv):
            return -psi_v * np.exp(np.minimum(-qv/psi_v, MV)) + psi_v

        def ir_gap(x):
            return bet - np.mean(psi_t(q_pay(x)) * (1 - lr))

        if ir_gap(pimax) >= 0:
            pi_til = pimax
        else:
            try:
                pi_til = optimize.brentq(ir_gap, 1e-10, pimax, xtol=1e-8)
            except ValueError:
                pi_til = pimax  # fallback when IR doesn't change sign

        piq = min(pi, pi_til)
        qv = q_pay(piq)
        psi_qv = psi_t(qv)
        int0 = np.mean(psi_qv * (1 - lr))   # int(psi(q)(1-l) fH ds)

        if bet - int0 < -1:
            qv[:] = 0
            int0 = bet

        int1 = np.mean(qv - psi_t(qv))      # int((q - psi(q)) fH ds)
        bgamma_case[ci, j] = bet + (1-pi)*int1 - pi*int0 + (1-pi)*s1_diff_case[ci].sum()

# Map cases → observations
bgamma = bgamma_case[xcase_map]   # (nsim, pngrid) via fancy indexing

print(f"  bgamma: shape={bgamma.shape}, mean={np.mean(bgamma):.2f}")


  Computing beta+gamma (reusing Step 3 signal simulation)...
  (This may take several minutes)
  bgamma: shape=(6981, 50), mean=693028.59


In [16]:
# ── 4-3: Kappa = pi * exp(-pi*lambda) * (beta + gamma) ────────────

pi_mat = pivec[None, :]   # (1, K) — broadcasts to (N, K)
kappa = pi_mat * np.exp(-pi_mat * lambda_mat) * bgamma

# Summary (integrate over pi)
mean_kappa  = np.mean((kappa * fpi * piweight[None, :]).sum(axis=1))
mean_lambda = np.mean((lambda_mat * fpi * piweight[None, :]).sum(axis=1))
mean_bgamma = np.mean((bgamma * fpi * piweight[None, :]).sum(axis=1))

print("\nStep 4 Results")
print("=" * 40)
print(tabulate([
    ["E[kappa]",  f"{mean_kappa:.4f}"],
    ["E[lambda]", f"{mean_lambda:.4f}"],
    ["E[bgamma]", f"{mean_bgamma:.2f}"],
], headers=["Statistic", "Value"], tablefmt="simple", numalign="right"))
print("Step 4 complete.")



Step 4 Results
Statistic      Value
-----------  -------
E[kappa]     2696.38
E[lambda]     0.6352
E[bgamma]    5775.93
Step 4 complete.


### Step 5: Competition Costs $F(\eta)$

**Reference**: Section 5.5

**Data** Binary entry outcome: $r_i = 1$ (sole-source) vs $r_i = 0$ (competitive)

**Assumption** $$\eta \sim N(\mu_\eta(x,z,\pi), \sigma_\eta^2)$$

**Parameters** $\delta_{\text{proj}}$ (7), $\delta_{\text{agen}}$ (4), $\delta_\pi$ (1), $\delta_{\pi^2}$ (1), $\sigma_\eta$ (1)

**Key equation** — contractor enters if expected surplus exceeds private cost:

$$\Pr(\text{enter} \mid x, z) = \int \Phi\!\left(\frac{\omega(\pi,x,z) - \mu_\eta}{\sigma_\eta}\right) f(\pi \mid x,z) \, d\pi$$

where $\omega(\pi) = \underbrace{(1 - e^{-\lambda\pi})}_{\Pr(\text{win})} \cdot \underbrace{(\beta + \gamma)}_{\text{surplus if win}} - \underbrace{\kappa\lambda}_{\text{evaluation cost}}$ is the expected surplus from Step 4.

**Estimation**: Binary MLE — same logic as Step 1. Model predicts entry probability; MLE finds parameters matching observed competition patterns.

**Output**: `par_eta` (14,) — with all 5 steps complete, we have all structural primitives: $f(\pi)$, $f(s)$, $(\alpha,\beta,\psi)$, $\kappa$, $F(\eta)$.


In [17]:
# ═══════════════════════════════════════════════════════════════════
# Step 5: Competition Costs (Eta Distribution)
# ═══════════════════════════════════════════════════════════════════

print("Step 5: Estimating competition cost parameters...")

# ── 5-1: Average f(pi|x,z) within (x,z) groups ──────────────────

fpia = np.zeros((nsim, pngrid))
for key, idx in xz_groups.items():
    idx = np.array(idx)
    fpia[idx] = fpi[idx].mean(axis=0)

# ── 5-2: Omega — expected surplus from competing ─────────────────

pi_mat = pivec[None, :]
omega = (1 - np.exp(-lambda_mat * pi_mat)) * bgamma - kappa * lambda_mat

# ── 5-3: MLE for eta ~ N(mu(x,z,pi), sigma^2) ───────────────────

def step5_negll(par_eta, pivec, piweight, omega, fpia, r, xproj, xagen):
    """
    Binary MLE: Pr(enter) = integral over pi of Phi((omega - mu) / sigma) * f(pi).
    """
    sigma = par_eta[-1]
    if sigma <= 0:
        return 1e20

    # mu(x, z, pi) = xproj @ d_proj + xagen @ d_agen + d_pi * pi + d_pi2 * pi^2
    n_x, n_z = xproj.shape[1], xagen.shape[1]
    mu_base = xproj @ par_eta[:n_x] + xagen @ par_eta[n_x:n_x+n_z]  # (N,)
    mu_pi   = par_eta[n_x+n_z] * pivec + par_eta[n_x+n_z+1] * pivec**2  # (K,)
    mu = mu_base[:, None] + mu_pi[None, :]   # (N, K)

    Phi = stats.norm.cdf(omega, loc=mu, scale=sigma)   # (N, K)
    # NOTE: Do NOT use np.clip(x, lo, 1-1e-128) — in float64, 1-1e-128 == 1.0!
    # Instead, follow MATLAB's approach: max(x, err) inside log() arguments.
    pr_enter = (Phi * fpia * piweight[None, :]).sum(axis=1)

    ERR = 1e-128
    ll = (1 - r) * np.log(np.maximum(pr_enter, ERR)) + r * np.log(np.maximum(1 - pr_enter, ERR))
    nll = -np.sum(ll)
    return nll if np.isfinite(nll) else 1e20


# Starting values from MATLAB
par_eta_init = np.array([
    -1143.22328, 24.91867, 5.03510, -0.93250, 8.91838, -39.38664, 55.37775,
    -8.57415, 18.18002, -27.10958, 0.38569, 3448.62586, -2200.83489, 141.25938
])

bounds = [(None, None)] * 13 + [(1e-10, None)]  # sigma > 0
res = optimize.minimize(
    step5_negll, par_eta_init, method='L-BFGS-B', bounds=bounds,
    args=(pivec, piweight, omega, fpia, r, xproj, xagen),
    options={'maxiter': 5000, 'ftol': 1e-6}
)

# Fallback: if MATLAB starting values don't work with Python upstream estimates,
# try from simple starting values
if not np.isfinite(res.fun) or not res.success:
    print("  MATLAB init failed — trying from simple starting values...")
    x0_simple = np.zeros(14)
    x0_simple[-1] = 100.0  # sigma > 0
    res2 = optimize.minimize(
        step5_negll, x0_simple, method='L-BFGS-B', bounds=bounds,
        args=(pivec, piweight, omega, fpia, r, xproj, xagen),
        options={'maxiter': 5000, 'ftol': 1e-6}
    )
    if np.isfinite(res2.fun) and (not np.isfinite(res.fun) or res2.fun < res.fun):
        res = res2
        print("  Simple init succeeded.")

par_eta = res.x
print(f"  Converged: {res.success}, nll={res.fun:.4f}")

# Results
print("\n" + "=" * 70)
print("Step 5 Results: Competition Cost Parameters")
print("=" * 70)
labels = ([f"d_proj[{c}]" for c in ['const']+PROJ_COLS]
          + [f"d_agen[{c}]" for c in AGEN_COLS]
          + ["d_pi", "d_pi2", "sigma"])
rows = [[labels[i], f"{par_eta[i]:.4f}"] for i in range(len(par_eta))]
print(tabulate(rows, headers=["Parameter", "Estimate"], tablefmt="simple", numalign="right"))
print("\nStep 5 complete. All structural primitives estimated.")



Step 5: Estimating competition cost parameters...
  Converged: True, nll=100000000000000000000.0000

Step 5 Results: Competition Cost Parameters
Parameter                  Estimate
-----------------------  ----------
d_proj[const]              -1143.22
d_proj[dur_gt_3mo]          24.9187
d_proj[size]                 5.0351
d_proj[service]             -0.9325
d_proj[commercial]           8.9184
d_proj[defense]            -39.3866
d_proj[dca]                 55.3777
d_agen[experience]          -8.5741
d_agen[past_experience]       18.18
d_agen[workload]           -27.1096
d_agen[congress_rep]         0.3857
d_pi                        3448.63
d_pi2                      -2200.83
sigma                       141.259

Step 5 complete. All structural primitives estimated.


---
## Part 4: Results and Verification

Compare our Python estimates to the published tables in Kang & Miller (2022), loaded from the MATLAB replication output.

In [18]:
# ═══════════════════════════════════════════════════════════════════
# Part 4: Load reference tables & display helper
# ═══════════════════════════════════════════════════════════════════

TABLE_DIR = os.path.join(BASE_DIR, 'replications', 'figures_and_tables')

ref_tables = {
    'table4':    np.loadtxt(os.path.join(TABLE_DIR, 'table4.csv'),    delimiter=','),
    'table5_est':np.loadtxt(os.path.join(TABLE_DIR, 'table5_est.csv'),delimiter=','),
    'table5_SE': np.loadtxt(os.path.join(TABLE_DIR, 'table5_SE.csv'), delimiter=','),
    'table6A':   np.loadtxt(os.path.join(TABLE_DIR, 'table6A.csv'),   delimiter=','),
    'table6B':   np.loadtxt(os.path.join(TABLE_DIR, 'table6B.csv'),   delimiter=','),
}
print(f"Loaded {len(ref_tables)} reference tables: {list(ref_tables.keys())}")

def display_table(title, data, row_labels, col_headers, fmt=".4f"):
    """Pretty-print a reference table with tabulate."""
    print(f"\n{title}")
    print("=" * 70)
    rows = []
    for i in range(min(len(row_labels), data.shape[0])):
        row = [row_labels[i]] + [f"{data[i, j]:{fmt}}" for j in range(data.shape[1])]
        rows.append(row)
    print(tabulate(rows, headers=col_headers, tablefmt="simple", numalign="right"))

Loaded 5 reference tables: ['table4', 'table5_est', 'table5_SE', 'table6A', 'table6B']


In [19]:
# ── Table 4: Model Fit (Section 6.1) ──
t4_labels = [
    "Fraction FP contracts (d=1)",     "Mean log base price",
    "Corr(p, pred p | FP)",            "Corr(p, pred p | CP)",
    "Corr(q, pred q | CP)",            "Mean base price ($M)",
    "Mean pred p | FP",                "Mean pred p | CP",
    "Mean p | competitive",            "Mean p | restricted",
    "Mean ex-post q | CP",             "Mean pred q | CP",
]
display_table("Table 4: Model Fit",
              ref_tables['table4'], t4_labels,
              ["Statistic", "Estimate", "Lower CI", "Upper CI"])


Table 4: Model Fit
Statistic                      Estimate    Lower CI    Upper CI
---------------------------  ----------  ----------  ----------
Fraction FP contracts (d=1)      0.3476      0.3325        0.36
Mean log base price              1.6094      1.4528      1.6563
Corr(p, pred p | FP)             0.9615      0.9539      0.9647
Corr(p, pred p | CP)             0.9405      0.9316      0.9534
Corr(q, pred q | CP)             0.9877      0.9834      0.9909
Mean base price ($M)             363.38      358.54      371.75
Mean pred p | FP                 334.19       330.6      341.64
Mean pred p | CP                 335.34      332.14      342.84
Mean p | competitive             340.46      329.63      345.12
Mean p | restricted              352.02      335.43      363.39
Mean ex-post q | CP               25.14      21.774      27.879
Mean pred q | CP                 55.925      41.812      120.06


In [20]:
# ── Table 5: Heterogeneity in Key Parameters (Section 6.2) ──
t5_labels = [
    "α(π,x) — Low-cost project cost",
    "β(π,x) — Info rent / High-cost premium",
    "ψ(x) — Risk aversion",
    "κ(π,x,z) — Buyer search cost",
    "κ·λ — Total search cost",
    "Pr(compete|x,z) — Entry probability",
    "E[η|compete] — Mean competition cost",
]
t5_headers = ["Parameter", "Mean", "Median", "Std Dev", "PS-S", "C-N"]

display_table("Table 5 — Panel A: Estimates",
              ref_tables['table5_est'], t5_labels, t5_headers)

# Panel B: display SEs in parentheses
print("\nTable 5 — Panel B: Standard Errors")
print("=" * 70)
se = ref_tables['table5_SE']
se_rows = []
for i in range(min(len(t5_labels), se.shape[0])):
    se_rows.append([t5_labels[i]] + [f"({se[i,j]:.4f})" for j in range(se.shape[1])])
print(tabulate(se_rows, headers=t5_headers, tablefmt="simple", numalign="right"))


Table 5 — Panel A: Estimates
Parameter                                 Mean    Median    Std Dev     PS-S      C-N
--------------------------------------  ------  --------  ---------  -------  -------
α(π,x) — Low-cost project cost          0.9404    0.9626     0.0645   0.0967   0.0313
β(π,x) — Info rent / High-cost premium  360.87    244.69     141.81  -27.191  -11.232
ψ(x) — Risk aversion                    40.911    20.373      46.55  -2.0199   19.237
κ(π,x,z) — Buyer search cost            4.5123     1.158     13.169  -6.6116  -0.4611
κ·λ — Total search cost                 1.7028    0.5639     4.6495  -3.7305   -0.321
Pr(compete|x,z) — Entry probability     0.0558     0.056     0.0162   -0.014  -0.0031
E[η|compete] — Mean competition cost    -0.009    -0.014     0.0192  -0.0054   0.0066

Table 5 — Panel B: Standard Errors
Parameter                               Mean       Median     Std Dev    PS-S       C-N
--------------------------------------  ---------  ---------  --------- 

In [21]:
# ── Table 6A: Why So Little Competition? (Section 6.3) ──
t6a_labels = [
    "Pr(compete) — Baseline",
    "+ Remove adverse selection",
    "+ Remove moral hazard",
    "Pr(compete) — No AS or MH",
    "ΔPr(compete) from removing AS",
    "ΔPr(compete) from removing MH",
    "Pr(compete) — No buyer search costs",
    "E[κ·λ] / E[β+γ]",
]
display_table("Table 6A: Why So Little Competition?",
              ref_tables['table6A'], t6a_labels,
              ["Decomposition", "Estimate", "Lower CI", "Upper CI"], fmt=".5f")


Table 6A: Why So Little Competition?
Decomposition                          Estimate    Lower CI    Upper CI
-----------------------------------  ----------  ----------  ----------
Pr(compete) — Baseline                  0.79888     0.65402     0.89398
+ Remove adverse selection               4.8856       4.783      5.3047
+ Remove moral hazard                    9.2406      9.0102      9.9838
Pr(compete) — No AS or MH               0.66403     0.52112     0.74734
ΔPr(compete) from removing AS            2.7283      1.4621      3.3752
ΔPr(compete) from removing MH            3.4319      1.9886      4.1077
Pr(compete) — No buyer search costs     0.57711     0.39316     0.67883
E[κ·λ] / E[β+γ]                         0.01172     0.00434     0.05755


In [22]:
# ── Table 6B: Policy Counterfactuals (Section 6.4) ──
t6b_labels = [
    "Baseline: log total price",       "Baseline: total price ($M)",
    "CF1: Remove AS — Δ log price",    "CF2: Remove MH — Δ log price",
    "CF3: No AS or MH — Δ log price",  "CF4: Symmetric info — Δ log price",
    "CF5: No AS/MH + sym — Δ log price","CF6: Full info — Δ log price",
    "Fraction saved (full info)",       "Log welfare loss",
    "Welfare loss ($M)",                "Full info — Δ log price (total)",
]
display_table("Table 6B: Policy Counterfactuals",
              ref_tables['table6B'], t6b_labels,
              ["Counterfactual", "Estimate", "Lower CI", "Upper CI"], fmt=".5f")


Table 6B: Policy Counterfactuals
Counterfactual                       Estimate    Lower CI    Upper CI
---------------------------------  ----------  ----------  ----------
Baseline: log total price              1.6094      1.4528      1.6563
Baseline: total price ($M)             363.38      358.54      371.75
CF1: Remove AS — Δ log price          0.65749     0.24705      1.0232
CF2: Remove MH — Δ log price          0.00794    -0.02351     0.03999
CF3: No AS or MH — Δ log price        0.02482     0.00958     0.16668
CF4: Symmetric info — Δ log price    -0.01309    -0.10332    -0.00406
CF5: No AS/MH + sym — Δ log price     0.01142      0.0033     0.07114
CF6: Full info — Δ log price           0.0479      0.0162     0.38814
Fraction saved (full info)            0.79033     0.77278     0.87456
Log welfare loss                     -0.95165      -1.687    -0.61134
Welfare loss ($M)                      1.3357     0.83979      2.3403
Full info — Δ log price (total)        0.0479      0.016

---
## Summary

Five-step sequential estimation of Kang & Miller (2022):

| Step | What | Method | Key Inputs |
|------|------|--------|------------|
| 1 | Type distribution f(π\|x,z) | Binary MLE | Contract choice d |
| 2 | Signal distributions f_L(s), f_H(s) | MLE × 2 | Price changes s₁, duration changes s₂ |
| 3 | Cost parameters (α,β,ψ) | NLS (3 moments) | Observed prices, Halton simulation |
| 4 | Buyer search costs κ | Closed-form | Steps 1–3 output |
| 5 | Competition costs F(η) | MLE | Entry decisions |

**Main finding**: Adverse selection and moral hazard — not search costs — explain why competitive bidding is rare in government procurement.