# Demand Estimation for Differentiated Products in Python  

---

## Learning Objectives

By the end of this notebook, you should be able to:

1. Explain why demand estimation is central in Industrial Organization and competition policy.
2. Connect demand estimation to **measurement of market power** (Lerner index, elasticities).
3. Distinguish between **Cournot** and **Bertrand** competition and see why Bertrand is the natural benchmark with differentiated products.
4. Understand the basic structure of:
   - The simple **Multinomial Logit** model,
   - The **Nested Logit** model,
   - The **BLP (Berry–Levinsohn–Pakes)** random-coefficients logit approach at a conceptual level.
5. Implement, in Python:
   - A **logit demand estimation** with aggregate market shares,
   - A **nested logit demand estimation** with segments as nests,
   - Construction and use of **BLP-style instruments** for price and nest share endogeneity.
6. Compute **own- and cross-price elasticities** and **implied marginal costs** from the estimated model.

We’ll do this through a **hands-on example** using the European car market, closely following the structure of problem sets used in empirical IO courses, but treated here as a guided lecture rather than an assignment.

In [2]:
# 0. Setup: packages and display options

import pandas as pd
import numpy as np

import statsmodels.api as sm # for econometric modeling
from statsmodels.iolib.summary2 import summary_col # for regression tables

# Display options
pd.set_option("display.max_columns", 80)
pd.set_option("display.float_format", lambda x: f"{x:,.4f}")
pd.set_option("display.max_rows", 100)

## 1. Why do we estimate demand?

In IO and antitrust, **demand** is not just "how much people buy". It is a **function** that links quantities (or market shares) to:
  - Prices,
  - Product characteristics (quality, horsepower, fuel efficiency, etc.),
  - Income and other demand shifters,
  - Unobserved factors (brand reputation, design, etc.).

Why we care:

1. **Pricing and markups**
   - If we know the **own-price elasticity** of a product, and we have some assumptions on competition (e.g. Bertrand pricing), we can infer **markups** and **marginal costs**.
   - This is key for measuring **market power**.

2. **Merger analysis**
   - Competition authorities want to know:
     - If two firms merge, how will they change prices?
     - Which products are close substitutes?
   - Demand estimation allows you to simulate how **prices and quantities** move after a merger.

3. **Policy simulations**
   - "What if" questions:
     - What if we impose a tax/subsidy on certain products (e.g., fuel-inefficient cars)?
     - What if a new product enters or an existing product exits?

4. **Product design and differentiation**
   - Firms can use demand estimates to decide:
     - Which segment to target,
     - Whether to launch a new variant (e.g., SUV vs compact).

In short: estimating demand is essential for **measuring market power** and for **counterfactual analysis** in IO and antitrust.

## 2. Measuring market power

### 2.1 The Lerner Index (homogeneous products, Cournot)

In a simple homogeneous goods model with **Cournot** competition:

- Many firms choose quantities $q_j$ to maximize profits:
  $$
  \pi_j = P(Q) q_j - C_j(q_j), \quad Q = \sum_j q_j
  $$
- The first-order condition implies:
  $$
  \frac{P - MC_j}{P} = \frac{s_j}{\eta}
  $$  
  where:
  - $MC_j$ is marginal cost of firm $j$,
  - $s_j = q_j / Q$ is firm $j$'s **market share**,
  - $\eta$ is the **market demand elasticity** (negative).

The left-hand side is the **Lerner index**:
$$
L_j = \frac{P - MC_j}{P}
$$

Under Cournot with homogeneous products:
- Markups are higher when:
  - Market demand is inelastic (small $|\eta|$),
  - Firm $j$ has a large market share $s_j$.

### 2.2 Differentiated products and Bertrand

In most real-world consumer markets (cars, phones, shampoo, etc.), **products are differentiated**.

- Firms choose **prices** (Bertrand competition),
- Each firm may sell **multiple products**,
- Demand is at the **product level**, not aggregated to one Q.

For a single-product firm $j$, under Bertrand:

$$
\pi_j = (p_j - c_j) q_j(p_1, \dots, p_J)
$$

The first-order condition w.r.t. $p_j$ gives:

$$
q_j + (p_j - c_j)\frac{\partial q_j}{\partial p_j} = 0
\;\Rightarrow\;
\frac{p_j - c_j}{p_j} = - \frac{1}{\eta_{jj}}
$$

where $ \eta_{jj} $ is the **own-price elasticity** of demand for product $ j $:

$$
\eta_{jj} = \frac{\partial q_j}{\partial p_j} \cdot \frac{p_j}{q_j}
$$

So **again**:
- Markups are inversely related to own-price elasticity:
  $$
  L_j = -\frac{1}{\eta_{jj}}
  $$

In differentiated products markets, we need **product-level demand elasticities**, not just a single market elasticity.
That’s why discrete choice models (logit, nested logit, BLP) are so important.


## 3. Cournot vs Bertrand: which one for differentiated products?

Very quick digression:

- **Cournot**:
  - Firms choose quantities.
  - Most natural when capacity/quantity decisions are key.
  - Fits homogeneous goods (cement, some energy markets).

- **Bertrand**:
  - Firms choose prices.
  - Most natural for differentiated products where firms post prices and consumers buy.

For **differentiated products** (cars):

- Products differ in **horsepower, fuel efficiency, brand, size, origin**, etc.
- Firms typically set **prices** (list prices, discounts).
- Consumers then choose **which car to buy**, if any.

So in this notebook, we will:
- Use **Bertrand** as the equilibrium concept.
- Use **discrete choice demand models** to estimate how demand responds to prices and characteristics.
- Later, we will connect those estimated elasticities back to **markups and marginal costs** as in the Bertrand FOCs.

This is the conceptual foundation behind BLP (1995) and Goldberg & Verboven (2001).


## 4. Seminal Papers: BLP (1995) and Goldberg–Verboven (2001)

### 4.1 Berry, Levinsohn and Pakes (1995)

BLP introduce a **random coefficients logit model** that simultaneously addresses three big problems in empirical IO demand estimation:

1. **Many products**
   - Markets with dozens or hundreds of differentiated products (e.g. cars).
   - They derive a way to invert market shares to recover mean utilities.

2. **Price endogeneity**
   - Prices are set by firms and correlated with unobserved quality.
   - BLP propose **instruments** based on:
     - Characteristics of other products sold by the same firm,
     - Characteristics of rivals’ products.

3. **Rich consumer heterogeneity**
   - Consumers differ in how they value characteristics (e.g., some care more about horsepower, others about fuel economy).
   - Random coefficients allow **flexible substitution patterns** and avoid the restrictive IIA property of simple logit.
   - IIA or the **Independence of Irrelevant Alternatives** problem refers to the unrealistic assumption that the relative odds of choosing between two options are unaffected by the presence or characteristics of other alternatives. Random coefficients help mitigate this issue by allowing for more realistic substitution patterns among products.

This has become the **workhorse framework** in merger analysis and antitrust.

---

### 4.2 Goldberg & Verboven (2001)

Goldberg & Verboven use car registration data from European markets to:

- Study **price discrimination** and **tax incidence** in the European car market,
- Combine **nested logit** and **random coefficients** to capture:
  - Segment-level grouping (e.g., compact, midsize, luxury),
  - Consumer heterogeneity in tastes.

The dataset we use (`cars1.dta`) is a **reduced version** of their data, focusing on:

- Market-level sales,
- Prices and characteristics,
- Country and year identifiers,
- Segments and brands.

In this notebook, we’ll:

- Start from **simple logit and nested logit** with this data,
- Introduce **BLP-style instruments**,
- Discuss conceptually how a full BLP model would extend what we are doing.


## 5. Hands-on example: European car market

We will now work through a **fully guided example** using the European car market data.

### Data description

The dataset `cars1.dta` contains:

- **Market identifiers**
  - `country`: market (country),
  - `year`: year,
  - `pop`: population,
- **Product identifiers**
  - `co`: model,
  - `brd`: brand,
  - `frm`: firm,
  - `cla`: segment (to be used as nest in nested logit),
  - `loc`: origin of the car
- **Outcomes and price**
  - `qu`: number of cars sold,
  - `princ`: price relative to average income (we’ll use this as price),
- **Characteristics**
  - `horsepower`: horsepower (kW),
  - `fuel`: fuel consumption (liters per 100 km),
  - `width`, `height`: width and height (cm)

We will:

1. Construct **market shares** and **outside good share**.
2. Estimate a **logit demand model**.
3. Extend to a **nested logit** model with segments as nests.
4. Construct **BLP-style instruments** and do **IV estimation**.
5. Compute **elasticities** and **marginal costs**.

In [5]:
path = "/Users/moxballo/Documents/GitHub/ds4upse-2526-s1/"
data_path = path + "03_data/"
cars = pd.read_stata( data_path + "cars1.dta")

# Quick look
cars.head()

Unnamed: 0,year,country,co,type,segment,domestic,firm,brand,loc,qu,pr,princ,price,horsepower,fuel,width,height,weight,pop,ngdp,ngdpe,country1,country2,country3,country4,country5,yearsquared
0,1983,Belgium,1,alfa 33,compact,0,AlfaRomeo,AlfaRomeo,Italy,729.0,336250.0,0.7915,18.1757,58.0,5.8,161.0,130.5,890,9860000.0,4188800024576.0,234000000.0,1,0,0,0,0,3932289.0
1,1984,Belgium,1,alfa 33,compact,0,AlfaRomeo,AlfaRomeo,Italy,1860.0,348750.0,0.762,17.4987,58.0,5.8,161.0,130.5,890,9860000.0,4512599769088.0,234000000.0,1,0,0,0,0,3936256.0
2,1985,Belgium,1,alfa 33,compact,0,AlfaRomeo,AlfaRomeo,Italy,1771.0,361000.0,0.7363,16.9076,58.0,5.8,161.0,130.5,890,9860000.0,4834399879168.0,234000000.0,1,0,0,0,0,3940225.0
3,1986,Belgium,1,alfa 33,compact,0,AlfaRomeo,AlfaRomeo,Italy,2047.0,339900.0,0.6591,15.1352,58.0,5.8,161.0,134.5,890,9860000.0,5084899966976.0,234000000.0,1,0,0,0,0,3944196.0
4,1987,Belgium,1,alfa 33,compact,0,Fiat,AlfaRomeo,Italy,2147.0,349900.0,0.6493,14.9107,58.0,5.8,161.0,134.5,910,9870000.0,5318699909120.0,234000000.0,1,0,0,0,0,3948169.0


In [6]:
# Check variable names and basic info
cars.info()
cars.describe(include="all").head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11483 entries, 0 to 11482
Data columns (total 27 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   year         11483 non-null  int16   
 1   country      11483 non-null  category
 2   co           11483 non-null  int16   
 3   type         11483 non-null  object  
 4   segment      11483 non-null  category
 5   domestic     11483 non-null  int8    
 6   firm         11483 non-null  category
 7   brand        11483 non-null  category
 8   loc          11483 non-null  category
 9   qu           11483 non-null  float32 
 10  pr           11483 non-null  float32 
 11  princ        11483 non-null  float32 
 12  price        11483 non-null  float32 
 13  horsepower   11483 non-null  float32 
 14  fuel         11483 non-null  float32 
 15  width        11483 non-null  float32 
 16  height       11483 non-null  float32 
 17  weight       11483 non-null  int16   
 18  pop          11483 non-nul

Unnamed: 0,year,country,co,type,segment,domestic,firm,brand,loc,qu,pr,princ,price,horsepower,fuel,width,height,weight,pop,ngdp,ngdpe,country1,country2,country3,country4,country5,yearsquared
count,11483.0,11483,11483.0,11483,11483,11483.0,11483,11483,11483,11483.0,11483.0,11483.0,11483.0,11483.0,11483.0,11483.0,11483.0,11483.0,11483.0,11483.0,11483.0,11483.0,11483.0,11483.0,11483.0,11483.0,11483.0
unique,,5,,403,5,,26,38,14,,,,,,,,,,,,,,,,,,
top,,Belgium,,ford escort,subcompact,,Fiat,Renault,France,,,,,,,,,,,,,,,,,,
freq,,2641,,146,3200,,1691,890,2593,,,,,,,,,,,,,,,,,,
mean,1985.4298,,223.0364,,,0.1886,,,,19911.4414,2857566.25,0.8274,18.4968,57.2639,6.7289,164.4574,140.4433,980.231,48057600.0,175568330948608.0,1170979200.0,0.23,0.1961,0.1986,0.1759,0.1993,3942004.25


### 5.1.1 Restrict to the relevant years

The original instructions for this dataset typically restrict the analysis to **1976–1999**.

We will:

- Drop early years (1970–1975),
- Keep a working copy of the data.

We’ll also create a **market identifier** as a combination of market (`ma`) and year (`ye`).


In [9]:
# Restrict to 1976-1999
cars = cars[(cars["year"] >= 1976) & (cars["year"] <= 1999)].copy()

# Create a market_id (country-year or market-year combination)
cars["market_id"] = cars["country"].astype(str) + "_" + cars["year"].astype(str)

cars[["country", "year", "market_id"]].drop_duplicates().head()

Unnamed: 0,country,year,market_id
0,Belgium,1983,Belgium_1983
1,Belgium,1984,Belgium_1984
2,Belgium,1985,Belgium_1985
3,Belgium,1986,Belgium_1986
4,Belgium,1987,Belgium_1987


## 5.2 Constructing market size and market shares

We follow the standard discrete choice logic (Berry (1994, 1995)):

- There are $L_{mt}$ **potential consumers** in market $m$, year $t$.
- Each consumer buys at most **one car** (or chooses the outside option).
- Let:
  - $q_{jmt}$: sales of product $j$ in market-year $(m,t)$.
  - $L_{mt}$: potential market size (number of households).
  - $s_{jmt} = q_{jmt} / L_{mt}$: **market share** of product $j$ relative to potential consumers.
  - $s_{0,mt} = 1 - \sum_j s_{jmt}$: **outside good share** (“no car purchase”).

### 5.2.1 Market size

We approximate **number of households** as:

$$
L_{mt} = \frac{\text{population}_{mt}}{\text{average household size}}
$$

We need to assume an average household size (e.g., 2.5 or 3). We’ll start with:

- `avg_hh_size = 2.5` (you can vary this later in a sensitivity analysis).

We then compute:

1. $L_{mt}$,
2. Total sales $Q_{mt} = \sum_j q_{jmt}$,
3. $s_{jmt} = q_{jmt} / L_{mt}$,
4. $s_{0,mt} = 1 - \sum_j s_{jmt}$.

In [10]:
# Assumed average household size
avg_hh_size = 2.5  # you can play with this later

# Verify that population is constant within a market-year (it should be)
pop_check = cars.groupby("market_id")["pop"].nunique()
print("Markets with more than one population value:", (pop_check > 1).sum())

# Market-level population (take the first within each market)
cars["pop_market"] = cars.groupby("market_id")["pop"].transform("first")

# Potential market size L = households
cars["L"] = cars["pop_market"] / avg_hh_size

# Total sales Q per market-year
cars["Q"] = cars.groupby("market_id")["qu"].transform("sum")

# Inside good market share for each product j
cars["s_j"] = cars["qu"] / cars["L"]

# Outside good share per market
sum_sj_by_market = cars.groupby("market_id")["s_j"].transform("sum")
cars["s_0"] = 1.0 - sum_sj_by_market

cars[["market_id", "qu", "L", "s_j", "s_0"]].head()


Markets with more than one population value: 0


Unnamed: 0,market_id,qu,L,s_j,s_0
0,Belgium_1983,729.0,3944000.0,0.0002,0.9231
1,Belgium_1984,1860.0,3944000.0,0.0005,0.916
2,Belgium_1985,1771.0,3944000.0,0.0004,0.9142
3,Belgium_1986,2047.0,3944000.0,0.0005,0.9107
4,Belgium_1987,2147.0,3948000.0,0.0005,0.9043


### 5.2.2 Segment (nest) shares

For the nested logit, we will use `segment` as the **nest** (e.g., compact, midsize, etc.).

We need, for each product $j$ in segment $g$:

- Segment-level share:
  $$
  s_{g,mt} = \sum_{j \in g} s_{jmt}
  $$
- Product’s share **within its segment**:
  $$
  s_{j|g,mt} = \frac{s_{jmt}}{s_{g,mt}}
  $$

We will also compute $ \ln(s_{j|g}) $ for the nested logit regression.


In [12]:
# Segment share within each market (sum of s_j within each segment)
cars["s_g"] = cars.groupby(["market_id", "segment"])["s_j"].transform("sum")

# Share of product j within its segment
cars["s_jg"] = cars["s_j"] / cars["s_g"]

# Log of within-segment share (for nested logit)
cars["log_sj_g"] = np.log(cars["s_jg"])

cars[["market_id", "segment", "s_j", "s_g", "s_jg", "log_sj_g"]].head()

  cars["s_g"] = cars.groupby(["market_id", "segment"])["s_j"].transform("sum")


Unnamed: 0,market_id,segment,s_j,s_g,s_jg,log_sj_g
0,Belgium_1983,compact,0.0002,0.0298,0.0062,-5.0834
1,Belgium_1984,compact,0.0005,0.0305,0.0155,-4.1683
2,Belgium_1985,compact,0.0004,0.0305,0.0147,-4.2178
3,Belgium_1986,compact,0.0005,0.0327,0.0159,-4.1437
4,Belgium_1987,compact,0.0005,0.0352,0.0155,-4.1698


## 5.3 Logit transformation: $\ln(s_j / s_0)$

From the aggregate logit model:

$$
\ln\left(\frac{s_j}{s_0}\right) = x_j \beta - \alpha p_j + \xi_j
$$

where:
- $ s_j $: share of product \( j \) (relative to potential consumers),
- $ s_0 $: share of the outside good,
- $ x_j $: observed characteristics,
- $ p_j $: price,
- $ \xi_j $: unobserved quality term (error term).
So our **dependent variable** in the logit and nested logit regressions is:

$$
y_j = \ln\left(\frac{s_j}{s_0}\right)
$$


In [13]:
# Construct the logit dependent variable ln(s_j / s_0)
cars["log_sj_s0"] = np.log(cars["s_j"] / cars["s_0"])

cars[["market_id", "s_j", "s_0", "log_sj_s0"]].head()


Unnamed: 0,market_id,s_j,s_0,log_sj_s0
0,Belgium_1983,0.0002,0.9231,-8.516
1,Belgium_1984,0.0005,0.916,-7.5717
2,Belgium_1985,0.0004,0.9142,-7.6187
3,Belgium_1986,0.0005,0.9107,-7.47
4,Belgium_1987,0.0005,0.9043,-7.4163


## 5.4 Specifying the logit and nested logit models

We now specify:

### 5.4.1 Random utility logit (simple logit)

Utility for product $j$ in market $m,t$:

$$
u_{ijmt} = x_{jmt} \beta - \alpha p_{jmt} + \xi_{jmt} + \varepsilon_{ijmt}
$$

where:
- $ x_{jmt} $ includes:
  - `hp` (horsepower),
  - `li` (fuel consumption),
  - `wi` (width),
  - `home` (domestic origin),
  - Time trend (`year`),
  - Country dummies,
  - Brand fixed effects.

The estimation equation:

$$
\ln\left(\frac{s_j}{s_0}\right) = \beta_0 + x_{jmt} \beta - \alpha p_{jmt} + \xi_{jmt}
$$

### 5.4.2 Nested logit (one-level, nests = segments)

We extend the logit by adding the log of within-segment share:

$$
\ln\left(\frac{s_j}{s_0}\right) = \beta_0 + x_{jmt} \beta - \alpha p_{jmt}
+ \sigma \ln(s_{j|g,mt}) + \xi_{jmt}
$$

where:
- $ \sigma $ measures correlation of unobserved utility within a segment/nest.
- When $ \sigma = 0 $, we are back to simple logit.
- When $ \sigma $ is large (close to 1), **within-segment substitution** is strong.

We start with **OLS** (treating price as exogenous), then discuss endogeneity and IV.


In [15]:
cars.columns

Index(['year', 'country', 'co', 'type', 'segment', 'domestic', 'firm', 'brand',
       'loc', 'qu', 'pr', 'princ', 'price', 'horsepower', 'fuel', 'width',
       'height', 'weight', 'pop', 'ngdp', 'ngdpe', 'country1', 'country2',
       'country3', 'country4', 'country5', 'yearsquared', 'market_id',
       'pop_market', 'L', 'Q', 's_j', 's_0', 's_g', 's_jg', 'log_sj_g',
       'log_sj_s0'],
      dtype='object')

In [22]:
# 5.4.3 Construct regressors

# Choose price variable
price_var = "princ"   # price relative to income

# Create dummy if same country of origin as market country
# Convert categoricals to strings before comparing to avoid TypeError
cars["domestic"] = (cars["country"].astype(str) == cars["loc"].astype(str)).astype(int)

# Core characteristics
x_vars = ["horsepower", "fuel", "weight", "domestic", "year"]

# Country, origin, and brand dummies and convert to integers
country_dummies = pd.get_dummies(cars['country'], prefix="cty", drop_first=True).astype(int)
# origin_dummies = pd.get_dummies(cars["loc"], prefix="loc", drop_first=True).astype(int)
brand_dummies = pd.get_dummies(cars["brand"], prefix="brand", drop_first=True).astype(int)

# Assemble design matrix for simple logit
X_logit = pd.concat([cars[x_vars + [price_var]], country_dummies, brand_dummies], axis=1)
X_logit = sm.add_constant(X_logit)

y = cars["log_sj_s0"]

X_logit.shape, y.shape


((9591, 48), (9591,))

In [24]:
# 5.4.4 OLS estimation: simple logit

logit_ols = sm.OLS(y, X_logit).fit()
print(logit_ols.summary())


                            OLS Regression Results                            
Dep. Variable:              log_sj_s0   R-squared:                       0.558
Model:                            OLS   Adj. R-squared:                  0.556
Method:                 Least Squares   F-statistic:                     267.9
Date:                Tue, 02 Dec 2025   Prob (F-statistic):               0.00
Time:                        13:57:51   Log-Likelihood:                -13547.
No. Observations:                9591   AIC:                         2.719e+04
Df Residuals:                    9545   BIC:                         2.752e+04
Df Model:                          45                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
const             

**Interpretation tips for students:**

- The coefficient on `princ` is $-\alpha$ (it should be **negative**).
- Coefficients on:
  - `horsepower` (horsepower): positive if consumers value power,
  - `fuel` (fuel consumption per 100 km): likely **negative** if high fuel consumption is disliked,
  - `weight` (weight): could be positive (bigger cars),
  - `domestic`: positive if consumers prefer domestic brands.
- Year trend (`year`): captures changes over time (e.g., growing market size not captured in L, evolving preferences).

But: **Price is endogenous**. Cars with high unobserved quality (large $\xi_j$) will have both **higher price** and **higher shares**, biasing the price coefficient.

This is why we need **BLP-style instruments**.


In [25]:
# 5.4.5 OLS estimation: nested logit

# Add log_sj_g as additional regressor
X_nested_ols = pd.concat([X_logit, cars[["log_sj_g"]]], axis=1)

nested_ols = sm.OLS(y, X_nested_ols).fit()
print(nested_ols.summary())


                            OLS Regression Results                            
Dep. Variable:              log_sj_s0   R-squared:                       0.919
Model:                            OLS   Adj. R-squared:                  0.919
Method:                 Least Squares   F-statistic:                     2366.
Date:                Tue, 02 Dec 2025   Prob (F-statistic):               0.00
Time:                        14:00:16   Log-Likelihood:                -5388.5
No. Observations:                9591   AIC:                         1.087e+04
Df Residuals:                    9544   BIC:                         1.121e+04
Df Model:                          46                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
const             

**Interpreting the nested logit OLS output:**

- Coefficient on `princ` is still $-\alpha$ (price sensitivity).
- Coefficient on `log_sj_g` is $ \sigma $:
  - If $ 0 < \sigma < 1 $:
    - There is positive correlation in unobserved utility within segments.
    - Products in the same segment are **closer substitutes**.
  - If $ \sigma $ is near 0:
    - We are close to the simple logit (no nesting structure).
- Again, though, price (and potentially `log_sj_g`) is endogenous, so OLS is not reliable.

Next: we construct **BLP instruments** and use **IV/2SLS**.


## 5.5 Endogeneity and BLP instruments

### 5.5.1 Why is price endogenous?

Recall the estimation equation:

$$
\ln\left(\frac{s_j}{s_0}\right) = x_j \beta - \alpha p_j + \xi_j
$$

-  $\xi_j$ is **unobserved quality** (to the econometrician).
- Firms **observe** $\xi_j$ when setting prices.
- So **price and $\xi_j$ are correlated**.
- OLS will **overestimate** quality of high-price, high-share products and bias the price coefficient.

### 5.5.2 BLP-style instruments

BLP propose instruments based on **product characteristics**:

For each product $j$ produced by firm $f$:

1. **IV1: Sums of characteristics of other products of the same firm** in the same market:
   - e.g., sum of horsepower of other models of the same firm.
2. **IV2: Sums of characteristics of products of rival firms** in the same market.
3. For nested logit, we also need instruments for $ \ln(s_{j|g}) $:
   - **IV3: Sums of characteristics of other products in the same segment**.

Intuition:
- Characteristics of other products shift the **competitive environment** and incentive to set higher prices,
- But conditional on product $j$'s own characteristics, they should not be directly correlated with $ \xi_j $.

We will implement these instruments in Python.


In [26]:
# 5.5.3 Construct BLP instruments

# For convenience, define some helper columns
cars["const_one"] = 1  # for counting products

# List of characteristics to use for BLP instruments
char_list = ["horsepower", "fuel", "weight", "domestic", "const_one"]

# Group objects
g_market = cars.groupby("market_id")
g_market_firm = cars.groupby(["market_id", "firm"])
g_market_seg = cars.groupby(["market_id", "segment"])

# Sum of characteristics over ALL products in the market
for v in char_list:
    cars[f"sum_{v}_all"] = g_market[v].transform("sum")

# Sum of characteristics over products of the SAME FIRM in the market
for v in char_list:
    cars[f"sum_{v}_firm"] = g_market_firm[v].transform("sum")

# Instrument type 1: other products of the same firm (within market)
for v in char_list:
    cars[f"iv1_{v}"] = cars[f"sum_{v}_firm"] - cars[v]

# Instrument type 2: products of rival firms (within market)
for v in char_list:
    cars[f"iv2_{v}"] = cars[f"sum_{v}_all"] - cars[f"sum_{v}_firm"]

# Instrument type 3: other products in the same segment (for nested logit)
for v in char_list:
    # sum of characteristics in the segment (market x segment)
    cars[f"sum_{v}_seg"] = g_market_seg[v].transform("sum")
    cars[f"iv3_{v}"] = cars[f"sum_{v}_seg"] - cars[v]

# Take a look at a few IVs
iv_cols_preview = [c for c in cars.columns if c.startswith("iv1_horsepower") or c.startswith("iv2_horsepower") or c.startswith("iv3_horsepower")]
cars[["market_id", "firm", "segment", "horsepower"] + iv_cols_preview].head()

  g_market_firm = cars.groupby(["market_id", "firm"])
  g_market_seg = cars.groupby(["market_id", "segment"])


Unnamed: 0,market_id,firm,segment,horsepower,iv1_horsepower,iv2_horsepower,iv3_horsepower
0,Belgium_1983,AlfaRomeo,compact,58.0,234.0,4858.0,1026.0
1,Belgium_1984,AlfaRomeo,compact,58.0,246.5,4765.0,1114.5
2,Belgium_1985,AlfaRomeo,compact,58.0,270.0,4798.0,940.0
3,Belgium_1986,AlfaRomeo,compact,58.0,278.0,4615.5,948.0
4,Belgium_1987,Fiat,compact,58.0,673.5,3933.0,1016.5


We now have a rich set of instruments:

- `iv1_horsepower`, `iv1_fuel`, `iv1_weight`, `iv1_domestic`, `iv1_const_one`:
  - Sums of other products of the **same firm** in the same market.
- `iv2_*`:
  - Sums of characteristics of **rival firms**’ products.
- `iv3_*`:
  - Sums of characteristics of **other products in the same segment** (useful for nested logit).

Next, we will perform **manual 2SLS (two-stage least squares)**:

1. First stage: regress price (and log_sj_g) on exogenous variables and instruments.
2. Second stage: use the **predicted values** in the demand equation.


In [28]:
# 5.6.1 Prepare matrices for IV estimation

# Dependent variable
y = cars["log_sj_s0"]

# Exogenous regressors (excluding price and log_sj_g)
exog_vars = ["horsepower", "fuel", "weight", "domestic", "year"]
X_exog = pd.concat([cars[exog_vars], country_dummies, brand_dummies], axis=1)
X_exog = sm.add_constant(X_exog)

# Endogenous regressors
price = cars[price_var]

# Instruments: choose a subset to avoid huge dimensionality
iv_vars = []

# Use iv1 and iv2 for horsepower, fuel, weight, domestic, const_one
for prefix in ["iv1_", "iv2_"]:
    for v in ["horsepower", "fuel", "weight", "domestic", "const_one"]:
        iv_vars.append(prefix + v)

Z = pd.concat([X_exog, cars[iv_vars]], axis=1)  # full instrument matrix (including exog)

X_exog.shape, price.shape, Z.shape


((9591, 47), (9591,), (9591, 57))

In [29]:
# 5.6.2 First stage: price on instruments and exogenous variables

first_stage_price = sm.OLS(price, Z).fit()
print(first_stage_price.summary())

# Predicted price
cars["princ_hat"] = first_stage_price.fittedvalues


                            OLS Regression Results                            
Dep. Variable:                  princ   R-squared:                       0.879
Model:                            OLS   Adj. R-squared:                  0.878
Method:                 Least Squares   F-statistic:                     1282.
Date:                Tue, 02 Dec 2025   Prob (F-statistic):               0.00
Time:                        14:06:02   Log-Likelihood:                 5634.7
No. Observations:                9591   AIC:                        -1.116e+04
Df Residuals:                    9536   BIC:                        -1.077e+04
Df Model:                          54                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
const             

In [30]:
# 5.6.3 Second stage: Logit IV (2SLS) with predicted price

X_logit_iv = pd.concat([X_exog, cars[["princ_hat"]]], axis=1)
logit_iv = sm.OLS(y, X_logit_iv).fit()

print(logit_iv.summary())


                            OLS Regression Results                            
Dep. Variable:              log_sj_s0   R-squared:                       0.541
Model:                            OLS   Adj. R-squared:                  0.539
Method:                 Least Squares   F-statistic:                     250.5
Date:                Tue, 02 Dec 2025   Prob (F-statistic):               0.00
Time:                        14:06:38   Log-Likelihood:                -13724.
No. Observations:                9591   AIC:                         2.754e+04
Df Residuals:                    9545   BIC:                         2.787e+04
Df Model:                          45                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
const             

**Comparison: OLS vs IV logit**

- Look at the coefficient on `princ` (price) from:
  - `logit_ols` (simple OLS),
  - `logit_iv` (2SLS using BLP instruments).

What you typically see:

- OLS price coefficient may be **too small in magnitude** (e.g., less negative) because:
  - High-quality products have high price and high shares.
- IV estimation corrects for this:
  - The IV price coefficient is usually **more negative** (more elastic demand),
  - Leading to **lower implied markups**.

We can present both estimates side by side.


In [31]:
# 5.6.4 Simple table comparing OLS and IV logit estimates (price and a few characteristics)

results_to_compare = {
    "Logit OLS": logit_ols,
    "Logit IV": logit_iv,
}

summary = summary_col(
    list(results_to_compare.values()),
    stars=True,
    model_names=list(results_to_compare.keys()),
    info_dict={"N": lambda x: f"{int(x.nobs)}"},
)

print(summary)



                            Logit OLS   Logit IV 
-------------------------------------------------
const                       -10.9896** -15.7426  
                            (5.0367)   (13.1021) 
horsepower                  -0.0130*** -0.0147***
                            (0.0014)   (0.0044)  
fuel                        -0.1182*** -0.1163***
                            (0.0125)   (0.0136)  
weight                      0.0010***  0.0009*** 
                            (0.0001)   (0.0002)  
domestic                    1.6736***  1.6751*** 
                            (0.0293)   (0.0301)  
year                        0.0024     0.0048    
                            (0.0025)   (0.0066)  
princ                       -1.4287***           
                            (0.0742)             
cty_France                  -0.7749*** -0.7798***
                            (0.0321)   (0.0350)  
cty_Germany                 -0.6996*** -0.6890***
                            (0.0324)   (0.0425)  

## 5.7 Nested logit with IV

For nested logit, we have **two** endogenous variables:

1. `princ` (price),
2. `log_sj_g` (log share within segment).

We will:

1. Create **two first stages**:
   - One for `princ`,
   - One for `log_sj_g`.
2. Use instruments:
   - Same BLP instruments for price,
   - Plus `iv3_*` instruments (within-segment sums) for `log_sj_g`.

Then:

$$
\ln\left(\frac{s_j}{s_0}\right) = \beta_0 + x_j \beta - \alpha \cdot \widehat{p}_j
+ \sigma \cdot \widehat{\ln(s_{j|g})} + \text{error}
$$


In [32]:
# 5.7.1 Instruments for nested logit

# Additional IVs for log_sj_g: iv3_*
iv3_vars = [c for c in cars.columns if c.startswith("iv3_")]

Z_nested = pd.concat([X_exog, cars[iv_vars + iv3_vars]], axis=1)

log_sj_g = cars["log_sj_g"]

# First stage for price (again, but now with extended Z_nested if you want)
fs_price_nested = sm.OLS(price, Z_nested).fit()
cars["princ_hat_nested"] = fs_price_nested.fittedvalues

# First stage for log_sj_g
fs_log_sj_g = sm.OLS(log_sj_g, Z_nested).fit()
cars["log_sj_g_hat"] = fs_log_sj_g.fittedvalues

print("First stage for price:")
print(fs_price_nested.summary().tables[1])

print("\nFirst stage for log_sj_g:")
print(fs_log_sj_g.summary().tables[1])


First stage for price:
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
const                          47.2221      1.953     24.183      0.000      43.394      51.050
horsepower                      0.0104      0.000     63.471      0.000       0.010       0.011
fuel                            0.0029      0.002      1.461      0.144      -0.001       0.007
weight                          0.0004   2.01e-05     21.105      0.000       0.000       0.000
domestic                       -0.0275      0.006     -4.543      0.000      -0.039      -0.016
year                           -0.0239      0.001    -24.213      0.000      -0.026      -0.022
cty_France                      0.0467      0.012      3.844      0.000       0.023       0.071
cty_Germany                    -0.0744      0.012     -6.125      0.000      -0.098      -0.051
cty_Italy        

In [33]:
# 5.7.2 Second stage: Nested logit IV

X_nested_iv = pd.concat([X_exog, cars[["princ_hat_nested", "log_sj_g_hat"]]], axis=1)

nested_iv = sm.OLS(y, X_nested_iv).fit()
print(nested_iv.summary())


                            OLS Regression Results                            
Dep. Variable:              log_sj_s0   R-squared:                       0.551
Model:                            OLS   Adj. R-squared:                  0.549
Method:                 Least Squares   F-statistic:                     254.8
Date:                Tue, 02 Dec 2025   Prob (F-statistic):               0.00
Time:                        14:11:11   Log-Likelihood:                -13621.
No. Observations:                9591   AIC:                         2.734e+04
Df Residuals:                    9544   BIC:                         2.767e+04
Df Model:                          46                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
const             

**Interpretation:**

- Coefficient on `princ_hat_nested` is the IV estimate of $-\alpha$ in the nested logit.
- Coefficient on `log_sj_g_hat` is the IV estimate of $\sigma$ (nesting parameter).
- Compare:
  - Nested OLS vs nested IV:
    - Is the nesting parameter $\sigma$ still between 0 and 1?
    - Is the price coefficient more negative under IV?
- Check signs and significance of the characteristics (`horsepower`, `fuel`, `weight`, `home`).

At this point, we have:

- **Logit OLS**,
- **Nested logit OLS**,
- **Logit IV**,
- **Nested logit IV**.

Next, we’ll compute **elasticities** and **marginal costs** using the IV nested logit estimates.

## 5.8 Elasticities and marginal costs

### 5.8.1 Logit elasticities (for comparison)

For the simple logit model with aggregate data, the **own-price elasticity** for product $j$ is:

$$
\epsilon_{jj} = \frac{\partial s_j}{\partial p_j} \cdot \frac{p_j}{s_j}
= -\alpha p_j (1 - s_j)
$$

and the **cross-price elasticity** between $j$ and $k$ is:

$$
\epsilon_{jk} = \alpha p_j s_k
$$

where $ \alpha > 0 $ is the absolute value of the price coefficient ($ -\alpha $ is the coefficient in the regression).

We will compute these using our **logit IV** estimate as a simple benchmark.

> Note: For nested logit, the formulas are more involved. We’ll show them after the logit case.


In [34]:
# 5.8.2 Logit elasticities using IV estimates

# Get price coefficient (should be negative in regression, so we take alpha = -coef)
alpha_logit_iv = -logit_iv.params["princ_hat"]

print("Estimated alpha (logit IV):", alpha_logit_iv)

# Compute logit own-price elasticity and marginal cost for a subset (just for illustration)
cars["eps_jj_logit_iv"] = -alpha_logit_iv * cars[price_var] * (1 - cars["s_j"])

# Implied marginal cost under single-product Bertrand:
# (p - c)/p = -1 / eps_jj  => c = p * (1 + 1/eps_jj)
cars["mc_logit_iv"] = cars[price_var] * (1 + 1 / cars["eps_jj_logit_iv"])

cars[["market_id", "co", price_var, "s_j", "eps_jj_logit_iv", "mc_logit_iv"]].head()

Estimated alpha (logit IV): 1.2792800258593362


Unnamed: 0,market_id,co,princ,s_j,eps_jj_logit_iv,mc_logit_iv
0,Belgium_1983,1,0.7915,0.0002,-1.0124,0.0097
1,Belgium_1984,1,0.762,0.0005,-0.9744,-0.02
2,Belgium_1985,1,0.7363,0.0004,-0.9415,-0.0458
3,Belgium_1986,1,0.6591,0.0005,-0.8427,-0.123
4,Belgium_1987,1,0.6493,0.0005,-0.8302,-0.1328


### 5.8.3 Nested logit elasticities (outline)

From the nested logit formulas, own- and cross-price elasticities depend on:

-  $\alpha$: price coefficient,
- $\sigma$: nesting parameter,
- $p_j$, $s_j$, and $s_{j|g}$ (within-segment share).

For example (using the formulas often shown in class):

- Own-price elasticity:
  $$
  \epsilon_{jj} = -\alpha p_j \left[
    \frac{1}{1 - \sigma} -
    \frac{\sigma}{1 - \sigma} (s_{j|g} - s_j)
  \right]
  $$
- Cross-price elasticity (same segment):
  $$
  \epsilon_{jk} =
  \alpha p_j \left[
    \frac{\sigma}{1 - \sigma} s_{j|g} + s_j
  \right]
  $$
- Cross-price elasticity (different segment) reduces to the logit-like term in many formulations.

We will implement a simple version using our **nested IV** estimates.


In [35]:
# 5.8.4 Nested logit elasticities using IV estimates

alpha_nested_iv = -nested_iv.params["princ_hat_nested"]
sigma_nested_iv = nested_iv.params["log_sj_g_hat"]

print("Estimated alpha (nested IV):", alpha_nested_iv)
print("Estimated sigma (nested IV):", sigma_nested_iv)

# Own-price elasticity (nested logit)
cars["eps_jj_nested_iv"] = -alpha_nested_iv * (
    (1 / (1 - sigma_nested_iv))
    - (sigma_nested_iv / (1 - sigma_nested_iv)) * (cars["s_jg"] - cars["s_j"])
) * cars[price_var]

# Compute implied marginal costs from nested logit
cars["mc_nested_iv"] = cars[price_var] * (1 + 1 / cars["eps_jj_nested_iv"])

cars[["market_id", "co", price_var, "s_j", "s_jg", "eps_jj_nested_iv", "mc_nested_iv"]].head()


Estimated alpha (nested IV): 3.4402805850937805
Estimated sigma (nested IV): 0.3353947306122548


Unnamed: 0,market_id,co,princ,s_j,s_jg,eps_jj_nested_iv,mc_nested_iv
0,Belgium_1983,1,0.7915,0.0002,0.0062,-4.0889,0.5979
1,Belgium_1984,1,0.762,0.0005,0.0155,-3.9247,0.5679
2,Belgium_1985,1,0.7363,0.0004,0.0147,-3.793,0.5422
3,Belgium_1986,1,0.6591,0.0005,0.0159,-3.3942,0.4649
4,Belgium_1987,1,0.6493,0.0005,0.0155,-3.3443,0.4552


**Sanity checks:**

- Own-price elasticities should be **negative**.
- More popular products (higher \( s_j \)) often have **larger absolute elasticities**.
- Implied marginal costs should be:
  - Less than prices (positive markups),
  - Reasonable given typical cost structures for cars.

You can also compute **summary statistics**:

- Mean and distribution of own-price elasticities,
- Mean and distribution of markups \((p - c)/p\),
- Compare logit IV vs nested IV.

In [36]:
# 5.8.5 Summary statistics for elasticities and markups under nested IV

cars["markup_nested_iv"] = (cars[price_var] - cars["mc_nested_iv"]) / cars[price_var]

summary_elast = cars["eps_jj_nested_iv"].describe()
summary_markup = cars["markup_nested_iv"].describe()

print("Nested IV own-price elasticities (summary):")
print(summary_elast)

print("\nNested IV markups (summary):")
print(summary_markup)


Nested IV own-price elasticities (summary):
count   9,591.0000
mean       -4.1051
std         1.9367
min       -17.9281
25%        -5.0253
50%        -3.6549
75%        -2.7031
max        -1.2508
Name: eps_jj_nested_iv, dtype: float64

Nested IV markups (summary):
count   9,591.0000
mean        0.2937
std         0.1233
min         0.0558
25%         0.1990
50%         0.2736
75%         0.3699
max         0.7995
Name: markup_nested_iv, dtype: float64


## 6. Connecting back to BLP and antitrust applications

What we did:

1. Started from the **structural utility model** underlying discrete choice:
   - Utility depends on price and characteristics.
2. Used the **Berry (1994, 1995)** insight:
   - Aggregate logit can be written as a **linear regression** in \( \ln(s_j / s_0) \).
3. Extended to **nested logit** to relax the IIA assumption and capture within-segment substitution.
4. Addressed **price endogeneity** (and nest-share endogeneity) using **BLP-style instruments**.
5. Computed **elasticities** and **marginal costs** under **Bertrand pricing**.

Conceptually, **BLP** adds one more layer:

- Instead of assuming all consumers share the same β and α (representative consumer),
- BLP allows these parameters to vary across individuals (random coefficients),
- This leads to **richer substitution patterns** and more realistic simulations.

In practice:

- Competition authorities (like PCC, EC, US-DOJ, CMA) use variants of these models to:
  - Evaluate **mergers**,
  - Analyze **remedies**,
  - Assess **dominance** and **abuse of dominance**,
  - Design **sectoral regulations**.

For teaching undergrads, this notebook gives:

- A **concrete bridge** from econometrics to modern IO demand estimation,
- With **Python code** that mirrors the Stata-based exercises often used in advanced IO courses.