## Exercise 1

In [30]:
import pyblp
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

pyblp.options.digits = 3
pyblp.options.verbose = False
pd.options.display.precision = 3
pd.options.display.max_columns = 50

import IPython.display
IPython.display.display(IPython.display.HTML('<style>pre { white-space: pre !important; }</style>'))

### 1. Describe the data

Let's load the data and look at a random sample. It's good practice to set your seed whenever you do something with a random number generator.

In [31]:
product_data = pd.read_csv('https://github.com/Mixtape-Sessions/Demand-Estimation/raw/main/Exercises/Data/products.csv')
product_data.sample(n=5, random_state=0)

Unnamed: 0,market,product,mushy,servings_sold,city_population,price_per_serving,price_instrument
1695,C47Q1,F2B28,1,249000.0,183521,0.205,0.177
1294,C36Q2,F4B12,0,140100.0,176664,0.082,0.04
672,C23Q1,F1B04,1,274900.0,2783726,0.094,0.072
1190,C34Q2,F2B26,0,782400.0,369365,0.108,0.076
98,C04Q1,F1B07,1,1151000.0,1585577,0.117,0.095


### 2. Compute market shares

Let's compute the market size and market shares.

In [32]:
product_data['market_size'] = product_data['city_population'] * 90
product_data['market_share'] = product_data['servings_sold'] / product_data['market_size']
product_data['outside_share'] = 1 - product_data.groupby('market')['market_share'].transform('sum')
product_data[['market_share', 'outside_share']].describe()

Unnamed: 0,market_share,outside_share
count,2256.0,2256.0
mean,0.01983,0.524
std,0.0256,0.11
min,0.0001818,0.305
25%,0.005183,0.439
50%,0.01114,0.536
75%,0.02465,0.604
max,0.4469,0.815


### 3. Estimate the pure logit model with OLS

Let's use the R-style formula interface to statsmodels to estimate the pure logit model with an OLS regression. We'll use HC0 standard errors to align with PyBLP's default, which is to adjust for heteroskedasticity.

In [33]:
product_data['logit_delta'] = np.log(product_data['market_share'] / product_data['outside_share'])
statsmodels_ols = smf.ols('logit_delta ~ 1 + mushy + price_per_serving', product_data)
statsmodels_results = statsmodels_ols.fit(cov_type='HC0')
statsmodels_results.summary2().tables[1]

Unnamed: 0,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,-2.935,0.108,-27.201,6.345000000000001e-163,-3.146,-2.723
mushy,0.075,0.054,1.382,0.1669,-0.031,0.181
price_per_serving,-7.48,0.84,-8.91,5.11e-19,-9.126,-5.835


The coefficient on price is negative, which means demand is estimated to be sloping down. We'll compute elasticities later, which are more interpretable than the magnitude of the price coefficient here. To interpret the coefficient on mushy, we can divide it by the coefficient on price: 0.075 / 7.480 = $0.01 is the willingness to pay of consumers for a cereal begin "mushy." There are likely a number of other characteristics that mushy is correlated with, which we're not including in this regression.

### 4. Run the same regression with PyBLP

Let's prep our data for use by PyBLP. We need to rename some columns and then use a similar R-style formula to set up our problem.

In [34]:
product_data = product_data.rename(columns={
    'market': 'market_ids',
    'product': 'product_ids',
    'market_share': 'shares',
    'price_per_serving': 'prices',
})
product_data['demand_instruments0'] = product_data['prices']
ols_problem = pyblp.Problem(pyblp.Formulation('1 + mushy + prices'), product_data)
ols_problem

Dimensions:
 T    N     K1    MD 
---  ----  ----  ----
94   2256   3     3  

Formulations:
     Column Indices:         0     1      2   
--------------------------  ---  -----  ------
X1: Linear Characteristics   1   mushy  prices

Let's double-check that PyBLP's instruments matrix is as we expect: a constant, mushy, and prices. The ordering is different, but it's the same.

In [35]:
pd.DataFrame(ols_problem.products.ZD).sample(n=5, random_state=0)

Unnamed: 0,0,1,2
1695,0.205,1.0,1.0
1294,0.082,1.0,0.0
672,0.094,1.0,1.0
1190,0.108,1.0,0.0
98,0.117,1.0,1.0


Now let's run the same OLS regression.

In [36]:
ols_results = ols_problem.solve(method='1s')
ols_results

Problem Results Summary:
GMM   Objective  Clipped  Weighting Matrix  Covariance Matrix
Step    Value    Shares   Condition Number  Condition Number 
----  ---------  -------  ----------------  -----------------
 1    +5.91E-25     0        +1.40E+03          +1.30E+03    

Cumulative Statistics:
Computation   Objective 
   Time      Evaluations
-----------  -----------
 00:00:00         1     

Beta Estimates (Robust SEs in Parentheses):
     1          mushy       prices   
-----------  -----------  -----------
 -2.93E+00    +7.48E-02    -7.48E+00 
(+1.08E-01)  (+5.41E-02)  (+8.40E-01)

We can create a quick dataframe to nicely-format the estimates in this notebook.

In [37]:
pd.DataFrame(index=ols_results.beta_labels, data={
    ("Estimates", "Statsmodels"): statsmodels_results.params.values,
    ("Estimates", "PyBLP"): ols_results.beta.flat,
    ("SEs", "Statsmodels"): statsmodels_results.bse.values,
    ("SEs", "PyBLP"): ols_results.beta_se.flat,
})

Unnamed: 0_level_0,Estimates,Estimates,SEs,SEs
Unnamed: 0_level_1,Statsmodels,PyBLP,Statsmodels,PyBLP
1,-2.935,-2.935,0.108,0.108
mushy,0.075,0.075,0.054,0.054
prices,-7.48,-7.48,0.84,0.84


We get the same estimates and the same standard errors.

### 5. Add market and product fixed effects

It's easiest to add fixed effects by absorbing them. This is done under the hood with iterative de-meaning. We'll drop the constant and the mushy dummy because these are collinear with the fixed effects.

In [38]:
fe_problem = pyblp.Problem(pyblp.Formulation('0 + prices', absorb='C(market_ids) + C(product_ids)'), product_data)
fe_problem

Dimensions:
 T    N     K1    MD    ED 
---  ----  ----  ----  ----
94   2256   1     1     2  

Formulations:
     Column Indices:          0   
--------------------------  ------
X1: Linear Characteristics  prices

In [39]:
fe_results = fe_problem.solve(method='1s')
fe_results

Problem Results Summary:
GMM   Objective  Clipped  Weighting Matrix
Step    Value    Shares   Condition Number
----  ---------  -------  ----------------
 1    +9.21E-29     0        +1.00E+00    

Cumulative Statistics:
Computation   Objective 
   Time      Evaluations
-----------  -----------
 00:00:00         1     

Beta Estimates (Robust SEs in Parentheses):
  prices   
-----------
 -2.86E+01 
(+8.92E-01)

We get a more negative coefficient on price, suggesting that the OLS coefficient was biased upwards. This suggests that price was positively correlated with product/market-specific components of unobserved quality.

### 6. Add an instrument for price

First, let's run a first-stage regression of price on the price instrument in the data to make sure it's relevant.

In [40]:
first_stage = smf.ols('prices ~ 0 + price_instrument + C(market_ids) + C(product_ids)', product_data)
first_stage_results = first_stage.fit(cov_type='HC0')
first_stage_results.summary2().tables[1].sort_index(ascending=False)

Unnamed: 0,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
price_instrument,0.877,6.583e-03,133.232,0.000e+00,0.864,0.890
C(product_ids)[T.F6B18],0.007,8.780e-04,8.339,7.466e-17,0.006,0.009
C(product_ids)[T.F4B12],0.006,8.285e-04,6.971,3.155e-12,0.004,0.007
C(product_ids)[T.F4B10],0.003,7.755e-04,4.412,1.024e-05,0.002,0.005
C(product_ids)[T.F4B02],0.013,9.324e-04,14.266,3.557e-46,0.011,0.015
...,...,...,...,...,...,...
C(market_ids)[C04Q1],0.039,1.082e-03,36.416,2.354e-290,0.037,0.042
C(market_ids)[C03Q2],0.038,1.222e-03,30.743,1.530e-207,0.035,0.040
C(market_ids)[C03Q1],0.041,1.108e-03,36.728,2.650e-295,0.039,0.043
C(market_ids)[C01Q2],0.039,1.401e-03,27.778,7.919e-170,0.036,0.042


It seems relevant, being strongly positively correlated with price even after adjusting for market and product fixed effects. Now we'll use it to instrument for price.

In [41]:
product_data = product_data.drop(columns='demand_instruments0').rename(columns={'price_instrument': 'demand_instruments0'})
iv_problem = pyblp.Problem(pyblp.Formulation('0 + prices', absorb='C(market_ids) + C(product_ids)'), product_data)
iv_problem

Dimensions:
 T    N     K1    MD    ED 
---  ----  ----  ----  ----
94   2256   1     1     2  

Formulations:
     Column Indices:          0   
--------------------------  ------
X1: Linear Characteristics  prices

In [42]:
iv_results = iv_problem.solve(method='1s')
iv_results

Problem Results Summary:
GMM   Objective  Clipped  Weighting Matrix
Step    Value    Shares   Condition Number
----  ---------  -------  ----------------
 1    +1.02E-29     0        +1.00E+00    

Cumulative Statistics:
Computation   Objective 
   Time      Evaluations
-----------  -----------
 00:00:00         1     

Beta Estimates (Robust SEs in Parentheses):
  prices   
-----------
 -3.06E+01 
(+9.68E-01)

In [43]:
pd.DataFrame(index=fe_results.beta_labels, data={
    ("Estimates", "OLS"): ols_results.beta[-1:].flat,
    ("Estimates", "+FE"): fe_results.beta.flat,
    ("Estimates", "+IV"): iv_results.beta.flat,
    ("SEs", "OLS"): ols_results.beta_se[-1:].flat,
    ("SEs", "+FE"): fe_results.beta_se.flat,
    ("SEs", "+IV"): iv_results.beta_se.flat,
})

Unnamed: 0_level_0,Estimates,Estimates,Estimates,SEs,SEs,SEs
Unnamed: 0_level_1,OLS,+FE,+IV,OLS,+FE,+IV
prices,-7.48,-28.618,-30.6,0.84,0.892,0.968


Our estimate gets even more negative with an IV, suggesting that the within product *and* market component of unobserved quality was still positively correlated with price.

### 7. Cut a price in half and see what happens

Let's select the market in which we'll run the counterfactual and see what choices are available to consumers.

In [44]:
counterfactual_market = 'C01Q2'
counterfactual_data = product_data.loc[product_data['market_ids'] == counterfactual_market, ['product_ids', 'mushy', 'prices', 'shares']]
counterfactual_data

Unnamed: 0,product_ids,mushy,prices,shares
24,F1B04,1,0.078,0.006443
25,F1B06,1,0.141,0.1413
26,F1B07,1,0.073,0.08789
27,F1B09,0,0.077,0.006621
28,F1B11,0,0.167,0.05427
29,F1B13,0,0.092,0.02198
30,F1B17,1,0.154,0.01055
31,F1B30,0,0.15,0.00131
32,F1B45,0,0.147,0.01052
33,F2B05,0,0.099,0.05907


Let's cut the price of the first product in half and use our estimated model to predict how market shares of all products in the market will change.

In [45]:
counterfactual_data['new_prices'] = counterfactual_data['prices']
counterfactual_data.loc[counterfactual_data['product_ids'] == 'F1B04', 'new_prices'] /= 2
counterfactual_data['new_shares'] = iv_results.compute_shares(market_id=counterfactual_market, prices=counterfactual_data['new_prices'])
counterfactual_data['iv_change'] = 100 * (counterfactual_data['new_shares'] - counterfactual_data['shares']) / counterfactual_data['shares']
counterfactual_data

Unnamed: 0,product_ids,mushy,prices,shares,new_prices,new_shares,iv_change
24,F1B04,1,0.078,0.006443,0.039,0.02085,223.638
25,F1B06,1,0.141,0.1413,0.141,0.1392,-1.45
26,F1B07,1,0.073,0.08789,0.073,0.08662,-1.45
27,F1B09,0,0.077,0.006621,0.077,0.006525,-1.45
28,F1B11,0,0.167,0.05427,0.167,0.05349,-1.45
29,F1B13,0,0.092,0.02198,0.092,0.02166,-1.45
30,F1B17,1,0.154,0.01055,0.154,0.01039,-1.45
31,F1B30,0,0.15,0.00131,0.15,0.001291,-1.45
32,F1B45,0,0.147,0.01052,0.147,0.01037,-1.45
33,F2B05,0,0.099,0.05907,0.099,0.05821,-1.45


The market share of the product whose price we halved increased by more than 200%, suggesting that consumers are fairly responsive to price changes. The market shares of the other products all decreased, which makes sense (we need substitution from somehwere), but we see that they all decreased by the same percent, which seems unrealistic. We would expect more substitution from more similar products. Cannibalization estimates don't seem reasonable -- we'd expect more cannibalization from the other products of firm one that are more similar to the product whose price is being cut.

### 8. Compute demand elasticities

To get a sense for what's going on, we can compute demand elasticities.

In [46]:
iv_elasticities = iv_results.compute_elasticities(market_id=counterfactual_market)
pd.DataFrame(iv_elasticities)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
0,-2.363,0.61,0.195,0.016,0.277,0.062,0.05,0.006,0.047,0.18,0.048,0.017,0.076,0.018,0.026,0.074,0.006,0.003,0.03,0.025,0.019,0.023,0.026,0.014
1,0.015,-3.706,0.195,0.016,0.277,0.062,0.05,0.006,0.047,0.18,0.048,0.017,0.076,0.018,0.026,0.074,0.006,0.003,0.03,0.025,0.019,0.023,0.026,0.014
2,0.015,0.61,-2.029,0.016,0.277,0.062,0.05,0.006,0.047,0.18,0.048,0.017,0.076,0.018,0.026,0.074,0.006,0.003,0.03,0.025,0.019,0.023,0.026,0.014
3,0.015,0.61,0.195,-2.34,0.277,0.062,0.05,0.006,0.047,0.18,0.048,0.017,0.076,0.018,0.026,0.074,0.006,0.003,0.03,0.025,0.019,0.023,0.026,0.014
4,0.015,0.61,0.195,0.016,-4.833,0.062,0.05,0.006,0.047,0.18,0.048,0.017,0.076,0.018,0.026,0.074,0.006,0.003,0.03,0.025,0.019,0.023,0.026,0.014
5,0.015,0.61,0.195,0.016,0.277,-2.763,0.05,0.006,0.047,0.18,0.048,0.017,0.076,0.018,0.026,0.074,0.006,0.003,0.03,0.025,0.019,0.023,0.026,0.014
6,0.015,0.61,0.195,0.016,0.277,0.062,-4.661,0.006,0.047,0.18,0.048,0.017,0.076,0.018,0.026,0.074,0.006,0.003,0.03,0.025,0.019,0.023,0.026,0.014
7,0.015,0.61,0.195,0.016,0.277,0.062,0.05,-4.596,0.047,0.18,0.048,0.017,0.076,0.018,0.026,0.074,0.006,0.003,0.03,0.025,0.019,0.023,0.026,0.014
8,0.015,0.61,0.195,0.016,0.277,0.062,0.05,0.006,-4.459,0.18,0.048,0.017,0.076,0.018,0.026,0.074,0.006,0.003,0.03,0.025,0.019,0.023,0.026,0.014
9,0.015,0.61,0.195,0.016,0.277,0.062,0.05,0.006,0.047,-2.863,0.048,0.017,0.076,0.018,0.026,0.074,0.006,0.003,0.03,0.025,0.019,0.023,0.026,0.014


The diagonal elements are useful statistics to report (perhaps as a quantity-weighted average or median) instead of the raw price coefficient, which is a bit hard to interpret on its own. They suggest that consumers are pretty elastic. The off-diagonal elements are cross-price elasticities, which are all fairly small. The non-realistic substitution patterns we saw in our counterfactual also show up here: all cross-price elasticities in each column are the same, even though we'd expect some differences for more similar products.

## Supplemental Questions

### 1. Try different standard errors

It is likely that there are many market-varying unobserved product characteristics, such as advertising, so it may be important to cluster by product. First we'll define clusters.

In [47]:
product_data['clustering_ids'] = product_data['product_ids']
cluster_problem = pyblp.Problem(pyblp.Formulation('0 + prices', absorb='C(market_ids) + C(product_ids)'), product_data)

Then we'll solve the problem, telling PyBLP to use the clusters when computing standard errors.

In [48]:
cluster_results = cluster_problem.solve(method='1s', se_type='clustered')
pd.DataFrame(index=fe_results.beta_labels, data={
    ("Estimates", "Unclustered"): iv_results.beta.flat,
    ("SEs", "Unclustered"): iv_results.beta_se.flat,
    ("Estimates", "Clustered"): cluster_results.beta.flat,
    ("SEs", "Clustered"): cluster_results.beta_se.flat,
})

Unnamed: 0_level_0,Estimates,SEs,Estimates,SEs
Unnamed: 0_level_1,Unclustered,Unclustered,Clustered,Clustered
prices,-30.6,0.968,-30.6,1.182


Standard errors are somewhat larger, as expected. Error correlation not accounted for typically bias standard errors downward.

### 2. Compute confidence intervals for your counterfactual

For speed in this exercise, let's just use 100 draws. In practice, you may want to use more.

In [49]:
bootstrap_results = cluster_results.bootstrap(draws=100, seed=0)
bootstrap_results

Bootstrapped Results Summary:
Computation  Bootstrap
   Time        Draws  
-----------  ---------
 00:00:01       100   

Let's get the bootstrapped shares for the counterfactual market. Their first axis indexes draws.

In [50]:
bootstrap_shares = bootstrap_results.bootstrapped_shares[:, product_data['market_ids'] == counterfactual_market]
bootstrap_shares.shape

(100, 24, 1)

Let's also replicate the bootstrapped prices, one for each draw, and bootstrap the counterfactual.

In [51]:
bootstrap_new_prices = np.tile(counterfactual_data['new_prices'].values, (100, 1))
bootstrap_new_shares = bootstrap_results.compute_shares(market_id=counterfactual_market, prices=bootstrap_new_prices)
bootstrap_changes = 100 * (bootstrap_new_shares - bootstrap_shares) / bootstrap_shares

Now let's compute 95% confidence intervals for each change.

In [52]:
counterfactual_data['iv_change_lb'] = np.squeeze(np.percentile(bootstrap_changes, 2.5, axis=0))
counterfactual_data['iv_change_ub'] = np.squeeze(np.percentile(bootstrap_changes, 97.5, axis=0))
counterfactual_data

Unnamed: 0,product_ids,mushy,prices,shares,new_prices,new_shares,iv_change,iv_change_lb,iv_change_ub
24,F1B04,1,0.078,0.006443,0.039,0.02085,223.638,196.667,249.841
25,F1B06,1,0.141,0.1413,0.141,0.1392,-1.45,-1.552,-1.31
26,F1B07,1,0.073,0.08789,0.073,0.08662,-1.45,-1.552,-1.31
27,F1B09,0,0.077,0.006621,0.077,0.006525,-1.45,-1.552,-1.31
28,F1B11,0,0.167,0.05427,0.167,0.05349,-1.45,-1.552,-1.31
29,F1B13,0,0.092,0.02198,0.092,0.02166,-1.45,-1.552,-1.31
30,F1B17,1,0.154,0.01055,0.154,0.01039,-1.45,-1.552,-1.31
31,F1B30,0,0.15,0.00131,0.15,0.001291,-1.45,-1.552,-1.31
32,F1B45,0,0.147,0.01052,0.147,0.01037,-1.45,-1.552,-1.31
33,F2B05,0,0.099,0.05907,0.099,0.05821,-1.45,-1.552,-1.31


The confidence intervals are fairly tight. This is because the price coefficient estimate has a fairly low standard error, even after clustering our standard errors.

### 3. Impute marginal costs from pricing optimality

Let's add firm identifiers to the data and re-solve the problem.

In [53]:
product_data['firm_ids'] = product_data['product_ids'].str[:2]
firm_problem = pyblp.Problem(pyblp.Formulation('0 + prices', absorb='C(market_ids) + C(product_ids)'), product_data)
firm_results = firm_problem.solve(method='1s', se_type='clustered')

Now let's impute marginal costs from pricing optimality and compare them with prices. We can also compute markups and profits.

In [54]:
product_data['costs'] = firm_results.compute_costs()
product_data['profit_per_serving'] = product_data['prices'] - product_data['costs']
product_data['markups'] = product_data['profit_per_serving'] / product_data['costs']
product_data[['prices', 'costs', 'profit_per_serving', 'markups']].describe()

Unnamed: 0,prices,costs,profit_per_serving,markups
count,2256.0,2256.0,2256.0,2256.0
mean,0.126,0.08703,0.039,0.681
std,0.029,0.02998,0.005,6.658
min,0.045,0.0001536,0.033,0.2
25%,0.105,0.06673,0.035,0.35
50%,0.124,0.08513,0.038,0.449
75%,0.143,0.1052,0.041,0.592
max,0.226,0.1881,0.073,315.98


These marginal costs do look somewhat reasonable. They are all positive and markups seem reasonable, generally on the order of 30% to 60%. Based on these estimates, firms seem to be enjoying a few cents of profit per serving sold.

Of course, in practice pricing may be more complicated than the simple model here. For example, prices may be bargained over with retailers instead of set by the producer. And pricing aside, given that our demand model is particularly simple right now, we may not want to particularly trust our marginal cost estimates because they depend crucially our estimates of demand elasticities, which are currently not very realistic.

### 4. Check your code by simulating data

Let's simulate some new prices and shares under a somewhat less elastic price coefficient, say $\alpha = -20$. To calibrate the simulation to our setting, we'll use the same fixed effects and unobserved quality estimated by PyBLP.

In [55]:
simulation = pyblp.Simulation(
    product_formulations=pyblp.Formulation('0 + prices'),
    product_data=product_data,
    beta=-20,
    xi=iv_results.xi_fe + iv_results.xi,
)
simulation

Dimensions:
 T    N     F    K1 
---  ----  ---  ----
94   2256   5    1  

Formulations:
     Column Indices:          0   
--------------------------  ------
X1: Linear Characteristics  prices

Beta True Values:
 prices  
---------
-2.00E+01

Next, we'll solve for equilibrium prices and shares, using the above-compute marginal costs.

In [56]:
simulation_results = simulation.replace_endogenous(costs=product_data['costs'])
simulation_results

Simulation Results Summary:
Computation  Fixed Point  Fixed Point  Contraction  Profit Gradients  Profit Hessians  Profit Hessians
   Time       Failures    Iterations   Evaluations      Max Norm      Min Eigenvalue   Max Eigenvalue 
-----------  -----------  -----------  -----------  ----------------  ---------------  ---------------
 00:00:00         0          1267         1267         +1.40E-13         -9.96E+00        -5.28E-03   

Solving for equilibrium prices seems to have been successful. First order conditions (profit gradient norms) are all near to zero and second order conditions (profit hessian eigenvalues) are all negative, indicating that prices are indeed profit-maximizing. In practice we may want to try multiple different starting values for prices since this can technically be a nonconvex optimization problem. We'll learn more about best practices for nonlinear optimization on the second day.

Now let's see if we can recover the above true price coefficient. We'll use our costs as our cost-shifter.

In [57]:
simulation_data = product_data.copy()
simulation_data['shares'] = simulation_results.product_data.shares
simulation_data['prices'] = simulation_results.product_data.prices
simulation_data['demand_instruments0'] = simulation_data['costs']
simulation_problem = pyblp.Problem(pyblp.Formulation('0 + prices', absorb='C(market_ids) + C(product_ids)'), simulation_data)
simulation_problem_results = simulation_problem.solve(method='1s', se_type='clustered')
simulation_problem_results

Problem Results Summary:
GMM   Objective  Clipped  Weighting Matrix
Step    Value    Shares   Condition Number
----  ---------  -------  ----------------
 1    +4.26E-29     0        +1.00E+00    

Cumulative Statistics:
Computation   Objective 
   Time      Evaluations
-----------  -----------
 00:00:00         1     

Beta Estimates (Robust SEs Adjusted for 24 Clusters in Parentheses):
  prices   
-----------
 -1.94E+01 
(+1.05E+00)

Our estimate does not seem to be significantlly different from the true $\alpha = -20$, as we'd hope.