# The Automobile Problem

In [16]:
import pyblp
import numpy as np

pyblp.options.digits = 3
pyblp.options.verbose = False
np.set_printoptions(precision=2, threshold=10, linewidth=100)

In this tutorial, we'll use data from :ref:`references:Berry, Levinsohn, and Pakes (1995)` to solve the paper's automobile problem.


## Loading the Automobile Data

We'll use NumPy to read the data.

In [3]:
product_data = np.recfromcsv(pyblp.data.BLP_PRODUCTS_LOCATION, encoding='utf-8')
product_data.dtype.names

('market_ids',
 'clustering_ids',
 'car_ids',
 'firm_ids0',
 'firm_ids1',
 'domestic',
 'japan',
 'european',
 'shares',
 'prices',
 'hpwt',
 'air',
 'mpd',
 'mpg',
 'space',
 'trend',
 'demand_instruments0',
 'demand_instruments1',
 'demand_instruments2',
 'demand_instruments3',
 'demand_instruments4',
 'demand_instruments5',
 'supply_instruments0',
 'supply_instruments1',
 'supply_instruments2',
 'supply_instruments3',
 'supply_instruments4',
 'supply_instruments5')

The product data contains market IDs, product IDs, two sets of firm IDs (the second are IDs after a simple merger, which are used later), shares, prices, a number of product characteristics, and some pre-computed excluded instruments. The product IDs are called clustering IDs because they will be used to compute clustered standard errors. For more information about the instruments and the example data as a whole, refer to the :mod:`data` module.

The `agent_data` argument of :class:`Problem` should also be a structured array-like object.

In [4]:
agent_data = np.recfromcsv(pyblp.data.BLP_AGENTS_LOCATION, encoding='utf-8')
agent_data.dtype.names

('market_ids',
 'weights',
 'nodes0',
 'nodes1',
 'nodes2',
 'nodes3',
 'nodes4',
 'nodes5',
 'income')

The agent data contains market IDs, integration weights, integration nodes, and demographics. In non-example problems, it is usually a better idea to use many more draws, or a more sophisticated :class:`Integration` configuration such as sparse grid quadrature.


## Solving the Automobile Problem

Unlike the fake cereal problem, we won't absorb any fixed effects in the automobile problem. However, we'll estimate the supply side, so we need to formulate $X_3$ in addition to the three other matrices.

In [6]:
product_formulations = (
   pyblp.Formulation('1 + hpwt + air + mpd + space'),
   pyblp.Formulation('1 + prices + hpwt + air + mpd + space'),
   pyblp.Formulation('1 + log(hpwt) + air + log(mpg) + log(space) + trend')
)
product_formulations

(1 + hpwt + air + mpd + space,
 1 + prices + hpwt + air + mpd + space,
 1 + log(hpwt) + air + log(mpg) + log(space) + trend)

In [7]:
agent_formulation = pyblp.Formulation('0 + I(1 / income)')
agent_formulation

I(1 / income)

The original specification for the automobile problem includes the term $\log(y_i - p_j)$, in which $y$ is income and $p$ are prices. Instead of including this term, which gives rise to a host of numerical problems, we'll follow :ref:`references:Berry, Levinsohn, and Pakes (1999)` and uses its first-order linear approximation, $p_j / y_i$. The above formulation for $d$ includes a column of $1 / y_i$ values, which we'll interact with $p_j$.

In [9]:
problem = pyblp.Problem(product_formulations, product_data, agent_formulation, agent_data)
problem

Dimensions:
   N:       T:              K1:                       K2:                     K3:                D:              MD:                 MS:             ED:         ES:            H:      
Products  Markets  Linear Characteristics  Nonlinear Characteristics  Cost Characteristics  Demographics  Demand Instruments  Supply Instruments  Demand FEs  Supply FEs  Nesting Groups
--------  -------  ----------------------  -------------------------  --------------------  ------------  ------------------  ------------------  ----------  ----------  --------------
  2217      20               5                         6                       6                 1                11                  12              0           0             0       

Formulations:
       Column Indices:            0          1        2       3          4         5  
-----------------------------  --------  ---------  -----  --------  ----------  -----
 X1: Linear Characteristics       1        hpwt      air   

We'll use published estimates as our starting values for $\Sigma$. By choosing a column vector of all zeros except for a negative term for the coefficient on $\log(y_i - p_j)$ as the starting values for $\Pi$, we're choosing to interact the inverse of income only with prices.

We'll also bound our parameters. When using a routine that supports bounds, :class:`Problem` chooses some default bounds to reduce the chance of numerical overflow that happens, for example, when optimization routines try out large parameter values. However, these default bounds are not quite restrictive enough to prevent overflow in the automobile problem, so we'll set our own bounds. In addition to overflow concerns, we'll also bound the diagonal of $\Sigma$ from below by zero for realism, and we'll make sure that demand is sloping downward by bounding the parameter in $\Pi$ from above (specifically, we'll use a bound that's slightly smaller than zero because when the parameter is exactly zero, there are matrix inversion problems with computing $\eta$). Choosing reasonable bounds can be very important.

In [11]:
initial_sigma = np.diag([3.612, 0, 4.628, 1.818, 1.050, 2.056])
initial_pi = np.c_[[0, -10, 0, 0, 0, 0]]
sigma_bounds = (
   np.zeros_like(initial_sigma),
   np.diag([100, 0, 100, 100, 50, 100])
)
pi_bounds = (
   np.c_[[0, -50, 0, 0, 0, 0]],
   np.c_[[0, -0.1, 0, 0, 0, 0]]
)

A linear marginal cost specification is the default, so we'll need to use the `costs_type` argument to employ the log-linear specification used by :ref:`references:Berry, Levinsohn, and Pakes (1995)`. A downside of this specification is that nonpositive estimated marginal costs can create problems for the optimization routine when computing $\tilde{c}(\hat{\theta}) = \log c(\hat{\theta})$. We'll use the `costs_bounds` argument to bound marginal costs from below by a small number. 

Finally, as in the original paper, we'll use the `W_type` and `se_type` argument to cluster by product IDs, which were specified as `clustering_ids` in product data.

In [14]:
results = problem.solve(
   initial_sigma,
   initial_pi,
   sigma_bounds=sigma_bounds,
   pi_bounds=pi_bounds,
   costs_type='log',
   costs_bounds=(0.001, None),
   W_type='clustered',
   se_type='clustered'
)
results

Problem Results Summary:
Cumulative  GMM   Optimization   Objective   Total Fixed Point  Total Contraction  Objective    Gradient        Clipped    
Total Time  Step   Iterations   Evaluations     Iterations         Evaluations       Value    Infinity Norm  Marginal Costs
----------  ----  ------------  -----------  -----------------  -----------------  ---------  -------------  --------------
 0:01:51     2         8            12             34214             102883        +2.38E+05    +6.19E+03          0       

Linear Estimates (Robust SEs Adjusted for 999 Clusters in Parentheses):
Beta:        1          hpwt          air          mpd         space                
------  -----------  -----------  -----------  -----------  -----------             
         -7.81E+00    +2.39E+00    +1.30E+00    -3.39E+00    +2.27E+00              
        (+1.69E+00)  (+2.42E+00)  (+3.39E+00)  (+2.29E+00)  (+6.07E-01)             
Gamma:       1        log(hpwt)       air       log(mpg)    log(sp

Here, results are less similar compared to those in the original paper.