# MaxEnt Mixed Logit

### Model



**Model specification**

$$
P(y_{it} = j | X_i, \alpha_{ji} \boldsymbol{\beta_{i}}) = \frac{e^{\alpha_{ji} + \boldsymbol{\beta}_k' \boldsymbol{x}_{jit}}}{\sum_{j \in C} e^{\alpha_{ji} + \boldsymbol{\beta}_i' \boldsymbol{x}_{jit}}}
$$


- $P(y_{it} = j | X_i, \alpha_{ji} \boldsymbol{\beta_{i}})$: Probability of individual $i$ choosing a particular alternative $j$ at time $t$, conditioned on parameters $\boldsymbol{\beta_i}$ and $\alpha_{ji}$
- $y_{it}$: Choice outcome for individual $i$ at time $t$.
- $x_{jit}$: Set of attributes of choice $j$ for individual $i$ at time $t$.
- $\alpha_{ji}$: Intercept term for choice $j$ for individual $i$, assumed to be a random variable varying across choices.
- $\beta_k$: Coefficient vector attributes $k$, assumed to be a random variable varying across individuals.
- $C$: Set of available choices.

**Model parameters**

In Bayesian terms, the problem is to infer the joint posterior distribution of the model's unknown parameters.

$$P(\theta | \mathcal{D}) = \dfrac{P(\mathcal{D} | \theta) P(\theta)}{P(\mathcal{D})}$$

where $\theta =  \{ \alpha_{ji}, \boldsymbol{\beta}_i \}$ and $\mathcal{D} = \{ x_{jit}, y_{it} \}$

### Model complexity


The model can be more or less complex, depending on how flexible $\alpha$ and $\beta$ are.

**Multinomial logit models**

The key feature of MNL models is that $\beta_k$ does not vary over the customers $i$, making it a standard deterministic vector.

$$
\beta = \begin{bmatrix}
\beta_0  \\
\beta_1 \\
\vdots \\
\beta_k
\end{bmatrix}
$$

- In the least complex case,  $\alpha$ and $\beta$ are fixed for all customers, for all times, for all choices. This reduces to the multinomial logit model. 1 + $k$ parameters. 
- Getting more complex, $\alpha$ can vary for each choice, while $\beta$ remains fixed for all customers, for all times, for all choices.  $j + k$ parameters.
- Getting more complex, $\alpha$ can vary for each choice, while $\beta$ remains fixed for all customers, for all times, but varies over the choices.  $j + kj$ parameters.


**Mixed logit models**

The key feature of MNL models is that $\beta$ does vary over the customers, making it a random vector. We denote this vector $\phi$ to distinguish it from the deterministic case.

$$
\phi = \begin{bmatrix}
\beta_0  \sim \mathcal{N}(\mu, \sigma) \\ 
\beta_1 \sim \mathcal{N}(\mu, \sigma) \\
\vdots  \\
\beta_k \sim \mathcal{N}(\mu, \sigma)
\end{bmatrix}
$$


- In the least complex case,  $\alpha$ can vary for each choice, while $\beta$ can vary for each customer, but for each customer stays the same all times, for all choices $j + ki$ parameters.
- Getting more complex,  $\alpha$ can vary for each choice and customer, while $\beta$ can vary for each customer, but for each customer stays the same all times, for all choices. $ji + ki$ parameters.
- Getting more complex,  $\alpha$ can vary for each choice and customer, while $\beta$ can vary for each customer, but for each customer stays the same all times, while varying over each choice. $ji + kij$ parameters.
- Most complex,  $\alpha$ can vary for each choice and customer, while $\beta$ can vary for each customer, at each time, over each choice. $ji + kijt$ parameters.



### Stochastic $\beta$ in mixed logit

For an individual customer, their $\beta$ becomes the mean $\beta_k$ plus a random draw from $\sigma_k$. This random draw can also be conceptualised as a fixed $\sigma_k$ scaled by some individual scaling factor $v_i$. For a given attribute $k$, this looks like

$$ \beta_i = \beta + \sigma v_i$$


Returning to $\phi$ to take into account all of the attributes $k$, because it is a random vector, it has multivariate normal joint probability distribution.

$$\phi_i = \phi + \Gamma v_i$$

where $\Gamma$ is the standard deviation matrix with the sigmas on the main diagonal.

$$
\Gamma = \begin{bmatrix}
\sigma_1 & 0 & \cdots & 0 \\
0 & \sigma_2 & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & \sigma_k
\end{bmatrix}
$$

**Correlated attributes**

With correlation between different attributes, the covariance matrix becomes

$$ \Sigma = \Gamma \Gamma' $$

$$
\Sigma = \begin{bmatrix}
\sigma_1^2 & \text{Cov}(1, 2) & \cdots & \text{Cov}(1, k) \\
\text{Cov}(2, 1) & \sigma_2^2 & \cdots & \text{Cov}(2, k) \\
\vdots & \vdots & \ddots & \vdots \\
\text{Cov}(k, 1) & \text{Cov}(k, 2) & \cdots & \sigma_k^2
\end{bmatrix}
$$


# Maximum entropy setup

Our ground truth choice data $y_{ij}$ is an $i \times j$ matrix with individual observations $i$ on the rows and choice outcomes $j$ on the columns.

$$
Y = \begin{bmatrix}
y_{11} & y_{12} & \cdots & y_{1j} \\
y_{21} & y_{22} & \cdots & y_{2j} \\
\vdots & \vdots & \ddots & \vdots \\
y_{i1} & y_{i2} & \cdots & y_{ij}
\end{bmatrix}
$$

This makes $y_{ij}$ some link function of observed attributes $x_{ij}$ and $\beta_{ij}$, plus an error term.

$$ y_{ij} = F(x_{ji}' \beta_j) + \epsilon_{ij}$$

$F(x_{ji}' \beta_j)$ is just the predicted probability, so 

$$ y_{ij} = p_{ij} + \epsilon_{ij}$$

In other words, the ground truth choice is the predicted probability plus some error.

Because $0 < p_{ij} < 1$ and $y_{ij}$ is either $0$ or $1$, the error term $\epsilon_{ij}$ is bounded between $[-1,1]$.


### Revenue constraint

Assuming price is our only attribute (i.e. $k$ = 1), revenue is price ($x$) multiplied by volume ($y$).

$$
R = \sum_{ij} y_{ij} \times x_{ij}
$$

$$
R = \sum_{ij} \left( p_{ij} + \epsilon_{ij} \right) \times x_{ij}
$$

$$
R = \sum_{ij} p_{ij} \times x_{ij} + \sum_{ij} \epsilon_{ij} \times x_{ij}
$$

Therefore, total revenue equals predicted revenue plus revenue error.



### Reformulating $p_{ij}$ and $\epsilon_{ij}$ as random variables

Let $p_{ij}$ be the expected value of a discrete random variable $s_{ij}$, with support $s_m$ $\in [0,1]$ and underlying probabilities $p(s_{ijm}) = \pi_{ijm}$. 

$$ \langle s_{ij} \rangle  = p_{ij} $$

$$ \langle s_{ij} \rangle  = \sum_m s_m \pi_{ijm} $$



Let $\epsilon_{ij}$ be expected value of discrete random variable $u_{ij}$, with support $u_h \in [-1,1]$ and underlying probabilities $p(u_{ijh}) = w_{ijh}$.

$$  \langle u_{ij} \rangle = \epsilon_{ij} $$

$$ \langle u_{ij} \rangle  = \sum_h u_h w_{ijh} $$


### Reformulated revenue constraint

$$
R = \sum_{ij} y_{ij} \times x_{ij}
$$

We substitute in our new definitions of $p_{ij}$ and $\epsilon_{ij}$.

$$
R = \sum_{ij} p_{ij} \times x_{ij} + \sum_{ij} \epsilon_{ij} \times x_{ij}
$$

Substituting in their expected value representation

$$
R = \sum_{ij} \langle s_{ij} \rangle \times x_{ij} + \sum_{ij} \langle u_{ij} \rangle \times x_{ij}
$$

Putting the expected values in explicit form

$$
R = \sum_{ijm} s_m \pi_{ijm} \times x_{ij} + \sum_{ijh} u_h w_{ijh} \times x_{ij}
$$



# Maximum entropy model

Our job is to infer the values of $\pi_{ijm}$ and $w_{ijh}$ by maximising an entropy (objective) function.

We assume the model is correctly specified, such that $p$ and $\epsilon$ (and therefore $\pi$ and $w$) are independent, and the maximum joint entropy is the sum of the maximum marginal entropies.

### Objective function

$$\max_{\pi, w} H(\pi, w) =  \left( - \sum_{ijm} \pi_{ijm} \log \pi_{ijm} \right) +  \left( - \sum_{ijh} w_{ijh} \log w_{ijh} \right) $$

### Constraints

**Revenue constraint**
$$ R = \sum_{ij} y_{ij} \times x_{ij} = \sum_{ijm} s_m \pi_{ijm} \times x_{ij} + \sum_{ijh} u_h w_{ijh} \times x_{ij} $$

**$\pi$ and $w$ must be legitimate probability distributions over the values of $p$ and $\epsilon$**
$$\sum_m \pi_{ijm} = 1$$

$$\sum_h w_{ijh} = 1$$

**$p_{ij}$ must be a legitimate probability distribution over the choices $j$**
$$ \sum_m s_m \pi_{ijm} = 1$$


In principal, we can formulate this as a Lagrangean and solve it analytically, but in practice we will probably use numerical methods.

$$
\mathcal{L} = - \sum_{ijm} \pi_{ijm} \log \pi_{ijm} - \sum_{ijh} w_{ijh} \log w_{ijh} + \lambda \left( R - \sum_{ijm} s_m \pi_{ijm} \times x_{ij} - \sum_{ijh} u_h w_{ijh} \times x_{ij} \right) + \sum_{ij} \lambda_{ij} \left( 1 - \sum_m \pi_{ijm} \right) + \sum_{ij} \lambda_{ij} \left( 1 - \sum_h w_{ijh} \right) + \sum_{ij} \lambda_{ij} \left( 1 - \sum_m s_m \pi_{ijm} \right)
$$




### Marginal effects



### Information gain

import numpy as np
from scipy.optimize import minimize

# Objective function: Negative of the entropy function
def objective_function(params):
    # params would be a flattened array of all the pi and w values
    # Split params back into pi and w and reshape them as needed
    # Calculate the negative entropy
    # ...
    return -entropy

# Constraint functions
def revenue_constraint(params):
    # Define the revenue constraint equation
    # ...
    return calculated_revenue - R

def probability_constraint_pi(params):
    # Define the constraint that sum of pi for each i, j should be 1
    # ...
    return sum_pi - 1

def probability_constraint_w(params):
    # Define the constraint that sum of w for each i, j should be 1
    # ...
    return sum_w - 1

# Initial guess for the parameters
initial_guess = np.ones(num_variables)  # Adjust the size as needed

# Constraints setup
constraints = [{'type': 'eq', 'fun': revenue_constraint},
               {'type': 'eq', 'fun': probability_constraint_pi},
               {'type': 'eq', 'fun': probability_constraint_w}]

# Optimization
result = minimize(objective_function, initial_guess, constraints=constraints)

# Check if the solver found a solution
if result.success:
    optimized_params = result.x
    # Reshape and process optimized_params as needed
else:
    print("Optimization failed:", result.message)


# References

Manzini, P., Mariotti, M., & Ülkü, L. (2019). Stochastic complementarity. The Economic Journal, 129(619), 1343-1363.

 

Yan, Z., Natarajan, K., Teo, C. P., & Cheng, C. (2022). A representative consumer model in data-driven multiproduct pricing optimization. Management Science, 68(8), 5798-5827.