# ...do not forget about convex optimization

# Part 2. Discrete choice interpretation of Logit

### Logistic regression lies in the intersection of classification models (CS) and discrete choice models (Econ). I will talk first about the second interpretation

Agent choses between goods 1,2 and default (0) and the utility is
- $U_0 = V_0(X,Z) + \varepsilon_0$ if bought nothing, $V_0(X,Z) = 0$
- $U_1 = V_1(X,Z) + \varepsilon_1$ if bought first, $V_1(X,Z) = \alpha' Z + \beta' X$
- $U_2 = V_2(X,Z) + \varepsilon_2$ if bought second, $V_2(X,Z) = \alpha' Z + \beta' X$

where $Z$ are agent characteristics (income, region) and $X$ are product characteristics (price, volume)

here $\varepsilon_i$ is an i.i.d. error with an unusual Extreme Value distribution, such that $\varepsilon_i - \varepsilon_j$ is, in fact, logistic

before we had only one good

### We can compute the probabilities of buying each good

- $P_0 = Prob(U_0 = \max U_i) = Prob(\varepsilon_j - \varepsilon_0 < V_0 - V_j, \forall j \neq 0) = e^{V_0}/(e^{V_0} + e^{V_1} + e^{V_2})$
- $P_1 = Prob(U_1 = \max U_i) = Prob(\varepsilon_j - \varepsilon_1 < V_1 - V_j, \forall j \neq 1) = e^{V_1}/(e^{V_0} + e^{V_1} + e^{V_2})$
- $P_2 = Prob(U_2 = \max U_i) = Prob(\varepsilon_j - \varepsilon_2 < V_2 - V_j, \forall j \neq 2) = e^{V_2}/(e^{V_0} + e^{V_1} + e^{V_2})$

### Notice that they add up to one
### Probabilities can be interpreted as market shares (Sony vs Xbox)

### This is your typical logit: $$ \text{Share of good 1|z} = \frac{e^{\alpha' \text{Z} + \beta' p_1}}{1 + e^{\alpha' \text{Z}+ \beta' p_{2}} + e^{\alpha' \text{Z}+ \beta' p_{0}}}$$
### 
- we can average over Z and we can apply logs, in any order you like (avg. log-share or log. avg-share)
- we can differentiate to compute elasticities of market shares in own and cross prices

### This maschinery turned out so effective that many extensions were developed
- ordered logit
- nested logit
- random coefficient logit
- latent group logit

### All this can be found in Train, Discrete Choice Methods with Simulation, 2003 или 2009

## 1. Ordered Logit 
### is just like logit but the alternatives are ordered. Consider data on schooling
$$ U = V(X,Z) + \varepsilon$$
where $X$ are agent characteristics, а $Z$ это school characteristics, аnd $\varepsilon$ is a logistic error

- if utility $U<r_1$ go to school only $\to$ good 0
- if utility $r_1<U<r_2$ got to bachelor $\to$ good 1
- if utility $r_2<U<r_3$ go to magisters $\to$ good 2
- if utility $r_3>U$ go to PhD $\to$ good 3

we want thresholds $r_1, r_2, r_3$ as well as all other coefficients

### All we need to do is to honestly derive shares of these "goods", 
$$ \text{Share of good } 1 = Pr(r_1 < U < r_2) = Pr(U < r_2) - Pr(U < r_1)=$$ $$= Pr(\varepsilon < r_2 - V(X,Z)) - Pr(\varepsilon < r_1 - V(X,Z)) = \frac{1}{1+\exp(V(X,Z)-r_2)} - \frac{1}{1+\exp(V(X,Z)-r_1)}$$

### Then patiently derive the whole likelihood
$$ \mathcal{LL} = \sum_{y_i=0} \log[\frac{1}{1+\exp(V(x_i,z_i)-r_1)}]+ \sum_{y_i=1}\log[\frac{1}{1+\exp(V(x_i,z_i)-r_2)}-\frac{1}{1+\exp(V(x_i,z_i)-r_1)}] + \ldots + \sum_{y_i = 3}\log [1-\frac{1}{1+\exp(V(x_i,z_i)-r_1)}]$$

### Add linear specification $V(X,Z) = \alpha X + \beta Z$
### Is the likelihood concave in $\alpha, \beta, r$? Actually it is.

## 2. Latent group Logit
### Sometimes we would like the coefficients to be correlated across certain groups. Say, for part of the population $\beta_1$ and $\beta_2$ are small and for part of the population $\beta_1$ and $\beta_2$ are large. The groups are latent and we can not observe them. There is no feature in the data to recover the group. 

### To model this we create 2 copies of coefficients $\beta^1$ and $\beta^2$ and an additional parameter $\rho$ that measures the share of the first group. The model randomly assigns an individual into one of the two groups and he learns his coefficients there.

$$\mathcal{L}=\prod G(x)^{y_i} \cdot (1-G(x))^{1-y_i}, \quad G(x) = \rho F(\beta^1 x_i) + (1-\rho)F(\beta^2 x_i)$$

### Is $\mathcal{LL}$ concave in $\beta_1, \beta_2, \rho$? Apparently yes, but thats the last of them.

## 3. Random Coefficients Logit
### When there are too many groups, we would like a continuous style of correlation

$$\mathcal{L}=\prod G(x)^{y_i} \cdot (1-G(x))^{1-y_i}, \quad G(x) = \int F(\beta x_i) d H(\beta)$$
where $H(\beta)$ is a distribution with an unknown mean and covariance matrix.

### Is $\mathcal{LL}$ concave? G is a sum but the coefficients depend on the parameters of the distribution in a convoluted way. For example, gaussian density is not concave. In fact, very few densities are.

### Even worse, $G(x)$ has no closed form and should be simulated. For example $\beta$ is normally distributed with mean $\gamma$ and variance $\alpha^2$:
$$\beta = \alpha \mathcal{N}(0,1) + \gamma, \quad \mathcal{LL} \to \max_{\alpha, \gamma}$$

### We need to first simulate (outside of the loop) a large number of standard normal variables, then transform them (inside the loop) and use for numerical integration. That is because random number generation is expensive but linear transformations are very, very cheap.

## P.S. You can add regularization to any of those