<a href="https://colab.research.google.com/github/pmontman/tmp_choicemodels/blob/main/nb/WK_04_multinomial_logit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lecture 4: Multinomial logit

We are introducing the multinomial logit, the most popular model for choice modelling. It can be considered as the extension of the logit to more than two alternatives. 

We will add on top of the logit another axiom on that will lead us to the mathematical formulation of the multinomial logit. The axiom is 'needed' or helps us deal with potential ambiguities when whe have more than two alternatives.

---
---



# Recap

* Based on utility decomposition: observed utility + random component creates choice probabilities.

* Studied binary choice to simplify the concepts:
  * The view as predicting the choice probability directly
  * The view as predicting the log-odds with the logistic squashing.

* Linking back to utility
  * logit: The view of observed utility as a linear function of the variables + random component having Gumbel distribution.


---
---

# Multinomial

The multi indicates that we have more than two alternatives.

**Question:**
When we have 2 alternatives, we only need to determine the choice probabilities of 1 of them, the other is completely dependent because of probabilities have sum 1.  **How many of the alternatives have to be determined, when 
 we have $J$ alternatives?**

---
---



# Independence of Irrelevant Alternatives (IIA)

When talking about utility, we have imposed some 'axioms' in the decision-making process such as transitivity, completeness (some of them are questionable).
If we accept these axioms,  then we can show mathematically that any decision-making process can be represented by utility functions that assign a number to each alternative, and then the individuals act as maximizing utility. They will choose the alternative that gives them higher utility. This makes analysis easier and enables precise quantitative descriptions.

Then we mentioned that we can impose additional restrictions on the decision-making process, for example, a very popular one is that there is a linear relationship between
 some variables in the process and the utility (attributes of the alternatives and characteristics of the individuals).

There is another important axiom that comes into play when we have to deal with more than two alternatives, and we will be able to almost completely derive the multinomial logit model from this axiom:  **Independence of irrelevant alternatives**


---
---



# Informal description of IIA

 Informally, IIA means that **the preference between any two alternatives should only depend on those two alternatives, not on the remaining of the choice set.**

 For example, we have alternatives A and B, say latte and espresso in the coffee shop example. Imagine that the individuals prefer latte to espresso.
 We now change the choice set, adding a third alternative C, long black. The independence of irrelevant alternatives states that the preference between latte and espresso should not change, 'it cannot happen' that adding long black modifies the preference and makes the individuals prefer espresso to latte.

 This axiom seems quite reasonable, it appears in some form in many behavioral sciences, not only choice modelling (e.g. game theory). The IIA axiom, in addition to seeming reasonable at first glance, has a powerful consequence:

 * It allows us to compare preferences by pairs, without needing to consider the full choice set. For us, as analysts, this means that we can subdivide the problem up until the level of pairs of alternatives. When we have many altenatives, think hundreds, **we can present subsets of those alternatives at a time and still recover the equivalent result as if the subjects were presented the full choice set**. This in fact might be the reason why it was proposed for choice modelling.


The specific definition of the IIA axiom varies depending of the subfield, in choice modelling, it is constant probability ratio: The ratio of choice probabilties of any two alternatives is not affected by the presence of any other alternative. Note that the absolute probabilities might be affected, but not ratio. If we recall our definitions of odds, it means that the odds are kept constant.

---
---




# Formal definition of IIA in choice modelling

It is formally defined as:

For any choice set $X$, measured attributes $s$ and alternatives $a, b$ of $X$

$$P(a | s, \{a,b\})P(b | s, X) = P(b, | s, \{a,b\})P(a | s, X)$$

The notation $P(a | s, \{a,b\})$ stands for: the choice probability of alternative $a$ with attributes $s$ when the choice set is $\{a,b\}$,

We can write a bit more informally, when the choice probabilities are nonzero as:

$$ \frac{P( a | \{a,b\})}{P(b | \{a,b\})} = \frac{P(a | X)}{P(b | X)}$$

And here it is clear why it is called the **'constant odds'**, the odds of any two alternatives do not change when we modify the choice set.

---
---



# Deriving the multinomial logit from IIA

From the IIA and the properties of probability we can derive the multinomial logit.

 * Probabilities are positive (even if very small)
 * Probabilities have to sum one
 * IIA 
 * Log odds are linear on the variables (logistic regression)

0. **Initial statement: IIA**

$$ \frac{P( a | \{a,b\})}{P(b | \{a,b\})} = \frac{P(a | X)}{P(b | X)}$$

1. From **assuming positive probabilities**, we can rearrange:
 $$P(b | X) = \frac{P( b | \{a,b\})}{P(a | \{a,b\})} P(a | X)$$

2. **From the axiom that Probabilities of the full choice set have to sum 1.** We apply the same formula to all individual alternatives in the choice set, its sum has to be 1.

  *  Also, assuming prob. choosing one  between identical alternatives is 0.5. $P( a | \{a,a\}) = 0.5$

$$ 1 = \sum_{b \in X}P(b | X) = \sum_{b \in X} \left(\frac{P( b | \{a,b\})}{P(a | \{a,b\})} \right)P(a | X)$$


3. **'Solve for** $P(a | X) $'. We can now see that the probabily in the full choice set can be written in terms of probabilities ratios, or odds, among pairs of alternatives:

$$ P(a | X) = \frac{1}{\sum_{b \in X} \left(\frac{P( b | \{a,b\})}{P(a | \{a,b\})} \right)}$$

4. **We want to rewrite everything in terms of the odds agains a reference, fixed alternative** $c$. Multiply expression (3) by the binary probability $\frac{P( a | \{a,c\})}{P( c | \{a,c\}) }$ in both nominator and denominator:


$$ P(a | X) = \frac{ \frac{P( a | \{a,c\})}{P( c | \{a,c\}) }}{\sum_{b \in X} \left(\frac{P( b | \{b,c\})}{P(c | \{b,c\})} \right)}$$

5. If we consider that the log-odds are linear, or that pairs can be modelled by the binary logit, recall: 

$$ \log \left( \frac{P( b | \{b,c\})}{P(c | \{b,c\})}\right) = V_b$$

In this case $V_b$ is the notation for observed utility, a linear function on the variables $s$ measured for the binary choice problem involving $a$ and $c$. Just as we saw in last lecture, for example: $\alpha \text{Price} + \beta \text{Age}$

So now if we introduce that the pairwise choices can be modelled by the logit, we will arrive to the expression of the multinomial logit:

$$ P(a | X) = \frac{ e^{V_a}}{\sum_{b \in X} e^{V_b}}$$

---
---

# Connection to binary choice, the logit

When $X = \{a,b\}$, we can choose the third alternative $c$ here to be $b$ and show

$$ P(a | X) = \frac{ e^{V_a}}{ e^{V_a} + e^{V_b}}$$
by the definition of $V_b$ when $b$ is the third alternative:

$$ \log \left( \frac{P( b | \{b,b\})}{P(b | \{b,b\})}\right) = V_b = 1$$

we get

$$ P(a | X) = \frac{ e^{V_a}}{ e^{V_a} + 1}$$ which is the definition of the logistic $S(x) = \frac{e^x}{1 + e^x}$ from lecture 2.




# Softmax view

We saw that the logit squashes a linear model that can produce arbitrary numbers into the range [0,1] so the output can be interpreted as probabilities.
There is an alternative view of the multinomial logit in similar spirit:
we start from a linear model and we want to transform or squash its ouput into something that can be interpreted as probabilities.

What the multinomial is doing is similar to squashing, **we want to force all individual predictions 
to be between 0 and 1. But now, we also want them to sum to 1.** In binary choice, this is not a problem, because we can model one of the two alternatives and set the probabiliy of the other to (1 - modeled probability). When we have more than 2 alternatives, we need to think a bit more.

A natural way of getting there is through the **softmax** function.
The softmax takes as input a vector of arbitrary numbers and produces as ouput a vector of the same dimension, but all outputs are between 0 and 1 and sum to 1. In mathematical notation, supposing we have a vector of size $J$:

$$ \text{softmax}: \mathbb{R}^J \to [0,1]^J $$

The definition of softmax can be separated for each element of the input vector.
For the element $i$ in the input vector $x$:

$$ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j \in J}e^{x_j}}$$

Literally, we calculate the exponential of all input numbers, the the softmax for each element of the input is its exponential dividied by the sum of all exponentials.

The function is called softmax because it will tends to produce an output of 1 in the largest element of the input, such as the indicator function for which is the maximum element in the input vector. So it is like computing the indicator function 'which element is the maximum' but more smoothly or 'fuzzy', with values inbetween 0 and 1.

The softmax is used in Machine Learning for classification, so the two concepts ,multinomial logit and softmax classifiers, are deeply connected.

---
---





# Maximum likelihood

We want to estimate the coeffients in our multinomial logit. We now have a vector of coefficients for each alternative.
We can adapt the method of maximum likelihood we saw in binary choice to the situation when we have more than two options.

Remember that for maximum likelihood we need to know the underlying probability distribution that generates the randomness. In the case of binary choice, it is clear that the Bernouilli is a good solution. When we move to multiple choices, then natural extension is the Multinomial distribution (for the number of trials equals 1).

For a given observation, the likelihood of the model parameterized by the coefficients $\beta$, considering $\beta$ the set of all coefficients for each alternative.

$$ L( \beta  ) = \prod_{j \in J} ( \text{prob. produced by } \beta \text{ for the alternative } j)^{\text{indicator of alternative } j \text{ was observed in the data}}$$

For the whole dataset, we take the product of all individual observations. Again, the logarithm transform is usually applied for practical reasons.

And that is why it is called the multinomial logit.

This time, maximum likelihood on the multinomial is connected to the cross-entropy loss in Machine Learning, they are two alternative points of view.

---
---





#An example of likelihood where the distribution is not multinomial

We have mentioned that both the Bernouilli and Multinomial distributions are very natural or fundamental candidates for underlying distributions when calculating likelihoods. We will see an example when this is not that obvious,
to give some perspective.


Imagine that we want to model the choice for the number of episodes that indivudals are going to watch of a particular tv show. We can think of a set of alternatives 0, 1, 2, 3... and consider the multinomial, but there might be a better distribution that describes the data in this case. In the multinomial, we are ignoring the fact that numbers have some similarities (e.g. 2 is closer to 3 than to 0). Another distribution could be applied for the likelihood, for example, the Poisson that is commonly used for counts.

In practice, this not something that is clear beforehand, before trying both ideas, the Multinomial and the Poisson. There are cases where in fact the multinomial is recommended for count data, for example, Amazon has research on using the multinomial for the likelihood when predicting the sales of products, because it is difficul to find a more natural distributions.

---
---




# Interpretation as Random Utility with a Gumbel distribtution on the random component

As we have shown with the logit, the multinomial logit:


$$ P(a | X) = \frac{ e^{V_a}}{\sum_{b \in X} e^{V_b}}$$
can be interpreted under the random utility model as a decomposition:

$$U_{nj} = V_{nj} + \varepsilon_{nj}$$

with $\varepsilon_{nj}$ are independent and identically distributed Gumbel.

McFadden showed the opposite direction, that under some mild restrictions, 
the Gumbel induces the multinomial logit.

---
---





#Limitations of the multinomial logit

The Independence of Irrelevant Alternatives, the constant odds creates an interesting paradox. There is a famous illustration of this, the **Red bus / Blue bus** problem.

Imagine a choice situation for driving to work. The initial choice set is among car and bus. Very importantly, the buses are red.

Lets say that the probabilities are 0.66 for car and 0.33 for bus.

We now introduce a new alternative, a blue bus. The variables of the blue bus are the same as the red bus, therefore the predicted utility in the model is the same as for the red bus. Under I.I.A. we expect the ratio of probabilities among car and red to bus keep constant when we introduce the blue bus.

Imagine that the utilities for car is $log(2)$ and for bus is $\log(1)$, we use logarithm so when we exponentiate them in the multinomial logit,
$\frac{ e^{V_a}}{\sum_{b \in X} e^{V_b}}$ they cancel and we get some nice numbers.


$$P(Car | \{Car, RedBus\}) = \frac{2}{2 + 1} = 0.66$$
$$P(RedBus  | \{Car, RedBus\}) = \frac{1}{2 + 1} = 0.33$$

when we introduce the blue bus in the choice set, because of the properties of a blue bus are the same as the red car, the color does not affect the choice, the utility for the blue bus is the same as the red bus. We compute the choice probabilties now and we get.

$$P(Car |  \{Car, Redbus, BlueBus\}) = \frac{2}{2 + 1 + 1} = 0.5$$
$$P(RedBus |  \{Car, Redbus, BlueBus\}) = \frac{1}{2 + 1 + 1} = 0.25$$

We see that the I.I.A. holds, in both choice sets, the odds of car to red bus has kept constant.
$$\frac{0.66}{0.33} = 2 = \frac{0.5}{0.25}$$

However, this result can seem a bit strange or counterintuitive. Most people would say that introducing a bus of another color should not change the probability of choosing car! We can imagine even more exaggerated examples, some new alternative that there is in not relevant way different from the original. Such as buses that have been manufactured only on Tuesdays in the production chain. Why should this new alternative drive preferences from car? 

What is commonly said is that these similar alternatives should 'lump together'
and share the original choice probability. Car = 0.66, RedBus=0.165, BlueBus=0.165.

Attempts to solve this paradox give raise to alternative models, such as the **nested logit** which we will study in later lectures.

---
---


# Counterpoint to the paradox

Perhaps one would wonder why the multinomial logit is so popular if it exhibits
that strange property.

In fact, the counter argument to the problem often a 'properly specified model'
should exhibit I.I.A. This means that we should capture the sources of similarity of alternatives. There are several ways:

 * We could directly not consider the blue bus as a new alternative
 * Are the blue buses and red buses really identical? For example, maybe they do not have the same schedule (buses cannot ocupy the same physical space at the same time!), so if we can capture these differences in our model, we will be able predict the new situation well.
 * We can consider that the attributes of the alternatives change when we introduce the new one. For example, we can specify the variable: 'number of buses of other colors' that would change when introducing the blue bus. This could keep the choice probabilities as intuitively expected.

Other arguments are:
 * We actually do not know how the reaction of the decision makers will be. Maybe some car drives were waiting for the oportunity of riding a blue bus...
 * Other models that do not have this problem actually require knowing the similarities of the alternatives beforehand.

**The bottom line: The multinomial logit is flexible enough, alternative models are 'tools in our toolbox' that can make the modelling easier in some situations.**

---
---




# Validation

With more than two alternatives, we can extend the accuracy metric
and look at the confusion matrix, it can give valuabe information.

| Predicted \ Actual  | Air  | Train   | Bus  | Total |
|---|---|---|---|--- |
| Air  | **7**  |8   |  2 |  17|
| Train  | 2  | **3**  |  1 |  6|
|  Bus |  3 |  2 | **1** |  6|
|  Total |  12|  13|  4 |  29|


