# Multinomial choice model as a NN:

Given a choice set $C={1,2,..J}$ with $J$ number of alternatives, we consider a multinomial choice model, where  and $X=\{ x_{1}, x_{2}, ..., x_{K}\}$ are the explanatory variables  representing observed attributes of the choice alternative and the individual’s socio-demographic characteristics.

The utility that individual $n$ associates with alternative $i$ is formally given by:
<br/> <br> 

$U_{i,n}= V_{i,n} + \epsilon, $    $∀i ∈ C$   $(1)$ , where $\epsilon$ is independently and identically distributed Type I Extreme Value.


<br/> <br> 
Assuming that the systematic part of the utility is linear-in-parameter and considering a single vector of coefficients that applies to all the utily functions, $V_{i,n}$ can be described by the following equation:
<br/> <br> 

$V_{i,n}=BX_{i,n}, $   $∀i ∈ C$  $(2)$, where $B=\{ \beta_{1}, \beta_{2}, ..., \beta_{K}\}$  are the preference parameters associated with the explanatory variables $X$ corresponding to explanatory variables for individual $n$.
<br/> <br> 


The current study adopts the implementation of an MNL as a NN using a simple 2D-CNN architecture as suggested by [Sifringer et al. (2020)](https://www.researchgate.net/publication/344428513_Enhancing_discrete_choice_models_with_representation_learning). Although CNNs are traditionally used to analyse image and signal data using complex architectures that typically include non-linear activation functions and multiple channels and convolution layers, their weight-sharing architecture conveniently allow us to use them in a more simplified form to retrieve the MNL formulation as defined in $(2)$. 


Given a 2-dimensional input space $X$ of shape $(K, x_w)$ and a convolutional filter $k$ consisting of an array of trainable weights of shape $(k_h, k_w)$, we consider a CNN, with a single convolutional layer $L$. The CNN maps $X$ to an output space $v$ by sliding $k$ across the input $X$ and applying the dot product between $k$ and each region of $X$ (plus a bias term $\alpha$), yielding each time a single scalar value. Thus, the shape $(v_h, v_w)$ of the output space $v$ is determined by the shapes of $X$ and $k$ according to the following formula: 
<br/> <br> 
<font size="3">

$(v_h, v_w)= (\frac{(n_{h}−k_{h}+s_{h})}{s_h }, \frac{(n_{w}−k_{w}+s_{w})}{s_{w}})$</font>, where $s_h$ and $s_w$ are the number of rows and columns of $X$ traversed per slide of $k$, also known as the stride $s$ of shape $(s_h,s_w)$.


The value of a neuron $v_{i} \in{v}$ that is stored in the layer $L+1$ of the CNN that follows the convolution, is given by the following equation:

<font size="3">
    
$v_{i}^{(L+1)}=g\left( x_{i}^{(L)} k^{(L)}+\alpha_{i}^{(L)}\right)$ $(3)$</font>,

where $g$ is an activation function  (usually non-linear), $x_{i}^{(L)}$ is the region of the input $x$ where the convolution is applied to produce $u_{i}^{(L+1)}$ and $\alpha_{i}^{(L)}$ the corresponding bias term.

In order to retrieve the MNL formulation as defined in $(2)$, we exclude the bias term and we set $g$ to be the identity  function($g(x) = x$). Additionally we set the input space $X$ having a shape of $(J,K)$, i.e. *(n of CHOICES, n of exogenous variables)* and the kernel $k$ and stride $s$ to have a shape of $(1,K)$.  As a result the shape of the output space $v$ will be $(v_h, v_w)= (J, 1)$, while the value of $v_{i}$  according to $(3)$ will be:

$v_{i}^{(L+1)}= x_{i}^{(L)} k^{(L)}$ $(4)$, which is equivalent to the formulation of the utility functions $V_n = \{V_{1,n},..., V_{J,n}\}$ as defined (2). A graphical representation of the convolution process that is used to produce $V_n$ is presented in the figure  below:


![title](conv_p.png)

After the convolution takes place, the output $v$, that is stored in layer $L+1$ and represents the utilities $V_n$ is connected to the final activation layer consisting of $J$ neurons, that allows the CNN to generate probability distributions over the $J$ different choice althernatives using the softmax activation function $\sigma$, such that:

<font size="3">
$\left(P_n\right)_{i}= \left(\boldsymbol{\sigma}\left(\mathbf{v}_{n}\right)\right)_{i}=\frac{e^{v_{i,n}}}{\sum_{j=1}^{J} e^{v_{j,n}}}$
 </font>, 
 
which is equivalent to the probability for individual $n$ to select choice alternative $i$ within the MNL framework under the standard assumptions. As it is usually the case when the output layer activation function of a NN is softmax, cross entropy is used as a loss function to optimize the model's parameters, i.e. the weights of $k$, during training through backpropagation. 
 
 As noted by [Sifringer et al. (2020)](https://www.researchgate.net/publication/344428513_Enhancing_discrete_choice_models_with_representation_learning) , minimizing cross entropy loss is equivalent to maximizing the log-likelihood function, and thus allows us to derive
the parameters’ Hessian  matrix for the CNN and compute useful post-estimation indicators such as their standard errors and confidence intervals of the model. The architecture of the MNL implemented as a CNN is shown in the figure below.

![mnl](MNL_as_CNN.png) 


The keras implementation of an MNL as a CNN is given below:

(as suggested by [Sifringer et al. (2020)](https://www.researchgate.net/publication/344428513_Enhancing_discrete_choice_models_with_representation_learning) in https://github.com/BSifringer/EnhancedDCM)

In [2]:
from keras.models import Model
from keras.layers import Input, Conv2D, Reshape, Activation

def MNL(vars_num, choices_num, logits_activation = 'softmax'):
    

    main_input= Input((vars_num, choices_num,1), name = 'Features')
   
    utilities = Conv2D(filters=1, kernel_size=[vars_num,1], strides=(1,1), 
                       padding='valid', name='Utilities',
                       use_bias=False, trainable=True)(main_input)

    utilitiesR = Reshape([choices_num], name='Flatten_Dim')(utilities)
    logits = Activation(logits_activation, name='Choice')(utilitiesR)
    
    model = Model(inputs=main_input, outputs=logits, name='Choice')
    print(model.summary())
    
    return model

MNL(10,3)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
Features (InputLayer)        (None, 10, 3, 1)          0         
_________________________________________________________________
Utilities (Conv2D)           (None, 1, 3, 1)           10        
_________________________________________________________________
Flatten_Dim (Reshape)        (None, 3)                 0         
_________________________________________________________________
Choice (Activation)          (None, 3)                 0         
Total params: 10
Trainable params: 10
Non-trainable params: 0
_________________________________________________________________
None


<keras.engine.training.Model at 0x2495447f788>