<a href="https://colab.research.google.com/github/pmontman/tmp_choicemodels/blob/main/nb/WK_04b_tuto_preds_mnl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 4b: Predictions and what if scenarios.

#NOTE: This tutorial follows from the previous one, so it will reuse first part. The novelty starts at the section 'Predictions of the model'. There is change w.r.t. to the version for tutorial 3 , we remove the individuals with `GA==0` from the database in the preprocessing step. This is to simplify the notebook a bit.

"*Biogeme is a open source Python package designed for the maximum likelihood estimation of parametric models in general, with a special emphasis on discrete choice models.*"

# Preparing the environment

Google colab environment does not have biogeme installed by default,
so we need to install in every session. Hopefully, it will take less than one minute. Once installed it will be valid until the session expires (or we reset it).

In [2]:
!pip install biogeme

Collecting biogeme
  Downloading biogeme-3.2.8.tar.gz (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 12.2 MB/s 
Collecting unidecode
  Downloading Unidecode-1.3.0-py2.py3-none-any.whl (235 kB)
[K     |████████████████████████████████| 235 kB 65.1 MB/s 
Building wheels for collected packages: biogeme
  Building wheel for biogeme (setup.py) ... [?25l[?25hdone
  Created wheel for biogeme: filename=biogeme-3.2.8-cp37-cp37m-linux_x86_64.whl size=4030747 sha256=c7cee3b4d6940c0284f42788ced085613e81165dd3f046dedc4d013b94a8e12b
  Stored in directory: /root/.cache/pip/wheels/d4/52/61/de6c73d2bc17603c60e754e260bccb7d4da2503e97015ebd49
Successfully built biogeme
Installing collected packages: unidecode, biogeme
Successfully installed biogeme-3.2.8 unidecode-1.3.0


First load the packages typical packages, biogeme and the common one for python data analysis.

In [3]:
import pandas  as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import biogeme.database as db
import biogeme.biogeme as bio
import biogeme.models as models
import biogeme.expressions as exp


# The dataset

We will use the example from the biogeme package, Swissmetro dataset:

"*This dataset consists of survey data collected on the trains between St. Gallen and Geneva, Switzerland, during March 1998. The respondents provided information in order to analyze the impact of
the modal innovation in transportation (a new mode of transport), represented by the Swissmetro, a revolutionary mag-lev underground system, against the usual transport modes represented by car and train.*"

#Loading the dataset

Biogeme can interact with the popular pandas package, so we can load the dataset in pandas first and then pass it to biogeme.

This specific dataset uses the format 'tab separated values' instead of the more common 'comma separated values'. We can specify this non-standard separator by the argument `sep` in `pandas.read_csv`.



In [4]:
swissmetro = pd.read_csv('http://transp-or.epfl.ch/data/swissmetro.dat', sep='\t')

We will take a look at it, using the head() method of to display the first few rows.

There is a detailed description of the dataset 
variables [here](http://transp-or.epfl.ch/documents/technicalReports/CS_SwissmetroDescription.pdf). **You will need to take a look at them later
to understand the data and create your own models.**

In [5]:
swissmetro.head(5)

Unnamed: 0,GROUP,SURVEY,SP,ID,PURPOSE,FIRST,TICKET,WHO,LUGGAGE,AGE,MALE,INCOME,GA,ORIGIN,DEST,TRAIN_AV,CAR_AV,SM_AV,TRAIN_TT,TRAIN_CO,TRAIN_HE,SM_TT,SM_CO,SM_HE,SM_SEATS,CAR_TT,CAR_CO,CHOICE
0,2,0,1,1,1,0,1,1,0,3,0,2,0,2,1,1,1,1,112,48,120,63,52,20,0,117,65,2
1,2,0,1,1,1,0,1,1,0,3,0,2,0,2,1,1,1,1,103,48,30,60,49,10,0,117,84,2
2,2,0,1,1,1,0,1,1,0,3,0,2,0,2,1,1,1,1,130,48,60,67,58,30,0,117,52,2
3,2,0,1,1,1,0,1,1,0,3,0,2,0,2,1,1,1,1,103,40,30,63,52,20,0,72,52,2
4,2,0,1,1,1,0,1,1,0,3,0,2,0,2,1,1,1,1,130,36,60,63,42,20,0,90,84,2


We see some socioeconomic characteristics such as 'AGE', likely encoding by age groups, 'MALE' which we can assume refers to gender. There are also some attributes of the alternatives, such as 'CAR_TT' which would be travel time in car, 'TRAIN_CO' which refers to the cost of the fare by train.

An important variable is 'CHOICE' it is the result of the choice for each individual, coded as 1 for train, 2 for swissmetro, 3 for car. The value 0 indicates invalid response.

# Passing the dataset to biogeme

For now, we have loaded the dataset into a pandas dataframe, we need to transform it into a Biogeme database, the format that biogeme understands.
We pass first argument the name of the database that we want to give and second the pandas dataframe.

In [6]:
bgm_swissmetro = db.Database('swissmetro', swissmetro)

We can access the dictionary of variables in the biogeme database the following way:

In [7]:
bgm_swissmetro.variables['CHOICE']

CHOICE

Because this is too verbose, we can load them into the global variables
of the python environment, to make the symbolic manipulation less verbose, so we can refer to them just by writing `VARIABLE_NAME` instead of having to write `database.variables['VARIABLE_NAME']`.

In [8]:
globals().update(bgm_swissmetro.variables)
CHOICE

CHOICE

Before we begin, we will need to clean the dataset a bit. In this case,
we have people that did not respond to the survey, and their value assigned
to the choice is 0. The only valid values for the choice are 1,2,3 indicating
the alternatives, train, swissmetro and car.
We can remove them from the biogeme database using the `remove` method with the logical indicator for the row that have choice 0.

**It is recommended that all database manipulations/cleaning are applied directly on the pandas dataframe before passing it to biogeme.** The reason being that pandas is better designed for that purpose and makes the code more readable. Ideally we would like out interactions with biogeme to be minimized and do as much as possible with the standard frameworks such as pandas.  

In [9]:
bgm_swissmetro.remove( (CHOICE == 0) )
bgm_swissmetro.remove( (GA == 1) )

# Creating the model

What we usually need to define in the multinomial logit can be summarized as:
 * Which variables in the database are we going to include in the model linear model.
 * What is the variable in the database that specifies the choice made, the alternative selected by an individual. The 'target variable' or dependent variable.
 * What variables are used in the modelling of each alternative. Remember that we can define a utility function for each alternative, so we can have alternatives that depend of some variables, and we can also have some variables that are estimated for all alternatives.

We can connect this back to the utily theory view, we want to specify the functions that produce the observed component of the utility, the $V_{nj}$

For each alternative $j$ and observation $n$, we consider the vector $x_{nj}$
to be the joint vector of for both attributes and characteristics (to simplify things).  We try to find the vector of coefficients $\beta_j$ for each alternative. In other words, we try to find the linear relationship between the variables and utility for each alternative:
  $$V_{nj} = \beta_j x_n$$ 

* Consider that some attributes or characteristics are not relevant for some alternatives: This would be equivanlent to fixing some of the values for $\beta_j$ to 0 and not fitting them to data.
* Consider that some attributes or characteristics 'share' the value of the coefficient.



# The alternative specific constants:
Just as in linear models we have the intercept, in choice models we have alternative specific constants. An important difference is that we cannot determine their value, because we have seen in the previous tutorial that the absolute level of utilities cannot be recovered. In practice, what we will do is assume that the attribute specific characteristic of one of the alternatives is set to 0 and we do not fit them to the data. This will set a reference point.
Again, which alternative we use as reference and what value for the ASC is arbitrary.


# Definition of the model in biogeme
We define the parameters of the model through the function `exp.Beta`.
The function `exp.Beta` takes 5 arguments:
1. the name of the parameter.
2. the default value. We can use 0 for the default values unless we know a better starting value, for example when we have prior information.
3. The lower bound, if we want to restrict it to a range, `None` if we do not want to restrict the value of the parameter. For example, sometimes we might know or would like to for a parameter to be positive.
4. The upper bound, if we want to restrict it to a range, `None` if we do not want to restrict the value of the parameter.
5. A 0/1 argument, 0 if the parameter must be estimated and 1 if it remains fixed.

We will define a simple model with just the three ASCs and two parameters for two variables of interest, time and cost.
Note that one of the ASCs, `ASC_SM` is set to not be estimated, notice the 1 in the value of the last argument when it is created. This comes from the explanation above, we set one of the ASCs arbitrarily to 0 because we utility cannot be recovered up to changes in constants, so we will pick among the many possible solutions the one that makes the ASC for the SwissMetro alternative 0.


In [10]:
ASC_CAR = exp.Beta ( 'ASC_CAR' ,0, None , None ,0)
ASC_TRAIN = exp.Beta ( 'ASC_TRAIN' ,0, None , None ,0)
ASC_SM = exp.Beta ( 'ASC_SM' ,0, None , None ,1)
B_TIME = exp.Beta ( 'B_TIME' ,0, None , None ,0)
B_COST = exp.Beta ( 'B_COST' ,0, None , None ,0)

We will create an artificial variable, ussing the luggage variable but squared.
This variable will only be included as a parameter for the utility of the car alternative. This is a totally arbitrary variable for the purposes of exposition, it does not mean that it is a good one.

In [11]:
B_LUGGA_SQ = exp.Beta( 'B_LUGGA_SQ', 0, None, None, 0)
LUGGA_SQ = LUGGAGE**2

A warning from the creator of the biogeme package:
when we define the parameters of our model, we store them into python variables.
The authr strongly recomments using the same name for the python variable
than for the parameter.
For example, while we could have define the variable following the code in the next cell, it is not recommended. I imagine that this could cause some confusion later on.



In [12]:
#doing this is not recommended!
car_cte = exp.Beta( 'ASC_CAR' ,0, None , None ,0)

We now define the utility functions for each alternative, more speficifically the linear relationship between the variables and the observed component of the utility.

In [13]:
V1 = ASC_TRAIN + B_TIME * TRAIN_TT + B_COST * TRAIN_CO 

V2 = ASC_SM + B_TIME * SM_TT + B_COST * SM_CO 

V3 = ASC_CAR + B_TIME * CAR_TT + B_COST * CAR_CO + B_LUGGA_SQ*LUGGA_SQ


We have to create a dictionary that maps the utility functions to the numbers that identify the alternatives in the database.
In this case 1 for Train, 2 for Swissmetro, 3 for car.

In [14]:
V = {1: V1 ,
2: V2 ,
3: V3 }

We have to pass availabilities, these are the indicator variables signaling if the option is available for that individual. Remember that the multinomial does not need to have all alternatives present for all individuals, it can recover the model from data even if the full choice set is not available for all individuals.

In [15]:


av = {1: TRAIN_AV,
2: SM_AV,
3: CAR_AV }

This is the definition of the model, in this case the multinomial logit (we will use other models later).

In [16]:
logprob = models.loglogit (V , av , CHOICE )

And finally we pack everything together in the biogeme object.

In [17]:
bgm_model = bio.BIOGEME ( bgm_swissmetro, logprob )

We can give a name to the model, this can help identifying the model when we come back to it later, for example when we save it to a file and want to use it in another report.

In [18]:
bgm_model.modelName = 'my first multinomial logit'

# Estimation of the model

Everythin is set, biogeme will kindly do maximum likelihood estimation for us.

In [19]:
results = bgm_model.estimate()

# Results of the model

We can check a basic summary of the estimated model, likelihoods, information
criterions, etc.

In [20]:
results.getGeneralStatistics()

{'Akaike Information Criterion': (14200.492538968734, '.7g'),
 'Bayesian Information Criterion': (14236.131135685173, '.7g'),
 'Excluded observations': (1512, ''),
 'Final gradient norm': (0.0451203565262594, '.4E'),
 'Final log likelihood': (-7095.246269484367, '.7g'),
 'Init log likelihood': (-9742.706372523207, '.7g'),
 'Likelihood ratio test for the init. model': (5294.920206077681, '.7g'),
 'Nbr of threads': (2, ''),
 'Number of estimated parameters': (5, ''),
 'Rho-square for the init. model': (0.27173764679035384, '.3g'),
 'Rho-square-bar for the init. model': (0.2712244423675969, '.3g'),
 'Sample size': (9207, '')}

The value of the parameters and the p-values for their statistical significance.
Note that ASC_SM is not shown, it was fixed to 0 by us.

In [21]:
results.getEstimatedParameters()

Unnamed: 0,Value,Std err,t-test,p-value,Rob. Std err,Rob. t-test,Rob. p-value
ASC_CAR,0.100709,0.037175,2.70905,0.006748,0.044229,2.276975,0.022788
ASC_TRAIN,-1.067158,0.050138,-21.284324,0.0,0.06392,-16.695091,0.0
B_COST,-0.007187,0.000382,-18.800027,0.0,0.000541,-13.293817,0.0
B_LUGGA_SQ,-0.081684,0.024663,-3.311961,0.000926,0.02247,-3.635325,0.000278
B_TIME,-0.012762,0.000446,-28.611135,0.0,0.000703,-18.160019,0.0


We can recover the values for the parameters in a dictionary.

In [22]:
results.getBetaValues()

{'ASC_CAR': 0.10070859907228755,
 'ASC_TRAIN': -1.0671577468567794,
 'B_COST': -0.007186549330881186,
 'B_LUGGA_SQ': -0.08168395799711564,
 'B_TIME': -0.012762245336960068}

# Predictions of the model

Producing the choice probabilities with biogeme is a little bit more cumbersome,
we need to do a two-step process:
 1. We need to create a biogeme object with two arguments
   * The dataset that we want to calculate the predictions on
   * The targets to predict. This could be the individual utility functions, if we want to calculate utilities. It could also be the logit transformation of the utility functions, if we want to calculate probabilities.
 2. Then we call the `simulate()` method of the biogeme object, the one that calculates predictions. The argument for simulate is the values for the betas, the coefficients estimated in the model. We can get them from the `getBetaValues()` in the estimated model, the `results` object calculated before. Note that we have two biogeme objects, the one we created to estimate the model and the one we use to produce the predictions, name them appropiately!

We can use as input dataset the same dataset in which we estimated the model, this way we could get all the validation measures. We can use a different dataset, as long as it has the same structure as the one that was used to estimate the model. A 'different dataset' can be used for holdout validation measures, all the what-if scenarios (and willingness to pay, elasticities, marginal) and the actual 'predictions' in practice, in real scenarios where we do not know the choices.
 **Important: If we use variable transformations, we need to be careful when inputting a different dataset, because these transformations will not exists in the new dataset. This is why creating the transformations using panda/numpy in the pandas dataframe and then passing it to biogeme is the recommended approach.** 


## Calculating choice probabilities

The first example is to get the choice probabilities.
We start with the expressions from the utility functions and we need to tranform them into expressions for choice probabilities. The way to do this is
to apply the mutinomial logit 'squashing function'. We indicate this in biogeme using the `models.logit` function. The `models.logit` function requies the dictionary for the utilities espressions `V`, the availabilities dictionary. We need to create one `models.logit` for each alternative, and we indicate that with the number that indentifies the alternatives in the database, so 1 for train, 2 for swissmetro and 3 for car.

In [23]:
prob_train = models.logit(V, av, 1)
prob_SM = models.logit(V, av, 2)
prob_car = models.logit(V, av, 3)

Next we need to assemble all the `models.logit` into one dictionary, we name the outputs properly to know what is being calculated.

In [24]:

targets_to_simulate ={'Prob. train':  prob_train ,
                      'Prob. SM':  prob_SM ,
           'Prob. car': prob_car }


The we create the biogeme object with the dataset we want to calculate the predictions on and the dictionary for the expressions of choice probabilities.

In [25]:

bgm_pred_model = bio.BIOGEME(bgm_swissmetro, targets_to_simulate)
bgm_pred_model.modelName = "swissmetro_logit_test"           


Finally we pass to the predictive biogeme object the coefficients estimated from the other model, in the `results` object in our example. We show the choice probabilities predicted by the model.

In [26]:

betaValues = results.getBetaValues()
simulatedValues = bgm_pred_model.simulate(betaValues)
print(simulatedValues.head())

   Prob. train  Prob. SM  Prob. car
0     0.111749  0.589939   0.298312
1     0.123875  0.618950   0.257175
2     0.093168  0.563247   0.343585
3     0.101789  0.452283   0.445928
4     0.088171  0.577315   0.334514


## Calculating utilities

The utility functions can be recovered from the  python
dictionary that was used to specify the expressions for each, the `V` object in our example.

So we will do everythin in one line, creating the biogeme object for the predictions, and the computing the utilities.


In [27]:
Vpred = bio.BIOGEME(bgm_swissmetro, V).simulate(betaValues)
Vpred.head()

Unnamed: 0,1,2,3
0,-2.841484,-1.177722,-1.8596
1,-2.726623,-1.117876,-1.996144
2,-3.071204,-1.27189,-1.766175
3,-2.669131,-1.177722,-1.191874
4,-2.984965,-1.105857,-1.651564


What simulate is doing is just computing the expression for the utilities, here is a manual way of computing the utility for the car alernative in the first row of the dataset. It should coincide with previous result.

In [28]:
swissmetro['CAR_CO'][0]*betaValues['B_COST'] + swissmetro['CAR_TT'][0]*betaValues['B_TIME'] + (swissmetro['LUGGAGE'][0]**2) *betaValues['B_LUGGA_SQ']  + betaValues['ASC_CAR']


-1.8595998118593176

Finally we check that everything is correct, we can compute manually the choice proabilities from the utilities by applying the multinomal logit squashing manually using numpy. We should ge the choice probabilies as computed using the `model.logit` expressions in `biogeme.simulate()`.

So next cell is computing the multinomial logit squashing $ \frac{e^V_j}{\sum_{k=1}^{J}{e^V_k}}$ for each row

In [29]:
expV = np.exp(Vpred)
expV / expV.sum(axis=1).to_numpy()[:, np.newaxis]

Unnamed: 0,1,2,3
0,0.111749,0.589939,0.298312
1,0.123875,0.618950,0.257175
2,0.093168,0.563247,0.343585
3,0.101789,0.452283,0.445928
4,0.088171,0.577315,0.334514
...,...,...,...
10723,0.113243,0.645399,0.241358
10724,0.094336,0.517452,0.388212
10725,0.086588,0.507879,0.405533
10726,0.068681,0.564505,0.366814


# What-if scenarios

Most of the what-if scenarios can be recovered by manipulating the dataset that is passed to simulate.

## Calculating Willingess to pay as a what-if scenario.

The definition of willingness to pay is the ratio of the derivates of the utility w.r.t a variable of interest and cost


$$WTP_{\text{variable of interest}} = \frac{ \frac{ \partial V_{nj}} { \partial \text{variable of interest}} } {  \frac{ \partial V_{nj}} { \partial \text{cost}}}$$

We can think of calculating the numerical derivatives as a what if scenario.
Imagine that we want to calculate the willingness to pay for travel time. Basically, the question is, How much do utility change if the travel times increase by a very small value, relative to the change in utility if the cost increase by a very small value?

This might seem as a weird what if scenario, but illustrates the power of the 'what-if' scenarios in general. Later, we will calculate a more normal scenario.

Remember that numerical derivate of function $f$ can be calculated $ \frac{f(x + \delta) - f(x)}{\delta}$, for $\delta$ a small value.
So the derviate of utility w.r.t. cost is the scenario where the dataset is modified by adding a small value to the cost variables. We do this at the `pandas` level, recovering the dataframe inside the biogeme database (because it was modified by biogeme). The we compute the new utilities for the modified dataset

In [30]:
delta = 0.001
swissmetro_deltacost = bgm_swissmetro.data.copy()
swissmetro_deltacost[ ['TRAIN_CO', 'SM_CO', 'CAR_CO'] ] += delta
swissmetro_deltacost = db.Database('swissmetro_deltacost', swissmetro_deltacost)

Vpred_deltacost = bio.BIOGEME( swissmetro_deltacost, V).simulate(betaValues)
Vpred_deltacost.head()

Unnamed: 0,1,2,3
0,-2.841491,-1.177729,-1.859607
1,-2.726631,-1.117883,-1.996151
2,-3.071211,-1.271897,-1.766182
3,-2.669138,-1.177729,-1.191881
4,-2.984973,-1.105864,-1.651571


We have to do the same for travel time, and be carefull not to modify the wrong dataset!

In [31]:
swissmetro_deltatime = bgm_swissmetro.data.copy()
swissmetro_deltatime[ ['TRAIN_TT', 'SM_TT', 'CAR_TT'] ] += delta
swissmetro_deltatime = db.Database('swissmetro_deltatime', swissmetro_deltatime)

Vpred_deltatime = bio.BIOGEME(swissmetro_deltatime, V).simulate(betaValues)
Vpred_deltatime.head()

Unnamed: 0,1,2,3
0,-2.841496,-1.177735,-1.859613
1,-2.726636,-1.117888,-1.996157
2,-3.071217,-1.271903,-1.766187
3,-2.669144,-1.177735,-1.191886
4,-2.984978,-1.105869,-1.651576


Finally we compute the ratio of derivaties, the division by $\delta$ is omitted because it does not affect the result.
We average over all individuals and alternatives in this case.

In [32]:
(Vpred_deltatime - Vpred).values.mean() / (Vpred_deltacost - Vpred).values.mean()


1.775851629113709

The result should be very similar to the 'analytic' definintion of willingness to pay for linear utility, which is the ratio of the coefficients.

In [33]:
betaValues['B_TIME'] / betaValues['B_COST']

1.7758516291149167

## What happens to choice probabilities there are problems in the roads? 

Imagine that there is a big snowfall (it is Switzerland after all). It is estimated that this will increase the travel time by car 25% (cost x1.25), and then add a flat 30 mins because of setting up the chains for the tires. What would happen to choice probabilities?


In [34]:
results.getEstimatedParameters()

Unnamed: 0,Value,Std err,t-test,p-value,Rob. Std err,Rob. t-test,Rob. p-value
ASC_CAR,0.100709,0.037175,2.70905,0.006748,0.044229,2.276975,0.022788
ASC_TRAIN,-1.067158,0.050138,-21.284324,0.0,0.06392,-16.695091,0.0
B_COST,-0.007187,0.000382,-18.800027,0.0,0.000541,-13.293817,0.0
B_LUGGA_SQ,-0.081684,0.024663,-3.311961,0.000926,0.02247,-3.635325,0.000278
B_TIME,-0.012762,0.000446,-28.611135,0.0,0.000703,-18.160019,0.0


In [35]:
swissmetro_snow = bgm_swissmetro.data.copy()
swissmetro_snow[ 'CAR_CO' ] *= 1.25
swissmetro_snow[ 'CAR_CO' ] += 30

bgm_model_snow = bio.BIOGEME(db.Database('swissmetro_snow', swissmetro_snow),
                             targets_to_simulate)

snow_probs = bgm_model_snow.simulate(betaValues)

In [36]:
snow_probs.head()

Unnamed: 0,Prob. train,Prob. SM,Prob. car
0,0.122045,0.644289,0.233666
1,0.134489,0.671979,0.193532
2,0.102533,0.619864,0.277603
3,0.115478,0.513109,0.371413
4,0.098257,0.643353,0.258391


We can compute the differences

In [37]:
(snow_probs - simulatedValues ).head()

Unnamed: 0,Prob. train,Prob. SM,Prob. car
0,0.010295,0.054351,-0.064646
1,0.010613,0.053029,-0.063642
2,0.009365,0.056616,-0.065981
3,0.013689,0.060826,-0.074515
4,0.010086,0.066038,-0.076124


Lets see how market shares change:

In [38]:
simulatedValues.mean(axis=0)

Prob. train    0.086131
Prob. SM       0.584990
Prob. car      0.328880
dtype: float64

In [39]:
snow_probs.mean(axis=0)

Prob. train    0.094524
Prob. SM       0.640429
Prob. car      0.265047
dtype: float64

Instead of probabilities we can calculate the number of people that will go for each transport in each scenario.

In [40]:
simulatedValues.sum(axis=0)

Prob. train     793.004412
Prob. SM       5385.999584
Prob. car      3027.996005
dtype: float64

In [41]:
snow_probs.sum(axis=0)

Prob. train     870.285327
Prob. SM       5896.428525
Prob. car      2440.286148
dtype: float64

# Accuracies and confusion matrices

In [52]:
which_max = simulatedValues.idxmax(axis=1)
which_max = which_max.replace({'Prob. train': 1, 'Prob. SM': 2, 'Prob. car': 3,})

In [53]:
#accuracy
np.mean(which_max == bgm_swissmetro.data['CHOICE'])

0.673617899424351

Calculate the proportions in the dataset, to see how much improvement there is for a naive model that predicts the most popular class.

In [58]:
tab = pd.crosstab(bgm_swissmetro.data['CHOICE'], 'count')
tab / tab.sum()

col_0,count
CHOICE,Unnamed: 1_level_1
1,0.08613
2,0.58499
3,0.32888


Calculate the confusion matrix

In [79]:
data = {'y_Actual':   bgm_swissmetro.data['CHOICE'],
        'y_Predicted': which_max
        }

df = pd.DataFrame(data, columns=['y_Actual','y_Predicted'])
confusion_matrix = pd.crosstab(df['y_Actual'], df['y_Predicted'], rownames=['Actual'], colnames=['Predicted'])

confusion_matrix

Predicted,2,3
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
1,655,138
2,4744,642
3,1570,1458


For example only 68% of the predictions for car turned out to be cars

In [74]:
confusion_matrix.loc[:,3] / confusion_matrix.loc[:,3].sum()

Actual
1    0.061662
2    0.286863
3    0.651475
Name: 3, dtype: float64

For the predictions of swissmetro, the accuracy is similar.
Confusion with trains is greater than for cars, so more percentage of swissmetro go to cars than to trains.

In [78]:
confusion_matrix.loc[:,2] / confusion_matrix.loc[:,2].sum()

Actual
1    0.093988
2    0.680729
3    0.225283
Name: 2, dtype: float64

---
---

# Exercises

---
---

# 1) Calculate the choice probabilities according to your model




# 2) What would happen to choice probabilities in the model estimated in (Exercise 3) of the previous tutorial if cost of train and swissmetro increase by 15%




# 3) What would happen to the revenue?