<a href="https://colab.research.google.com/github/pmontman/tmp_choicemodels/blob/main/nb/tutorials/WK_10_tuto_efficient_designs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to design efficient experiments

* We will create a fake design matrix for the full experiment
* We will use biogeme to compute the initial choice probabilities
* Compute the d-efficiency for the design
* Show can we can get subsets of the full design and compute their efficiency
* Do a random search for the best subdesign originating from the initial design

In [1]:
!pip install biogeme



In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

import biogeme.database as db
import biogeme.biogeme as bio
import biogeme.models as models
import biogeme.expressions as exp
import biogeme.tools as tools

In [3]:
betas = np.matrix(' 1 1; 2 2')

In [4]:
def choice_prob(betas, X):
  V = np.matmul(X, betas)
  P = np.exp(V)
  return P / np.sum(P, axis = 1)

In [5]:
colnames = ['price_apple', 'size_apple', 'os_apple', 'price_android', 'size_android', 'os_android']

# Working example, discrete choice experiment for automobile preferences.

We want to understand population preferences for cars, we will consider the following variables.

`price`, `power`, `engine_type`

Ignoring all realistic values, let's go for:

* Prices: 20000, 30000, 40000 AUD.
* Consider power: 130 hp, 170 hp, 220 hp.
* Engine types, encoded as integer initially: 0=petrol, 1=diesel, 2=hybrid, 3=electric.



#Creating all combinations of variables and values

We can create the full factorial design in python by using the cartesian product
`cartesian` function.

Here is an example of use.

In [6]:
from sklearn.utils.extmath import cartesian

full_fact = pd.DataFrame(cartesian(([20000.0, 30000.0, 40000.0], [130, 170, 220], [0, 1, 2, 3])), columns=['price', 'power', 'engine_type'])
full_fact

Unnamed: 0,price,power,engine_type
0,20000.0,130.0,0.0
1,20000.0,130.0,1.0
2,20000.0,130.0,2.0
3,20000.0,130.0,3.0
4,20000.0,170.0,0.0
5,20000.0,170.0,1.0
6,20000.0,170.0,2.0
7,20000.0,170.0,3.0
8,20000.0,220.0,0.0
9,20000.0,220.0,1.0


These are all possible combinations of price, car and engine type.

In practice the full factorial could be too large to compute. We do not need to
actually compute it *completely* for what we want to do, which is finding the best subset of size $N$ out of the full factorial.

We still need to create the alternatives to compare, imagine that we have two car manufacturers, each will identify one alternative. The two alternatives are then Toyota and Renault. Comparing all possible Toyota values for the attributes vs all possible Renault attributes will render 2025 entries in the full factorial.

Let us create the full experiment comparing two alternatives, we can do that by using the `cartesian` product again, or in this case we `merge` together two copies of the attributes that we computed earlier. `merge` takes two dataframe and a pair of `suffixes` to identify the repeated names of the columns.

In [7]:
design_mat = pd.merge(full_fact, full_fact, how='cross', suffixes=('_toyo', '_rena'))
design_mat

Unnamed: 0,price_toyo,power_toyo,engine_type_toyo,price_rena,power_rena,engine_type_rena
0,20000.0,130.0,0.0,20000.0,130.0,0.0
1,20000.0,130.0,0.0,20000.0,130.0,1.0
2,20000.0,130.0,0.0,20000.0,130.0,2.0
3,20000.0,130.0,0.0,20000.0,130.0,3.0
4,20000.0,130.0,0.0,20000.0,170.0,0.0
...,...,...,...,...,...,...
1291,40000.0,220.0,3.0,40000.0,170.0,3.0
1292,40000.0,220.0,3.0,40000.0,220.0,0.0
1293,40000.0,220.0,3.0,40000.0,220.0,1.0
1294,40000.0,220.0,3.0,40000.0,220.0,2.0


#Efficiency

Recall that we want to find the subset of $N$ rows from the full experiment that maximizes the efficiency of the resulting experiment.
There are several concepts of efficiency, relatively similar but we will
focus on D-efficiency. Roughly speaking, D-efficiency want to make the covariance matrix for the coefficients, $\text{covariance}(B)$, as 'small' as possible.
In discrete choice, the formula for the covariance matrix of the coefficients is a bit more complex than for linear regression.


$$\text{covariance}(\beta) = (Z' P Z )^{-1}$$

when working with $J$ alternatives:
*  $P$ is the matrix of choice probabilities computed by the model.
* $Z$ is similar to design matrix, but 'centered' using the choice probabilities. Basically, to each row of observations, we substract the weighted mean of the variables across all alternatives. The weights are the choice probabilities computed by the model.

 $$z_{jn} = x_{jn} - \sum_{i=1}^Jx_{in}P_{in}$$

To compute the $Z$ matrix, **we need some  'choice probabilities'**. In our context, we do not yet know these choice probabilities, so we need to work with an initial guess of them. This initial guess usually comes from an 'initial' value for the coefficients that creates equal choice probs, basically a 'no-information' stating model. In some cases (as in the group assignment!), we might get a good starting guess, fom example, if we have data of a similar problem or from a similar experiment.

#Creating the centered design matrix $Z$

We need some initial model that we can use to compute choice probabilities.
We can use a biogeme model, or some manual computation, since we do not really need to estimate the coefficients from the data.

Lets try biogeme.

Load some auxiliary functions first:

---
---

# Auxiliary functions

The first function takes the dictionary of utilities, a pandas dataframe, and the name of the variable that contains the variable with the results of the choice. It returns the biogeme object with the model and the estimated 'results' object (the one we get the values, likelihoods, etc.)
We have added the dictionary with the utilities to the biogeme object, in case we use it later.

In [8]:
def qbus_estimate_bgm(V, pd_df, tgtvar_name, modelname='bgmdef'):
 av = {1: 1,
       2: 1}
 bgm_db = db.Database(modelname + '_db', pd_df)
 globals().update(bgm_db.variables)
 logprob = models.loglogit (V , av , bgm_db.variables[tgtvar_name] )
 bgm_model = bio.BIOGEME ( bgm_db, logprob )
 bgm_model.utility_dic = V.copy()
 return bgm_model, bgm_model.estimate()

The next function will calculate the predictions for a given biogeme object that was estimated with `qbus_estimate_bgm`. The output is the array with the choice probabilities. From the choice probabilities, this can be used to calculate accuracies, confusion matrices and the output of what-if scenarios.

In [9]:
def qbus_simulate_bgm(qbus_bgm_model, betas, pred_pd_df):
  av_auto = qbus_bgm_model.utility_dic.copy()
  for key, value in av_auto.items():
   av_auto[key] = 1

  targets = qbus_bgm_model.utility_dic.copy()
  for key, value in targets.items():
   targets[key] = models.logit(qbus_bgm_model.utility_dic, av_auto, key)

  bgm_db = db.Database('simul', pred_pd_df)
  globals().update(bgm_db.variables)
  bgm_pred_model = bio.BIOGEME(bgm_db, targets)
  simulatedValues = bgm_pred_model.simulate(betas)
  return simulatedValues

The function `qbus_calc_accu_confusion` calculates the accuracies given the choice probability predictions a pandas dataset and the specification of the name that contains the actual choices in the input dataset.

In [10]:
def qbus_calc_accu_confusion(sim_probs, pd_df, choice_var):
  which_max = sim_probs.idxmax(axis=1)
  data = {'y_Actual':   pd_df[choice_var],
          'y_Predicted': which_max
        }

  df = pd.DataFrame(data, columns=['y_Actual','y_Predicted'])
  confusion_matrix = pd.crosstab(df['y_Actual'], df['y_Predicted'], rownames=['Actual'], colnames=['Predicted'])
  accu = np.mean(which_max == pd_df[choice_var])
  return accu, confusion_matrix

The next function calculates the likelihood ratio test having to write a bit less code that the default biogeme function. The arguments are the results objects of the two models to be compared. The first is the more complex and the second is the reference model (**the order is important!**). The third argument is the significance level for the test.

In [11]:
def qbus_likeli_ratio_test_bgm(results_complex, results_reference, signif_level):
  return tools.likelihood_ratio_test( (results_complex.data.logLike, results_complex.data.nparam),
                                     (results_reference.data.logLike, results_reference.data.nparam), signif_level)

The next function just updates the globals so we can use it

In [12]:
def qbus_update_globals_bgm(pd_df):
   globals().update(db.Database('tmp_bg_bgm_for_glob', pd_df).variables)

---
---

#'Estimating' the initial biogeme model

We will create a dataset that is the same as the design matrix, to pass to biogeme. We will get a biogeme model with the right structure and use biogeme for the predictions.
We have to give biogeme a 'fake' response variable that does not exist so it can create the model, otherwise it will give us an error.

We can give arbitrarily values to the choice variable, since we are going to impose the coefficients manually afterwards. We set all choices to 1, then, we need to get some of the entries to the other alternative (otherwise biogeme will go crazy), so we set the first row to 2.

In [13]:
init_dset = design_mat.copy()
init_dset['choice'] = 1
qbus_update_globals_bgm(init_dset)

In [14]:
init_dset['choice'][0] = 2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  init_dset['choice'][0] = 2


Now we create the biogeme model that would analyze the data, as usual.
The differences is that, for simplicity, we will set most coefficients to 0 and tell biogeme to **not fit them.** This is because we would like to set them manually to 0 so we have the no-information starting coefficients. If we have other dataset that could be used for the initial guess, then we would fit it for real.

Unfortunately, we cannot set all coefficients to 0 at this stage, we have to fit some coefficient (otherwise biogeme will give an error) so we fill fit one of the ASC (we will ignore the result anyway).

In [15]:
ASC_toyo = exp.Beta ( 'ASC_toyo' ,0, None , None ,1)
ASC_rena = exp.Beta ( 'ASC_rena' ,0, None , None ,0)

B_price = exp.Beta ( 'B_price',0, None , None ,1)
B_power = exp.Beta ( 'B_power',0, None , None ,1)
B_engine_type = exp.Beta('B_engine_type', 0, None, None, 1)

In [16]:
V_toyo = ASC_toyo + B_price*price_toyo + B_power*power_toyo + B_engine_type*engine_type_toyo
V_rena = ASC_rena + B_price*price_rena + B_power*power_rena + B_engine_type*engine_type_rena

V_base = {1: V_toyo,
     2: V_rena}

And we now 'estimate' the biogeme model, remember this estiamtion is so we can get the structure for making the predictions, but we are going to ignore the actual estimated coefficients and set them to 0.

In [17]:
model_base, results_base = qbus_estimate_bgm(V_base, init_dset, 'choice', 'automob')



In [18]:
results_base.getEstimatedParameters()

Unnamed: 0,Value,Rob. Std err,Rob. t-test,Rob. p-value
ASC_rena,-7.166266,1.000386,-7.163501,7.86482e-13


**Important step**. We need to compute the choice probabilities of the initial model. We will use biogeme for that, but where we pass the Betas, we will set them manually. Notice the dictionary `{'ACS_rena':0}`, this will tell biogeme that the value of the ASC that we fitted is set to 0, manually by us. The rest of the betas are not needed to set manually to 0, because they were set fixed to zero at the time of the definition.

We calculate the choice probabilties for the design matrix, we should get equal probs. because the Betas are set to 0, so all alternatives haver the same utility for all values of their attributes.

In [19]:
init_choice_probs=qbus_simulate_bgm(model_base, {'ASC_rena':0}, design_mat)
init_choice_probs

Unnamed: 0,1,2
0,0.5,0.5
1,0.5,0.5
2,0.5,0.5
3,0.5,0.5
4,0.5,0.5
...,...,...
1291,0.5,0.5
1292,0.5,0.5
1293,0.5,0.5
1294,0.5,0.5


# Estimating the d-efficiency for choice models.

With the initial choice probabilities and the design matrix we can compute the efficiency following the above for the covariance for the betas, and the formula in the lectures for the d-efficienty based on the determinant of the covariance of the betas.

Programmatically it is a bit tricky, so we have this function `calc_mnl_cov`.

 **Important, you might want to reuse it for the group assignment**. The function gets as arguments, a design matrix, the choice probabilities, and we have also to pass them the number of alternatives and the number of attribures per alternatives.

In [20]:
def calc_mnl_cov(design_m, cprobs, num_alt, attrs_per_alt):
  P_rep = np.repeat(cprobs.to_numpy(), np.repeat(attrs_per_alt, num_alt), axis=1)
  num_cols = num_alt * attrs_per_alt
  XP_rep = np.repeat((design_m.to_numpy()*P_rep).sum(axis=1).T.reshape(-1,1), num_cols, axis=1)
  Z = design_m - XP_rep
  ZPZ = np.matmul(Z.T, P_rep*Z.to_numpy())
  covMNL = np.linalg.pinv(ZPZ)
  if (np.linalg.det(covMNL)):
    return covMNL
  else:
    return np.eye(covMNL.shape[0])*1000
  return covMNL

In [21]:
covMNL = calc_mnl_cov(design_mat, init_choice_probs, 2, 3)

The definition of d-efficiency:

In [22]:
def d_effic(covMAT):
  return np.power( np.linalg.det(covMAT), 1 / (covMAT.shape[0] + 1) )

And we can calculate the efficiency, finally.

In [23]:
d_effic(covMNL)

1.1952521124286206e-06

#Finding good designs by random search

Our actual goal is to find a good subset of the full factorial design matrix.
There are some sophisticated ways of findining it, it is an optimization problem that mathematicians have studied.

We will use random search, we will compare many possible subsets of the full design matrix, and pick the best one.




Set the size of the experiment, say 100 rows.

In [24]:
N = 12

Get a random subset of the design matrix.
The python function `np.random.choice` generates random integers for a given range. We pass as range, the number of rows in the full design matrix.
We use sampling without replacement, but you can use replacement, it could repeat rows of the design, but as long as the efficiency is OK, we are good to go.

In [25]:
np.random.seed(1234)

selected_rows = np.random.choice(design_mat.shape[0], N, replace=False)
selected_rows

array([  35, 1016,  494,  350, 1291,  905,  547,  772,  113,   90,  869,
        783])

Lets look at how this subdesign looks like

In [26]:
sub_design = design_mat.iloc[selected_rows,:]
sub_design

Unnamed: 0,price_toyo,power_toyo,engine_type_toyo,price_rena,power_rena,engine_type_rena
35,20000.0,130.0,0.0,40000.0,220.0,3.0
1016,40000.0,170.0,0.0,20000.0,220.0,0.0
494,30000.0,130.0,1.0,40000.0,130.0,2.0
350,20000.0,220.0,1.0,40000.0,130.0,2.0
1291,40000.0,220.0,3.0,40000.0,170.0,3.0
905,40000.0,130.0,1.0,20000.0,170.0,1.0
547,30000.0,130.0,3.0,20000.0,170.0,3.0
772,30000.0,220.0,1.0,30000.0,170.0,0.0
113,20000.0,130.0,3.0,20000.0,170.0,1.0
90,20000.0,130.0,2.0,30000.0,170.0,2.0


Compute the efficiency for that design.

In [27]:
sub_probs = qbus_simulate_bgm(model_base, {'ASC_rena':0}, sub_design)
sub_covMNL = calc_mnl_cov(sub_design, sub_probs, 2, 3)
d_effic(sub_covMNL)

8.126159538750309e-05

Put everything in a `for` loop, and run 250 times, pick the best sub design.

In [28]:
np.random.seed(1234)

best_rows = None
best_effic = 9999
for i in range(250):
  selected_rows = np.random.choice(design_mat.shape[0], N, replace=False)
  sub_design = design_mat.iloc[selected_rows,:]
  sub_probs = qbus_simulate_bgm(model_base, {'ASC_rena':0}, sub_design)
  sub_covMNL = calc_mnl_cov(sub_design, sub_probs, 2, 3)
  d_ef = d_effic(sub_covMNL)
  if (d_ef < best_effic):
    best_effic = d_ef
    print('New best efficiency found!', best_effic)
    best_rows = selected_rows


New best efficiency found! 8.126159538750309e-05
New best efficiency found! 7.990706870261724e-05
New best efficiency found! 7.88632819977927e-05
New best efficiency found! 6.9417385121184e-05
New best efficiency found! 6.583462212600721e-05
New best efficiency found! 6.262240807063488e-05


Lets print the best design found

In [29]:
design_mat.iloc[best_rows,:]

Unnamed: 0,price_toyo,power_toyo,engine_type_toyo,price_rena,power_rena,engine_type_rena
567,30000.0,130.0,3.0,40000.0,130.0,3.0
1157,40000.0,220.0,0.0,20000.0,170.0,1.0
381,20000.0,220.0,2.0,30000.0,220.0,1.0
1287,40000.0,220.0,3.0,40000.0,130.0,3.0
1261,40000.0,220.0,3.0,20000.0,130.0,1.0
1254,40000.0,220.0,2.0,40000.0,170.0,2.0
47,20000.0,130.0,1.0,20000.0,220.0,3.0
604,30000.0,170.0,0.0,40000.0,170.0,0.0
1007,40000.0,130.0,3.0,40000.0,220.0,3.0
178,20000.0,170.0,0.0,40000.0,220.0,2.0


#Exercise 1) Think of some good initial coefficients that are not zero, and run the search again using that model for the choice probabilities. Notice the differences? What are the choice probs of the original subset found (under the new coefficients)  vs the new subset found?

We will start with an initial 'guess'. The coefficients that we will impose are 'small', as to not give us, initially, very extreme choice probabilities.
We will calculate a reference design, and then change our initial guess, giving much more importance to one of the variables (e.g. engine type more positive or more negative). We will compare both 'optimal' designs (the results of the search), for the two levels of importance. Intuitively, the design chosen when engine type is more important should 'sample' engine type more, it should pick more different values compared to when it is not important, the farther away the variables, the easier to estimate the coefficient for that variable.

In [30]:
ASC_toyo = exp.Beta ( 'ASC_toyo' ,0, None , None ,1)
ASC_rena = exp.Beta ( 'ASC_rena' ,0.01, None , None ,0) #not used

B_price = exp.Beta ( 'B_price',-0.0001, None , None ,1)
B_power = exp.Beta ( 'B_power',0.021, None , None ,1)
B_engine_type = exp.Beta('B_engine_type', 0.0000013, None, None, 1)

In [31]:
V_toyo = ASC_toyo + B_price*price_toyo + B_power*power_toyo + B_engine_type*engine_type_toyo
V_rena = ASC_rena + B_price*price_rena + B_power*power_rena + B_engine_type*engine_type_rena

V_base = {1: V_toyo,
     2: V_rena}

In [32]:
model_base, results_base = qbus_estimate_bgm(V_base, init_dset, 'choice', 'automob')



In [33]:
sub_probs = qbus_simulate_bgm(model_base, {'ASC_rena':0.01}, sub_design)
np.round(sub_probs,2)

Unnamed: 0,1,2
1038,0.5,0.5
412,0.88,0.12
144,0.7,0.3
590,0.7,0.3
788,0.73,0.27
786,0.88,0.12
1173,0.27,0.73
1048,0.12,0.88
344,0.73,0.27
1226,0.47,0.53


In [34]:
sub_covMNL = calc_mnl_cov(sub_design, sub_probs, 2, 3)
sub_covMNL
np.linalg.det(sub_covMNL)

-4.69274344155968e-35

In [35]:
np.seterr(all="ignore")

{'divide': 'warn', 'over': 'warn', 'under': 'ignore', 'invalid': 'warn'}

In [36]:
np.random.seed(1234)

best_rows = None
best_effic = 9999
for i in range(250):
  selected_rows = np.random.choice(design_mat.shape[0], N, replace=False)
  sub_design = design_mat.iloc[selected_rows,:]
  sub_probs = qbus_simulate_bgm(model_base, {'ASC_rena':2}, sub_design)
  sub_covMNL = calc_mnl_cov(sub_design, sub_probs, 2, 3)
  d_ef = d_effic(sub_covMNL)
  if (d_ef < best_effic):
    best_effic = d_ef
    print('New best efficiency found!', best_effic)
    best_rows = selected_rows

New best efficiency found! 1.2452656559872315e-05
New best efficiency found! 1.238046441502408e-05
New best efficiency found! 1.0206153795608703e-05
New best efficiency found! 8.28663200980351e-06
New best efficiency found! 8.13179930000571e-06
New best efficiency found! 8.050097721774853e-06


In [37]:
design_mat.iloc[best_rows,:]

Unnamed: 0,price_toyo,power_toyo,engine_type_toyo,price_rena,power_rena,engine_type_rena
1159,40000.0,220.0,0.0,20000.0,170.0,3.0
84,20000.0,130.0,2.0,30000.0,130.0,0.0
745,30000.0,220.0,0.0,40000.0,130.0,1.0
1269,40000.0,220.0,3.0,20000.0,220.0,1.0
956,40000.0,130.0,2.0,30000.0,220.0,0.0
136,20000.0,130.0,3.0,40000.0,170.0,0.0
1000,40000.0,130.0,3.0,40000.0,170.0,0.0
215,20000.0,170.0,1.0,40000.0,220.0,3.0
365,20000.0,220.0,2.0,20000.0,170.0,1.0
1197,40000.0,220.0,1.0,20000.0,220.0,1.0


In [38]:
np.mean(np.abs(design_mat.iloc[best_rows,2] - design_mat.iloc[best_rows,5]))

1.8333333333333333

Now we will change the intitial guess of engine type to a much larger number,
and compare the results.

In [39]:
ASC_toyo = exp.Beta ( 'ASC_toyo' ,0, None , None ,1)
ASC_rena = exp.Beta ( 'ASC_rena' ,0.01, None , None ,0) #not used

B_price = exp.Beta ( 'B_price',-0.0001, None , None ,1)
B_power = exp.Beta ( 'B_power',0.021, None , None ,1)
B_engine_type = exp.Beta('B_engine_type', 10.0000013, None, None, 1)

In [40]:
V_toyo = ASC_toyo + B_price*price_toyo + B_power*power_toyo + B_engine_type*engine_type_toyo
V_rena = ASC_rena + B_price*price_rena + B_power*power_rena + B_engine_type*engine_type_rena

V_base = {1: V_toyo,
     2: V_rena}

In [41]:
model_base, results_base = qbus_estimate_bgm(V_base, init_dset, 'choice', 'automob')



In [42]:
sub_probs = qbus_simulate_bgm(model_base, {'ASC_rena':0.01}, sub_design)
np.round(sub_probs,2)

Unnamed: 0,1,2
1038,0.0,1.0
412,1.0,0.0
144,0.7,0.3
590,0.0,1.0
788,1.0,0.0
786,0.0,1.0
1173,0.0,1.0
1048,1.0,0.0
344,1.0,0.0
1226,0.47,0.53


In [43]:
sub_covMNL = calc_mnl_cov(sub_design, sub_probs, 2, 3)
sub_covMNL
np.linalg.det(sub_covMNL)

3.9460181999735294e-36

In [44]:
np.seterr(all="ignore")

{'divide': 'ignore', 'over': 'ignore', 'under': 'ignore', 'invalid': 'ignore'}

In [45]:
np.random.seed(1234)

best_rows = None
best_effic = 9999
for i in range(250):
  selected_rows = np.random.choice(design_mat.shape[0], N, replace=False)
  sub_design = design_mat.iloc[selected_rows,:]
  sub_probs = qbus_simulate_bgm(model_base, {'ASC_rena':2}, sub_design)
  sub_covMNL = calc_mnl_cov(sub_design, sub_probs, 2, 3)
  d_ef = d_effic(sub_covMNL)
  if (d_ef < best_effic):
    best_effic = d_ef
    print('New best efficiency found!', best_effic)
    best_rows = selected_rows

New best efficiency found! 7.861204605666597e-06
New best efficiency found! 7.635738251780904e-06
New best efficiency found! 7.502837505660039e-06
New best efficiency found! 6.349930412329488e-06
New best efficiency found! 5.632506619525888e-06


In [46]:
design_mat.iloc[best_rows,:]

Unnamed: 0,price_toyo,power_toyo,engine_type_toyo,price_rena,power_rena,engine_type_rena
336,20000.0,220.0,1.0,30000.0,130.0,0.0
309,20000.0,220.0,0.0,30000.0,220.0,1.0
638,30000.0,170.0,1.0,40000.0,130.0,2.0
762,30000.0,220.0,1.0,20000.0,170.0,2.0
431,20000.0,220.0,3.0,40000.0,220.0,3.0
982,40000.0,130.0,3.0,20000.0,220.0,2.0
907,40000.0,130.0,1.0,20000.0,170.0,3.0
413,20000.0,220.0,3.0,30000.0,170.0,1.0
737,30000.0,220.0,0.0,30000.0,170.0,1.0
422,20000.0,220.0,3.0,40000.0,130.0,2.0


In [47]:
np.mean(np.abs(design_mat.iloc[best_rows,2] - design_mat.iloc[best_rows,5]))

1.25

We see how the average distance of across alternatives for engine type is much larger when engine type is more 'imporant' in the choice model.

---
# Exercise 1.B) Create an ordered logit model and practice the selection of rows

We will practice with an scenario of an ordered logit. We will simulate
that the dataset comes with a few characteristics, and we focus our model on those characteristics.
Then we will calculate a covariance matrix and efficiency.

**Step 1: Generating a dataset with characteristics**

We will add two columns that do not change per alternative, some characteristics such as age and gender.
We copy the original full design, and add some columns with random values.
We could also create the new full design, using all combinations, this woudl greatly increase the size of the matrix, so we skip that part (we are doing a random search after all).

In [48]:
ord_dset = init_dset.copy()

In [49]:
np.random.seed(1234)

Adding some ages and genders at random

In [50]:
ord_dset['choice'] = np.random.choice([1,2], ord_dset.shape[0])
ord_dset['age'] = np.random.choice([19, 23, 27, 30, 35, 65], ord_dset.shape[0])
ord_dset['gender'] = np.random.choice([0,1], ord_dset.shape[0])

In [51]:

qbus_update_globals_bgm(ord_dset)

In [52]:
ord_dset

Unnamed: 0,price_toyo,power_toyo,engine_type_toyo,price_rena,power_rena,engine_type_rena,choice,age,gender
0,20000.0,130.0,0.0,20000.0,130.0,0.0,2,30,1
1,20000.0,130.0,0.0,20000.0,130.0,1.0,2,35,0
2,20000.0,130.0,0.0,20000.0,130.0,2.0,1,35,0
3,20000.0,130.0,0.0,20000.0,130.0,3.0,2,19,1
4,20000.0,130.0,0.0,20000.0,170.0,0.0,1,19,1
...,...,...,...,...,...,...,...,...,...
1291,40000.0,220.0,3.0,40000.0,170.0,3.0,1,35,0
1292,40000.0,220.0,3.0,40000.0,220.0,0.0,1,35,0
1293,40000.0,220.0,3.0,40000.0,220.0,1.0,2,27,0
1294,40000.0,220.0,3.0,40000.0,220.0,2.0,1,65,1


We will create a new specification for the ordered logit, following the same principles. We will use the tau as the variable to 'fit' in biogeme, but this will not affect the results.

In [53]:
B_ord_age = exp.Beta ( 'B_ord_price',-0.001, None , None ,1)
B_ord_gender = exp.Beta ( 'B_ord_power',0.01, None , None ,1)


In [54]:
V_ord = B_ord_age*age + B_ord_gender * gender

In [55]:
tau1 = exp.Beta('tau1', -1.5, None, None, 0)



Specification of the ordered logit, manually here.

In [56]:
import biogeme.distributions as dist

In [57]:
alt_probs_map = {
    1: dist.logisticcdf(tau1 - V_ord),
    2: 1- dist.logisticcdf(tau1 - V_ord)}

In [58]:
ord_logprob = exp.log(exp.Elem(alt_probs_map, choice))

In [59]:
ord_db = db.Database('ord_car', ord_dset)

biogeme_ord  = bio.BIOGEME(ord_db, ord_logprob)


results_ord = biogeme_ord.estimate()



Sanity checks, but these will not be relevant since the covariance matrix for the ordered logit, we will use a simplified approach that does not require the predictions (the choice probs.)

In [60]:
results_ord.getEstimatedParameters()

Unnamed: 0,Value,Rob. Std err,Rob. t-test,Rob. p-value
tau1,-0.127663,0.05564,-2.294434,0.021766


In [61]:
biogeme_ord_pred = bio.BIOGEME(ord_db, alt_probs_map)
simulatedValues = biogeme_ord_pred.simulate(results_ord.getBetaValues())
simulatedValues

Unnamed: 0,1,2
0,0.473110,0.526890
1,0.476851,0.523149
2,0.476851,0.523149
3,0.470369,0.529631
4,0.470369,0.529631
...,...,...
1291,0.476851,0.523149
1292,0.476851,0.523149
1293,0.474856,0.525144
1294,0.481842,0.518158


**Important: For the sake of simplicity, we will use the definition of the covariance matrix for linear models, which does not require the predictions**

In [62]:
def calc_ord_cov(design_m):
  Z =design_m.to_numpy()
  ZTZ = np.matmul(Z.T, Z)
  covMNL = np.linalg.pinv(ZTZ)
  if (np.linalg.det(covMNL)):
    return covMNL
  else:
    return np.eye(covMNL.shape[0])*1000

We drop the response variable from the full dset, since this should not be used for the covariance (it is the response variable)

In [63]:
ord_design = ord_dset.drop(['choice'], axis=1)

Our design matrix looks like this

In [64]:
ord_design

Unnamed: 0,price_toyo,power_toyo,engine_type_toyo,price_rena,power_rena,engine_type_rena,age,gender
0,20000.0,130.0,0.0,20000.0,130.0,0.0,30,1
1,20000.0,130.0,0.0,20000.0,130.0,1.0,35,0
2,20000.0,130.0,0.0,20000.0,130.0,2.0,35,0
3,20000.0,130.0,0.0,20000.0,130.0,3.0,19,1
4,20000.0,130.0,0.0,20000.0,170.0,0.0,19,1
...,...,...,...,...,...,...,...,...
1291,40000.0,220.0,3.0,40000.0,170.0,3.0,35,0
1292,40000.0,220.0,3.0,40000.0,220.0,0.0,35,0
1293,40000.0,220.0,3.0,40000.0,220.0,1.0,27,0
1294,40000.0,220.0,3.0,40000.0,220.0,2.0,65,1


We now calculate the covariance matrix. We will use the defintion of covariance that the linear model uses, for simplicity. This only requires the candidate design matrix, not the probabilties of the model, so it will simplify things  a lot.  

**The only adjustment that we would need to do with this simplification is to select the columns that actually affect the design. In this specification, we are ony using the columns age and gender, so we will choose only those for the design matrix. For simplicity, we will ignore that part, in a real situation, all columns in the design matrix are appear in the model, by construction, we would not be posing variables that we will not consider for the model!**

Calculate the covariance and efficiency of the 'full' design

In [65]:
cov_ord = calc_ord_cov(ord_design)

In [66]:
d_effic(cov_ord)

2.343382364721533e-06

Lets pick a small design, of 4 rows, to print the intuition.
Focus on the columns age and gender, see how separated they are, and then we will compared to a more 'optimal' design by random search.

In [67]:
N = 8
np.random.seed(123456)
ord_selected_rows = np.random.choice(ord_design.shape[0], N, replace=False)
d_effic ( calc_ord_cov ( ord_design.iloc[ord_selected_rows,:] ) )

0.00034870895102143736

In [68]:
ord_design.iloc[ord_selected_rows,:]

Unnamed: 0,price_toyo,power_toyo,engine_type_toyo,price_rena,power_rena,engine_type_rena,age,gender
812,30000.0,220.0,2.0,30000.0,220.0,0.0,35,0
692,30000.0,170.0,3.0,20000.0,220.0,0.0,27,1
904,40000.0,130.0,1.0,20000.0,170.0,0.0,35,0
626,30000.0,170.0,1.0,30000.0,130.0,2.0,65,0
527,30000.0,130.0,2.0,30000.0,220.0,3.0,30,0
752,30000.0,220.0,0.0,40000.0,220.0,0.0,19,1
520,30000.0,130.0,2.0,30000.0,170.0,0.0,30,0
828,30000.0,220.0,3.0,20000.0,130.0,0.0,23,0


Do some random search, we will do a more exhaustive search since it is much faster than the MNL version.

In [69]:
np.random.seed(1234)

best_rows = None
best_effic = 9999
for i in range(22250):
  ord_selected_rows = np.random.choice(ord_design.shape[0], N, replace=False)
  ord_sub_design = ord_design.iloc[ord_selected_rows,:]
  sub_cov_ord = calc_ord_cov(ord_sub_design)
  d_ef = d_effic(sub_cov_ord)
  if (d_ef < best_effic):
    best_effic = d_ef
    print('New best efficiency found!', best_effic, i)
    best_rows = ord_selected_rows

New best efficiency found! 0.0007631219682135426 0
New best efficiency found! 0.0006775255983237997 1
New best efficiency found! 0.000642691719566003 2
New best efficiency found! 0.00046330095761941273 3
New best efficiency found! 0.0003839912944025 4
New best efficiency found! 0.00028271541650905067 7
New best efficiency found! 4.3412181021057375e-06 73
New best efficiency found! 3.851382892053248e-06 244
New best efficiency found! 3.468228873501343e-06 6753
New best efficiency found! 3.314087734027942e-06 10331
New best efficiency found! 2.6727535150524496e-06 11936
New best efficiency found! 2.299663002589456e-06 19556


In [70]:
ord_design.iloc[best_rows,:]

Unnamed: 0,price_toyo,power_toyo,engine_type_toyo,price_rena,power_rena,engine_type_rena,age,gender
674,30000.0,170.0,2.0,40000.0,130.0,2.0,27,0
795,30000.0,220.0,2.0,20000.0,130.0,3.0,35,0
860,30000.0,220.0,3.0,40000.0,220.0,0.0,65,1
369,20000.0,220.0,2.0,20000.0,220.0,1.0,30,0
62,20000.0,130.0,1.0,40000.0,130.0,2.0,30,0
1216,40000.0,220.0,1.0,40000.0,170.0,0.0,19,0
580,30000.0,170.0,0.0,20000.0,170.0,0.0,30,1
707,30000.0,170.0,3.0,30000.0,220.0,3.0,19,1


**For the sake of illustration, we will focus on the design matrix that only
considers the columns `age` and `gender`, to exaggerate the effect of the search.
What we would expect is perfect balance (all levels with equal frequency of appearance) and maximum distance within each attribute (levels of age as far away as possible, gender is binary so it does not apply, if gender is balanced, then it is already maximum distance (half are zeros, half are ones).**

In [71]:
ord_design = ord_design[['age', 'gender']]

In [72]:
np.random.seed(1234)

best_rows = None
best_effic = 9999
for i in range(22250):
  ord_selected_rows = np.random.choice(ord_design.shape[0], N, replace=False)
  ord_sub_design = ord_design.iloc[ord_selected_rows,:]
  sub_cov_ord = calc_ord_cov(ord_sub_design)
  d_ef = d_effic(sub_cov_ord)
  if (d_ef < best_effic):
    best_effic = d_ef
    print('New best efficiency found!', best_effic, i)
    best_rows = ord_selected_rows

New best efficiency found! 0.05445146886802793 0
New best efficiency found! 0.03324776629274323 1
New best efficiency found! 0.02972747724080623 2
New best efficiency found! 0.029316353363113997 6
New best efficiency found! 0.02833161303556778 12
New best efficiency found! 0.028137576565958923 35
New best efficiency found! 0.025536839737668116 113
New best efficiency found! 0.02440276074480469 148
New best efficiency found! 0.0239624021402172 1771


Notice how we *almost* get the 'optimal' design, which should be: half of the design age=19, the other half age=65, and then combined with gender (getting also balance and orthogonality).

In [73]:
ord_design.iloc[best_rows,:]

Unnamed: 0,age,gender
1096,65,0
49,27,1
532,19,1
1187,65,0
38,35,1
1142,65,1
200,65,0
458,65,1


#Exercise 2) Encode the engine_type as binary dummy and find a good experiment of 100 rows.

The purpose is to practice with the binary encoding as it tends to appear in many experiments, when we take all variables as categorical (even when they are numeric, we put them into chunks to do piecewise constant approximation).

In [74]:
dset_dum = init_dset.copy()
dset_dum['engine_type_rena'] = dset_dum['engine_type_rena'].astype(int).astype(str)
dset_dum['engine_type_toyo'] = dset_dum['engine_type_toyo'].astype(int).astype(str)

In [75]:
dset_dum

Unnamed: 0,price_toyo,power_toyo,engine_type_toyo,price_rena,power_rena,engine_type_rena,choice
0,20000.0,130.0,0,20000.0,130.0,0,2
1,20000.0,130.0,0,20000.0,130.0,1,1
2,20000.0,130.0,0,20000.0,130.0,2,1
3,20000.0,130.0,0,20000.0,130.0,3,1
4,20000.0,130.0,0,20000.0,170.0,0,1
...,...,...,...,...,...,...,...
1291,40000.0,220.0,3,40000.0,170.0,3,1
1292,40000.0,220.0,3,40000.0,220.0,0,1
1293,40000.0,220.0,3,40000.0,220.0,1,1
1294,40000.0,220.0,3,40000.0,220.0,2,1


In [76]:
dset_dum = pd.get_dummies(dset_dum, ['engine_type_rena', 'engine_type_toyo'])

In [77]:
qbus_update_globals_bgm(dset_dum)

In [78]:
ASC_toyo = exp.Beta ( 'ASC_toyo' ,0, None , None ,1)
ASC_rena = exp.Beta ( 'ASC_rena' ,0, None , None ,0)

B_price = exp.Beta ( 'B_price',0, None , None ,1)
B_power = exp.Beta ( 'B_power',0, None , None ,1)
B_et_pet = exp.Beta('B_et_pet', 0, None, None, 1)
B_et_dies = exp.Beta('B_et_dies', 0, None, None, 1)
B_et_hyb = exp.Beta('B_et_hyb', 0, None, None, 1)
B_et_elec = exp.Beta('B_et_elec', 0, None, None, 1)

In [79]:
V_toyo = ASC_toyo + B_price*price_toyo + B_power*power_toyo + B_et_pet*engine_type_toyo_0 + B_et_dies*engine_type_toyo_1 + B_et_hyb*engine_type_toyo_2 + B_et_elec*engine_type_toyo_3
V_rena = ASC_rena + B_price*price_rena + B_power*power_rena + B_et_pet*engine_type_rena_0 + B_et_dies*engine_type_rena_1 + B_et_hyb*engine_type_rena_2 + B_et_elec*engine_type_rena_3

V_base = {1: V_toyo,
     2: V_rena}

With the new model try finind a good design

In [80]:
model_base, results_base = qbus_estimate_bgm(V_base, dset_dum, 'choice', 'automob')



In [81]:
sub_probs = qbus_simulate_bgm(model_base, {'ASC_rena':0.01}, dset_dum)
np.round(sub_probs,2)

Unnamed: 0,1,2
0,0.5,0.5
1,0.5,0.5
2,0.5,0.5
3,0.5,0.5
4,0.5,0.5
...,...,...
1291,0.5,0.5
1292,0.5,0.5
1293,0.5,0.5
1294,0.5,0.5


In [84]:
dset_dum

Unnamed: 0,price_toyo,power_toyo,price_rena,power_rena,choice,engine_type_rena_0,engine_type_rena_1,engine_type_rena_2,engine_type_rena_3,engine_type_toyo_0,engine_type_toyo_1,engine_type_toyo_2,engine_type_toyo_3
0,20000.0,130.0,20000.0,130.0,2,1,0,0,0,1,0,0,0
1,20000.0,130.0,20000.0,130.0,1,1,0,0,0,0,1,0,0
2,20000.0,130.0,20000.0,130.0,1,1,0,0,0,0,0,1,0
3,20000.0,130.0,20000.0,130.0,1,1,0,0,0,0,0,0,1
4,20000.0,130.0,20000.0,170.0,1,1,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1291,40000.0,220.0,40000.0,170.0,1,0,0,0,1,0,0,0,1
1292,40000.0,220.0,40000.0,220.0,1,0,0,0,1,1,0,0,0
1293,40000.0,220.0,40000.0,220.0,1,0,0,0,1,0,1,0,0
1294,40000.0,220.0,40000.0,220.0,1,0,0,0,1,0,0,1,0


In [85]:
dset_dum = dset_dum.drop(['choice'], axis=1)

In [86]:
sub_covMNL = calc_mnl_cov(dset_dum, sub_probs, 2, 6)
sub_covMNL
np.linalg.det(sub_covMNL)

9.000577358808503e-70

In [87]:
np.seterr(all="ignore")

{'divide': 'ignore', 'over': 'ignore', 'under': 'ignore', 'invalid': 'ignore'}

In [103]:
N = 100

In [104]:
np.random.seed(1234)

best_rows = None
best_effic = 9999
for i in range(1250):
  selected_rows = np.random.choice(dset_dum.shape[0], N, replace=False)
  sub_design = dset_dum.iloc[selected_rows,:]
  sub_probs = qbus_simulate_bgm(model_base, {'ASC_rena':2}, sub_design)
  sub_covMNL = calc_mnl_cov(sub_design, sub_probs, 2, 6)
  d_ef = d_effic(sub_covMNL)
  if (d_ef < best_effic):
    best_effic = d_ef
    print('New best efficiency found!', best_effic)
    best_rows = selected_rows

New best efficiency found! 8.437019438625361e-05
New best efficiency found! 7.949248192788179e-05
New best efficiency found! 7.40392797043412e-05
New best efficiency found! 7.354615636130814e-05
New best efficiency found! 7.341118247591412e-05
New best efficiency found! 7.297042054644157e-05
New best efficiency found! 7.180647223462893e-05
New best efficiency found! 7.094856042115314e-05
New best efficiency found! 6.209755835016761e-05
New best efficiency found! 5.9470419038266256e-05


In [105]:
dset_dum.iloc[best_rows,:]

Unnamed: 0,price_toyo,power_toyo,price_rena,power_rena,engine_type_rena_0,engine_type_rena_1,engine_type_rena_2,engine_type_rena_3,engine_type_toyo_0,engine_type_toyo_1,engine_type_toyo_2,engine_type_toyo_3
1006,40000.0,130.0,40000.0,220.0,0,0,0,1,0,0,1,0
478,30000.0,130.0,20000.0,220.0,0,1,0,0,0,0,1,0
689,30000.0,170.0,20000.0,170.0,0,0,0,1,0,1,0,0
771,30000.0,220.0,30000.0,130.0,0,1,0,0,0,0,0,1
908,40000.0,130.0,20000.0,220.0,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
805,30000.0,220.0,30000.0,130.0,0,0,1,0,0,1,0,0
863,30000.0,220.0,40000.0,220.0,0,0,0,1,0,0,0,1
600,30000.0,170.0,40000.0,130.0,1,0,0,0,1,0,0,0
48,20000.0,130.0,30000.0,130.0,0,1,0,0,1,0,0,0


In [106]:
dset_dum.iloc[best_rows, 4:]

Unnamed: 0,engine_type_rena_0,engine_type_rena_1,engine_type_rena_2,engine_type_rena_3,engine_type_toyo_0,engine_type_toyo_1,engine_type_toyo_2,engine_type_toyo_3
1006,0,0,0,1,0,0,1,0
478,0,1,0,0,0,0,1,0
689,0,0,0,1,0,1,0,0
771,0,1,0,0,0,0,0,1
908,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...
805,0,0,1,0,0,1,0,0
863,0,0,0,1,0,0,0,1
600,1,0,0,0,1,0,0,0
48,0,1,0,0,1,0,0,0


In [108]:
dset_dum.iloc[selected_rows, 4:]

Unnamed: 0,engine_type_rena_0,engine_type_rena_1,engine_type_rena_2,engine_type_rena_3,engine_type_toyo_0,engine_type_toyo_1,engine_type_toyo_2,engine_type_toyo_3
48,0,1,0,0,1,0,0,0
692,0,0,0,1,1,0,0,0
360,0,0,1,0,1,0,0,0
585,1,0,0,0,0,1,0,0
361,0,0,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...
410,0,0,0,1,0,0,1,0
407,0,0,0,1,0,0,0,1
1000,0,0,0,1,1,0,0,0
541,0,0,0,1,0,1,0,0


In [114]:
np.mean(np.abs(dset_dum.iloc[best_rows, 4:].mean(axis=0) - 0.25))

0.04

In [115]:
np.mean(np.abs( dset_dum.iloc[selected_rows, 4:].mean(axis=0) - 0.25))

0.05