<a href="https://colab.research.google.com/github/pmontman/tmp_choicemodels/blob/main/nb/tutorials/WK_10_tuto_efficient_designs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to design efficient experiments

In this lecture we are going to

* Properly define what is an efficient design
* Use the mathematical definition to create good designs
* Give some final guidelines on design of experiments for choice modelling

In [58]:
!pip install biogeme

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting biogeme
  Downloading biogeme-3.2.10.tar.gz (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 15.6 MB/s 
[?25hBuilding wheels for collected packages: biogeme
  Building wheel for biogeme (setup.py) ... [?25l[?25hdone
  Created wheel for biogeme: filename=biogeme-3.2.10-cp37-cp37m-linux_x86_64.whl size=4253311 sha256=0d0caa53c1a05a6aef751ebadebd8b47b6b0aa63b9bc21886536207049f26f45
  Stored in directory: /root/.cache/pip/wheels/5b/92/9b/63caa7ad9b2cd582de77d3701d10f7e8d041466f4a9d07d554
Successfully built biogeme
Installing collected packages: biogeme
Successfully installed biogeme-3.2.10


In [59]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

import biogeme.database as db
import biogeme.biogeme as bio
import biogeme.models as models
import biogeme.expressions as exp
import biogeme.tools as tools

In [60]:
betas = np.matrix(' 1 1; 2 2')

In [61]:
def choice_prob(betas, X):
  V = np.matmul(X, betas)
  P = np.exp(V)
  return P / np.sum(P, axis = 1)

In [62]:
colnames = ['price_apple', 'size_apple', 'os_apple', 'price_android', 'size_android', 'os_android']

# Working eample, discrete choice experiment for automobile preferences.

We want to understand population preferences for cars, we will consider the following variables.

`price`, `power`, `engine_type`

Ignoring all realistic values, let's go for:

* Prices: 20000, 30000, 40000 AUD.
* Consider power: 130 hp, 170 hp, 220 hp.
* Engine types, encoded as integer initially: 0=petrol, 1=diesel, 2=hybrid, 3=electric.



#Creating all combinations of variables and values

We can create the full factorial design in python by using the cartesian product
`cartesian` function.

Here is an example of use.

In [63]:
from sklearn.utils.extmath import cartesian

full_fact = pd.DataFrame(cartesian(([20000.0, 30000.0, 40000.0], [130, 170, 220], [0, 1, 2, 3, 4])), columns=['price', 'power', 'engine_type'])
full_fact

Unnamed: 0,price,power,engine_type
0,20000.0,130.0,0.0
1,20000.0,130.0,1.0
2,20000.0,130.0,2.0
3,20000.0,130.0,3.0
4,20000.0,130.0,4.0
5,20000.0,170.0,0.0
6,20000.0,170.0,1.0
7,20000.0,170.0,2.0
8,20000.0,170.0,3.0
9,20000.0,170.0,4.0


These are all possible combinations of price, car and engine type.

In practice the full factorial could be too large to compute. We do not need to
actually compute it *completely* for want we want to do, which is finding the best subset of size $N$ out of the factorial.

For example, imagine that we hace two car manufacturers, each will identify one alternative. The two alternatives are Toyota and Renault. Comparing all possible Toyota values for the attribute vs all possible Renault attributes will render 2025 entries in the full factorial.

Let us create the full experiment comparing two alternatives, we can do that by using the `cartesian` product, or in this case we `merge` together two copies of the attributes that we computed earlier. `merge` takes to dataframe and a pair of `suffixes` to identify the repeated names of the columns.

In [64]:
design_mat = pd.merge(full_fact, full_fact, how='cross', suffixes=('_toyo', '_rena'))
design_mat

Unnamed: 0,price_toyo,power_toyo,engine_type_toyo,price_rena,power_rena,engine_type_rena
0,20000.0,130.0,0.0,20000.0,130.0,0.0
1,20000.0,130.0,0.0,20000.0,130.0,1.0
2,20000.0,130.0,0.0,20000.0,130.0,2.0
3,20000.0,130.0,0.0,20000.0,130.0,3.0
4,20000.0,130.0,0.0,20000.0,130.0,4.0
...,...,...,...,...,...,...
2020,40000.0,220.0,4.0,40000.0,220.0,0.0
2021,40000.0,220.0,4.0,40000.0,220.0,1.0
2022,40000.0,220.0,4.0,40000.0,220.0,2.0
2023,40000.0,220.0,4.0,40000.0,220.0,3.0


#Efficiency

Recall that we want to find the subset of $N$ rows from the full experiment that maximizes the efficiency of the resulting experiment.
There are several concepts of efficiency, relatively similar but we will
focus on $D-efficiency$. Roughly speakling, D-efficiency want to make the covariance matrix for the coefficients, $\text{covariance}(B)$, as 'small' as possible.
In discrete choice, the formula for the covariance matrix of the coefficients is a bit more complex than for linear regression.


$$\text{covariance}(\beta) = (Z' P Z )^{-1}$$

when working with $J$ alternatives:
*  $P$ is the matrix of choice probabilities computed by the model.
* $Z$ is similar to design matrix, but 'centered' using the choice probabilities. Basically, to each row of observations, we substract the weighted mean of the variables across all alternatives. The weights are the choice probabilities computed by the model.

 $$z_{jn} = x_{jn} - \sum_{i=1}^Jx_{in}P_{in}$$

To compute the $Z$ matrix, we need the 'choice probabilities'. In our context, we do not yet know these choice probabilities, so we need to work with an initial guess of them. This initial guess usually comes from an 'initial' value for the coefficients that creates equal choice probs, basically a 'no-information' stating model. In some cases, we might get a good starting guess, from example, if we have data of a similar problem or from a similar experiment.

#Creating the centered design matrix $Z$

We need some initial model that we can use to compute choice probabilities.
We can use a biogeme model, or some manual compuation.
Lets try some biogeme.

Load some auxiliary function first:

---
---

# Auxiliary functions

The first function takes the dictionary of utilities, a pandas dataframe, and the name of the variable that contains the variable with the results of the choice. It returns the biogeme object with the model and the estimated 'results' object (the one we get the values, likelihoods, etc.)
We have added the dictionary with the utilities to the biogeme object, in case we use it later.

In [89]:
def qbus_estimate_bgm(V, pd_df, tgtvar_name, modelname='bgmdef'):
 av = {1: 1,
       2: 1}
 bgm_db = db.Database(modelname + '_db', pd_df)
 globals().update(bgm_db.variables)
 logprob = models.loglogit (V , av , bgm_db.variables[tgtvar_name] )
 bgm_model = bio.BIOGEME ( bgm_db, logprob )
 bgm_model.utility_dic = V.copy()
 return bgm_model, bgm_model.estimate()

The next function will calculate the predictions for a given biogeme object that was estimated with `qbus_estimate_bgm`. The output is the array with the choice probabilities. From the choice probabilities, this can be used to calculate accuracies, confusion matrices and the output of what-if scenarios.

In [90]:
def qbus_simulate_bgm(qbus_bgm_model, betas, pred_pd_df):
  av_auto = qbus_bgm_model.utility_dic.copy()
  for key, value in av_auto.items():
   av_auto[key] = 1

  targets = qbus_bgm_model.utility_dic.copy()
  for key, value in targets.items():
   targets[key] = models.logit(qbus_bgm_model.utility_dic, av_auto, key)

  bgm_db = db.Database('simul', pred_pd_df)
  globals().update(bgm_db.variables)
  bgm_pred_model = bio.BIOGEME(bgm_db, targets)
  simulatedValues = bgm_pred_model.simulate(betas)
  return simulatedValues

The function `qbus_calc_accu_confusion` calculates the accuracies given the choice probability predictions a pandas dataset and the specification of the name that contains the actual choices in the input dataset.

In [91]:
def qbus_calc_accu_confusion(sim_probs, pd_df, choice_var):
  which_max = sim_probs.idxmax(axis=1)
  data = {'y_Actual':   pd_df[choice_var],
          'y_Predicted': which_max
        }

  df = pd.DataFrame(data, columns=['y_Actual','y_Predicted'])
  confusion_matrix = pd.crosstab(df['y_Actual'], df['y_Predicted'], rownames=['Actual'], colnames=['Predicted'])
  accu = np.mean(which_max == pd_df[choice_var])
  return accu, confusion_matrix 

The next function calculates the likelihood ratio test having to write a bit less code that the default biogeme function. The arguments are the results objects of the two models to be compared. The first is the more complex and the second is the reference model (**the order is important!**). The third argument is the significance level for the test.

In [92]:
def qbus_likeli_ratio_test_bgm(results_complex, results_reference, signif_level):
  return tools.likelihood_ratio_test( (results_complex.data.logLike, results_complex.data.nparam),
                                     (results_reference.data.logLike, results_reference.data.nparam), signif_level)

The next function just updates the globals so we can use it 

In [93]:
def qbus_update_globals_bgm(pd_df):
   globals().update(db.Database('tmp_bg_bgm_for_glob', pd_df).variables)

---
---

#Est

WE will create a dataset for biogeme model

In [94]:
init_dset = design_mat.copy()
init_dset['choice'] = 1
qbus_update_globals_bgm(init_dset)

In [95]:
init_dset[0] = 2


In [99]:
ASC_toyo = exp.Beta ( 'ASC_toyo' ,0, None , None ,1)
ASC_rena = exp.Beta ( 'ASC_rena' ,0, None , None ,0)

B_price = exp.Beta ( 'B_price',0, None , None ,1)
B_power = exp.Beta ( 'B_power',0, None , None ,1)
B_engine_type = exp.Beta('B_engine_type', 0, None, None, 1)

In [100]:
V_toyo = ASC_toyo + B_price*price_toyo + B_power*power_toyo + B_engine_type*engine_type_toyo
V_rena = ASC_rena + B_price*price_rena + B_power*power_rena + B_engine_type*engine_type_rena

V_base = {1: V_toyo,
     2: V_rena}

In [113]:
model_base, results_base = qbus_estimate_bgm(V_base, init_dset, 'choice', 'automob')



In [121]:
results_base.getBetaValues()['ASC_rena'] = 0

{'ASC_rena': -15.5105542762932}

In [146]:
init_choice_probs=qbus_simulate_bgm(model_base, {'ASC_rena':0.01}, init_dset)
init_choice_probs

Unnamed: 0,1,2
0,0.4975,0.5025
1,0.4975,0.5025
2,0.4975,0.5025
3,0.4975,0.5025
4,0.4975,0.5025
...,...,...
2020,0.4975,0.5025
2021,0.4975,0.5025
2022,0.4975,0.5025
2023,0.4975,0.5025


In [150]:
P_rep = np.repeat(init_choice_probs.to_numpy(), [3,3], axis=1)

In [171]:
XP_rep = np.repeat((design_mat.to_numpy()*P_rep).sum(axis=1).T.reshape(-1,1), 6, axis=1)

In [173]:
Z = design_mat - XP_rep
Z

Unnamed: 0,price_toyo,power_toyo,engine_type_toyo,price_rena,power_rena,engine_type_rena
0,-130.0000,-20000.0000,-20130.0000,-130.0000,-20000.0000,-20130.0000
1,-130.5025,-20000.5025,-20130.5025,-130.5025,-20000.5025,-20129.5025
2,-131.0050,-20001.0050,-20131.0050,-131.0050,-20001.0050,-20129.0050
3,-131.5075,-20001.5075,-20131.5075,-131.5075,-20001.5075,-20128.5075
4,-132.0100,-20002.0100,-20132.0100,-132.0100,-20002.0100,-20128.0100
...,...,...,...,...,...,...
2020,-221.9900,-40001.9900,-40217.9900,-221.9900,-40001.9900,-40221.9900
2021,-222.4925,-40002.4925,-40218.4925,-222.4925,-40002.4925,-40221.4925
2022,-222.9950,-40002.9950,-40218.9950,-222.9950,-40002.9950,-40220.9950
2023,-223.4975,-40003.4975,-40219.4975,-223.4975,-40003.4975,-40220.4975


In [197]:
ZPZ = np.matmul(Z.T, P_rep*Z.to_numpy())
invcovMNL = np.linalg.inv(ZPZ)
np.linalg.det(invcovMNL)

9.306942625179009e-44

In [196]:
def d_effic(covMAT):
  return np.power( np.linalg.det(covMAT), 1 / (covMAT.shape[0] + 1) )

In [None]:
def

With the choice probs, we can compute the z matrix

In [None]:
def cov_mnl(Xj, J, betas):
  Xj = np.hsplit(np.array(Xj), 2)
  P = np.hstack( [np.matmul(Xj[0], betas[0].T ), np.matmul(Xj[1], betas[0].T )])
  P = np.exp(P)
  PP = P / np.sum(P, axis = 1)
  P0D = np.diag(np.array(PP[:,0].flatten()[0].T[:]).T[0])
  return np.linalg.inv(np.matmul( np.matmul(Xj[0].T, P0D), Xj[0]))

And now we calculate

In [None]:
 sub_fact = np.array(full_factorial)[np.random.choice(full_factorial.shape[0], 10, replace=False), :]

In [None]:
betas = [ np.matrix('0.5 0.1 1.1')]
betas[0]

In [None]:
pd.DataFrame(cov_mnl(sub_fact, 2, betas))

In [None]:
def deffic_mnl(X, J, betas):
  covX = cov_mnl(X, J, betas)
  return np.power( np.linalg.det(covX), 1 / (covX.shape[0] + 1) )

In [None]:
deffic_mnl(sub_fact, 2, betas)

In [None]:
deffic_mnl(full_factorial, 2, betas)

#Relationship to the principles of design of experiments

Recall the four principles

1. Level balance
2. Orthogonality
3. Minimal level overlap
4. Utility balance


These principles are all summarized in the D-efficiency, meaning that they are 'rules of thumb' to create designs with good efficiency. Nowadays we can just put the computer to work to find a good design, before that, we used to pick the design manually by following the principles... It is important to get an intuition on how it works.


# Example: Level balance and overlap



In [None]:
np.random.seed(1234) 
sub_fact = np.array(full_factorial)[np.random.choice(full_factorial.shape[0], 20, replace=False), :]
sub_fact

In [None]:

sub_fact = sub_fact[[ 0, 5, 8, 9, 13, 15,  17, 2],:]


In [None]:
sub_fact

In [None]:
deffic_mnl(sub_fact, 2, betas)

Compare with random experiments of the same size (look at the largest efficiency in a random search of experiments of 8 rows).

In [None]:
np.random.seed(1234) 
[deffic_mnl(np.array(full_factorial)[np.random.choice(full_factorial.shape[0], 8, replace=False), :], 2, betas) for i in range(20)]

#Orthogonality

We pick rows that cannot tell attribute 
(column) 2 vs 3.

In [None]:
np.random.seed(1234) 
sub_fact = np.array(full_factorial)[np.random.choice(full_factorial.shape[0], 20, replace=False), :]
sub_fact
sub_fact_orth = sub_fact[[ 0, 1, 3, 7, 9, 11, 12, 15],:]
sub_fact_orth

In [None]:

deffic_mnl(sub_fact_orth, 2, betas)

# The workflow

1) Define attributes and levels


2) Pilot Studuy

3) Design of the Experiment

4) Design the Survey

5) Conduct the survey and data analysis

# Recommendations


* **Which variables should we choose?**
 Create an exhaustive list of attributes, the reduce it to a number between 3 to 7 by discarding some and mergin others (important combinations of a pair of attributes). For example, screen size and speed can be merged if these do not really vary independently (no small fast smartphones), just create a new categorical attribute with a few levels for the realistic combination.

* **How do we choose the levels?**
 Try a large range and pick the best subset using a computer.

* **How many alternatives**
 From 2 to 3 alternatives can be handled by people before getting into decision fatigue.