# Practice QBUS3840 Test using the Heater dataset

You will anayze a classic dataset, from Kenneth Train, about Heating system preferences.
Households face the decision of which heating system to install, for example, Should we install a gas central heating system or for a heat pump instead? (among others).

Several variables that affect the decision are measured, such as the installation costs, operation costs for one year and characteristics such as the number of rooms in the house or the income of the household.
The dataset was gathered in California



## Description of the dataset

Each row represents a different household, os the household is the 'decision-maker' in this scenario. Household are 'independent' of each other.

The variables in the dataset are:

**idcase:** The identifier of each individual, decision-maker.

**depvar**: A categorical variable indicating the choice of heating system. It is encoded in text *(we will turn it into numbers for biogeme in the preprocessing step)*. We have 5 alternatives.

 * 'gc': Gas central
 * 'gr': Gas room
 * 'ec': Electric central
 * 'er': Electric room
 * 'hp': Heat pump

**Installation costs variables**:  The cost of installing each system, the variable names are encoded such as `ic_xx`, with xx being the name of the alternative, as in the depvar variable. For example the column `ic_gc` means installation costs for the gas central alternative. `ic_er` would be installation cost for the electric room alternative.

**Operation costs**: Operation costs of each heating system, for a year. The variable names are encoded in a similar fashion to installation cost. So the column`oc_gr` would mean operation cost for the gas room alternative.

**rooms**: The number of rooms in the house, a numeric variable.

**agehed**: Age of the decision maker, considered as the 'household head'.

**income**: Yearly income of the household, in dollars.

**region**: A categorical variable encoding the location of the household within California. Four levels encoded with text (will be turned into numbers for processing in biogeme).
 * 'ncostl': Norther coastal region
 2. 'scostl': Souther coastal region
 3. 'mountn': Mountain region
 4. 'valley': Valley region

---
---

# Preparing the environment


In [1]:
!pip install biogeme

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting biogeme
  Downloading biogeme-3.2.10.tar.gz (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 4.1 MB/s 
[?25hBuilding wheels for collected packages: biogeme
  Building wheel for biogeme (setup.py) ... [?25l[?25hdone
  Created wheel for biogeme: filename=biogeme-3.2.10-cp37-cp37m-linux_x86_64.whl size=4253805 sha256=fa2075f2b949602b08dcd3515d91c3b6008de4bdbe9f42d7156f13b8b014bfe8
  Stored in directory: /root/.cache/pip/wheels/5b/92/9b/63caa7ad9b2cd582de77d3701d10f7e8d041466f4a9d07d554
Successfully built biogeme
Installing collected packages: biogeme
Successfully installed biogeme-3.2.10


Load the packages, feel free to change the names.

In [2]:
import pandas  as pd
import numpy as np
import matplotlib.pyplot as plt

import biogeme.database as db
import biogeme.biogeme as bio
import biogeme.models as models
import biogeme.expressions as exp
import biogeme.tools as tools

# Load the dataset

In [3]:

heater_pd = pd.read_csv('https://github.com/pmontman/tmp_choicemodels/raw/main/data/heating.csv')


A simple look at the dataset.

In [4]:
heater_pd.head(5)

Unnamed: 0,idcase,depvar,ic_gc,ic_gr,ic_ec,ic_er,ic_hp,oc_gc,oc_gr,oc_ec,oc_er,oc_hp,income,agehed,rooms,region
0,1,gc,866.0,962.64,859.9,995.76,1135.5,199.69,151.72,553.34,505.6,237.88,7,25,6,ncostl
1,2,gc,727.93,758.89,796.82,894.69,968.9,168.66,168.66,520.24,486.49,199.19,5,60,5,scostl
2,3,gc,599.48,783.05,719.86,900.11,1048.3,165.58,137.8,439.06,404.74,171.47,4,65,2,ncostl
3,4,er,835.17,793.06,761.25,831.04,1048.7,180.88,147.14,483.0,425.22,222.95,2,50,4,scostl
4,5,er,755.59,846.29,858.86,985.64,883.05,174.91,138.9,404.41,389.52,178.49,2,25,6,valley


Data cleaning (not needed in a exam): The variable depvar uses strings for the variable, we need to use integers (starting in 1) for biogeme. So we re-encode the `depvar` variable as integer using the pandas `factorize` function.

**Be careful with the encoding! according to `factorize`, in this dataset, the corresponding numbers will be:**
 1. gas central
 2. electricity room
 3. gas room
 4. heat pump
 5. electricty central

---



In [5]:
depvar_factor = pd.factorize(heater_pd['depvar'])

heater_pd['depvar'] = depvar_factor[0] + 1
depvar_factor[1]

Index(['gc', 'er', 'gr', 'hp', 'ec'], dtype='object')

The `region` variable, we will encoded it as numbers via binary encoding. We do this with `get_dummies` function from pandas. We can do the efficient binary encoding, considering one of the levels of region as the baseline (saving one variable), or the explicit encoding, creating one variable per level.
Let's go for explicit encoding, easier interpretation.

We will also print a snapshot of the resulting dataset, already clean and ready for analysis.

In [6]:
heater_pd = pd.get_dummies(heater_pd, 'region')

heater_pd.head(5)

Unnamed: 0,idcase,depvar,ic_gc,ic_gr,ic_ec,ic_er,ic_hp,oc_gc,oc_gr,oc_ec,oc_er,oc_hp,income,agehed,rooms,region_mountn,region_ncostl,region_scostl,region_valley
0,1,1,866.0,962.64,859.9,995.76,1135.5,199.69,151.72,553.34,505.6,237.88,7,25,6,0,1,0,0
1,2,1,727.93,758.89,796.82,894.69,968.9,168.66,168.66,520.24,486.49,199.19,5,60,5,0,0,1,0
2,3,1,599.48,783.05,719.86,900.11,1048.3,165.58,137.8,439.06,404.74,171.47,4,65,2,0,1,0,0
3,4,2,835.17,793.06,761.25,831.04,1048.7,180.88,147.14,483.0,425.22,222.95,2,50,4,0,0,1,0
4,5,2,755.59,846.29,858.86,985.64,883.05,174.91,138.9,404.41,389.52,178.49,2,25,6,0,0,0,1


---
---

# 1) Adjust a model with alternative specific constants and shared parameters for installation cost and operation costs. Select one of the alternatives as the reference (pick the one that you prefer). Comment on the results: Signs of the variables and alternative specific constants.

---
---

# 2) Calculate the willingness to pay for reducing operating cost.
* *In this case, we have two 'price' variables (installation and operation): Operating cost is the variable we want to understand, installation cost is the price variable in the WTP formula.*



---
---

#3) Do big houses with many rooms (5,6,7) have different preferences than the rest (4 and less rooms)?
* *Might or might not need to fit another model.*

---
---

# 4) Create a more complex model, that includes at least one *interaction* variable between an attribute and a characteristic (product of two variables). Comment on the interpretation of the model. Comment of the per-alernative Willingess To Pay for operating cost, and how they compare to the answer in Question 2.
# Is the model a better fit than the one created in in Question 1?


---
---

# 5) Do the people of the 'valley' region have significatively utility relationship for installation cost and operation cost, compared to the other regions?
*You might need to fit one model (or two) to answer this question.*


---
---

# 6) Due to a 'Special Operation', it is expeted that the supply of gas will be completely cut, this is, the two alternatives that use gas, 'gas central' and 'gas room' will not be available. The households that *chose one of those alternatives* will have to move another heating system. Calculate the installation cost that will be incurred due to this change for the population. Use the model fitted in Question 4.