# Predicting Airbnb Prices



In [1]:
import pandas as pd
import numpy as np
from gurobipy import *

In [2]:
data = pd.read_csv('AirbnbTrain.csv')
data

Unnamed: 0,latitude,longitude,Entire home,accommodates,bathrooms,bedrooms,beds,cleaning_fee,minimum_nights,number_of_reviews,review_scores_rating,instant_bookable,price
0,34.103701,-118.332241,1,13,2.0,3,2,150,2,1,100,1,350
1,34.099484,-118.331645,1,8,2.0,2,4,150,1,11,96,1,190
2,34.104321,-118.329662,1,4,1.0,0,1,55,1,1,80,0,85
3,34.101028,-118.317848,0,2,1.0,1,1,20,1,8,98,0,75
4,34.098292,-118.324980,1,2,1.0,1,1,20,1,11,96,0,130
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1695,34.104790,-118.311298,1,2,1.0,1,1,65,2,4,100,1,125
1696,34.104793,-118.334680,1,6,2.0,2,3,70,1,56,91,1,149
1697,34.104810,-118.313040,0,4,1.0,1,2,20,1,21,96,1,90
1698,34.103203,-118.342882,1,3,1.0,1,1,60,1,23,90,1,130


## Model 1
Formulate the least absolute deviations regression problem as a linear program. Solve
the linear program using the data given in the file AirbnbTrain.csv. What is the prediction
error, in $/night, of your model on the test set (provided in AirbnbTest.csv)?

\begin{align}
\text{Decision Variable}:  z_{i}, \text {i=1,....n} \ \ \ \text{ be the prediction error } 
\end{align}

\begin{align}
\underset{{\bf z}}{\text{min}} \;\; &(1/n) *\sum_{i=1}^nz_{i}\\
&\text{s.t.}\\
& z_{i} \geq y_{i} - \sum_{j=1}^d\beta_{j}x_{ij} , \ \text{for  $i=1,.....n$}\\
& z_{i} \geq \sum_{j=1}^d\beta_{j}x_{ij} -y_{i} ,\  \text{for  $i=1,.....n$}\\
\end{align}

In [3]:
x = data.iloc[:,:-1].values
y = data.iloc[:,-1].values

In [4]:
# number of price listings
n = len(y)
#number of features
d = len(x[0])

In [5]:
# Intializing model 1
mod_1 = Model()

# defining the decision variable
z = mod_1.addVars(n)
b = mod_1.addVars(d)

# defining the constraints 
cons_1 = mod_1.addConstrs((z[i] >= y[i] - sum(b[j] * x[i][j] for j in range(d))) for i in range(n))
cons_2 = mod_1.addConstrs(z[i] >= sum(b[j] * x[i][j] for j in range(d)) - y[i] for i in range(n))          

# objecive function 
mod_1.setObjective((1/n) * sum(z[i] for i in range(n)), GRB.MINIMIZE)

mod_1.update()
mod_1.optimize()

Set parameter Username
Academic license - for non-commercial use only - expires 2025-01-14
Gurobi Optimizer version 11.0.0 build v11.0.0rc2 (mac64[arm] - Darwin 23.3.0 23D56)

CPU model: Apple M2
Thread count: 8 physical cores, 8 logical processors, using up to 8 threads

Optimize a model with 3400 rows, 1712 columns and 41372 nonzeros
Model fingerprint: 0x49c47e75
Coefficient statistics:
  Matrix range     [5e-01, 5e+02]
  Objective range  [6e-04, 6e-04]
  Bounds range     [0e+00, 0e+00]
  RHS range        [1e+01, 2e+03]
Presolve time: 0.01s
Presolved: 3400 rows, 1712 columns, 41372 nonzeros

Concurrent LP optimizer: primal simplex, dual simplex, and barrier
Showing barrier log only...

Ordering time: 0.00s

Barrier statistics:
 Dense cols : 12
 AA' NZ     : 2.995e+04
 Factor NZ  : 3.260e+04 (roughly 2 MB of memory)
 Factor Ops : 4.141e+05 (less than 1 second per iteration)
 Threads    : 1

                  Objective                Residual
Iter       Primal          Dual         Pri

In [6]:
opt = mod_1.objval
print(opt)

36.426247398213434


In [7]:
opt = [z[i].x for i in range(n)]

In [8]:
optimal_coefficients = [b[j].x for j in range(d)]
optimal_coefficients

[290.2514663518523,
 84.03092762130778,
 36.78323755553825,
 9.936817142125033,
 31.69433118878754,
 19.703957114478886,
 0.0,
 0.31061628141023223,
 0.0,
 0.0,
 0.2681158824601272,
 5.167454373349817]

In [9]:
test = pd.read_csv('AirbnbTest.csv')
test_y = test['price']
test_x = test.iloc[:,:-1].values

### 1. What is the prediction error, in $/night, of your model on the test set

In [10]:
# Now, you can use these coefficients to make predictions on the test set
# Use the AirbnbTest.csv data to obtain the test set features

# Calculate prediction errors on the test set
prediction_errors = []
for i in range(len(test_y)):
    prediction = sum(optimal_coefficients[j] * test_x[i][j] for j in range(d))
    error = abs(prediction - test_y[i])
    prediction_errors.append(error)

# Calculate average prediction error
average_prediction_error = sum(prediction_errors) / len(test_y)
print(f'Model 1 Prediction Error: {average_prediction_error}')

Model 1 Prediction Error: 35.604535030377846


## Model 2
Suppose that to improve interpretability, you wish to build a model that predicts
Airbnb prices using only the three most important variables. Modify Model 1 by including a
constraint that allows at most three variables to have non-zero coefficients. 

a) List the names and coefficients of the three variables selected by the optimization model.  
b) What is the new prediction error, in $/night, of Model 2?


\begin{align}
\text{Decision Variable}: z_{i}, \text {i=1,....n} \ \ \ \text{ be the prediction error }\\
\end{align}



\begin{align}
\underset{{\bf {z, w}}}{\text{min}} \;\; &(1/n) *\sum_{i=1}^nz_{i}\\
&\text{s.t.}\\
& z_{i} \geq y_{i} - \sum_{j=1}^d\beta_{j}x_{ij} , \ \text{for  $i=1,.....n$}\\
& z_{i} \geq \sum_{j=1}^d\beta_{j}x_{ij} -y_{i} ,\  \text{for  $i=1,.....n$}\\
& \beta_{j} \leq  1000 * w_{j} ,\  \text{for  $j=1,.....d$}\\
& \beta_{j} \geq - 1000 * w_{j} ,\  \text{for  $j=1,.....d$}\\
& \sum_{j=1}^dw_{j} \leq 3 \\
& w_{j} \in {{0,1}},  \text{for  $j=1,.....d$}\\
\end{align}

In [11]:
#Intializing model 2
mod_2 = Model()

# defining the decision variable
b = mod_2.addVars(d)
z = mod_2.addVars(n)
w = mod_2.addVars(d, vtype = GRB.BINARY)

# defining the constraints 
for i in range(n):
    cons_1 = mod_2.addConstr(z[i] >= y[i] - sum(b[j]*x[i][j] for j in range(d)))
    cons_2 = mod_2.addConstr(z[i] >= sum(b[j]*x[i][j] for j in range(d)) - y[i])

cons_3 = mod_2.addConstrs(b[j] <= 1000 * w[j] for j in range(d))
cons_4 = mod_2.addConstrs(b[j] >= - 1000 * w[j] for j in range(d))
cons_5 = mod_2.addConstr(sum(w[j] for j in range(d)) <= 3)
 
# objective function 
mod_2.setObjective((1/n) * sum(z[i] for i in range(n)), GRB.MINIMIZE)
    
mod_2.update()
mod_2.optimize()

Gurobi Optimizer version 11.0.0 build v11.0.0rc2 (mac64[arm] - Darwin 23.3.0 23D56)

CPU model: Apple M2
Thread count: 8 physical cores, 8 logical processors, using up to 8 threads

Optimize a model with 3425 rows, 1724 columns and 41432 nonzeros
Model fingerprint: 0xfc339209
Variable types: 1712 continuous, 12 integer (12 binary)
Coefficient statistics:
  Matrix range     [5e-01, 1e+03]
  Objective range  [6e-04, 6e-04]
  Bounds range     [1e+00, 1e+00]
  RHS range        [3e+00, 2e+03]
Found heuristic solution: objective 144.9682353
Presolve removed 840 rows and 414 columns
Presolve time: 0.02s
Presolved: 2585 rows, 1310 columns, 31274 nonzeros
Variable types: 1298 continuous, 12 integer (12 binary)

Root relaxation: objective 3.642625e+01, 1462 iterations, 0.10 seconds (0.33 work units)

    Nodes    |    Current Node    |     Objective Bounds      |     Work
 Expl Unexpl |  Obj  Depth IntInf | Incumbent    BestBd   Gap | It/Node Time

     0     0   36.42625    0    7  144.96824   

### a) List the names and coefficients of the three variables selected by the optimization model.

In [12]:
#selected variables
beta_values = {j: b[j].X for j in range(d)}
selected_variables = [j for j in range(d) if w[j].X > 0]

print("Selected Variables and Coefficients:")
for j in selected_variables:
    print(f"Variable {data.columns[j]}: Coefficient {beta_values[j]}")

Selected Variables and Coefficients:
Variable Entire home: Coefficient 52.0
Variable accommodates: Coefficient 14.0
Variable bedrooms: Coefficient 32.0


### b) What is the new prediction error, in $/night, of Model 2?

In [13]:
# new prediction error
# Calculate prediction errors on the test set
prediction_errors = []
for i in range(len(test_y)):
    prediction = sum(beta_values[j] * test_x[i][j] for j in range(d))
    error = abs(prediction - test_y[i])
    prediction_errors.append(error)

# Calculate average prediction error
average_prediction_error = sum(prediction_errors) / len(test_y)
print(f'Model 2 Prediction Error: {average_prediction_error}')

Model 2 Prediction Error: 37.73676680972818


## Model 3
Suppose now you wish to build a model that predicts Airbnb listing price using only
three variables, where one of the variables is the number of beds.  

a) List the names and coefficients of the two other variables selected by the optimization model.  
b) Which variable was inModel 2 but is no longer inModel 3? Briefly explain in 1-2 sentences why this variable might have been dropped.  
c) What is the new prediction error, in $/night, of Model 3?


\begin{align}
\text{Decision Variable}: z_{i}, \text {i=1,....n} \ \ \ \text{ be the prediction error }\\
\end{align}



\begin{align}
\underset{{\bf z,w}}{\text{min}} \;\; &(1/n) *\sum_{i=1}^nz_{i}\\
&\text{s.t.}\\
& z_{i} \geq y_{i} - \sum_{j=1}^d\beta_{j}x_{ij} , \ \text{for  $i=1,.....n$}\\
& z_{i} \geq \sum_{j=1}^d\beta_{j}x_{ij} -y_{i} ,\  \text{for  $i=1,.....n$}\\
& \beta_{j} \leq  1000 * w_{j} ,\  \text{for  $j=1,.....d$}\\
& \beta_{j} \geq - 1000 * w_{j} ,\  \text{for  $j=1,.....d$}\\
& \sum_{j=1}^dw_{j} = 3 \\
& w_{6} = 1 \ \ \ \text{(Number of beds indexed at 6)}\\
& w_{j} \in {{0,1}},  \text{for  $j=1,.....d$}\\
\end{align}

In [14]:
# confirm the index of number of beds
data.iloc[:,6]

0       2
1       4
2       1
3       1
4       1
       ..
1695    1
1696    3
1697    2
1698    1
1699    1
Name: beds, Length: 1700, dtype: int64

In [15]:
#Intializing model 3
mod_3 = Model()

# defining the decision variable
z = mod_3.addVars(n)
b = mod_3.addVars(d)
w = mod_3.addVars(d, vtype = GRB.BINARY)

# defining the constraints 
for i in range(n):
    cons_1 = mod_3.addConstr(z[i] >= y[i] - sum(b[j]*x[i][j]*w[j] for j in range(d)))
    cons_2 = mod_3.addConstr(z[i] >= sum(b[j]*x[i][j]*w[j] for j in range(d)) - y[i])

cons_3 = mod_3.addConstrs(b[j] <= 1000 * w[j] for j in range(d))
cons_4 = mod_3.addConstrs(b[j] >= - 1000 * w[j] for j in range(d))
cons_5 = mod_3.addConstr(sum(w[j] for j in range(d)) == 3)
cons_6 = mod_3.addConstr(w[6] == 1)

# objecive function 
mod_3.setObjective((1/n) * sum(z[i] for i in range(n)), GRB.MINIMIZE)

mod_3.update()
mod_3.optimize()

Gurobi Optimizer version 11.0.0 build v11.0.0rc2 (mac64[arm] - Darwin 23.3.0 23D56)

CPU model: Apple M2
Thread count: 8 physical cores, 8 logical processors, using up to 8 threads

Optimize a model with 26 rows, 1724 columns and 61 nonzeros
Model fingerprint: 0xad4ed5ea
Model has 3400 quadratic constraints
Variable types: 1712 continuous, 12 integer (12 binary)
Coefficient statistics:
  Matrix range     [1e+00, 1e+03]
  QMatrix range    [5e-01, 5e+02]
  QLMatrix range   [1e+00, 1e+00]
  Objective range  [6e-04, 6e-04]
  Bounds range     [1e+00, 1e+00]
  RHS range        [1e+00, 3e+00]
  QRHS range       [1e+01, 2e+03]
Presolve added 810 rows and 409 columns
Presolve time: 0.03s
Presolved: 3441 rows, 2144 columns, 32996 nonzeros
Variable types: 2133 continuous, 11 integer (11 binary)
Found heuristic solution: objective 48.6558384
Found heuristic solution: objective 48.5975270

Root relaxation: objective 3.642625e+01, 2314 iterations, 0.14 seconds (0.48 work units)

    Nodes    |    Cu

### a) List the names and coefficients of the two other variables selected by the optimization model.

In [16]:
#selected variables
beta_values_3 = {j: b[j].X for j in range(d)}
selected_variables = [j for j in range(d) if w[j].X > 0]

print("Selected Variables and Coefficients:")
for j in selected_variables:
    print(f"Variable {data.columns[j]}: Coefficient {beta_values_3[j]}")

Selected Variables and Coefficients:
Variable Entire home: Coefficient 67.875
Variable bedrooms: Coefficient 47.375
Variable beds: Coefficient 12.125


### b) Which variable was in Model 2 but is no longer in Model 3? Briefly explain in 1-2 sentences why this variable might have been dropped.

The variable `accommodates` was in model 2 but has been dropped in model 3 because it is highly correlated/ associated with the second binary decision variable (bed). The exclusion of this variable in Model 3 might suggest that the optimization model prioritizes interpretability and simplicity by selecting only the most relevant variables, and `accomomdates` did not contribute significantly to the model's predictive power or was redundant given the inclusion of other features.

In [17]:
data[['beds', 'accommodates']].corr()

Unnamed: 0,beds,accommodates
beds,1.0,0.714887
accommodates,0.714887,1.0


### c) What is the new prediction error, in $/night, of Model 3?

In [18]:
# new prediction error
# Calculate prediction errors on the test set
prediction_errors = []
for i in range(len(test_y)):
    prediction = sum(beta_values_3[j] * test_x[i][j] for j in range(d))
    error = abs(prediction - test_y[i])
    prediction_errors.append(error)

# Calculate average prediction error
average_prediction_error = sum(prediction_errors) / len(test_y)
print(f'Model 3 Prediction Error: {average_prediction_error}')

Model 3 Prediction Error: 38.59960658082976
