# MGMTMSA 403 Assignment 3: Predicting Airbnb Prices


* Jiacheng (George) Zheng UID: 005856460
* Jiaxin (Frank) Zheng UID: 806310846
* Joy Lin UID: 806304435

## Import libraries

In [5]:
# Import gurobi and numpy
from gurobipy import *
import numpy as np
from numpy import genfromtxt
import csv
import pandas as pd

## Import data

In [6]:
train = pd.read_csv('AirbnbTrain.csv')
test = pd.read_csv('AirbnbTest.csv')
X_train = train.drop('price',axis=1).to_numpy()
y_train = train['price']
X_test = test.drop('price',axis=1).to_numpy()
y_test = test['price']

In [7]:
n_train = X_train.shape[0]
d_train = X_train.shape[1]

n_test = X_test.shape[0]
d_test = X_test.shape[1]

# Model 1

* **Parameters**

\begin{align*}
i &: \text{row number } \\
j &: \text{column number } \\
x_{ij} &: \text{element at row i and column j of n d-dimensional vector of features} \\
y_i &: \text{price corresponding to } x_{ij}
\end{align*}

* **Decision Variables**

\begin{align*}
\beta_j &: \text{real-valued dependent variable corresponding to j } \\
z_i &: |y_i - \sum_{j=1}^{d} \beta_j x_{ij}| \\
ypred_i&: \text{prediction of y }
\end{align*}

* **Objective Function**

\begin{equation*}
\min_{\beta_1, ..., \beta_d} \frac{1}{n} \sum_{i=1}^{n} |y_i - \sum_{j=1}^{d} \beta_j x_{ij}|
\end{equation*}

* **Constraints**

\begin{equation*}
z_i \geq y_i - ypred_i, \quad i = 1,...,n
\end{equation*}

\begin{equation*}
z_i \geq ypred_i - y_i, \quad i = 1,...,n
\end{equation*}

> At most 3 variables (k = 3)

\begin{equation*}
\sum_{j=1}^{d} 1 (\beta_j \neq 0) \leq 3
\end{equation*}

In [56]:
mod1 = Model()

b = mod1.addVars(d_train)
z = mod1.addVars(n_train)
y_pred = mod1.addVars(n_train)

# Objective function
mod1.setObjective(1/n_train * sum(z[i]for i in range(n_train)), GRB.MINIMIZE)

# Constraints
for i in range(n_train):
    mod1.addConstr(y_pred[i] == sum(b[j] * X_train[i][j] for j in range(d_train)))
    
    mod1.addConstr(z[i] >= y_train[i] - y_pred[i])
    
    mod1.addConstr(z[i] >= y_pred[i] - y_train[i])

# Solve the model
mod1.optimize()

Gurobi Optimizer version 11.0.0 build v11.0.0rc2 (mac64[x86] - Darwin 23.2.0 23C71)

CPU model: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
Thread count: 4 physical cores, 8 logical processors, using up to 8 threads

Optimize a model with 5100 rows, 3412 columns and 27486 nonzeros
Model fingerprint: 0x07c6caa8
Coefficient statistics:
  Matrix range     [5e-01, 5e+02]
  Objective range  [6e-04, 6e-04]
  Bounds range     [0e+00, 0e+00]
  RHS range        [1e+01, 2e+03]
Presolve removed 2 rows and 1 columns
Presolve time: 0.02s
Presolved: 5098 rows, 3411 columns, 27482 nonzeros

Iteration    Objective       Primal Inf.    Dual Inf.      Time
       0    0.0000000e+00   1.849080e+05   0.000000e+00      0s
    1499    3.6426247e+01   0.000000e+00   0.000000e+00      0s

Use crossover to convert LP symmetric solution to basic solution...
Crossover log...

       0 DPushes remaining with DInf 0.0000000e+00                 0s

       5 PPushes remaining with PInf 0.0000000e+00                 0

In [58]:
if mod1.status == GRB.OPTIMAL:
    residuals = []
    for i in range(n_test):
        y_pred = sum(b[j].x * X_test[i][j] for j in range(d_test))
        residuals.append(abs(y_test[i] - y_pred))

print('The prediction error of my model on the test set is $',
      sum(residuals) / n_test,
     '/night.')

The prediction error of my model on the test set is $ 35.60453503037783 /night.


# Model 2

* **Parameters**

\begin{align*}
i &: \text{row number } \\
j &: \text{column number } \\
x_{ij} &: \text{element at row i and column j of n d-dimensional vector of features} \\
y_i &: \text{price corresponding to } x_{ij}
\end{align*}

* **Decision Variables**

\begin{align*}
\beta_j &: \text{real-valued dependent variable corresponding to j } \\
z_i &: |y_i - \sum_{j=1}^{d} \beta_j x_{ij}| \\
ypred_i&: \text{prediction of y } \\
c_j &: \text{= 1 if jth variable is used, 0 otherwise }
\end{align*}

* **Objective Function**

\begin{equation*}
\min_{\beta_1, ..., \beta_d} \frac{1}{n} \sum_{i=1}^{n} |y_i - \sum_{j=1}^{d} \beta_j x_{ij}|
\end{equation*}

* **Constraints**

\begin{equation*}
ypred_i = \sum_{j=1}^{d} \beta_{i} x_{ij}, \quad i = 1,...,n
\end{equation*}

\begin{equation*}
z_i \geq y_i - ypred_i, \quad i = 1,...,n
\end{equation*}

\begin{equation*}
z_i \geq ypred_i - y_i, \quad i = 1,...,n
\end{equation*}

> At most 3 variables (k = 3)

\begin{equation*}
\sum_{j=1}^{d} 1 (\beta_j \neq 0) \leq 3
\end{equation*}

> Variables to have non-zero coefficients

\begin{equation*}
\beta_j \leq 100 * c_j, \quad j = 1,...,d
\end{equation*}

## a)

In [11]:
mod2 = Model()

b = mod2.addVars(d_train)
z = mod2.addVars(n_train)
c = mod2.addVars(d_train, vtype=GRB.BINARY)
y_pred = mod2.addVars(n_train)

# Objective function
mod2.setObjective(1/n_train * sum(z[i]for i in range(n_train)), GRB.MINIMIZE)

# Constraints
for i in range(n_train):
    mod2.addConstr(sum(c[j] for j in range(d_train)) <= 3)
    
    mod2.addConstr(y_pred[i] == sum(b[j] * X_train[i][j] for j in range(d_train)))
    
    mod2.addConstr(z[i] >= y_train[i] - y_pred[i])
    
    mod2.addConstr(z[i] >= y_pred[i] - y_train[i])

for j in range(d_train):
    mod2.addConstr(b[j] <= 100 * c[j])
    
# Solve the model
mod2.optimize()

Gurobi Optimizer version 11.0.0 build v11.0.0rc2 (mac64[x86] - Darwin 23.2.0 23C71)

CPU model: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
Thread count: 4 physical cores, 8 logical processors, using up to 8 threads

Optimize a model with 6812 rows, 3424 columns and 47910 nonzeros
Model fingerprint: 0x7a655b87
Variable types: 3412 continuous, 12 integer (12 binary)
Coefficient statistics:
  Matrix range     [5e-01, 5e+02]
  Objective range  [6e-04, 6e-04]
  Bounds range     [1e+00, 1e+00]
  RHS range        [3e+00, 2e+03]
Found heuristic solution: objective 144.9682353
Presolve removed 2943 rows and 829 columns
Presolve time: 0.06s
Presolved: 3869 rows, 2595 columns, 20795 nonzeros
Variable types: 2583 continuous, 12 integer (12 binary)

Root relaxation: objective 3.644035e+01, 1512 iterations, 0.29 seconds (0.40 work units)

    Nodes    |    Current Node    |     Objective Bounds      |     Work
 Expl Unexpl |  Obj  Depth IntInf | Incumbent    BestBd   Gap | It/Node Time

     0     0

In [12]:
if mod2.status == GRB.OPTIMAL:
    print("Name\t\tCoefficient")
    for j in range(d_train):
        if b[j].x != 0:
            print(train.columns[j], '\t\t', j)

Name		Coefficient
Entire home 		 2
accommodates 		 3
bedrooms 		 5


## b)

In [68]:
if mod2.status == GRB.OPTIMAL:
    residuals = []
    for i in range(n_test):
        y_pred = sum(b[j].x * X_test[i][j] for j in range(d_test))
        residuals.append(abs(y_test[i] - y_pred))

print('The new prediction error is $',
      sum(residuals) / n_test,
     '/night.')

The new prediction error is $ 37.73676680972818 /night.


# Model 3

* **Parameters**

\begin{align*}
i &: \text{row number } \\
j &: \text{column number } \\
x_{ij} &: \text{element at row i and column j of n d-dimensional vector of features} \\
y_i &: \text{price corresponding to } x_{ij}
\end{align*}

* **Decision Variables**

\begin{align*}
\beta_j &: \text{real-valued dependent variable corresponding to j } \\
z_i &: |y_i - \sum_{j=1}^{d} \beta_j x_{ij}| \\
ypred_i&: \text{prediction of y } \\
c_j &: \text{= 1 if jth variable is used, 0 otherwise }
\end{align*}

* **Objective Function**

\begin{equation*}
\min_{\beta_1, ..., \beta_d} \frac{1}{n} \sum_{i=1}^{n} |y_i - \sum_{j=1}^{d} \beta_j x_{ij}|
\end{equation*}

* **Constraints**

\begin{equation*}
ypred_i = \sum_{j=1}^{d} \beta_{i} x_{ij}, \quad i = 1,...,n
\end{equation*}

\begin{equation*}
z_i \geq y_i - ypred_i, \quad i = 1,...,n
\end{equation*}

\begin{equation*}
z_i \geq ypred_i - y_i, \quad i = 1,...,n
\end{equation*}

> At most 3 variables (k = 3)

\begin{equation*}
\sum_{j=1}^{d} 1 (\beta_j \neq 0) \leq 3
\end{equation*}

> Variables to have non-zero coefficients

\begin{equation*}
\beta_j \leq 100 * c_j, \quad j = 1,...,d
\end{equation*}

> One of the variables is the number of beds

\begin{equation*}
c_6 \geq 1
\end{equation*}

## a)

In [8]:
mod3 = Model()

b = mod3.addVars(d_train)
z = mod3.addVars(n_train)
c = mod3.addVars(d_train, vtype=GRB.BINARY)
y_pred = mod3.addVars(n_train)

# Objective function
mod3.setObjective(1/n_train * sum(z[i]for i in range(n_train)), GRB.MINIMIZE)

# Constraints

for i in range(n_train):
    mod3.addConstr(sum(c[j] for j in range(d_train)) <= 3)

    mod3.addConstr(y_pred[i] == sum(b[j] * X_train[i][j] for j in range(d_train)))
    
    mod3.addConstr(z[i] >= y_train[i] - y_pred[i])
    
    mod3.addConstr(z[i] >= y_pred[i] - y_train[i])

for j in range(d_train):
    mod3.addConstr(b[j] <= 100 * c[j])

mod3.addConstr(c[6] >= 1)
    
# Solve the model
mod3.optimize()

Set parameter Username
Academic license - for non-commercial use only - expires 2025-01-16
Gurobi Optimizer version 11.0.0 build v11.0.0rc2 (mac64[x86] - Darwin 23.2.0 23C71)

CPU model: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
Thread count: 4 physical cores, 8 logical processors, using up to 8 threads

Optimize a model with 6813 rows, 3424 columns and 47911 nonzeros
Model fingerprint: 0xa7dfb2d7
Variable types: 3412 continuous, 12 integer (12 binary)
Coefficient statistics:
  Matrix range     [5e-01, 5e+02]
  Objective range  [6e-04, 6e-04]
  Bounds range     [1e+00, 1e+00]
  RHS range        [1e+00, 2e+03]
Found heuristic solution: objective 144.9682353
Presolve removed 2945 rows and 830 columns
Presolve time: 0.07s
Presolved: 3868 rows, 2594 columns, 20792 nonzeros
Variable types: 2583 continuous, 11 integer (11 binary)

Root relaxation: objective 3.644433e+01, 1711 iterations, 0.35 seconds (0.48 work units)

    Nodes    |    Current Node    |     Objective Bounds      |     Work

In [9]:
if mod3.status == GRB.OPTIMAL:
    print("Name\t\tCoefficient")
    for j in range(d_train):
        if b[j].x != 0:
            print(train.columns[j], '\t\t', j)

Name		Coefficient
Entire home 		 2
bedrooms 		 5
beds 		 6


## b)

Variable accommodates was in Model 2 but is no longer in Model 3. 
This variable has been dropped possibly because of its multicolinearity with the feature beds.

## c)

In [10]:
if mod3.status == GRB.OPTIMAL:
    residuals = []
    for i in range(n_test):
        y_pred = sum(b[j].x * X_test[i][j] for j in range(d_test))
        residuals.append(abs(y_test[i] - y_pred))

print('The new prediction error is $',
      sum(residuals) / n_test,
     '/night.')

The new prediction error is $ 38.59960658082976 /night.
