# Gradient Descent Models

We will predict the price (`price` column) of an AirBNB dataset used last week.

**Therefore, our unit of analysis is an AIRBNB LISTING**

## 1. Setup

In [1]:
# Common imports
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import SGDRegressor 
from sklearn.dummy import DummyRegressor

np.random.seed(1)

# 2. Load the data

We will use the AirBNB data that we cleaned in last class (the original, not the one that you altered for last weeks exercise).

In [2]:
X_train = pd.read_csv("./data/airbnb_train_X_price.csv")
X_test = pd.read_csv("./data/airbnb_test_X_price.csv")
y_train = pd.read_csv("./data/airbnb_train_y_price.csv")
y_test = pd.read_csv("./data/airbnb_test_y_price.csv")

## 3. Model the data

First, we will create a dataframe to hold all the results of our models.

In [3]:
results = pd.DataFrame(y_train, columns=["actual"])

rmses = pd.DataFrame({"model": [], "rmse": []})

## Polynomial Regression

This is done by creating the polynomial "variables" of the existing variables, then fitting them in a regular regression model

In [4]:
from sklearn.preprocessing import PolynomialFeatures

# Create second degree terms and interaction terms
poly_features = PolynomialFeatures(degree=2).fit(X_train)
X_train_poly = poly_features.transform(X_train)
X_test_poly = poly_features.transform(X_test)

#This will create the polynomial terms of the categorical variables too (since they are encoded as numbers)

#if degree=3, then it creates all combinations: a, a^2, a^3, b, b^2, b^3, a.b, a^2.b, a.b^2, a^2.b^2 

In [5]:
#We still fit a linear regression model

poly_lin_reg = SGDRegressor(max_iter=1000, penalty=None, eta0=0.01) 
poly_lin_reg.fit(X_train_poly, np.ravel(y_train))

print(f"Number of iterations = {poly_lin_reg.n_iter_}")

results["SGD_preds_ using polynomial"] = poly_lin_reg.predict(X_train_poly)

Number of iterations = 17


In [6]:
# Train RMSE
# SGD with polynomial input
poly_test_pred = poly_lin_reg.predict(X_test_poly)
poly_test_rmse = np.sqrt(mean_squared_error(y_test, poly_test_pred))

rmses = pd.concat([rmses, pd.DataFrame({'model':"SGD Poly", 'rmse': poly_test_rmse}, index=[0])])

print(f"SGD wt Polynomial input Test RMSE: {poly_test_rmse:.3f}")

SGD wt Polynomial input Test RMSE: 336317650987.969


The RMSE result from the polynomial is very large, a strong indicator that this may not be a good model. The problem is most likely related to having mamy coeficients that are not significant. We can use Lasso to reduce the size of some of the coeficients, or reduce the degree of the polynomial.

In [7]:
poly_lin_reg_l1 = SGDRegressor(max_iter=1000, penalty='l1', alpha=0.5,  eta0=0.01) 
poly_lin_reg_l1.fit(X_train_poly, np.ravel(y_train))

print(f"Number of iterations = {poly_lin_reg_l1.n_iter_}")

results["SGD_preds_ using polynomial with l1"] = poly_lin_reg_l1.predict(X_train_poly)

poly_test_pred_l1 = poly_lin_reg_l1.predict(X_test_poly)
poly_test_rmse_l1 = np.sqrt(mean_squared_error(y_test, poly_test_pred_l1))

rmses = pd.concat([rmses, pd.DataFrame({'model':"SGD Poly l1", 'rmse': poly_test_rmse_l1}, index=[0])])

print(f"SGD wt Polynomial input l1 regularization Test RMSE: {poly_test_rmse_l1:.3f}")

Number of iterations = 17
SGD wt Polynomial input l1 regularization Test RMSE: 1265935928719.893


In [8]:
poly_lin_reg_l2 = SGDRegressor(max_iter=1000, penalty='l2', alpha=0.5,  eta0=0.01) 
poly_lin_reg_l2.fit(X_train_poly, np.ravel(y_train))

print(f"Number of iterations = {poly_lin_reg_l2.n_iter_}")

results["SGD_preds_ using polynomial with l2"] = poly_lin_reg_l2.predict(X_train_poly)

poly_test_pred_l2 = poly_lin_reg_l2.predict(X_test_poly)
poly_test_rmse_l2 = np.sqrt(mean_squared_error(y_test, poly_test_pred_l2))

rmses = pd.concat([rmses, pd.DataFrame({'model':"SGD Poly l2", 'rmse': poly_test_rmse_l2}, index=[0])])

print(f"SGD wt Polynomial input l2 regularization Test RMSE: {poly_test_rmse_l2:.3f}")

Number of iterations = 6
SGD wt Polynomial input l2 regularization Test RMSE: 1326427286336.276


In [9]:
poly_lin_reg_elastic = SGDRegressor(max_iter=1000, penalty='elasticnet', l1_ratio=.5, alpha=0.5,  eta0=0.01) 
poly_lin_reg_elastic.fit(X_train_poly, np.ravel(y_train))

print(f"Number of iterations = {poly_lin_reg_elastic.n_iter_}")

results["SGD_preds_ using polynomial with elastic net"] = poly_lin_reg_elastic.predict(X_train_poly)

poly_test_pred_elastic = poly_lin_reg_elastic.predict(X_test_poly)
poly_test_rmse_elastic= np.sqrt(mean_squared_error(y_test, poly_test_pred_elastic))

rmses = pd.concat([rmses, pd.DataFrame({'model':"SGD Poly elastic", 'rmse': poly_test_rmse_elastic}, index=[0])])

print(f"SGD wt Polynomial input elastic net regularization Test RMSE: {poly_test_rmse_elastic:.3f}")

Number of iterations = 6
SGD wt Polynomial input elastic net regularization Test RMSE: 772683365718.655


## 5.0 Summary

In [10]:
rmses.sort_values(by=['rmse'])

Unnamed: 0,model,rmse
0,SGD Poly,336317700000.0
0,SGD Poly elastic,772683400000.0
0,SGD Poly l1,1265936000000.0
0,SGD Poly l2,1326427000000.0


Since the RSME for SGD Poly basic model without any Regularization is less we can say that this model is better performing model than others