# Alternative Models
In order to ensure the model used to make predictions for the analysis, I also tried training & testing various other models that were good candidates (based on the characteristics of our data).

Specifically, we also tested the following regression models:
1. Linear (Lasso Regularization)
2. Linear (Ridge Regularization)
3. SGD
4. Decision Tree

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import SGDRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import seaborn as sns
%matplotlib inline

# Import & preview input variables
X = pd.read_csv('./output/model_X.csv')
X.head()

Unnamed: 0,host_is_superhost,host_total_listings_count,accommodates,bathrooms,bedrooms,beds,guests_included,minimum_nights,maximum_nights,review_scores_rating,...,property_type_Townhouse,property_type_Treehouse,property_type_Villa,room_type_Private room,room_type_Shared room,bed_type_Couch,bed_type_Futon,bed_type_Pull-out Sofa,bed_type_Real Bed,annual_booked
0,0.0,6.0,2,1.0,1.0,1.0,1,1,730,98.0,...,0,0,0,1,0,0,0,0,1,5.0
1,0.0,5.0,2,1.0,0.0,1.0,2,1,1125,95.0,...,0,0,0,0,0,0,0,0,1,81.0
2,0.0,1.0,2,1.0,1.0,1.0,2,3,7,93.939368,...,0,0,0,1,0,0,0,1,0,0.0
3,0.0,1.0,3,1.0,1.0,4.0,1,1,730,90.0,...,0,0,0,0,0,0,0,0,1,299.0
4,0.0,1.0,2,1.0,1.0,2.0,1,4,90,89.0,...,0,0,0,0,0,0,0,0,1,104.0


In [2]:
# Input & preview output variables
y = pd.read_csv('./output/model_y.csv', header=None, squeeze=True)
y.head()

0     2.041096
1    49.931507
2     0.000000
3    72.906849
4    29.917808
Name: 0, dtype: float64

In [3]:
# Split data into training & testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Linear (Lasso Regularization)
We're going to first try the simplest of the list by adding _Lasso regularization_ to the Linear Regression. The hope is that with the corrective properties (by penalizing complexity), we will be able to get substantially higher training & testing scores.

We're going to test the Lasso model with various alpha values to spot the config with optimal scores.

In [20]:
# Function to run Lasso model

def runLasso(alpha=1.0):
    """
    Compute the training & testing scores of the Linear Regression (with Lasso regularization)
    along with the SUM of coefficients used.
    
    Input:
        alpha: the degree of penalization for model complexity
        
    Output:
        alpha: the degree of penalization for model complexity
        train_scoreL: Training score
        test_scoreL: Testing score
        coeff_used: SUM of all coefficients used in model
    """
    
    # Instantiate & train
    lasso_reg = Lasso(alpha=alpha)
    lasso_reg.fit(X_train, y_train)

    # Predict testing data
    pred_train = lasso_reg.predict(X_train)
    pred_test = lasso_reg.predict(X_test)

    # Score
    train_scoreL = lasso_reg.score(X_train,y_train)
    test_scoreL = lasso_reg.score(X_test,y_test)
    coeff_used = np.sum(lasso_reg.coef_!=0)
    
    print("Lasso Score (" + str(alpha) + "):")
    print(train_scoreL)
    print(test_scoreL)
    print(' ')
    print("Coefficients Used:")
    print(coeff_used)
    print('-------')
    
    return (alpha, train_scoreL, test_scoreL, coeff_used)

runLasso()

Lasso Score (1.0):
0.23223211976073035
0.18862779654313222
-------
Coefficients Used:
20


(1.0, 0.23223211976073035, 0.18862779654313222, 20)

In [7]:
# Test the Lasso regularization for a range of alpha variables

alpha_lasso = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20]

for i in range(10):
    runLasso(alpha_lasso[i])



Lasso Score (1e-15):
0.2559366327386795
0.20189974847694037
-------
Coefficients Used:
279




Lasso Score (1e-10):
0.2559366327390432
0.20189974871655747
-------
Coefficients Used:
279




Lasso Score (1e-08):
0.2559366327750352
0.20189977241934653
-------
Coefficients Used:
279




Lasso Score (0.0001):
0.25593536548472673
0.2019226658696961
-------
Coefficients Used:
276




Lasso Score (0.001):
0.25587513878367596
0.20203614035738615
-------
Coefficients Used:
241
Lasso Score (0.01):
0.25495352910683877
0.20238889968783336
-------
Coefficients Used:
124
Lasso Score (1):
0.23223211976073035
0.18862779654313222
-------
Coefficients Used:
20
Lasso Score (5):
0.20547594338644537
0.16504046703963937
-------
Coefficients Used:
12
Lasso Score (10):
0.19236671314805076
0.15451447757549708
-------
Coefficients Used:
8
Lasso Score (20):
0.18615436751503778
0.14947038592763684
-------
Coefficients Used:
8


### Linear Regression (Lasso) Conclusion

The Lasso linear model does not seem to surpass the simple Linear Regression model trained (see "Airbnb NYC Data Exploration" notebook for details), which had scored 25.6% (training) and 20.2% (testing).

**Therefore, we will discount this as a superior modelling assumption**

## Linear (Ridge Regularization)
Similarly, we're going to test a Linear Regression with _Ridge regularization_. Since the dataset is non-sparse, the hypothesis is that we should get more from the L2 regularization's corrective properties for complexity (more than Lasso's L1 reg.)

We're going to test the Ridge model with various alpha values to spot the config with optimal scores.

In [9]:
# Function to run Ridge model

def runRidge(alpha=1.0):
    """
    Compute the training & testing scores of the Linear Regression (with Ridge regularization)
    along with the SUM of coefficients used.
    
    Input:
        alpha: the degree of penalization for model complexity
        
    Output:
        alpha: the degree of penalization for model complexity
        train_scoreL: Training score
        test_scoreL: Testing score
        coeff_used: SUM of all coefficients used in model
    """
    # Instantiate & train
    rid_reg = Ridge(alpha=alpha, normalize=True)
    rid_reg.fit(X_train, y_train)

    # Predict testing data
    pred_train = rid_reg.predict(X_train)
    pred_test = rid_reg.predict(X_test)

    # Score
    train_score = rid_reg.score(X_train,y_train)
    test_score = rid_reg.score(X_test,y_test)
    coeff_used = np.sum(rid_reg.coef_!=0)
    
    print("Ridge Score (" + str(alpha) + "):")
    print(train_score)
    print(test_score)
    print('-------')
    print("Coefficients Used:")
    print(coeff_used)
    
    return (alpha, train_score, test_score, coeff_used)

runRidge()

Ridge Score (1.0):
0.20902323915862486
0.16550505004900973
-------
Coefficients Used:
279


In [10]:
alpha_ridge = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20]

for i in range(10):
    runRidge(alpha_ridge[i])

  overwrite_a=True).T


Ridge Score (1e-15):
0.2559366870637817
0.20189484648856693
-------
Coefficients Used:
279
Ridge Score (1e-10):
0.2559366870637817
0.20190767294147383
-------
Coefficients Used:
279
Ridge Score (1e-08):
0.2559366870637807
0.20190767275041943
-------
Coefficients Used:
279
Ridge Score (0.0001):
0.25593659625822296
0.20190549320829965
-------
Coefficients Used:
279
Ridge Score (0.001):
0.2559322297145006
0.20188920584965975
-------
Coefficients Used:
279
Ridge Score (0.01):
0.255889202023326
0.20184044553149974
-------
Coefficients Used:
279
Ridge Score (1):
0.20902323915862486
0.16550505004900973
-------
Coefficients Used:
279
Ridge Score (5):
0.11072430763295538
0.08812228834875402
-------
Coefficients Used:
279
Ridge Score (10):
0.0704876562860427
0.05617677247752917
-------
Coefficients Used:
279
Ridge Score (20):
0.040971903570563684
0.0326704448518893
-------
Coefficients Used:
279


### Linear Regression (Ridge) Conclusion

The Ridge linear model also does not seem to surpass the simple Linear Regression model (see "Airbnb NYC Data Exploration" notebook for details), which had scored 25.6% (training) and 20.2% (testing).

**Therefore, we will discount Ridge regularization as a superior modelling assumption**

## Stochastic Gradient Descent (SGD) Regression
The SGD regression is different from the former two (Lasso, Ridge) that were based on a Linear Regression model. Since SGD basically applies the squared trick at every point in our data at same time (vs Batch, which looks at points one-by-one), I don't expect scores to differ too much when compared to the previous 2.

In [11]:
# Function to run SGD model

def runSGD():
    """
    Compute the training & testing scores of the SGD
    along with the SUM of coefficients used.
        
    Output:
        train_score: Training score
        test_score: Testing score
        coeff_used: SUM of all coefficients used in model
    """
    # Instantiate & train
    sgd_reg = SGDRegressor(loss="squared_loss", penalty=None)
    sgd_reg.fit(X_train, y_train)

    # Predict testing data
    pred_train = sgd_reg.predict(X_train)
    pred_test = sgd_reg.predict(X_test)

    # Score
    train_score = sgd_reg.score(X_train,y_train)
    test_score = sgd_reg.score(X_test,y_test)
    coeff_used = np.sum(sgd_reg.coef_!=0)
    
    print("SGD Score:")
    print(train_score)
    print(test_score)
    print('-------')
    print("Coefficients Used:")
    print(coeff_used)
    
    return (train_score, test_score, coeff_used)

runSGD()



SGD Score:
-8.783751868376809e+45
-3.2951885575502357e+37
-------
Coefficients Used:
279


### Stochastic Gradient Descent (SGD) Conclusion

The SGD model also does not seem to surpass the simple Linear Regression model (see "Airbnb NYC Data Exploration" notebook for details), which had scored 25.6% (training) and 20.2% (testing). In fact, the output training & testing scores are negative, indicative of terrible fit to the data.

**Therefore, we will discount SGD as a superior modelling assumption**

## Decision Trees
Unlike the former models, Decision Trees have a very different model structure. That is, it generates a series of nodes & branches that maximize informtion gain. Thus, it naturally is also the model most prone to overfitting

To remedy the overfitting challenge, we'll run the Decision Trees model with the below parameters:
- max_depth
- min_samples_leaf
- min_samples_split

To isolate the effect of these parameters on scores, we'll change one at a time (i.e. keeping other parameters constant)

In [12]:
# Function to run Decision Trees
def runTree(max_depth=None, min_samples_leaf=1, min_samples_split=2):
    """
    Compute the training & testing scores of the Linear Regression (with Lasso regularization)
    along with the SUM of coefficients used.
    
    Input:
        max_depth: maximum allowed depth of trees ("distance" between root & leaf)
        min_samples_leaf: minimum samples to contain per leaf
        min_samples_split: minimum samples to split a node
        
    Output:
        max_depth: maximum allowed depth of trees ("distance" between root & leaf)
        min_samples_leaf: minimum samples to contain per leaf
        min_samples_split: minimum samples to split a node
        train_score: Training score
        test_score: Testing score
    """
    # Instantiate & train
    tree_reg = DecisionTreeRegressor(criterion='mse', max_depth=max_depth, min_samples_leaf=min_samples_leaf, min_samples_split=min_samples_split)
    tree_reg.fit(X_train, y_train)

    # Predict testing data
    pred_train = tree_reg.predict(X_train)
    pred_test = tree_reg.predict(X_test)

    # Score
    train_score = tree_reg.score(X_train,y_train)
    test_score = tree_reg.score(X_test,y_test)
    
    print("Tree Score (" + str(max_depth) + ', ' + str(min_samples_leaf) + ', ' + str(min_samples_split) + "):")
    print(train_score)
    print(test_score)
    print('-------')

runTree()

Tree Score (None, 1, 2):
0.9878965346502062
-0.03756831319523579
-------


In [13]:
depths = [2, 5, 6, 7, 8]
for dep in depths:
    runTree(dep)

Tree Score (2, 1, 2):
0.15479934586567123
0.12183263202028384
-------
Tree Score (5, 1, 2):
0.4744294887573631
0.07241405400320378
-------
Tree Score (6, 1, 2):
0.525310295933926
0.08305001615068186
-------
Tree Score (7, 1, 2):
0.5638299452344031
0.07131265727236702
-------
Tree Score (8, 1, 2):
0.6307552187994618
-0.028863898962306234
-------


In [14]:
min_leafs = [2, 4, 6, 8, 10, 12, 14, 16]
for lfs in min_leafs:
    runTree(7, lfs)

Tree Score (7, 2, 2):
0.44084600423305964
0.14480057994068019
-------
Tree Score (7, 4, 2):
0.3857547145581244
0.17718732285841887
-------
Tree Score (7, 6, 2):
0.3421063058328305
0.19366673593944272
-------
Tree Score (7, 8, 2):
0.33583295299107185
0.1936338179864181
-------
Tree Score (7, 10, 2):
0.3438241646559581
0.20280465059455732
-------
Tree Score (7, 12, 2):
0.32942818941899776
0.2207759709711551
-------
Tree Score (7, 14, 2):
0.3283815370863621
0.22152969537300318
-------
Tree Score (7, 16, 2):
0.32610147391348787
0.22092665015994173
-------


In [15]:
min_splits = [2, 4, 6, 8, 10]
for splt in min_splits:
    runTree(7, 14, splt)

Tree Score (7, 14, 2):
0.3283815370863621
0.22152969537300318
-------
Tree Score (7, 14, 4):
0.3283815370863621
0.22152969537300318
-------
Tree Score (7, 14, 6):
0.328381537086362
0.22152969537300318
-------
Tree Score (7, 14, 8):
0.328381537086362
0.22152969537300318
-------
Tree Score (7, 14, 10):
0.328381537086362
0.22152969537300318
-------


In [16]:
for dep in depths:
    for lfs in min_leafs:
        runTree(dep, lfs)

Tree Score (2, 2, 2):
0.15479934586567123
0.12183263202028359
-------
Tree Score (2, 4, 2):
0.15479934586567123
0.1218326320202835
-------
Tree Score (2, 6, 2):
0.15479934586567123
0.12183263202028384
-------
Tree Score (2, 8, 2):
0.15479934586567112
0.12183263202028384
-------
Tree Score (2, 10, 2):
0.15479934586567112
0.12183263202028384
-------
Tree Score (2, 12, 2):
0.15479934586567123
0.12183263202028384
-------
Tree Score (2, 14, 2):
0.15479934586567123
0.12183263202028359
-------
Tree Score (2, 16, 2):
0.15479934586567112
0.12183263202028359
-------
Tree Score (5, 2, 2):
0.36661942537925574
0.12127467104405443
-------
Tree Score (5, 4, 2):
0.31980895350015925
0.17135495073230267
-------
Tree Score (5, 6, 2):
0.2850110739746403
0.1735525930023777
-------
Tree Score (5, 8, 2):
0.2832835369825508
0.17283090183811545
-------
Tree Score (5, 10, 2):
0.2863249780699517
0.1918975165540071
-------
Tree Score (5, 12, 2):
0.2762523394476941
0.20328371328514327
-------
Tree Score (5, 14, 2)

### Decision Tree Conclusion

Unlike the rest, the Decision Tree model does seem to surpass the simple Linear Regression model (see "Airbnb NYC Data Exploration" notebook for details), which had scored 25.6% (training) and 20.2% (testing).

Based on the tests, seems there is a maximum point where testing error is minimized. Also notable is the fact that the training & testing score seem to be inversely correlated.

In [17]:
tree_reg = DecisionTreeRegressor(criterion='mse', max_depth=8, min_samples_leaf=16, min_samples_split=2)
tree_reg.fit(X_train, y_train)

# Predict testing data
pred_train = tree_reg.predict(X_train)
pred_test = tree_reg.predict(X_test)

In [18]:
# Import prediction input
df_nei_Manhattan_EV = pd.read_csv('./data/input/pred_input_Manhattan_EV.csv')
df_nei_Manhattan_HA = pd.read_csv('./data/input/pred_input_Manhattan_HA.csv')
df_nei_Manhattan_HK = pd.read_csv('./data/input/pred_input_Manhattan_HK.csv')
df_nei_Manhattan_UWS = pd.read_csv('./data/input/pred_input_Manhattan_UWS.csv')

df_nei_Brooklyn_BS = pd.read_csv('./data/input/pred_input_Brooklyn_BS.csv')
df_nei_Brooklyn_BU = pd.read_csv('./data/input/pred_input_Brooklyn_BU.csv')
df_nei_Brooklyn_WI = pd.read_csv('./data/input/pred_input_Brooklyn_WI.csv')

df_nei_Queens_AS = pd.read_csv('./data/input/pred_input_Queens_AS.csv')
df_nei_Queens_LI = pd.read_csv('./data/input/pred_input_Queens_LI.csv')

avgRev_Manhattan_EV = round(tree_reg.predict(df_nei_Manhattan_EV)[0],2)
avgRev_Manhattan_HA = round(tree_reg.predict(df_nei_Manhattan_HA)[0],2)
avgRev_Manhattan_HK = round(tree_reg.predict(df_nei_Manhattan_HK)[0],2)
avgRev_Manhattan_UWS = round(tree_reg.predict(df_nei_Manhattan_UWS)[0],2)

avgRev_Brooklyn_BS = round(tree_reg.predict(df_nei_Brooklyn_BS)[0],2)
avgRev_Brooklyn_BU = round(tree_reg.predict(df_nei_Brooklyn_BU)[0],2)
avgRev_Brooklyn_WI = round(tree_reg.predict(df_nei_Brooklyn_WI)[0],2)

avgRev_Queens_AS = round(tree_reg.predict(df_nei_Queens_AS)[0],2)
avgRev_Queens_LI = round(tree_reg.predict(df_nei_Queens_LI)[0],2)

print("--------Manhattan---------")
print(avgRev_Manhattan_EV)
print(avgRev_Manhattan_HA)
print(avgRev_Manhattan_HK)
print(avgRev_Manhattan_UWS)
print("")
print("--------Brooklyn---------")
print(avgRev_Brooklyn_BS)
print(avgRev_Brooklyn_BU)
print(avgRev_Brooklyn_WI)
print("")
print("--------Queens---------")
print(avgRev_Queens_AS)
print(avgRev_Queens_LI)

--------Manhattan---------
44.15
44.15
44.15
44.15

--------Brooklyn---------
30.37
30.37
30.37

--------Queens---------
30.37
30.37


In [19]:
# Import prediction input
df_nei_1_1 = pd.read_csv('./data/input/pred_input_Manhattan_EV.csv')
df_nei_2_2 = pd.read_csv('./data/input/pred_input_Manhattan_EV_2bed_2_bath.csv')
df_nei_2_1 = pd.read_csv('./data/input/pred_input_Manhattan_EV_2bed_1_bath.csv')

avgRev_1_1 = tree_reg.predict(df_nei_1_1)[0]
avgRev_2_2 = tree_reg.predict(df_nei_2_2)[0]
avgRev_2_1 = tree_reg.predict(df_nei_2_1)[0]

print(round(avgRev_1_1,2))
print(round(avgRev_2_2,2))
print(round(avgRev_2_1,2))

44.15
44.15
44.15


# Conclusion
Based on these parameters, seems like the best scoring model (Decision Trees) was a bit too generalized, making the same prediction for variations (e.g. 1 bedroom vs 2 bedroom).