In [5]:
import pandas as pd
import os
import seaborn as sns
from time import time
path = '../data/'

df = pd.read_csv(os.path.join(path,'airbnb_listings_usa_cycle_1.csv'))

# df=df[:10000]

In [2]:
pd.set_option('display.max_rows', 40)
pd.set_option('display.max_columns', 100)

Strategy

Baseline: mean
Jasim's baseline: multiple regression model with selected dependent variables. 

1. linear regression
2. decision tree
3. ensemble

-- go into pricing models

4. neural network

stretch goal: the google airline algo recommendation!!!



Test train split

split by time -> not perfect but proxy perhaps
df['last_review'] = pd.to_datetime(df.last_review)

Splitting by time is not approprirate because a majority of our dataset is spread throughout 2020 including the period before and after the pandemic, which as we know has impacted prices. 

In [8]:
# 1. Import the appropriate estimator class from Scikit-Learn
from sklearn.linear_model import LinearRegression

In [9]:
# 2. Instantiate this class
model = LinearRegression()

In [10]:
# 3. Arrange X features matrix & y target vector

target = ['price']

features = ['host_response_time','host_response_rate','host_acceptance_rate',
'street','neighbourhood','neighbourhood_cleansed','neighbourhood_group_cleansed','city','state','zipcode','market','smart_location','latitude','longitude','property_type','room_type','accommodates','bathrooms','bedrooms','beds','bed_type','amenities','square_feet','minimum_nights','maximum_nights','instant_bookable','is_business_travel_ready','cancellation_policy','require_guest_profile_picture','require_guest_phone_verification','notes_len','transit_len','access_len','interaction_len','house_rules_len','host_about_len','metro_area','bedrooms_str','beds_str']

# Wrangle and pre-process

# Removing sparse features from features (sparse features are <90% populated)
sparse_features = ['square_feet','neighbourhood_group_cleansed','host_response_rate','host_response_time','neighbourhood','host_acceptance_rate']

unusable_features = ['amenities']

duplicative_location_features = ['street','neighbourhood','neighbourhood_cleansed','neighbourhood_group_cleansed','city','state','zipcode','market','smart_location','metro_area']

numeric_columns = df.dtypes[df.dtypes==int].index.tolist()
nonnumeric_columns = df.dtypes[df.dtypes==object].index.tolist()

df[nonnumeric_columns] = df[nonnumeric_columns].astype(str)

selected_features = list(set(features) - set(sparse_features))
selected_features = list(set(selected_features) - set(unusable_features))
selected_features = list(set(selected_features) - set(duplicative_location_features))

df = df[target + selected_features]

df.dropna(inplace=True)

In [11]:
y = df[target]
X = df[selected_features]
y.shape, X.shape


((230950, 1), (230950, 24))

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.2, random_state=42)



In [13]:
# We need to encode features!

from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_absolute_error
import category_encoders as ce


In [14]:
st = time()
encoder = ce.OneHotEncoder(use_cat_names=True)
X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)
et = time()
et-st

5.915673732757568

In [16]:
print(X_train.shape)

(184760, 107)


Ok, the simple linear regression was taking too long. Looking at the shape of our 1-hot enodeded X_train, something seems not right. Does X_train have 507645 columns? Is this possible? 

Let's look at the cardinality of our categorical features

In [19]:
df.nunique().sum()

297260

Explainable cardinality: 
'amenities' is a dictionary of values.

Unexplainable cardinality:
'latitude' and 'longitude'

(side track: going into experiment mode. when unsure if logic is not behaving like you expect, test with a sample row-wise and column-wise.)

In [20]:
# Fit the model / "train the model and estimate the MAE"
st = time()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'mae ${mae:,.0f}')
et = time()

mae $179


In [95]:
# The MAE of the baseline"
st = time()
y_pred = [y_train.mean()] * len(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'mae ${mae:,.0f}')
et = time()

mae $210


Jasim's excel principle:

Sometimes it's just easy to excel it out!

sometimes a GUI just helps improve the thought process


selected some features:


Research point:
Leaky features??
Some features like # of reviews, ratings, etc. are an outcome of being successful as a host, which is also an outcom.

Independence of features:
 "The coefficient estimates for Ordinary Least Squares rely on the independence of the features. When features are correlated and the columns of the design matrix  have an approximate linear dependence, the design matrix becomes close to singular and as a result, the least-squares estimate becomes highly sensitive to random errors in the observed target, producing a large variance. This situation of multicollinearity can arise, for example, when data are collected without an experimental design."

Q1: does this also apply when we use decision trees, random forests or ensemble methods? 

Q2: Do we apply the chi-square test?





Appendix: additional pre-processing

In [96]:
# Display % of nulls
df.isnull().mean().sort_values(ascending=False)[:5]

require_guest_phone_verification    0.0
longitude                           0.0
bedrooms                            0.0
beds                                0.0
transit_len                         0.0
dtype: float64

Drop features that are too sparse! Imputing them might distort the results. Let's call them "sparse_features", the threshold for sparsity being <90%


Ridge regression:

the best model is in the "sweet spot" in the tradeoff between bias and variance.
bias --> overweigh 
variance --> 

"bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated."
"tendency to overestimate or underestimate a parameter"

one way to look at it. your "model" has a preconeived view. data shows otherwise, thus you are biased. a tendency to hold on to a preconeived view despite data showing otherewise.

variance. the way hte concept is phrased makes it seem as if it is the opposite.

...

one way to reduce overfitting for training set is to regularize.

Ridge regression introduces bias into how a regression line fits the data. So we trade in variance with bias, which reduces the fit for the training set but may improve the fit for the test set, which meansbetter long term predictions.


In [21]:
from sklearn.linear_model import Ridge

In [27]:
st = time()
ridge = Ridge(alpha = .5, normalize=True)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'mae ${mae:,.0f}')
et = time()


mae $169


In [24]:
st = time()
# ridge = Ridge(alpha = 0, normalize=True) # mae $2,173,408,237
# ridge = Ridge(alpha = .1, normalize=True) #mae $175
# ridge = Ridge(alpha = .2, normalize=True) #mae $172
# ridge = Ridge(alpha = .4, normalize=True) # mae $170
# ridge = Ridge(alpha = .6, normalize=True) #mae $168
# ridge = Ridge(alpha = .8, normalize=True) #mae $168
# ridge = Ridge(alpha = 1, normalize=True) #mae $168
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'mae ${mae:,.0f}')
et = time()


mae $169


The optimal alpha is probably around 0.6. Adding bias has diminishing returns past. 

How else could we improve predictive accuracy? 

Feature engineering and feature selection - using domain knowledge

Or we can use KBest



In [25]:
# Currently we have 107 features, post 1-hot encoding. Will 1-hot encoding distort KBest? 
X_train.shape

(184760, 107)

In [28]:
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(k=15) #going down from 107 features to 15???

In [29]:
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

In [30]:
X_train_selected

array([[2. , 1. , 0. , ..., 1. , 0. , 0. ],
       [1. , 1. , 0. , ..., 1. , 0. , 0. ],
       [2. , 1.5, 0. , ..., 1. , 0. , 0. ],
       ...,
       [1. , 1. , 0. , ..., 1. , 0. , 0. ],
       [1. , 1. , 0. , ..., 1. , 0. , 0. ],
       [1. , 1. , 0. , ..., 1. , 0. , 0. ]])

In [31]:
selected_mask = selector.get_support()
all_names = X_train.columns
selected_names = all_names[selected_mask]
unselected_names = all_names[~selected_mask]

print ('Features selected:')
for name in selected_names:
  print (name)

print()

print ('Features excluded:')
for name in unselected_names:
  print (name)

Features selected:
bedrooms
bathrooms
property_type_Villa
beds_str_1
beds
bedrooms_str_1
bedrooms_str_7+
accommodates
cancellation_policy_super_strict_60
cancellation_policy_luxury_moderate
cancellation_policy_luxury_super_strict_95
cancellation_policy_luxury_no_refund
room_type_Entire home/apt
room_type_Private room
room_type_Shared room

Features excluded:
longitude
is_business_travel_ready_f
require_guest_phone_verification_f
require_guest_phone_verification_t
maximum_nights
latitude
instant_bookable_f
instant_bookable_t
property_type_Condominium
property_type_Serviced apartment
property_type_Apartment
property_type_House
property_type_Guesthouse
property_type_Townhouse
property_type_Loft
property_type_Boutique hotel
property_type_Guest suite
property_type_Cottage
property_type_Bungalow
property_type_Resort
property_type_Hostel
property_type_Bed and breakfast
property_type_Nature lodge
property_type_Hotel
property_type_Camper/RV
property_type_Tiny house
property_type_Other
property_

In [32]:
# Ridge alpha .1 mae insensitive to change in alpha
st = time()
ridge = Ridge(alpha = .1, normalize=True) #mae $175
ridge.fit(X_train_selected, y_train)
y_pred = ridge.predict(X_test_selected)
mae = mean_absolute_error(y_test, y_pred)
print(f'mae ${mae:,.0f}')
et = time()


mae $167


In [36]:
# Fit the model / "train the model and estimate the MAE"
st = time()
model.fit(X_train_selected, y_train)
y_pred = model.predict(X_test_selected)
mae = mean_absolute_error(y_test, y_pred)
print(f'mae ${mae:,.0f}')
et = time()

mae $167


#### Pickle the model

* save the model to a file.
* then reconstitude the model from the file.
* avoids having to re-train the model after the first time. "once you save your model as a pickle, you can load it later while making the prediction"

* aka serialization, marshalling or flattening

> for deployability, we will use a linear model since it has faster run-time than say a tree-based model. Simpler models also come with the added benefit of interpretability (e.g. it is more important to see the correlation a fthiseature and a problem than increasing the predictive accuracy)

> **definition**:
Pickling - is the process whereby a Python object hierarchy is converted into a byte stream, and Unpickling - is the inverse operation, whereby a byte stream is converted back into an object hierarchy.

> Pickling (and unpickling) is alternatively known as serialization, marshalling, or flattening.

> byte stream: an ordered sequence of bytes


#### Pickling a linear model

776 bytes

In [37]:
# Save a model using pickle
import pickle
import os

# MODEL_FILEPATH = os.path.join(os.path.dirname(__file__),"model.pkl")
with open('linear_model.pkl','wb') as model_file:
    pickle.dump(model, model_file)

# linear_model: 776bytes


In [38]:
# Load the model and do a predict

with open('linear_model.pkl','rb') as model_file:
    loaded_model = pickle.load(model_file)

In [39]:
loaded_model.predict(X_train_selected[:2,:])

## Note this loads the last model

array([[217.11297934],
       [178.66686964]])

#### Pickling the ridge regression

621 bytes




In [40]:
# save model
with open('ridge_reg_model.pkl','wb') as model_file:
    pickle.dump(ridge, model_file)

In [41]:
# load model
with open('ridge_reg_model.pkl','rb') as model_file:
    loaded_model = pickle.load(model_file)

In [42]:
# testing prediction
loaded_model.predict(X_train_selected[:2,:])

array([[227.2735363],
       [175.3132717]])

In [113]:
import numpy as np
np.savetxt("x_train.csv",X_train_selected,delimiter=",")

In [114]:
k_best = ['bedrooms','beds','bathrooms','bedrooms_str_1','bedrooms_str_7+','room_type_Entire home/apt','room_type_Private room','room_type_Shared room','beds_str_1','property_type_Villa','accommodates','cancellation_policy_super_strict_60','cancellation_policy_luxury_moderate','cancellation_policy_luxury_super_strict_95','cancellation_policy_luxury_no_refund']
X_train[k_best]

Unnamed: 0,bedrooms,beds,bathrooms,bedrooms_str_1,bedrooms_str_7+,room_type_Entire home/apt,room_type_Private room,room_type_Shared room,beds_str_1,property_type_Villa,accommodates,cancellation_policy_super_strict_60,cancellation_policy_luxury_moderate,cancellation_policy_luxury_super_strict_95,cancellation_policy_luxury_no_refund
39931,2.0,2.0,1.0,0,0,1,0,0,0,0,5,0,0,0,0
217369,1.0,1.0,1.0,1,0,1,0,0,1,0,3,0,0,0,0
153008,2.0,2.0,1.5,0,0,1,0,0,0,0,4,0,0,0,0
180362,5.0,5.0,2.5,0,0,1,0,0,0,0,10,0,0,0,0
86132,1.0,2.0,1.0,1,0,1,0,0,0,0,3,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
120860,1.0,1.0,1.0,1,0,1,0,0,1,0,2,0,0,0,0
104426,1.0,1.0,1.0,1,0,1,0,0,1,0,4,0,0,0,0
132975,1.0,1.0,1.0,1,0,1,0,0,1,0,2,0,0,0,0
147923,1.0,0.0,1.0,1,0,1,0,0,0,0,1,0,0,0,0


Optional: you can try other selection methods
Also: try polynomial features

_hello_

_ * italic and bold
#### header -> six levels

links: 
inline links example
[Search for it.](www.google.com)

reference links(that multiple links to the same place only need to be updated once):
Do you want to [see something fun][a fun place]?

Well, do I have [the website for you][another fun place]!

[a fun place]: www.zombo.com
[another fun place]: www.stumbleupon.com


images:
inline image link

reference imafge link![Black cat][Black]

![Orange cat][Orange]

[Black]: https://upload.wikimedia.org/wikipedia/commons/a/a3/81_INF_DIV_SSI.jpg
[Orange]:https://upload.wikimedia.org/wikipedia/commons/a/a3/81_INF_DIV_SSI.jpg



Blockquotes: "carets" >


lists:

unordered * bla
ordered 1. bla
space to indent! but stop at 3 levels

soft breaks (vs hard break)>> use double spaces are end of line




#### Using pipelines for linear regression

In [44]:
y = df[target]
X = df[selected_features]
y.shape, X.shape

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.2, random_state=42)

In [45]:
# User pipelines for linear regression
import category_encoders as ce
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer

pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    SimpleImputer(strategy='mean'),
    LinearRegression()
)




In [46]:
pipeline.fit(X_train, y_train)

print('Train Accuracy', pipeline.score(X_train, y_train))
print('Test Accuracy', pipeline.score(X_test, y_pred),"does not seem right")

y_pred = pipeline.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'mae ${mae:,.0f}')

Train Accuracy 0.15518025396037005
Test Accuracy 0.7990731248446383 does not seem right
mae $179


In [47]:
# Save pipeline 
with open('linear_model_pipeline.pkl','wb') as model_file:
    pickle.dump(pipeline, model_file)


In [48]:
# load model
with open('linear_model_pipeline.pkl','rb') as model_file:
    loaded_model = pickle.load(model_file)

In [49]:
# Run prediction
loaded_model.predict(X_train.iloc[:2,:])

# array([[134.01877682],[194.9625043 ]])

array([[134.00902037],
       [194.95762565]])

In [126]:
X_train

Unnamed: 0,bedrooms,beds,transit_len,access_len,bathrooms,bedrooms_str,require_guest_profile_picture,room_type,latitude,beds_str,longitude,minimum_nights,property_type,accommodates,notes_len,interaction_len,instant_bookable,bed_type,host_about_len,house_rules_len,is_business_travel_ready,maximum_nights,cancellation_policy,require_guest_phone_verification
39931,2.0,2.0,110,171,1.0,2,f,Entire home/apt,39.95069,2,-82.98802,1,Condominium,5,3,120,f,Real Bed,3,67,f,1125,strict_14_with_grace_period,f
217369,1.0,1.0,287,317,1.0,1,f,Entire home/apt,47.66101,1,-122.34084,30,Serviced apartment,3,455,270,t,Real Bed,3,376,f,1125,strict_14_with_grace_period,f
153008,2.0,2.0,3,3,1.5,2,f,Entire home/apt,40.82495,2,-73.94023,1,Apartment,4,3,3,t,Real Bed,3,3,f,30,flexible,f
180362,5.0,5.0,63,132,2.5,5,f,Entire home/apt,41.64919,5,-71.22206,3,House,10,127,383,f,Real Bed,316,114,f,1125,moderate,f
86132,1.0,2.0,345,67,1.0,1,f,Entire home/apt,34.03090,2,-118.26353,31,Apartment,3,3,212,t,Real Bed,3,3,f,365,strict_14_with_grace_period,f
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
120860,1.0,1.0,3,3,1.0,1,f,Entire home/apt,40.79323,1,-73.97267,5,Apartment,2,3,3,f,Real Bed,25,3,f,60,moderate,f
104426,1.0,1.0,204,3,1.0,1,f,Entire home/apt,34.06525,1,-118.39569,31,Apartment,4,3,122,f,Real Bed,467,58,f,1125,strict_14_with_grace_period,f
132975,1.0,1.0,567,168,1.0,1,f,Entire home/apt,40.71780,1,-73.95595,4,Apartment,2,3,177,f,Real Bed,376,137,f,10,strict_14_with_grace_period,f
147923,1.0,0.0,172,12,1.0,1,f,Entire home/apt,40.76457,0,-73.97518,30,Apartment,1,40,3,f,Real Bed,3,3,f,1125,flexible,f


#### Linear model + Kbest in a pipeline

In [121]:
# from sklearn.feature_selection import SelectKBest
# selector = SelectKBest(k=15) #going down from 107 features to 15???
# X_train_selected = selector.fit_transform(X_train, y_train)
# X_test_selected = selector.transform(X_test)

In [122]:
# # User pipelines for linear regression
# import category_encoders as ce
# from sklearn.pipeline import make_pipeline
# from sklearn.linear_model import LinearRegression
# from sklearn.impute import SimpleImputer
# from sklearn.feature_selection import SelectKBest
# selector = SelectKBest(k=15) #going down from 107 features to 15???

# pipeline = make_pipeline(
#     ce.OneHotEncoder(use_cat_names=True),
#     SimpleImputer(strategy='mean'),
#     LinearRegression()
# )

# X_train_selected = selector.fit_transform(X_train, y_train)
# pipeline.fit(X_train_selected,)

# # pipeline.fit(X_train, y_train)



In [123]:
# X_train_selected

In [124]:
X_train.shape

(184760, 24)

**OOPS**!

We're run into a problem. The latitude and longitude are the only  location attribute for our prediction model. However, these coordinates would not be meaningful to the user as they are not familiar with the locations they correspond to. We could solve this by 
>(a) translating the coordinates to their corresponding neighborhoods in the app. 
>(b) Or we could pick a different location variable. 

We'll pick (b) which we'll implement in a separate notebook.

In [127]:
X_train

Unnamed: 0,bedrooms,beds,transit_len,access_len,bathrooms,bedrooms_str,require_guest_profile_picture,room_type,latitude,beds_str,longitude,minimum_nights,property_type,accommodates,notes_len,interaction_len,instant_bookable,bed_type,host_about_len,house_rules_len,is_business_travel_ready,maximum_nights,cancellation_policy,require_guest_phone_verification
39931,2.0,2.0,110,171,1.0,2,f,Entire home/apt,39.95069,2,-82.98802,1,Condominium,5,3,120,f,Real Bed,3,67,f,1125,strict_14_with_grace_period,f
217369,1.0,1.0,287,317,1.0,1,f,Entire home/apt,47.66101,1,-122.34084,30,Serviced apartment,3,455,270,t,Real Bed,3,376,f,1125,strict_14_with_grace_period,f
153008,2.0,2.0,3,3,1.5,2,f,Entire home/apt,40.82495,2,-73.94023,1,Apartment,4,3,3,t,Real Bed,3,3,f,30,flexible,f
180362,5.0,5.0,63,132,2.5,5,f,Entire home/apt,41.64919,5,-71.22206,3,House,10,127,383,f,Real Bed,316,114,f,1125,moderate,f
86132,1.0,2.0,345,67,1.0,1,f,Entire home/apt,34.03090,2,-118.26353,31,Apartment,3,3,212,t,Real Bed,3,3,f,365,strict_14_with_grace_period,f
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
120860,1.0,1.0,3,3,1.0,1,f,Entire home/apt,40.79323,1,-73.97267,5,Apartment,2,3,3,f,Real Bed,25,3,f,60,moderate,f
104426,1.0,1.0,204,3,1.0,1,f,Entire home/apt,34.06525,1,-118.39569,31,Apartment,4,3,122,f,Real Bed,467,58,f,1125,strict_14_with_grace_period,f
132975,1.0,1.0,567,168,1.0,1,f,Entire home/apt,40.71780,1,-73.95595,4,Apartment,2,3,177,f,Real Bed,376,137,f,10,strict_14_with_grace_period,f
147923,1.0,0.0,172,12,1.0,1,f,Entire home/apt,40.76457,0,-73.97518,30,Apartment,1,40,3,f,Real Bed,3,3,f,1125,flexible,f
