### Step 3: predict the property's price

**(1) Data Understanding**

**(2) Data Transformation and Formatting -- ETL pipeline**

**(3) ML Data Preperation - extract features and obtain train and validate data**
- **Check for missing values**:

- **Missing value imputation**
    - Missing values imputations using `"continuous_imputation()"` function

- **Handle categorical variables**:
    - Encoding categorical variable, "amenities" from the listings data using `"encode_amenities()"` function
    - Encoding other categorical variables, using one-hot encoding or target encoding using `"category_encoding"` function
    
- **Finally, we can use `"ml_dataprep()"` for ml_dataprep**.

**(4) Data Modeling and Evaluate the Results**

In [2]:
# statistics
import pandas as pd
import numpy as np
import math as mt

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Data Preprocessing - Standardization, Encoding, Imputation
from sklearn.preprocessing import StandardScaler # Standardization
from sklearn.preprocessing import Normalizer # Normalization
from sklearn.preprocessing import OneHotEncoder # One-hot Encoding
from sklearn.preprocessing import OrdinalEncoder # Ordinal Encoding
from category_encoders import MEstimateEncoder # Target Encoding
from sklearn.preprocessing import PolynomialFeatures # Create Polynomial Features
from sklearn.impute import SimpleImputer # Imputation

# Exploratory Data Analysis - Feature Engineering
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import mutual_info_regression
from sklearn.decomposition import PCA

# Modeling - ML Pipelines
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

# Modeling - Algorithms
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

# ML - Evaluation
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score


### (1) Data Understanding

**Load dataset**

In [3]:
import os

for dirname, _, filenames in os.walk("../data"):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
listings = pd.read_csv("../data/listings_cs.csv") 
reviews = pd.read_csv("../data/reviews.csv")
calendar = pd.read_csv("../data/calendar.csv")

../data/seattle-parks-and-recreation-park-addresses.csv
../data/reviews.csv
../data/tourist attractions_clean.csv
../data/listings_cs_kfold.csv
../data/listings.csv
../data/listings_cs.csv
../data/calendar.csv
../data/seattle_top55 tourist attractions.csv


  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Here, we import three files included in the Seattle dataset:
- *Listings*, including full descriptions and average review score.
    - Note: This `"listings_cs.csv"` is not exactly the same file as the original data provided on Kaggle, but with some convenience_score_related features calculated from `"convenience_score.ipynb"`.
    
- *Reviews*, including unique id for each reviewer and detailed comments
- *Calendar*, including listing id and the price and availability for that day

In [4]:
listings.shape, calendar.shape, reviews.shape

((3818, 595), (1393570, 4), (84849, 6))

**Cross-Validation KFold**

In [22]:
def generate_listings_kfold():
    # Mark the train dataset with kfold = 5
    listings = pd.read_csv("../data/listings_cs.csv")
    
    if os.path.exists("../data/listings_cs_kfold.csv"):
        os.remove("../data/listings_cs_kfold.csv")
    
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    for fold, (train_idx, valid_idx) in enumerate(kf.split(X=listings)):
        listings.loc[valid_idx, "kfold"] = fold

    listings.to_csv('../data/listings_cs_kfold.csv', index=False)

# After assigning kfold
generate_listings_kfold()
listings = pd.read_csv("../data/listings_cs_kfold.csv")
listings.loc[:, ['id', 'kfold']].head()

  if (await self.run_code(code, result,  async_=asy)):
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,id,kfold
0,241032,0.0
1,953595,4.0
2,3308979,2.0
3,7421966,3.0
4,278830,4.0


### (2) Data Transformation and Formatting -- ETL pipeline
- Note: the "ETL pipeline" code is borrowed from Zacks Shen https://github.com/ZacksAmber/Kaggle-Seattle-Airbnb. 

His project conducted a price prediction model with the same dataset, but we have different research questions. This project is to investigate whether the convenience scores created by step 2 can benefit the price prediction outcomes. 

In [4]:
class ETL_pipeline:
    def __init__(self, data_frame):
        self.df = data_frame
    
    # Data type transformation
    def _transformation(self, data_frame):
        df = data_frame
        # Convert dollar columns from object to float
        # Remove '$' and ','
        dollar_cols = ['price', 'weekly_price', 'monthly_price', 'extra_people', 'security_deposit', 'cleaning_fee']
        for dollar_col in dollar_cols:
            df[dollar_col] = df[dollar_col].replace('[\$,]', '', regex=True).astype(float)
        
        # Convert rate columns from object to float
        # Remove '%'
        percent_cols = ['host_response_rate', 'host_acceptance_rate']
        for percent_col in percent_cols:
            df[percent_col] = df[percent_col].replace('%', '', regex=True).astype(float)

        # Replace the following values in "property_type" to Unique space due to small sample size
        unique_space = ["Barn", "Boat","Bus","Camper/RV", "Treehouse","Campsite","Castle","Cave", "Dome House",
        "Earth house","Farm stay","Holiday park", "Houseboat", "Hut","Igloo", "Island", "Lighthouse", "Plane",
        "Ranch",  "Religious building","Shepherd’s hut", "Shipping container", "Tent", "Tiny house", "Tipi",
        "Tower", "Train","Windmill", "Yurt","Riad","Pension","Dorm", "Chalet"]            
        df.property_type = df.property_type.replace(unique_space, "Unique space", regex=True)

        # Convert 't'(true), 'f'(false) to 1, 0
        tf_cols = ['host_is_superhost', 'instant_bookable', 'require_guest_profile_picture', 'require_guest_phone_verification']
        for tf_col in tf_cols:
            df[tf_col] = df[tf_col].replace('f', 0, regex=True)
            df[tf_col] = df[tf_col].replace('t', 1, regex=True)
        
        return df
    
    # Parse listings
    def parse_listings(self):
        """Parse listings.
        """
        df = self.df
        df = self._transformation(df)
        return df
    
    def parse_reviews(self):
        """Parse reviews.
        """
        df = self.df
        df.date = pd.to_datetime(df.date)
        return df
    
    # Parse calendar
    def parse_calender(self):
        """Paser calendar.
        """
        df = self.df
        # Convert date from object to datetime
        df.date = pd.to_datetime(df.date)
        
        # Convert price from object to float
        # Convert '$' and ',' to ''
        df.price = df.price.replace('[\$,]', '', regex=True).astype(float)
        
        # Convert 't', 'f' to 1, 0
        df['available'] = df['available'].replace('f', 0, regex=True)
        df['available'] = df['available'].replace('t', 1, regex=True)
        return df

listings = pd.read_csv("../data/listings_cs_kfold.csv")
listings = ETL_pipeline(listings).parse_listings()
reviews = ETL_pipeline(reviews).parse_reviews()
calendar = ETL_pipeline(calendar).parse_calender()


In [5]:
# The calendar recorded the availability of each listing in the next 365 days
calendar.groupby('listing_id').size().head()


listing_id
3335    365
4291    365
5682    365
6606    365
7369    365
dtype: int64

### (3) ML_dataprep pipeline

- **Check for missing values**:

- **Missing value imputation**
    - Missing values imputations using `"continuous_imputation()"` function

- **Handle categorical variables**:
    - Encoding categorical variable, "amenities" from the listings data using `"encode_amenities()"` function
    - Encoding other categorical variables, using one-hot encoding or target encoding using `"category_encoding"` function
    
- **Finally, we can use `"ml_dataprep()"` for ml_dataprep**.

**Check for missing values**

- Based on Zacks Shen's analysis, there are 3 types of features with missing values, including "zero_features" (will fill with 0), "mean_features"(will fill with col-mean) and "mode_features"(will fill with most-frequent value, can be used for categorical features such as 'property_type'). Let's firstly check the percentages of missing values for each selected features.


In [25]:
## Based on "Zacks Shen" analysis, there are 3 types of continous features with missing values, including ""
zero_features = ['reviews_per_month', 'host_response_rate', 'host_is_superhost', 'security_deposit', 'cleaning_fee']
mean_features = ['host_acceptance_rate', 'review_scores_accuracy', 'review_scores_checkin', 
                         'review_scores_value', 'review_scores_location', 'review_scores_cleanliness', 
                         'review_scores_communication', 'review_scores_rating']
mode_features = ['bathrooms', 'bedrooms', 'beds', 'property_type']

The percentage of missing values for each column in the "zero-feature set"
 reviews_per_month     0.164
host_response_rate    0.137
host_is_superhost     0.001
security_deposit      0.511
cleaning_fee          0.270
dtype: float64


Unnamed: 0,reviews_per_month,host_response_rate,host_is_superhost,security_deposit,cleaning_fee
0,4.07,96%,f,,
1,1.48,98%,t,$100.00,$40.00


Let's check the missing proportion for each set of features.

In [27]:
print('For the "zero-feature set",\n\tthe percentage of missing values for each column is shown below\n', round( 
    listings[zero_features].isnull().sum(axis = 0)/len(listings), 3))
listings[zero_features].head(2)

For the "zero-feature set",
	the percentage of missing values for each column is shown below
 reviews_per_month     0.164
host_response_rate    0.137
host_is_superhost     0.001
security_deposit      0.511
cleaning_fee          0.270
dtype: float64


Unnamed: 0,reviews_per_month,host_response_rate,host_is_superhost,security_deposit,cleaning_fee
0,4.07,96%,f,,
1,1.48,98%,t,$100.00,$40.00


In [28]:
print('For the "mean-feature set",\n\tthe percentage of missing values for each column is shown below\n', round( 
    listings[mean_features].isnull().sum(axis = 0)/len(listings), 3))
listings[mean_features].head(2)

For the "mean-feature set",
	the percentage of missing values for each column is shown below
 host_acceptance_rate           0.202
review_scores_accuracy         0.172
review_scores_checkin          0.172
review_scores_value            0.172
review_scores_location         0.172
review_scores_cleanliness      0.171
review_scores_communication    0.171
review_scores_rating           0.169
dtype: float64


Unnamed: 0,host_acceptance_rate,review_scores_accuracy,review_scores_checkin,review_scores_value,review_scores_location,review_scores_cleanliness,review_scores_communication,review_scores_rating
0,100%,10.0,10.0,10.0,9.0,10.0,10.0,95.0
1,100%,10.0,10.0,10.0,10.0,10.0,10.0,96.0


In [29]:
print('For the "mode-feature set",\n\tthe percentage of missing values for each column is shown below\n', round( 
    listings[mode_features].isnull().sum(axis = 0)/len(listings), 3))
listings[mode_features].head(2)

For the "mode-feature set",
	the percentage of missing values for each column is shown below
 bathrooms        0.004
bedrooms         0.002
beds             0.000
property_type    0.000
dtype: float64


Unnamed: 0,bathrooms,bedrooms,beds,property_type
0,1.0,1.0,1.0,Apartment
1,1.0,1.0,1.0,Apartment



**Missing value imputation**
- Missing values imputations using `"continuous_imputation()"` function
    - Based on Zacks Shen's analysis, there are 3 types of continous features with missing values, including "zero_features" (fill with 0), "mean_features"(fill with col-mean) and "mode_features"(fill with most-frequent value, can be used for categorical features such as 'property_type')
    

In [8]:

def continuous_imputation(X_train, X_valid, y_train, y_valid): 
    
    zero_features = ['reviews_per_month', 'host_response_rate', 'host_is_superhost', 'security_deposit', 'cleaning_fee']
    mean_features = ['host_acceptance_rate', 'review_scores_accuracy', 'review_scores_checkin', 
                             'review_scores_value', 'review_scores_location', 'review_scores_cleanliness', 
                             'review_scores_communication', 'review_scores_rating']
    mode_features = ['bathrooms', 'bedrooms', 'beds', 'property_type']

    X_train, X_valid, y_train, y_valid = X_train.copy(), X_valid.copy(), y_train.copy(), y_valid.copy()

    # Zero imputation
    X_train, X_valid = simple_imputer(zero_features, 'constant', X_train, X_valid, 'float')
    
    # Mean imputation
    X_train, X_valid = simple_imputer(mean_features, 'mean', X_train, X_valid, 'float')
    
    # Mode imputation
    X_train, X_valid = simple_imputer(mode_features, 'most_frequent', X_train, X_valid, 'int')
    
    return X_train, X_valid, y_train, y_valid

def simple_imputer(imp_features, strategy, X_train, X_valid, feature_type):
    if strategy == 'constant':
        imp = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=0)
    else:
        imp = SimpleImputer(missing_values=np.nan, strategy=strategy)
        
    X_train_imp = pd.DataFrame(imp.fit_transform(X_train[imp_features]))
    X_valid_imp = pd.DataFrame(imp.transform(X_valid[imp_features]))
    X_train_imp.columns = imp_features
    X_valid_imp.columns = imp_features

    X_train_imp.index = X_train.index
    X_valid_imp.index = X_valid.index
    
    if strategy == 'most_frequent':
        X_train_imp[['bathrooms', 'bedrooms', 'beds']] = X_train_imp[['bathrooms', 'bedrooms', 'beds']].astype(feature_type)
        X_valid_imp[['bathrooms', 'bedrooms', 'beds']] = X_valid_imp[['bathrooms', 'bedrooms', 'beds']].astype(feature_type)
    else: 
        X_train_imp = X_train_imp.astype( feature_type )
        X_valid_imp = X_valid_imp.astype( feature_type )

    # Replace the unimputated columns
    for feature in imp_features:
        X_train.loc[:, feature] = X_train_imp.loc[:, feature]
        X_valid.loc[:, feature] = X_valid_imp.loc[:, feature]
    
    return X_train, X_valid


# Split train and valid, an example.

# tt = listings.copy()
# kfold = 1

# X_train = tt[tt.kfold != kfold]
# X_valid = tt[tt.kfold == kfold]
# y_train = X_train.pop('price')
# y_valid = X_valid.pop('price')
# X_train, X_valid, y_train, y_valid = continuous_imputation(X_train, X_valid, y_train, y_valid)


**Handle categorical variables**:
- Encoding categorical variable, "amenities" from the listings data using `"encode_amenities()"` function
- Encoding other categorical variables, using one-hot encoding or target encoding using `"category_encoding"` function

- **Encoding "amenities" categorical variable** using "encode_amenities()" function
    

In [6]:
print( listings.amenities.head() )

0    {TV,"Cable TV",Internet,"Wireless Internet","A...
1    {TV,Internet,"Wireless Internet",Kitchen,"Free...
2    {TV,"Cable TV",Internet,"Wireless Internet","A...
3    {Internet,"Wireless Internet",Kitchen,"Indoor ...
4    {TV,"Cable TV",Internet,"Wireless Internet",Ki...
Name: amenities, dtype: object


In [7]:
def encode_amenities(df):
    df.amenities.replace('[{}"]', "", regex=True, inplace=True)
    df.amenities = df.amenities.str.split(",")
    
    ## get a unique set of amenities 
    uniq_amenities = set().union(*df.amenities) 
    
    # remove '', None and "Washer / Dryer" (due to low frequency)
    uniq_amenities.remove('')
#     uniq_amenities.remove(None)
    uniq_amenities.remove('Washer / Dryer')
    
    # encoding amenities
    amenities_encod = pd.DataFrame()
    for item in uniq_amenities:
        amenities_encod[f"amenity_encod_{item}"] = df.amenities.str.contains(item, regex=False)
    
    # Concat encoded amenities and data_frame
    df = pd.concat([df, amenities_encod], axis=1)
    # remove the "amenities" col
    df.pop('amenities')  ## then remove "amenities" col
    
    return df

print(listings.shape)
listings_encode = encode_amenities(listings) ## --- apply "encode_amenities()" function
print(listings_encode.shape)

listings_encode.loc[1:3, listings_encode.columns[ listings_encode.columns.str.contains("amenity_encod") ]]

(3818, 595)
(3818, 634)


Unnamed: 0,amenity_encod_Breakfast,amenity_encod_Elevator in Building,amenity_encod_Cat(s),amenity_encod_Dryer,amenity_encod_Essentials,amenity_encod_Internet,amenity_encod_Free Parking on Premises,amenity_encod_Safety Card,amenity_encod_Smoking Allowed,amenity_encod_First Aid Kit,...,amenity_encod_Laptop Friendly Workspace,amenity_encod_Buzzer/Wireless Intercom,amenity_encod_Cable TV,amenity_encod_Air Conditioning,amenity_encod_Carbon Monoxide Detector,amenity_encod_Doorman,amenity_encod_Fire Extinguisher,amenity_encod_Hangers,amenity_encod_Heating,amenity_encod_24-Hour Check-in
1,False,False,False,True,True,True,True,True,False,True,...,False,True,False,False,True,False,True,False,True,False
2,False,False,True,True,True,True,True,False,False,False,...,False,False,True,True,True,False,False,False,True,False
3,False,False,False,True,True,True,False,True,False,False,...,False,False,False,False,True,False,True,False,True,False


After encoding the "amenities" column, we can get 40 more columns, each represents whether the property provides specific amenity such as Breakfast and elevator in building shown above. 

- **Encoding other categorical variables**, using one-hot encoding or target encoding 

In [9]:

def _one_hot_encoding(X_train, X_valid, y_train, y_valid):
    X_train, X_valid, y_train, y_valid = X_train.copy(), X_valid.copy(), y_train.copy(), y_valid.copy()

    oe_enc_features = ['cancellation_policy', 'require_guest_profile_picture', 'require_guest_phone_verification', 
                           'neighbourhood_group_cleansed', 'property_type', 'instant_bookable', 'room_type', 'bed_type']

    oe = OrdinalEncoder()
    X_train[oe_enc_features] = oe.fit_transform(X_train[oe_enc_features])
    X_valid[oe_enc_features] = oe.transform(X_valid[oe_enc_features])

    return X_train, X_valid, y_train, y_valid

def _target_encoding(X_train, X_valid, y_train, y_valid):
    X_train, X_valid, y_train, y_valid = X_train.copy(), X_valid.copy(), y_train.copy(), y_valid.copy()

    target_enc_features = ['cancellation_policy', 'require_guest_profile_picture', 'require_guest_phone_verification', 
                           'neighbourhood_group_cleansed', 'property_type', 'instant_bookable', 'room_type', 'bed_type']

    # Create the encoder instance. Choose m to control noise.
    target_enc = MEstimateEncoder(cols=target_enc_features, m=5.0)
    X_train = target_enc.fit_transform(X_train, y_train)
    X_valid = target_enc.transform(X_valid)

    return X_train, X_valid, y_train, y_valid
    

**Finally, we can construct a function `"ml_dataprep()"` for ml_dataprep.**

In [12]:

def ml_dataprep(df, features, target, kfold, target_encoding=True):
    """
    Args:
        df (Pandas DataFrame): "listings" data.
        features (list): The ML features.
        target (str): 'price'
        kfold: 1-5
        target_encoding: True or False
    """
    import warnings
    warnings.filterwarnings("ignore") # ignore target encoding warnings

#     features.append(target)
#     df = df[features]

#     # Encode amenities --- using "encode_amenities()" function
#     df = encode_amenities(df) 

    # Split train and valid
    X_train = df[df.kfold != kfold]
    X_valid = df[df.kfold == kfold]
    y_train = X_train.pop(target)
    y_valid = X_valid.pop(target) # 'price'

    # Imputation                  --- using "continuous_imputation()" function
    X_train, X_valid, y_train, y_valid = continuous_imputation(X_train, X_valid, y_train, y_valid)

    # Target Encoding             --- using "_target_encoding()" function
    if target_encoding:
        X_train, X_valid, y_train, y_valid = _target_encoding(X_train, X_valid, y_train, y_valid)
        
    else:  # one-hot encoding     --- using "_one_hot_encoding()" function
        X_train, X_valid, y_train, y_valid = _one_hot_encoding(X_train, X_valid, y_train, y_valid)

    return X_train, X_valid, y_train, y_valid


# listings = pd.read_csv("listings_cs_kfold.csv")
# listings = ETL_pipeline(listings).parse_listings()

# base_features = ['host_acceptance_rate', 'neighbourhood_group_cleansed', 'property_type', 'room_type',
#             'bathrooms', 'bedrooms', 'beds', 'bed_type', 'number_of_reviews', 'review_scores_rating',
#             'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication',
#             'review_scores_location', 'review_scores_value', 'reviews_per_month', 'host_response_rate', 'host_is_superhost', 
#             'accommodates', 'security_deposit', 'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights', 
#             'maximum_nights', 'instant_bookable', 'cancellation_policy', 'require_guest_profile_picture', 
#             'require_guest_phone_verification', 'amenities', 'kfold']

# kfold = 1

# X_train, X_valid, y_train, y_valid = ml_dataprep(df=listings, features=base_features, target='price', \
#                                                  kfold=kfold, target_encoding=True)

    

### (4) Data Modeling and Evaluate the Results
- **Two model settings** (Compared two sets of features)
    - model 1: m1_features
    - model 2: m1_features + convenience_score related features..
    
- **Two machine learning algorithms**:
    - [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
    - [RandomForests](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
    
- **Evaluation metric**:
    - [RMSE](https://en.wikipedia.org/wiki/Root-mean-square_deviation)

In [38]:

listings = pd.read_csv("listings_cs_kfold.csv")
listings = ETL_pipeline(listings).parse_listings()

### input of "ml_dataprep()"....
m1_features_input = ['host_acceptance_rate', 'neighbourhood_group_cleansed', 'property_type', 'room_type',
            'bathrooms', 'bedrooms', 'beds', 'bed_type', 'number_of_reviews', 'review_scores_rating',
            'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication',
            'review_scores_location', 'review_scores_value', 'reviews_per_month', 'host_response_rate', 'host_is_superhost', 
            'accommodates', 'security_deposit', 'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights', 
            'maximum_nights', 'instant_bookable', 'cancellation_policy', 'require_guest_profile_picture', 
            'require_guest_phone_verification', 'amenities', 'kfold']


####################################
#####  Model 1-- m1_features..
m1_features = m1_features_input.copy()
m1_features.remove('amenities')

m1_features_input.append('price')
df = listings[m1_features_input].copy()
# Encode amenities --- using "encode_amenities()" function
df = encode_amenities(df) 
    
for target_encoding in [False, True]:
    
    md1_RMSE = []
    
    for kfold in range(5):

        X_train, X_test, y_train, y_test = ml_dataprep( df=df, features=m1_features_input, target='price', \
                                                       kfold=kfold, target_encoding=target_encoding)
        ## LinearRegression function
        lr_model = LinearRegression()
        lr_parameters = { "fit_intercept": [True, False]}
        lr_grid = GridSearchCV(estimator=lr_model, 
                            param_grid = lr_parameters, 
                            cv = 3, 
                            n_jobs=-1,
    #                         random_state=kfold,
                            )
        lr_grid.fit(X_train[m1_features], y_train)
        test_preds = lr_grid.predict(X_test[m1_features])

        RMSE = mean_squared_error(y_test, test_preds, squared=False)
        print(f"ML1, kfold: {kfold}. RMSE: {RMSE}")
        md1_RMSE.append(RMSE)
    
    print(f"Model 1 (LinearRegression with target_encoding={target_encoding}). Average RMSE: {np.mean(md1_RMSE)}\n")


ML1, kfold: 0. RMSE: 56.70855885709866
ML1, kfold: 1. RMSE: 66.50284923910483
ML1, kfold: 2. RMSE: 60.81836541597313
ML1, kfold: 3. RMSE: 62.798432234822165
ML1, kfold: 4. RMSE: 86.31317047062942
Model 1 (LinearRegression with target_encoding=False). Average RMSE: 66.62827524352565

ML1, kfold: 0. RMSE: 55.47840623642598
ML1, kfold: 1. RMSE: 65.2719817746783
ML1, kfold: 2. RMSE: 59.29221420761654
ML1, kfold: 3. RMSE: 62.10852061619225
ML1, kfold: 4. RMSE: 86.9957031795546
Model 1 (LinearRegression with target_encoding=True). Average RMSE: 65.82936520289353



In [39]:


####################################
#####  Model 2-- m2_features..

### input of "ml_dataprep()"....
m1_features_input = ['host_acceptance_rate', 'neighbourhood_group_cleansed', 'property_type', 'room_type',
            'bathrooms', 'bedrooms', 'beds', 'bed_type', 'number_of_reviews', 'review_scores_rating',
            'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication',
            'review_scores_location', 'review_scores_value', 'reviews_per_month', 'host_response_rate', 'host_is_superhost', 
            'accommodates', 'security_deposit', 'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights', 
            'maximum_nights', 'instant_bookable', 'cancellation_policy', 'require_guest_profile_picture', 
            'require_guest_phone_verification', 'amenities', 'kfold']

m2_features_input = m1_features_input.copy()
m2_features_input.extend(["cs_attr_5_counts", "cs_park_5_counts", \
                                                   "cs_attr_10_avgdist", "cs_park_10_avgdist"])

m2_features = m2_features_input.copy()
m2_features.remove('amenities')


m2_features_input.append('price')
df = listings[m2_features_input].copy()
# Encode amenities --- using "encode_amenities()" function
df = encode_amenities(df) 
    
for target_encoding in [False, True]:
    
    md2_RMSE = []
    
    for kfold in range(5):

        X_train, X_test, y_train, y_test = ml_dataprep( df=df, features=m2_features_input, target='price', \
                                                       kfold=kfold, target_encoding=target_encoding)
        ## LinearRegression function
        lr_model = LinearRegression()
        lr_parameters = { "fit_intercept": [True, False]}
        lr_grid = GridSearchCV(estimator=lr_model, 
                            param_grid = lr_parameters, 
                            cv = 3, 
                            n_jobs=-1,
    #                         random_state=kfold,
                            )
        lr_grid.fit(X_train[m2_features], y_train)
        test_preds = lr_grid.predict(X_test[m2_features])

        RMSE = mean_squared_error(y_test, test_preds, squared=False)
        print(f"ML2, kfold: {kfold}. RMSE: {RMSE}")
        md2_RMSE.append(RMSE)

    print(f"Model 2 (LinearRegression with target_encoding={target_encoding}). Average RMSE: {np.mean(md2_RMSE)}\n")


ML2, kfold: 0. RMSE: 55.31544479857226
ML2, kfold: 1. RMSE: 64.44421460776245
ML2, kfold: 2. RMSE: 58.77542158192223
ML2, kfold: 3. RMSE: 61.59597900497572
ML2, kfold: 4. RMSE: 82.6865979719371
Model 2 (LinearRegression with target_encoding=False). Average RMSE: 64.56353159303396

ML2, kfold: 0. RMSE: 55.279150838411404
ML2, kfold: 1. RMSE: 64.29133372829887
ML2, kfold: 2. RMSE: 58.872667001227185
ML2, kfold: 3. RMSE: 61.67166260732547
ML2, kfold: 4. RMSE: 86.02107090313123
Model 2 (LinearRegression with target_encoding=True). Average RMSE: 65.22717701567883



#### Using "Linear Regression model", we find that:
<img src="../result/result_LinearReg.png" width=600 height=420 />

- with target_encoding = False (that is, we use `one-hot-encoding`), 
    - Model 1 with baseline feature obtains Average RMSE: 66.628.
    - Model 2 with baseline feature and four convenience_score-related features obtains Average RMSE: 64.564.
    
    Thus, the four convenience_score-related features I generated help improve price prediction of Airbnb properties, given the fact that we use simple "one-hot-encoding" technique for categorical features.
    
    
- with target_encoding = True (that is, we use `target-encoding`), 
    - Model 1 with baseline feature obtains Average RMSE: 65.829.
    - Model 2 with baseline feature and four convenience_score-related features obtains Average RMSE: 65.227.
    
    Thus, the four convenience_score-related features I generated help improve price prediction of Airbnb properties, given the fact that we use advanced "target-encoding" technique for categorical features.
    

#### I also implement "Random Forests" algorithm to see if four convenience-score-related features help improve price prediction performance. 
<img src="../result/result_RandomForest.png" width=500 height=320 />

- with target_encoding = False (that is, we use `one-hot-encoding`), 
    - Model 1 with baseline feature obtains Average RMSE: 56.836.
    - Model 2 with baseline feature and four convenience_score-related features obtains Average RMSE:  55.400.
    
    Thus, the four convenience_score-related features I generated help improve price prediction of Airbnb properties, given the fact that we use simple "one-hot-encoding" technique for categorical features.
    
    
- with target_encoding = True (that is, we use `target-encoding`), 
    - Model 1 with baseline feature obtains Average RMSE: 55.989.
    - Model 2 with baseline feature and four convenience_score-related features obtains Average RMSE: 55.269.
    
    Thus, the four convenience_score-related features I generated help improve price prediction of Airbnb properties, given the fact that we use advanced "target-encoding" technique for categorical features.
   

In [40]:

listings = pd.read_csv("listings_cs_kfold.csv")
listings = ETL_pipeline(listings).parse_listings()

### input of "ml_dataprep()"....
m1_features_input = ['host_acceptance_rate', 'neighbourhood_group_cleansed', 'property_type', 'room_type',
            'bathrooms', 'bedrooms', 'beds', 'bed_type', 'number_of_reviews', 'review_scores_rating',
            'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication',
            'review_scores_location', 'review_scores_value', 'reviews_per_month', 'host_response_rate', 'host_is_superhost', 
            'accommodates', 'security_deposit', 'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights', 
            'maximum_nights', 'instant_bookable', 'cancellation_policy', 'require_guest_profile_picture', 
            'require_guest_phone_verification', 'amenities', 'kfold']


####################################
#####  Model 1-- m1_features..
m1_features = m1_features_input.copy()
m1_features.remove('amenities')

m1_features_input.append('price')
df = listings[m1_features_input].copy()
# Encode amenities --- using "encode_amenities()" function
df = encode_amenities(df) 
    
for target_encoding in [False, True]:
    
    md1_RMSE = []
    
    for kfold in range(5):

        X_train, X_test, y_train, y_test = ml_dataprep( df=df, features=m1_features_input, target='price', \
                                                       kfold=kfold, target_encoding=target_encoding)
    
        ## RandomForest
        rf_model = RandomForestRegressor(random_state=kfold)
        rf_parameters = {
                        'max_depth': [80, 100],
                        'max_features': [2, 3],
#                         'min_samples_leaf': [3, 4, 5],
#                         'min_samples_split': [8, 10, 12],
#                         'n_estimators': [100, 200, 300]
                        }
        rf_grid = GridSearchCV(estimator = rf_model, param_grid = rf_parameters, 
                                  cv = 3, n_jobs = -1, verbose = 2)
        rf_grid.fit(X_train[m1_features], y_train)
        test_preds = rf_grid.predict(X_test[m1_features])

        RMSE = mean_squared_error(y_test, test_preds, squared=False)
        print(f"ML1, kfold: {kfold}. RMSE: {RMSE}")
        md1_RMSE.append(RMSE)
    
    print(f"Model 1 (RandomForest with target_encoding={target_encoding}). Average RMSE: {np.mean(md1_RMSE)}\n")



Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.1s finished


ML1, kfold: 0. RMSE: 54.422594651049046
Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.1s finished


ML1, kfold: 1. RMSE: 63.100122356555644
Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.1s finished


ML1, kfold: 2. RMSE: 57.432182418377955
Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.1s finished


ML1, kfold: 3. RMSE: 59.69225249036197
Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.1s finished


ML1, kfold: 4. RMSE: 49.532044227493905
Model 1 (RandomForest with target_encoding=False). Average RMSE: 56.8358392287677

Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.1s finished


ML1, kfold: 0. RMSE: 52.66218590472293
Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.1s finished


ML1, kfold: 1. RMSE: 61.86024764124715
Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.1s finished


ML1, kfold: 2. RMSE: 56.92278624687547
Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.1s finished


ML1, kfold: 3. RMSE: 59.43959069635754
Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.1s finished


ML1, kfold: 4. RMSE: 49.06027695692371
Model 1 (RandomForest with target_encoding=True). Average RMSE: 55.989017489225354



In [41]:



####################################
#####  Model 2-- m2_features..

### input of "ml_dataprep()"....
m1_features_input = ['host_acceptance_rate', 'neighbourhood_group_cleansed', 'property_type', 'room_type',
            'bathrooms', 'bedrooms', 'beds', 'bed_type', 'number_of_reviews', 'review_scores_rating',
            'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication',
            'review_scores_location', 'review_scores_value', 'reviews_per_month', 'host_response_rate', 'host_is_superhost', 
            'accommodates', 'security_deposit', 'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights', 
            'maximum_nights', 'instant_bookable', 'cancellation_policy', 'require_guest_profile_picture', 
            'require_guest_phone_verification', 'amenities', 'kfold']

m2_features_input = m1_features_input.copy()
m2_features_input.extend(["cs_attr_5_counts", "cs_park_5_counts", \
                                                   "cs_attr_10_avgdist", "cs_park_10_avgdist"])

m2_features = m2_features_input.copy()
m2_features.remove('amenities')


m2_features_input.append('price')
df = listings[m2_features_input].copy()
# Encode amenities --- using "encode_amenities()" function
df = encode_amenities(df) 
    
for target_encoding in [False, True]:
    
    md2_RMSE = []
    
    for kfold in range(5):

        X_train, X_test, y_train, y_test = ml_dataprep( df=df, features=m2_features_input, target='price', \
                                                       kfold=kfold, target_encoding=target_encoding)
        import warnings
        warnings.filterwarnings("ignore") 
    
        ## RandomForest
        rf_model = RandomForestRegressor(random_state=kfold)
        rf_parameters = {
                        'max_depth': [80, 100],
                        'max_features': [2, 3],
#                         'min_samples_leaf': [3, 4, 5],
#                         'min_samples_split': [8, 10, 12],
#                         'n_estimators': [100, 200, 300]
                        }
        rf_grid = GridSearchCV(estimator = rf_model, param_grid = rf_parameters, 
                                  cv = 3, n_jobs = -1, verbose = 2)
        rf_grid.fit(X_train[m2_features], y_train)
        test_preds = rf_grid.predict(X_test[m2_features])

        RMSE = mean_squared_error(y_test, test_preds, squared=False)
        print(f"ML2, kfold: {kfold}. RMSE: {RMSE}")
        md2_RMSE.append(RMSE)

    print(f"Model 2 (RandomForest with target_encoding={target_encoding}). Average RMSE: {np.mean(md2_RMSE)}\n")

    

Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.2s finished


ML2, kfold: 0. RMSE: 52.21765947324597
Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.2s finished


ML2, kfold: 1. RMSE: 61.71314719661439
Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.2s finished


ML2, kfold: 2. RMSE: 56.392582674441485
Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.2s finished


ML2, kfold: 3. RMSE: 58.78185250986093
Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.2s finished


ML2, kfold: 4. RMSE: 47.894710480440764
Model 2 (RandomForest with target_encoding=False). Average RMSE: 55.39999046692071

Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.2s finished


ML2, kfold: 0. RMSE: 52.33677863861145
Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.2s finished


ML2, kfold: 1. RMSE: 61.11901138984703
Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.2s finished


ML2, kfold: 2. RMSE: 56.451680169673054
Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.2s finished


ML2, kfold: 3. RMSE: 58.926668218998046
Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    1.2s finished


ML2, kfold: 4. RMSE: 47.51225096308627
Model 2 (RandomForest with target_encoding=True). Average RMSE: 55.26927787604317



    
## Summary,
- **both Linear Regression and Random Forests suggest to include four convenience_score-related features (`"cs_attr_5_counts", "cs_park_5_counts", "cs_attr_10_avgdist", "cs_park_10_avgdist"`) into the price prediction model for the Settle datasets, which reduce the average RMSE compared to the one without them.**
    - Particularly, these four features have more adds-on when we encode categorical variables using simple 'one-hot-encoding' technique.
- **it is not surprising that RandomForest outperforms simple linear regression model.**
- **if we choose the RandomForests model as our final model, target-encoding for categorical variables outperforms the one-hot encoding.**