# Notebook 3: Modeling 

_For USD-599 Capstone Project by Hunter Blum, Kyle Esteban Dalope, and Nicholas Lee (Summer 2023)_

***

**Content Overview:**
1. Pipeline Creation
2. Data Splitting - Split by property_type_binary and train test split for each
3. Modeling - Two sets of models based on propery_type_binary that will be evaluated separately
4. Results

In [71]:
# Library Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.pipeline import make_pipeline, Pipeline 
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn import set_config
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn import linear_model

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Read in data from the previous notebook
clean_data = pd.read_csv("../Data/model_ready.csv.gz", compression = "gzip")

In [3]:
clean_data.dtypes

host_listings_count                             float64
property_type                                    object
room_type                                        object
bathrooms                                       float64
bedrooms                                        float64
price                                           float64
minimum_nights                                    int64
maximum_nights                                    int64
minimum_minimum_nights                          float64
maximum_maximum_nights                          float64
has_availability                                 object
availability_30                                   int64
availability_365                                  int64
instant_bookable                                 object
calculated_host_listings_count                    int64
calculated_host_listings_count_private_rooms      int64
calculated_host_listings_count_shared_rooms       int64
reviews_per_month                               

In [4]:
# Change some features to the proper data types
# i.e. zipcode to categorical
clean_data["zipcode"] = clean_data["zipcode"].astype("category")

**Do we need to drop _property_type_ and _room_type_?**

In [5]:
# Split the data by property type

house_df = clean_data.loc[clean_data["property_type_binary"] == "house"]
room_df = clean_data.loc[clean_data["property_type_binary"] == "room"]

## Pipeline Setup
-  Establish a unique pipeline for numerical and categorical columns separately
- Partition the dataset into training and test set (75:25 split)
- Fit and apply the transformer to the training set
- Applied the trained transformer to the test set
- Return the preprocessed, model-ready training and test sets

**Maybe we go back and keep _review_scores_average_ and in the pipeline, multiple the two features (avg. score and # of monthly reviews) together, to create a weighted score, and then drop it?**

I'm thinking if this app is for users, they will have this information will be readily accessible and they can enter it themselves. If we take this approach, we'll need to keep the original features so the vector exists for them in the model training data

In [6]:
# Separate numerical and categorical features
num_cols = clean_data.select_dtypes(["int64", "float64"]).columns.tolist()
categorical_cols = clean_data.select_dtypes("object").columns.tolist()

**Should we add an imputer to the numerical data in the event that a user decides to leave a field empty, downstream?**

In [7]:
# Set up separate pipelines for different datatypes

# Set transformer output as a pandas dataframe
set_config(transform_output="pandas")

# Numerical Pipeline
num_pipeline = Pipeline([
    ("standardscaler", StandardScaler())
])

# Categorical Pipeline
categorical_pipeline = Pipeline([
    # Handle_unknown = "ignore" to deal with one off values in categorical features
    ("encoder", OneHotEncoder(
        sparse_output = False, drop = "if_binary", handle_unknown = "ignore"
        )
    )
])

# Global Data Pipeline
data_transformer = ColumnTransformer(
    transformers = [
        ("numerical", num_pipeline, num_cols),
        ("categorical", categorical_pipeline, categorical_cols)
    ]
)

In [119]:
# A function to output preprocessed, model-ready data using the data_transformer
def preprocess_data(data_set, pipeline = data_transformer,
                    num_cols = num_cols, categorical_cols = categorical_cols):
    
    # Data partitioning 75:25 Train-Test Split
    training_data, test_data = train_test_split(
        data_set, test_size = 0.25, random_state = 2023
        )

    # Fit and transform the training data partition
    trained_data = pipeline.fit_transform(training_data)        

    # Transform the test data set based on the training data
    test_data = pipeline.transform(test_data)

    # Remove whitespace in col names
    trained_data.columns = trained_data.columns.str.replace(' ', '_')
    test_data.columns = test_data.columns.str.replace(' ', '_')

    # Remove slashes in col names
    trained_data.columns = trained_data.columns.str.replace('/', '_')
    test_data.columns = test_data.columns.str.replace('/', '_')

    return trained_data, test_data

In [120]:
# Preprocess the house-type and room-type data sets
house_train, house_test = preprocess_data(house_df)
room_train, room_test = preprocess_data(room_df)

In [121]:
house_train

Unnamed: 0,numerical__host_listings_count,numerical__bathrooms,numerical__bedrooms,numerical__price,numerical__minimum_nights,numerical__maximum_nights,numerical__minimum_minimum_nights,numerical__maximum_maximum_nights,numerical__availability_30,numerical__availability_365,...,categorical__property_type_Entire_villa,categorical__property_type_Farm_stay,categorical__property_type_Houseboat,categorical__property_type_Island,categorical__property_type_Tent,categorical__property_type_Tiny_home,categorical__room_type_Entire_home_apt,categorical__has_availability_t,categorical__instant_bookable_t,categorical__property_type_binary_house
742,-0.187327,1.510348,0.730554,0.380433,-0.379007,-1.090379,-0.365692,0.816143,-1.191077,-1.482026,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
10066,-0.229448,-0.665010,-0.833981,-0.266470,-0.254668,-1.037455,-0.365692,-1.619710,-0.621686,-1.339599,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0
5031,1.687064,0.422669,1.512822,0.315743,-0.254668,-0.294311,-0.236694,-0.871412,1.560979,1.398166,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
13683,-0.159247,-0.665010,-0.051713,-0.244077,-0.379007,-0.294311,-0.301193,0.816143,-1.096179,-1.442463,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
18031,-0.239979,-0.665010,-0.833981,-0.025125,-0.379007,1.381620,-0.365692,0.816143,1.655877,1.406079,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6909,-0.239979,0.422669,-0.051713,0.297082,-0.316837,1.381620,-0.301193,0.816143,-0.336991,-1.410813,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0
2960,-0.057454,-0.665010,-0.833981,-0.268958,-0.316837,-0.294311,-0.365692,-0.871412,0.801791,1.208263,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0
17836,-0.159247,-0.665010,-0.051713,-0.205512,-0.379007,1.381620,-0.365692,0.816143,0.042603,-1.110133,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0
6451,-0.134676,-0.665010,-0.833981,-0.369725,-0.379007,-0.294311,-0.365692,0.816143,1.371182,-0.809454,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0


In [122]:
# Separate out the X and y
house_train_y = house_train[['numerical__price']]
house_train_X = house_train.drop(columns=['numerical__price'])
house_test_y = house_train[['numerical__price']]
house_test_X = house_train.drop(columns=['numerical__price'])

room_train_y = room_train['numerical__price']
room_train_X = room_train.drop(columns=['numerical__price'])
room_test_y = room_test[['numerical__price']]
room_test_X = room_test.drop(columns=['numerical__price'])

## Baseline Model - Backwards Stepwise Regression

Since we do not have too many training features, it will be better to use backwards stepwise regression to test all of our features. 

### Entire House Model

### Room Model
First we'll do backward selection with sklearn, then we'll the final model with statsmodels for improved diagnostic tools.

In [123]:
# Train the sklearn backward selected model
room_back_reg = SequentialFeatureSelector(linear_model.LinearRegression(),
                                          n_features_to_select = 'auto',
                                          direction='backward',
                                          n_jobs = 8).fit(room_train_X, room_train_y)

room_train_X_back = room_back_reg.transform(room_train_X)


In [124]:
import statsmodels.formula.api as smf

# Combine the dataframe back into one for the model
room_train_back = pd.concat([room_train_X_back, room_train_y], axis = 1)

# Get string of columns for formula
cols = list(room_train_X_back.columns)
cols_str = " + ".join(cols)
cols_str = str(cols_str)

# Fit the model
room_back_reg_fin = smf.ols(formula= 'numerical__price ~' + cols_str, 
                            data = room_train_back).fit()

# Model summary
print(room_back_reg_fin.summary())

                            OLS Regression Results                            
Dep. Variable:       numerical__price   R-squared:                       0.209
Model:                            OLS   Adj. R-squared:                  0.198
Method:                 Least Squares   F-statistic:                     19.66
Date:                Tue, 25 Jul 2023   Prob (F-statistic):          4.56e-101
Time:                        21:30:50   Log-Likelihood:                -3245.5
No. Observations:                2493   AIC:                             6559.
Df Residuals:                    2459   BIC:                             6757.
Df Model:                          33                                         
Covariance Type:            nonrobust                                         
                                                                    coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------