# Notebook 3: Modeling 

_For USD-599 Capstone Project by Hunter Blum, Kyle Esteban Dalope, and Nicholas Lee (Summer 2023)_

***

**Content Overview:**
1. Pipeline Creation
2. Data Splitting - Split by property_type_binary and train test split for each
3. Modeling - Two sets of models based on propery_type_binary that will be evaluated separately
4. Results

In [1]:
# Library Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.pipeline import make_pipeline, Pipeline 
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn import set_config
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn import linear_model

#import warnings
#warnings.filterwarnings("ignore")

In [2]:
# Read in data from the previous notebook
clean_data = pd.read_csv("../Data/model_ready.csv.gz", compression = "gzip")

In [3]:
clean_data.dtypes

host_listings_count                             float64
property_type                                    object
room_type                                        object
bathrooms                                       float64
bedrooms                                        float64
price                                           float64
minimum_nights                                    int64
maximum_nights                                    int64
minimum_minimum_nights                          float64
maximum_maximum_nights                          float64
has_availability                                 object
availability_30                                   int64
availability_365                                  int64
instant_bookable                                 object
calculated_host_listings_count                    int64
calculated_host_listings_count_private_rooms      int64
calculated_host_listings_count_shared_rooms       int64
reviews_per_month                               

In [4]:
# Change some features to the proper data types
# i.e. zipcode to categorical
clean_data["zipcode"] = clean_data["zipcode"].astype("category")

**Do we need to drop _property_type_ and _room_type_?**

In [5]:
# Split the data by property type

house_df = clean_data.loc[clean_data["property_type_binary"] == "house"]
room_df = clean_data.loc[clean_data["property_type_binary"] == "room"]

## Pipeline Setup
-  Establish a unique pipeline for numerical and categorical columns separately
- Partition the dataset into training and test set (75:25 split)
- Fit and apply the transformer to the training set
- Applied the trained transformer to the test set
- Return the preprocessed, model-ready training and test sets

**Maybe we go back and keep _review_scores_average_ and in the pipeline, multiple the two features (avg. score and # of monthly reviews) together, to create a weighted score, and then drop it?**

I'm thinking if this app is for users, they will have this information will be readily accessible and they can enter it themselves. If we take this approach, we'll need to keep the original features so the vector exists for them in the model training data

In [14]:
# Separate numerical and categorical features
num_cols = clean_data.select_dtypes(["int64", "float64"]).columns.tolist()
categorical_cols = clean_data.select_dtypes("object").columns.tolist()

# Separate out the target
num_cols.remove('price')

**Should we add an imputer to the numerical data in the event that a user decides to leave a field empty, downstream?**

In [7]:
# Set up separate pipelines for different datatypes

# Set transformer output as a pandas dataframe
set_config(transform_output="pandas")

# Numerical Pipeline
num_pipeline = Pipeline([
    ("standardscaler", StandardScaler())
])

# Categorical Pipeline
categorical_pipeline = Pipeline([
    # Handle_unknown = "ignore" to deal with one off values in categorical features
    ("encoder", OneHotEncoder(
        sparse_output = False, drop = "if_binary", handle_unknown = "ignore"
        )
    )
])

# Global Data Pipeline
data_transformer = ColumnTransformer(
    transformers = [
        ("numerical", num_pipeline, num_cols),
        ("categorical", categorical_pipeline, categorical_cols)
    ]
)

In [16]:
# A function to output preprocessed, model-ready data using the data_transformer
def preprocess_data(data_set, pipeline = data_transformer,
                    num_cols = num_cols, categorical_cols = categorical_cols):
    
    # Data partitioning 75:25 Train-Test Split
    training_data, testing_data = train_test_split(
        data_set, test_size = 0.25, random_state = 2023
        )
    
    # Separate target from df
    training_data_X = training_data.drop(columns = ['price'])
    train_data_y = training_data['price']

    testing_data_X = testing_data.drop(columns = ['price'])
    test_data_y = testing_data['price']

    # Fit and transform the training data partition
    train_data_X = pipeline.fit_transform(training_data_X)        

    # Transform the test data set based on the training data
    test_data_X = pipeline.transform(testing_data_X)

    # Remove whitespace in col names
    train_data_X.columns = train_data_X.columns.str.replace(' ', '_')
    test_data_X.columns = test_data_X.columns.str.replace(' ', '_')

    # Remove slashes in col names
    train_data_X.columns = train_data_X.columns.str.replace('/', '_')
    test_data_X.columns = test_data_X.columns.str.replace('/', '_')

    return train_data_X, train_data_y, test_data_X, test_data_y

In [17]:
# Preprocess the house-type and room-type data sets
house_train_X, house_train_y, house_test_X, house_test_y = preprocess_data(house_df)
room_train_X, room_train_y, room_test_X, room_test_y = preprocess_data(room_df)



## Baseline Model - Backwards Stepwise Regression

Since we do not have too many training features, it will be better to use backwards stepwise regression to test all of our features. 

### Entire House Model

### Room Model
First we'll do backward selection with sklearn, then we'll the final model with statsmodels for improved diagnostic tools.

In [21]:
# Train the sklearn backward selected model
room_back_reg = SequentialFeatureSelector(linear_model.LinearRegression(),
                                          n_features_to_select = 'auto',
                                          direction='backward',
                                          n_jobs = -1).fit(room_train_X, room_train_y)

room_train_X_back = room_back_reg.transform(room_train_X)

In [26]:
import statsmodels.formula.api as smf
from statsmodels.nonparametric.smoothers_lowess import lowess

# Combine the dataframe back into one for the model
room_train_back = pd.concat([room_train_X_back, room_train_y], axis = 1)

# Get string of columns for formula
cols = list(room_train_X_back.columns)
cols_str = " + ".join(cols)
cols_str = str(cols_str)

# Fit the model
room_back_reg_fin = smf.ols(formula= 'price ~' + cols_str, 
                            data = room_train_back).fit()

# Model summary
print(room_back_reg_fin.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.208
Model:                            OLS   Adj. R-squared:                  0.197
Method:                 Least Squares   F-statistic:                     20.16
Date:                Wed, 26 Jul 2023   Prob (F-statistic):          4.59e-101
Time:                        12:49:47   Log-Likelihood:                -18212.
No. Observations:                2493   AIC:                         3.649e+04
Df Residuals:                    2460   BIC:                         3.668e+04
Df Model:                          32                                         
Covariance Type:            nonrobust                                         
                                                                    coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------

### Residuals vs. Fitted Plot

In [25]:
residuals = room_back_reg_fin.resid
fitted = room_back_reg_fin.fittedvalues



16749     -2.435547
6568     238.208984
6135      -6.466797
16098    -67.335938
7457    -344.105469
            ...    
12956    -69.892578
16506     -9.441406
15788   -235.166016
11420    -41.162109
6822       2.517578
Length: 2493, dtype: float64