# Notebook 3: Modeling 

_For USD-599 Capstone Project by Hunter Blum, Kyle Esteban Dalope, and Nicholas Lee (Summer 2023)_

***

**Content Overview:**
1. Pipeline Creation
2. Data Splitting - Split by property_type_binary and train test split for each
3. Modeling - Two sets of models based on propery_type_binary that will be evaluated separately
4. Results

In [58]:
# Library Imports
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.pipeline import make_pipeline, Pipeline 
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn import set_config

import warnings
warnings.filterwarnings("ignore")

In [4]:
# Read in data from the previous notebook
clean_data = pd.read_csv("../Data/model_ready.csv.gz", compression = "gzip")

In [7]:
clean_data.dtypes

host_listings_count                             float64
property_type                                    object
room_type                                        object
bathrooms                                       float64
bedrooms                                        float64
price                                           float64
minimum_nights                                    int64
maximum_nights                                    int64
minimum_minimum_nights                          float64
maximum_maximum_nights                          float64
has_availability                                 object
availability_30                                   int64
availability_365                                  int64
instant_bookable                                 object
calculated_host_listings_count                    int64
calculated_host_listings_count_private_rooms      int64
calculated_host_listings_count_shared_rooms       int64
reviews_per_month                               

In [8]:
# Change some features to the proper data types
# i.e. zipcode to categorical
clean_data["zipcode"] = clean_data["zipcode"].astype("category")

**Do we need to drop _property_type_ and _room_type_?**

In [15]:
# Split the data by property type

house_df = clean_data.loc[clean_data["property_type_binary"] == "house"]
room_df = clean_data.loc[clean_data["property_type_binary"] == "room"]

## Pipeline Setup
-  Establish a unique pipeline for numerical and categorical columns separately
- Partition the dataset into training and test set (75:25 split)
- Fit and apply the transformer to the training set
- Applied the trained transformer to the test set
- Return the preprocessed, model-ready training and test sets

**Maybe we go back and keep _review_scores_average_ and in the pipeline, multiple the two features (avg. score and # of monthly reviews) together, to create a weighted score, and then drop it?**

I'm thinking if this app is for users, they will have this information will be readily accessible and they can enter it themselves. If we take this approach, we'll need to keep the original features so the vector exists for them in the model training data

In [22]:
# Separate numerical and categorical features
num_cols = clean_data.select_dtypes(["int64", "float64"]).columns.tolist()
categorical_cols = clean_data.select_dtypes("object").columns.tolist()

**Should we add an imputer to the numerical data in the event that a user decides to leave a field empty, downstream?**

In [52]:
# Set up separate pipelines for different datatypes

# Set transformer output as a pandas dataframe
set_config(transform_output="pandas")

# Numerical Pipeline
num_pipeline = Pipeline([
    ("standardscaler", StandardScaler())
])

# Categorical Pipeline
categorical_pipeline = Pipeline([
    # Handle_unknown = "ignore" to deal with one off values in categorical features
    ("encoder", OneHotEncoder(
        sparse_output = False, drop = "if_binary", handle_unknown = "ignore"
        )
    )
])

# Global Data Pipeline
data_transformer = ColumnTransformer(
    transformers = [
        ("numerical", num_pipeline, num_cols),
        ("categorical", categorical_pipeline, categorical_cols)
    ]
)

In [53]:
# A function to output preprocessed, model-ready data using the data_transformer
def preprocess_data(data_set, pipeline = data_transformer,
                    num_cols = num_cols, categorical_cols = categorical_cols):
    
    # Data partitioning 75:25 Train-Test Split
    training_data, test_data = train_test_split(
        data_set, test_size = 0.25, random_state = 2023
        )

    # Fit and transform the training data partition
    trained_data = pipeline.fit_transform(training_data)        

    # Transform the test data set based on the training data
    test_data = pipeline.transform(test_data)

    return trained_data, test_data

In [59]:
# Preprocess the house-type and room-type data sets
house_train, house_test = preprocess_data(house_df)
room_train, room_test = preprocess_data(room_df)