## Data Prep for Modeling

We are looking to make the data "model ready." What we ultimately find in the data is that there are many with  "N/A" values for square feet. We want to use this metric, and dropping the square feet does not drastically affect the amount of records we have for modeling purposes. Thus, we do that here.

We do not want to continually edit the modeling data (updated as of 04/13/2024) because the commentary is based on a single data pull. Thus, we need to put in a conditional to see if it already exists. If it does, this means we will process the daily refreshed data. The data does already exist, so this conditional is simply meant to show work; this data is not meant to be edited for the modeling phase. Again, this is so that the commentary and results will stay consistent. 

In [1]:
import numpy as np
import pandas as pd
import os

# Set the max columns to infinite so that we may view all of them
pd.set_option('display.max_columns', None)

#### Import the cleaned data. 

This is the same data that is explored in the EDA file.

In [2]:
# File path to check for modeling data.
modeling_data_path = '../Data/modeling_data.csv'

In [3]:
def check_if_file_exists():

    # Check if the file exists
    if os.path.exists(modeling_data_path):
        
        # If the file already exists, load data from the daily file
        df = pd.read_csv('../Data/Primary_Data_Cleaned.csv')

        # Return the primary data file df
        return df
        
    else:
        
        # If the file does not exist, load data from the existing file path
        df = pd.read_csv('../Data/Cleaned_Data_For_EDA.csv')

        # Return the cleaned EDA file path df
        return df

In [4]:
# Create the df using the conditional function
df = check_if_file_exists()

## Handling Null Values

#### We show and remove null values for modeling purposes.

We remove the rows with null values for square feet and ones that do not have an address listed.

The ones without addresses listed contain the string "None listed." However, the null values show up in the location coordinates. When we remove the missing address ones, the null values in these rows (latitude, longitude, nearest_budget_grocery_store_distance, etc.) will reduce to none.

In [5]:
# Check Datatypes
df.dtypes

title                                      object
price                                       int64
Bedrooms                                   object
square_feet                               float64
full_address                               object
monthly                                     int64
apartment                                   int64
cats_allowed                                int64
dogs_allowed                                int64
laundry_on_site                             int64
air_conditioning                            int64
off_street_parking                          int64
EV_charging                                 int64
washer_dryer_in_unit                        int64
carport                                     int64
no_smoking                                  int64
attached_garage                             int64
detached_garage                             int64
laundry_in_bldg                             int64
fee_needed_to_apply                         int64


In [6]:
# Show which features contain null values
df.isna().sum()

title                                       0
price                                       0
Bedrooms                                    0
square_feet                               229
full_address                                0
monthly                                     0
apartment                                   0
cats_allowed                                0
dogs_allowed                                0
laundry_on_site                             0
air_conditioning                            0
off_street_parking                          0
EV_charging                                 0
washer_dryer_in_unit                        0
carport                                     0
no_smoking                                  0
attached_garage                             0
detached_garage                             0
laundry_in_bldg                             0
fee_needed_to_apply                         0
wheelchair_accessible                       0
no_parking                        

In [7]:
# Drop the rows with null square_feet measures
df.dropna(subset=['square_feet'], inplace=True)

In [8]:
# Create a copy of the DataFrame with the missing address rows filtered out
filtered_listings_df = df[df['full_address'] != 'None listed'].copy()

In [9]:
# Check to see that there are no null values now
filtered_listings_df.isnull().sum()

title                                     0
price                                     0
Bedrooms                                  0
square_feet                               0
full_address                              0
monthly                                   0
apartment                                 0
cats_allowed                              0
dogs_allowed                              0
laundry_on_site                           0
air_conditioning                          0
off_street_parking                        0
EV_charging                               0
washer_dryer_in_unit                      0
carport                                   0
no_smoking                                0
attached_garage                           0
detached_garage                           0
laundry_in_bldg                           0
fee_needed_to_apply                       0
wheelchair_accessible                     0
no_parking                                0
furnished                       

#### Finally, we export the data.

If the modeling data already exists (which should be the case), this will update the daily data additions, and have it ready to be loaded into the database. Otherwise, the modeling data is created.

In [10]:
if os.path.exists(modeling_data_path):
    # Export for storage in Database
    filtered_listings_df.to_csv('../Data/load_ready_data.csv', index=False)
    print(f'Updated the load ready data.')
else:
    # Export for modeling
    filtered_listings_df.to_csv('../Data/modeling_data.csv', index=False)
    print(f'Updated the modeling data.')

Updated the load ready data.
