# Data Cleaning

This data cleaning file is meant for 2 purposes: prepare the data for EDA and perform an initial clean before further cleaning is done post-EDA (based on some of the insights gleaned during EDA). 

There is a condition at the beginning to check if this has already been done. This is shown for records keeping purposes as the data should already be there if this is downloaded from GitHub. It should always be the case that the daily refreshed data is cleaned here. 

In [1]:
import pandas as pd
import os

# Set pandas to display all columns 
pd.set_option('display.max_columns', None)

#### First, we indicate the two file paths we want to use.

In [2]:
# File paths
original_file_path = '../Data/Locations_Data_Added_For_EDA.csv'
daily_refresh_file_path = '../Data/Daily_Refresh_With_Locations_Data.csv'

#### Next, we write a function that checks if the file exists. 

This should use the daily refresh file, though the conditional is kept in so as to show how the EDA file was prepped originally. 

In [3]:
def check_if_file_exists():

    # Check if the file exists
    if os.path.exists(original_file_path):
        
        # If the file already exists, load data from the daily file
        df = pd.read_csv(daily_refresh_file_path)

        # Return the file path name
        return daily_refresh_file_path, df
        
    else:
        
        # If the file does not exist, load data from the existing file path
        df = pd.read_csv(existing_file_path)

        # Return the file path name
        return existing_file_path, df

In [4]:
# Execute the conditional function and store the result in a variable
file_used = check_if_file_exists()[0]
df = check_if_file_exists()[1]
print(f'The file used is: {file_used}')

The file used is: ../Data/Daily_Refresh_With_Locations_Data.csv


In [5]:
df.head()

Unnamed: 0,Title,Price,Bedrooms,Square Feet,Full Address,monthly,apartment,cats are OK - purrr,dogs are OK - wooof,laundry on site,air conditioning,off-street parking,EV charging,w/d in unit,carport,no smoking,attached garage,detached garage,laundry in bldg,Fee Needed To Apply,wheelchair accessible,no parking,furnished,street parking,no laundry on site,house,w/d hookups,date_added,latitude,longitude,nearest_budget_grocery_store_distance,nearest_budget_grocery_store,nearest_midTier_grocery_store_distance,nearest_midTier_grocery_store,nearest_premium_grocery_store_distance,nearest_premium_grocery_store
0,Title Not Found,Price Not Found,Bedrooms Info Not Found,Square Feet Not Found,None listed,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3/20/24,,,,,,,,
1,1 Bedroom in the Heart of Venice* Plank Floors...,"$2,895",1br,750,"237 Fourth Avenue, Venice, CA 90291",1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,3/20/24,33.99809,-118.475972,0.69602,"Smart & Final Extra! - 604 Lincoln Blvd, Venice",0.843992,"Ralphs - 910 Lincoln Blvd, Venice",0.408345,"Whole Foods Market - 225 Lincoln Blvd, Venice"
2,"SPECIALS, Rooftop Sky Deck, Brand New 1+1 Bren...","$3,438",1br,711,"11916 West Pico Boulevard, Los Angeles, CA 90064",1,1,1,1,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,3/20/24,34.029804,-118.448669,0.790421,"Smart & Final Extra! - 11221 W Pico Blvd, Los ...",0.267695,"Trader Joe's - 11755 W Olympic Blvd, Los Angeles",0.794601,"Whole Foods Market - 11666 National Blvd, Los ..."
3,1 Bedroom 1 Bath Westwood Apartment in Westwoo...,"$3,175",1br,,"972 Hilgard Ave, Los Angeles, CA 90024",1,1,0,0,1,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,3/20/24,34.061724,-118.441019,1.744201,"Smart & Final Extra! - 11221 W Pico Blvd, Los ...",0.162153,"Trader Joe's - 1000 Glendon Ave, Los Angeles",0.340418,"Whole Foods Market - 1050 Gayley Ave, Los Angeles"
4,"Dishwasher, Efficient Appliances, 1 Bed","$2,895",1br,575,"1720 Pacific Ave, Venice, CA 90291",1,1,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,3/20/24,33.987065,-118.470748,1.009629,"Smart & Final Extra! - 604 Lincoln Blvd, Venice",0.997125,"Ralphs - 910 Lincoln Blvd, Venice",0.528287,"Erewhon - 585 Venice Blvd., Venice"


#### Let's start by checking for listings without titles.

In [6]:
# Check for listings without titles
no_title_df = df[df['Title'] == 'Title Not Found']

In [7]:
# Check first five rows to confirm there are results.
no_title_df.head()

Unnamed: 0,Title,Price,Bedrooms,Square Feet,Full Address,monthly,apartment,cats are OK - purrr,dogs are OK - wooof,laundry on site,air conditioning,off-street parking,EV charging,w/d in unit,carport,no smoking,attached garage,detached garage,laundry in bldg,Fee Needed To Apply,wheelchair accessible,no parking,furnished,street parking,no laundry on site,house,w/d hookups,date_added,latitude,longitude,nearest_budget_grocery_store_distance,nearest_budget_grocery_store,nearest_midTier_grocery_store_distance,nearest_midTier_grocery_store,nearest_premium_grocery_store_distance,nearest_premium_grocery_store
0,Title Not Found,Price Not Found,Bedrooms Info Not Found,Square Feet Not Found,None listed,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3/20/24,,,,,,,,
427,Title Not Found,Price Not Found,Bedrooms Info Not Found,Square Feet Not Found,None listed,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3/20/24,,,,,,,,
555,Title Not Found,Price Not Found,Bedrooms Info Not Found,Square Feet Not Found,None listed,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3/20/24,,,,,,,,
561,Title Not Found,Price Not Found,Bedrooms Info Not Found,Square Feet Not Found,None listed,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3/20/24,,,,,,,,
564,Title Not Found,Price Not Found,Bedrooms Info Not Found,Square Feet Not Found,None listed,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3/20/24,,,,,,,,


#### We see that occasionally there are listings without titles, perhaps uploaded in error. 

If that is that case, we will remove them here.

In [8]:
# Drop the rows from the DataFrame
df = df.drop(df[df['Title'] == 'Title Not Found'].index)

In [9]:
df.head()

Unnamed: 0,Title,Price,Bedrooms,Square Feet,Full Address,monthly,apartment,cats are OK - purrr,dogs are OK - wooof,laundry on site,air conditioning,off-street parking,EV charging,w/d in unit,carport,no smoking,attached garage,detached garage,laundry in bldg,Fee Needed To Apply,wheelchair accessible,no parking,furnished,street parking,no laundry on site,house,w/d hookups,date_added,latitude,longitude,nearest_budget_grocery_store_distance,nearest_budget_grocery_store,nearest_midTier_grocery_store_distance,nearest_midTier_grocery_store,nearest_premium_grocery_store_distance,nearest_premium_grocery_store
1,1 Bedroom in the Heart of Venice* Plank Floors...,"$2,895",1br,750.0,"237 Fourth Avenue, Venice, CA 90291",1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,3/20/24,33.99809,-118.475972,0.69602,"Smart & Final Extra! - 604 Lincoln Blvd, Venice",0.843992,"Ralphs - 910 Lincoln Blvd, Venice",0.408345,"Whole Foods Market - 225 Lincoln Blvd, Venice"
2,"SPECIALS, Rooftop Sky Deck, Brand New 1+1 Bren...","$3,438",1br,711.0,"11916 West Pico Boulevard, Los Angeles, CA 90064",1,1,1,1,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,3/20/24,34.029804,-118.448669,0.790421,"Smart & Final Extra! - 11221 W Pico Blvd, Los ...",0.267695,"Trader Joe's - 11755 W Olympic Blvd, Los Angeles",0.794601,"Whole Foods Market - 11666 National Blvd, Los ..."
3,1 Bedroom 1 Bath Westwood Apartment in Westwoo...,"$3,175",1br,,"972 Hilgard Ave, Los Angeles, CA 90024",1,1,0,0,1,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,3/20/24,34.061724,-118.441019,1.744201,"Smart & Final Extra! - 11221 W Pico Blvd, Los ...",0.162153,"Trader Joe's - 1000 Glendon Ave, Los Angeles",0.340418,"Whole Foods Market - 1050 Gayley Ave, Los Angeles"
4,"Dishwasher, Efficient Appliances, 1 Bed","$2,895",1br,575.0,"1720 Pacific Ave, Venice, CA 90291",1,1,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,3/20/24,33.987065,-118.470748,1.009629,"Smart & Final Extra! - 604 Lincoln Blvd, Venice",0.997125,"Ralphs - 910 Lincoln Blvd, Venice",0.528287,"Erewhon - 585 Venice Blvd., Venice"
5,Newly Renovated 1 Bedroom 1 BA in Westwood ^ C...,"$3,095",1br,644.0,"520 Kelton Avenue, Los Angeles, CA 90024",1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3/20/24,34.06768,-118.453252,2.029118,"Smart & Final - 12210 Santa Monica Blvd W, Los...",0.597086,"Ralphs Fresh Fare - 10861 Weyburn Ave, Los Ang...",0.576295,"Whole Foods Market - 1050 Gayley Ave, Los Angeles"


#### Next, we convert the price column to a numeric field. 

Currently, it is a string in the form of '$2,385.' For EDA purposes (as well as modeling since this is going to be our dependent variable), we want this to be converted to an integer.

In [10]:
df['Price'] = pd.to_numeric(df['Price'].str.replace('$','').str.replace(',',''), errors='coerce')

#### Next, we rename columns to remove white spaces and standardize capitalization.

In [11]:
# Rename columns
df.rename(columns={'Title': 'title',
                   'Price': 'price', 
                   'Square Feet': 'square_feet',
                   'Full Address': 'full_address',
                   'cats are OK - purrr': 'cats_allowed',
                  'dogs are OK - wooof': 'dogs_allowed',
                  'laundry on site': 'laundry_on_site',
                  'air conditioning': 'air_conditioning',
                  'off-street parking': 'off_street_parking',
                  'EV charging': 'EV_charging',
                  'w/d in unit': 'washer_dryer_in_unit',
                  'no smoking': 'no_smoking',
                  'attached garage': 'attached_garage',
                  'detached garage': 'detached_garage',
                  'laundry in bldg': 'laundry_in_bldg',
                  'Fee Needed To Apply': 'fee_needed_to_apply',
                  'wheelchair accessible': 'wheelchair_accessible',
                  'no parking': 'no_parking',
                  'street parking': 'street_parking',
                  'no laundry on site': 'no_laundry_on_site',
                  'w/d hookups': 'washer_dryer_hookups'}, inplace=True)

#### Let's check the data types.

In [12]:
# Check data types 
df.dtypes

title                                      object
price                                       int64
Bedrooms                                   object
square_feet                                object
full_address                               object
monthly                                     int64
apartment                                   int64
cats_allowed                                int64
dogs_allowed                                int64
laundry_on_site                             int64
air_conditioning                            int64
off_street_parking                          int64
EV_charging                                 int64
washer_dryer_in_unit                        int64
carport                                     int64
no_smoking                                  int64
attached_garage                             int64
detached_garage                             int64
laundry_in_bldg                             int64
fee_needed_to_apply                         int64


#### We need to change the boolean values from floats to integers to use them in the model.

Let's write a function that does this. We'll define the columns to convert outside of the function. That way, in the future if there are more boolean float types, we can simply add them in here and the function will take care of it. 

In [13]:
# List of boolean column names to convert
boolean_float_columns = [
    'monthly', 'apartment', 'cats_allowed', 'dogs_allowed',
    'laundry_on_site', 'air_conditioning', 'off_street_parking', 'EV_charging',
    'washer_dryer_in_unit', 'carport', 'no_smoking', 'attached_garage',
    'detached_garage', 'laundry_in_bldg', 'fee_needed_to_apply',
    'wheelchair_accessible', 'no_parking', 'furnished', 'street_parking',
    'no_laundry_on_site', 'house', 'washer_dryer_hookups'
]

In [14]:
# Define the function to convert float boolean types to integer
def convert_float_booleans_to_int(df, columns_to_convert):
    
    # Convert each column in the list to 'int' datatype
    for column in columns_to_convert:
        df[column] = df[column].astype(int)

In [15]:
# Execute the convert_float_booleans_to_int function
convert_float_booleans_to_int(df, boolean_float_columns)

#### Next, let's check for duplicates.

In [16]:
# Check for duplicates and return 'True' if any exist
duplicates = df.duplicated().any()
print("Duplicates exist:", duplicates) 

Duplicates exist: True


#### We see that there are indeed duplicates (as of this writing). 

We will remove these as this will be unwanted noise in our EDA and models. First, we check the current length of the DataFrame. Then we remove duplicates and check to see that they were dropped.

In [17]:
len(df)

2525

In [18]:
# Remove duplicate rows, keeping the first occurrence
df = df.drop_duplicates()

In [19]:
len(df)

2092

#### Finally, we export this data to be used for EDA or to update the daily data.

In [20]:
if file_used == '../Data/Daily_Refresh_With_Locations_Data.csv':

    # Export the cleaned data to our 'Data' folder.
   df.to_csv('../Data/Primary_Data_Cleaned.csv', index=False)
    
else: 
    
    # Export the data for EDA to our 'Data' folder.
    df.to_csv('../Data/Cleaned_Data_For_EDA.csv', index=False)