## Milestone 2 US Eastern Region Data Cleaning

The purpose of this notebook is to conduct data cleaning on the Eastern US cities

Data cleaning was done on the following Eastern US cities:
* Broward County (FL)
* Jersey City (NJ)
* New York City (NY)
* Cambridge (MA)
* Washington DC

This is one of multiple milestone 2 data cleaning notebooks the team is conducting on our Airbnb project.

After much discussion with the team on important features/columns to eventually do feature engineering, the data cleaning is only focused on the listings_detailed table for each city. 

For each city's table, we will conduct the same data cleaning procedures before we take all of the dataframes and concatenate all of them together to make the east_coast_cities_cleaned.csv file which contains all of the Eastern US cities data in its current cleaned format. Team will discuss later after doing additional EDA if we need to do any additional data processing.

In [61]:
import pandas as pd
import spacy
import numpy as np
import os
import matplotlib.pyplot as plt
from scipy.stats import zscore
from scipy.stats import mannwhitneyu
from statistics import mean
from scipy.stats import norm
import numpy as np
import scipy.stats as stats
import seaborn as sns
from matplotlib.colors import ListedColormap
from collections import Counter
import geopandas as gpd
import datetime
import matplotlib.dates as mdates
from matplotlib.ticker import FuncFormatter
import math
import ast

#### Directory Paths Set-up for each City

In [62]:
# Define the directory path for each city
broward_directory_path = r'C:\Users\tiffa.TIFFANY\OneDrive\Documents\DS 5460 - Big Data Scaling\Final Project\data\usa\Broward County'
jersey_directory_path = r'C:\Users\tiffa.TIFFANY\OneDrive\Documents\DS 5460 - Big Data Scaling\Final Project\data\usa\Jersey City'
nyc_directory_path = r'C:\Users\tiffa.TIFFANY\OneDrive\Documents\DS 5460 - Big Data Scaling\Final Project\data\usa\New York City'
cambridge_directory_path = r'C:\Users\tiffa.TIFFANY\OneDrive\Documents\DS 5460 - Big Data Scaling\Final Project\data\usa\Cambridge'
dc_directory_path = r'C:\Users\tiffa.TIFFANY\OneDrive\Documents\DS 5460 - Big Data Scaling\Final Project\data\usa\Washington DC'


# File names
listings = 'listings.csv'
listings_detailed = 'listings_detailed.csv'
reviews = 'reviews.csv'
reviews_detailed = 'reviews_detailed.csv'
calendar = 'calendar.csv'
neighbourhoods = 'neighbourhoods.csv'
neighbourhoods_json = 'neighbourhoods.geojson'

# Full paths to read in later for each city
broward_calendar_path = broward_directory_path + '\\' + calendar
broward_listings_detailed_path = broward_directory_path + '\\' + listings_detailed
broward_listings_path = broward_directory_path + '\\' + listings
broward_reviews_path = broward_directory_path + '\\' + reviews
broward_reviews_detailed_path = broward_directory_path + '\\' + reviews_detailed
broward_neighbourhoods_path = broward_directory_path + '\\' + neighbourhoods
broward_neighbourhoods_json_path = broward_directory_path + '\\' + neighbourhoods_json

jersey_calendar_path = jersey_directory_path + '\\' + calendar
jersey_listings_detailed_path = jersey_directory_path + '\\' + listings_detailed
jersey_listings_path = jersey_directory_path + '\\' + listings
jersey_reviews_path = jersey_directory_path + '\\' + reviews
jersey_reviews_detailed_path = jersey_directory_path + '\\' + reviews_detailed
jersey_neighbourhoods_path = jersey_directory_path + '\\' + neighbourhoods
jersey_neighbourhoods_json_path = jersey_directory_path + '\\' + neighbourhoods_json

nyc_calendar_path = nyc_directory_path + '\\' + calendar
nyc_listings_detailed_path = nyc_directory_path + '\\' + listings_detailed
nyc_listings_path = nyc_directory_path + '\\' + listings
nyc_reviews_path = nyc_directory_path + '\\' + reviews
nyc_reviews_detailed_path = nyc_directory_path + '\\' + reviews_detailed
nyc_neighbourhoods_path = nyc_directory_path + '\\' + neighbourhoods
nyc_neighbourhoods_json_path = nyc_directory_path + '\\' + neighbourhoods_json

cambridge_calendar_path = cambridge_directory_path + '\\' + calendar
cambridge_listings_detailed_path = cambridge_directory_path + '\\' + listings_detailed
cambridge_listings_path = cambridge_directory_path + '\\' + listings
cambridge_reviews_path = cambridge_directory_path + '\\' + reviews
cambridge_reviews_detailed_path = cambridge_directory_path + '\\' + reviews_detailed
cambridge_neighbourhoods_path = cambridge_directory_path + '\\' + neighbourhoods
cambridge_neighbourhoods_json_path = cambridge_directory_path + '\\' + neighbourhoods_json

dc_calendar_path = dc_directory_path + '\\' + calendar
dc_listings_detailed_path = dc_directory_path + '\\' + listings_detailed
dc_listings_path = dc_directory_path + '\\' + listings
dc_reviews_path = dc_directory_path + '\\' + reviews
dc_reviews_detailed_path = dc_directory_path + '\\' + reviews_detailed
dc_neighbourhoods_path = dc_directory_path + '\\' + neighbourhoods
dc_neighbourhoods_json_path = dc_directory_path + '\\' + neighbourhoods_json

In [63]:
# Data cleaning is only focused on listings_detailed dataframe
broward_listings_detailed_df = pd.read_csv(broward_listings_detailed_path,  na_filter=False)
jersey_listings_detailed_df = pd.read_csv(jersey_listings_detailed_path,  na_filter=False)
nyc_listings_detailed_df = pd.read_csv(nyc_listings_detailed_path,  na_filter=False)
cambridge_listings_detailed_df = pd.read_csv(cambridge_listings_detailed_path,  na_filter=False)
dc_listings_detailed_df = pd.read_csv(dc_listings_detailed_path,  na_filter=False)

  broward_listings_detailed_df = pd.read_csv(broward_listings_detailed_path,  na_filter=False)
  nyc_listings_detailed_df = pd.read_csv(nyc_listings_detailed_path,  na_filter=False)


In [64]:
# Add in a city column to each dataframe so that we can use this as part of the primary key/use it to conduct groupby for
# future EDA or additional analysis as final table will contain all of our listings data
broward_listings_detailed_df['city'] = 'Broward County'
cambridge_listings_detailed_df['city'] = 'Cambridge'
jersey_listings_detailed_df['city'] = 'Jersey City'
nyc_listings_detailed_df['city'] = 'New York City'
dc_listings_detailed_df['city'] = 'Washington DC'

In [65]:
# Identify what columns to keep and what columns to remove based on team discussion
# Note some additional processing on retaining columns will be conducted at the end of each city depending on final data size
def retain_columns(df, columns):
    """
    Retains only the specified columns in a DataFrame.

    Parameters:
    - df: A pandas DataFrame object.
    - columns: A list of column names to retain in the DataFrame.

    Returns:
    A pandas DataFrame object with only the specified columns retained.
    """
    # Retain only the columns that exist in the DataFrame
    retained_columns = [col for col in columns if col in df.columns]
    transformed_df = df[retained_columns]
    
    return transformed_df

# List of columns to retain
columns_to_keep = [
    'id', 'name', 'description', 'neighborhood_overview', 'host_about', 'host_id',
    'host_name', 'host_since', 'host_location', 'host_response_time', 'host_response_rate',
    'host_acceptance_rate', 'host_is_superhost', 'host_listings_count', 'host_total_listings_count',
    'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'neighbourhood', 'neighbourhood_cleansed',
    'latitude', 'longitude', 'room_type', 'accommodates', 'bathrooms_text', 'bedrooms', 'beds',
    'amenities', 'price', 'number_of_reviews', 'review_scores_value', 'calculated_host_listings_count',
    'city'
]

broward_listings_transformed = retain_columns(broward_listings_detailed_df, columns_to_keep)
cambridge_listings_transformed = retain_columns(cambridge_listings_detailed_df, columns_to_keep)
jersey_listings_transformed = retain_columns(jersey_listings_detailed_df, columns_to_keep)
nyc_listings_transformed = retain_columns(nyc_listings_detailed_df, columns_to_keep)
dc_listings_transformed = retain_columns(dc_listings_detailed_df, columns_to_keep)

In [66]:
# Check what an example looks like
broward_listings_transformed.head()

Unnamed: 0,id,name,description,neighborhood_overview,host_about,host_id,host_name,host_since,host_location,host_response_time,...,accommodates,bathrooms_text,bedrooms,beds,amenities,price,number_of_reviews,review_scores_value,calculated_host_listings_count,city
0,827736378366911479,Legion 1BR/1BA,Take it easy at this unique and tranquil getaway.,,,475630606,Sean,2022-08-18,,within an hour,...,2,1 bath,1,1,"[""Air conditioning"", ""Free parking on premises...",$222.00,0,,1,Broward County
1,592589963829194972,Club Wyndham Royal Vista,"Located directly on the beach, the property si...",,"Hello, \r\nMy name is Ryan! I really love to ...",66506549,Ryan,2016-04-09,"Alpharetta, GA",within an hour,...,2,2 shared baths,2,4,"[""TV"", ""Paid parking on premises"", ""Indoor fir...",$500.00,0,,5,Broward County
2,772438920837360569,Relaxing 5 Acre Ranch home with private pond!,Relax with the whole family at this peaceful p...,,,382318476,Maggie,2020-12-30,,within an hour,...,8,3 baths,4,6,"[""Air conditioning"", ""Free parking on premises...",$500.00,2,5.0,1,Broward County
3,33271346,Beach Escape – One Block from the Beach!,Newly constructed and beautifully renovated Ke...,Pompano world-famous coastline is a very popul...,We’re a happily married couple who has travell...,118856968,Steve And Jo,2017-03-02,"Fort Lauderdale, FL",within an hour,...,6,2 baths,2,4,"[""TV"", ""Hair dryer"", ""Essentials"", ""Wifi"", ""Ha...",$186.00,129,4.68,3,Broward County
4,484515,MIAMI- AMAZING APARTMENT OVER BEACH,<b>The space</b><br />The apartment is located...,,Hope you have a nice time in our apartments :)...,637272,Bianca,2011-05-28,"Buenos Aires, Argentina",within an hour,...,7,2 baths,2,5,"[""Air conditioning"", ""Hangers"", ""Free parking ...",$297.00,27,4.44,6,Broward County


In [67]:
# Apply the function to convert the data types properly - this is the same function used when conducting EDA
def convert_listings_detailed_data(df):
    # Convert price columns to numeric after removing '$' and ','
    df['price'] = pd.to_numeric(df['price'].str.replace('$', '').str.replace(',', ''), errors='coerce')
    
    # Convert date column to datetime and extract date part
    df['host_since'] = pd.to_datetime(df['host_since']).dt.date

    columns_to_convert = [
    'host_total_listings_count',
    'bedrooms',
    'beds',
    'review_scores_value'
    ]
        
    for column in columns_to_convert:
        df[column] = pd.to_numeric(df[column], errors='coerce')

    
    return df

broward_listings_transformed_wip = convert_listings_detailed_data(broward_listings_transformed)
jersey_listings_transformed_wip = convert_listings_detailed_data(jersey_listings_transformed)
nyc_listings_transformed_wip = convert_listings_detailed_data(nyc_listings_transformed)
cambridge_listings_detailed_df_wip = convert_listings_detailed_data(cambridge_listings_transformed)
dc_listings_transformed_wip = convert_listings_detailed_data(dc_listings_transformed)

  df['price'] = pd.to_numeric(df['price'].str.replace('$', '').str.replace(',', ''), errors='coerce')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price'] = pd.to_numeric(df['price'].str.replace('$', '').str.replace(',', ''), errors='coerce')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['host_since'] = pd.to_datetime(df['host_since']).dt.date
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/u

In [68]:
# Check what an example looks like
broward_listings_transformed_wip

Unnamed: 0,id,name,description,neighborhood_overview,host_about,host_id,host_name,host_since,host_location,host_response_time,...,accommodates,bathrooms_text,bedrooms,beds,amenities,price,number_of_reviews,review_scores_value,calculated_host_listings_count,city
0,827736378366911479,Legion 1BR/1BA,Take it easy at this unique and tranquil getaway.,,,475630606,Sean,2022-08-18,,within an hour,...,2,1 bath,1.0,1.0,"[""Air conditioning"", ""Free parking on premises...",222.0,0,,1,Broward County
1,592589963829194972,Club Wyndham Royal Vista,"Located directly on the beach, the property si...",,"Hello, \r\nMy name is Ryan! I really love to ...",66506549,Ryan,2016-04-09,"Alpharetta, GA",within an hour,...,2,2 shared baths,2.0,4.0,"[""TV"", ""Paid parking on premises"", ""Indoor fir...",500.0,0,,5,Broward County
2,772438920837360569,Relaxing 5 Acre Ranch home with private pond!,Relax with the whole family at this peaceful p...,,,382318476,Maggie,2020-12-30,,within an hour,...,8,3 baths,4.0,6.0,"[""Air conditioning"", ""Free parking on premises...",500.0,2,5.00,1,Broward County
3,33271346,Beach Escape – One Block from the Beach!,Newly constructed and beautifully renovated Ke...,Pompano world-famous coastline is a very popul...,We’re a happily married couple who has travell...,118856968,Steve And Jo,2017-03-02,"Fort Lauderdale, FL",within an hour,...,6,2 baths,2.0,4.0,"[""TV"", ""Hair dryer"", ""Essentials"", ""Wifi"", ""Ha...",186.0,129,4.68,3,Broward County
4,484515,MIAMI- AMAZING APARTMENT OVER BEACH,<b>The space</b><br />The apartment is located...,,Hope you have a nice time in our apartments :)...,637272,Bianca,2011-05-28,"Buenos Aires, Argentina",within an hour,...,7,2 baths,2.0,5.0,"[""Air conditioning"", ""Hangers"", ""Free parking ...",297.0,27,4.44,6,Broward County
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16894,673349748522183791,Lovely 2 bedroom Apartment for a Family Getaway!,Our lovely 2 bed 2 bath apartment is a great h...,,We are a professional Vacation Rental property...,106641930,SouthFloridaBNB,2016-12-08,"Hollywood, FL",within an hour,...,4,2 baths,2.0,2.0,"[""Microwave"", ""Stainless steel oven"", ""Mosquit...",135.0,8,3.88,32,Broward County
16895,4729595,Sunny Private Room w/ terrace,Cozy two-story house located in the heart of F...,Beautiful neighborhood with plenty outdoor act...,"Nos encanta la música, viajar y conocer gente ...",24423385,Fernan&Alina,2014-12-02,"Fort Lauderdale, FL",within an hour,...,2,1 private bath,1.0,1.0,"[""TV"", ""Bathtub"", ""Garden view"", ""Hair dryer"",...",92.0,11,4.73,1,Broward County
16896,578864807997181407,Modern 1 Bed Apartment w/Amenities- Hallandale,This stylish place to stay is perfect for grou...,,,432365961,Eugenia,2021-11-17,"Miami, FL",within an hour,...,5,1.5 baths,1.0,3.0,"[""Patio or balcony"", ""Free parking on premises...",251.0,0,,8,Broward County
16897,684108385561377222,Lovely 1 bedroom apartment in a quiet neighbor...,Spacious apartment with one bedroom and a full...,Hallandale Beach is a city located in Broward ...,Hey! We love Miami and want to be a part of yo...,459788234,Hey Miami,2022-05-18,"Hollywood, FL",within an hour,...,4,1 bath,1.0,2.0,"[""Air conditioning"", ""Free parking on premises...",173.0,0,,4,Broward County


In [69]:
# Sanity check what are the missing values and what data type they are to properly conduct the cleaning
# This function was used in a previous EDA notebook
def analyze_dataframes_listings_detailed(dataframes, threshold=20):
    for df_name, df in dataframes.items():
        print(f"Analyzing '{df_name}' DataFrame:")
        print(f"Total Rows: {df.shape[0]}, Total Columns: {df.shape[1]}")
        
        # Missing values
        missing_values_count = df.isnull().sum()
        total_missing = missing_values_count.sum()
        print(f"Total Missing Values: {total_missing}")
        if total_missing > 0:
            print("Missing Values by Column:")
            for column, missing_count in missing_values_count.iteritems():
                if missing_count > 0:
                    print(f" - {column}: {missing_count} missing values")
        
        # Handling columns by data type
        for column in df.columns:
            if pd.api.types.is_numeric_dtype(df[column]):
                # Calculate statistics
                min_value = df[column].min()
                median_value = df[column].median()
                mean_value = df[column].mean()
                std_deviation = df[column].std()
                max_value = df[column].max()
                # Print statistics
                print(f"{column} (Numerical): Min = {min_value}, Median = {median_value}, Mean = {mean_value}, Std Dev = {std_deviation}, Max = {max_value}")
            elif pd.api.types.is_object_dtype(df[column]) and all(isinstance(x, (datetime.date, type(pd.NaT))) for x in df[column].dropna()):
                # Handle date columns
                non_na_values = df[column].dropna()
                if non_na_values.empty:
                    min_date = max_date = "No Dates Available"
                else:
                    min_date = non_na_values.min()
                    max_date = non_na_values.max()
                print(f"{column} (Date): Range = {min_date} to {max_date}")
            else:
                # Handle categorical columns
                unique_values = df[column].unique()
                if len(unique_values) <= threshold:
                    print(f"{column} (Categorical): Categories = {unique_values}")
                else:
                    print(f"{column} (Categorical): {len(unique_values)} unique categories")
        print("------\n")

dataframes = {
    'Broward County': broward_listings_transformed_wip,
    'Jersey City': jersey_listings_transformed_wip,
    'New York City': nyc_listings_transformed_wip,
    'Cambridge': cambridge_listings_detailed_df_wip,
    'Washington DC': dc_listings_transformed_wip
}

analyze_dataframes_listings_detailed(dataframes)

Analyzing 'Broward County' DataFrame:
Total Rows: 16899, Total Columns: 33
Total Missing Values: 4407
Missing Values by Column:
 - host_since: 1 missing values
 - host_total_listings_count: 1 missing values
 - bedrooms: 1332 missing values
 - beds: 197 missing values
 - review_scores_value: 2876 missing values
id (Numerical): Min = 57818, Median = 53655816.0, Mean = 3.334714145876098e+17, Std Dev = 3.612008167622543e+17, Max = 855979244654886532
name (Categorical): 16068 unique categories
description (Categorical): 14606 unique categories
neighborhood_overview (Categorical): 6998 unique categories
host_about (Categorical): 2937 unique categories
host_id (Numerical): Min = 5146, Median = 151641107.0, Mean = 200355516.94946447, Std Dev = 165627062.2669479, Max = 506759471
host_name (Categorical): 3524 unique categories
host_since (Date): Range = 2008-12-13 to 2023-03-23
host_location (Categorical): 615 unique categories
host_response_time (Categorical): Categories = ['within an hour' 'wi

amenities (Categorical): 36300 unique categories
price (Numerical): Min = 0.0, Median = 125.0, Mean = 200.30716731499382, Std Dev = 895.0829114730037, Max = 99000.0
number_of_reviews (Numerical): Min = 0, Median = 5.0, Mean = 25.85600149076425, Std Dev = 56.616343792003846, Max = 1842
review_scores_value (Numerical): Min = 0.0, Median = 4.77, Mean = 4.641795556936504, Std Dev = 0.49880967979345053, Max = 5.0
calculated_host_listings_count (Numerical): Min = 1, Median = 1.0, Mean = 24.05480887936456, Std Dev = 80.86795810974542, Max = 526
city (Categorical): Categories = ['New York City']
------

Analyzing 'Cambridge' DataFrame:
Total Rows: 1026, Total Columns: 33
Total Missing Values: 330
Missing Values by Column:
 - bedrooms: 46 missing values
 - beds: 16 missing values
 - review_scores_value: 268 missing values
id (Numerical): Min = 8521, Median = 46186400.0, Mean = 2.165761778080382e+17, Std Dev = 3.2613721048534195e+17, Max = 856071734230683199
name (Categorical): 937 unique catego

#### Examine how many rows are going to get dropped due to price or accommodates being missing or 0

We decided to drop rows where price or accommodates is missing or 0 as it means the data provided is incomplete and not reasonable in the context of Airbnb.

In [70]:
# Double check what are the general range of values for price and accommodates 
dataframe_names = ['Broward County', 'Cambridge', 'Jersey City', 'New York City', 'Washington DC']
dataframes = [broward_listings_transformed_wip, cambridge_listings_detailed_df_wip, jersey_listings_transformed_wip, nyc_listings_transformed_wip, dc_listings_transformed_wip]

# Get an idea of what the values are for both columns for each dataframe
for name, df in zip(dataframe_names, dataframes):
    print(f"{name} - Price unique value counts:")
    print(df['price'].value_counts())
    print(f"{name} - Accommodates unique value counts:")
    print(df['accommodates'].value_counts())
    print("\n")

Broward County - Price unique value counts:
150.0     212
200.0     154
100.0     153
250.0     150
120.0     131
         ... 
1515.0      1
1610.0      1
1650.0      1
1245.0      1
985.0       1
Name: price, Length: 1228, dtype: int64
Broward County - Accommodates unique value counts:
4     3802
2     3637
6     3079
8     1689
3      986
5      867
10     781
1      544
12     417
7      388
16     313
9      160
14     115
11      59
13      34
15      25
0        3
Name: accommodates, dtype: int64


Cambridge - Price unique value counts:
228.0    21
225.0    20
120.0    20
80.0     15
100.0    15
         ..
94.0      1
126.0     1
528.0     1
101.0     1
636.0     1
Name: price, Length: 316, dtype: int64
Cambridge - Accommodates unique value counts:
2     421
4     200
1     127
3      98
5      59
6      56
8      36
7      12
10      8
9       6
14      2
12      1
Name: accommodates, dtype: int64


Jersey City - Price unique value counts:
150.0      25
40.0       24
100.0    

In [71]:
# Verifying how many rows for each dataframe need to get dropped due to price or accommodates being missing or 0 
dataframe_dict = dict(zip(dataframe_names, dataframes))

for city, df in dataframe_dict.items():
    # Conditions for 'price' being 0 or missing, and 'accommodates' being 0 or missing
    condition = (
        (df['price'].isna() | (df['price'] == 0)) |
        (df['accommodates'].isna() | (df['accommodates'] == 0))
    )
    
    # Counting the number of rows that match the condition
    count = df[condition].shape[0]
    
    print(f"{city}: Rows with price 0 or missing and/or accommodates 0 or missing = {count}")

# The number of rows for each city is minimal that need to get dropped, so we are fine to drop these rows.
# We will drop these rows later in the data cleaning when cleaning up each city

Broward County: Rows with price 0 or missing and/or accommodates 0 or missing = 4
Cambridge: Rows with price 0 or missing and/or accommodates 0 or missing = 0
Jersey City: Rows with price 0 or missing and/or accommodates 0 or missing = 0
New York City: Rows with price 0 or missing and/or accommodates 0 or missing = 27
Washington DC: Rows with price 0 or missing and/or accommodates 0 or missing = 2


### Data Cleaning on Broward County

In [72]:
# The dataframe we should be using for clean-up
broward_listings_transformed_wip

Unnamed: 0,id,name,description,neighborhood_overview,host_about,host_id,host_name,host_since,host_location,host_response_time,...,accommodates,bathrooms_text,bedrooms,beds,amenities,price,number_of_reviews,review_scores_value,calculated_host_listings_count,city
0,827736378366911479,Legion 1BR/1BA,Take it easy at this unique and tranquil getaway.,,,475630606,Sean,2022-08-18,,within an hour,...,2,1 bath,1.0,1.0,"[""Air conditioning"", ""Free parking on premises...",222.0,0,,1,Broward County
1,592589963829194972,Club Wyndham Royal Vista,"Located directly on the beach, the property si...",,"Hello, \r\nMy name is Ryan! I really love to ...",66506549,Ryan,2016-04-09,"Alpharetta, GA",within an hour,...,2,2 shared baths,2.0,4.0,"[""TV"", ""Paid parking on premises"", ""Indoor fir...",500.0,0,,5,Broward County
2,772438920837360569,Relaxing 5 Acre Ranch home with private pond!,Relax with the whole family at this peaceful p...,,,382318476,Maggie,2020-12-30,,within an hour,...,8,3 baths,4.0,6.0,"[""Air conditioning"", ""Free parking on premises...",500.0,2,5.00,1,Broward County
3,33271346,Beach Escape – One Block from the Beach!,Newly constructed and beautifully renovated Ke...,Pompano world-famous coastline is a very popul...,We’re a happily married couple who has travell...,118856968,Steve And Jo,2017-03-02,"Fort Lauderdale, FL",within an hour,...,6,2 baths,2.0,4.0,"[""TV"", ""Hair dryer"", ""Essentials"", ""Wifi"", ""Ha...",186.0,129,4.68,3,Broward County
4,484515,MIAMI- AMAZING APARTMENT OVER BEACH,<b>The space</b><br />The apartment is located...,,Hope you have a nice time in our apartments :)...,637272,Bianca,2011-05-28,"Buenos Aires, Argentina",within an hour,...,7,2 baths,2.0,5.0,"[""Air conditioning"", ""Hangers"", ""Free parking ...",297.0,27,4.44,6,Broward County
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16894,673349748522183791,Lovely 2 bedroom Apartment for a Family Getaway!,Our lovely 2 bed 2 bath apartment is a great h...,,We are a professional Vacation Rental property...,106641930,SouthFloridaBNB,2016-12-08,"Hollywood, FL",within an hour,...,4,2 baths,2.0,2.0,"[""Microwave"", ""Stainless steel oven"", ""Mosquit...",135.0,8,3.88,32,Broward County
16895,4729595,Sunny Private Room w/ terrace,Cozy two-story house located in the heart of F...,Beautiful neighborhood with plenty outdoor act...,"Nos encanta la música, viajar y conocer gente ...",24423385,Fernan&Alina,2014-12-02,"Fort Lauderdale, FL",within an hour,...,2,1 private bath,1.0,1.0,"[""TV"", ""Bathtub"", ""Garden view"", ""Hair dryer"",...",92.0,11,4.73,1,Broward County
16896,578864807997181407,Modern 1 Bed Apartment w/Amenities- Hallandale,This stylish place to stay is perfect for grou...,,,432365961,Eugenia,2021-11-17,"Miami, FL",within an hour,...,5,1.5 baths,1.0,3.0,"[""Patio or balcony"", ""Free parking on premises...",251.0,0,,8,Broward County
16897,684108385561377222,Lovely 1 bedroom apartment in a quiet neighbor...,Spacious apartment with one bedroom and a full...,Hallandale Beach is a city located in Broward ...,Hey! We love Miami and want to be a part of yo...,459788234,Hey Miami,2022-05-18,"Hollywood, FL",within an hour,...,4,1 bath,1.0,2.0,"[""Air conditioning"", ""Free parking on premises...",173.0,0,,4,Broward County


In [73]:
# Make a copy and using broward_listings_detailed_cleaned going forwards
broward_listings_detailed_cleaned = broward_listings_transformed_wip.copy()

##### Dropping rows where price is 0 or missing or accommmodates is 0 or missing

In [74]:
# Conduct initial row drop conditions
condition = (
    (broward_listings_detailed_cleaned['price'].isna() | (broward_listings_detailed_cleaned['price'] == 0)) |
    (broward_listings_detailed_cleaned['accommodates'].isna() | (broward_listings_detailed_cleaned['accommodates'] == 0))
)

# Inverting the condition to keep rows that do not meet the condition
# This gives us all the rows where price is not 0 or missing and accommodates is not 0 or missing
broward_listings_detailed_cleaned = broward_listings_detailed_cleaned[~condition]

# Verify with shape - verified that we dropped successfully as expect the number of rows to decrease by 4 for Broward
broward_listings_detailed_cleaned.shape

(16895, 33)

##### Clean-up bathroom text
Take the leading digits in front of bathroom text. 

- If bathroom text says 'Half-bath', 'Shared half-bath', 'Private half-bath', then we assume it is 0.5. 
- If bathroom text is missing then put the number of bedrooms. 
- If bedrooms are also null, then put number of beds/2. 
- If beds is also null, then we impute 0. 

In [75]:
# Cleaning bathroom text - this will get renamed as bathrooms. Towards end of the notebook the column is renamed to num_bath

# Use regex to extract the first number found in the bathrooms_text column
broward_listings_detailed_cleaned['bathrooms'] = broward_listings_detailed_cleaned['bathrooms_text'].str.extract(r'(\d+\.?\d*)').astype(float)

# Handle specific wording cases to see if it is 0.5
half_bath_keywords = ['Half-bath', 'Shared half-bath', 'Private half-bath']
for keyword in half_bath_keywords:
    broward_listings_detailed_cleaned.loc[broward_listings_detailed_cleaned['bathrooms_text'].str.contains(keyword, case=False, na=False), 'bathrooms'] = 0.5

# Impute missing data based on bedrooms or beds
# Find indices where bathrooms is NaN
missing_bathrooms_idx = broward_listings_detailed_cleaned[broward_listings_detailed_cleaned['bathrooms'].isna()].index

# For rows with missing bathrooms_text, impute with bedrooms value; if bedrooms is also missing/NaN,
# then use beds/2, if beds is also missing then just put the value as 0
for idx in missing_bathrooms_idx:
    if pd.notna(broward_listings_detailed_cleaned.loc[idx, 'bedrooms']):
        broward_listings_detailed_cleaned.loc[idx, 'bathrooms'] = broward_listings_detailed_cleaned.loc[idx, 'bedrooms']
    elif pd.notna(broward_listings_detailed_cleaned.loc[idx, 'beds']):
        broward_listings_detailed_cleaned.loc[idx, 'bathrooms'] = broward_listings_detailed_cleaned.loc[idx, 'beds'] / 2
    else:
        # If both bedrooms and beds are missing, impute the value as 0
        broward_listings_detailed_cleaned.loc[idx, 'bathrooms'] = 0  


In [76]:
# Double checking that after data cleaning that bathrooms is 0 now
# which is confirmed that we were able to clean successfully
missing_bathrooms_count = broward_listings_detailed_cleaned['bathrooms'].isna().sum()
print(f"Missing values in bathrooms: {missing_bathrooms_count}")

Missing values in bathrooms: 0


In [77]:
broward_listings_detailed_cleaned.shape
# Checking again on the shape of the dataframe

(16895, 34)

In [78]:
# Reorganizing the column order for bathrooms
# Drop the 'bathrooms_text' column but first remember its position
bathrooms_text_index = broward_listings_detailed_cleaned.columns.get_loc('bathrooms_text')
# Drop the column
broward_listings_detailed_cleaned.drop(columns=['bathrooms_text'], inplace=True)

# Reorder columns to insert 'bathrooms' into the position where 'bathrooms_text' was
columns = list(broward_listings_detailed_cleaned.columns)
# Remove 'bathrooms' from its current position
columns.remove('bathrooms')
# Insert 'bathrooms' at the position where 'bathrooms_text' used to be
columns.insert(bathrooms_text_index, 'bathrooms')

# Reassign reordered columns to broward_listings_detailed_cleaned
broward_listings_detailed_cleaned = broward_listings_detailed_cleaned[columns]

# Verify the shape again
broward_listings_detailed_cleaned.shape

(16895, 33)

In [79]:
broward_listings_detailed_cleaned

Unnamed: 0,id,name,description,neighborhood_overview,host_about,host_id,host_name,host_since,host_location,host_response_time,...,accommodates,bathrooms,bedrooms,beds,amenities,price,number_of_reviews,review_scores_value,calculated_host_listings_count,city
0,827736378366911479,Legion 1BR/1BA,Take it easy at this unique and tranquil getaway.,,,475630606,Sean,2022-08-18,,within an hour,...,2,1.0,1.0,1.0,"[""Air conditioning"", ""Free parking on premises...",222.0,0,,1,Broward County
1,592589963829194972,Club Wyndham Royal Vista,"Located directly on the beach, the property si...",,"Hello, \r\nMy name is Ryan! I really love to ...",66506549,Ryan,2016-04-09,"Alpharetta, GA",within an hour,...,2,2.0,2.0,4.0,"[""TV"", ""Paid parking on premises"", ""Indoor fir...",500.0,0,,5,Broward County
2,772438920837360569,Relaxing 5 Acre Ranch home with private pond!,Relax with the whole family at this peaceful p...,,,382318476,Maggie,2020-12-30,,within an hour,...,8,3.0,4.0,6.0,"[""Air conditioning"", ""Free parking on premises...",500.0,2,5.00,1,Broward County
3,33271346,Beach Escape – One Block from the Beach!,Newly constructed and beautifully renovated Ke...,Pompano world-famous coastline is a very popul...,We’re a happily married couple who has travell...,118856968,Steve And Jo,2017-03-02,"Fort Lauderdale, FL",within an hour,...,6,2.0,2.0,4.0,"[""TV"", ""Hair dryer"", ""Essentials"", ""Wifi"", ""Ha...",186.0,129,4.68,3,Broward County
4,484515,MIAMI- AMAZING APARTMENT OVER BEACH,<b>The space</b><br />The apartment is located...,,Hope you have a nice time in our apartments :)...,637272,Bianca,2011-05-28,"Buenos Aires, Argentina",within an hour,...,7,2.0,2.0,5.0,"[""Air conditioning"", ""Hangers"", ""Free parking ...",297.0,27,4.44,6,Broward County
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16894,673349748522183791,Lovely 2 bedroom Apartment for a Family Getaway!,Our lovely 2 bed 2 bath apartment is a great h...,,We are a professional Vacation Rental property...,106641930,SouthFloridaBNB,2016-12-08,"Hollywood, FL",within an hour,...,4,2.0,2.0,2.0,"[""Microwave"", ""Stainless steel oven"", ""Mosquit...",135.0,8,3.88,32,Broward County
16895,4729595,Sunny Private Room w/ terrace,Cozy two-story house located in the heart of F...,Beautiful neighborhood with plenty outdoor act...,"Nos encanta la música, viajar y conocer gente ...",24423385,Fernan&Alina,2014-12-02,"Fort Lauderdale, FL",within an hour,...,2,1.0,1.0,1.0,"[""TV"", ""Bathtub"", ""Garden view"", ""Hair dryer"",...",92.0,11,4.73,1,Broward County
16896,578864807997181407,Modern 1 Bed Apartment w/Amenities- Hallandale,This stylish place to stay is perfect for grou...,,,432365961,Eugenia,2021-11-17,"Miami, FL",within an hour,...,5,1.5,1.0,3.0,"[""Patio or balcony"", ""Free parking on premises...",251.0,0,,8,Broward County
16897,684108385561377222,Lovely 1 bedroom apartment in a quiet neighbor...,Spacious apartment with one bedroom and a full...,Hallandale Beach is a city located in Broward ...,Hey! We love Miami and want to be a part of yo...,459788234,Hey Miami,2022-05-18,"Hollywood, FL",within an hour,...,4,1.0,1.0,2.0,"[""Air conditioning"", ""Free parking on premises...",173.0,0,,4,Broward County


In [80]:
# bedrooms and beds imputation (the order of the code is important for the 
# imputation logic):
# Impute missing values in 'bedrooms' based on 'bathrooms'
broward_listings_detailed_cleaned['bedrooms'] = broward_listings_detailed_cleaned.apply(
    lambda row: math.ceil(row['bathrooms']) if pd.isnull(row['bedrooms']) else row['bedrooms'],
    axis=1
)

# If beds has missing values then we need to impute using bedrooms 
broward_listings_detailed_cleaned['beds'] = broward_listings_detailed_cleaned.apply(
    lambda row: row['bedrooms'] if pd.isnull(row['beds']) else row['beds'],
    axis=1
)

In [81]:
# Verify that the bedooms and beds are now cleaned with not having any missing values
missing_bedrooms_count = broward_listings_detailed_cleaned['bedrooms'].isna().sum()
print(f"Missing values in bedrooms: {missing_bedrooms_count}")
missing_beds_count = broward_listings_detailed_cleaned['beds'].isna().sum()
print(f"Missing values in beds: {missing_beds_count}")


Missing values in bedrooms: 0
Missing values in beds: 0


In [82]:
# Are there any host_since missing values?
missing_host_since = broward_listings_detailed_cleaned['host_since'].isna().sum()

missing_host_since
# We found that there is 1 value missing

1

In [83]:
# Clean the host_since column based on what is the most recent web-scrape
# if the data is missing

id_to_last_scraped = broward_listings_detailed_df.set_index('id')['last_scraped']

broward_listings_detailed_cleaned['host_since'] = broward_listings_detailed_cleaned.apply(
    lambda row: id_to_last_scraped[row['id']] if pd.isnull(row['host_since']) else row['host_since'],
    axis=1
)

# Verified that host_since no longer has any missing values
missing_host_since = broward_listings_detailed_cleaned['host_since'].isna().sum()

print(f"Missing values in host_since: {missing_host_since}")

Missing values in host_since: 0


In [84]:
# Check if there are any missing data for host_location
broward_listings_detailed_cleaned['host_location'] = broward_listings_detailed_cleaned['host_location'].str.strip()
broward_listings_detailed_cleaned['host_location'].replace('', np.nan, inplace=True)
broward_listings_detailed_cleaned['host_location'].replace(' ', np.nan, inplace=True)  # In case of single space strings

missing_host_location = broward_listings_detailed_cleaned['host_location'].isna().sum()

print(f"Missing values in host_location: {missing_host_location}")

Missing values in host_location: 4769


In [85]:
# Impute host_location as Unknown if it was initially missing
broward_listings_detailed_cleaned['host_location'].fillna('Unknown', inplace=True)

# Verified that the host_location has no missing values
missing_host_location = broward_listings_detailed_cleaned['host_location'].isna().sum()

print(f"Missing values in host_location: {missing_host_location}")

Missing values in host_location: 0


In [86]:
broward_listings_detailed_cleaned['host_is_superhost'] = broward_listings_detailed_cleaned['host_is_superhost'].str.strip()
broward_listings_detailed_cleaned['host_is_superhost'].replace('', np.nan, inplace=True)
broward_listings_detailed_cleaned['host_is_superhost'].replace(' ', np.nan, inplace=True)  # In case of single space strings

missing_host_is_superhost = broward_listings_detailed_cleaned['host_is_superhost'].isna().sum()

# Check that host_is_superhost has 5 missing values
print(f"Missing values in host_is_superhost: {missing_host_is_superhost}")

Missing values in host_is_superhost: 5


In [87]:
# Imputed host_is_superhost as false if missing
broward_listings_detailed_cleaned['host_is_superhost'].fillna('f', inplace=True)

missing_host_is_superhost = broward_listings_detailed_cleaned['host_is_superhost'].isna().sum()

# Verified that the host_is_superhost has no missing values after imputing
print(f"Missing values in host_is_superhost: {missing_host_is_superhost}")

Missing values in host_is_superhost: 0


In [88]:
broward_listings_detailed_cleaned['host_listings_count'] = broward_listings_detailed_cleaned['host_listings_count'].str.strip()
broward_listings_detailed_cleaned['host_listings_count'].replace('', np.nan, inplace=True)
broward_listings_detailed_cleaned['host_listings_count'].replace(' ', np.nan, inplace=True)  # In case of single space strings

missing_host_listings_count = broward_listings_detailed_cleaned['host_listings_count'].isna().sum()

# Check the number of host_listings_count
print(f"Missing values in host_listings_count: {missing_host_listings_count}")

Missing values in host_listings_count: 8704


In [89]:
# Imputed host_listings_count to be 1 if missing
broward_listings_detailed_cleaned['host_listings_count'].fillna(1, inplace=True)

missing_host_listings_count = broward_listings_detailed_cleaned['host_listings_count'].isna().sum()
# Verified that the host_listings_count has no missing values after imputing
print(f"Missing values in host_listings_count: {missing_host_listings_count}")

Missing values in host_listings_count: 0


In [90]:
# Checking if host_total_listings_count and imputing if there are any missing values
missing_host_total_listings_count = broward_listings_detailed_cleaned['host_total_listings_count'].isna().sum()

print(f"Missing values in host_total_listings_count: {missing_host_total_listings_count}")

broward_listings_detailed_cleaned['host_total_listings_count'].fillna(1, inplace=True)

print("Imputed the data")

missing_host_total_listings_count = broward_listings_detailed_cleaned['host_total_listings_count'].isna().sum()

print(f"Missing values in host_total_listings_count: {missing_host_total_listings_count}")

Missing values in host_total_listings_count: 1
Imputed the data
Missing values in host_total_listings_count: 0


In [91]:
# Get counts of host verifications and see that there is a None category
broward_listings_detailed_cleaned['host_verifications'] = broward_listings_detailed_cleaned['host_verifications'].apply(lambda x: "None" if x == '[]' else x)

print(broward_listings_detailed_cleaned['host_verifications'].value_counts())

['email', 'phone']                  13214
['email', 'phone', 'work_email']     2061
['phone']                            1516
['phone', 'work_email']                95
['email']                               5
None                                    4
Name: host_verifications, dtype: int64


In [92]:
# See if there are any missing values
broward_listings_detailed_cleaned['host_identity_verified'] = broward_listings_detailed_cleaned['host_identity_verified'].str.strip()
broward_listings_detailed_cleaned['host_identity_verified'].replace('', np.nan, inplace=True)
broward_listings_detailed_cleaned['host_identity_verified'].replace(' ', np.nan, inplace=True) 

missing_host_identity_verified = broward_listings_detailed_cleaned['host_identity_verified'].isna().sum()

# We see there is 1
print(f"Missing values in host_identity_verified: {missing_host_identity_verified}")

Missing values in host_identity_verified: 1


In [93]:
# Impute the host_identity_verified
broward_listings_detailed_cleaned['host_identity_verified'].fillna('f', inplace=True)

missing_host_identity_verified = broward_listings_detailed_cleaned['host_identity_verified'].isna().sum()

# Verified that all missing values for host_identity_verified was resolved
print(f"Missing values in host_identity_verified: {missing_host_identity_verified}")

Missing values in host_identity_verified: 0


In [94]:
missing_calculated_host_listings_count = broward_listings_detailed_cleaned['calculated_host_listings_count'].isna().sum()

# Checked that calculated_host_listings_count had no missing values
print(f"Missing values in calculated_host_listings_count: {missing_calculated_host_listings_count}")

Missing values in calculated_host_listings_count: 0


In [95]:
broward_listings_detailed_cleaned

Unnamed: 0,id,name,description,neighborhood_overview,host_about,host_id,host_name,host_since,host_location,host_response_time,...,accommodates,bathrooms,bedrooms,beds,amenities,price,number_of_reviews,review_scores_value,calculated_host_listings_count,city
0,827736378366911479,Legion 1BR/1BA,Take it easy at this unique and tranquil getaway.,,,475630606,Sean,2022-08-18,Unknown,within an hour,...,2,1.0,1.0,1.0,"[""Air conditioning"", ""Free parking on premises...",222.0,0,,1,Broward County
1,592589963829194972,Club Wyndham Royal Vista,"Located directly on the beach, the property si...",,"Hello, \r\nMy name is Ryan! I really love to ...",66506549,Ryan,2016-04-09,"Alpharetta, GA",within an hour,...,2,2.0,2.0,4.0,"[""TV"", ""Paid parking on premises"", ""Indoor fir...",500.0,0,,5,Broward County
2,772438920837360569,Relaxing 5 Acre Ranch home with private pond!,Relax with the whole family at this peaceful p...,,,382318476,Maggie,2020-12-30,Unknown,within an hour,...,8,3.0,4.0,6.0,"[""Air conditioning"", ""Free parking on premises...",500.0,2,5.00,1,Broward County
3,33271346,Beach Escape – One Block from the Beach!,Newly constructed and beautifully renovated Ke...,Pompano world-famous coastline is a very popul...,We’re a happily married couple who has travell...,118856968,Steve And Jo,2017-03-02,"Fort Lauderdale, FL",within an hour,...,6,2.0,2.0,4.0,"[""TV"", ""Hair dryer"", ""Essentials"", ""Wifi"", ""Ha...",186.0,129,4.68,3,Broward County
4,484515,MIAMI- AMAZING APARTMENT OVER BEACH,<b>The space</b><br />The apartment is located...,,Hope you have a nice time in our apartments :)...,637272,Bianca,2011-05-28,"Buenos Aires, Argentina",within an hour,...,7,2.0,2.0,5.0,"[""Air conditioning"", ""Hangers"", ""Free parking ...",297.0,27,4.44,6,Broward County
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16894,673349748522183791,Lovely 2 bedroom Apartment for a Family Getaway!,Our lovely 2 bed 2 bath apartment is a great h...,,We are a professional Vacation Rental property...,106641930,SouthFloridaBNB,2016-12-08,"Hollywood, FL",within an hour,...,4,2.0,2.0,2.0,"[""Microwave"", ""Stainless steel oven"", ""Mosquit...",135.0,8,3.88,32,Broward County
16895,4729595,Sunny Private Room w/ terrace,Cozy two-story house located in the heart of F...,Beautiful neighborhood with plenty outdoor act...,"Nos encanta la música, viajar y conocer gente ...",24423385,Fernan&Alina,2014-12-02,"Fort Lauderdale, FL",within an hour,...,2,1.0,1.0,1.0,"[""TV"", ""Bathtub"", ""Garden view"", ""Hair dryer"",...",92.0,11,4.73,1,Broward County
16896,578864807997181407,Modern 1 Bed Apartment w/Amenities- Hallandale,This stylish place to stay is perfect for grou...,,,432365961,Eugenia,2021-11-17,"Miami, FL",within an hour,...,5,1.5,1.0,3.0,"[""Patio or balcony"", ""Free parking on premises...",251.0,0,,8,Broward County
16897,684108385561377222,Lovely 1 bedroom apartment in a quiet neighbor...,Spacious apartment with one bedroom and a full...,Hallandale Beach is a city located in Broward ...,Hey! We love Miami and want to be a part of yo...,459788234,Hey Miami,2022-05-18,"Hollywood, FL",within an hour,...,4,1.0,1.0,2.0,"[""Air conditioning"", ""Free parking on premises...",173.0,0,,4,Broward County


#### Now doing count amenities and neighborhood resolve

In [96]:
def count_amenities(amenities_str):
    try:
        # Convert the string representation of the list back into a list
        amenities_list = ast.literal_eval(amenities_str)
        # Return the count of items in the list
        return len(amenities_list)
    except (ValueError, SyntaxError):
        # In case of any error during conversion, return 0
        return 0

# Apply the function to each row in the 'amenities' column and create a new column 'amenities_count'
broward_listings_detailed_cleaned['amenities_count'] = broward_listings_detailed_cleaned['amenities'].apply(count_amenities)


In [97]:
# Drop the neighbourhood column
broward_listings_detailed_cleaned.drop(columns=['neighbourhood'], inplace=True)

broward_listings_detailed_cleaned.rename(columns={'neighbourhood_cleansed': 'neighborhood'}, inplace=True)

In [98]:
# Drop additional columns
broward_listings_detailed_cleaned.drop(columns=['neighborhood_overview', 'host_about'], inplace=True)

In [99]:
# Rename bathrooms to num_baths
broward_listings_detailed_cleaned.rename(columns={'bathrooms': 'num_bath'}, inplace=True)

In [100]:
broward_listings_detailed_cleaned

Unnamed: 0,id,name,description,host_id,host_name,host_since,host_location,host_response_time,host_response_rate,host_acceptance_rate,...,num_bath,bedrooms,beds,amenities,price,number_of_reviews,review_scores_value,calculated_host_listings_count,city,amenities_count
0,827736378366911479,Legion 1BR/1BA,Take it easy at this unique and tranquil getaway.,475630606,Sean,2022-08-18,Unknown,within an hour,100%,94%,...,1.0,1.0,1.0,"[""Air conditioning"", ""Free parking on premises...",222.0,0,,1,Broward County,10
1,592589963829194972,Club Wyndham Royal Vista,"Located directly on the beach, the property si...",66506549,Ryan,2016-04-09,"Alpharetta, GA",within an hour,98%,16%,...,2.0,2.0,4.0,"[""TV"", ""Paid parking on premises"", ""Indoor fir...",500.0,0,,5,Broward County,29
2,772438920837360569,Relaxing 5 Acre Ranch home with private pond!,Relax with the whole family at this peaceful p...,382318476,Maggie,2020-12-30,Unknown,within an hour,100%,89%,...,3.0,4.0,6.0,"[""Air conditioning"", ""Free parking on premises...",500.0,2,5.00,1,Broward County,14
3,33271346,Beach Escape – One Block from the Beach!,Newly constructed and beautifully renovated Ke...,118856968,Steve And Jo,2017-03-02,"Fort Lauderdale, FL",within an hour,100%,100%,...,2.0,2.0,4.0,"[""TV"", ""Hair dryer"", ""Essentials"", ""Wifi"", ""Ha...",186.0,129,4.68,3,Broward County,22
4,484515,MIAMI- AMAZING APARTMENT OVER BEACH,<b>The space</b><br />The apartment is located...,637272,Bianca,2011-05-28,"Buenos Aires, Argentina",within an hour,95%,26%,...,2.0,2.0,5.0,"[""Air conditioning"", ""Hangers"", ""Free parking ...",297.0,27,4.44,6,Broward County,17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16894,673349748522183791,Lovely 2 bedroom Apartment for a Family Getaway!,Our lovely 2 bed 2 bath apartment is a great h...,106641930,SouthFloridaBNB,2016-12-08,"Hollywood, FL",within an hour,100%,100%,...,2.0,2.0,2.0,"[""Microwave"", ""Stainless steel oven"", ""Mosquit...",135.0,8,3.88,32,Broward County,58
16895,4729595,Sunny Private Room w/ terrace,Cozy two-story house located in the heart of F...,24423385,Fernan&Alina,2014-12-02,"Fort Lauderdale, FL",within an hour,100%,100%,...,1.0,1.0,1.0,"[""TV"", ""Bathtub"", ""Garden view"", ""Hair dryer"",...",92.0,11,4.73,1,Broward County,42
16896,578864807997181407,Modern 1 Bed Apartment w/Amenities- Hallandale,This stylish place to stay is perfect for grou...,432365961,Eugenia,2021-11-17,"Miami, FL",within an hour,100%,97%,...,1.5,1.0,3.0,"[""Patio or balcony"", ""Free parking on premises...",251.0,0,,8,Broward County,16
16897,684108385561377222,Lovely 1 bedroom apartment in a quiet neighbor...,Spacious apartment with one bedroom and a full...,459788234,Hey Miami,2022-05-18,"Hollywood, FL",within an hour,100%,81%,...,1.0,1.0,2.0,"[""Air conditioning"", ""Free parking on premises...",173.0,0,,4,Broward County,11


In [101]:
broward_listings_detailed_df.head(1)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,city
0,827736378366911479,https://www.airbnb.com/rooms/827736378366911479,20230327145536,2023-03-27,city scrape,Legion 1BR/1BA,Take it easy at this unique and tranquil getaway.,,https://a0.muscache.com/pictures/miso/Hosting-...,475630606,...,,,,t,1,1,0,0,,Broward County


### Data Cleaning on Jersey City

In [102]:
# The dataframe we should be using for clean-up
jersey_listings_transformed_wip

Unnamed: 0,id,name,description,neighborhood_overview,host_about,host_id,host_name,host_since,host_location,host_response_time,...,accommodates,bathrooms_text,bedrooms,beds,amenities,price,number_of_reviews,review_scores_value,calculated_host_listings_count,city
0,40669,Skyy’s Lounge / Cozy,<b>The space</b><br />Skyy’s Lounge ....Everyt...,The neighborhood is very diverse & friendly sh...,I am the owner of a high end Nail Salon in the...,175412,Skyy,2010-07-20,"Jersey City, NJ",,...,2,1 shared bath,1.0,1.0,"[""Kitchen"", ""Dedicated workspace"", ""Dryer"", ""G...",128.0,10,5.00,1,Jersey City
1,215768,Minutes to Manhattan & Jersey Shore,Walking to distance to Statue of Liberty and E...,"Such close proximity to NYC, 7 minutes on the ...",Hello and thank you for taking the time to rea...,846837,Charlaine,2011-07-20,"Jersey City, NJ",within an hour,...,4,1 bath,1.0,2.0,"[""Kitchen"", ""Blender"", ""Dryer"", ""Fire extingui...",134.0,159,4.79,2,Jersey City
2,254245,Minutes to Manhattan and NJ Shore,Walking to distance to Statue of Liberty and E...,"Such close proximity to NYC, 7 minutes on the ...",Hello and thank you for taking the time to rea...,846837,Charlaine,2011-07-20,"Jersey City, NJ",within an hour,...,2,1 bath,1.0,1.0,"[""Kitchen"", ""Dedicated workspace"", ""Dryer"", ""F...",139.0,121,4.73,2,Jersey City
3,269266,Modern private 2 bedrooms apt minutes to NYC,Enjoy that private luxury two bedrooms apartme...,Our House is located in the Liberty State Park...,I am living in New York City area for several...,1410590,Magda,2011-11-15,"Jersey City, NJ",within an hour,...,5,1 bath,2.0,3.0,"[""Kitchen"", ""Fire extinguisher"", ""Books and re...",119.0,408,4.45,11,Jersey City
4,270245,Private room with own bathroom close to NYC,Just for you small bedroom with private bathro...,,I am living in New York City area for several...,1410590,Magda,2011-11-15,"Jersey City, NJ",within an hour,...,1,1 private bath,1.0,1.0,"[""Kitchen"", ""Fire extinguisher"", ""Books and re...",44.0,283,4.59,11,Jersey City
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1309,853639185161545946,Visit Statue of Liberty! Easy City Access!,"Experience NYC at its finest! Spacious suites,...",Newport Centre - 0.5 miles; <br />J. Owen Grun...,,501999514,RoomPicks,2023-02-20,,within an hour,...,4,1 bath,1.0,1.0,"[""Kitchen"", ""Dedicated workspace"", ""Gym"", ""Pai...",299.0,0,,7,Jersey City
1310,853643364839804604,Contemporary Living! Near Liberty Science Center!,"Experience NYC at its finest! Spacious suites,...",Newport Centre - 0.5 miles; <br />J. Owen Grun...,,501999514,RoomPicks,2023-02-20,,within an hour,...,2,1 bath,1.0,1.0,"[""Kitchen"", ""Dedicated workspace"", ""Gym"", ""Pai...",257.0,0,,7,Jersey City
1311,853644015910009065,Stylish Suite! Parking Available! Enjoy City V...,"Experience NYC at its finest! Spacious suites,...",Newport Centre - 0.5 miles; <br />J. Owen Grun...,,501999514,RoomPicks,2023-02-20,,within an hour,...,2,1 bath,1.0,1.0,"[""Kitchen"", ""Dedicated workspace"", ""Gym"", ""Pai...",257.0,0,,7,Jersey City
1312,853646181107758913,Elegant Suite! Near Hudson River Waterfront!,"Experience NYC at its finest! Spacious suites,...",Newport Centre - 0.5 miles; <br />J. Owen Grun...,,501999514,RoomPicks,2023-02-20,,within an hour,...,2,1 bath,1.0,1.0,"[""Kitchen"", ""Dedicated workspace"", ""Gym"", ""Pai...",257.0,0,,7,Jersey City


In [103]:
# Make a copy and using jersey_listings_transformed_wip going forwards
jersey_listings_detailed_cleaned = jersey_listings_transformed_wip.copy()

In [104]:
# Conduct initial drop conditions
condition = (
    (jersey_listings_detailed_cleaned['price'].isna() | (jersey_listings_detailed_cleaned['price'] == 0)) |
    (jersey_listings_detailed_cleaned['accommodates'].isna() | (jersey_listings_detailed_cleaned['accommodates'] == 0))
)

# Inverting the condition to keep rows that do not meet the condition
# This gives us all the rows where price is not 0 or missing and accommodates is not 0 or missing
jersey_listings_detailed_cleaned = jersey_listings_detailed_cleaned[~condition]

# Verify with shape - verified that it is correct as expected as no data had these conditions
jersey_listings_detailed_cleaned.shape

(1314, 33)

In [105]:
# Cleaning bathroom text - this will get renamed as bathrooms. Towards end of the notebook the column is renamed to num_bath

# Use regex to extract the first number found in the bathrooms_text column
jersey_listings_detailed_cleaned['bathrooms'] = jersey_listings_detailed_cleaned['bathrooms_text'].str.extract(r'(\d+\.?\d*)').astype(float)

# Handle specific wording cases to see if it is 0.5
half_bath_keywords = ['Half-bath', 'Shared half-bath', 'Private half-bath']
for keyword in half_bath_keywords:
    jersey_listings_detailed_cleaned.loc[jersey_listings_detailed_cleaned['bathrooms_text'].str.contains(keyword, case=False, na=False), 'bathrooms'] = 0.5

# Impute missing data based on bedrooms or beds
# Find indices where bathrooms is NaN
missing_bathrooms_idx = jersey_listings_detailed_cleaned[jersey_listings_detailed_cleaned['bathrooms'].isna()].index

# For rows with missing bathrooms_text, impute with bedrooms value; if bedrooms is also missing/NaN,
# then use beds/2, if beds is also missing then just put the value as 0
for idx in missing_bathrooms_idx:
    if pd.notna(jersey_listings_detailed_cleaned.loc[idx, 'bedrooms']):
        jersey_listings_detailed_cleaned.loc[idx, 'bathrooms'] = jersey_listings_detailed_cleaned.loc[idx, 'bedrooms']
    elif pd.notna(jersey_listings_detailed_cleaned.loc[idx, 'beds']):
        jersey_listings_detailed_cleaned.loc[idx, 'bathrooms'] = jersey_listings_detailed_cleaned.loc[idx, 'beds'] / 2
    else:
        # If both bedrooms and beds are missing, impute the value as 0
        jersey_listings_detailed_cleaned.loc[idx, 'bathrooms'] = 0 


In [106]:
# Double checking that after data cleaning that bathrooms is 0 now
# which is confirmed that we were able to clean successfully
missing_bathrooms_count = jersey_listings_detailed_cleaned['bathrooms'].isna().sum()
print(f"Missing values in bathrooms: {missing_bathrooms_count}")

Missing values in bathrooms: 0


In [107]:
# Reorganizing the column order for bathrooms
# Drop the 'bathrooms_text' column but first remember its position
bathrooms_text_index = jersey_listings_detailed_cleaned.columns.get_loc('bathrooms_text')
# Drop the column
jersey_listings_detailed_cleaned.drop(columns=['bathrooms_text'], inplace=True)

# Reorder columns to insert 'bathrooms' into the position where 'bathrooms_text' was
columns = list(jersey_listings_detailed_cleaned.columns)
# Remove 'bathrooms' from its current position
columns.remove('bathrooms')
# Insert 'bathrooms' at the position where 'bathrooms_text' used to be
columns.insert(bathrooms_text_index, 'bathrooms')

# Reassign reordered columns to DataFrame
jersey_listings_detailed_cleaned = jersey_listings_detailed_cleaned[columns]

# Verify the shape again
jersey_listings_detailed_cleaned.shape

(1314, 33)

In [108]:
# bedrooms and beds imputation (the order of the code is important for the imputation logic):
# Impute missing values in 'bedrooms' based on 'bathrooms'
jersey_listings_detailed_cleaned['bedrooms'] = jersey_listings_detailed_cleaned.apply(
    lambda row: math.ceil(row['bathrooms']) if pd.isnull(row['bedrooms']) else row['bedrooms'],
    axis=1
)

# If beds has missing values then we need to impute using bedrooms 
jersey_listings_detailed_cleaned['beds'] = jersey_listings_detailed_cleaned.apply(
    lambda row: row['bedrooms'] if pd.isnull(row['beds']) else row['beds'],
    axis=1
)


missing_bedrooms_count = jersey_listings_detailed_cleaned['bedrooms'].isna().sum()
print(f"Missing values in bedrooms: {missing_bedrooms_count}")
missing_beds_count = jersey_listings_detailed_cleaned['beds'].isna().sum()
print(f"Missing values in beds: {missing_beds_count}")



Missing values in bedrooms: 0
Missing values in beds: 0


In [109]:
# Are there any host_since missing values?
missing_host_since = broward_listings_detailed_cleaned['host_since'].isna().sum()

missing_host_since
# Skipping code needed to handle host_since not needed

0

In [110]:
# Check if there are any missing data for host_location
jersey_listings_detailed_cleaned['host_location'] = jersey_listings_detailed_cleaned['host_location'].str.strip()
jersey_listings_detailed_cleaned['host_location'].replace('', np.nan, inplace=True)
jersey_listings_detailed_cleaned['host_location'].replace(' ', np.nan, inplace=True)  # In case of single space strings

missing_host_location = jersey_listings_detailed_cleaned['host_location'].isna().sum()

print(f"Missing values in host_location: {missing_host_location}")

Missing values in host_location: 319


In [111]:
# Impute host_location as Unknown if it was initially missing
jersey_listings_detailed_cleaned['host_location'].fillna('Unknown', inplace=True)

missing_host_location = jersey_listings_detailed_cleaned['host_location'].isna().sum()

print(f"Missing values in host_location: {missing_host_location}")

Missing values in host_location: 0


In [112]:
# Imputed host_listings_count to be 1 if missing
jersey_listings_detailed_cleaned['host_is_superhost'] = jersey_listings_detailed_cleaned['host_is_superhost'].str.strip()
jersey_listings_detailed_cleaned['host_is_superhost'].replace('', np.nan, inplace=True)
jersey_listings_detailed_cleaned['host_is_superhost'].replace(' ', np.nan, inplace=True)  # In case of single space strings

missing_host_is_superhost = jersey_listings_detailed_cleaned['host_is_superhost'].isna().sum()

print(f"Missing values in host_is_superhost: {missing_host_is_superhost}")

# Skipping code needed to impute as not needed with host_is_superhost missingness being 0


Missing values in host_is_superhost: 0


In [113]:
missing_host_listings_count = jersey_listings_detailed_cleaned['host_listings_count'].isna().sum()

print(f"Missing values in host_listings_count: {missing_host_listings_count}")

# Skipping code needed to impute host listing count as not needed as 0 missing

Missing values in host_listings_count: 0


In [114]:
# Checking if host_total_listings_count and imputing if there are any missing values
missing_host_total_listings_count = jersey_listings_detailed_cleaned['host_total_listings_count'].isna().sum()

print(f"Missing values in host_total_listings_count: {missing_host_total_listings_count}")

jersey_listings_detailed_cleaned['host_total_listings_count'].fillna(1, inplace=True)

print("Imputed the data")

missing_host_total_listings_count = jersey_listings_detailed_cleaned['host_total_listings_count'].isna().sum()

print(f"Missing values in host_total_listings_count: {missing_host_total_listings_count}")

Missing values in host_total_listings_count: 0
Imputed the data
Missing values in host_total_listings_count: 0


In [115]:
# Get counts of host verifications and see that there is a None category
jersey_listings_detailed_cleaned['host_verifications'] = jersey_listings_detailed_cleaned['host_verifications'].apply(lambda x: "None" if x == '[]' else x)

print(jersey_listings_detailed_cleaned['host_verifications'].value_counts())

# No imputation needed

['email', 'phone']                  932
['email', 'phone', 'work_email']    263
['phone']                           117
['phone', 'work_email']               2
Name: host_verifications, dtype: int64


In [116]:
# See if there are any missing values
jersey_listings_detailed_cleaned['host_identity_verified'] = jersey_listings_detailed_cleaned['host_identity_verified'].str.strip()
jersey_listings_detailed_cleaned['host_identity_verified'].replace('', np.nan, inplace=True)
jersey_listings_detailed_cleaned['host_identity_verified'].replace(' ', np.nan, inplace=True) 

missing_host_identity_verified = jersey_listings_detailed_cleaned['host_identity_verified'].isna().sum()

print(f"Missing values in host_identity_verified: {missing_host_identity_verified}")

Missing values in host_identity_verified: 0


In [117]:
missing_calculated_host_listings_count = jersey_listings_detailed_cleaned['calculated_host_listings_count'].isna().sum()
# Checked that calculated_host_listings_count had no missing values
print(f"Missing values in calculated_host_listings_count: {missing_calculated_host_listings_count}")

Missing values in calculated_host_listings_count: 0


In [118]:
jersey_listings_detailed_cleaned

Unnamed: 0,id,name,description,neighborhood_overview,host_about,host_id,host_name,host_since,host_location,host_response_time,...,accommodates,bathrooms,bedrooms,beds,amenities,price,number_of_reviews,review_scores_value,calculated_host_listings_count,city
0,40669,Skyy’s Lounge / Cozy,<b>The space</b><br />Skyy’s Lounge ....Everyt...,The neighborhood is very diverse & friendly sh...,I am the owner of a high end Nail Salon in the...,175412,Skyy,2010-07-20,"Jersey City, NJ",,...,2,1.0,1.0,1.0,"[""Kitchen"", ""Dedicated workspace"", ""Dryer"", ""G...",128.0,10,5.00,1,Jersey City
1,215768,Minutes to Manhattan & Jersey Shore,Walking to distance to Statue of Liberty and E...,"Such close proximity to NYC, 7 minutes on the ...",Hello and thank you for taking the time to rea...,846837,Charlaine,2011-07-20,"Jersey City, NJ",within an hour,...,4,1.0,1.0,2.0,"[""Kitchen"", ""Blender"", ""Dryer"", ""Fire extingui...",134.0,159,4.79,2,Jersey City
2,254245,Minutes to Manhattan and NJ Shore,Walking to distance to Statue of Liberty and E...,"Such close proximity to NYC, 7 minutes on the ...",Hello and thank you for taking the time to rea...,846837,Charlaine,2011-07-20,"Jersey City, NJ",within an hour,...,2,1.0,1.0,1.0,"[""Kitchen"", ""Dedicated workspace"", ""Dryer"", ""F...",139.0,121,4.73,2,Jersey City
3,269266,Modern private 2 bedrooms apt minutes to NYC,Enjoy that private luxury two bedrooms apartme...,Our House is located in the Liberty State Park...,I am living in New York City area for several...,1410590,Magda,2011-11-15,"Jersey City, NJ",within an hour,...,5,1.0,2.0,3.0,"[""Kitchen"", ""Fire extinguisher"", ""Books and re...",119.0,408,4.45,11,Jersey City
4,270245,Private room with own bathroom close to NYC,Just for you small bedroom with private bathro...,,I am living in New York City area for several...,1410590,Magda,2011-11-15,"Jersey City, NJ",within an hour,...,1,1.0,1.0,1.0,"[""Kitchen"", ""Fire extinguisher"", ""Books and re...",44.0,283,4.59,11,Jersey City
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1309,853639185161545946,Visit Statue of Liberty! Easy City Access!,"Experience NYC at its finest! Spacious suites,...",Newport Centre - 0.5 miles; <br />J. Owen Grun...,,501999514,RoomPicks,2023-02-20,Unknown,within an hour,...,4,1.0,1.0,1.0,"[""Kitchen"", ""Dedicated workspace"", ""Gym"", ""Pai...",299.0,0,,7,Jersey City
1310,853643364839804604,Contemporary Living! Near Liberty Science Center!,"Experience NYC at its finest! Spacious suites,...",Newport Centre - 0.5 miles; <br />J. Owen Grun...,,501999514,RoomPicks,2023-02-20,Unknown,within an hour,...,2,1.0,1.0,1.0,"[""Kitchen"", ""Dedicated workspace"", ""Gym"", ""Pai...",257.0,0,,7,Jersey City
1311,853644015910009065,Stylish Suite! Parking Available! Enjoy City V...,"Experience NYC at its finest! Spacious suites,...",Newport Centre - 0.5 miles; <br />J. Owen Grun...,,501999514,RoomPicks,2023-02-20,Unknown,within an hour,...,2,1.0,1.0,1.0,"[""Kitchen"", ""Dedicated workspace"", ""Gym"", ""Pai...",257.0,0,,7,Jersey City
1312,853646181107758913,Elegant Suite! Near Hudson River Waterfront!,"Experience NYC at its finest! Spacious suites,...",Newport Centre - 0.5 miles; <br />J. Owen Grun...,,501999514,RoomPicks,2023-02-20,Unknown,within an hour,...,2,1.0,1.0,1.0,"[""Kitchen"", ""Dedicated workspace"", ""Gym"", ""Pai...",257.0,0,,7,Jersey City


In [119]:
def count_amenities(amenities_str):
    try:
        # Convert the string representation of the list back into a list
        amenities_list = ast.literal_eval(amenities_str)
        # Return the count of items in the list
        return len(amenities_list)
    except (ValueError, SyntaxError):
        # In case of any error during conversion, return 0
        return 0

# Apply the function to each row in the 'amenities' column and create a new column 'amenities_count'
jersey_listings_detailed_cleaned['amenities_count'] = jersey_listings_detailed_cleaned['amenities'].apply(count_amenities)


In [120]:
# Drop the neighbourhood column
jersey_listings_detailed_cleaned.drop(columns=['neighbourhood'], inplace=True)

jersey_listings_detailed_cleaned.rename(columns={'neighbourhood_cleansed': 'neighborhood'}, inplace=True)

In [121]:
# Drop additional columns
jersey_listings_detailed_cleaned.drop(columns=['neighborhood_overview', 'host_about'], inplace=True)

In [122]:
# Rename bathrooms to num_baths
jersey_listings_detailed_cleaned.rename(columns={'bathrooms': 'num_bath'}, inplace=True)

In [123]:
jersey_listings_detailed_cleaned

Unnamed: 0,id,name,description,host_id,host_name,host_since,host_location,host_response_time,host_response_rate,host_acceptance_rate,...,num_bath,bedrooms,beds,amenities,price,number_of_reviews,review_scores_value,calculated_host_listings_count,city,amenities_count
0,40669,Skyy’s Lounge / Cozy,<b>The space</b><br />Skyy’s Lounge ....Everyt...,175412,Skyy,2010-07-20,"Jersey City, NJ",,,33%,...,1.0,1.0,1.0,"[""Kitchen"", ""Dedicated workspace"", ""Dryer"", ""G...",128.0,10,5.00,1,Jersey City,44
1,215768,Minutes to Manhattan & Jersey Shore,Walking to distance to Statue of Liberty and E...,846837,Charlaine,2011-07-20,"Jersey City, NJ",within an hour,100%,98%,...,1.0,1.0,2.0,"[""Kitchen"", ""Blender"", ""Dryer"", ""Fire extingui...",134.0,159,4.79,2,Jersey City,42
2,254245,Minutes to Manhattan and NJ Shore,Walking to distance to Statue of Liberty and E...,846837,Charlaine,2011-07-20,"Jersey City, NJ",within an hour,100%,98%,...,1.0,1.0,1.0,"[""Kitchen"", ""Dedicated workspace"", ""Dryer"", ""F...",139.0,121,4.73,2,Jersey City,32
3,269266,Modern private 2 bedrooms apt minutes to NYC,Enjoy that private luxury two bedrooms apartme...,1410590,Magda,2011-11-15,"Jersey City, NJ",within an hour,100%,100%,...,1.0,2.0,3.0,"[""Kitchen"", ""Fire extinguisher"", ""Books and re...",119.0,408,4.45,11,Jersey City,47
4,270245,Private room with own bathroom close to NYC,Just for you small bedroom with private bathro...,1410590,Magda,2011-11-15,"Jersey City, NJ",within an hour,100%,100%,...,1.0,1.0,1.0,"[""Kitchen"", ""Fire extinguisher"", ""Books and re...",44.0,283,4.59,11,Jersey City,41
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1309,853639185161545946,Visit Statue of Liberty! Easy City Access!,"Experience NYC at its finest! Spacious suites,...",501999514,RoomPicks,2023-02-20,Unknown,within an hour,98%,100%,...,1.0,1.0,1.0,"[""Kitchen"", ""Dedicated workspace"", ""Gym"", ""Pai...",299.0,0,,7,Jersey City,31
1310,853643364839804604,Contemporary Living! Near Liberty Science Center!,"Experience NYC at its finest! Spacious suites,...",501999514,RoomPicks,2023-02-20,Unknown,within an hour,98%,100%,...,1.0,1.0,1.0,"[""Kitchen"", ""Dedicated workspace"", ""Gym"", ""Pai...",257.0,0,,7,Jersey City,31
1311,853644015910009065,Stylish Suite! Parking Available! Enjoy City V...,"Experience NYC at its finest! Spacious suites,...",501999514,RoomPicks,2023-02-20,Unknown,within an hour,98%,100%,...,1.0,1.0,1.0,"[""Kitchen"", ""Dedicated workspace"", ""Gym"", ""Pai...",257.0,0,,7,Jersey City,31
1312,853646181107758913,Elegant Suite! Near Hudson River Waterfront!,"Experience NYC at its finest! Spacious suites,...",501999514,RoomPicks,2023-02-20,Unknown,within an hour,98%,100%,...,1.0,1.0,1.0,"[""Kitchen"", ""Dedicated workspace"", ""Gym"", ""Pai...",257.0,0,,7,Jersey City,31


### Data Cleaning on Cambridge

In [124]:
# The dataframe we should be using for clean-up
cambridge_listings_detailed_df_wip

Unnamed: 0,id,name,description,neighborhood_overview,host_about,host_id,host_name,host_since,host_location,host_response_time,...,accommodates,bathrooms_text,bedrooms,beds,amenities,price,number_of_reviews,review_scores_value,calculated_host_listings_count,city
0,8521,SunsplashedSerenity walk to Harvard & Fresh Pond,"An elegant, sun-splashed, 2 bedroom (+2offices...",Huron Village is known for its charm. We have...,I'm a professor at one of the local universiti...,306681,Janet,2010-12-01,"Cambridge, MA",within a few hours,...,5,1 bath,2.0,2.0,"[""Coffee maker"", ""Baking sheet"", ""Beach essent...",225.0,50,4.68,2,Cambridge
1,11169,Lovely Studio Room: Available for long w/ends,Large sunny room w kitchenette & bath. Foam ma...,The neighborhood is quiet and friendly and our...,"Friendly, politically progressive, resourceful...",40965,Judy,2009-09-24,"Cambridge, MA",within a few hours,...,3,1 private bath,1.0,,"[""Coffee maker"", ""Coffee"", ""Lockbox"", ""Dining ...",121.0,165,4.76,3,Cambridge
2,19581,"Furnished suite, Windsor","Welcome to Area IV! We are located, convenient...",,We are adventure seekers. Both Patty and I lo...,74249,Marc And Patty,2010-01-27,"Cambridge, MA",within an hour,...,1,1 private bath,1.0,1.0,"[""Dining table"", ""Dishwasher"", ""Smoke alarm"", ...",205.0,8,4.17,3,Cambridge
3,27498,Furnished suite 2 @ the Windsor,"Welcome to Area IV! We are located, convenient...",,We are adventure seekers. Both Patty and I lo...,74249,Marc And Patty,2010-01-27,"Cambridge, MA",within an hour,...,2,1 private bath,1.0,1.0,"[""Baking sheet"", ""Dining table"", ""Dishwasher"",...",225.0,20,4.50,3,Cambridge
4,79762,Cambridge Getaway @ Harvard & MIT,Charming 2-bedroom apartment on the third floo...,Annmarie and I have lived in this area for ove...,Annmarie and I were both born here in Cambridg...,430015,Kevin,2011-03-08,"Cambridge, MA",within a day,...,4,1 bath,2.0,2.0,"[""Coffee maker"", ""Pack \u2019n play/Travel cri...",300.0,385,4.76,1,Cambridge
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1021,847711482579431825,Free Parking 2 Bed 1 Bth Harvard,"Come and stay in this centrally-located, quiet...",,,43450256,Steve,2015-09-05,"Cambridge, MA",within an hour,...,4,1 bath,2.0,3.0,"[""Coffee maker"", ""Dining table"", ""Shower gel"",...",121.0,0,,26,Cambridge
1022,847779091034141950,Walk Around Cambridge or Boston,Your family will be close to everything when y...,"East Cambridge is close to MIT/Kendall, all bi...",,505533166,Karolyn,2023-03-15,,within an hour,...,6,1 bath,3.0,4.0,"[""Shower gel"", ""Single oven"", ""Clothing storag...",166.0,0,,1,Cambridge
1023,848589438104752605,Queen Bedroom Near Harvard,"Come and stay in this centrally-located, quiet...",,,43450256,Steve,2015-09-05,"Cambridge, MA",within an hour,...,2,1 shared bath,1.0,1.0,"[""Coffee maker"", ""Dining table"", ""Shower gel"",...",130.0,0,,26,Cambridge
1024,849427951863814148,MIT/Havard~King bed-washer/dryer,"This special place is close to everything, mak...",,,163848078,Adam,2017-12-23,"Boston, MA",within an hour,...,4,1 bath,1.0,1.0,"[""Pool table"", ""Clothing storage: dresser and ...",225.0,0,,8,Cambridge


In [125]:
# Make a copy and using cambridge_listings_detailed_df_wip going forwards
cambridge_listings_detailed_cleaned = cambridge_listings_detailed_df_wip.copy()

In [126]:
# Conduct initial drop conditions
condition = (
    (cambridge_listings_detailed_cleaned['price'].isna() | (cambridge_listings_detailed_cleaned['price'] == 0)) |
    (cambridge_listings_detailed_cleaned['accommodates'].isna() | (cambridge_listings_detailed_cleaned['accommodates'] == 0))
)

# Inverting the condition to keep rows that do not meet the condition
# This gives us all the rows where price is not 0 or missing and accommodates is not 0 or missing
cambridge_listings_detailed_cleaned = cambridge_listings_detailed_cleaned[~condition]

# Verify with shape - verified that we dropped successfully as expect the number of rows to not change as didn't have
# any condition that matched
cambridge_listings_detailed_cleaned.shape


(1026, 33)

In [127]:
# Cleaning bathroom text - this will get renamed as bathrooms. Towards end of the notebook the column is renamed to num_bath

# Use regex to extract the first number found in the bathrooms_text column
cambridge_listings_detailed_cleaned['bathrooms'] = cambridge_listings_detailed_cleaned['bathrooms_text'].str.extract(r'(\d+\.?\d*)').astype(float)

# Handle specific wording cases to see if it is 0.5
half_bath_keywords = ['Half-bath', 'Shared half-bath', 'Private half-bath']
for keyword in half_bath_keywords:
    cambridge_listings_detailed_cleaned.loc[cambridge_listings_detailed_cleaned['bathrooms_text'].str.contains(keyword, case=False, na=False), 'bathrooms'] = 0.5

# Impute missing data based on bedrooms or beds
# Find indices where bathrooms is NaN
missing_bathrooms_idx = cambridge_listings_detailed_cleaned[cambridge_listings_detailed_cleaned['bathrooms'].isna()].index

# For rows with missing bathrooms_text, impute with bedrooms value; if bedrooms is also missing/NaN,
# then use beds/2, if beds is also missing then just put the value as 0
for idx in missing_bathrooms_idx:
    if pd.notna(cambridge_listings_detailed_cleaned.loc[idx, 'bedrooms']):
        cambridge_listings_detailed_cleaned.loc[idx, 'bathrooms'] = cambridge_listings_detailed_cleaned.loc[idx, 'bedrooms']
    elif pd.notna(cambridge_listings_detailed_cleaned.loc[idx, 'beds']):
        cambridge_listings_detailed_cleaned.loc[idx, 'bathrooms'] = cambridge_listings_detailed_cleaned.loc[idx, 'beds'] / 2
    else:
        # If both bedrooms and beds are missing, impute the value as 0
        cambridge_listings_detailed_cleaned.loc[idx, 'bathrooms'] = 0

# Double checking that after data cleaning that bathrooms is 0 now
# which is confirmed that we were able to clean successfully
missing_bathrooms_count = cambridge_listings_detailed_cleaned['bathrooms'].isna().sum()
print(f"Missing values in bathrooms: {missing_bathrooms_count}")


Missing values in bathrooms: 0


In [128]:
# Reorganizing the column order for bathrooms
# Drop the 'bathrooms_text' column but first remember its position
bathrooms_text_index = cambridge_listings_detailed_cleaned.columns.get_loc('bathrooms_text')
# Drop the column
cambridge_listings_detailed_cleaned.drop(columns=['bathrooms_text'], inplace=True)

# Reorder columns to insert 'bathrooms' into the position where 'bathrooms_text' was
columns = list(cambridge_listings_detailed_cleaned.columns)
# Remove 'bathrooms' from its current position
columns.remove('bathrooms')
# Insert 'bathrooms' at the position where 'bathrooms_text' used to be
columns.insert(bathrooms_text_index, 'bathrooms')

# Reassign reordered columns to DataFrame
cambridge_listings_detailed_cleaned = cambridge_listings_detailed_cleaned[columns]

# Verify the shape again
cambridge_listings_detailed_cleaned.shape

(1026, 33)

In [129]:
# bedrooms and beds imputation (the order of the code is important for the 
# imputation logic):
# Impute missing values in 'bedrooms' based on 'bathrooms'
cambridge_listings_detailed_cleaned['bedrooms'] = cambridge_listings_detailed_cleaned.apply(
    lambda row: math.ceil(row['bathrooms']) if pd.isnull(row['bedrooms']) else row['bedrooms'],
    axis=1
)

# If beds has missing values then we need to impute using bedrooms 
cambridge_listings_detailed_cleaned['beds'] = cambridge_listings_detailed_cleaned.apply(
    lambda row: row['bedrooms'] if pd.isnull(row['beds']) else row['beds'],
    axis=1
)



missing_bedrooms_count = cambridge_listings_detailed_cleaned['bedrooms'].isna().sum()
print(f"Missing values in bedrooms: {missing_bedrooms_count}")
missing_beds_count = cambridge_listings_detailed_cleaned['beds'].isna().sum()
print(f"Missing values in beds: {missing_beds_count}")

Missing values in bedrooms: 0
Missing values in beds: 0


In [130]:
# Verify that the bedooms and beds are now cleaned with not having any missing values
missing_host_since = cambridge_listings_detailed_cleaned['host_since'].isna().sum()

missing_host_since

# Skip code for host_since imputation since it is 0

0

In [131]:
# Check if there are any missing data for host_location
cambridge_listings_detailed_cleaned['host_location'] = cambridge_listings_detailed_cleaned['host_location'].str.strip()
cambridge_listings_detailed_cleaned['host_location'].replace('', np.nan, inplace=True)
cambridge_listings_detailed_cleaned['host_location'].replace(' ', np.nan, inplace=True)  # In case of single space strings

missing_host_location = cambridge_listings_detailed_cleaned['host_location'].isna().sum()

print(f"Missing values in host_location: {missing_host_location}")
# Found the number of host_location missingness

Missing values in host_location: 147


In [132]:
# Impute host_location as Unknown if it was initially missing
cambridge_listings_detailed_cleaned['host_location'].fillna('Unknown', inplace=True)

missing_host_location = cambridge_listings_detailed_cleaned['host_location'].isna().sum()
# Verified that the host_location has no missing values
print(f"Missing values in host_location: {missing_host_location}")

Missing values in host_location: 0


In [133]:
cambridge_listings_detailed_cleaned['host_is_superhost'] = cambridge_listings_detailed_cleaned['host_is_superhost'].str.strip()
cambridge_listings_detailed_cleaned['host_is_superhost'].replace('', np.nan, inplace=True)
cambridge_listings_detailed_cleaned['host_is_superhost'].replace(' ', np.nan, inplace=True)  # In case of single space strings

missing_host_is_superhost = cambridge_listings_detailed_cleaned['host_is_superhost'].isna().sum()
# Check that host_is_superhost has 0 missing values
print(f"Missing values in host_is_superhost: {missing_host_is_superhost}")

Missing values in host_is_superhost: 0


In [134]:
# Imputed host_is_superhost as false if missing
cambridge_listings_detailed_cleaned['host_is_superhost'].fillna('f', inplace=True)

missing_host_is_superhost = cambridge_listings_detailed_cleaned['host_is_superhost'].isna().sum()

print(f"Missing values in host_is_superhost: {missing_host_is_superhost}")

Missing values in host_is_superhost: 0


In [135]:
missing_host_listings_count = cambridge_listings_detailed_cleaned['host_listings_count'].isna().sum()

print(f"Missing values in host_listings_count: {missing_host_listings_count}")

Missing values in host_listings_count: 0


In [136]:
# Checking if host_total_listings_count and imputing if there are any missing values
missing_host_total_listings_count = cambridge_listings_detailed_cleaned['host_total_listings_count'].isna().sum()

print(f"Missing values in host_total_listings_count: {missing_host_total_listings_count}")

cambridge_listings_detailed_cleaned['host_total_listings_count'].fillna(1, inplace=True)

print("Imputed the data")

missing_host_total_listings_count = cambridge_listings_detailed_cleaned['host_total_listings_count'].isna().sum()

print(f"Missing values in host_total_listings_count: {missing_host_total_listings_count}")

Missing values in host_total_listings_count: 0
Imputed the data
Missing values in host_total_listings_count: 0


In [137]:
# See if there are any missing values
cambridge_listings_detailed_cleaned['host_verifications'] = cambridge_listings_detailed_cleaned['host_verifications'].apply(lambda x: "None" if x == '[]' else x)

print(cambridge_listings_detailed_cleaned['host_verifications'].value_counts())

['email', 'phone']                  643
['email', 'phone', 'work_email']    341
['phone']                            30
['phone', 'work_email']               7
['email']                             5
Name: host_verifications, dtype: int64


In [138]:
# Check on host_identity_verified if has missing values
cambridge_listings_detailed_cleaned['host_identity_verified'] = cambridge_listings_detailed_cleaned['host_identity_verified'].str.strip()
cambridge_listings_detailed_cleaned['host_identity_verified'].replace('', np.nan, inplace=True)
cambridge_listings_detailed_cleaned['host_identity_verified'].replace(' ', np.nan, inplace=True) 

missing_host_identity_verified = cambridge_listings_detailed_cleaned['host_identity_verified'].isna().sum()

print(f"Missing values in host_identity_verified: {missing_host_identity_verified}")


Missing values in host_identity_verified: 0


In [139]:
cambridge_listings_detailed_cleaned['host_identity_verified'].fillna('f', inplace=True)

missing_host_identity_verified = cambridge_listings_detailed_cleaned['host_identity_verified'].isna().sum()

print(f"Missing values in host_identity_verified: {missing_host_identity_verified}")

Missing values in host_identity_verified: 0


In [140]:
missing_calculated_host_listings_count = cambridge_listings_detailed_cleaned['calculated_host_listings_count'].isna().sum()

print(f"Missing values in calculated_host_listings_count: {missing_calculated_host_listings_count}")

Missing values in calculated_host_listings_count: 0


In [141]:
def count_amenities(amenities_str):
    try:
        # Convert the string representation of the list back into a list
        amenities_list = ast.literal_eval(amenities_str)
        # Return the count of items in the list
        return len(amenities_list)
    except (ValueError, SyntaxError):
        # In case of any error during conversion, return 0 (or you may choose to return NaN)
        return 0

# Apply the function to each row in the 'amenities' column and create a new column 'amenities_count'
cambridge_listings_detailed_cleaned['amenities_count'] = cambridge_listings_detailed_cleaned['amenities'].apply(count_amenities)

In [142]:
# Drop the neighbourhood column
cambridge_listings_detailed_cleaned.drop(columns=['neighbourhood'], inplace=True)

cambridge_listings_detailed_cleaned.rename(columns={'neighbourhood_cleansed': 'neighborhood'}, inplace=True)


In [143]:
# Dropped additional columns
cambridge_listings_detailed_cleaned.drop(columns=['neighborhood_overview', 'host_about'], inplace=True)
# Rename bathrooms to num_baths
cambridge_listings_detailed_cleaned.rename(columns={'bathrooms': 'num_bath'}, inplace=True)

In [144]:
cambridge_listings_detailed_cleaned.head()

Unnamed: 0,id,name,description,host_id,host_name,host_since,host_location,host_response_time,host_response_rate,host_acceptance_rate,...,num_bath,bedrooms,beds,amenities,price,number_of_reviews,review_scores_value,calculated_host_listings_count,city,amenities_count
0,8521,SunsplashedSerenity walk to Harvard & Fresh Pond,"An elegant, sun-splashed, 2 bedroom (+2offices...",306681,Janet,2010-12-01,"Cambridge, MA",within a few hours,100%,71%,...,1.0,2.0,2.0,"[""Coffee maker"", ""Baking sheet"", ""Beach essent...",225.0,50,4.68,2,Cambridge,45
1,11169,Lovely Studio Room: Available for long w/ends,Large sunny room w kitchenette & bath. Foam ma...,40965,Judy,2009-09-24,"Cambridge, MA",within a few hours,100%,64%,...,1.0,1.0,1.0,"[""Coffee maker"", ""Coffee"", ""Lockbox"", ""Dining ...",121.0,165,4.76,3,Cambridge,26
2,19581,"Furnished suite, Windsor","Welcome to Area IV! We are located, convenient...",74249,Marc And Patty,2010-01-27,"Cambridge, MA",within an hour,100%,100%,...,1.0,1.0,1.0,"[""Dining table"", ""Dishwasher"", ""Smoke alarm"", ...",205.0,8,4.17,3,Cambridge,46
3,27498,Furnished suite 2 @ the Windsor,"Welcome to Area IV! We are located, convenient...",74249,Marc And Patty,2010-01-27,"Cambridge, MA",within an hour,100%,100%,...,1.0,1.0,1.0,"[""Baking sheet"", ""Dining table"", ""Dishwasher"",...",225.0,20,4.5,3,Cambridge,48
4,79762,Cambridge Getaway @ Harvard & MIT,Charming 2-bedroom apartment on the third floo...,430015,Kevin,2011-03-08,"Cambridge, MA",within a day,100%,60%,...,1.0,2.0,2.0,"[""Coffee maker"", ""Pack \u2019n play/Travel cri...",300.0,385,4.76,1,Cambridge,30


### Data Cleaning on New York City

In [145]:
# The dataframe we should be using for clean-up
nyc_listings_transformed_wip

Unnamed: 0,id,name,description,neighborhood_overview,host_about,host_id,host_name,host_since,host_location,host_response_time,...,accommodates,bathrooms_text,bedrooms,beds,amenities,price,number_of_reviews,review_scores_value,calculated_host_listings_count,city
0,801749842377802394,A home away from home,The whole group will be comfortable in this sp...,,,495455523,Michael,2023-01-10,,,...,2,1 bath,1.0,1.0,"[""50\"" TV"", ""Bathtub"", ""Microwave"", ""Free driv...",143.0,0,,1,New York City
1,765948794133787266,Brooklyn Refuge,Take a break and unwind at this peaceful oasis.,,,488760226,Eric,2022-11-22,,within an hour,...,1,1 shared bath,1.0,1.0,"[""Free parking on premises"", ""Carbon monoxide ...",30.0,13,4.92,2,New York City
2,636274456676328779,Villa Masino.,Close to beach Peaceful walk to park & beach...,,,461263600,Tommaso,2022-05-27,,,...,6,2 baths,2.0,2.0,"[""BBQ grill"", ""Security cameras on property"", ...",157.0,0,,1,New York City
3,768125251187660469,1-Bedroom Private Room with King Size Bed,Private room with king size bedroom near Sheep...,,,475699129,Suliman,2022-08-18,,within an hour,...,2,2 baths,3.0,1.0,"[""Security cameras on property"", ""Keypad"", ""Ca...",89.0,15,5.00,7,New York City
4,49248255,Get the best of both worlds in Riverdale!,Welcome to the greatest location if you desire...,You will find within walking distance the Metr...,,397288055,Katherine,2021-04-16,,within an hour,...,3,1 bath,2.0,2.0,"[""Hangers"", ""Clothing storage: closet"", ""Secur...",125.0,25,4.64,1,New York City
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42926,40342902,Cozy room in prime area,,,"The sun, a beach, great books and grilling :)",90429772,Hanna,2016-08-17,"New York, NY",,...,1,1 shared bath,1.0,1.0,"[""Hangers"", ""Kitchen"", ""Dishes and silverware""...",85.0,0,,2,New York City
42927,35257699,Hell's Kitchen /Times Sq - Comfortable 2 BDR Flat,"Location Location. Hell's Kitchen, Times Squar...",,Young professional living in this amazing city..,264962468,Mili,2019-05-29,,within an hour,...,6,1 bath,2.0,3.0,"[""Hangers"", ""Microwave"", ""Dishwasher"", ""Dishes...",208.0,217,4.51,1,New York City
42928,52491515,Cozy 3 bedroom apt in the heart of Lower East ...,This comfortable apartment in the heart of Low...,,,305489297,Stavros,2019-10-30,,within an hour,...,3,1 bath,3.0,3.0,"[""Hangers"", ""Dishes and silverware"", ""Lockbox""...",125.0,7,4.14,6,New York City
42929,48158801,Spacious Loft Space / Photo studio in Bushwick,Looking for a two month subleter. LGBTQ+ and a...,,"Hey everyone, I'm Quentin, a 30 year old filmm...",6600525,Quentin,2013-05-27,"New York, NY",,...,1,2 shared baths,6.0,1.0,"[""Lock on bedroom door"", ""Fire extinguisher"", ...",50.0,0,,1,New York City


In [146]:
# Make a copy and using nyc_listings_transformed_wip going forwards
nyc_listings_detailed_cleaned = nyc_listings_transformed_wip.copy()

In [147]:
# Conduct initial drop conditions
condition = (
    (nyc_listings_detailed_cleaned['price'].isna() | (nyc_listings_detailed_cleaned['price'] == 0)) |
    (nyc_listings_detailed_cleaned['accommodates'].isna() | (nyc_listings_detailed_cleaned['accommodates'] == 0))
)

# Inverting the condition to keep rows that do not meet the condition
# This gives us all the rows where price is not 0 or missing and accommodates is not 0 or missing
nyc_listings_detailed_cleaned = nyc_listings_detailed_cleaned[~condition]

# Verify with shape - verified that we dropped successfully as expect the number of rows to decrease given the conditons
nyc_listings_detailed_cleaned.shape

(42904, 33)

In [148]:
# Cleaning bathroom text - this will get renamed as bathrooms. Towards end of the notebook the column is renamed to num_bath

# Use regex to extract the first number found in the bathrooms_text column
nyc_listings_detailed_cleaned['bathrooms'] = nyc_listings_detailed_cleaned['bathrooms_text'].str.extract(r'(\d+\.?\d*)').astype(float)

# Handle specific wording cases to see if it is 0.5
half_bath_keywords = ['Half-bath', 'Shared half-bath', 'Private half-bath']
for keyword in half_bath_keywords:
    nyc_listings_detailed_cleaned.loc[nyc_listings_detailed_cleaned['bathrooms_text'].str.contains(keyword, case=False, na=False), 'bathrooms'] = 0.5

# Impute missing data based on bedrooms or beds
# Find indices where bathrooms is NaN
missing_bathrooms_idx = nyc_listings_detailed_cleaned[nyc_listings_detailed_cleaned['bathrooms'].isna()].index

# For rows with missing bathrooms_text, impute with bedrooms value; if bedrooms is also missing/NaN,
# then use beds/2, if beds is also missing then just put the value as 0
for idx in missing_bathrooms_idx:
    if pd.notna(nyc_listings_detailed_cleaned.loc[idx, 'bedrooms']):
        nyc_listings_detailed_cleaned.loc[idx, 'bathrooms'] = nyc_listings_detailed_cleaned.loc[idx, 'bedrooms']
    elif pd.notna(nyc_listings_detailed_cleaned.loc[idx, 'beds']):
        nyc_listings_detailed_cleaned.loc[idx, 'bathrooms'] = nyc_listings_detailed_cleaned.loc[idx, 'beds'] / 2
    else:
        # If both bedrooms and beds are missing, you might want to impute with a default value or another logic
        nyc_listings_detailed_cleaned.loc[idx, 'bathrooms'] = 0  # Or any default value you consider appropriate



# Double checking that after data cleaning that bathrooms is 0 now
# which is confirmed that we were able to clean successfully
missing_bathrooms_count = nyc_listings_detailed_cleaned['bathrooms'].isna().sum()
print(f"Missing values in bathrooms: {missing_bathrooms_count}")


Missing values in bathrooms: 0


In [149]:
# Reorganizing the column order for bathrooms
# Drop the 'bathrooms_text' column but first remember its position
bathrooms_text_index = nyc_listings_detailed_cleaned.columns.get_loc('bathrooms_text')
# Drop the column
nyc_listings_detailed_cleaned.drop(columns=['bathrooms_text'], inplace=True)

# Reorder columns to insert 'bathrooms' into the position where 'bathrooms_text' was
columns = list(nyc_listings_detailed_cleaned.columns)
# Remove 'bathrooms' from its current position
columns.remove('bathrooms')
# Insert 'bathrooms' at the position where 'bathrooms_text' used to be
columns.insert(bathrooms_text_index, 'bathrooms')

# Reassign reordered columns to DataFrame
nyc_listings_detailed_cleaned = nyc_listings_detailed_cleaned[columns]

# Verify the shape again
nyc_listings_detailed_cleaned.shape


(42904, 33)

In [150]:
# bedrooms and beds imputation (the order of the code is important for the imputation logic):
# Impute missing values in 'bedrooms' based on 'bathrooms'
nyc_listings_detailed_cleaned['bedrooms'] = nyc_listings_detailed_cleaned.apply(
    lambda row: math.ceil(row['bathrooms']) if pd.isnull(row['bedrooms']) else row['bedrooms'],
    axis=1
)

# Impute missing values in 'beds' based on 'bedrooms'
nyc_listings_detailed_cleaned['beds'] = nyc_listings_detailed_cleaned.apply(
    lambda row: row['bedrooms'] if pd.isnull(row['beds']) else row['beds'],
    axis=1
)



# Verify that the bedooms and beds are now cleaned with not having any missing values
missing_bedrooms_count = nyc_listings_detailed_cleaned['bedrooms'].isna().sum()
print(f"Missing values in bedrooms: {missing_bedrooms_count}")
missing_beds_count = nyc_listings_detailed_cleaned['beds'].isna().sum()
print(f"Missing values in beds: {missing_beds_count}")





Missing values in bedrooms: 0
Missing values in beds: 0


In [151]:
# Are there any host_since missing values?
missing_host_since = nyc_listings_detailed_cleaned['host_since'].isna().sum()

missing_host_since

5

In [152]:
# Clean the host_since column based on what is the most recent web-scrape if the data is missing
id_to_last_scraped = nyc_listings_detailed_df.set_index('id')['last_scraped']

nyc_listings_detailed_cleaned['host_since'] = nyc_listings_detailed_cleaned.apply(
    lambda row: id_to_last_scraped[row['id']] if pd.isnull(row['host_since']) else row['host_since'],
    axis=1
)

# Verified column has been cleaned/imputed in
missing_host_since = nyc_listings_detailed_cleaned['host_since'].isna().sum()

print(f"Missing values in host_since: {missing_host_since}")

Missing values in host_since: 0


In [153]:
# Check if there are any missing data for host_location
nyc_listings_detailed_cleaned['host_location'] = nyc_listings_detailed_cleaned['host_location'].str.strip()
nyc_listings_detailed_cleaned['host_location'].replace('', np.nan, inplace=True)
nyc_listings_detailed_cleaned['host_location'].replace(' ', np.nan, inplace=True)  # In case of single space strings

missing_host_location = nyc_listings_detailed_cleaned['host_location'].isna().sum()

print(f"Missing values in host_location: {missing_host_location}")


Missing values in host_location: 9070


In [154]:
# Impute host_location as Unknown if it was initially missing
nyc_listings_detailed_cleaned['host_location'].fillna('Unknown', inplace=True)

missing_host_location = nyc_listings_detailed_cleaned['host_location'].isna().sum()

print(f"Missing values in host_location: {missing_host_location}")

Missing values in host_location: 0


In [155]:
# Check if host_is_superhost is missing and it is not
nyc_listings_detailed_cleaned['host_is_superhost'] = nyc_listings_detailed_cleaned['host_is_superhost'].str.strip()
nyc_listings_detailed_cleaned['host_is_superhost'].replace('', np.nan, inplace=True)
nyc_listings_detailed_cleaned['host_is_superhost'].replace(' ', np.nan, inplace=True)  # In case of single space strings

missing_host_is_superhost = nyc_listings_detailed_cleaned['host_is_superhost'].isna().sum()

print(f"Missing values in host_is_superhost: {missing_host_is_superhost}")

Missing values in host_is_superhost: 0


In [156]:
nyc_listings_detailed_cleaned['host_is_superhost'].fillna('f', inplace=True)

missing_host_is_superhost = nyc_listings_detailed_cleaned['host_is_superhost'].isna().sum()

print(f"Missing values in host_is_superhost: {missing_host_is_superhost}")

Missing values in host_is_superhost: 0


In [157]:
# Check if host_listings_count is missing
nyc_listings_detailed_cleaned['host_listings_count'] = nyc_listings_detailed_cleaned['host_listings_count'].str.strip()
nyc_listings_detailed_cleaned['host_listings_count'].replace('', np.nan, inplace=True)
nyc_listings_detailed_cleaned['host_listings_count'].replace(' ', np.nan, inplace=True)  # In case of single space strings

missing_host_listings_count = nyc_listings_detailed_cleaned['host_listings_count'].isna().sum()

print(f"Missing values in host_listings_count: {missing_host_listings_count}")

Missing values in host_listings_count: 10161


In [158]:
# Clean host_listings_count by impute with 1
nyc_listings_detailed_cleaned['host_listings_count'].fillna(1, inplace=True)

missing_host_listings_count = nyc_listings_detailed_cleaned['host_listings_count'].isna().sum()

# Verified that host_listings_count missingness is now 0
print(f"Missing values in host_listings_count: {missing_host_listings_count}")


Missing values in host_listings_count: 0


In [159]:
# Checking if host_total_listings_count and imputing if there are any missing values
missing_host_total_listings_count = nyc_listings_detailed_cleaned['host_total_listings_count'].isna().sum()

print(f"Missing values in host_total_listings_count: {missing_host_total_listings_count}")

nyc_listings_detailed_cleaned['host_total_listings_count'].fillna(1, inplace=True)

print("Imputed the data")

missing_host_total_listings_count = nyc_listings_detailed_cleaned['host_total_listings_count'].isna().sum()

print(f"Missing values in host_total_listings_count: {missing_host_total_listings_count}")

Missing values in host_total_listings_count: 5
Imputed the data
Missing values in host_total_listings_count: 0


In [160]:
# Get counts of host verifications and see that there is a None category
nyc_listings_detailed_cleaned['host_verifications'] = nyc_listings_detailed_cleaned['host_verifications'].apply(lambda x: "None" if x == '[]' else x)

print(nyc_listings_detailed_cleaned['host_verifications'].value_counts())

['email', 'phone']                  33373
['email', 'phone', 'work_email']     5270
['phone']                            4054
['phone', 'work_email']               101
['email']                              69
None                                   31
['email', 'work_email']                 6
Name: host_verifications, dtype: int64


In [161]:
# See if there are any missing values
nyc_listings_detailed_cleaned['host_identity_verified'] = nyc_listings_detailed_cleaned['host_identity_verified'].str.strip()
nyc_listings_detailed_cleaned['host_identity_verified'].replace('', np.nan, inplace=True)
nyc_listings_detailed_cleaned['host_identity_verified'].replace(' ', np.nan, inplace=True) 

missing_host_identity_verified = nyc_listings_detailed_cleaned['host_identity_verified'].isna().sum()

print(f"Missing values in host_identity_verified: {missing_host_identity_verified}")


# Impute the host_identity_verified
nyc_listings_detailed_cleaned['host_identity_verified'].fillna('f', inplace=True)

missing_host_identity_verified = nyc_listings_detailed_cleaned['host_identity_verified'].isna().sum()

print(f"Missing values in host_identity_verified: {missing_host_identity_verified}")


Missing values in host_identity_verified: 5
Missing values in host_identity_verified: 0


In [162]:
missing_calculated_host_listings_count = nyc_listings_detailed_cleaned['calculated_host_listings_count'].isna().sum()
# Checked that calculated_host_listings_count had no missing values
print(f"Missing values in calculated_host_listings_count: {missing_calculated_host_listings_count}")

Missing values in calculated_host_listings_count: 0


In [163]:
def count_amenities(amenities_str):
    try:
        # Convert the string representation of the list back into a list
        amenities_list = ast.literal_eval(amenities_str)
        # Return the count of items in the list
        return len(amenities_list)
    except (ValueError, SyntaxError):
        return 0

# Apply the function to each row in the 'amenities' column and create a new column 'amenities_count'
nyc_listings_detailed_cleaned['amenities_count'] = nyc_listings_detailed_cleaned['amenities'].apply(count_amenities)



In [164]:
# Drop the neighbourhood column
nyc_listings_detailed_cleaned.drop(columns=['neighbourhood'], inplace=True)

nyc_listings_detailed_cleaned.rename(columns={'neighbourhood_cleansed': 'neighborhood'}, inplace=True)


# Drop additional columns
nyc_listings_detailed_cleaned.drop(columns=['neighborhood_overview', 'host_about'], inplace=True)
# Rename bathrooms to num_baths
nyc_listings_detailed_cleaned.rename(columns={'bathrooms': 'num_bath'}, inplace=True)

In [165]:
nyc_listings_detailed_cleaned.head()

Unnamed: 0,id,name,description,host_id,host_name,host_since,host_location,host_response_time,host_response_rate,host_acceptance_rate,...,num_bath,bedrooms,beds,amenities,price,number_of_reviews,review_scores_value,calculated_host_listings_count,city,amenities_count
0,801749842377802394,A home away from home,The whole group will be comfortable in this sp...,495455523,Michael,2023-01-10,Unknown,,,,...,1.0,1.0,1.0,"[""50\"" TV"", ""Bathtub"", ""Microwave"", ""Free driv...",143.0,0,,1,New York City,25
1,765948794133787266,Brooklyn Refuge,Take a break and unwind at this peaceful oasis.,488760226,Eric,2022-11-22,Unknown,within an hour,100%,100%,...,1.0,1.0,1.0,"[""Free parking on premises"", ""Carbon monoxide ...",30.0,13,4.92,2,New York City,7
2,636274456676328779,Villa Masino.,Close to beach Peaceful walk to park & beach...,461263600,Tommaso,2022-05-27,Unknown,,,,...,2.0,2.0,2.0,"[""BBQ grill"", ""Security cameras on property"", ...",157.0,0,,1,New York City,3
3,768125251187660469,1-Bedroom Private Room with King Size Bed,Private room with king size bedroom near Sheep...,475699129,Suliman,2022-08-18,Unknown,within an hour,99%,98%,...,2.0,3.0,1.0,"[""Security cameras on property"", ""Keypad"", ""Ca...",89.0,15,5.0,7,New York City,7
4,49248255,Get the best of both worlds in Riverdale!,Welcome to the greatest location if you desire...,397288055,Katherine,2021-04-16,Unknown,within an hour,75%,68%,...,1.0,2.0,2.0,"[""Hangers"", ""Clothing storage: closet"", ""Secur...",125.0,25,4.64,1,New York City,43


### Data Cleaning on Washington DC

In [166]:
# The dataframe we should be using for clean-up
dc_listings_transformed_wip

Unnamed: 0,id,name,description,neighborhood_overview,host_about,host_id,host_name,host_since,host_location,host_response_time,...,accommodates,bathrooms_text,bedrooms,beds,amenities,price,number_of_reviews,review_scores_value,calculated_host_listings_count,city
0,22229408,"Explorer's Paradise: Near Train, Bus & Bike Share",Welcome to your perfect urban escape in the Na...,My home is located in the Edgewood neighborhoo...,,120875011,J. P.,2017-03-15,"Washington, DC",,...,2,1 private bath,1.0,1.0,"[""Essentials"", ""Shampoo"", ""Air conditioning"", ...",20.0,20,4.75,1,Washington DC
1,46951758,Boutique Style Home - Rooftop w/Breathtaking V...,Explore or get settled in this artsy modern ro...,"Food & Drinks: Starbucks, Mama's Pizza, Busboy...",,55133178,Quinton,2016-01-18,"Washington, DC",within an hour,...,8,2.5 baths,3.0,3.0,"[""Bathtub"", ""Ethernet connection"", ""Hammock"", ...",185.0,51,4.80,1,Washington DC
2,580379638076900630,Sojourn | Penthouse | Private Outdoor Space | ...,Boutique building in one of DC's best neighbor...,Dupont Circle stands out as a cosmopolitan jew...,"Hi, I'm Nicole, the co-founder of Sojourn. At...",39930655,Team,2015-07-29,"Washington, DC",within an hour,...,4,2 baths,2.0,2.0,"[""Bathtub"", ""Air conditioning"", ""Private entra...",221.0,0,,173,Washington DC
3,594971943284098653,Quaint 1-bedroom apartment with outdoor Patio,Welcome to the center of DC. Half-way between ...,,,351398058,Ladi,2020-06-22,"Washington, DC",within an hour,...,3,1 bath,1.0,2.0,"[""Fire extinguisher"", ""Kitchen"", ""Air conditio...",142.0,11,4.73,3,Washington DC
4,54371126,Waterfront Two Bedroom Apartment In a Brand Ne...,Located in Washington in the District of Colum...,,Hosted by RedAwning Vacation Rentals\n\nWelcom...,395672427,RedAwning,2021-04-05,"Emeryville, CA",,...,5,2 baths,2.0,3.0,"[""Bathtub"", ""Pool"", ""Air conditioning"", ""Priva...",398.0,1,5.00,7,Washington DC
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6450,849293688249908869,Top-floor Glover Park apartment,Enjoy this centrally located Washington DC Apa...,,My name is Olena and I live in the upper Georg...,9286163,Olena,2013-10-07,"Washington, DC",within an hour,...,4,1 bath,1.0,1.0,"[""BBQ grill"", ""Fire extinguisher"", ""Dedicated ...",172.0,0,,5,Washington DC
6451,849304474004984024,5 Bedroom 3 Story Townhome in DC,Welcome to your next memorable Airbnb experien...,,I love to travel.,140334352,Beko,2017-07-13,"Washington, DC",within a few hours,...,14,3.5 baths,5.0,7.0,"[""Fire extinguisher"", ""Dedicated workspace"", ""...",480.0,0,,7,Washington DC
6452,849310766288741827,2 br U st Condo with rooftop,This spacious two bedroom condo is walking dis...,,,177084854,NIna,2018-03-07,,,...,4,2 baths,2.0,3.0,"[""BBQ grill"", ""Fire extinguisher"", ""Kitchen"", ...",173.0,0,,1,Washington DC
6453,849479534479786096,cozy Suite on Rhode Island IIII,Come enjoy a private room inside of a two bedr...,,,390256204,Collin,2021-02-26,,within an hour,...,2,1 shared bath,1.0,1.0,"[""Dedicated workspace"", ""Kitchen"", ""Air condit...",82.0,0,,45,Washington DC


In [167]:
# Make a copy and using dc_listings_transformed_wip going forwards
dc_listings_detailed_cleaned = dc_listings_transformed_wip.copy()

In [168]:
# Conduct initial drop conditions
condition = (
    (dc_listings_detailed_cleaned['price'].isna() | (dc_listings_detailed_cleaned['price'] == 0)) |
    (dc_listings_detailed_cleaned['accommodates'].isna() | (dc_listings_detailed_cleaned['accommodates'] == 0))
)

# Inverting the condition to keep rows that do not meet the condition
# This gives us all the rows where price is not 0 or missing and accommodates is not 0 or missing
dc_listings_detailed_cleaned = dc_listings_detailed_cleaned[~condition]

# Verify with shape - verified that we dropped successfully as expect the number of rows to decrease with the conditions
dc_listings_detailed_cleaned.shape


(6453, 33)

In [169]:
# Cleaning bathroom text - this will get renamed as bathrooms. Towards end of the notebook the column is renamed to num_bath

# Use regex to extract the first number found in the bathrooms_text column
dc_listings_detailed_cleaned['bathrooms'] = dc_listings_detailed_cleaned['bathrooms_text'].str.extract(r'(\d+\.?\d*)').astype(float)

# Handle specific wording cases to see if it is 0.5
half_bath_keywords = ['Half-bath', 'Shared half-bath', 'Private half-bath']
for keyword in half_bath_keywords:
    dc_listings_detailed_cleaned.loc[dc_listings_detailed_cleaned['bathrooms_text'].str.contains(keyword, case=False, na=False), 'bathrooms'] = 0.5

# Impute missing data based on bedrooms or beds
# Find indices where bathrooms is NaN
missing_bathrooms_idx = dc_listings_detailed_cleaned[dc_listings_detailed_cleaned['bathrooms'].isna()].index

# For rows with missing bathrooms_text, impute with bedrooms value; if bedrooms is also missing/NaN,
# then use beds/2, if beds is also missing then just put the value as 0
for idx in missing_bathrooms_idx:
    if pd.notna(dc_listings_detailed_cleaned.loc[idx, 'bedrooms']):
        dc_listings_detailed_cleaned.loc[idx, 'bathrooms'] = dc_listings_detailed_cleaned.loc[idx, 'bedrooms']
    elif pd.notna(dc_listings_detailed_cleaned.loc[idx, 'beds']):
        dc_listings_detailed_cleaned.loc[idx, 'bathrooms'] = dc_listings_detailed_cleaned.loc[idx, 'beds'] / 2
    else:
        # If both bedrooms and beds are missing, impute the value as 0
        dc_listings_detailed_cleaned.loc[idx, 'bathrooms'] = 0




# Double checking that after data cleaning that bathrooms is 0 now
# which is confirmed that we were able to clean successfully
missing_bathrooms_count = dc_listings_detailed_cleaned['bathrooms'].isna().sum()
print(f"Missing values in bathrooms: {missing_bathrooms_count}")


Missing values in bathrooms: 0


In [170]:
# Reorganizing the column order for bathrooms
# Drop the 'bathrooms_text' column but first remember its position
bathrooms_text_index = dc_listings_detailed_cleaned.columns.get_loc('bathrooms_text')
# Drop the column
dc_listings_detailed_cleaned.drop(columns=['bathrooms_text'], inplace=True)

# Reorder columns to insert 'bathrooms' into the position where 'bathrooms_text' was
columns = list(dc_listings_detailed_cleaned.columns)
# Remove 'bathrooms' from its current position
columns.remove('bathrooms')
# Insert 'bathrooms' at the position where 'bathrooms_text' used to be
columns.insert(bathrooms_text_index, 'bathrooms')

# Reassign reordered columns to DataFrame
dc_listings_detailed_cleaned = dc_listings_detailed_cleaned[columns]

# Verify the shape again
dc_listings_detailed_cleaned.shape


(6453, 33)

In [171]:
# bedrooms and beds imputation (the order of the code is important for the imputation logic):
# Impute missing values in 'bedrooms' based on 'bathrooms'
dc_listings_detailed_cleaned['bedrooms'] = dc_listings_detailed_cleaned.apply(
    lambda row: math.ceil(row['bathrooms']) if pd.isnull(row['bedrooms']) else row['bedrooms'],
    axis=1
)

# Impute missing values in 'beds' based on 'bedrooms'
dc_listings_detailed_cleaned['beds'] = dc_listings_detailed_cleaned.apply(
    lambda row: row['bedrooms'] if pd.isnull(row['beds']) else row['beds'],
    axis=1
)



# Verify that the bedooms and beds are now cleaned with not having any missing values
missing_bedrooms_count = dc_listings_detailed_cleaned['bedrooms'].isna().sum()
print(f"Missing values in bedrooms: {missing_bedrooms_count}")
missing_beds_count = dc_listings_detailed_cleaned['beds'].isna().sum()
print(f"Missing values in beds: {missing_beds_count}")





Missing values in bedrooms: 0
Missing values in beds: 0


In [172]:
# Clean the host_since column based on what is the most recent web-scrape if the data is missing
missing_host_since = dc_listings_detailed_cleaned['host_since'].isna().sum()

missing_host_since

## Skip code for host_since as missingness is 0

0

In [173]:
# Check if there are any missing data for host_location
dc_listings_detailed_cleaned['host_location'] = dc_listings_detailed_cleaned['host_location'].str.strip()
dc_listings_detailed_cleaned['host_location'].replace('', np.nan, inplace=True)
dc_listings_detailed_cleaned['host_location'].replace(' ', np.nan, inplace=True)  # In case of single space strings

missing_host_location = dc_listings_detailed_cleaned['host_location'].isna().sum()

print(f"Missing values in host_location: {missing_host_location}")

Missing values in host_location: 1139


In [174]:
# Impute host_location as Unknown if it was initially missing

dc_listings_detailed_cleaned['host_location'].fillna('Unknown', inplace=True)

missing_host_location = dc_listings_detailed_cleaned['host_location'].isna().sum()
# Verified that the host_location has no missing values
print(f"Missing values in host_location: {missing_host_location}")

Missing values in host_location: 0


In [175]:


dc_listings_detailed_cleaned['host_is_superhost'] = dc_listings_detailed_cleaned['host_is_superhost'].str.strip()
dc_listings_detailed_cleaned['host_is_superhost'].replace('', np.nan, inplace=True)
dc_listings_detailed_cleaned['host_is_superhost'].replace(' ', np.nan, inplace=True)  # In case of single space strings

missing_host_is_superhost = dc_listings_detailed_cleaned['host_is_superhost'].isna().sum()

print(f"Missing values in host_is_superhost: {missing_host_is_superhost}")


# Verified that host_is_superhost had no missing values

dc_listings_detailed_cleaned['host_is_superhost'].fillna('f', inplace=True)

missing_host_is_superhost = dc_listings_detailed_cleaned['host_is_superhost'].isna().sum()

print(f"Missing values in host_is_superhost: {missing_host_is_superhost}")


Missing values in host_is_superhost: 0
Missing values in host_is_superhost: 0


In [176]:

missing_host_listings_count = dc_listings_detailed_cleaned['host_listings_count'].isna().sum()
# Check the number of host_listings_count
print(f"Missing values in host_listings_count: {missing_host_listings_count}")

Missing values in host_listings_count: 0


In [177]:
# Checking if host_total_listings_count and imputing if there are any missing values
missing_host_total_listings_count = dc_listings_detailed_cleaned['host_total_listings_count'].isna().sum()

print(f"Missing values in host_total_listings_count: {missing_host_total_listings_count}")

dc_listings_detailed_cleaned['host_total_listings_count'].fillna(1, inplace=True)

print("Imputed the data")

missing_host_total_listings_count = dc_listings_detailed_cleaned['host_total_listings_count'].isna().sum()

print(f"Missing values in host_total_listings_count: {missing_host_total_listings_count}")


Missing values in host_total_listings_count: 0
Imputed the data
Missing values in host_total_listings_count: 0


In [178]:

# Get counts of host verifications and see that there is a None category
dc_listings_detailed_cleaned['host_verifications'] = dc_listings_detailed_cleaned['host_verifications'].apply(lambda x: "None" if x == '[]' else x)

print(dc_listings_detailed_cleaned['host_verifications'].value_counts())


['email', 'phone']                  4511
['email', 'phone', 'work_email']    1534
['phone']                            387
['phone', 'work_email']               20
['email']                              1
Name: host_verifications, dtype: int64


In [179]:

# See if there are any missing values
dc_listings_detailed_cleaned['host_identity_verified'] = dc_listings_detailed_cleaned['host_identity_verified'].str.strip()
dc_listings_detailed_cleaned['host_identity_verified'].replace('', np.nan, inplace=True)
dc_listings_detailed_cleaned['host_identity_verified'].replace(' ', np.nan, inplace=True) 

missing_host_identity_verified = dc_listings_detailed_cleaned['host_identity_verified'].isna().sum()

print(f"Missing values in host_identity_verified: {missing_host_identity_verified}")



missing_calculated_host_listings_count = dc_listings_detailed_cleaned['calculated_host_listings_count'].isna().sum()

print(f"Missing values in calculated_host_listings_count: {missing_calculated_host_listings_count}")

Missing values in host_identity_verified: 0
Missing values in calculated_host_listings_count: 0


In [180]:


def count_amenities(amenities_str):
    try:
        # Convert the string representation of the list back into a list
        amenities_list = ast.literal_eval(amenities_str)
        # Return the count of items in the list
        return len(amenities_list)
    except (ValueError, SyntaxError):
        return 0

# Apply the function to each row in the 'amenities' column and create a new column 'amenities_count'
dc_listings_detailed_cleaned['amenities_count'] = dc_listings_detailed_cleaned['amenities'].apply(count_amenities)


In [181]:

# Drop the neighbourhood column
dc_listings_detailed_cleaned.drop(columns=['neighbourhood'], inplace=True)

dc_listings_detailed_cleaned.rename(columns={'neighbourhood_cleansed': 'neighborhood'}, inplace=True)

In [182]:
# Drop additional columns
dc_listings_detailed_cleaned.drop(columns=['neighborhood_overview', 'host_about'], inplace=True)
# Rename bathrooms to num_baths
dc_listings_detailed_cleaned.rename(columns={'bathrooms': 'num_bath'}, inplace=True)

In [183]:
dc_listings_detailed_cleaned.head()

Unnamed: 0,id,name,description,host_id,host_name,host_since,host_location,host_response_time,host_response_rate,host_acceptance_rate,...,num_bath,bedrooms,beds,amenities,price,number_of_reviews,review_scores_value,calculated_host_listings_count,city,amenities_count
0,22229408,"Explorer's Paradise: Near Train, Bus & Bike Share",Welcome to your perfect urban escape in the Na...,120875011,J. P.,2017-03-15,"Washington, DC",,,,...,1.0,1.0,1.0,"[""Essentials"", ""Shampoo"", ""Air conditioning"", ...",20.0,20,4.75,1,Washington DC,13
1,46951758,Boutique Style Home - Rooftop w/Breathtaking V...,Explore or get settled in this artsy modern ro...,55133178,Quinton,2016-01-18,"Washington, DC",within an hour,100%,100%,...,2.5,3.0,3.0,"[""Bathtub"", ""Ethernet connection"", ""Hammock"", ...",185.0,51,4.8,1,Washington DC,57
2,580379638076900630,Sojourn | Penthouse | Private Outdoor Space | ...,Boutique building in one of DC's best neighbor...,39930655,Team,2015-07-29,"Washington, DC",within an hour,99%,100%,...,2.0,2.0,2.0,"[""Bathtub"", ""Air conditioning"", ""Private entra...",221.0,0,,173,Washington DC,39
3,594971943284098653,Quaint 1-bedroom apartment with outdoor Patio,Welcome to the center of DC. Half-way between ...,351398058,Ladi,2020-06-22,"Washington, DC",within an hour,100%,100%,...,1.0,1.0,2.0,"[""Fire extinguisher"", ""Kitchen"", ""Air conditio...",142.0,11,4.73,3,Washington DC,8
4,54371126,Waterfront Two Bedroom Apartment In a Brand Ne...,Located in Washington in the District of Colum...,395672427,RedAwning,2021-04-05,"Emeryville, CA",,,,...,2.0,2.0,3.0,"[""Bathtub"", ""Pool"", ""Air conditioning"", ""Priva...",398.0,1,5.0,7,Washington DC,31


#### Stack all of the dataframes together to make Eastern US Cities Cleaned Dataframe

In [184]:

dfs = [
    broward_listings_detailed_cleaned,
    jersey_listings_detailed_cleaned,
    cambridge_listings_detailed_cleaned,
    nyc_listings_detailed_cleaned,
    dc_listings_detailed_cleaned
]

# Concatenate all DataFrames
all_listings_detailed_cleaned = pd.concat(dfs, ignore_index=True)


In [185]:
all_listings_detailed_cleaned.shape

(68592, 31)

In [186]:
all_listings_detailed_cleaned.drop(columns=['description'], inplace=True)

In [187]:
all_listings_detailed_cleaned.head()

Unnamed: 0,id,name,host_id,host_name,host_since,host_location,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,...,num_bath,bedrooms,beds,amenities,price,number_of_reviews,review_scores_value,calculated_host_listings_count,city,amenities_count
0,827736378366911479,Legion 1BR/1BA,475630606,Sean,2022-08-18,Unknown,within an hour,100%,94%,f,...,1.0,1.0,1.0,"[""Air conditioning"", ""Free parking on premises...",222.0,0,,1,Broward County,10
1,592589963829194972,Club Wyndham Royal Vista,66506549,Ryan,2016-04-09,"Alpharetta, GA",within an hour,98%,16%,f,...,2.0,2.0,4.0,"[""TV"", ""Paid parking on premises"", ""Indoor fir...",500.0,0,,5,Broward County,29
2,772438920837360569,Relaxing 5 Acre Ranch home with private pond!,382318476,Maggie,2020-12-30,Unknown,within an hour,100%,89%,t,...,3.0,4.0,6.0,"[""Air conditioning"", ""Free parking on premises...",500.0,2,5.0,1,Broward County,14
3,33271346,Beach Escape – One Block from the Beach!,118856968,Steve And Jo,2017-03-02,"Fort Lauderdale, FL",within an hour,100%,100%,f,...,2.0,2.0,4.0,"[""TV"", ""Hair dryer"", ""Essentials"", ""Wifi"", ""Ha...",186.0,129,4.68,3,Broward County,22
4,484515,MIAMI- AMAZING APARTMENT OVER BEACH,637272,Bianca,2011-05-28,"Buenos Aires, Argentina",within an hour,95%,26%,f,...,2.0,2.0,5.0,"[""Air conditioning"", ""Hangers"", ""Free parking ...",297.0,27,4.44,6,Broward County,17


#### Final Count on Missingness for all Eastern US Cities

In [188]:
# Final Count:
all_listings_detailed_cleaned.isnull().sum()

id                                    0
name                                  0
host_id                               0
host_name                             0
host_since                            0
host_location                         0
host_response_time                    0
host_response_rate                    0
host_acceptance_rate                  0
host_is_superhost                     0
host_listings_count                   0
host_total_listings_count             0
host_verifications                    0
host_has_profile_pic                  0
host_identity_verified                0
neighborhood                          0
latitude                              0
longitude                             0
room_type                             0
accommodates                          0
num_bath                              0
bedrooms                              0
beds                                  0
amenities                             0
price                                 0


In [183]:
# # Save a copy of the joined data
# all_listings_detailed_cleaned.to_csv("east_coast_cleaned.csv", index=False)

# print("DataFrame has been successfully written to CSV file.")

DataFrame has been successfully written to CSV file.
