## INFO 2950 Final Project
Larrisa Chen (lc949), Michelle Li (myl39), Christina Jin (cej65), Jade Eggleston (jce76)

### Research question
***Do the criteria for a successful Airbnb differ in U.S. regions?***

Where success is defined as:
- high number of bookings combined with high listing rating

and criteria is defined by:
- Price
- Number of Beds
- Number of Baths
- Host ratings
- Number of reviews 
- Private or Public
- Proximity to urban center (most popular neighborhood)
- Neighborhood
- Keywords in names
- Keywords in description
- Host response rate / time
- Room type
- Amenities (binary)
    - varies by region

The list of amenities is scraped from each of the cities listed below and aggregated to find the top 15 in each region. 

Ultimately, we want to deduce the attributes that contribute to a successful Airbnb listing and compare these listings across different regions in the U.S. (Northeast, Southeast, Southeast, West, Northwest, and Midwest).

**Revised question: How do US regions differ on the demographic and lifestyle of a consumer they attract?**


### Data Origin/Description

Our datasets explore all recorded Airbnb locations within the largest cities of 9 US regions. We compiled Airbnb rental data in Boston, NYC, Chicago, The Twin Cities, DC, Nashville, Dallas, Las Vegas (represented by Clark County), and Los Angeles, measuring across 26 criteria that US consumers consider when choosing the optimal Airbnb. By analyzing the average values within each criteria, we hope to identify the demographics and lifestyles of Airbnb consumers that each US region attracts and determine key differences amongst them. We believe this data could be useful in creating target consumer profiles for Airbnb hosts to account for when providing their services; our analyses could also have further implications influencing tourism services in each major city…. (include more potential uses if necessary)

The source of our datasets comes from the Inside Airbnb database, which is an independent project run by data activists to provide information about Airbnb’s impact in residential communities. As this information is collected directly from the Airbnb platform, this data is directly observable from the website. Additionally, it holds some identifiable information involving Airbnb hosts, who have consented to publicly displaying their data. However, for our project purposes, we have taken a smaller sample of data from the database and cleaned/merged it according to our needs (explained in the Data Cleaning section). 

We cleaned/merged multiple datasets so that each city would have a single dataset displaying all relevant information in it (creating a total of 9 datasets for our project). All instances within each dataset are either a quantifiable number, price, percentage, date, review score, word description, or boolean. 8 of the criteria’s instances are quantifiable numbers, 1 criteria’s instance is price, 2 criteria’s instances are percentages, 2 criteria’s instances are dates, 1 criteria’s instance is a review score, 7 of the criteria’s instances are a word description, and 4 of the criteria’s instances are a boolean. The data of each instance is raw data?? Further explained in our Data Cleaning section, some instances were missing under certain criteria due to lack of available data. This occurs since every Airbnb listing varies with including or excluding optional information (such as host descriptions or amenities). 

Each of our datasets are defined by a city, and include all the available listings within the city and their criteria. Each listing  is defined by its unique numeric ID, display name, and informational description showcased on the website’s listing. The first few columns of criteria involve the Airbnb host, including when they became a host (host_since); their personal background (host_about); response rate and response time to customer queries; acceptance rate of new tenants (host_acceptance_rate); whether they are a host with outstanding experience (host_is_superhost); amount of listings they own (host_listings_count); whether they have a public profile picture (host_has_profile_pic); and whether they are verified on Airbnb (host_identity_verified). Additionally, the next criteria involve the logistical aspects of the unit, including the neighborhood it is located in (neighborhood_cleansed); the type of room (room_type); how many people it can house (accommodates); number of bathrooms (bathrooms_text), bedrooms, and beds; amenities available; price per night (price); and the minimum and maximum number of nights it can be booked (minimum_nights and maximum_nights). Furthermore, the next criteria involve customer reviews for the unit, including the number of reviews, most recently posted review (last_review), and review rating out of 5 (review_scores_rating). The final criteria indicates whether the listing can be booked without needing the host’s approval (instant_bookable).

Since our datasets could include identifiable information about the Airbnb hosts (within listing description and host description), there could be a risk of using this information to draw unfounding  analyses that discriminate against minority identifying Airbnb hosts. Hence, our dataset should not be used to profile Airbnb hosts based upon their demographics.  


### Data Collection & Cleaning
**Collection**:
We began the data collection process by identifying the cities with the largest inbound tourism within each US region.  After compiling the set list of cities we wanted to analyze, we found a large database containing detailed quarterly datasets of Airbnb listings in cities around the globe. We ended up having to replace a few of the cities on our list with alternatives (discussed in Data Limitations section) due to lack of available data. Using the Airbnb database, we downloaded each selected city’s relevant datasets that contained information about the Airbnb listings, calendar data, reviews, and neighborhood data.  

**Cleaning**:
We began the data cleaning process by tackling the listings datasets as they contained the most information we needed. Since many of the cells had missing values and were of type object, we used the notna function to pull out only the non-missing rows from each of our selected columns (host_about, host_response_time, host_response_rate, host_acceptance_rate, last_review, review_scores_rating, description, beds, bedrooms, bathrooms_text). So, any row that had any information missing in the above columns were removed.

Many of the Airbnbs’ names, host descriptions, and listing information descriptions are showcased in various fonts and contain emoji symbols within them. However, we needed to make sure names and descriptions were uniform to make scraping for common words easier. Hence, we created the is_string helper function to run through every letter in the strings of these columns and ensure that it is within the ranges of ASCII characters of ‘A’ - ‘Z’, ASCII characters of ‘a-z’, or ASCII characters of 0-9. This removed the names and descriptions with special characters or fonts and ensured that all characters in descriptions or names were alphanumeric.

For the following columns, we needed to convert the object types within our data into more exact data types that are valuable to our data analysis and exploration. Since host_response_rate and host_acceptance_rate were in a percentage format as a string, they were turned into type objects. Hence, we created the percent_to_float helper function to remove the percentage symbol and change the values into floats for these two columns. Additionally, the price column had a dollar sign in its values to represent dollar amounts, which were turned into type objects. Hence, we created the dollar_to_float helper function to remove the dollar symbol and change the values into floats for this column. 

Moreover, the values under columns instant_bookable, host_identity_verified, host_has_profile_pic, and host_is_superhost were either “t” and “f”, which all became identified as type string or object. Hence, we created the str_to_bool helper function to convert the “t” and “f” into boolean expressions. We converted the “t” to True and “f” to False.

There were categorical values under column host_response_time of “within an hour,” “within a few hours,” “within a day,” “a few days or more,” or “None”. To convert these values into an exact type, we decided to assign the categories to an int value within the range 0-4. The range would indicate the amount of time taken for the host response, with 0 indicating the shortest amount of time to 4 taking the longest amount of time. We created the host_response_time_to_int helper function to change the values “within an hour” to 0, “within a few hours” to 1, “within a day” to 2, “a few days or more” to 3, and “None” to 4. This would make it easier to numerically identify how long the host response takes and we can more easily quantify how long the host response takes.

Each description under the amenities columns was a single string, but we needed to turn them into lists of strings to scrape and identify common amenities among the listings. Hence, we created the amenities_to_list helper function, which uses a json library that automatically parses strings into a json format, and applied it to the amenities column.

The values under columns host_since and last_review were type objects. We wanted to convert these values into datetime format to help with our visualizations, so we created the to_datetime helper function to turn the values into type datatime. 

Next, we wanted to create a privacy criteria of each listing dependent on whether users have to use a shared space. Although the room_type column labels listings as private rooms, there was still the possibility that the listing had a shared bathroom, making it not private. To resolve this, we decided that if bathroom_text had the word “shared” in it, the listing would be considered as a shared space.  In order to implement this, we first created the room_type_to_bool helper function to search through the room_type values and identify if the word “shared” was in it. We then created a new column called is_private_room to hold the room_type values after room_type_to_bool was applied to them. If the room_type value contained “shared,” then the room_type_to_bool function would return as True; if the room_type value did not contain “shared,” then the room_type_to_bool function would return as False. Next, we completed the same process for bathrooms_text, creating the helper function bathrooms_text_to_bool to search through the bathrooms_text values and identify if “shared” was in it. We made a new column called is_private_bath to hold the resulting True/False values. Finally, we created a new column called is_private_overall to determine whether the listing is private based on whether both values in the is_private_room and is_private bathroom columns are True.

In addition, we wanted to turn the bathrooms_text column values (1 bathroom, 1.5 bathrooms, or half-bath) into type ints to make visualizations easier. We created the bathrooms_text_to_int helper function to parse through bathrooms_text and identify the “half-bath” values and return it as 0.5. If the value did not include “half-bath,” the function would just use regex to return only the number. The resulting values from the bathrooms_text_to_int helper function were then added to a new column called baths.

After finishing the cleaning in the listings data sets, we began cleaning the calendar data sets. We used the drop function to remove the price and adjusted_price columns since we already had the same information in our listings data sets. 

Additionally, we converted the date column values to type datetime and the available column values to boolean using the str_to_bool helper function.

Because the data sets for each city were similar in data format in terms of the columns and the types we wanted the columns to be, we used the same process for data cleaning for every city. 

In [1]:
import numpy as np
import pandas as pd
import regex as re
import json

In [2]:
def percent_to_float(x):
    return float(x.strip("%"))/100.0

def dollar_to_float(x):
    x = x.replace(",", "")
    x = x.replace("$", "")
    return float(x)

def str_to_bool(x):
    if(x == "f"):
        return False
    else:
        return True
    
def room_type_to_bool(x):
    if(x == "Shared room"):
        return False
    return True

def bathrooms_text_to_bool(x):
    if(str(x).find("shared") != -1 or str(x).find("Shared") != -1):
        return False
    return True
    
def amenities_to_list(x):
    return json.loads(x)
    
def host_response_time_to_int(x):
    if(x == "within an hour"):
        return 0
    elif(x == "within a few hours"):
        return 1
    elif(x == "within a day"):
        return 2
    elif(x == "a few days or more"):
        return 3
    elif(x == "None"):
        return 4
    
def bathroom_text_to_float(x):
    if(x.find("half-bath") != -1 or x.find("Half-bath") != -1):
        return 0.5
    elif(re.search(r"d+\.\d+", x) != None):
        return float(re.search(r"d+\.\d+", x).group())
    return int(re.search(r"\d+", x).group())

def is_string(x):
    for letter in x:
        if((letter >= "A" and letter <= "Z") or (letter >= "a" and letter <="z") or (letter >= "0" and letter <= "9")):
            return True;
    return False;

In [3]:
# NEW YORK CLEANING

nyc_listings_df = pd.read_csv("listings/nyc_listings.csv")

nyc_listings_df = nyc_listings_df[nyc_listings_df["host_about"].notna()]
nyc_listings_df = nyc_listings_df[nyc_listings_df["host_response_time"].notna()]
nyc_listings_df = nyc_listings_df[nyc_listings_df["host_response_rate"].notna()]
nyc_listings_df = nyc_listings_df[nyc_listings_df["host_acceptance_rate"].notna()]
nyc_listings_df = nyc_listings_df[nyc_listings_df["last_review"].notna()]
nyc_listings_df = nyc_listings_df[nyc_listings_df["review_scores_rating"].notna()]
nyc_listings_df = nyc_listings_df[nyc_listings_df["description"].notna()]
nyc_listings_df = nyc_listings_df[nyc_listings_df["beds"].notna()]
nyc_listings_df = nyc_listings_df[nyc_listings_df["bedrooms"].notna()]
nyc_listings_df = nyc_listings_df[nyc_listings_df["bathrooms_text"].notna()]

nyc_listings_df = nyc_listings_df[nyc_listings_df["host_about"].apply(is_string)]
nyc_listings_df = nyc_listings_df[nyc_listings_df["name"].apply(is_string)]
nyc_listings_df = nyc_listings_df[nyc_listings_df["description"].apply(is_string)]

nyc_listings_df["host_response_rate"] = nyc_listings_df["host_response_rate"].apply(percent_to_float)
nyc_listings_df["host_acceptance_rate"] = nyc_listings_df["host_acceptance_rate"].apply(percent_to_float)
nyc_listings_df["price"] = nyc_listings_df["price"].apply(dollar_to_float)
nyc_listings_df["instant_bookable"] = nyc_listings_df["instant_bookable"].apply(str_to_bool)
nyc_listings_df["host_identity_verified"] = nyc_listings_df["host_identity_verified"].apply(str_to_bool)
nyc_listings_df["host_has_profile_pic"] = nyc_listings_df["host_has_profile_pic"].apply(str_to_bool)
nyc_listings_df["host_is_superhost"] = nyc_listings_df["host_is_superhost"].apply(str_to_bool)
nyc_listings_df["host_response_time"] = nyc_listings_df["host_response_time"].apply(host_response_time_to_int)
nyc_listings_df["amenities"] = nyc_listings_df["amenities"].apply(amenities_to_list)

nyc_listings_df["is_private_room"] = nyc_listings_df["room_type"].apply(room_type_to_bool)
nyc_listings_df["is_private_bath"] = nyc_listings_df["bathrooms_text"].apply(bathrooms_text_to_bool)
nyc_listings_df["baths"] = nyc_listings_df["bathrooms_text"].apply(bathroom_text_to_float)
nyc_listings_df["is_private_overall"] = nyc_listings_df["is_private_room"] & nyc_listings_df["is_private_bath"]

nyc_listings_df["host_since"] = pd.to_datetime(nyc_listings_df["host_since"])
nyc_listings_df["last_review"] = pd.to_datetime(nyc_listings_df["last_review"])

print(nyc_listings_df.dtypes)
print(nyc_listings_df.shape)
print(nyc_listings_df)

nyc_calendar_df = pd.read_csv("calendars/nyc_calendar.csv")
nyc_calendar_df = nyc_calendar_df.drop("price", axis=1)
nyc_calendar_df = nyc_calendar_df.drop("adjusted_price", axis=1)
nyc_calendar_df["date"] = pd.to_datetime(nyc_calendar_df["date"])
nyc_calendar_df["available"] = nyc_calendar_df["available"].apply(str_to_bool)

print(nyc_calendar_df)
print(nyc_calendar_df.dtypes)

id                               float64
name                              object
description                       object
host_since                datetime64[ns]
host_about                        object
host_response_time                 int64
host_response_rate               float64
host_acceptance_rate             float64
host_is_superhost                   bool
host_listings_count              float64
host_has_profile_pic                bool
host_identity_verified              bool
neighbourhood_cleansed            object
room_type                         object
accommodates                       int64
bathrooms_text                    object
bedrooms                         float64
beds                             float64
amenities                         object
price                            float64
minimum_nights                     int64
maximum_nights                     int64
number_of_reviews                  int64
last_review               datetime64[ns]
review_scores_ra

In [4]:
# BOSTON CLEANING

bos_listings_df = pd.read_csv("listings/bos_listings.csv")

bos_listings_df = bos_listings_df[bos_listings_df["host_about"].notna()]
bos_listings_df = bos_listings_df[bos_listings_df["host_response_time"].notna()]
bos_listings_df = bos_listings_df[bos_listings_df["host_response_rate"].notna()]
bos_listings_df = bos_listings_df[bos_listings_df["host_acceptance_rate"].notna()]
bos_listings_df = bos_listings_df[bos_listings_df["last_review"].notna()]
bos_listings_df = bos_listings_df[bos_listings_df["review_scores_rating"].notna()]
bos_listings_df = bos_listings_df[bos_listings_df["description"].notna()]
bos_listings_df = bos_listings_df[bos_listings_df["beds"].notna()]
bos_listings_df = bos_listings_df[bos_listings_df["bedrooms"].notna()]
bos_listings_df = bos_listings_df[bos_listings_df["bathrooms_text"].notna()]

bos_listings_df = bos_listings_df[bos_listings_df["host_about"].apply(is_string)]
bos_listings_df = bos_listings_df[bos_listings_df["name"].apply(is_string)]
bos_listings_df = bos_listings_df[bos_listings_df["description"].apply(is_string)]

bos_listings_df["host_response_rate"] = bos_listings_df["host_response_rate"].apply(percent_to_float)
bos_listings_df["host_acceptance_rate"] = bos_listings_df["host_acceptance_rate"].apply(percent_to_float)
bos_listings_df["price"] = bos_listings_df["price"].apply(dollar_to_float)
bos_listings_df["instant_bookable"] = bos_listings_df["instant_bookable"].apply(str_to_bool)
bos_listings_df["host_identity_verified"] = bos_listings_df["host_identity_verified"].apply(str_to_bool)
bos_listings_df["host_has_profile_pic"] = bos_listings_df["host_has_profile_pic"].apply(str_to_bool)
bos_listings_df["host_is_superhost"] = bos_listings_df["host_is_superhost"].apply(str_to_bool)
bos_listings_df["host_response_time"] = bos_listings_df["host_response_time"].apply(host_response_time_to_int)
bos_listings_df["amenities"] = bos_listings_df["amenities"].apply(amenities_to_list)

bos_listings_df["is_private_room"] = bos_listings_df["room_type"].apply(room_type_to_bool)
bos_listings_df["is_private_bath"] = bos_listings_df["bathrooms_text"].apply(bathrooms_text_to_bool)
bos_listings_df["baths"] = bos_listings_df["bathrooms_text"].apply(bathroom_text_to_float)
bos_listings_df["is_private_overall"] = bos_listings_df["is_private_room"] & bos_listings_df["is_private_bath"]

bos_listings_df["host_since"] = pd.to_datetime(bos_listings_df["host_since"])
bos_listings_df["last_review"] = pd.to_datetime(bos_listings_df["last_review"])

print(bos_listings_df.dtypes)
print(bos_listings_df.shape)
print(bos_listings_df)

bos_calendar_df = pd.read_csv("calendars/bos_calendar.csv")
bos_calendar_df = bos_calendar_df.drop("price", axis=1)
bos_calendar_df = bos_calendar_df.drop("adjusted_price", axis=1)
bos_calendar_df["date"] = pd.to_datetime(bos_calendar_df["date"])
bos_calendar_df["available"] = bos_calendar_df["available"].apply(str_to_bool)

print(bos_calendar_df)
print(bos_calendar_df.dtypes)

id                               float64
name                              object
description                       object
host_since                datetime64[ns]
host_about                        object
host_response_time                 int64
host_response_rate               float64
host_acceptance_rate             float64
host_is_superhost                   bool
host_listings_count                int64
host_has_profile_pic                bool
host_identity_verified              bool
neighbourhood_cleansed            object
room_type                         object
accommodates                       int64
bathrooms_text                    object
bedrooms                         float64
beds                             float64
amenities                         object
price                            float64
minimum_nights                     int64
maximum_nights                     int64
number_of_reviews                  int64
last_review               datetime64[ns]
review_scores_ra

In [5]:
# CHICAGO CLEANING

chi_listings_df = pd.read_csv("listings/chi_listings.csv")

chi_listings_df = chi_listings_df[chi_listings_df["host_about"].notna()]
chi_listings_df = chi_listings_df[chi_listings_df["host_response_time"].notna()]
chi_listings_df = chi_listings_df[chi_listings_df["host_response_rate"].notna()]
chi_listings_df = chi_listings_df[chi_listings_df["host_acceptance_rate"].notna()]
chi_listings_df = chi_listings_df[chi_listings_df["last_review"].notna()]
chi_listings_df = chi_listings_df[chi_listings_df["review_scores_rating"].notna()]
chi_listings_df = chi_listings_df[chi_listings_df["description"].notna()]
chi_listings_df = chi_listings_df[chi_listings_df["beds"].notna()]
chi_listings_df = chi_listings_df[chi_listings_df["bedrooms"].notna()]
chi_listings_df = chi_listings_df[chi_listings_df["bathrooms_text"].notna()]

chi_listings_df = chi_listings_df[chi_listings_df["host_about"].apply(is_string)]
chi_listings_df = chi_listings_df[chi_listings_df["name"].apply(is_string)]
chi_listings_df = chi_listings_df[chi_listings_df["description"].apply(is_string)]

chi_listings_df["host_response_rate"] = chi_listings_df["host_response_rate"].apply(percent_to_float)
chi_listings_df["host_acceptance_rate"] = chi_listings_df["host_acceptance_rate"].apply(percent_to_float)
chi_listings_df["price"] = chi_listings_df["price"].apply(dollar_to_float)
chi_listings_df["instant_bookable"] = chi_listings_df["instant_bookable"].apply(str_to_bool)
chi_listings_df["host_identity_verified"] = chi_listings_df["host_identity_verified"].apply(str_to_bool)
chi_listings_df["host_has_profile_pic"] = chi_listings_df["host_has_profile_pic"].apply(str_to_bool)
chi_listings_df["host_is_superhost"] = chi_listings_df["host_is_superhost"].apply(str_to_bool)
chi_listings_df["host_response_time"] = chi_listings_df["host_response_time"].apply(host_response_time_to_int)
chi_listings_df["amenities"] = chi_listings_df["amenities"].apply(amenities_to_list)

chi_listings_df["is_private_room"] = chi_listings_df["room_type"].apply(room_type_to_bool)
chi_listings_df["is_private_bath"] = chi_listings_df["bathrooms_text"].apply(bathrooms_text_to_bool)
chi_listings_df["baths"] = chi_listings_df["bathrooms_text"].apply(bathroom_text_to_float)
chi_listings_df["is_private_overall"] = chi_listings_df["is_private_room"] & chi_listings_df["is_private_bath"]

chi_listings_df["host_since"] = pd.to_datetime(chi_listings_df["host_since"])
chi_listings_df["last_review"] = pd.to_datetime(chi_listings_df["last_review"])

print(chi_listings_df.dtypes)
print(chi_listings_df.shape)
print(chi_listings_df)

chi_calendar_df = pd.read_csv("calendars/chi_calendar.csv")
chi_calendar_df = chi_calendar_df.drop("price", axis=1)
chi_calendar_df = chi_calendar_df.drop("adjusted_price", axis=1)
chi_calendar_df["date"] = pd.to_datetime(chi_calendar_df["date"])
chi_calendar_df["available"] = chi_calendar_df["available"].apply(str_to_bool)

print(chi_calendar_df)
print(chi_calendar_df.dtypes)

id                               float64
name                              object
description                       object
host_since                datetime64[ns]
host_about                        object
host_response_time                 int64
host_response_rate               float64
host_acceptance_rate             float64
host_is_superhost                   bool
host_listings_count                int64
host_has_profile_pic                bool
host_identity_verified              bool
neighbourhood_cleansed            object
room_type                         object
accommodates                       int64
bathrooms_text                    object
bedrooms                         float64
beds                             float64
amenities                         object
price                            float64
minimum_nights                     int64
maximum_nights                     int64
number_of_reviews                  int64
last_review               datetime64[ns]
review_scores_ra

In [6]:
# WASHINGTON DC CLEANING

dc_listings_df = pd.read_csv("listings/dc_listings.csv")

dc_listings_df = dc_listings_df[dc_listings_df["host_about"].notna()]
dc_listings_df = dc_listings_df[dc_listings_df["host_response_time"].notna()]
dc_listings_df = dc_listings_df[dc_listings_df["host_response_rate"].notna()]
dc_listings_df = dc_listings_df[dc_listings_df["host_acceptance_rate"].notna()]
dc_listings_df = dc_listings_df[dc_listings_df["last_review"].notna()]
dc_listings_df = dc_listings_df[dc_listings_df["review_scores_rating"].notna()]
dc_listings_df = dc_listings_df[dc_listings_df["description"].notna()]
dc_listings_df = dc_listings_df[dc_listings_df["beds"].notna()]
dc_listings_df = dc_listings_df[dc_listings_df["bedrooms"].notna()]
dc_listings_df = dc_listings_df[dc_listings_df["bathrooms_text"].notna()]

dc_listings_df = dc_listings_df[dc_listings_df["host_about"].apply(is_string)]
dc_listings_df = dc_listings_df[dc_listings_df["name"].apply(is_string)]
dc_listings_df = dc_listings_df[dc_listings_df["description"].apply(is_string)]

dc_listings_df["host_response_rate"] = dc_listings_df["host_response_rate"].apply(percent_to_float)
dc_listings_df["host_acceptance_rate"] = dc_listings_df["host_acceptance_rate"].apply(percent_to_float)
dc_listings_df["price"] = dc_listings_df["price"].apply(dollar_to_float)
dc_listings_df["instant_bookable"] = dc_listings_df["instant_bookable"].apply(str_to_bool)
dc_listings_df["host_identity_verified"] = dc_listings_df["host_identity_verified"].apply(str_to_bool)
dc_listings_df["host_has_profile_pic"] = dc_listings_df["host_has_profile_pic"].apply(str_to_bool)
dc_listings_df["host_is_superhost"] = dc_listings_df["host_is_superhost"].apply(str_to_bool)
dc_listings_df["host_response_time"] = dc_listings_df["host_response_time"].apply(host_response_time_to_int)
dc_listings_df["amenities"] = dc_listings_df["amenities"].apply(amenities_to_list)

dc_listings_df["is_private_room"] = dc_listings_df["room_type"].apply(room_type_to_bool)
dc_listings_df["is_private_bath"] = dc_listings_df["bathrooms_text"].apply(bathrooms_text_to_bool)
dc_listings_df["baths"] = dc_listings_df["bathrooms_text"].apply(bathroom_text_to_float)
dc_listings_df["is_private_overall"] = dc_listings_df["is_private_room"] & dc_listings_df["is_private_bath"]

dc_listings_df["host_since"] = pd.to_datetime(dc_listings_df["host_since"])
dc_listings_df["last_review"] = pd.to_datetime(dc_listings_df["last_review"])

print(dc_listings_df.dtypes)
print(dc_listings_df.shape)
print(dc_listings_df)

dc_calendar_df = pd.read_csv("calendars/dc_calendar.csv")
dc_calendar_df = dc_calendar_df.drop("price", axis=1)
dc_calendar_df = dc_calendar_df.drop("adjusted_price", axis=1)
dc_calendar_df["date"] = pd.to_datetime(dc_calendar_df["date"])
dc_calendar_df["available"] = dc_calendar_df["available"].apply(str_to_bool)

print(dc_calendar_df)
print(dc_calendar_df.dtypes)

id                               float64
name                              object
description                       object
host_since                datetime64[ns]
host_about                        object
host_response_time                 int64
host_response_rate               float64
host_acceptance_rate             float64
host_is_superhost                   bool
host_listings_count              float64
host_has_profile_pic                bool
host_identity_verified              bool
neighbourhood_cleansed            object
room_type                         object
accommodates                       int64
bathrooms_text                    object
bedrooms                         float64
beds                             float64
amenities                         object
price                            float64
minimum_nights                     int64
maximum_nights                     int64
number_of_reviews                  int64
last_review               datetime64[ns]
review_scores_ra

In [7]:
# DALLAS CLEANING

dal_listings_df = pd.read_csv("listings/dal_listings.csv")

dal_listings_df = dal_listings_df[dal_listings_df["host_about"].notna()]
dal_listings_df = dal_listings_df[dal_listings_df["host_response_time"].notna()]
dal_listings_df = dal_listings_df[dal_listings_df["host_response_rate"].notna()]
dal_listings_df = dal_listings_df[dal_listings_df["host_acceptance_rate"].notna()]
dal_listings_df = dal_listings_df[dal_listings_df["last_review"].notna()]
dal_listings_df = dal_listings_df[dal_listings_df["review_scores_rating"].notna()]
dal_listings_df = dal_listings_df[dal_listings_df["description"].notna()]
dal_listings_df = dal_listings_df[dal_listings_df["beds"].notna()]
dal_listings_df = dal_listings_df[dal_listings_df["bedrooms"].notna()]
dal_listings_df = dal_listings_df[dal_listings_df["bathrooms_text"].notna()]

dal_listings_df = dal_listings_df[dal_listings_df["host_about"].apply(is_string)]
dal_listings_df = dal_listings_df[dal_listings_df["name"].apply(is_string)]
dal_listings_df = dal_listings_df[dal_listings_df["description"].apply(is_string)]

dal_listings_df["host_response_rate"] = dal_listings_df["host_response_rate"].apply(percent_to_float)
dal_listings_df["host_acceptance_rate"] = dal_listings_df["host_acceptance_rate"].apply(percent_to_float)
dal_listings_df["price"] = dal_listings_df["price"].apply(dollar_to_float)
dal_listings_df["instant_bookable"] = dal_listings_df["instant_bookable"].apply(str_to_bool)
dal_listings_df["host_identity_verified"] = dal_listings_df["host_identity_verified"].apply(str_to_bool)
dal_listings_df["host_has_profile_pic"] = dal_listings_df["host_has_profile_pic"].apply(str_to_bool)
dal_listings_df["host_is_superhost"] = dal_listings_df["host_is_superhost"].apply(str_to_bool)
dal_listings_df["host_response_time"] = dal_listings_df["host_response_time"].apply(host_response_time_to_int)
dal_listings_df["amenities"] = dal_listings_df["amenities"].apply(amenities_to_list)

dal_listings_df["is_private_room"] = dal_listings_df["room_type"].apply(room_type_to_bool)
dal_listings_df["is_private_bath"] = dal_listings_df["bathrooms_text"].apply(bathrooms_text_to_bool)
dal_listings_df["baths"] = dal_listings_df["bathrooms_text"].apply(bathroom_text_to_float)
dal_listings_df["is_private_overall"] = dal_listings_df["is_private_room"] & dal_listings_df["is_private_bath"]

dal_listings_df["host_since"] = pd.to_datetime(dal_listings_df["host_since"])
dal_listings_df["last_review"] = pd.to_datetime(dal_listings_df["last_review"])

print(dal_listings_df.dtypes)
print(dal_listings_df.shape)
print(dal_listings_df)

dal_calendar_df = pd.read_csv("calendars/dal_calendar.csv")
dal_calendar_df = dal_calendar_df.drop("price", axis=1)
dal_calendar_df = dal_calendar_df.drop("adjusted_price", axis=1)
dal_calendar_df["date"] = pd.to_datetime(dal_calendar_df["date"])
dal_calendar_df["available"] = dal_calendar_df["available"].apply(str_to_bool)

print(dal_calendar_df)
print(dal_calendar_df.dtypes)

id                               float64
name                              object
description                       object
host_since                datetime64[ns]
host_about                        object
host_response_time                 int64
host_response_rate               float64
host_acceptance_rate             float64
host_is_superhost                   bool
host_listings_count                int64
host_has_profile_pic                bool
host_identity_verified              bool
neighbourhood_cleansed            object
room_type                         object
accommodates                       int64
bathrooms_text                    object
bedrooms                         float64
beds                             float64
amenities                         object
price                            float64
minimum_nights                     int64
maximum_nights                     int64
number_of_reviews                  int64
last_review               datetime64[ns]
review_scores_ra

In [8]:
# LAS VEGAS CLEANING

lv_listings_df = pd.read_csv("listings/lv_listings.csv")

lv_listings_df = lv_listings_df[lv_listings_df["host_about"].notna()]
lv_listings_df = lv_listings_df[lv_listings_df["host_response_time"].notna()]
lv_listings_df = lv_listings_df[lv_listings_df["host_response_rate"].notna()]
lv_listings_df = lv_listings_df[lv_listings_df["host_acceptance_rate"].notna()]
lv_listings_df = lv_listings_df[lv_listings_df["last_review"].notna()]
lv_listings_df = lv_listings_df[lv_listings_df["review_scores_rating"].notna()]
lv_listings_df = lv_listings_df[lv_listings_df["description"].notna()]
lv_listings_df = lv_listings_df[lv_listings_df["beds"].notna()]
lv_listings_df = lv_listings_df[lv_listings_df["bedrooms"].notna()]
lv_listings_df = lv_listings_df[lv_listings_df["bathrooms_text"].notna()]

lv_listings_df = lv_listings_df[lv_listings_df["host_about"].apply(is_string)]
lv_listings_df = lv_listings_df[lv_listings_df["name"].apply(is_string)]
lv_listings_df = lv_listings_df[lv_listings_df["description"].apply(is_string)]

lv_listings_df["host_response_rate"] = lv_listings_df["host_response_rate"].apply(percent_to_float)
lv_listings_df["host_acceptance_rate"] = lv_listings_df["host_acceptance_rate"].apply(percent_to_float)
lv_listings_df["price"] = lv_listings_df["price"].apply(dollar_to_float)
lv_listings_df["instant_bookable"] = lv_listings_df["instant_bookable"].apply(str_to_bool)
lv_listings_df["host_identity_verified"] = lv_listings_df["host_identity_verified"].apply(str_to_bool)
lv_listings_df["host_has_profile_pic"] = lv_listings_df["host_has_profile_pic"].apply(str_to_bool)
lv_listings_df["host_is_superhost"] = lv_listings_df["host_is_superhost"].apply(str_to_bool)
lv_listings_df["host_response_time"] = lv_listings_df["host_response_time"].apply(host_response_time_to_int)
lv_listings_df["amenities"] = lv_listings_df["amenities"].apply(amenities_to_list)

lv_listings_df["is_private_room"] = lv_listings_df["room_type"].apply(room_type_to_bool)
lv_listings_df["is_private_bath"] = lv_listings_df["bathrooms_text"].apply(bathrooms_text_to_bool)
lv_listings_df["baths"] = lv_listings_df["bathrooms_text"].apply(bathroom_text_to_float)
lv_listings_df["is_private_overall"] = lv_listings_df["is_private_room"] & lv_listings_df["is_private_bath"]

lv_listings_df["host_since"] = pd.to_datetime(lv_listings_df["host_since"])
lv_listings_df["last_review"] = pd.to_datetime(lv_listings_df["last_review"])

print(lv_listings_df.dtypes)
print(lv_listings_df.shape)
print(lv_listings_df)

lv_calendar_df = pd.read_csv("calendars/lv_calendar.csv")
lv_calendar_df = lv_calendar_df.drop("price", axis=1)
lv_calendar_df = lv_calendar_df.drop("adjusted_price", axis=1)
lv_calendar_df["date"] = pd.to_datetime(lv_calendar_df["date"])
lv_calendar_df["available"] = lv_calendar_df["available"].apply(str_to_bool)

print(lv_calendar_df)
print(lv_calendar_df.dtypes)

id                               float64
name                              object
description                       object
host_since                datetime64[ns]
host_about                        object
host_response_time                 int64
host_response_rate               float64
host_acceptance_rate             float64
host_is_superhost                   bool
host_listings_count              float64
host_has_profile_pic                bool
host_identity_verified              bool
neighbourhood_cleansed            object
room_type                         object
accommodates                       int64
bathrooms_text                    object
bedrooms                         float64
beds                             float64
amenities                         object
price                            float64
minimum_nights                     int64
maximum_nights                     int64
number_of_reviews                  int64
last_review               datetime64[ns]
review_scores_ra

In [9]:
# LOS ANGELES CLEANING

la_listings_df = pd.read_csv("listings/la_listings.csv")

la_listings_df = la_listings_df[la_listings_df["host_about"].notna()]
la_listings_df = la_listings_df[la_listings_df["host_response_time"].notna()]
la_listings_df = la_listings_df[la_listings_df["host_response_rate"].notna()]
la_listings_df = la_listings_df[la_listings_df["host_acceptance_rate"].notna()]
la_listings_df = la_listings_df[la_listings_df["last_review"].notna()]
la_listings_df = la_listings_df[la_listings_df["review_scores_rating"].notna()]
la_listings_df = la_listings_df[la_listings_df["description"].notna()]
la_listings_df = la_listings_df[la_listings_df["beds"].notna()]
la_listings_df = la_listings_df[la_listings_df["bedrooms"].notna()]
la_listings_df = la_listings_df[la_listings_df["bathrooms_text"].notna()]

la_listings_df = la_listings_df[la_listings_df["host_about"].apply(is_string)]
la_listings_df = la_listings_df[la_listings_df["name"].apply(is_string)]
la_listings_df = la_listings_df[la_listings_df["description"].apply(is_string)]

la_listings_df["host_response_rate"] = la_listings_df["host_response_rate"].apply(percent_to_float)
la_listings_df["host_acceptance_rate"] = la_listings_df["host_acceptance_rate"].apply(percent_to_float)
la_listings_df["price"] = la_listings_df["price"].apply(dollar_to_float)
la_listings_df["instant_bookable"] = la_listings_df["instant_bookable"].apply(str_to_bool)
la_listings_df["host_identity_verified"] = la_listings_df["host_identity_verified"].apply(str_to_bool)
la_listings_df["host_has_profile_pic"] = la_listings_df["host_has_profile_pic"].apply(str_to_bool)
la_listings_df["host_is_superhost"] = la_listings_df["host_is_superhost"].apply(str_to_bool)
la_listings_df["host_response_time"] = la_listings_df["host_response_time"].apply(host_response_time_to_int)
la_listings_df["amenities"] = la_listings_df["amenities"].apply(amenities_to_list)

la_listings_df["is_private_room"] = la_listings_df["room_type"].apply(room_type_to_bool)
la_listings_df["is_private_bath"] = la_listings_df["bathrooms_text"].apply(bathrooms_text_to_bool)
la_listings_df["baths"] = la_listings_df["bathrooms_text"].apply(bathroom_text_to_float)
la_listings_df["is_private_overall"] = la_listings_df["is_private_room"] & la_listings_df["is_private_bath"]

la_listings_df["host_since"] = pd.to_datetime(la_listings_df["host_since"])
la_listings_df["last_review"] = pd.to_datetime(la_listings_df["last_review"])

print(la_listings_df.dtypes)
print(la_listings_df.shape)
print(la_listings_df)

la_calendar_df = pd.read_csv("calendars/la_calendar.csv")
la_calendar_df = la_calendar_df.drop("price", axis=1)
la_calendar_df = la_calendar_df.drop("adjusted_price", axis=1)
la_calendar_df["date"] = pd.to_datetime(la_calendar_df["date"])
la_calendar_df["available"] = la_calendar_df["available"].apply(str_to_bool)

print(la_calendar_df)
print(la_calendar_df.dtypes)

id                               float64
name                              object
description                       object
host_since                datetime64[ns]
host_about                        object
host_response_time                 int64
host_response_rate               float64
host_acceptance_rate             float64
host_is_superhost                   bool
host_listings_count              float64
host_has_profile_pic                bool
host_identity_verified              bool
neighbourhood_cleansed            object
room_type                         object
accommodates                       int64
bathrooms_text                    object
bedrooms                         float64
beds                             float64
amenities                         object
price                            float64
minimum_nights                     int64
maximum_nights                     int64
number_of_reviews                  int64
last_review               datetime64[ns]
review_scores_ra

In [10]:
# NASHVILLE CLEANING

nsh_listings_df = pd.read_csv("listings/nsh_listings.csv")

nsh_listings_df = nsh_listings_df[nsh_listings_df["host_about"].notna()]
nsh_listings_df = nsh_listings_df[nsh_listings_df["host_response_time"].notna()]
nsh_listings_df = nsh_listings_df[nsh_listings_df["host_response_rate"].notna()]
nsh_listings_df = nsh_listings_df[nsh_listings_df["host_acceptance_rate"].notna()]
nsh_listings_df = nsh_listings_df[nsh_listings_df["last_review"].notna()]
nsh_listings_df = nsh_listings_df[nsh_listings_df["review_scores_rating"].notna()]
nsh_listings_df = nsh_listings_df[nsh_listings_df["description"].notna()]
nsh_listings_df = nsh_listings_df[nsh_listings_df["beds"].notna()]
nsh_listings_df = nsh_listings_df[nsh_listings_df["bedrooms"].notna()]
nsh_listings_df = nsh_listings_df[nsh_listings_df["bathrooms_text"].notna()]

nsh_listings_df = nsh_listings_df[nsh_listings_df["host_about"].apply(is_string)]
nsh_listings_df = nsh_listings_df[nsh_listings_df["name"].apply(is_string)]
nsh_listings_df = nsh_listings_df[nsh_listings_df["description"].apply(is_string)]

nsh_listings_df["host_response_rate"] = nsh_listings_df["host_response_rate"].apply(percent_to_float)
nsh_listings_df["host_acceptance_rate"] = nsh_listings_df["host_acceptance_rate"].apply(percent_to_float)
nsh_listings_df["price"] = nsh_listings_df["price"].apply(dollar_to_float)
nsh_listings_df["instant_bookable"] = nsh_listings_df["instant_bookable"].apply(str_to_bool)
nsh_listings_df["host_identity_verified"] = nsh_listings_df["host_identity_verified"].apply(str_to_bool)
nsh_listings_df["host_has_profile_pic"] = nsh_listings_df["host_has_profile_pic"].apply(str_to_bool)
nsh_listings_df["host_is_superhost"] = nsh_listings_df["host_is_superhost"].apply(str_to_bool)
nsh_listings_df["host_response_time"] = nsh_listings_df["host_response_time"].apply(host_response_time_to_int)
nsh_listings_df["amenities"] = nsh_listings_df["amenities"].apply(amenities_to_list)

nsh_listings_df["is_private_room"] = nsh_listings_df["room_type"].apply(room_type_to_bool)
nsh_listings_df["is_private_bath"] = nsh_listings_df["bathrooms_text"].apply(bathrooms_text_to_bool)
nsh_listings_df["baths"] = nsh_listings_df["bathrooms_text"].apply(bathroom_text_to_float)
nsh_listings_df["is_private_overall"] = nsh_listings_df["is_private_room"] & nsh_listings_df["is_private_bath"]

nsh_listings_df["host_since"] = pd.to_datetime(nsh_listings_df["host_since"])
nsh_listings_df["last_review"] = pd.to_datetime(nsh_listings_df["last_review"])

print(nsh_listings_df.dtypes)
print(nsh_listings_df.shape)
print(nsh_listings_df)

nsh_calendar_df = pd.read_csv("calendars/nsh_calendar.csv")
nsh_calendar_df = nsh_calendar_df.drop("price", axis=1)
nsh_calendar_df = nsh_calendar_df.drop("adjusted_price", axis=1)
nsh_calendar_df["date"] = pd.to_datetime(nsh_calendar_df["date"])
nsh_calendar_df["available"] = nsh_calendar_df["available"].apply(str_to_bool)

print(nsh_calendar_df)
print(nsh_calendar_df.dtypes)

id                                 int64
name                              object
description                       object
host_since                datetime64[ns]
host_about                        object
host_response_time                 int64
host_response_rate               float64
host_acceptance_rate             float64
host_is_superhost                   bool
host_listings_count              float64
host_has_profile_pic                bool
host_identity_verified              bool
neighbourhood_cleansed            object
room_type                         object
accommodates                       int64
bathrooms_text                    object
bedrooms                         float64
beds                             float64
amenities                         object
price                            float64
minimum_nights                     int64
maximum_nights                     int64
number_of_reviews                  int64
last_review               datetime64[ns]
review_scores_ra

In [11]:
# TWIN CITIES CLEANING

msp_listings_df = pd.read_csv("listings/msp_listings.csv")

msp_listings_df = msp_listings_df[msp_listings_df["host_about"].notna()]
msp_listings_df = msp_listings_df[msp_listings_df["host_response_time"].notna()]
msp_listings_df = msp_listings_df[msp_listings_df["host_response_rate"].notna()]
msp_listings_df = msp_listings_df[msp_listings_df["host_acceptance_rate"].notna()]
msp_listings_df = msp_listings_df[msp_listings_df["last_review"].notna()]
msp_listings_df = msp_listings_df[msp_listings_df["review_scores_rating"].notna()]
msp_listings_df = msp_listings_df[msp_listings_df["description"].notna()]
msp_listings_df = msp_listings_df[msp_listings_df["beds"].notna()]
msp_listings_df = msp_listings_df[msp_listings_df["bedrooms"].notna()]
msp_listings_df = msp_listings_df[msp_listings_df["bathrooms_text"].notna()]

msp_listings_df = msp_listings_df[msp_listings_df["host_about"].apply(is_string)]
msp_listings_df = msp_listings_df[msp_listings_df["name"].apply(is_string)]
msp_listings_df = msp_listings_df[msp_listings_df["description"].apply(is_string)]

msp_listings_df["host_response_rate"] = msp_listings_df["host_response_rate"].apply(percent_to_float)
msp_listings_df["host_acceptance_rate"] = msp_listings_df["host_acceptance_rate"].apply(percent_to_float)
msp_listings_df["price"] = msp_listings_df["price"].apply(dollar_to_float)
msp_listings_df["instant_bookable"] = msp_listings_df["instant_bookable"].apply(str_to_bool)
msp_listings_df["host_identity_verified"] = msp_listings_df["host_identity_verified"].apply(str_to_bool)
msp_listings_df["host_has_profile_pic"] = msp_listings_df["host_has_profile_pic"].apply(str_to_bool)
msp_listings_df["host_is_superhost"] = msp_listings_df["host_is_superhost"].apply(str_to_bool)
msp_listings_df["host_response_time"] = msp_listings_df["host_response_time"].apply(host_response_time_to_int)
msp_listings_df["amenities"] = msp_listings_df["amenities"].apply(amenities_to_list)

msp_listings_df["is_private_room"] = msp_listings_df["room_type"].apply(room_type_to_bool)
msp_listings_df["is_private_bath"] = msp_listings_df["bathrooms_text"].apply(bathrooms_text_to_bool)
msp_listings_df["baths"] = msp_listings_df["bathrooms_text"].apply(bathroom_text_to_float)
msp_listings_df["is_private_overall"] = msp_listings_df["is_private_room"] & msp_listings_df["is_private_bath"]

msp_listings_df["host_since"] = pd.to_datetime(msp_listings_df["host_since"])
msp_listings_df["last_review"] = pd.to_datetime(msp_listings_df["last_review"])

print(msp_listings_df.dtypes)
print(msp_listings_df.shape)
print(msp_listings_df)

msp_calendar_df = pd.read_csv("calendars/msp_calendar.csv")
msp_calendar_df = msp_calendar_df.drop("price", axis=1)
msp_calendar_df = msp_calendar_df.drop("adjusted_price", axis=1)
msp_calendar_df["date"] = pd.to_datetime(msp_calendar_df["date"])
msp_calendar_df["available"] = msp_calendar_df["available"].apply(str_to_bool)

print(msp_calendar_df)
print(msp_calendar_df.dtypes)

id                               float64
name                              object
description                       object
host_since                datetime64[ns]
host_about                        object
host_response_time                 int64
host_response_rate               float64
host_acceptance_rate             float64
host_is_superhost                   bool
host_listings_count                int64
host_has_profile_pic                bool
host_identity_verified              bool
neighbourhood_cleansed            object
room_type                         object
accommodates                       int64
bathrooms_text                    object
bedrooms                         float64
beds                             float64
amenities                         object
price                            float64
minimum_nights                     int64
maximum_nights                     int64
number_of_reviews                  int64
last_review               datetime64[ns]
review_scores_ra

### Data Limitations


**Limitation 1:** The cities we selected do not capture the full picture of the U.S. domestic tourism market. At first, we planned to select the city from each U.S. geographic region with the most inbound tourism, using the regions defined by the CDC. These cities would have been:
   - New England: Boston, Massachusetts
   - Middle Atlantic: New York, New York
   - East North Central: Chicago, Illinois
   - West North Central: The Twin Cities (Minneapolis and St. Paul)
   - South Atlantic: Orlando, Florida
   - East South Central: Nashville, Tennessee
   - West South Central: San Antonio, Texas
   - Mountain: Las Vegas, Nevada
   - Pacific: Los Angeles, California

However, after checking the available Airbnb databases, we realized that there was no data for two of these cities: Orlando and San Antonio. As a result, we had to select the next largest cities with available data: Washington DC and Dallas. Additionally, rather than Las Vegas, Airbnb had data for Clark County, Nevada, which contains Vegas. This data may capture some data points outside of the target metropolitan area. Overall, these data limitations made it so that we cannot analyze some target cities for our research question. 

**Limitation 2:** Another limitation is that we had to delete many columns from the datasets, solely because there were inconsistencies between cities. For instance, The dataset for New York City had a column called “neighborhood group (cleansed)”, which showed the borough that Airbnb was located in. This data would have been helpful for our analysis of NYC, but there was no  data in “neighborhood group (cleansed)” for the remaining cities. We had to delete several other columns, like “bathrooms” and “license”, because they were empty for many cities. Had these columns been populated for all of the cities, we could have used them in our analysis. 

**Limitation 3:** A final limitation with our data was that we had to set a definition for what makes a listing “private”. In the raw data, there were two columns that suggested the degree of privacy of the listing: 1) “room type” which indicated whether the listing is a private room, shared room, or entire home/apartment, and 2) “bathroom text”, which included the number of bathrooms, and occasionally, whether the bathrooms were shared or private. Since we were missing a lot of data for whether bathrooms were shared or private, we had to create a single variable “is_private” which aggregates info from both columns to determine whether the overall listing is private or not. This is a limitation, as it would have been helpful to know how the separate factors, bedroom privacy and bathroom privacy, influence users’ decisions.

### Exploratory Data Analysis

In [12]:
import numpy as np
import seaborn as sns
import pandas as pd

import matplotlib.pyplot as plt

In [13]:
%load_ext sql

%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

%sql duckdb:///:memory:

In [19]:
%sql SELECT * FROM nyc_listings_df WHERE review_scores_rating IS NOT NULL ORDER BY review_scores_rating ASC 

Unnamed: 0,id,name,description,host_since,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,...,minimum_nights,maximum_nights,number_of_reviews,last_review,review_scores_rating,instant_bookable,is_private_room,is_private_bath,baths,is_private_overall
0,2.632320e+05,Cozy&Clean in a great neighborhood,"My place is good for couples, solo adventurers...",2011-11-07,I work in NYC as a Fashion Designer. I am from...,2,0.50,0.00,False,1.0,...,30,30,1,2019-01-27,0.0,False,True,True,1.0,True
1,6.672758e+06,Cool Large 3 bedrooms in the heart of East vil...,3 MONTHS MINIMUM.<br /><br /><br />This furnis...,2012-04-22,my favorite cities in the world\nnew york\nPar...,3,0.05,0.17,False,5.0,...,30,30,1,2018-03-10,0.0,True,True,True,1.0,True
2,1.597414e+07,100$,I am renting my super cozy and very spacious 1...,2015-09-29,A happy Brazilian girl leaving in the capitol ...,2,1.00,0.67,False,1.0,...,30,30,1,2018-01-07,0.0,False,True,False,1.0,False
3,2.590499e+07,Bedstuy spacious townhouse w. amazing backyard,"Beautiful duplex townhouse on a safe, friendly...",2010-07-20,We are a young (at least we feel young ;) and ...,2,0.75,0.50,False,1.0,...,10,75,2,2019-12-16,0.0,False,True,True,3.0,True
4,2.758348e+07,River View!,"Spacious, neat, beautiful, and quiet one-bedro...",2013-01-02,"cool easy going,",1,1.00,0.00,False,1.0,...,30,120,1,2018-08-08,0.0,False,True,True,1.0,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11092,2.453984e+07,Luxury Furnished 2 Bedroom Apartment Jersey City,"Soaring 50 stories above the Hudson River, thi...",2013-10-14,We have been providing vacation rental apartme...,0,0.99,0.62,False,327.0,...,27,1125,1,2020-01-03,5.0,True,True,True,2.0,True
11093,5.584635e+17,"Jersey City 2BR w/ Elevator, Gym & W/D, nr PATH",Show up and start living from day one in New J...,2016-12-16,"We’re Blueground, a global proptech company wi...",0,1.00,0.97,False,4022.0,...,31,1125,1,2022-06-02,5.0,True,True,True,2.0,True
11094,6.786148e+17,One Bedroom Apartment With A Manhattan View,1 Bedroom Apartment overlooking NYC. The view ...,2021-01-20,Host looking to provide you with a comfortable...,0,0.81,0.90,False,1.0,...,5,365,2,2022-08-30,5.0,True,True,True,1.0,True
11095,5.426634e+07,Luxe Industrial Loft in Heart of City near NYC,Experience Jersey City in a luxe industrial lo...,2020-07-31,Hey! My name is Jeff and I am a traveling seri...,2,0.67,1.00,False,2.0,...,30,365,5,2022-07-06,5.0,True,True,True,1.0,True


In [20]:
nyc_calendar_df.head()

Unnamed: 0,listing_id,date,available,minimum_nights,maximum_nights
0,2539,2022-09-07,False,30.0,730.0
1,2539,2022-09-08,False,30.0,730.0
2,2539,2022-09-09,False,30.0,730.0
3,2539,2022-09-10,False,30.0,730.0
4,2539,2022-09-11,False,30.0,730.0


In [21]:
#dropping minimum_nights and maximum_nights (redundant & already exists in listings)
nyc_calendar_df = nyc_calendar_df.drop(columns = 'minimum_nights')
nyc_calendar_df = nyc_calendar_df.drop(columns = 'maximum_nights')
nyc_calendar_df.head()

Unnamed: 0,listing_id,date,available
0,2539,2022-09-07,False
1,2539,2022-09-08,False
2,2539,2022-09-09,False
3,2539,2022-09-10,False
4,2539,2022-09-11,False


In [22]:
%sql SELECT listing_id, COUNT(available) AS days_booked FROM nyc_calendar_df WHERE available = False GROUP BY listing_id ORDER BY days_booked DESC

Unnamed: 0,listing_id,days_booked
0,64015,366
1,95883,365
2,170420,365
3,189787,365
4,3251014,365
...,...,...
39075,46929299,1
39076,14408114,1
39077,24561311,1
39078,1755844,1


In [23]:
%sql nyc_bookings_df << (SELECT listing_id, COUNT(available) AS days_booked FROM nyc_calendar_df WHERE available = False GROUP BY listing_id)
%sql nyc_combined_df << (SELECT * FROM nyc_listings_df LEFT JOIN nyc_bookings_df ON nyc_listings_df.id = nyc_bookings_df.listing_id)

Returning data to local variable nyc_bookings_df
Returning data to local variable nyc_combined_df


In [24]:
nyc_combined_df = nyc_combined_df.drop(columns = 'listing_id')

In [25]:
nyc_combined_df.head()

Unnamed: 0,id,name,description,host_since,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,...,maximum_nights,number_of_reviews,last_review,review_scores_rating,instant_bookable,is_private_room,is_private_bath,baths,is_private_overall,days_booked
0,281851.0,Beautiful Private Bedroom - Downstairs,Cozy bedroom in a 100 year old Brownstone loca...,2011-12-04,We are very easy going and family oriented. W...,2,1.0,0.63,True,1.0,...,1125,121,2022-06-08,4.93,False,True,False,0.0,False,107.0
1,287845.0,Carroll Gardens Gem-2BD with Garden,"Clean, quiet, comfortable beautifully renovate...",2011-12-12,We love Brooklyn!,0,1.0,0.6,False,1.0,...,365,127,2022-07-09,4.98,False,True,True,1.0,True,77.0
2,294259.0,Loft Suite,"This loft unit features 16ft ceilings, a kitch...",2011-03-01,Thank you for your interest in the Box House H...,0,0.99,0.98,True,30.0,...,180,135,2022-08-28,4.84,False,True,True,1.0,True,159.0
3,322604.0,Artist Loft-McCarren Park-Williamsburg-Brookly...,"** Please send me a message , before trying t...",2012-01-25,Artist/Photographer\nwith a Sweet Bedroom in A...,0,1.0,0.97,True,2.0,...,5,223,2022-08-20,4.95,False,True,False,1.0,False,351.0
4,352651.0,COMFORTABLE LARGE ROOM,"This is a very large room with queen size bed,...",2012-02-21,"Out going, easy to get a long with.",2,0.5,0.67,False,3.0,...,180,11,2019-06-23,4.44,False,True,False,1.0,False,145.0


### Questions for Reviewers