## INFO 2950 Final Project
Larrisa Chen (lc949), Michelle Li (myl39), Christina Jin (cej65), Jade Eggleston (jce76)

### Research question
***Do the criteria for a successful Airbnb differ in U.S. regions?***

Where success is defined as:
- high number of bookings combined with high listing rating

and criteria is defined by:
- Price
- Number of Beds
- Number of Baths
- Host ratings
- Number of reviews 
- Private or Public
- Proximity to urban center (most popular neighborhood)
- Neighborhood
- Keywords in names
- Keywords in description
- Host response rate / time
- Room type
- Amenities (binary)
    - varies by region

The list of amenities is scraped from each of the cities listed below and aggregated to find the top 15 in each region. 

Ultimately, we want to deduce the attributes that contribute to a successful Airbnb listing and compare these listings across different regions in the U.S. (Northeast, Southeast, Southeast, West, Northwest, and Midwest).


### Data Origin/Description

Our data is sourced from http://insideairbnb.com/get-the-data/.

We aggregated data from cities which receive the highest number of inbound tourism in each region from *link*.
The cities being analyzed are as follows:
- New England: Boston
- Middle Atlantic: New York City
- East North Central: Chicago
- West North Central: Minneapolis
- South Atlantic: Miami
- East South Central: Nashville
- West South Central: San Antonio
- Mountain: Las Vegas
- Pacific: Los Angeles



For each city, we want to analyze:
- Lisitng data with columns:
    - id
    - name
    - description
    - host_since
    - host_about
    - host_response_time
    - host_response_rate
    - host_acceptance_rate
    - host_is_superhost
    - host_listings_count
    - host_has_profile_pic
    - host_identity_verified
    - neighbourhood_cleansed
    - neighbourhood_group_cleansed
    - room_type
    - accommodates
    - bathrooms_text
    - bedrooms
    - beds
    - amenities
    - price
    - minimum_nights
    - maximum_nights
    - number_of_reviews
    - last_review
    - review_scores_rating
    - instant_bookable

- and booking data with columns:
    - list columns here


### Data Collection & Cleaning

We begin by removing all rows which contain NaN, so that analyzation can take place where all columns are one type.

Then, we analyze the columns and manually delete the following:
- listing_url
    - Repetitive data 
- Maximum_maximum_nights
    - Repetitive data
- Minimum_nights_avg_ntm
    - Repetitive data
- Maximum_nights_avg_ntm
    - Repetitive data
- Calendar_updated
    - Only contains empty data
- Has_availability
    - All true, redundent
- availability_30
    - Assuming users only evaluate criteria for listings that are available
- availability_60
    - Assuming users only evaluate criteria for listings that are available
- availability_90
    - Assuming users only evaluate criteria for listings that are available
- Availability_365
    - Assuming users only evaluate criteria for listings that are available
- calendar_last_scraped
- Number_of_reviews_ltm
    - Correlated to number_of_reviews, redundant
- Number_of_reviews_l30d
    - Correlated to number_of_reviews, redundant
- First_review
    - Redundant information because we have host_since
- Review_scores_accuracy
    - Correlated to review_scores_rating
- Review_scores_cleanliness
    - Correlated to review_scores_rating
- Review_scores_checkin
    - Correlated to review_scores_rating
- Review_scores_communication
    - Correlated to review_scores_rating
- Review_scores_location
    - Correlated to review_scores_rating
- Review_scores_value
    - Correlated to review_scores_rating
- License
    - Only contains empty data
- calculated_host_listings_count
- Calculated_host_listings_count_entire_homes
    - Information about host’s other listings are not relevant to this listing
- Calculated_host_listings_count_private_rooms
    - Information about host’s other listings are not relevant to this listing
- Calculated_host_listings_count_shared_rooms
    - Information about host’s other listings are not relevant to this listing
- Reviews_per_month
    - Too dependent on other people’s stay time, irrelevant metric 
- Neighbourhood_group_cleansed
    - Inconsistent across different cities
- House_availability
    - If listing is unavailable, users will not view it and it will by default not be the best listing


In [22]:
import numpy as np
import pandas as pd
import regex as re
import json

In [61]:
def percent_to_float(x):
    return float(x.strip("%"))/100.0

def dollar_to_float(x):
    x = x.replace(",", "")
    x = x.replace("$", "")
    return float(x)

def str_to_bool(x):
    if(x == "f"):
        return False
    else:
        return True

# is_private_overall -> is_private boolean type if private room and private bathroom else false
# if room_type is entire home/apt = private and bath bed = private
def room_type_to_bool(x):
    if(x == "Shared room"):
        return False
    return True

def bathrooms_text_to_bool(x):
    if(str(x).find("shared") != -1 or str(x).find("Shared") != -1):
        return False
    return True
    
def amenities_to_list(x):
    return json.loads(x)
    
def host_response_time_to_int(x):
    if(x == "within an hour"):
        return 0
    elif(x == "within a few hours"):
        return 1
    elif(x == "within a day"):
        return 2
    elif(x == "a few days or more"):
        return 3
    elif(x == "None"):
        return 4
    
def bathroom_text_to_int(x):
    if(x.find("half-bath") != -1 or x.find("Half-bath") != -1):
        return 0.5
    elif(re.search(r"d+\.\d+", x) != None):
        return float(re.search(r"d+\.\d+", x).group())
    return int(re.search(r"\d+", x).group())

nyc_listings_df = pd.read_csv("nyc_listings.csv")

nyc_listings_df = nyc_listings_df[nyc_listings_df["host_about"].notna()]
nyc_listings_df = nyc_listings_df[nyc_listings_df["host_response_time"].notna()]
nyc_listings_df = nyc_listings_df[nyc_listings_df["host_response_rate"].notna()]
nyc_listings_df = nyc_listings_df[nyc_listings_df["host_acceptance_rate"].notna()]
nyc_listings_df = nyc_listings_df[nyc_listings_df["last_review"].notna()]
nyc_listings_df = nyc_listings_df[nyc_listings_df["review_scores_rating"].notna()]
nyc_listings_df = nyc_listings_df[nyc_listings_df["description"].notna()]
nyc_listings_df = nyc_listings_df[nyc_listings_df["beds"].notna()]
nyc_listings_df = nyc_listings_df[nyc_listings_df["bedrooms"].notna()]
nyc_listings_df = nyc_listings_df[nyc_listings_df["bathrooms_text"].notna()]

nyc_listings_df["host_response_rate"] = nyc_listings_df["host_response_rate"].apply(percent_to_float)
nyc_listings_df["host_acceptance_rate"] = nyc_listings_df["host_acceptance_rate"].apply(percent_to_float)
nyc_listings_df["price"] = nyc_listings_df["price"].apply(dollar_to_float)
nyc_listings_df["instant_bookable"] = nyc_listings_df["instant_bookable"].apply(str_to_bool)
nyc_listings_df["host_identity_verified"] = nyc_listings_df["host_identity_verified"].apply(str_to_bool)
nyc_listings_df["host_has_profile_pic"] = nyc_listings_df["host_has_profile_pic"].apply(str_to_bool)
nyc_listings_df["host_is_superhost"] = nyc_listings_df["host_is_superhost"].apply(str_to_bool)
nyc_listings_df["host_response_time"] = nyc_listings_df["host_response_time"].apply(host_response_time_to_int)
nyc_listings_df["amenities"] = nyc_listings_df["amenities"].apply(amenities_to_list)

nyc_listings_df = nyc_listings_df.rename(columns={"room_type": "is_private_room"})
nyc_listings_df["is_private_room"] = nyc_listings_df["is_private_room"].apply(room_type_to_bool)
nyc_listings_df["is_private_bath"] = nyc_listings_df["bathrooms_text"].apply(bathrooms_text_to_bool)
nyc_listings_df = nyc_listings_df.rename(columns={"bathrooms_text": "baths"})
nyc_listings_df["baths"] = nyc_listings_df["baths"].apply(bathroom_text_to_int)
nyc_listings_df["is_private_overall"] = nyc_listings_df["is_private_room"] & nyc_listings_df["is_private_bath"]

nyc_listings_df["host_since"] = pd.to_datetime(nyc_listings_df["host_since"])
nyc_listings_df["last_review"] = pd.to_datetime(nyc_listings_df["last_review"])

print(nyc_listings_df.dtypes)
print(nyc_listings_df.shape)
print(nyc_listings_df)


id                               float64
name                              object
description                       object
host_since                datetime64[ns]
host_about                        object
host_response_time                 int64
host_response_rate               float64
host_acceptance_rate             float64
host_is_superhost                   bool
host_listings_count              float64
host_has_profile_pic                bool
host_identity_verified              bool
neighbourhood_cleansed            object
is_private_room                     bool
accommodates                       int64
baths                            float64
bedrooms                         float64
beds                             float64
amenities                         object
price                            float64
minimum_nights                     int64
maximum_nights                     int64
number_of_reviews                  int64
last_review               datetime64[ns]
review_scores_ra

### Data Limitations


### Exploratory Data Analysis

In [None]:
import numpy as np
import seaborn as sns
import pandas as pd

import matplotlib.pyplot as plt

In [None]:
%load_ext sql

%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

%sql duckdb:///:memory:

In [None]:
nyc_listings_df = pd.read_csv("nyc_listings.csv")
nyc_calendar_df = pd.read_csv("nyc_calendar.csv")

In [None]:
%sql SELECT DISTINCT host_response_time FROM nyc_listings_df 

In [None]:
%sql SELECT DISTINCT room_type FROM nyc_listings_df

In [None]:
%sql SELECT DISTINCT bathrooms_text FROM nyc_listings_df

In [None]:
%sql SELECT COUNT(bathrooms_text) FROM nyc_listings_df WHERE bathrooms_text = '0 baths' 

In [None]:
%sql SELECT * FROM nyc_listings_df WHERE review_scores_rating IS NOT NULL ORDER BY review_scores_rating ASC 

In [None]:
nyc_calendar_df.head()

In [None]:
#dropping minimum_nights and maximum_nights (redundant & already exists in listings)
nyc_calendar_df = nyc_calendar_df.drop(columns = 'minimum_nights')
nyc_calendar_df = nyc_calendar_df.drop(columns = 'maximum_nights')
nyc_calendar_df.head()

In [None]:
%sql SELECT listing_id, COUNT(available) AS days_booked FROM nyc_calendar_df WHERE available = 'f' GROUP BY listing_id ORDER BY days_booked DESC

### Questions for Reviewers