## INFO 2950 Final Project
Larrisa Chen (lc949), Michelle Li (myl39), Christina Jin (cej65), Jade Eggleston (jce76)

### Research question
***Do the criteria for a successful Airbnb differ in U.S. regions?***

Where success is defined as:
- high number of bookings combined with high listing rating

and criteria is defined by:
- Price
- Number of Beds
- Number of Baths
- Host ratings
- Number of reviews 
- Private or Public
- Proximity to urban center (most popular neighborhood)
- Neighborhood
- Keywords in names
- Keywords in description
- Host response rate / time
- Room type
- Amenities (binary)
    - varies by region

The list of amenities is scraped from each of the cities listed below and aggregated to find the top 15 in each region. 

Ultimately, we want to deduce the attributes that contribute to a successful Airbnb listing and compare these listings across different regions in the U.S. (Northeast, Southeast, Southeast, West, Northwest, and Midwest).


### Data Origin/Description

Our data is sourced from http://insideairbnb.com/get-the-data/.

We aggregated data from cities which receive the highest number of inbound tourism in each region from *link*.
The cities being analyzed are as follows:
- New England: Boston
- Middle Atlantic: New York City
- East North Central: Chicago
- West North Central: Minneapolis
- South Atlantic: Miami
- East South Central: Nashville
- West South Central: San Antonio
- Mountain: Las Vegas
- Pacific: Los Angeles



For each city, we want to analyze:
- Lisitng data with columns:
    - id
    - name
    - description
    - host_since
    - host_about
    - host_response_time
    - host_response_rate
    - host_acceptance_rate
    - host_is_superhost
    - host_listings_count
    - host_has_profile_pic
    - host_identity_verified
    - neighbourhood_cleansed
    - neighbourhood_group_cleansed
    - room_type
    - accommodates
    - bathrooms_text
    - bedrooms
    - beds
    - amenities
    - price
    - minimum_nights
    - maximum_nights
    - number_of_reviews
    - last_review
    - review_scores_rating
    - instant_bookable

- and booking data with columns:
    - list columns here


### Data Collection & Cleaning

We begin by removing all rows which contain NaN, so that analyzation can take place where all columns are one type.

Then, we analyze the columns and manually delete the following:
- listing_url
    - Repetitive data 
- Maximum_maximum_nights
    - Repetitive data
- Minimum_nights_avg_ntm
    - Repetitive data
- Maximum_nights_avg_ntm
    - Repetitive data
- Calendar_updated
    - Only contains empty data
- Has_availability
    - All true, redundent
- availability_30
    - Assuming users only evaluate criteria for listings that are available
- availability_60
    - Assuming users only evaluate criteria for listings that are available
- availability_90
    - Assuming users only evaluate criteria for listings that are available
- Availability_365
    - Assuming users only evaluate criteria for listings that are available
- calendar_last_scraped
- Number_of_reviews_ltm
    - Correlated to number_of_reviews, redundant
- Number_of_reviews_l30d
    - Correlated to number_of_reviews, redundant
- First_review
    - Redundant information because we have host_since
- Review_scores_accuracy
    - Correlated to review_scores_rating
- Review_scores_cleanliness
    - Correlated to review_scores_rating
- Review_scores_checkin
    - Correlated to review_scores_rating
- Review_scores_communication
    - Correlated to review_scores_rating
- Review_scores_location
    - Correlated to review_scores_rating
- Review_scores_value
    - Correlated to review_scores_rating
- License
    - Only contains empty data
- calculated_host_listings_count
- Calculated_host_listings_count_entire_homes
    - Information about host’s other listings are not relevant to this listing
- Calculated_host_listings_count_private_rooms
    - Information about host’s other listings are not relevant to this listing
- Calculated_host_listings_count_shared_rooms
    - Information about host’s other listings are not relevant to this listing
- Reviews_per_month
    - Too dependent on other people’s stay time, irrelevant metric 
- Neighbourhood_group_cleansed
    - Inconsistent across different cities
- House_availability
    - If listing is unavailable, users will not view it and it will by default not be the best listing


### Data Limitations


### Exploratory Data Analysis

In [1]:
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
%load_ext sql

%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

%sql duckdb:///:memory:

In [3]:
nyc_listings_df = pd.read_csv("nyc_listings.csv")
nyc_calendar_df = pd.read_csv("nyc_calendar.csv")

In [4]:
%sql SELECT DISTINCT host_response_time FROM nyc_listings_df 

Unnamed: 0,host_response_time
0,within an hour
1,within a day
2,
3,within a few hours
4,a few days or more


In [5]:
nyc_calendar_df.head()

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,2539,2022-09-07,f,$299.00,$299.00,30.0,730.0
1,2539,2022-09-08,f,$299.00,$299.00,30.0,730.0
2,2539,2022-09-09,f,$299.00,$299.00,30.0,730.0
3,2539,2022-09-10,f,$299.00,$299.00,30.0,730.0
4,2539,2022-09-11,f,$299.00,$299.00,30.0,730.0


In [6]:
#dropping minimum_nights and maximum_nights (redundant & already exists in listings)
nyc_calendar_df = nyc_calendar_df.drop(columns = 'minimum_nights')
nyc_calendar_df = nyc_calendar_df.drop(columns = 'maximum_nights')
nyc_calendar_df.head()

Unnamed: 0,listing_id,date,available,price,adjusted_price
0,2539,2022-09-07,f,$299.00,$299.00
1,2539,2022-09-08,f,$299.00,$299.00
2,2539,2022-09-09,f,$299.00,$299.00
3,2539,2022-09-10,f,$299.00,$299.00
4,2539,2022-09-11,f,$299.00,$299.00


In [7]:
%sql SELECT listing_id, COUNT(available) FROM nyc_calendar_df WHERE available = 'f' GROUP BY listing_id

Unnamed: 0,listing_id,count(available)
0,573671,365
1,590903,365
2,597624,365
3,600775,365
4,606269,365
...,...,...
39862,46783109,365
39863,14024257,365
39864,38238758,365
39865,38461254,365


### Questions for Reviewers