## INFO 2950 Final Project
Larrisa Chen (lc949), Michelle Li (myl39), Christina Jin (cej65), Jade Eggleston (jce76)

### Research question
***Do the criteria for a successful Airbnb differ in U.S. regions?***

Where success is defined as:
- high number of bookings combined with high listing rating

and criteria is defined by:
- Price
- Number of Beds
- Number of Baths
- Host ratings
- Number of reviews 
- Private or Public
- Proximity to urban center (most popular neighborhood)
- Neighborhood
- Keywords in names
- Keywords in description
- Host response rate / time
- Room type
- Amenities (binary)
    - varies by region

The list of amenities is scraped from each of the cities listed below and aggregated to find the top 15 in each region. 

Ultimately, we want to deduce the attributes that contribute to a successful Airbnb listing and compare these listings across different regions in the U.S. (Northeast, Southeast, Southeast, West, Northwest, and Midwest).


### Data Origin/Description

Our data is sourced from http://insideairbnb.com/get-the-data/.

We aggregated data from cities which receive the highest number of inbound tourism in each region from *link*.
The cities being analyzed are as follows:
- New England: Boston
- Middle Atlantic: New York City
- East North Central: Chicago
- West North Central: Minneapolis
- South Atlantic: Miami
- East South Central: Nashville
- West South Central: San Antonio
- Mountain: Las Vegas
- Pacific: Los Angeles



For each city, we want to analyze:
- Lisitng data with columns:
    - id
    - name
    - description
    - host_since
    - host_about
    - host_response_time
    - host_response_rate
    - host_acceptance_rate
    - host_is_superhost
    - host_listings_count
    - host_has_profile_pic
    - host_identity_verified
    - neighbourhood_cleansed
    - neighbourhood_group_cleansed
    - room_type
    - accommodates
    - bathrooms_text
    - bedrooms
    - beds
    - amenities
    - price
    - minimum_nights
    - maximum_nights
    - number_of_reviews
    - last_review
    - review_scores_rating
    - instant_bookable

- and booking data with columns:
    - list columns here


### Data Collection & Cleaning

We begin by removing all rows which contain NaN, so that analyzation can take place where all columns are one type.

Then, we analyze the columns and manually delete the following:
- listing_url
    - Repetitive data 
- Maximum_maximum_nights
    - Repetitive data
- Minimum_nights_avg_ntm
    - Repetitive data
- Maximum_nights_avg_ntm
    - Repetitive data
- Calendar_updated
    - Only contains empty data
- Has_availability
    - All true, redundent
- availability_30
    - Assuming users only evaluate criteria for listings that are available
- availability_60
    - Assuming users only evaluate criteria for listings that are available
- availability_90
    - Assuming users only evaluate criteria for listings that are available
- Availability_365
    - Assuming users only evaluate criteria for listings that are available
- calendar_last_scraped
- Number_of_reviews_ltm
    - Correlated to number_of_reviews, redundant
- Number_of_reviews_l30d
    - Correlated to number_of_reviews, redundant
- First_review
    - Redundant information because we have host_since
- Review_scores_accuracy
    - Correlated to review_scores_rating
- Review_scores_cleanliness
    - Correlated to review_scores_rating
- Review_scores_checkin
    - Correlated to review_scores_rating
- Review_scores_communication
    - Correlated to review_scores_rating
- Review_scores_location
    - Correlated to review_scores_rating
- Review_scores_value
    - Correlated to review_scores_rating
- License
    - Only contains empty data
- calculated_host_listings_count
- Calculated_host_listings_count_entire_homes
    - Information about host’s other listings are not relevant to this listing
- Calculated_host_listings_count_private_rooms
    - Information about host’s other listings are not relevant to this listing
- Calculated_host_listings_count_shared_rooms
    - Information about host’s other listings are not relevant to this listing
- Reviews_per_month
    - Too dependent on other people’s stay time, irrelevant metric 
- Neighbourhood_group_cleansed
    - Inconsistent across different cities
- House_availability
    - If listing is unavailable, users will not view it and it will by default not be the best listing


### Data Limitations


### Exploratory Data Analysis

In [1]:
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
%load_ext sql

%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

%sql duckdb:///:memory:

In [16]:
nyc_listings_df = pd.read_csv("nyc_listings.csv")
nyc_calendar_df = pd.read_csv("nyc_calendar.csv")

In [4]:
%sql SELECT DISTINCT host_response_time FROM nyc_listings_df 

Unnamed: 0,host_response_time
0,within an hour
1,within a day
2,
3,within a few hours
4,a few days or more


In [15]:
%sql SELECT DISTINCT room_type FROM nyc_listings_df

Unnamed: 0,room_type
0,Private room
1,Entire home/apt
2,Hotel room
3,Shared room


In [17]:
%sql SELECT DISTINCT bathrooms_text FROM nyc_listings_df

Unnamed: 0,bathrooms_text
0,1 shared bath
1,1 bath
2,
3,2.5 baths
4,1.5 baths
5,1 private bath
6,1.5 shared baths
7,2 baths
8,2 shared baths
9,Shared half-bath


In [20]:
%sql SELECT COUNT(bathrooms_text) FROM nyc_listings_df WHERE bathrooms_text = '0 baths' 

Unnamed: 0,count(bathrooms_text)
0,52


In [22]:
%sql SELECT * FROM nyc_listings_df WHERE review_scores_rating IS NOT NULL ORDER BY review_scores_rating ASC 

Unnamed: 0,id,name,description,host_since,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,...,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,number_of_reviews,last_review,review_scores_rating,instant_bookable
0,201514.0,Litas New York Apartments - Analita's Suite,You stay in a lovely historical building from ...,2011-08-17,I am the owner of this beautifully restored la...,,,,f,1.0,...,1.0,1.0,"[""First aid kit"", ""Body soap"", ""Microwave"", ""S...",$150.00,30,60,1,2017-10-05,0.0,f
1,263232.0,Cozy&Clean in a great neighborhood,"My place is good for couples, solo adventurers...",2011-11-07,I work in NYC as a Fashion Designer. I am from...,within a day,50%,0%,f,1.0,...,1.0,2.0,"[""Elevator"", ""Heating"", ""TV"", ""Security camera...",$157.00,30,30,1,2019-01-27,0.0,f
2,278876.0,"Large, furnished room in a 2 bedroom!",This is a large room in a first floor apartmen...,2011-01-30,I'm a late twenties graduate student working t...,,,,f,1.0,...,1.0,1.0,"[""Hair dryer"", ""Essentials"", ""Iron"", ""Heating""...",$60.00,30,30,1,2017-03-18,0.0,f
3,499249.0,"WILLIAMSBURG FOR 25 DAYS, CHEAP!",<b>The space</b><br />I'm going to be out of B...,2012-05-25,,,,,f,1.0,...,1.0,,"[""Long term stays allowed""]",$190.00,30,218,1,2012-06-28,0.0,f
4,655472.0,Lovely 2-room Studio in Crown Hghts,Two room sun-lit apartment with large private ...,2010-12-24,I have been living in Brooklyn 7 years. Musici...,,,,f,1.0,...,1.0,1.0,"[""Heating"", ""Kitchen"", ""Air conditioning"", ""Wi...",$60.00,30,60,1,2012-08-25,0.0,f
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31514,10905293.0,Hoboken 1BR | Desk+WiFi | Near Hospitals | by GLS,Guests will experience the stay of a lifetime ...,2015-03-30,,within an hour,100%,92%,f,437.0,...,1.0,2.0,"[""Body soap"", ""Microwave"", ""Stove"", ""Coffee ma...",$325.00,9,1125,2,2018-02-16,5.0,t
31515,20583441.0,Classic Hoboken 1BR | Work Desk + WiFi | near NYC,Guests will experience the stay of a lifetime ...,2015-03-30,,within an hour,100%,92%,f,437.0,...,1.0,2.0,"[""Body soap"", ""Microwave"", ""Stove"", ""Coffee ma...",$325.00,9,1125,2,2021-05-31,5.0,t
31516,38361585.0,Home Suite Away-Min From JFK-Walking to UBS Arena,,2019-08-31,I'm a hospitality professional within the F&B ...,within an hour,100%,100%,f,1.0,...,2.0,3.0,"[""Microwave"", ""Coffee maker"", ""Smart lock"", ""L...",$125.00,27,30,6,2022-08-30,5.0,f
31517,35372621.0,Huge Room minutes away to NYC. Easy commute!,,2016-05-22,Wether you think you can or can’t...you’re rig...,within an hour,80%,67%,f,1.0,...,,1.0,"[""Essentials"", ""Lock on bedroom door"", ""Hot wa...",$69.00,2,5,3,2019-08-24,5.0,t


In [5]:
nyc_calendar_df.head()

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,2539,2022-09-07,f,$299.00,$299.00,30.0,730.0
1,2539,2022-09-08,f,$299.00,$299.00,30.0,730.0
2,2539,2022-09-09,f,$299.00,$299.00,30.0,730.0
3,2539,2022-09-10,f,$299.00,$299.00,30.0,730.0
4,2539,2022-09-11,f,$299.00,$299.00,30.0,730.0


In [6]:
#dropping minimum_nights and maximum_nights (redundant & already exists in listings)
nyc_calendar_df = nyc_calendar_df.drop(columns = 'minimum_nights')
nyc_calendar_df = nyc_calendar_df.drop(columns = 'maximum_nights')
nyc_calendar_df.head()

Unnamed: 0,listing_id,date,available,price,adjusted_price
0,2539,2022-09-07,f,$299.00,$299.00
1,2539,2022-09-08,f,$299.00,$299.00
2,2539,2022-09-09,f,$299.00,$299.00
3,2539,2022-09-10,f,$299.00,$299.00
4,2539,2022-09-11,f,$299.00,$299.00


In [14]:
%sql SELECT listing_id, COUNT(available) AS days_booked FROM nyc_calendar_df WHERE available = 'f' GROUP BY listing_id ORDER BY days_booked DESC

Unnamed: 0,listing_id,days_booked
0,64015,366
1,712136,365
2,769175,365
3,776257,365
4,785097,365
...,...,...
39075,41494959,1
39076,96471,1
39077,10384214,1
39078,14408114,1


### Questions for Reviewers