# Seattle Airbnb Data

The aim is to examine the Seattle Airbnb data and derive insights on what factors impact Airbnb listing prices. We explore 3 main questions:
- Do missing descriptions impact prices?
- How does ease of booking impact prices? ( cancellation policy, instant bookability, requiremrents for booking, etc.)
- How do reviews immpact prices? (number of reviews, scores, etc.)


### Install libraries

In [5]:
pip install -U scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.3.0-cp38-cp38-win_amd64.whl (9.2 MB)
Collecting joblib>=1.1.1
  Downloading joblib-1.3.2-py3-none-any.whl (302 kB)
Installing collected packages: joblib, scikit-learn
  Attempting uninstall: joblib
    Found existing installation: joblib 1.0.1
    Uninstalling joblib-1.0.1:
      Successfully uninstalled joblib-1.0.1
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 0.24.1
    Uninstalling scikit-learn-0.24.1:
      Successfully uninstalled scikit-learn-0.24.1
Successfully installed joblib-1.3.2 scikit-learn-1.3.0
Note: you may need to restart the kernel to use updated packages.


In [23]:
import pandas as pd
import sklearn as sk
import numpy as np

try:
    %load_ext autotime
except:
    !pip install ipython-autotime
    %load_ext autotime

The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
time: 1.99 s (started: 2025-04-11 11:41:04 +01:00)


### Read in data (3 csvs)

In [33]:
#Read in calendar csv

calendar = pd.read_csv('calendar.csv')
calendar.head()

Unnamed: 0,listing_id,date,available,price
0,241032,2016-01-04,t,$85.00
1,241032,2016-01-05,t,$85.00
2,241032,2016-01-06,f,
3,241032,2016-01-07,f,
4,241032,2016-01-08,f,


time: 641 ms (started: 2025-04-11 11:55:41 +01:00)


In [3]:
#Read in listings csv

listings = pd.read_csv('listings.csv')
listings.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,241032,https://www.airbnb.com/rooms/241032,20160104002432,2016-01-04,Stylish Queen Anne Apartment,,Make your self at home in this charming one-be...,Make your self at home in this charming one-be...,none,,...,10.0,f,,WASHINGTON,f,moderate,f,f,2,4.07
1,953595,https://www.airbnb.com/rooms/953595,20160104002432,2016-01-04,Bright & Airy Queen Anne Apartment,Chemically sensitive? We've removed the irrita...,"Beautiful, hypoallergenic apartment in an extr...",Chemically sensitive? We've removed the irrita...,none,"Queen Anne is a wonderful, truly functional vi...",...,10.0,f,,WASHINGTON,f,strict,t,t,6,1.48
2,3308979,https://www.airbnb.com/rooms/3308979,20160104002432,2016-01-04,New Modern House-Amazing water view,New modern house built in 2013. Spectacular s...,"Our house is modern, light and fresh with a wa...",New modern house built in 2013. Spectacular s...,none,Upper Queen Anne is a charming neighborhood fu...,...,10.0,f,,WASHINGTON,f,strict,f,f,2,1.15
3,7421966,https://www.airbnb.com/rooms/7421966,20160104002432,2016-01-04,Queen Anne Chateau,A charming apartment that sits atop Queen Anne...,,A charming apartment that sits atop Queen Anne...,none,,...,,f,,WASHINGTON,f,flexible,f,f,1,
4,278830,https://www.airbnb.com/rooms/278830,20160104002432,2016-01-04,Charming craftsman 3 bdm house,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,none,We are in the beautiful neighborhood of Queen ...,...,9.0,f,,WASHINGTON,f,strict,f,f,1,0.89


time: 391 ms (started: 2025-04-11 11:22:43 +01:00)


In [4]:
#Read in reviews csv

reviews = pd.read_csv('reviews.csv')
reviews.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,7202016,38917982,2015-07-19,28943674,Bianca,Cute and cozy place. Perfect location to every...
1,7202016,39087409,2015-07-20,32440555,Frank,Kelly has a great room in a very central locat...
2,7202016,39820030,2015-07-26,37722850,Ian,"Very spacious apartment, and in a great neighb..."
3,7202016,40813543,2015-08-02,33671805,George,Close to Seattle Center and all it has to offe...
4,7202016,41986501,2015-08-10,34959538,Ming,Kelly was a great host and very accommodating ...


time: 641 ms (started: 2025-04-11 11:22:46 +01:00)


### Exploratory analysis 

In [7]:
# Get number of rows of data (calendar)
calendar.shape[0]

1393570

time: 15 ms (started: 2025-04-11 11:24:13 +01:00)


In [8]:
# Get number of rows of data (listings)
listings.shape[0]

3818

time: 0 ns (started: 2025-04-11 11:24:14 +01:00)


In [6]:
# Get number of rows of data (reviews)
reviews.shape[0]

84849

time: 0 ns (started: 2025-04-11 11:23:49 +01:00)


In [9]:
# What is the timeframe for this data (max/min dates in calendar)

print(calendar['date'].max())
print(calendar['date'].min())

# Jan 2016 - Jan 2017

2017-01-02
2016-01-04
time: 188 ms (started: 2025-04-11 11:24:31 +01:00)


In [10]:
# How many listings are there in total?
# calendar['listing_id'].count()
print(str(calendar['listing_id'].agg('nunique')) + ' listings in calendar')
print(str(listings['id'].agg('nunique')) + ' listings in listings')
print(str(reviews['listing_id'].agg('nunique')) + ' listings in reviews')

# 3,818 listings

3818 listings in calendar
3818 listings in listings
3191 listings in reviews
time: 63 ms (started: 2025-04-11 11:25:12 +01:00)


In [11]:
# Is the min/max date the same for all listings? Or does it show the dates for when it came to market?
minmaxdate = calendar.groupby('listing_id').agg(max_date = ('date','max'), min_date = ('date','min'))
minmaxdate.head()

Unnamed: 0_level_0,max_date,min_date
listing_id,Unnamed: 1_level_1,Unnamed: 2_level_1
3335,2017-01-02,2016-01-04
4291,2017-01-02,2016-01-04
5682,2017-01-02,2016-01-04
6606,2017-01-02,2016-01-04
7369,2017-01-02,2016-01-04


time: 1.28 s (started: 2025-04-11 11:25:29 +01:00)


In [20]:
#what is the type of the price column
calendar.dtypes

listing_id     int64
date          object
available     object
price         object
dtype: object

time: 0 ns (started: 2025-04-11 11:38:48 +01:00)


In [50]:
#convert price to numeric 
#remove dollar sign first
calendar['price_avg'] = pd.to_numeric(calendar['price'].str.replace('$', ''), errors='coerce')
calendar.dtypes

  calendar['price_avg'] = pd.to_numeric(calendar['price'].str.replace('$', ''), errors='coerce')


listing_id      int64
date           object
available      object
price          object
price2        float64
price_avg     float64
dtype: object

time: 781 ms (started: 2025-04-11 12:06:00 +01:00)


In [51]:
# Get average price for each listing
listing_price = calendar.groupby('listing_id')['price_avg'].mean()
listing_price.head()

listing_id
3335    120.000000
4291     82.000000
5682     53.944984
6606     92.849315
7369     85.000000
Name: price_avg, dtype: float64

time: 47 ms (started: 2025-04-11 12:06:02 +01:00)


In [52]:
#check no duplication
print(str(calendar['listing_id'].agg('nunique')) + ' listings in calendar')
print(listing_price.shape[0])

3818 listings in calendar
3818
time: 31 ms (started: 2025-04-11 12:06:06 +01:00)


In [53]:
# Join average listing price to listing df

# rename columns to match
listings = listings.rename(columns={"id": "listing_id"})

# join with price data
listings2 = listings.join(listing_price, on='listing_id', how='left')
listings2.shape[0]

3818

time: 32 ms (started: 2025-04-11 12:06:07 +01:00)


In [57]:
listings2.head()

Unnamed: 0,listing_id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month,price_avg
0,241032,https://www.airbnb.com/rooms/241032,20160104002432,2016-01-04,Stylish Queen Anne Apartment,,Make your self at home in this charming one-be...,Make your self at home in this charming one-be...,none,,...,f,,WASHINGTON,f,moderate,f,f,2,4.07,85.0
1,953595,https://www.airbnb.com/rooms/953595,20160104002432,2016-01-04,Bright & Airy Queen Anne Apartment,Chemically sensitive? We've removed the irrita...,"Beautiful, hypoallergenic apartment in an extr...",Chemically sensitive? We've removed the irrita...,none,"Queen Anne is a wonderful, truly functional vi...",...,f,,WASHINGTON,f,strict,t,t,6,1.48,170.931271
2,3308979,https://www.airbnb.com/rooms/3308979,20160104002432,2016-01-04,New Modern House-Amazing water view,New modern house built in 2013. Spectacular s...,"Our house is modern, light and fresh with a wa...",New modern house built in 2013. Spectacular s...,none,Upper Queen Anne is a charming neighborhood fu...,...,f,,WASHINGTON,f,strict,f,f,2,1.15,894.186047
3,7421966,https://www.airbnb.com/rooms/7421966,20160104002432,2016-01-04,Queen Anne Chateau,A charming apartment that sits atop Queen Anne...,,A charming apartment that sits atop Queen Anne...,none,,...,f,,WASHINGTON,f,flexible,f,f,1,,100.0
4,278830,https://www.airbnb.com/rooms/278830,20160104002432,2016-01-04,Charming craftsman 3 bdm house,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,none,We are in the beautiful neighborhood of Queen ...,...,f,,WASHINGTON,f,strict,f,f,1,0.89,462.739726


time: 47 ms (started: 2025-04-11 12:19:19 +01:00)


### 1) Do missing descriptions impact prices?

In [80]:
# create function for creating NA flags

def na_flag(df, col):
    df[(col + '_NA')] = np.where(df[col].isna(), 0, 1)
    print(df.groupby((col + '_NA'))['price_avg'].mean())
    print(df.groupby((col + '_NA'))['listing_id'].count())

time: 0 ns (started: 2025-04-11 13:10:43 +01:00)


In [81]:
na_flag(listings2, 'summary')

summary_NA
0    136.156921
1    135.341474
Name: price_avg, dtype: float64
summary_NA
0     177
1    3641
Name: listing_id, dtype: int64
time: 0 ns (started: 2025-04-11 13:10:45 +01:00)


In [82]:
na_flag(listings2, 'neighborhood_overview')

neighborhood_overview_NA
0    138.747845
1    134.136001
Name: price_avg, dtype: float64
neighborhood_overview_NA
0    1032
1    2786
Name: listing_id, dtype: int64
time: 16 ms (started: 2025-04-11 13:11:04 +01:00)


In [83]:
na_flag(listings2, 'space')

space_NA
0    129.993938
1    136.321921
Name: price_avg, dtype: float64
space_NA
0     569
1    3249
Name: listing_id, dtype: int64
time: 0 ns (started: 2025-04-11 13:11:12 +01:00)


In [84]:
na_flag(listings2, 'description')

description_NA
1    135.380034
Name: price_avg, dtype: float64
description_NA
1    3818
Name: listing_id, dtype: int64
time: 0 ns (started: 2025-04-11 13:11:29 +01:00)


Having more information listed (eg. description, summary, etc.) does not necessarily lead to higher prices. The only field that may have an impace is 'space'.

In [75]:
# Does the amount of missing info relate to lower prices? Eg. if a listing has no description on 1 vs 2 vs 3 fields

listings2['count_NA'] = listings2['summary_NA'] + listings2['neighborhood_overview_NA'] + listings2['space_NA'] 
listings2.groupby(['count_NA'])['price_avg'].mean()

count_NA
1    132.952066
2    141.334069
3    134.621072
Name: price_avg, dtype: float64

time: 16 ms (started: 2025-04-11 13:08:15 +01:00)


In [77]:
listings2.groupby(['count_NA'])['listing_id'].count()

count_NA
1     597
2     584
3    2637
Name: listing_id, dtype: int64

time: 0 ns (started: 2025-04-11 13:09:17 +01:00)


Doesn't seem to be a pattern with the number of missing/non-missing info

In [86]:
listings2.groupby(['count_NA', 'space_NA'])['price_avg'].mean()

count_NA  space_NA
1         0           131.020785
          1           139.565242
2         0           125.443594
          1           144.827233
3         1           134.621072
Name: price_avg, dtype: float64

time: 16 ms (started: 2025-04-11 13:14:11 +01:00)


In [85]:
listings2.groupby(['count_NA', 'space_NA'])['listing_id'].count()

count_NA  space_NA
1         0            464
          1            133
2         0            105
          1            479
3         1           2637
Name: listing_id, dtype: int64

time: 15 ms (started: 2025-04-11 13:13:46 +01:00)


'Space' description seems to have an impact

## 2) How does ease of booking impact prices?

## 3) How do reviews impact prices? 