# EDA and data cleaning

In this notebook, we will go over the data we have and perform EDA and data cleaning.

In [22]:
# imports
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 
import seaborn as sns 


## Dropping unnecessary columns
We will start our cleaning process by dropping the columns we are sure we dont need.

In [23]:
# read the preprocessed data
df = pd.read_csv('./../data/austin_listings_processed.csv')
print(f'the size of our data is {df.shape}')
df.head(2)

the size of our data is (47037, 81)


  df = pd.read_csv('./../data/austin_listings_processed.csv')


Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,occ_rate_calendar,active_duration_days,occ_rate_70,occ_rate_50,occ_rate_30,time_quarter
0,5456,https://www.airbnb.com/rooms/5456,20231215200307,2023-12-16,city scrape,Guesthouse in Austin · ★4.84 · 1 bedroom · 2 b...,,My neighborhood is ideally located if you want...,https://a0.muscache.com/pictures/14084884/b5a3...,8028,...,1,0,0,3.71,0.3,5390.0,0.7,0.7,0.7,Q4
1,5769,https://www.airbnb.com/rooms/5769,20231215200307,2023-12-16,previous scrape,Home in Austin · ★4.91 · 1 bedroom · 1 bed · 1...,,Quiet neighborhood with lots of trees and good...,https://a0.muscache.com/pictures/23822033/ac94...,8186,...,0,1,0,1.76,0.7,5404.0,0.388601,0.544041,0.7,Q4


In [24]:
# print list of the columns
print(list(df.columns))


['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name', 'description', 'neighborhood_overview', 'picture_url', 'host_id', 'host_url', 'host_name', 'host_since', 'host_location', 'host_about', 'host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood', 'host_listings_count', 'host_total_listings_count', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude', 'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability', 'availability_30', 'availability_60', 'availability_90', 'availabil

Let us retain the columns we might wanna use later. Only drop the columns that won't be used for sure. 

In [25]:

columns_to_keep = ['id', 'source', 'name', 'description','neighborhood_overview',
                   'host_is_superhost', 'neighbourhood_cleansed', 'latitude',
                   'longitude', 'property_type', 'room_type', 'accommodates',
                   'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
                   'minimum_nights', 'maximum_nights', 'number_of_reviews',
                   'review_scores_rating','occ_rate_50', 'time_quarter']
df = df[columns_to_keep]
# rename columns if needed
df.rename(columns={'neighbourhood_cleansed': 'zipcode',
                   'occ_rate_50': 'occupancy_rate'}, inplace=True)
df.head(3)


Unnamed: 0,id,source,name,description,neighborhood_overview,host_is_superhost,zipcode,latitude,longitude,property_type,...,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,number_of_reviews,review_scores_rating,occupancy_rate,time_quarter
0,5456,city scrape,Guesthouse in Austin · ★4.84 · 1 bedroom · 2 b...,,My neighborhood is ideally located if you want...,t,78702,30.26057,-97.73441,Entire guesthouse,...,,2.0,[],$101.00,2,90,668,4.84,0.7,Q4
1,5769,previous scrape,Home in Austin · ★4.91 · 1 bedroom · 1 bed · 1...,,Quiet neighborhood with lots of trees and good...,t,78729,30.45697,-97.78422,Private room in home,...,,1.0,[],,1,14,294,4.91,0.544041,Q4
2,6413,previous scrape,Guesthouse in Austin · ★4.97 · Studio · 1 bed ...,,Travis Heights is one of the oldest neighborho...,f,78704,30.24885,-97.73587,Entire guesthouse,...,,1.0,[],,30,90,120,4.97,0.7,Q4


Let us start our analysis by looking at the nans.

In [18]:
df.isna().sum().sort_values(ascending=False)

bedrooms                  17307
neighborhood_overview     16070
description               12515
price                      2135
host_is_superhost          2058
beds                        363
bathrooms_text               34
id                            0
occ_rate_50                   0
review_scores_rating          0
number_of_reviews             0
maximum_nights                0
minimum_nights                0
amenities                     0
accommodates                  0
source                        0
room_type                     0
property_type                 0
longitude                     0
latitude                      0
neighbourhood_cleansed        0
name                          0
time_quarter                  0
dtype: int64

Since price is a very import feature in our data, let us dive deeper into why it has 2000 missing values.

## Missing prices

In [21]:
df[df['price'].isna()].head(5)

Unnamed: 0,id,source,name,description,neighborhood_overview,host_is_superhost,neighbourhood_cleansed,latitude,longitude,property_type,...,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,number_of_reviews,review_scores_rating,occ_rate_50,time_quarter
1,5769,previous scrape,Home in Austin · ★4.91 · 1 bedroom · 1 bed · 1...,,Quiet neighborhood with lots of trees and good...,t,78729,30.45697,-97.78422,Private room in home,...,,1.0,[],,1,14,294,4.91,0.544041,Q4
2,6413,previous scrape,Guesthouse in Austin · ★4.97 · Studio · 1 bed ...,,Travis Heights is one of the oldest neighborho...,f,78704,30.24885,-97.73587,Entire guesthouse,...,,1.0,[],,30,90,120,4.97,0.7,Q4
14,69810,previous scrape,Guesthouse in Austin · ★4.98 · 1 bedroom · 1 b...,,"Located in the cool Dawson area, about two mil...",f,78704,30.2309,-97.76619,Entire guesthouse,...,,1.0,[],,2,730,445,4.98,0.7,Q4
18,219168,previous scrape,Rental unit in Austin · ★4.88 · Studio · 1 bed...,,Despite its proximity to South Congress and th...,f,78704,30.24519,-97.74486,Entire rental unit,...,,1.0,[],,2,60,8,4.88,0.04308,Q4
19,72833,previous scrape,Guesthouse in Austin · ★4.91 · 1 bedroom · 1 b...,,Peaceful neighborhood street in central Austin...,t,78731,30.313,-97.75066,Entire guesthouse,...,,1.0,[],,3,60,413,4.91,0.7,Q4


Let us see if we can find further insight towards what listings have missing prices.

In [30]:
df[df['price'].isna()].describe(include='object')

Unnamed: 0,source,name,description,neighborhood_overview,host_is_superhost,property_type,room_type,bathrooms_text,amenities,price,time_quarter
count,2135,2135,0.0,1333,2134,2135,2135,2132,2135,0.0,2135
unique,2,1151,0.0,1276,2,40,4,18,1,0.0,1
top,previous scrape,Rental unit in Austin · 1 bedroom · 1 bed · 1 ...,,"Great location, close to everything Austin has...",f,Entire home,Entire home/apt,1 bath,[],,Q4
freq,2115,110,,11,1828,631,1555,970,2135,,2135


In [32]:
df[df['price'].isna()].describe()


Unnamed: 0,id,zipcode,latitude,longitude,accommodates,bedrooms,beds,minimum_nights,maximum_nights,number_of_reviews,review_scores_rating,occupancy_rate
count,2135.0,2135.0,2135.0,2135.0,2135.0,0.0,2122.0,2135.0,2135.0,2135.0,2135.0,2135.0
mean,1.153142e+17,78723.140984,30.275237,-97.743172,3.998126,,2.052309,5.290867,542.600937,23.096956,4.794993,0.189055
std,2.717539e+17,20.665209,0.058512,0.050071,2.380016,,1.447688,29.952152,527.063859,50.841213,0.444607,0.230274
min,5769.0,78701.0,30.13013,-98.00946,1.0,,1.0,1.0,1.0,1.0,0.0,0.00185
25%,13169790.0,78704.0,30.239023,-97.764025,2.0,,1.0,1.0,15.0,2.0,4.775,0.024379
50%,23680260.0,78722.0,30.26535,-97.73799,4.0,,2.0,2.0,365.0,6.0,4.97,0.076225
75%,47038120.0,78744.0,30.29657,-97.716055,5.5,,3.0,3.0,1125.0,21.5,5.0,0.266056
max,1.029898e+18,78759.0,30.50632,-97.575828,16.0,,14.0,1100.0,1125.0,687.0,5.0,0.7


In [33]:
df[df['price'].isna()]['id'].nunique()

2135

As we can see, all of the missing prices have happened during data scraping in Q4 so there could be some issues at that time (they all have unique id's as well). Looking further, we can see that none of the listings with missing price have description either but this may not a significant finding because description column has a lot of missing values, in addition to the ones with no price (similar observation for bedrooms).  

Let us see what is the total number of unique id's for our listings.

In [29]:
df['id'].nunique()

15175

This means we will have a lot of listings that have been listed at different times. We might be able to use price information for that specific property by looking at the prices it was listed for at other times. 

In [31]:
df['amenities'].nunique()

33350