## Data Collection

In [1]:
# importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#data obtained from http://insideairbnb.com/get-the-data

df = pd.read_csv("./data/listings.csv")
df.shape


  df = pd.read_csv("./data/listings.csv")


(41533, 18)

In [3]:
# Inspecting the data
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,5136,"Spacious Brooklyn Duplex, Patio + Garden",7378,Rebecca,Brooklyn,Sunset Park,40.66265,-73.99454,Entire home/apt,275,21,3,2022-08-10,0.03,1,267,1,
1,5203,Cozy Clean Guest Room - Family Apt,7490,MaryEllen,Manhattan,Upper West Side,40.8038,-73.96751,Private room,75,2,118,2017-07-21,0.73,1,0,0,
2,5121,BlissArtsSpace!,7356,Garon,Brooklyn,Bedford-Stuyvesant,40.68535,-73.95512,Private room,60,30,50,2019-12-02,0.3,2,322,0,
3,5178,Large Furnished Room Near B'way,8967,Shunichi,Manhattan,Midtown,40.76457,-73.98317,Private room,68,2,559,2022-11-20,3.38,1,79,50,
4,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75356,-73.98559,Entire home/apt,175,30,49,2022-06-21,0.31,3,365,1,


## Data Definition

In [4]:
# Generating a df of label descriptions
labels_temp = pd.read_excel("./data/Inside Airbnb Data Dictionary.xlsx")
labels_clean = labels_temp.iloc[6:]
labels_clean.columns = labels_clean.iloc[0]
labels = labels_clean.iloc[1:]
labels.head()

6,Field,Type,Calculated,Description,Reference
7,id,integer,,Airbnb's unique identifier for the listing,
8,listing_url,text,y,,
9,scrape_id,bigint,y,"Inside Airbnb ""Scrape"" this was part of",
10,last_scraped,datetime,y,"UTC. The date and time this listing was ""scrap...",
11,source,text,,"One of ""neighbourhood search"" or ""previous scr...",


In [5]:
# dropping unnecessary features
columns_to_drop = ['Calculated', 'Reference']
labels = labels.drop(labels=columns_to_drop, axis=1)
labels = labels.reset_index(drop=True).rename_axis(None, axis=1)

In [6]:
# filtering for columns of interest
features = list(df.columns)
labels.loc[labels['Field'].isin(features)]

Unnamed: 0,Field,Type,Description
0,id,integer,Airbnb's unique identifier for the listing
5,name,text,Name of the listing
9,host_id,integer,Airbnb's unique identifier for the host/user
11,host_name,text,Name of the host. Usually just the first name(s).
27,neighbourhood,text,
30,latitude,numeric,Uses the World Geodetic System (WGS84) project...
31,longitude,numeric,Uses the World Geodetic System (WGS84) project...
33,room_type,text,[Entire home/apt|Private room|Shared room|Hote...
40,price,currency,daily price in local currency
41,minimum_nights,integer,minimum number of night stay for the listing (...


In [7]:
# Ensure the Dtypes match up
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41533 entries, 0 to 41532
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              41533 non-null  int64  
 1   name                            41520 non-null  object 
 2   host_id                         41533 non-null  int64  
 3   host_name                       41528 non-null  object 
 4   neighbourhood_group             41533 non-null  object 
 5   neighbourhood                   41533 non-null  object 
 6   latitude                        41533 non-null  float64
 7   longitude                       41533 non-null  float64
 8   room_type                       41533 non-null  object 
 9   price                           41533 non-null  int64  
 10  minimum_nights                  41533 non-null  int64  
 11  number_of_reviews               41533 non-null  int64  
 12  last_review                     

## Data Cleaning



Ill start by inspecting the nulls.

In [8]:
df.isnull().sum().sort_values(ascending=False)

license                           41532
reviews_per_month                  9393
last_review                        9393
name                                 13
host_name                             5
minimum_nights                        0
number_of_reviews_ltm                 0
availability_365                      0
calculated_host_listings_count        0
number_of_reviews                     0
id                                    0
room_type                             0
longitude                             0
latitude                              0
neighbourhood                         0
neighbourhood_group                   0
host_id                               0
price                                 0
dtype: int64

The licence information that is missing has no effect on the project so it can be dropped.

There are 13 entries under the `name` field that are missing, as well as 9393 entries related to reviews. Lets insepct these to see whats missing.

In [9]:
df[df.name.isna()]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
1827,2232600,,11395220,Anna,Manhattan,East Village,40.73192,-73.98819,Entire home/apt,700,60,28,2015-06-08,0.27,1,359,0,
2831,4209595,,20700823,Jesse,Manhattan,Greenwich Village,40.73323,-73.99294,Entire home/apt,225,30,1,2015-01-01,0.01,1,0,0,
2982,4370230,,22686810,Michaël,Manhattan,Nolita,40.721,-73.99536,Entire home/apt,215,30,5,2016-01-02,0.05,1,0,0,
3099,4581788,,21600904,Lucie,Brooklyn,Williamsburg,40.7137,-73.94378,Private room,150,30,0,,,1,0,0,
3296,4774658,,24625694,Josh,Manhattan,Washington Heights,40.85111,-73.93009,Private room,40,30,0,,,1,0,0,
4395,6782407,,31147528,Huei-Yin,Brooklyn,Williamsburg,40.71354,-73.93882,Private room,45,30,0,,,1,0,0,
5916,9325951,,33377685,Jonathan,Manhattan,Hell's Kitchen,40.76617,-73.98435,Entire home/apt,190,30,1,2016-01-05,0.01,1,0,0,
6343,9787590,,50448556,Miguel,Manhattan,Harlem,40.80551,-73.95069,Entire home/apt,300,30,0,,,5,0,0,
6675,10116081,,51913270,Andrew,Manhattan,Midtown,40.75939,-73.96949,Entire home/apt,200,30,0,,,1,0,0,
6711,10052289,,49522403,Vanessa,Brooklyn,Brownsville,40.66409,-73.92314,Private room,80,30,3,2016-08-18,0.04,1,0,0,


There doesn't appear to be any visual correlation between these listings in terms of location or host. Most of them have no reviews, and without a description highlighting amenities, we wont be able to include them in the comparisons. For now, I will name them 'Unnamed Listing' and we'll decide if any value can be extracted during the EDA step.

In [10]:
df.name.fillna('Unnamed Listing', inplace=True)

In [16]:
df.price.describe()

count    41533.000000
mean       221.978282
std        919.502236
min          0.000000
25%         80.000000
50%        131.000000
75%        220.000000
max      98159.000000
Name: price, dtype: float64

In [41]:
# The minimum value indicates there are listings for $0. 
(df.price==0).value_counts()

False    41503
True        30
Name: price, dtype: int64

In [38]:
(df.room_type == 'Hotel room').value_counts()

False    41345
True       188
Name: room_type, dtype: int64

In [12]:
reviews = df.loc[:, ['id', 'last_review', 'reviews_per_month']].set_index('id')


Some features can be encoded in categorical variables ('neighbourhood_group' and 'room_type'). 

## EDA

What are the different `room_type`?

In [13]:
df.room_type.unique()

array(['Entire home/apt', 'Private room', 'Hotel room', 'Shared room'],
      dtype=object)

What `room_type` has higher price? 

In [14]:
# fig = plt.figure(figsize=(10,7))
# ax = fig.add_subplot(111)
# df.boxplot(column='price', by='room_type', rot=90, )

What are the neighborhoods?

In [15]:
# print(sorted(df.neighbourhood.unique()))