Source: [https://towardsdatascience.com/data-exploration-on-airbnb-singapore-01-40698c54cac3](https://towardsdatascience.com/data-exploration-on-airbnb-singapore-01-40698c54cac3)

# Data Exploration on Airbnb Singapore

# Acquire and Loading Data

## Load Python libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
%matplotlib inline
import seaborn as sns

## Load dataset

In [None]:
airbnb = pd.read_csv('listings_sum_sg.csv')
airbnb.head()

## Understanding data

In [None]:
airbnb.shape

In [None]:
airbnb.dtypes

_Let's look up all the unique values of the 'neighbourhood_group' that consists of a list of all the Singapore region_

In [None]:
airbnb['neighbourhood_group'].unique()

_The region area is divided further by the Urban Redevelopment Authority (URA) into 55 areas called planning areas for urban planning purposes. We will use the ‘neighbourhood’ columns to look at which planning area that has the Airbnb listing._

In [None]:
airbnb['neighbourhood'].unique()

_Let's look up the ‘room_type’ columns for each room type of the listing_

In [None]:
airbnb['room_type'].unique()

# Cleaning dataset

## Checking column with missing values

In [None]:
airbnb.isnull().sum()

## Removing redundant variables

In our case, the missing values that are observed do not need too much treatment. Looking into our dataset, we can state columns ‘ name’ and ‘host_name’, ‘last_review’ are irrelevant and unethical for further data exploration analysis. Therefore, we can get rid of those columns.

In [None]:
airbnb.drop(['id','host_name','last_review'],axis=1,inplace=True)
airbnb.head()

## Replacing all the missing values

We need to replace all the missing values in the ‘review_per_month’ column with 0 (zero) to make sure the missing values do not interfere with our analysis

In [None]:
airbnb['reviews_per_month'].fillna(0,inplace=True)

# Exploring and visualizing data

## Top listing counts
First, we skip the first column of ‘name’ and begin from the ‘host_id’ column. Then we slice the top 10 hosts in terms of listing count

In [None]:
top_host_id = airbnb['host_id'].value_counts().head(10)

Next, we set the figure size and setting it up for data visualizations plot using a bar chart

In [None]:
sns.set(rc={'figure.figsize':(10,8)})
viz_bar = top_host_id.plot(kind='bar')
viz_bar.set_title('Hosts with the most listings in Singapore')
viz_bar.set_xlabel('Host IDs')
viz_bar.set_ylabel('Count of listings')
viz_bar.set_xticklabels(viz_bar.get_xticklabels(), rotation=45)

## Top region area
Next, we visualize the proportion of the listing count on each region area using the ‘neighbourhood_group’ columns

In [None]:
labels = airbnb.neighbourhood_group.value_counts().index
colors = ['#008fd5','#fc4f30','#e5ae38','#6d904f','#8b8b8b']
explode = (0.1,0,0,0,0)
shape = airbnb.neighbourhood_group.value_counts().values
plt.figure(figsize=(12,12))
plt.pie(shape, explode = explode, labels=shape, colors= colors, autopct = '%1.1f%%', startangle=90)
plt.legend(labels)
plt.title('Neighbourhood Group')
plt.show()

## Top planning area
Next, we look up the top 10 planning areas that have the highest number of listings

In [None]:
airbnb.neighbourhood.value_counts().head(10)

## Listing map
To create a map of the listing location, we will use the ‘longitude’ and ‘latitude’ column. But first, we need to check the values within the column

In [None]:
coord = airbnb.loc[:,['longitude','latitude']]
coord.describe()

From the data above, we can see the outer values of longitude and latitude from the min and max index.

For a better understanding of the listings density, we can use the folium heat map

In [None]:
import folium
from folium.plugins import HeatMap
map_folium = folium.Map([1.35255,103.82580],zoom_start=11.4)
HeatMap(airbnb[['latitude','longitude']].dropna(),radius=8,gradient={0.2:'blue',0.4:'purple',0.6:'orange',1.0:'red'}).add_to(map_folium)
display(map_folium)

## Price Distribution
Before we visualize the price map, we need to update the dataset by removing some of the outlier data as some data prices have value far from the IQR (interquartile range).

Based on our price heat map observation, we need to visualize the price distribution using a box plot to understand more on the listing price range grouped by the ‘neighbourhood_group’ /region area.

In [None]:
airbnb_1 = airbnb[airbnb.price < 300]

plt.style.use('fivethirtyeight')
plt.figure(figsize=(14,12))
sns.boxplot(y='price',x='neighbourhood_group',data = airbnb_1)
plt.title('Neighbourhood Group Price Distribution < S$ 300')
plt.show()

## Top listing words
Next, we will explore deeper on the property detail by finding out what the most used word in the listing name. The most used word could represent the selling value of their property for the prospective guests. First, we will create a function to collect the words.

In [None]:
#Create empty list where we are going to put the name strings
names=[]
#Getting name string from 'name' column and appending it to the empty list
for name in airbnb.name:
    names.append(name)
#Setting a function to split name strings into separate words
def split_name(name):
    s = str(name).split()
    return s
#Create empty list where we are going to count the words
names_count = []
#Getting name string to appending it to the names_count list
for n in names:
    for word in split_name(n):
        word = word.lower()
        names_count.append(word)

We need to import counter library to count and generate raw data which contains the top 25 words used by the host

In [None]:
from collections import Counter
top_25 = Counter(names_count).most_common()
top_25 = top_25[:25]

Then, we convert the data into DataFrame and visualize our findings

In [None]:
word_count_data = pd.DataFrame(top_25)
word_count_data.rename(columns={0:'Words',1:'Counts'},inplace=True)
viz_count = sns.barplot(x='Words',y='Counts', data = word_count_data)
viz_count.set_title('Top 25 used words for listing names')
viz_count.set_ylabel('Count of words')
viz_count.set_xlabel('Words')
viz_count.set_xticklabels(viz_count.get_xticklabels(),rotation = 90)

From the chart above, we see the top 25 words used in the listing name. We can use the word cloud visualization method to help us better understand the chart.

In [None]:
from wordcloud import WordCloud, ImageColorGenerator
text = ' '.join(str(n).lower() for n in airbnb.name)
#Generate wordcloud image
wordcloud = WordCloud(max_words=200, background_color = 'white').generate(text)
plt.figure(figsize=(25,20))
#Display the image
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

## Room type details
Next, we will visualize all listing’s room type proportions from each region area using Plotly API library for graph visualization

In [None]:
import plotly.offline as pyo
import plotly.graph_objs as go
#Setting up the color pallete
color_dict = {'Private room': '#cc5a49', 'Entire home/apt' : '#4586ac', 'Shared room' : '#21908d', 'Hotel room' : '#C0C0C0' }
#Group the room type using 'neighbourhood_group' as an index
airbnb_types=airbnb.groupby(['neighbourhood_group', 'room_type']).size()
#Create function to plot room type proportion on all region area
for region in airbnb.neighbourhood_group.unique():
    
    plt.figure(figsize=(24,12))
    
    airbnb_reg=airbnb_types[region]
    labels = airbnb_reg.index
    sizes = airbnb_reg.values
              
    colors = [color_dict[x] for x in labels]
    
    plot_num = 321
    plt.subplot(plot_num)
    reg_ch = go.Figure(data = [go.Pie(labels = labels, values = sizes, hole = 0.6)])
    reg_ch.update_traces(title = region, marker=dict(colors=colors))
    reg_ch.show()
    
    plot_num += 1

## Top 10 most reviewed listings
We will find out the top 10 listings based on their number of reviews to know the most popular Airbnb listings in Singapore.

In [None]:
airbnb.nlargest(10, 'number_of_reviews')

## Average price per night
Lastly, we will calculate the average price per night of the 10 most popular listings

In [None]:
top_review = airbnb.nlargest(10, 'number_of_reviews')
price_avg = top_review.price.mean()
print('Average price per night: S$ {}'.format(price_avg))