
# What features determine the price of an Airbnb rental?

In [None]:
import numpy                 as np
import pandas                as pd
import matplotlib.pyplot     as plt
import seaborn               as sns
import pandas                as pd
from   collections           import Counter
%matplotlib inline
sns.set()

## Introduction

**Business Context**. Airbnb is an enormous online marketplace for everyday people to rent places to stay. It is a large and lucrative market, but many vendors are simply individuals who are renting their own primary residence for short visits. Even larger vendors are typically small businesses with only a small number of places to rent. As a result, they have limited ability to assess large-scale trends and set optimal prices.

Airbnb has rolled out a new service to help listers set prices. Airbnb makes a percentage commission off of the listings, so they are incentivized to help listers price optimally; that is, at the maximum possible point where they will still close a deal. You are an Airbnb consultant helping with this new pricing service.

**Business Problem**. Your initial task is to explore the data with the goal of answering the question: <b>"What features are most relevant to the price of an Airbnb listing?"</b>

**Analytical Context**. We will use the publicly available and well-maintained dataset created by the Inside Airbnb advocacy group. We will focus on listings in New York City within the last year, taking advantage of larger datasets when there are important details to explore.

The case is structured as follows: we will (1) do basic data exploration by plotting distributions of key quantities; (2) introduce the concept of correlation to find the key features; (3) introduce the idea of interaction effects to correct for the effects of key features; (4) discuss how to iteratively generate hypotheses and choose data visualizations to support your conclusions; (5) look at one very specific type of interaction effect, the temporal effect, and how to correct for it; and finally (6) pull everything together to identify the key factors that affect the price.

## Some basic data exploration

We begin by loading the data and looking at its basic shape:

In [None]:
listings = pd.read_csv('airbnb_NYC.csv', delimiter=',')
listings.shape

Let's also look at the columns of the dataset:

In [None]:
listings.columns 

In [None]:
# We display the basic listings data.
pd.options.display.max_columns = 100
listings.head(3)

The following are details about some of the important columns here:

1. ```neighbourhood```:  which neighborhood the property is in
2. ```longitude```, ```latitude```: longitude and latitude
3. ```property_type```: type of property, such as apartment, condo etc.
4. ```bathrooms```: number of bathrooms
5. ```bedrooms```: number of bathrooms
6. ```price```:  price of the listing
7. ```number_of_reviews```: number of reviews given by customers who stayed there
8. ```parking```: 1 means there is parking available, -1 means there is not

For other categorical variables, such as ```outdoor_friendly```, ```gym```, etc., the 1,-1 should be interpreted similarly to ```parking``` as explained above.

### Plotting the marginal distributions of key quantities of interest

It is good to first develop an idea of how the values of a few key quantities of interest are distributed. Let's start by doing so for some numeric variables, such as ```price```, ```bedrooms```, ```bathrooms```, ```number_of_reviews```:

### Exercise 1:

Use the ```describe()``` command to compute some important summary statistics for the above variables.

In [None]:
# possible solution


```plotly.py``` enables Python users to create beautiful interactive web-based visualizations that can be displayed in Jupyter notebooks, saved to standalone HTML files, or served as part of pure Python-built web applications.

Let's use the ```plt.hist()``` function to plot the histogram of the above variables. What are their basic distribution shapes (e.g. normal, skewed, multi-modal, etc.)?

In [None]:
plt.figure(figsize=(12,10))
vars_to_plot = ['price', 'bedrooms','bathrooms','number_of_reviews']
for i, var in enumerate(vars_to_plot):
    plt.subplot(2,2,i+1)
    plt.hist(listings[var],50)
    title_string = "Histogram of " + var
    plt.title(title_string)

**Answer.** All look somewhat skewed to the right, though the ```bathroom``` variable is so concentrated at a single entry that it is hard to tell.

Are the distributions fairly smooth, or do they exhibit "spiky" or "discontinuous" behavior? If the latter, can you explain where it might come from?

**Answer.** The ```price``` variable is noticeably spiky. There is a nice bulk of prices between about 25 and 300 dollars, with very obvious spikes at nice, round numbers such as 50, 100, 150, 200, 250, and 300. This probably reflects the fact that people enter in the prices that they wish to list at, and so tend to choose round numbers (or numbers just below round numbers).

Can you detect any outliers from these histograms? If so, do they suggest (i) data error; or (ii) data that should be omitted from our future analysis?

**Answer.** Very few places had prices of more than $320, and so we might think of these as "outliers". Some of these may represent error, but we guess that most of them are correct – hotels in NYC certainly often go for over 400 dollars per night, and so it is not unreasonable to expect some Airbnb listings of this price. The question as to whether we should omit these outliers is a little more difficult, but we lean towards omitting them for <b>most</b> clients. Even if these prices are correct, we suspect that they are governed by idiosyncratic factors that are not as relevant to the listings that most of our clients are interested in analyzing. Thus, they will tend to give us misleading (or "biased") results. 

### Another way to look at the histogram of number of bedrooms

Sometimes, it is better to look at a histogram which plots the <i>relative</i> percentages of values across categories:

In [None]:
# How many bedrooms
bedrooms_counts = Counter(listings.bedrooms) #A Counter is a dict subclass for counting hashable objects. It is a collection where elements are stored as dictionary keys and their counts are stored as dictionary values.
tdf = pd.DataFrame.from_dict(bedrooms_counts, orient = 'index').sort_values(by = 0)
tdf = (tdf.iloc[-10:, :] / len(listings)) * 100

# Sort bedroom dataframe by number
tdf.sort_index(axis = 0, ascending = True, inplace = True)

# Plot percent of listings by bedroom number
ax = tdf.plot(kind = 'bar', figsize = (12, 7.5))
ax.set_xlabel("# Bedrooms")
ax.set_ylabel("% Listings")
ax.set_title('% Listings by Bedrooms')
ax.legend_.remove()

plt.show()

print("Percent of 1 Bedroom Listings: %{0:.2f}".format(tdf[0][1])) 
#The syntax 0:.2f denotes that we will print upto to decimal places
#Change it to %{0:.3f to see what happens}

## Inspecting price against variables of interest

Now that we have looked at the variables of interest in isolation, it makes sense to look at them in relation to price.

### Exercise 2:

#### 2.1
Write code for making a boxplot of ```price``` vs. ```bedrooms```, ```bathrooms```, ```number_of_reviews```, ```review_scores_cleanliness```.

In [None]:
# Code here, hint: use the seaborn library function as sns.boxplot
plt.figure(figsize=(12,10))
vars_to_plot = ['bedrooms','bathrooms','number_of_reviews','review_scores_cleanliness']
for i, var in enumerate(vars_to_plot):
    #code here


#### 2.2

Comment on the relationship between price and the respective variable in each of the above plots.

**Answer.**

1. As expected, the median price increases with the number of bedrooms. This relationship also seems linear.
2. Again as expected, the price on average seems to increase with the number of bathrooms. There seems to be some outliers which defy this trend.
3. The number of reviews do not seem to affect the median price. 
4. There seems to be a slight increase in median price with increase in cleanliness review scores.

### Investigating correlations

Although plotting the relationship between price and a few other variables is a good first step, overall there are too many variables to individually plot and manually inspect. We need a more systematic method. How do we proceed? An easy way to get a quick overview of the key variables that affect the price is via <b>correlation</b>.

Let's look at the ```price``` vs. ```bedrooms``` plot again:

In [None]:
plt.figure(figsize=(10,8))
plt.subplot(121)
sns.scatterplot(x='bedrooms',y = 'price', data = listings)
plt.ylabel("price")
plt.title("Scatterplot of Price vs. Bedrooms")
plt.subplot(122)
sns.boxplot(x='bedrooms',y= 'price', data = listings)
plt.ylabel("price")
plt.title("Boxplot of Price vs. Bedrooms")

We see that as the number of bedrooms increases, the price on average increases. The quantity <b>correlation</b> is one way to capture this relationship. The correlation of two quantities is a measurement of how much they tend to increase together, measured on a scale going from -1 to 1. A positive correlation between price and number of bedrooms would indicate that higher-priced listings tend to have <i>more</i> bedrooms. Similarly, a negative correlation between price and number of bedrooms would indicate that higher-priced listings tend to have <i>fewer</i> bedrooms.  In our case, we can easily see that price is positively correlated with bedrooms.

Since correlation is just a single number summarizing an entire joint distribution, it can be misleading and does not eliminate the need to plot and visually inspect the key variables that it suggests are important. Nonetheless, it is quite helpful when quickly scanning for very strong relationships in the data and whittling down a much larger list of potential factors.

In [None]:
np.corrcoef(listings['price'],listings['bedrooms'])[0,1]

The <b>correlation matrix</b> then gives all of the pairwise correlations between all of the variables. We can get a quick overview of the key variables that affect the price by looking at its row in the correlation matrix.

### Exercise 3:

#### 3.1 Write code to compute the correlation matrix between the price and other quantities.
(use .corr() function). 

#### 3.2 Print the columns which are positively correlated, in increasing order of the correlation.
#### 3.3 Print the columns which are negatively correlated, in increasing order of the magnitude of the correlation. 

In [None]:
# code here


#### 3.4:

From the table above, what factors are most correlated with price? Which correlations are surprising?

**Answer.** Many of these are unsurprising – for example, the largest correlations are with measures of size (```accommodates```, ```bedrooms```, ```beds```, etc.). Review scores are only slightly related to price. Looking at the location-related scores.

We also notice a few correlations that seem a bit surprising. For example:

1. Parking is negatively correlated to price. This correlation with parking is very suspicious – why would parking be bad? I suspect that it is "spurious", caused by the fact that parking is more common in less expensive neighborhoods. Let's investigate this by looking at parking in a region-by-region manner.
2. Being a superhost is negatively correlated; we don't follow up on it here.
3. The total number of listings is positively correlated. This seems counterintuitive, as one would want large-scale listers to be able to rent more cheaply due to economies of scale.

Let's use another method of checking correlations by making a heatmap, this is a useful tool to aid in the feature selection when building models.

In [None]:
plt.figure(figsize=(30,30))
sns.heatmap(corr, cmap="RdYlBu", 
    annot=True, square=True,
    vmin=-1.0, vmax=1.0, fmt="+.1f", mask=np.triu(corr), cbar=False) # np.triu Returns copy of array with lower part of the triangle
plt.title("Correlations between features")

#### Write code to make a scatterplot between price and longitude, with number of bedrooms categorized by color.

Plot ```longitude``` vs. ```price```:

In [None]:
sns.scatterplot(x= listings['longitude'], y = listings['price'], hue = listings['bedrooms'])

When looking at the list of correlations, ```parking``` stood out as having a surprisingly negative correlation with price. We've seen that location has a strong influence on price; let's see if it can help explain the negative correlation exhibited by ```parking```.

## Interaction effects and iterative hypotheses

Now that we have explored some of the factors that are expected to affect price, let's focus on understanding the unexpected correlations, such as the negative correlation with parking. We start with the latter:

In [None]:
# First, plot parking vs. non-parking prices.
sns.kdeplot(listings.loc[listings['parking'] == 1,'price'],shade = True, label="Parking",color="g")
sns.kdeplot(listings.loc[listings['parking'] == -1,'price'],shade = True, label="No Parking",color="r")
plt.title("Density plot of Price for Parking vs. No Parking");

We saw before that the correlation between price and parking is -0.019383. Since parking is desirable, we expect the price to increase with parking. When we see a pattern like this, we should suspect the existence of **interaction effects** that are complicating the parking vs. price relationship. Interaction effects are when the relationship between two variables is **conditional**, or depends on the value of a third, hidden variable.

What could this third variable potentially be? Well, we have seen that location has a huge impact on prices. Perhaps high-price areas don't have many parking spots, whereas low-price areas do? We don't know this for sure, but it's a worthwhile guess.

More formally, we hypothesize that this observed negative correlation is the result of interaction effects arising from location. In order to investigate this hypothesis, we ought to break down the locations by neighborhood and see if this negative correlation between price and parking still holds within neighborhoods. The neighborhoods are discrete and there are many listings per neighborhood, so we can simply compute the correlation for every neighborhood individually. Mathematically, this is exactly the same thing as conditioning on the neighborhood and computing the conditional correlation.

### Exercise 4:

#### 4.1
Write code to make a dictionary in which the keys are the `neighbourhoods` in the dataset and the values are the correlation between price and parking for that neighborhood.


#### 4.2 
Next plot a histogram of these correlations. 

In [None]:
neighbourhoods = listings.neighbourhood.unique()
cvec = list()
cvec = dict()

for x in neighbourhoods:
    temp = listings[listings['neighbourhood'] == x]
    cvec[x] = temp.corr()['price']['parking']


res = list(cvec.values())
res = [x for x in res if str(x) != 'nan']
res.sort()


plt.hist(res, bins=20)
plt.ylabel('Correlation')
plt.show()


print('Average correlation: ', sum(res)/len(res))


#### 4.3 Explain the relationship between the histogram and our finding that parking is negatively correlated with price.

**Answer.** Our original correlation of about $-0.02$ was the correlation between price and parking <i>for all listings in NYC</i> – that is, the conditional correlation between price and parking <i> given that you are in NYC</i>. The number ```res['Brooklyn']``` is the correlation between price and parking <i>for all listings in Brooklyn</i> – that is, the conditional correlation between price and parking <i>given that you are in Brooklyn</i>. 

The histogram shows us that most of the conditional correlations within neighborhoods are positive, even though the correlation across all of NYC is negative. Roughly speaking, this means that the following are all occurring:

1. Within neighborhoods, parking is positively associated with price.
2. Different neighborhoods have very different typical prices (as we saw last section).
3. Parking tends to be concentrated in cheaper neighborhoods.

The correlation values of 1 and -1 are presumably due largely to neighborhoods with very few listings, and should essentially be ignored. Viewing the histogram, however, we can see that a clear majority of correlations are at least slightly positive, for an average correlation of 0.08.

#### 4.4
Plot the histogram that overlays the distribution of price for parking and non-parking (use sns.kdeplot) for the neighborhoods: `St. George`,`Greenwood Heights`,`Rego Park`,`Brooklyn Navy Yard`.

If we plot this by neighborhood for a few neighborhoods, we can see this somewhat positive correlation of parking vs. no parking visually: 

In [None]:
plt.figure(figsize=(12,10))
neigh_to_look = ['St. George','Greenwood Heights','Rego Park','Brooklyn Navy Yard']
for i, neigh in enumerate(neigh_to_look):
    plt.subplot(2,2,i+1)
    sns.kdeplot(listings.loc[(listings['parking'] == 1) & (listings['neighbourhood'] == neigh),'price'],shade = True, label="Parking",color="g")
    sns.kdeplot(listings.loc[(listings['parking'] == -1) & (listings['neighbourhood'] == neigh),'price'],shade = True, label="No Parking",color="r")
    plt.title("Parking vs. No Parking for neighboorhood = " + str(neigh));

As we have seen, the existence of unexpected correlations should spur investigation into potential interaction effects, which lead to potentially interesting hypotheses. Thus, one good way of generating iterative hypotheses is to find and think about potential interaction effects.

### Finding more interactions: how does price vary by property type?

We saw that finding conditional correlations or interactions is a good way to generate further hypotheses, as many interesting lines of investigation arise from investigating these **confounding variables**. Here is another example: let's now look at how price varies with property type. The following code plots the price of a one bedroom listing broken down by the property type:

## Exercise 5

#### 5.1
Write code to make a boxplot of price of one bedroom property across all property types.

In [None]:
# code here


#### 5.2

What can you conclude about the variation in price of a one bedroom by the property type?

**Answer.** There is significant variation in price according to the property type; a room in a house or a loft is the cheapest, while cabins, boutique hotels, and boats are very expensive! It is also interesting to see huge variations in hotel prices.

#### 5.3

Do the same price vs. property type plot for two bedroom listings.

In [None]:
# code here

## Exploring temporal effects: summer in Rio and winter in Moscow

We have seen that conditional plots can be a useful way to "correct" comparisons by taking into account interaction effects.

Time is a very common interaction effect that appears across lots of datasets. For Airbnb data, this is especially important, as Airbnb is often more expensive near holidays, and so reasonable price estimates must take this into account. In practice this is one of the most important corrections offered by Airbnb pricing consultancy firms, and corrections usually take advantage of data pooled from many somewhat similar cities. This is vital to achieving good corrections, but it is easy to make mistakes by failing to account for important city-to-city differences.

We begin by opening up the calendar data and counting (i) the number of rentals per day; and (ii) their total prices: 

In [None]:
cal = pd.read_csv('scal.csv', delimiter=',')
cal.head()
# Count rentals and total price on each date.

In [None]:
cal.shape

In [None]:
rcount = dict()
rprice = dict()

for row in cal.itertuples(index=True, name='Pandas'): #The itertuples() function is used to iterate over DataFrame rows as namedtuples.
    rcount[str(row[1])] = rcount.get(str(row[1]), 0) + 1
    rprice[str(row[1])] = rprice.get(str(row[1]), 0) + row[2] 

In [None]:
# Next, plot the results. 
tempcount = sorted(rcount.items())
x, y = zip(*tempcount) #The purpose of zip is to map the similar index of multiple containers so that they can be used just using as single entity
tempprice = sorted(rprice.items())
u,v = zip(*tempprice)

In [None]:
# Next, we look at average price

tempprice = sorted(rprice.items())
u,v = zip(*tempprice)

ratio = lambda a,b: float(a)/float(b) 

avgprice = list(map(ratio, v,y))

xd = pd.to_datetime(x)


In [None]:
plt.figure(figsize=(12,10))
plt.plot(xd,avgprice)
plt.xticks(rotation = 'vertical')
plt.ylabel('Average price')
plt.title("Average price vs. Date")
plt.show()

In [None]:
#Let us also plot a smaller time interval
plt.figure(figsize=(12,10))
plt.plot(xd[0:28],avgprice[0:28])
plt.xticks(rotation = 'vertical')
plt.ylabel('Average price')
plt.title("Average price vs. Date")
plt.show()

When analyzing time series data like this, it is common to view it as a sum of several contributing effects over time plus noise. The two common types of summands in such a representation are:

1. **Seasonal effects**: this is a summand that is periodic, often with period corresponding to the calendar (week, month or year).
2. **Trend effects**: this is a smooth summand that goes up or down slowly over an entire series, representing long-term trends such as price inflation.

### Exercise 6:

#### 6.1

Visually, can you see any strong seasonal or trend components? What do they mean?

**Answer.**

1. There is an extremely strong cyclical component that repeats every week. This corresponds to the fact that weekend travel is very different from weekday travel.
2. There is a trend of increasing prices over time.
3. Calendar components: this is a component with sharp "spikes" that is designed to correct for any idiosyncratic elements of our calendar. This might include: (i) a monthly time series with a dip in February (since it is the shortest month); (ii) spikes in months that contain five Saturdays (since there may be more spending on weekends); or (iii) a daily time series with a dip on Labor Day (when stores are closed).

#### 6.2

What is the enormous spike that you see in this chart? Is it real, and how would you describe what is going on in layman's terms?

**Answer.** This spike occurs at Christmas, the busiest holiday season. We expect it every year and must incorporate it in any reasonable model.

#### 6.3

Can we guess the busiest season (excluding Christmas) from this raw chart?

**Answer.** This would be difficult. Notice that this chart covers about a year, but there is a clear discontinuity if you try to "wrap" the data (i.e. the difference between the first and last day on this chart is significant). This is caused by an underlying trend of increasing prices every year. To figure out the best season, you would need to extract out this trend, which is difficult to do from a single year's data in a single city.

This brings us to an important topic: bringing in auxiliary datasets! The Inside Airbnb website includes calendar data for many cities, and we can use these to adjust for the trend component. To get some diversity, we should make sure to source some data from: (i) a city close to NYC; (ii) a city in the US with very different weather; and (iii) some cities very far away.

## Conclusions

In this case, we saw that Airbnb prices are influenced by many factors. Some of the main ones include location, date, number of bedrooms, number of guests, and property type.

Any future model we build should feature these factors. Incorporating some of these factors, such as the number of bedrooms, should be straightforward, as this has a large and nearly linear relationship to price. But others, such as location, exhibit very non-linear relationships.

We also found some surprising correlations, such as the negative correlation between price and parking. However, after breaking the data down by neighborhoods and incorporating the interaction effect of location, this negative correlation went away entirely.

Temporal effects are a very specific type of interaction effect which must be dealt with separately. Our exploration tells us that any model of AirBnB pricing should take into account strong seasonal components as well as strong spikes around major holidays. 

## Takeaways
In this case, you learned the following exploration process:

1. Start by looking at marginal distributions of quantities of interest to look for interesting patterns and/or outliers.
2. A correlation matrix can quickly reveal the most promising candidate variables for further investigation.
3. Investigate each of these candidate variables in turn. Note which ones exhibit interesting and unexpected correlations.
4. Explore potential interaction effects for the variables with unexpected correlations. Suspected important interactions should be looked at directly with further plotting.
5. Finally, take some time to carefully plot any interactions that you know to be important from domain knowledge. In our case, we looked at two features that are common to many datasets: location data and temporal data. Both of these contained very important signals that were immediately visually apparent, but which were strongly non-linear and could not easily be reduced to correlations or other simple summaries.

This process can be a bit daunting at first, but it is widely used by veteran data analysts and scientists and is extremely effective in most situations. By iteratively generating hypotheses throughout this process and investigating them, you can uncover great insight about what is going on without building a single formal model. Formal modelling will be discussed in future cases.

## Extended read

### Distribution shapes

https://mathbitsnotebook.com/Algebra1/StatisticsData/STShapes.html

### Understanding and Interpreting boxplots

https://www.wellbeingatschool.org.nz/information-sheet/understanding-and-interpreting-box-plots

### Pandas display options

https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html

### Python seaborn plotting tutorial

https://elitedatascience.com/python-seaborn-tutorial