
# What features determine the price of an Airbnb rental?

In [None]:
import numpy                 as np
import pandas                as pd
import matplotlib.pyplot     as plt
import seaborn               as sns
import folium  #needed for interactive map
from folium.plugins import HeatMap
%matplotlib inline
sns.set()

## Some basic data exploration

We begin by loading the data and looking at its basic shape:

In [None]:
listings = pd.read_csv('data/airbnb_nyc.csv', delimiter=',')
listings.shape

We display the basic listings data:

In [None]:
pd.options.display.max_columns = 100
listings.head(3)

### Plotting the marginal distributions of key quantities of interest

As you have seen in the Python cases, it is good to first develop an idea of how the values of a few key quantities of interest are distributed. We always start investigating by gaining an overhead view of various parameters in our data. Let's start by doing so for some numeric variables, such as ```price```, ```bedrooms```, ```bathrooms```, ```number_of_reviews```.

### Example 1

Use the [describe()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) command to compute some important summary statistics for the above variables.

**Answer.** One possible solution is given below:

In [None]:
listings[['price','bedrooms','bathrooms','number_of_reviews']].describe()

In [None]:
listings[['price','bedrooms','bathrooms','number_of_reviews']].quantile([0.9,0.95,0.99])

### Exercise 1

Plot the histograms of the above variables.

**Answer.**

-------

## Inspecting price against variables of interest

Using `seaborn`, we can create box plots in which the data are grouped by a second column. For instance:

In [None]:
sns.boxplot(x = "bedrooms", y='price', data = listings)
plt.title("Boxplot of Price vs. bedrooms")

### Exercise 2

Create box plots of `price` vs. `bathrooms`, `price` vs. `number_of_reviews`, and `price` vs. `review_scores_cleanliness`.

**Answer.**

-------

## Investigating correlations

To calculate correlation coefficients, we use [**`.corr()`**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) from `pandas`. For instance, to find the correlation $r$ between `price` and `bedrooms`, we can run this line:

In [None]:
listings[["price", "bedrooms"]].corr()

This gives you a **correlation matrix** that tells you that

$$
r_{price,price} = 1\\
r_{price,bedrooms} = r_{bedrooms,price} = 0.454539
$$

If you want to extract only $r_{price,bedrooms}$, you can index the resulting object like this:

This would be the correlation matrix for all the variables in the DataFrame:

In [None]:
corrm = listings.corr()
corrm

### Exercise 3

Write code to print the columns which are positively correlated with `price`, from most positive to least positive. Similarly, print the columns which are negatively correlated, from most negative to least negative.

**Answer.**

-------

## Location, location, location!

Let's create an interactive map of New York. This can be easily done with the `folium` package:

In [None]:
folium_map = folium.Map()
folium_map

This is certainly a nice map, but it is not a map of New York yet. We know that New York's coordinates are latitude 40.738 (Northern hemisphere) and longitude -73.98 (Western hemisphere), so let's set that as the center of our map:

In [None]:
ny_coords = [40.738, -73.98] # lat, long
folium_map = folium.Map(location=ny_coords)
folium_map

This looks much better. We can set a default zoom that gives a closer view of the city:

In [None]:
ny_coords = [40.738, -73.98] # lat, long
folium_map = folium.Map(location=ny_coords, zoom_start=13)
folium_map

We can also change the title (the default is `OpenStreetMap`):

In [None]:
ny_coords = [40.738, -73.98] # lat, long
folium_map = folium.Map(location=ny_coords, zoom_start=13, tiles="OpenStreetMap")
folium_map

### Exercise 4 (optional)

There are other styles available for the `tiles` argument:

* `Stamen Toner`
* `Stamen Terrain`
* `Stamen Watercolor`
* `CartoDB positron`
* `CartoDB dark_matter`

Experiment with each one of these styles and take a screenshot of your favorite one. Then share it with the class.

### Using heat maps to understand the price distribution with location

Next, we create a heat map of the price of apartments in NYC. This will give us a sense of where the important locations are. First, the canvas:

In [None]:
folium_hmap = folium.Map(location=ny_coords, zoom_start=13, tiles="OpenStreetMap")

Now we prepare the data. `folium` needs a list in which each element contains the `latitude`, the `longitude`, and the `price` of the listing. We can use Python's handy [**`zip()`**](https://www.w3schools.com/python/ref_func_zip.asp) function, which takes two iterables and matches their elements one-by-one pairwise, like this:

![The zip function](data/images/zip_function.png)

**Note:** In order to inspect the elements inside a `zip` object, we first need to convert it into a list.

In [None]:
my_zip = zip(listings['latitude'], listings['longitude'], listings['price'])
list_of_my_zip = list(my_zip)
list_of_my_zip[0:15]

The next step is to create a `HeatMap` layer with the data:

In [None]:
hm_layer = HeatMap(list_of_my_zip,
                   # These are parameters that we tweak manually to adjust color
                   # See folium docs for more information
                   min_opacity=0.2,
                   radius=8,
                   blur=6, 
                 )

We can finally add this layer to our map and see the result:

In [None]:
folium_hmap.add_child(hm_layer)
folium_hmap

Let's save the map as HTML, so that we can share it later with people who don't have Jupyter in their computers. As HTML files, Folium maps can be visualized using any modern browser.

In [None]:
folium_hmap.save("hmap.html")

To test that everything worked correctly, go to your folder and look for the `hmap.html` file. Then open it with your browser.

### Exercise 5 (optional)

Make a heat map using `folium` like the one we just made, only this time make the temperature of the map dependent on `review_score_rating` rather than on `price`.

**Hint:** You will need to remove null values from your DataFrame. To avoid discarding rows that contain useful data for the analyses that come after this exercise, don't overwrite `listings` - rather, create a new DataFrame that does not contain nulls and make your map with that.

**Answer.**

-------

When looking at the list of correlations, ```parking``` stood out as having a surprisingly negative correlation with price. We've seen that location has a strong influence on price; let's see if it can help explain the negative correlation exhibited by ```parking```.

### Example 2

Write code here to plot the first 1,000 locations on the map where parking is available in blue color, and the first 1000 locations where parking is not available in red color.

**Hint:** You can use the commands `color = "blue"` and `color = "red"` respectively.

**Answer.** One possible solution is given below:

In [None]:
lat_log_parking_yes = listings.loc[ listings['parking']==1.0, ["latitude","longitude" ] ]
lat_log_parking_no = listings.loc[ listings['parking']==-1.0, ["latitude","longitude" ] ]
folium_map = folium.Map(location=[40.738, -73.98],
                        zoom_start=13,
                        tiles="OpenStreetMap")
for i in range(1000):
    marker = folium.CircleMarker(location=[lat_log_parking_yes["latitude"].iloc[i],lat_log_parking_yes["longitude"].iloc[i]],radius=5,color="blue",fill=True)
    marker.add_to(folium_map)

for i in range(1000):
    marker = folium.CircleMarker(location=[lat_log_parking_no["latitude"].iloc[i],lat_log_parking_no["longitude"].iloc[i]],radius=5,color="red",fill=True)
    marker.add_to(folium_map)    
    
folium_map

## Interaction effects

Let's find the correlation between `price` and `parking` for each `neighborhood`. This is easily done with the `.groupby()` method:

In [None]:
cbn = listings.groupby("neighbourhood")[["price", "parking"]].corr()
cbn

Let's filter out redundant information:

In [None]:
cbn = cbn.reset_index()
cbn = cbn.drop(columns=["parking"])
cbn.columns = ["neighbourhood", "variable", "r_parking_price"]
cbn = cbn[cbn["variable"]=="parking"]
cbn = cbn.drop(columns=["variable"])
cbn

### Exercise 6

Find out how many neighborhoods present a strongly negative, mildly negative, mildly positive, and strongly positive correlation between `price` and `parking`. Specifically, we want to know how many neighborhoods show a correlation between -1 and -0.5, between -0.5 and 0, between 0 and 0.5 and between 0.5 and 1.

**Hint:** For this, you can use the [`.plot.hist()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.hist.html) method again, this time specifying the bins you want the data to be split into.

**Answer.**

-------

### Exercise 7

Create four density plots that overlay the distribution of price for parking and non-parking, for each of the following neighborhoods: `St. George`, `Greenwood Heights`, `Rego Park`, and `Brooklyn Navy Yard`.

**Hint:** Use the [`sns.kdeplot()`](https://seaborn.pydata.org/generated/seaborn.kdeplot.html) function and the `hue` argument.

**Answer.**

-------

### Exercise 8

Plot average property price across all locations as a time series. The relevant dataset is `data/scal.csv`.

**Hint:** Use the `pd.to_datetime()` function with the `format="%Y%m%d"` argument to process the dates.

**Answer.**

-------

## Attribution

"New York", Inside Airbnb, [Public Domain](http://creativecommons.org/publicdomain/zero/1.0/), http://insideairbnb.com/get-the-data.html