# Giving geospatial context to a dataset

This a short analysis to illustrate if the distance to an art galley, museum or other cultural center might influence the price you pay per night for your Airbnb. The Airbnb dataset used for this exercise was downloaded from the website [Inside Airbnb](http://insideairbnb.com/) and the inventory of cultural sites was downloaded from [Seattle's Open Data Portal](https://data.seattle.gov/). Using these publicly accessible datasets we will show how to give some spatial context to a dataset while trying to answer the following questions:

1. Which are the most expensive neighborhoods in Seattle and what is their average distance to cultural sites?
2. Is there a correlation between price per night and the proximity to city's cultural sites?
3. Which are the most influential variables for predicting price per night?

## Data exploration and manipulation

In [None]:
# import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt


In [None]:
# Import the Airbnb dataset
listings = pd.read_csv('data/listings.csv')
# listings.head(3)

In [None]:
# Data cleaning
# drop all rows with nulls in columns ['price', 'latitude', 'longitude']
listings.dropna(subset=['price', 'latitude', 'longitude'], inplace=True)
# clean up 'price' column
tmp_price = listings['price'].str.split('$', expand=True)
listings['price_cleansed'] = tmp_price[1].str.replace(',', '').astype('float')

In [None]:
# Import the cultural sites dataset
cultural = pd.read_csv('data/Seattle_Cultural_Space_Inventory.csv')
# cultural.head(3)

In [None]:
# Data cleaning
# drop all rows with null in columns ['Latitude', 'Longitude']
cultural.dropna(subset=['Latitude', 'Longitude'], inplace=True)

## Create geospatial variable

In [None]:
# Create geodataframes and reproject them to UTM Zone 10N
gpd_airbnb = gpd.GeoDataFrame(listings, geometry=gpd.points_from_xy(listings.longitude, listings.latitude), crs='EPSG:4326').to_crs('EPSG:32610')
gpd_cultural = gpd.GeoDataFrame(cultural, geometry=gpd.points_from_xy(cultural.Longitude, cultural.Latitude), crs='EPSG:4326').to_crs('EPSG:32610')

In [None]:
# Join both geodataframes based on the closest cultural point
gpd_airbnb_cultural = gpd_airbnb.sjoin_nearest(gpd_cultural, distance_col="distance")
gpd_airbnb_cultural[['price_cleansed', 'distance']]

In [None]:
# Visualization
base = gpd_airbnb.plot(markersize=2, color="blue", figsize=(10,8))
gpd_cultural.plot(ax=base, markersize=2, color='red');

## Which are the most expensive neighborhoods in Seattle and what is their average distance to cultural sites?

In [None]:
# Number of cultural sites by neighborhood
print(gpd_airbnb_cultural['neighbourhood_group_cleansed'].value_counts()[:3])
gpd_airbnb_cultural['neighbourhood_group_cleansed'].value_counts().plot.bar(figsize=(10,8), xlabel='', ylabel='# of cultural sites');

In [None]:
# Average price by neighbouthoob
print(gpd_airbnb_cultural.groupby('neighbourhood_group_cleansed')['price_cleansed'].mean().sort_values(ascending=False)[:3])
gpd_airbnb_cultural.groupby('neighbourhood_group_cleansed')['price_cleansed'].mean().sort_values(ascending=False).plot.bar(figsize=(10,8), xlabel='', ylabel='Price per night (USD)');

In [None]:
# Average distance to cultural sites by neighbourhoob
print(gpd_airbnb_cultural.groupby('neighbourhood_group_cleansed')['distance'].mean().sort_values()[:3])
gpd_airbnb_cultural.groupby('neighbourhood_group_cleansed')['distance'].mean().sort_values().plot.bar(figsize=(10,8), xlabel='', ylabel='Distance to cultural sites (meters)');

## Is there a correlation between price per night and the proximity to city's cultural sites? 

In [None]:
# Scatter plot of price and distance to cultural sites
gpd_airbnb_cultural.plot.scatter('distance', 'price_cleansed', figsize=(10,8),xlabel='Distance to cultural sites (meters)', ylabel='Price per night (USD)');

In [None]:
# correlation between price and distance (<1000m) to cultural sites
# gpd_airbnb_cultural[gpd_airbnb_cultural['distance'] < 1000][['price_cleansed', 'distance']].corr()
# correlation between price and distance to cultural sites
gpd_airbnb_cultural[['price_cleansed', 'distance']].corr()

In [None]:
# correlation between variables using a heatmap
corr = gpd_airbnb_cultural.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style('white'):
    f, ax = plt.subplots(figsize=(12, 10))
    ax = sns.heatmap(corr, mask=mask, vmax=0.3, square=True, cmap='YlGnBu');

In [None]:
gpd_airbnb_cultural.groupby('accommodates')['price_cleansed'].mean().plot(kind='bar');

In [None]:
gpd_airbnb_cultural.groupby('bathrooms')['price_cleansed'].mean().plot(kind='bar');

## Which are the most influential variables for predicting price per night?

In [None]:
# Define variables of interest based on the data exploration and cleaning
variables_of_interest = [
    'neighbourhood_group_cleansed',
    'latitude',
    'longitude',
    'property_type',
    'room_type',
    'accommodates',
    'bathrooms',
    'bedrooms',
    'beds',
    'amenities',
    'price_cleansed',
    'availability_30',
    'availability_60',
    'availability_90',
    'availability_365',
    'number_of_reviews',
    'review_scores_rating',
    'cancellation_policy',
    'reviews_per_month']