<a href="https://colab.research.google.com/github/neo4j-contrib/training/blob/master/ml_ai/02_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Short Term Rentals - Exploratory Data Analysis

Now we're going to see what we've imported. As with the previous notebook let's install and import py2neo and pandas. We'll also install matplotlib to create some charts showing us the shape of the data.

In [None]:
!pip install py2neo pandas matplotlib

In [2]:
%matplotlib notebook

from py2neo import Graph
import pandas as pd

import matplotlib 
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

Update the cell below with the same Sandbox credentials that you used in the first notebook:

In [1]:
# Change the line of code below to use the IP Address, Bolt Port, and Password of your Sandbox.
# graph = Graph("bolt://<IP Address>:<Bolt Port>", auth=("neo4j", "<Password>")) 

graph = Graph("bolt://18.234.168.45:33679", auth=("neo4j", "daybreak-cosal-rumbles")) 

NameError: name 'Graph' is not defined

Now we can run the following query to check how many nodes our database contains:

In [None]:
query = """
CALL db.schema() 
"""

graph.run(query).data()

This query returns all the labels in the database, any constraints they have, as well as relationship types.

We could also run this query in the Neo4j Browser if we want to see a visual representation:

<img align="left" src="images/airbnb-graph.svg" width="500px" />

In [None]:
query = """
MATCH () 
RETURN COUNT(*) AS nodeCount
"""

graph.run(query).to_data_frame()

Let's drill down a bit. What types of nodes do we have?

In [None]:
result = {"label": [], "count": []}
for label in graph.run("CALL db.labels()").to_series():
    query = f"MATCH (:`{label}`) RETURN count(*) as count"
    count = graph.run(query).to_data_frame().iloc[0]['count']
    result["label"].append(label)
    result["count"].append(count)
nodes_df = pd.DataFrame(data=result)
nodes_df.sort_values("count")

We can visualize this counts using matplotlib with the following code:

In [None]:
nodes_df.plot(kind='bar', x='label', y='count', legend=None, title="Node Cardinalities")
plt.yscale("log")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

And what types of relationships?

In [None]:
result = {"relType": [], "count": []}
for relationship_type in graph.run("CALL db.relationshipTypes()").to_series():
    query = f"MATCH ()-[:`{relationship_type}`]->() RETURN count(*) as count"
    count = graph.run(query).to_data_frame().iloc[0]['count']
    result["relType"].append(relationship_type)
    result["count"].append(count)
rels_df = pd.DataFrame(data=result)
rels_df.sort_values("count")

We can visualize this counts using matplotlib with the following code:

In [None]:
rels_df.plot(kind='bar', x='relType', y='count', legend=None, title="Relationship Cardinalities")
plt.yscale("log")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Now let's explore the neighborhood data:

In [None]:
exploratory_query = """
MATCH (n:Neighborhood)<-[:IN_NEIGHBORHOOD]-(l:Listing)-[:HAS]->(a:Amenity) 
RETURN n.name AS neighborhood, l.name AS name, collect(a.name) AS amenities, l.price AS price 
LIMIT 25
"""

graph.run(exploratory_query).to_data_frame()

What does the distribution of prices in the dataset look like? We can use the `describe` function to work this out:

In [None]:
query = """
MATCH (l:Listing)
RETURN l.price AS price
"""

price_df = graph.run(query).to_data_frame()
price_df.describe()

This returns some descriptive statistics that allow us to get an understanding of how prices are distributed. We can see that the average price is 139 per night, but the maximum price is 999 - there's clearly a lot of variation in prices!

In [None]:
fig1, ax1 = plt.subplots()
ax1.hist(pd.Series(price_df['price'].dropna()), 20, density=True, facecolor='g', alpha=0.75)
plt.tight_layout()
plt.show()

We have a very long tail going on here - the majority of listings are priced at under 200, but then there are a few properties that cost much more than this.

What are the most expensive places to live?

In [None]:
query = """
MATCH (l:Listing)-[:IN_NEIGHBORHOOD]->(n:Neighborhood)
WITH n, avg(l.price) AS averagePrice
RETURN n.id AS zip, n.name AS neighborhood, averagePrice
"""

price_df = graph.run(query).to_data_frame().sort_values("averagePrice", ascending=False)
price_df.head(10)

The variation in average price by neighborhood is easier to see in a chart:

In [None]:
price_df.head(30).plot(kind='bar', x='zip', y='averagePrice', legend=None, title="Average price")
plt.tight_layout()
plt.show()

## Exercise 

* Can you create a similar chart showing the areas which are offering the largest number of bedrooms?
* What about bathrooms?
* What about the number of listings per neighborhood?