# Sydney House Prices Geospatial Analysis.
***

This is a change

![hello](../images/Sydney.jpg)

# Motivation

House prices in Sydney have been the subject of great attention in Australia and globally. Specifically, for their extraordinarily high prices. Being a resident of Sydney, I was interested in seeing the relative prices across the suburbs I live around. I wanted a way I could visualise these geospatial relationships myself. 

> A choropleth map (from Greek χῶρος choros 'area/region' and πλῆθος plethos 'multitude') is a type of thematic map in which a set of pre-defined areas is colored or patterned in proportion to a statistical variable that represents an aggregate summary of a geographic characteristic within each area, such as population density or per-capita income.



In [1]:
%reload_ext autoreload

# general modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import plotly.express as px
import os
import requests
import json
import geojson
import yaml

from box import Box
from urllib.request import urlopen

# utility module
from sydneyhouseprices.data import remoteGeoJSONToGDF

# display options
pd.options.display.float_format = '{:,.2f}'.format

## Gathering Data
The data I used can be found [here](https://www.kaggle.com/mihirhalai/sydney-house-prices#__sid=js0).

In [2]:
# import config files
with open("config.yml", "r") as ymlfile:
  cfg = Box(yaml.safe_load(ymlfile))

# import data
house_prices_syd = pd.read_csv(os.path.join(cfg.files.data,"SydneyHousePrices.csv"),index_col=0,parse_dates=True)

## Inspect and cleaning the data.

Having a quick glance at our data.

In [5]:
# Inspect Data
house_prices_syd.head()

Unnamed: 0_level_0,Id,suburb,postalCode,sellPrice,bed,bath,car,propType
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2019-06-19,1,Avalon Beach,2107,1210000,4.0,2,2.0,house
2019-06-13,2,Avalon Beach,2107,2250000,4.0,3,4.0,house
2019-06-07,3,Whale Beach,2107,2920000,3.0,3,2.0,house
2019-05-28,4,Avalon Beach,2107,1530000,3.0,1,2.0,house
2019-05-22,5,Whale Beach,2107,8000000,5.0,4,4.0,house


In [3]:
house_prices_syd.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 199504 entries, 2019-06-19 to 2011-04-16
Data columns (total 8 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   Id          199504 non-null  int64  
 1   suburb      199504 non-null  object 
 2   postalCode  199504 non-null  int64  
 3   sellPrice   199504 non-null  int64  
 4   bed         199350 non-null  float64
 5   bath        199504 non-null  int64  
 6   car         181353 non-null  float64
 7   propType    199504 non-null  object 
dtypes: float64(2), int64(4), object(2)
memory usage: 13.7+ MB


## Data Cleaning

There are a few columns that are reduntant for our analyses. We can remove the `id` and `postalCode` columns.

In [None]:
clean_house_prices_syd = house_prices_syd.drop(axis=1,labels=["Id","postalCode"])
clean_house_prices_syd.head()

To get a general idea of the data we calculate some summary statistics for each of our features.

In [7]:
# summary stats
clean_house_prices_syd.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
sellPrice,199504.0,1269776.3,6948239.27,1.0,720000.0,985000.0,1475000.0,2147483647.0
bed,199350.0,3.52,1.07,1.0,3.0,3.0,4.0,99.0
bath,199504.0,1.89,0.93,1.0,1.0,2.0,2.0,99.0
car,181353.0,1.94,1.06,1.0,1.0,2.0,2.0,41.0


For our choropleth map, we want to map each suburb to its median price.

In [8]:
# calculate median stats for each suburb

median_statistics = clean_house_prices_syd.groupby("suburb").median()
median_statistics.head()

Unnamed: 0_level_0,sellPrice,bed,bath,car
suburb,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Abbotsbury,975000.0,4.0,3.0,2.0
Abbotsford,1287500.0,3.0,2.0,2.0
Agnes Banks,715000.0,4.0,2.0,2.0
Airds,505000.0,4.0,1.0,2.0
Alexandria,1027500.0,3.0,1.0,1.0


In [9]:
# clean sydney dataframe
sydney = remoteGeoJSONToGDF(cfg.files.sydneyurl)


sydney = sydney[["geometry","nsw_loca_2"]]
sydney.rename(columns={"nsw_loca_2":"suburb"},inplace=True)
sydney.suburb = sydney.suburb.str.title()
sydney.head()

Unnamed: 0,geometry,suburb
0,"MULTIPOLYGON (((151.10074 -33.84457, 151.10082...",Concord
1,"MULTIPOLYGON (((151.19808 -33.82566, 151.19816...",Wollstonecraft
2,"MULTIPOLYGON (((151.10398 -33.81987, 151.10406...",Putney
3,"MULTIPOLYGON (((151.08348 -33.30938, 151.09335...",Ten Mile Hollow
4,"MULTIPOLYGON (((151.16649 -33.75486, 151.16677...",Killara


In [10]:
geo_house_prices = pd.merge(sydney,mean_statistics,left_on="suburb",right_on=mean_statistics.index,how="inner")
geo_house_prices.set_index("suburb",inplace=True)

Configuring parameters and testing whether map rendering improves when providing geojson from a url instead of locally.

In [5]:
fig = px.choropleth_mapbox(geo_house_prices, geojson=geo_house_prices.geometry, locations=geo_house_prices.index, color='sellPrice',
                           color_continuous_scale="Viridis",
                           center = {"lat": cfg.map.lat, "lon": cfg.map.lon},
                           mapbox_style=cfg.map.style,
                           range_color=(0, 4000000), 
                           opacity=0.5,
                           title="Median Selling Prices of Properties in Sydney Suburbs"
                        
                          )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

NameError: name 'geo_house_prices' is not defined