# IBM Data Science Capstone Project: Settling in Sydney, Australia

## 1. Introduction

In this final project I choose to analyse the city of Sydney, AUS. Sydney is the largest city in Asutralia with more than 5 millions inhabitants. The city is very large with different neigbourhoods. All neigbourhoods offers different atmosphere. You have to ones close to the beach, the others closer to the city center or the business district. 
This analysis should help you find the best place for you to settle in Sydney.
It will show you:
 - What are the best locations as per infrstructure
 - What type of venues are there around - school, restaurants, parks, gyms, coffee-shops

According to your personal preference, you will be able to choose the best suited location/neigbourhood for you.

## 2. Data

As we decided to focus our analysis on Sydney, we need to get the data for Sydney and its suburb. They can be found here [link](https://www.geonames.org/postal-codes/AU/NSW/new-south-wales.html). In  order to get the data we need to scrape the webpage. Where we need to be careful is to get the 2nd table of the webpage, as the first one refers to a search table. Once we get the relevant table, we need to clean it. Meaning droping the non-relevant columns, the columns with no values, renaming the columns and reseting the index. 

In a second part we will have to combine the lattitude and longitude data to the table. As you will be able to see, the table from the webpage is already containing the lattitude and longitude data. In an index section I will add the code I used to transform the table and only extract the lattitude and longitude from it. The final geographical data are stored in a .csv file and will be imported and combined to the main table. 

The final table will then contain every neighbourhoods in Sydney, display its respective borough, zip code as well as its latittude and longitude.

The below steps get use to the final outcome, which will contain all the data relevant to pursuie our analysis.

### 2.1 Lets load required libraries

In [1]:
#importing librairies
import random # library for random number generation
import numpy as np # library for vectorized computation
import pandas as pd # library to process data as dataframes

import matplotlib.pyplot as plt # plotting library
# backend for rendering plots within the browser
%matplotlib inline 

from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs

### 2.2 Creation of the dataframe by scraping the relevant webpage

In [2]:
#importing more librairies
# import the library we use to open URLs

import urllib.request

In [3]:
# importing the table 

url = 'https://www.geonames.org/postal-codes/AU/NSW/new-south-wales.html'
page = urllib.request.urlopen(url)

In [4]:
pip install BeautifulSoup4

Collecting BeautifulSoup4
[?25l  Downloading https://files.pythonhosted.org/packages/d1/41/e6495bd7d3781cee623ce23ea6ac73282a373088fcd0ddc809a047b18eae/beautifulsoup4-4.9.3-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 9.0MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2; python_version >= "3.0" (from BeautifulSoup4)
  Downloading https://files.pythonhosted.org/packages/02/fb/1c65691a9aeb7bd6ac2aa505b84cb8b49ac29c976411c6ab3659425e045f/soupsieve-2.1-py3-none-any.whl
Installing collected packages: soupsieve, BeautifulSoup4
Successfully installed BeautifulSoup4-4.9.3 soupsieve-2.1
Note: you may need to restart the kernel to use updated packages.


In [5]:
from bs4 import BeautifulSoup

In [6]:
pip install lxml

Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/bd/78/56a7c88a57d0d14945472535d0df9fb4bbad7d34ede658ec7961635c790e/lxml-4.6.2-cp36-cp36m-manylinux1_x86_64.whl (5.5MB)
[K     |████████████████████████████████| 5.5MB 6.1MB/s eta 0:00:01     |██████▊                         | 1.2MB 6.1MB/s eta 0:00:01     |████████████████████████▉       | 4.3MB 6.1MB/s eta 0:00:01     |████████████████████████████    | 4.8MB 6.1MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.6.2
Note: you may need to restart the kernel to use updated packages.


In [7]:
#getting the table from the below URL
url = 'https://www.geonames.org/postal-codes/AU/NSW/new-south-wales.html'
tables = pd.read_html(url)

In [8]:
#we choose the 2nd table from the webpage
sydney = tables[2]

In [9]:
#we want the name of the columns
sydney.columns

Index(['Unnamed: 0', 'Place', 'Code', 'Country', 'Admin1', 'Admin2', 'Admin3'], dtype='object')

In [10]:
#display table sydney
sydney

Unnamed: 0.1,Unnamed: 0,Place,Code,Country,Admin1,Admin2,Admin3
0,1.0,Haymarket,2000,Australia,New South Wales,SYDNEY STREETS,
1,,-33.88/151.205,-33.88/151.205,-33.88/151.205,-33.88/151.205,-33.88/151.205,-33.88/151.205
2,2.0,Ultimo,2007,Australia,New South Wales,SYDNEY STREETS,
3,,-33.881/151.198,-33.881/151.198,-33.881/151.198,-33.881/151.198,-33.881/151.198,-33.881/151.198
4,3.0,Chippendale,2008,Australia,New South Wales,SYDNEY STREETS,
...,...,...,...,...,...,...,...
396,199.0,St Ives Chase,2075,Australia,New South Wales,,
397,,-33.709/151.162,-33.709/151.162,-33.709/151.162,-33.709/151.162,-33.709/151.162,-33.709/151.162
398,200.0,Normanhurst,2076,Australia,New South Wales,GOSFORD,
399,,-33.723/151.097,-33.723/151.097,-33.723/151.097,-33.723/151.097,-33.723/151.097,-33.723/151.097


### 2.3 Cleaning of the dataframe

In [11]:
#droping columns which we do not need
sydney.drop(columns=['Unnamed: 0', 'Country', 'Admin1', 'Admin3'], inplace=True)
#renaming column Admin 2 to Suburb
sydney.rename(columns={"Admin2": "Borough", "Place": "Neighbourhood", "Code": "Postal Code"}, inplace=True)
#dispplay table sydney
sydney

Unnamed: 0,Neighbourhood,Postal Code,Borough
0,Haymarket,2000,SYDNEY STREETS
1,-33.88/151.205,-33.88/151.205,-33.88/151.205
2,Ultimo,2007,SYDNEY STREETS
3,-33.881/151.198,-33.881/151.198,-33.881/151.198
4,Chippendale,2008,SYDNEY STREETS
...,...,...,...
396,St Ives Chase,2075,
397,-33.709/151.162,-33.709/151.162,-33.709/151.162
398,Normanhurst,2076,GOSFORD
399,-33.723/151.097,-33.723/151.097,-33.723/151.097


In [12]:
#excludes every 2nd row starting from 1
sydney2 = sydney[sydney.index % 2 != 1].reset_index()  

In [13]:
#display new df sydney2
sydney2

Unnamed: 0,index,Neighbourhood,Postal Code,Borough
0,0,Haymarket,2000,SYDNEY STREETS
1,2,Ultimo,2007,SYDNEY STREETS
2,4,Chippendale,2008,SYDNEY STREETS
3,6,Pyrmont,2009,SYDNEY STREETS
4,8,Surry Hills,2010,SYDNEY STREETS
...,...,...,...,...
196,392,South Turramurra,2074,GOSFORD
197,394,Warrawee,2074,GOSFORD
198,396,St Ives Chase,2075,
199,398,Normanhurst,2076,GOSFORD


In [14]:
sydney2.columns

Index(['index', 'Neighbourhood', 'Postal Code', 'Borough'], dtype='object')

In [15]:
#droping extra index column
sydney2.drop(columns=['index'], inplace=True)

In [16]:
#display new df sydney2
sydney2

Unnamed: 0,Neighbourhood,Postal Code,Borough
0,Haymarket,2000,SYDNEY STREETS
1,Ultimo,2007,SYDNEY STREETS
2,Chippendale,2008,SYDNEY STREETS
3,Pyrmont,2009,SYDNEY STREETS
4,Surry Hills,2010,SYDNEY STREETS
...,...,...,...
196,South Turramurra,2074,GOSFORD
197,Warrawee,2074,GOSFORD
198,St Ives Chase,2075,
199,Normanhurst,2076,GOSFORD


In [17]:
#drop rows where Suburb is NaN
sydney2.dropna(subset=['Borough'], inplace=True)
#display new df sydney2
sydney2

Unnamed: 0,Neighbourhood,Postal Code,Borough
0,Haymarket,2000,SYDNEY STREETS
1,Ultimo,2007,SYDNEY STREETS
2,Chippendale,2008,SYDNEY STREETS
3,Pyrmont,2009,SYDNEY STREETS
4,Surry Hills,2010,SYDNEY STREETS
...,...,...,...
194,Turramurra,2074,GOSFORD
195,North Turramurra,2074,GOSFORD
196,South Turramurra,2074,GOSFORD
197,Warrawee,2074,GOSFORD


In [18]:
#reset index
sydney3= sydney2.reset_index()

In [19]:
#droping extra index column
sydney3.drop(columns=['index'], inplace=True)

In [20]:
#display new df sydney3
sydney3

Unnamed: 0,Neighbourhood,Postal Code,Borough
0,Haymarket,2000,SYDNEY STREETS
1,Ultimo,2007,SYDNEY STREETS
2,Chippendale,2008,SYDNEY STREETS
3,Pyrmont,2009,SYDNEY STREETS
4,Surry Hills,2010,SYDNEY STREETS
...,...,...,...
189,Turramurra,2074,GOSFORD
190,North Turramurra,2074,GOSFORD
191,South Turramurra,2074,GOSFORD
192,Warrawee,2074,GOSFORD


The dataframe sydney3 is the cleaned df we gonna used for the rest of the exercise.

### 2.4 Adding the lattitude and longitude data 

In [21]:
#importing geocoder package
import pip
!pip install geocoder
print("geocoder is now istalled!")

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 12.0MB/s ta 0:00:01
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6
geocoder is now istalled!


In [33]:
#get the geocode data for sydney
syd_geocode = pd.read_csv ("sydney_geocode.csv")
print (syd_geocode)

     Postal Code  Lattitude  Longitude
0           2000    -33.880    151.205
1           2007    -33.881    151.198
2           2008    -33.886    151.199
3           2009    -33.870    151.194
4           2010    -33.885    151.212
..           ...        ...        ...
195         2074    -33.704    151.149
196         2074    -33.753    151.111
197         2074    -33.729    151.123
198         2075    -33.709    151.162
199         2076    -33.723    151.097

[200 rows x 3 columns]


In [38]:
sydney3.dtypes

Neighbourhood    object
Postal Code      object
Borough          object
dtype: object

In [39]:
syd_geocode.dtypes

Postal Code      int64
Lattitude      float64
Longitude      float64
dtype: object

In [40]:
sydney3['Postal Code'] = sydney3['Postal Code'].astype(float)

In [42]:
#join the two tables
syd_data = sydney3.join(syd_geocode.set_index('Postal Code'), on='Postal Code').reset_index()
syd_data

Unnamed: 0,index,Neighbourhood,Postal Code,Borough,Lattitude,Longitude
0,0,Haymarket,2000.0,SYDNEY STREETS,-33.880,151.205
1,0,Haymarket,2000.0,SYDNEY STREETS,-33.861,151.204
2,0,Haymarket,2000.0,SYDNEY STREETS,-33.861,151.207
3,1,Ultimo,2007.0,SYDNEY STREETS,-33.881,151.198
4,2,Chippendale,2008.0,SYDNEY STREETS,-33.886,151.199
...,...,...,...,...,...,...
331,192,Warrawee,2074.0,GOSFORD,-33.732,151.130
332,192,Warrawee,2074.0,GOSFORD,-33.704,151.149
333,192,Warrawee,2074.0,GOSFORD,-33.753,151.111
334,192,Warrawee,2074.0,GOSFORD,-33.729,151.123


In [43]:
#droping extra index column
syd_data.drop(columns=['index'], inplace=True)
syd_data

Unnamed: 0,Neighbourhood,Postal Code,Borough,Lattitude,Longitude
0,Haymarket,2000.0,SYDNEY STREETS,-33.880,151.205
1,Haymarket,2000.0,SYDNEY STREETS,-33.861,151.204
2,Haymarket,2000.0,SYDNEY STREETS,-33.861,151.207
3,Ultimo,2007.0,SYDNEY STREETS,-33.881,151.198
4,Chippendale,2008.0,SYDNEY STREETS,-33.886,151.199
...,...,...,...,...,...
331,Warrawee,2074.0,GOSFORD,-33.732,151.130
332,Warrawee,2074.0,GOSFORD,-33.704,151.149
333,Warrawee,2074.0,GOSFORD,-33.753,151.111
334,Warrawee,2074.0,GOSFORD,-33.729,151.123


In [44]:
syd_data['Postal Code'] = syd_data['Postal Code'].astype(object)
syd_data

Unnamed: 0,Neighbourhood,Postal Code,Borough,Lattitude,Longitude
0,Haymarket,2000,SYDNEY STREETS,-33.880,151.205
1,Haymarket,2000,SYDNEY STREETS,-33.861,151.204
2,Haymarket,2000,SYDNEY STREETS,-33.861,151.207
3,Ultimo,2007,SYDNEY STREETS,-33.881,151.198
4,Chippendale,2008,SYDNEY STREETS,-33.886,151.199
...,...,...,...,...,...
331,Warrawee,2074,GOSFORD,-33.732,151.130
332,Warrawee,2074,GOSFORD,-33.704,151.149
333,Warrawee,2074,GOSFORD,-33.753,151.111
334,Warrawee,2074,GOSFORD,-33.729,151.123


In [47]:
syd_data.to_csv('syd_data.csv', index=False)  
print("Table saved as csv file!")

Table saved as csv file!


In [49]:
!pip install WeasyPrint #to able you to save as pdf

Collecting WeasyPrint
[?25l  Downloading https://files.pythonhosted.org/packages/ef/5b/58e85042758718f7ea5f6b3927675dc3aa25138884f0eef988a4b6653a53/WeasyPrint-52.2-py3-none-any.whl (363kB)
[K     |████████████████████████████████| 368kB 7.4MB/s eta 0:00:01
[?25hCollecting cairocffi>=0.9.0 (from WeasyPrint)
[?25l  Downloading https://files.pythonhosted.org/packages/84/ca/0bffed5116d21251469df200448667e90acaa5131edea869b44a3fbc73d0/cairocffi-1.2.0.tar.gz (70kB)
[K     |████████████████████████████████| 71kB 17.0MB/s eta 0:00:01
[?25hCollecting tinycss2>=1.0.0 (from WeasyPrint)
  Downloading https://files.pythonhosted.org/packages/65/f7/63bf697a7c7257d304269b49f1be3dfe429856889e93963d6f5790d77d82/tinycss2-1.1.0-py3-none-any.whl
Collecting html5lib>=0.999999999 (from WeasyPrint)
[?25l  Downloading https://files.pythonhosted.org/packages/6c/dd/a834df6482147d48e225a49515aabc28974ad5a4ca3215c18a882565b028/html5lib-1.1-py2.py3-none-any.whl (112kB)
[K     |██████████████████████████████

NameError: name 'weasyprint' is not defined

In [51]:
#save as pdf
from weasyprint import HTML
HTML(string=pd.read_csv('syd_data.csv').to_html()).write_pdf("syd_data.pdf")

We will use the Foursquare API in the next step, to get the venues for each neigbourhood. 