<H1> Data Preparation </H1>

<h2>Crime Rates</h2>

We want to find a city with less crime rate. So, we need the crime rate statistics for suburbs of Melbourne city.

We can obtain crime rate statistics for Melbourne up to 2016 from the following link:

http://www.jonvy.co.nz/index.php?route=product/product&path=69_110&product_id=6131&sort=p.price&order=ASC

I have downloaded the excel file and kept the relevant data for my work i.e., offense in each suburb for the last 5 years (2011-2016). Now, we can get the stats from the file as follows 


In [137]:
#Stats data will like that
import pandas as pd
MelbourneCrimeStatsDf= pd.read_excel("Data\CrimeStatsMelbourne.xlsx")
MelbourneCrimeStatsDf.head()

Unnamed: 0,Postcode,Apr 2011 - Mar 2012,Apr 2012 - Mar 2013,Apr 2013 - Mar 2014,Apr 2014 - Mar 2015,Apr 2015 - Mar 2016
0,3000,21467,24164,22369,22666,22337
1,3002,1022,756,1029,719,852
2,3003,436,659,557,487,610
3,3006,2070,2100,2262,2437,3114
4,3008,850,942,931,1575,1450


Now, we will compute the mean crime of suburbs. This will help us to start with crime free suburbs.

In [139]:
# Taking mean to find the safe suburbs
# Table will look like this
MelbourneCrimeStatsDf.columns=['PostalCode',"2012","2013","2014","2015","2016"]
MelbourneCrimeStatsDf['MeanCrimes'] = MelbourneCrimeStatsDf.iloc[:,1:6].mean(axis=1)
MelbourneCrimeStatsDf.sort_values(by='MeanCrimes', ascending=True, inplace=True)
MelbourneCrimeStatsDf.reset_index(drop= True, inplace= True)
MelbourneCrimeStatsDf.head()

Unnamed: 0,PostalCode,2012,2013,2014,2015,2016,MeanCrimes
0,3852,0,0,0,1,0,0.2
1,3708,0,1,0,1,1,0.6
2,3576,1,0,0,1,1,0.6
3,3647,0,0,2,2,0,0.8
4,3724,1,0,0,0,4,1.0


<H1> Scrapping Melbourne Suburb Table from Wikipedia </h2>

In [161]:
# Extracting information from html web page (table)
import requests
from bs4 import BeautifulSoup

webPageLink= "https://en.wikipedia.org/wiki/List_of_Melbourne_suburbs"
responseObject = requests.get(webPageLink).text
soup = BeautifulSoup(responseObject, 'lxml')
suburbTable =soup.find('table',{'class':'wikitable sortable'})

I read the suburb names from the Wikipedia webpage. Then, I created a data frame to store the suburb names and the corresponding postal codes. The dataframe will look like this

In [162]:
# Read suburb postal code and names
rowData = soup.find_all("tr")

suburbInformation=[]
for item in rowData:
    try:
            
        #Ignoring the headers
        if item.find("th"):
            continue
        
        tags= item.findAll('td')
        suburbName = tags[0].text
        postalCode = tags[1].text
        suburbInformation.append([suburbName, postalCode])
    except Exception as e:
        continue
import pandas as pd
suburbMelbourneDf = pd.DataFrame(suburbInformation, columns=['SuburbName',"PostalCode"])
suburbMelbourneDf.head()


Unnamed: 0,SuburbName,PostalCode
0,Bellfield,3081
1,Briar Hill,3088
2,Bundoora,3083
3,Eaglemont,3084
4,Eltham,3095


<h3> Geolocator to extract longitudes and latitudes for all suburbs </H3>

I used the Geolocator API to extract the latitude and longitudes for all suburbs. I also extracted the distance in KM (using Haversine Formula) of a suburb from the city center. All this information is stored in a data frame. The data frame looks like 

In [165]:
from  geopy.geocoders import Nominatim
import numpy as np 
import math 

suburbMelbourneDf["Latitude"] =  np.nan
suburbMelbourneDf["Longitude"] =  np.nan
geolocator = Nominatim(user_agent="explorer")

i=0
for value in suburbMelbourneDf["SuburbName"]:
    suburbName = value
    if i >10:
        break
    c=0
    city ="%s,Victoria"%suburbName
    country ="Australia"
    try:
        if(math.isnan(suburbMelbourneDf.iloc[i, suburbMelbourneDf.columns.get_loc('Latitude')])):
            loc = geolocator.geocode(city+','+ country, timeout=None)
            #print("Suburb %s , Latitude =%0.2f , Longitude = %0.2f"%(suburbName,loc.latitude,loc.longitude))
  
            suburbMelbourneDf.iloc[i, suburbMelbourneDf.columns.get_loc('Latitude')]=loc.latitude
            suburbMelbourneDf.iloc[i, suburbMelbourneDf.columns.get_loc('Longitude')]=loc.longitude
        i+=1
    except Exception as e:
        print(str(e))
        break

suburbMelbourneDf.head()



Unnamed: 0,SuburbName,PostalCode,Latitude,Longitude
0,Bellfield,3081,-37.753107,145.038478
1,Briar Hill,3088,-37.70637,145.121648
2,Bundoora,3083,-37.697306,145.066254
3,Eaglemont,3084,-37.765144,145.063331
4,Eltham,3095,-37.71787,145.15669


<h3> Foursquare API</H3>

We will use Foursquare API to extract the information about venues. We will extract the following information (features) using Foursquare API

<H4> Feature # 1 - School Information </H4>

We will get the near by schools information for a neighborhood with in 3 KM

In [8]:
# Dummy Data at the moment
radius = 3000
CLIENT_ID=""
CLIENT_SECRET=""
Latitude=""
Longitude=""
VERSION=""
LIMIT=100


In [9]:
import requests
search_query = 'School'
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, Latitude, Longitude, VERSION, search_query, radius, LIMIT)
results = requests.get(url).json()

<H4> Feature # 2 - Park Information </H4>

We will get the near by parks information for a neighborhood with in 3 KM

In [10]:
search_query = 'Park'
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, Latitude, Longitude, VERSION, search_query, radius, LIMIT)
results = requests.get(url).json()

<H4> Feature # 3 - Train Station </H4>

We will get the near by train stations information for a neighborhood with in 3 KM

In [11]:
search_query = 'Train Stations'
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, Latitude, Longitude, VERSION, search_query, radius, LIMIT)
results = requests.get(url).json()

<H4> Feature # 4 - Resturants </H4>

We will get the near by resturants information for a neighborhood with in 3 KM

In [12]:
search_query = 'Resturants'
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, Latitude, Longitude, VERSION, search_query, radius, LIMIT)
results = requests.get(url).json()

<H4> Feature # 5 - Other Venues </H4>
We want to have at-least 50 venues aroud a suburb. That indicates that suburb is well developed and hs lot of shops

In [15]:
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, Latitude, Longitude, VERSION, radius, LIMIT)
results = requests.get(url).json()

# This should be over 30 otherwise we will not consider this suburb
#total_number_of_venues = len(results['response']['venues']) 


We can use all above features to do analysis for our project