# Capstone Project 
## Safest Neighborhood in Toronto for opening a commercial establishment
### Applied Data Science Capstone by IBM/Coursera


# Introduction
Toronto is a great place to live, the shopping is great, thousands of restaurants and cafes to get a fantastic meal, and there are lots of things you can do at any hour from strolling through parks, catching a movie or concert, or watching some live sports. But opening a business in Toronto isn’t always so good, especially if you ask about crime. Fortunately, if you want to open your business in Toronto, with this project we will be looking to understand the crime, and which will be the best neighborhood to open your own business.


# Business Problem
The purpose of this project is to understand which neighborhood will be the best to open a commercial business in Toronto and which type of commercial business. The first task will be to understand which neighborhood is the safest by analyzing the crime data and the second task will be to analyze the 10 most common venue in this neighborhood. We will use our knowledge of Data Science to do this analysis.

# Data
Based on definition of our problem, factors that will influence our decission are:
* finding the safest borough based on crime statistics
* finding the most common venues.
* choosing the right neighbourhood within the borough

We will be using the geographical coordinates of Toronto to plot neighbourhoods in a borough that is the safest and them we will show the 10 most common venues. 

Following data sources will be needed to extract/generate the required information:

- [**Part 1**: Using data set from Kaggle containing the Toronto Crimes from 2014 to 2019](#part1):  
A dataset consisting of the crime of each Neighborhood in Toronto along with type of crime. And them we will find the safest neighborhood to work with it.
The data of crimes I will use the real data that it is published in Kaggle dataset for this page: https://www.kaggle.com/kapastor/toronto-police-data-crime-rates-by-neighbourhood

- [**Part 2**: List of officially categorized boroughs in Toronto from Wikipedia.](#part2): 
For data of Toronto Neighborhoods, We will use the WikIpedia source:
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M and them we will complete the information with Latutude and Longitude.



- [**Part 3**: Finding the most common venues of the safest Neighborhood along with co-ordinates.](#part4): 
This data will be fetched using Four Square API to explore the neighbourhood venues and to apply machine learning algorithm to cluster the neighbourhoods.

## Importing all necesary libraries

In [99]:
from bs4 import BeautifulSoup # library for pulling data out of HTML
import requests # library to handle requests
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium # map rendering library

# Read table with Wikipedia Link <a id="2"></a>

In [100]:
List_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(List_url).text

# Create table with the condititions described <a id="2"></a>

In [101]:
soup = BeautifulSoup(result.content, 'html.parser')
table = soup.find('table')
trs = table.find_all('tr')
rows = []
for tr in trs:
    i = tr.find_all('td')
    if i:
        rows.append(i)
        
lst = []
for row in rows:
    postalcode = row[0].text.rstrip()
    borough = row[1].text.rstrip()
    neighborhood = row[2].text.rstrip()
    if borough != 'Not assigned':
        if neighborhood == 'Not assigned':
            neighborhood = borough
        lst.append([postalcode, borough, neighborhood])
. 

In [102]:
cols = ['Postcode', 'Borough', 'Neighborhood']
df = pd.DataFrame(lst, columns=cols)
print(df.shape)


(103, 3)


In [103]:
df = df.groupby(['Postcode', 'Borough'], as_index=False).agg(lambda neighborhoods: ', '.join(neighborhoods))
df

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


# Read Dataset with Latitud and Longitud<a id="2"></a>

In [128]:
df_geo = pd.read_csv("/Users/fvazquez/Coursera/Coursera_Capstone/Scripts/TorontoPostalCode.csv")


# Merge Datasets <a id="2"></a>

In [130]:
df_geo = pd.merge(df, geo, on="Postcode", how='left')

In [131]:
df_geo.dtypes

Postcode          object
Borough_x         object
Neighborhood      object
Unnamed: 0         int64
Borough_y         object
Neighbourhood     object
Latitude         float64
Longitude        float64
dtype: object

# Complete Dataset <a id="2"></a>

In [133]:
#clean up the dataset to remove unnecessary columns (eg. REG) 
df_geo.drop(['Unnamed: 0','Borough_y','Neighbourhood'], axis = 1, inplace = True)
# let's rename the columns so that they make sense
df_geo.rename (columns = {'Borough_x':'Borough'}, inplace = True)
df_geo

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


# Use geopy library to get the latitude and longitude values of TORONTO.

In [134]:
address = 'Toronto'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [137]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_geo['Latitude'], df_geo['Longitude'], df_geo['Borough'], df_geo['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

# Define Foursquare Credentials and Version

In [152]:
CLIENT_ID = 'VO1JHZLUVTQA2H3XDMBC3H2ILXUMEJ0JZNWPOWANRBJRLEXK' # your Foursquare ID
CLIENT_SECRET = 'LJFBE2ERSEGQDGFYWXNGZKTWV3ODDMOPSY4X55331T30JKMS'
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: VO1JHZLUVTQA2H3XDMBC3H2ILXUMEJ0JZNWPOWANRBJRLEXK
CLIENT_SECRET:LJFBE2ERSEGQDGFYWXNGZKTWV3ODDMOPSY4X55331T30JKMS


# I decided to work with York Borough

In [145]:
york_data = df_geo[df_geo['Borough'] == 'York'].reset_index(drop=True)
york_data.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M6C,York,Humewood-Cedarvale,43.693781,-79.428191
1,M6E,York,Caledonia-Fairbanks,43.689026,-79.453512
2,M6M,York,"Del Ray, Mount Dennis, Keelsdale and Silverthorn",43.691116,-79.476013
3,M6N,York,"Runnymede, The Junction North",43.673185,-79.487262
4,M9N,York,Weston,43.706876,-79.518188


In [146]:
address = 'York, Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of York are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of York are 43.6896191, -79.479188.


In [147]:
# create map of York using latitude and longitude values
map_york = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(york_data['Latitude'], york_data['Longitude'], york_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_york)  
    
map_york

# Explore neightborhoods in York with Foursquare and segment them

In [153]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [154]:

york_venues = getNearbyVenues(names=york_data['Neighborhood'],
                                   latitudes=york_data['Latitude'],
                                   longitudes=york_data['Longitude']
                                  )

Humewood-Cedarvale
Caledonia-Fairbanks
Del Ray, Mount Dennis, Keelsdale and Silverthorn
Runnymede, The Junction North
Weston


In [159]:
print(york_venues.shape)
york_venues

(20, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Humewood-Cedarvale,43.693781,-79.428191,Cedarvale Park,43.692535,-79.428705,Field
1,Humewood-Cedarvale,43.693781,-79.428191,Cedarvale Ravine,43.690188,-79.426106,Trail
2,Humewood-Cedarvale,43.693781,-79.428191,Cedarvale Dog Park,43.692036,-79.429491,Dog Run
3,Humewood-Cedarvale,43.693781,-79.428191,Cedarvale Tennis Courts,43.692744,-79.432244,Tennis Court
4,Humewood-Cedarvale,43.693781,-79.428191,Phil White Arena,43.691303,-79.431761,Hockey Arena
5,Caledonia-Fairbanks,43.689026,-79.453512,Nairn Park,43.690654,-79.4563,Park
6,Caledonia-Fairbanks,43.689026,-79.453512,Maximum Woman,43.690651,-79.456333,Women's Store
7,Caledonia-Fairbanks,43.689026,-79.453512,Fairbanks Pool,43.691959,-79.448922,Pool
8,Caledonia-Fairbanks,43.689026,-79.453512,Fairbank Memorial Park,43.692028,-79.448924,Park
9,"Del Ray, Mount Dennis, Keelsdale and Silverthorn",43.691116,-79.476013,Subway,43.690218,-79.47405,Sandwich Place


In [157]:
york_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Caledonia-Fairbanks,4,4,4,4,4,4
"Del Ray, Mount Dennis, Keelsdale and Silverthorn",5,5,5,5,5,5
Humewood-Cedarvale,5,5,5,5,5,5
"Runnymede, The Junction North",4,4,4,4,4,4
Weston,2,2,2,2,2,2


In [158]:
print('There are {} uniques categories.'.format(len(york_venues['Venue Category'].unique())))

There are 17 uniques categories.


# Analyze each Neighborhoods

In [161]:
# one hot encoding
york_onehot = pd.get_dummies(york_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
york_onehot['Neighborhood'] = york_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [york_onehot.columns[-1]] + list(york_onehot.columns[:-1])
york_onehot = york_onehot[fixed_columns]

york_onehot

Unnamed: 0,Neighborhood,Bar,Breakfast Spot,Brewery,Bus Line,Convenience Store,Discount Store,Dog Run,Field,Hockey Arena,Park,Pool,Restaurant,Sandwich Place,Tennis Court,Trail,Turkish Restaurant,Women's Store
0,Humewood-Cedarvale,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
1,Humewood-Cedarvale,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,Humewood-Cedarvale,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,Humewood-Cedarvale,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
4,Humewood-Cedarvale,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
5,Caledonia-Fairbanks,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
6,Caledonia-Fairbanks,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
7,Caledonia-Fairbanks,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
8,Caledonia-Fairbanks,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
9,"Del Ray, Mount Dennis, Keelsdale and Silverthorn",0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


In [162]:
york_onehot.shape

(20, 18)

In [163]:
# Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
york_grouped = york_onehot.groupby('Neighborhood').mean().reset_index()
york_grouped

Unnamed: 0,Neighborhood,Bar,Breakfast Spot,Brewery,Bus Line,Convenience Store,Discount Store,Dog Run,Field,Hockey Arena,Park,Pool,Restaurant,Sandwich Place,Tennis Court,Trail,Turkish Restaurant,Women's Store
0,Caledonia-Fairbanks,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.25,0.0,0.0,0.0,0.0,0.0,0.25
1,"Del Ray, Mount Dennis, Keelsdale and Silverthorn",0.2,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.2,0.2,0.0,0.0,0.2,0.0
2,Humewood-Cedarvale,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.2,0.2,0.0,0.0,0.0,0.0,0.2,0.2,0.0,0.0
3,"Runnymede, The Junction North",0.0,0.25,0.25,0.25,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Weston,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [164]:
num_top_venues = 5

for hood in york_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = york_grouped[york_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Caledonia-Fairbanks----
                venue  freq
0                Park  0.50
1       Women's Store  0.25
2                Pool  0.25
3  Turkish Restaurant  0.00
4               Trail  0.00


----Del Ray, Mount Dennis, Keelsdale and Silverthorn----
                venue  freq
0                 Bar   0.2
1  Turkish Restaurant   0.2
2      Discount Store   0.2
3      Sandwich Place   0.2
4          Restaurant   0.2


----Humewood-Cedarvale----
          venue  freq
0  Hockey Arena   0.2
1         Trail   0.2
2  Tennis Court   0.2
3       Dog Run   0.2
4         Field   0.2


----Runnymede, The Junction North----
               venue  freq
0            Brewery  0.25
1           Bus Line  0.25
2  Convenience Store  0.25
3     Breakfast Spot  0.25
4                Bar  0.00


----Weston----
                venue  freq
0                Park   1.0
1                 Bar   0.0
2  Turkish Restaurant   0.0
3               Trail   0.0
4        Tennis Court   0.0




In [167]:
# create the new dataframe and display the top 10 venues for each neighborhood.
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = york_grouped['Neighborhood']

for ind in np.arange(york_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(york_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Caledonia-Fairbanks,Park,Women's Store,Pool,Dog Run,Breakfast Spot,Brewery,Bus Line,Convenience Store,Discount Store,Hockey Arena
1,"Del Ray, Mount Dennis, Keelsdale and Silverthorn",Bar,Sandwich Place,Restaurant,Turkish Restaurant,Discount Store,Dog Run,Breakfast Spot,Brewery,Bus Line,Convenience Store
2,Humewood-Cedarvale,Hockey Arena,Dog Run,Trail,Tennis Court,Field,Breakfast Spot,Brewery,Bus Line,Convenience Store,Discount Store
3,"Runnymede, The Junction North",Breakfast Spot,Brewery,Bus Line,Convenience Store,Women's Store,Field,Discount Store,Dog Run,Hockey Arena,Turkish Restaurant
4,Weston,Park,Women's Store,Field,Breakfast Spot,Brewery,Bus Line,Convenience Store,Discount Store,Dog Run,Hockey Arena


# Cluster Neighborhood

In [172]:
# import k-means from clustering stage
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 5

york_grouped_clustering = york_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(york_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 4, 1, 2, 0], dtype=int32)

In [175]:
# add clustering labels
#neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

york_merged = york_data

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
york_merged = york_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

york_merged # check the last columns!

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M6C,York,Humewood-Cedarvale,43.693781,-79.428191,1,Hockey Arena,Dog Run,Trail,Tennis Court,Field,Breakfast Spot,Brewery,Bus Line,Convenience Store,Discount Store
1,M6E,York,Caledonia-Fairbanks,43.689026,-79.453512,3,Park,Women's Store,Pool,Dog Run,Breakfast Spot,Brewery,Bus Line,Convenience Store,Discount Store,Hockey Arena
2,M6M,York,"Del Ray, Mount Dennis, Keelsdale and Silverthorn",43.691116,-79.476013,4,Bar,Sandwich Place,Restaurant,Turkish Restaurant,Discount Store,Dog Run,Breakfast Spot,Brewery,Bus Line,Convenience Store
3,M6N,York,"Runnymede, The Junction North",43.673185,-79.487262,2,Breakfast Spot,Brewery,Bus Line,Convenience Store,Women's Store,Field,Discount Store,Dog Run,Hockey Arena,Turkish Restaurant
4,M9N,York,Weston,43.706876,-79.518188,0,Park,Women's Store,Field,Breakfast Spot,Brewery,Bus Line,Convenience Store,Discount Store,Dog Run,Hockey Arena


In [181]:
# create map
import folium
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
#x = np.arange(kclusters)
#ys = [i + x + (i*x)**2 for i in range(kclusters)]
#colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
#rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(york_merged['Latitude'], york_merged['Longitude'], york_merged['Neighborhood'], york_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        #color=rainbow[cluster-1],
        fill=True,
        #fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters