# Data exploration, IBM's capstone project

## Introduction

 Where could be the ideal place for a restaurant to open in toronto? In this project I'm going to answer this question by using the postal code dataset obtained from 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', 'https://en.wikipedia.org/wiki/Demographics_of_Toronto_neighbourhoods' for the average income of the population and number of people there and information obtained from Foursquare API. 
     To determine the answer for this question some assumpsions will be made, such as the average income, of which, if they are high, then a Luxorious restaurant is most likely to succeed. 


## The problem

Examining the possibilities can give us a better insight of how to proceed, after looking through the data obtained from Foursquare API we can use the folium library to visualize, for example, how distributed the restaurants are, and if you're looking to open one you'll be able to determine the best place for it.
By the end, we'll be able to determine the best location for opening a luxorious and a simple restaurant.

## Downloading libraries
Here i'll be getting all the libraries i'm going to use in this project.

In [1]:
#Importing libraries
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from pandas.io.html import read_html

In [2]:
print ('Hello Capstone Project Course!')

Hello Capstone Project Course!


# Getting the data
In the following steps i get the table from the wikipedia website and turn it into a pandas Dataframe.

In [3]:
#getting the table

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

#Transforming the table into a dataframe

table = read_html(url,  attrs={"class":"wikitable"})
Postal_codes = pd.DataFrame(data=table[0])
Postal_codes = Postal_codes.rename(columns={'Neighbourhood':'Neighborhood'})

In [4]:
url2 = 'https://en.wikipedia.org/wiki/Demographics_of_Toronto_neighbourhoods'

table2 = read_html(url2,  attrs={"class":"wikitable"})
population_df = pd.DataFrame(data=table2[0])

In [5]:
population_df = pd.DataFrame(data=(population_df[['Name','Population','Average Income']]))

In [6]:
population_df.shape

(175, 3)

In [7]:
population_df=population_df.rename(columns={"Name": "Neighborhood"})

In [8]:
#dropping all 'Not assigned' Boroughs

Postal_codes= Postal_codes.set_index('Borough')
Postal_codes = Postal_codes.drop('Not assigned', axis = 0)
Postal_codes = Postal_codes.reset_index(drop=False)
Postal_codes.shape

(210, 3)

### Merging both data sets

In [9]:
df = pd.merge(Postal_codes, population_df, on="Neighborhood",how='inner')
df.shape

(80, 5)

# Data normalization
Here I normalize the dataset before beginning to use it.

In [10]:
# Combining rows with the same postal code
#df.set_index(['Postcode','Borough'],inplace=True)
#df = df.groupby(level=['Postcode','Borough'], sort=False).agg( ', '.join)
#df = df.reset_index(drop=False)
#df.head()
df.shape

(80, 5)

In [11]:
# Locating all 'not assigned' Neighbourhoods and replacing it with their respective boroughs

for index, row in df.iterrows():
    if df.loc[index, 'Neighborhood'] =='Not assigned':
        df.loc [index,'Neighborhood'] = df.loc[index,'Borough']
df.head()

Unnamed: 0,Borough,Postcode,Neighborhood,Population,Average Income
0,North York,M3A,Parkwoods,26533,34811
1,North York,M4A,Victoria Village,17047,29657
2,North York,M6A,Lawrence Heights,3769,29867
3,North York,M6A,Lawrence Manor,13750,36361
4,Scarborough,M1B,Rouge,22724,29230


In [12]:
df.shape

(80, 5)

# Getting the geodata


Here i rename the postal code column of the second dataframe so it matches the first one, then i can use the merge function to unite them.

In [13]:

import types
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_988167ed0bdd415c8d4dea1d1cc0335e = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='SECRET',
    ibm_auth_endpoint="SECRET",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_988167ed0bdd415c8d4dea1d1cc0335e.get_object(Bucket='finalprojectcapstone-donotdelete-pr-umou4ypnnndznf',Key='Geospatial_Coordinates.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df2 = pd.read_csv(body)
df2.head()


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [14]:
df2 = df2.rename(columns={'Postal Code':'Postcode'})

In [15]:
df3 = pd.merge(df, df2, on="Postcode")
df3.head()

Unnamed: 0,Borough,Postcode,Neighborhood,Population,Average Income,Latitude,Longitude
0,North York,M3A,Parkwoods,26533,34811,43.753259,-79.329656
1,North York,M4A,Victoria Village,17047,29657,43.725882,-79.315572
2,North York,M6A,Lawrence Heights,3769,29867,43.718518,-79.464763
3,North York,M6A,Lawrence Manor,13750,36361,43.718518,-79.464763
4,Scarborough,M1B,Rouge,22724,29230,43.806686,-79.194353


In [16]:
df3.shape

(80, 7)

# Creating a map to visualize the data points
In here i'll create a map of toronto so we can visualize the dataset.

In [20]:
!conda install -c conda-forge folium=0.5.0 --yes

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    branca-0.3.1               |             py_0          25 KB  conda-forge
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    altair-4.0.1               |             py_0         575 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.0 MB

The following NEW packages will be 

In [21]:
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors


In [22]:
from geopy.geocoders import Nominatim

In [23]:
address = 'Toronto'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [24]:
# creatung a map of Toronto using latitude and longitude values
toronto_map = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df3['Latitude'], df3['Longitude'], df3['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map)  
    
toronto_map

This map shows the neighborhood points in toronto.

In [25]:
from folium.plugins import HeatMap

### In here i create a heat map showing where's the most income.

In [26]:

# creating a heatmap of Toronto using latitude and longitude values
base_heatmap = folium.Map(location=[latitude, longitude], zoom_start=10)
# Just adding a marker
folium.Marker((lat,lng), popup='label').add_to(base_heatmap)
HeatMap(data=df3[['Latitude', 'Longitude', 'Average Income']].groupby(['Latitude', 'Longitude']).sum().reset_index().values.tolist(), radius=15, max_zoom=4).add_to(base_heatmap)
# To show the map
base_heatmap

In [27]:
df3.shape

(80, 7)

# Getting nearby Venues

In [28]:
LIMIT = 500 # limit of number of venues returned by Foursquare API
radius = 1500 # define radius


CLIENT_ID = 'SECRET' # Foursquare ID
CLIENT_SECRET = 'SECRET' # Foursquare Secret
VERSION = 'SECRET' # Foursquare API version

In [29]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [30]:
toronto_venues = getNearbyVenues(names=df3['Neighborhood'],
                                   latitudes=df3['Latitude'],
                                   longitudes=df3['Longitude']
                                  )

Parkwoods
Victoria Village
Lawrence Heights
Lawrence Manor
Rouge
Malvern
Garden District
Princess Gardens
West Deane Park
Highland Creek
Rouge Hill
Port Union
Flemingdon Park
St. James Town
St. James Town
Cabbagetown
Eringate
Markland Wood
Guildwood
Morningside
West Hill
The Beaches
Woburn
Leaside
Bathurst Manor
Wilson Heights
Thorncliffe Park
Scarborough Village
Henry Farm
Toronto Islands
Little Portugal
Ionview
Bayview Village
Riverdale
Brockton
Clairlea
Oakridge
York Mills
Downsview
Humber Summit
Cliffcrest
Cliffside
Newtonbrook
Willowdale
Bedford Park
Mount Dennis
Silverthorn
Humberlea
Birch Cliff
Lawrence Park
Runnymede
Runnymede
Swansea
Weston
Dorset Park
Westmount
Maryvale
Wexford
The Annex
Yorkville
Parkdale
Roncesvalles
Kingsview Village
Agincourt
Davisville
Moore Park
Grange Park
Kensington Market
Milliken
Deer Park
South Hill
Humber Bay Shores
New Toronto
Thistletown
Rosedale
Alderwood
Long Branch
The Kingsway
Church and Wellesley
Sunnylea


In [31]:
toronto_venues.head(1)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park


In [32]:
toronto_restaurants  = pd.DataFrame(columns=['Restaurants'])
toronto_restaurants  = toronto_restaurants.fillna(0) # with 0s rather than NaNs
toronto_restaurants

Unnamed: 0,Restaurants


In [33]:
#for index, row in df.iterrows():
#    if 'Restaurant' in toronto_venues.loc[index, 'Venue Category'] == False:
 #       toronto_restaurants.loc[index] = toronto_venues.loc[index, 'Venue Category']

    #toronto_venues.Venue Category= toronto_venues.Venue Category.apply(lambda x: 'ball sport' if 'ball' in x else x)
    
toronto_restaurants = toronto_venues[toronto_venues['Venue Category'].str.contains(r'Restaurant')]
toronto_restaurants.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant
9,Lawrence Heights,43.718518,-79.464763,Lac Vien Vietnamese Restaurant,43.721259,-79.468472,Vietnamese Restaurant
21,Lawrence Manor,43.718518,-79.464763,Lac Vien Vietnamese Restaurant,43.721259,-79.468472,Vietnamese Restaurant
31,Rouge,43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
33,Malvern,43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant


In [34]:
toronto_restaurants.shape

(339, 7)

In [50]:
number_of_restaurants=toronto_restaurants['Neighborhood'].value_counts()
number_of_restaurants = pd.DataFrame(data=number_of_restaurants)

In [51]:
number_of_restaurants.reset_index(inplace=True)
number_of_restaurants = number_of_restaurants.rename(columns={"index": "Neighborhood", "Neighborhood": "count"})


In [86]:
df4 = df3
df4 = pd.merge(df3, number_of_restaurants, on="Neighborhood",how='inner')

## Heat Map that shows the amount of restaurants

In [87]:
# creating a heatmap of Toronto using latitude and longitude values
base_heatmap3 = folium.Map(location=[latitude, longitude], zoom_start=10)
# Just adding a marker
folium.Marker((lat,lng), popup='Toronto').add_to(base_heatmap3)
HeatMap(data=df4[['Latitude', 'Longitude','count']].groupby(['Latitude', 'Longitude']).sum().reset_index().values.tolist(), radius=15, max_zoom=4).add_to(base_heatmap2)
# To show the map
base_heatmap2

## The results

In [99]:
df4 = df4.sort_values(by=['Average Income','count'],ascending=False)
df4.head(10)

Unnamed: 0,Borough,Postcode,Neighborhood,Population,Average Income,Latitude,Longitude,count
44,Central Toronto,M4T,Moore Park,4474,154825,43.689574,-79.38316,1
48,Central Toronto,M4V,South Hill,6218,120453,43.686412,-79.400049,4
39,Central Toronto,M5R,Yorkville,6045,105239,43.67271,-79.405678,4
14,East York,M4G,Leaside,13876,82670,43.70906,-79.363452,4
28,North York,M5M,Bedford Park,13749,80827,43.733283,-79.41975,11
47,Central Toronto,M4V,Deer Park,15165,80704,43.686412,-79.400049,4
38,Central Toronto,M5R,The Annex,15602,63636,43.67271,-79.405678,4
33,West Toronto,M6S,Swansea,11133,58681,43.651571,-79.48445,10
18,North York,M2J,Henry Farm,2790,56395,43.778517,-79.346556,12
43,Central Toronto,M4S,Davisville,23727,55735,43.704324,-79.38879,10


As shown in the table above, the best place to open a Luxorious Restaurant would be Moore park, as it is the neighborhood with the best average income and less competition.

In [101]:
df4 = df4.sort_values(by=['Population','count'],ascending=False)
df4.head(10)

Unnamed: 0,Borough,Postcode,Neighborhood,Population,Average Income,Latitude,Longitude,count
13,Scarborough,M1G,Woburn,48507,26190,43.770992,-79.216917,1
42,Scarborough,M1S,Agincourt,44577,25750,43.7942,-79.262029,1
4,Scarborough,M1B,Malvern,44324,25677,43.806686,-79.194353,1
23,East Toronto,M4K,Riverdale,31007,40139,43.679557,-79.352188,17
40,West Toronto,M6R,Parkdale,28367,26314,43.64896,-79.456325,4
12,Scarborough,M1E,West Hill,25632,27936,43.763573,-79.188711,1
43,Central Toronto,M4S,Davisville,23727,55735,43.704324,-79.38879,10
3,Scarborough,M1B,Rouge,22724,29230,43.806686,-79.194353,1
6,North York,M3C,Flemingdon Park,21287,23471,43.7259,-79.340923,8
29,York,M6M,Mount Dennis,21284,23910,43.691116,-79.476013,1


In here we see that Scarborough would be the best place to open a low income restaurant, since there's only one in the area and it's the most populated Borough in Toronto.