<a href="https://colab.research.google.com/github/madiltalay/IBM-Data-Science-Capstone/blob/master/IBM_Data_Science_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Background
Mr. Arif, a novice businessperson, plans to open up a restaurant in a health-conscious neighborhood in the city of Toronto, nearby the most visited venues in order to gain more customers, and is concerned to avoid competition, which in his opinion could be achieved by either keeping a good distance from other restaurants, or choosing a type of restaurant that is popular in other neighborhoods but non-existent in the chosen neighborhood.

This projects aims to help him find the most suitable location and type of restaurant to start the business.

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


Importing required libraries

In [0]:
import pandas as pd
import folium
import numpy as np
import requests
import geopy.distance
import matplotlib.cm as cm
import matplotlib.colors as colors

In [3]:
%cd gdrive/My\ Drive/Google\ Colab/Coursera/IBM\ DS\ Capstone

/content/gdrive/My Drive/Google Colab/Coursera/IBM DS Capstone


## First Step - Shortlisting the health conscious neighborhoods
We get the data of Toronto's health conscious neighborhoods from the Toronto wellbeing data. We use the Healthy Food Index (HFI) as a measure for the health consciousness of the neighborhood.

In [4]:
toronto_wellbeing = pd.read_csv('wellbeing_toronto.csv')
toronto_wellbeing.head()

Unnamed: 0,Neighbourhood,Total Population,Healthy Food Index
0,West Humber-Clairville,33312,23.82
1,Mount Olive-Silverstone-Jamestown,32954,37.57
2,Thistletown-Beaumond Heights,10360,42.26
3,Rexdale-Kipling,10529,23.31
4,Elms-Old Rexdale,9456,24.71


After loading the data, we sort it according to the HFI in descending order to get the most health conscious neighborhoods at the top.

In [5]:
toronto_wellbeing_sorted = toronto_wellbeing.sort_values(['Healthy Food Index'],ascending=[0])
toronto_wellbeing_sorted.shape

(140, 3)

From the sorted wellbeing data, we select only those neighborhoods with HFI>45

In [9]:
toronto_wellbeing_filtered = toronto_wellbeing_sorted[toronto_wellbeing_sorted['Healthy Food Index']>45]
toronto_wellbeing_filtered

Unnamed: 0,Neighbourhood,Total Population,Healthy Food Index
108,Caledonia-Fairbank,9955,53.48
90,Weston-Pellam Park,11098,52.57
104,Lawrence Park North,14607,52.03
98,Mount Pleasant East,16775,50.4
47,Hillcrest Village,16934,48.46
38,Bedford Park-Nortown,23236,47.74
67,North Riverdale,11916,47.17
124,Ionview,13641,47.14
26,York University Heights,27593,47.0
86,High Park-Swansea,23925,46.53


After filtering, we get 14 neighborhoods with HFI>45

In [10]:
toronto_wellbeing_filtered.shape

(14, 3)

Let's look at the names of these neighborhoods

In [11]:
neighborhoods = toronto_wellbeing_filtered['Neighbourhood'].to_list()
neighborhoods

['Caledonia-Fairbank',
 'Weston-Pellam Park',
 'Lawrence Park North',
 'Mount Pleasant East',
 'Hillcrest Village',
 'Bedford Park-Nortown',
 'North Riverdale',
 'Ionview',
 'York University Heights',
 'High Park-Swansea',
 'Edenbridge-Humber Valley',
 'South Riverdale',
 'Palmerston-Little Italy',
 'Runnymede-Bloor West Village']

We tried to use the geocoder library but it was taking too long, so we instead searched directly on Google for the coordinates of these neighborhoods, from which we prepared the dataframe 'lon_lat'

Note: The geocoordinates of North and South Riverdale are same, so the 14 neighborhoods actually turn into 13, but I have not removed the duplicate in the upcoming data.

In [12]:
lon_lat = pd.DataFrame(columns=['Neighborhood', 'Latitude', 'Longitude'],
                       data = [['Caledonia-Fairbank', 43.6899, -79.4552],
                               ['Weston-Pellam Park', 43.6716, -79.4577],
                               ['Lawrence Park North', 43.7238, -79.3886],
                               ['Mount Pleasant East', 43.7051, -79.3848],
                               ['Hillcrest Village', 43.8049, -79.3547],
                               ['Bedford Park-Nortown', 43.7303, -79.4114],
                               ['North Riverdale', 43.6698, -79.3554],
                               ['Ionview', 43.7308, -79.2739],
                               ['York University Heights', 43.7664, -79.4774],
                               ['High Park-Swansea', 43.6536, -79.4653],
                               ['Edenbridge-Humber Valley', 43.6671, -79.5280],
                               ['South Riverdale', 43.6698, -79.3554],
                               ['Palmerston-Little Italy', 43.6600, -79.4175],
                               ['Runnymede-Bloor West Village', 43.6593, -79.4838],
                               ])
lon_lat

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Caledonia-Fairbank,43.6899,-79.4552
1,Weston-Pellam Park,43.6716,-79.4577
2,Lawrence Park North,43.7238,-79.3886
3,Mount Pleasant East,43.7051,-79.3848
4,Hillcrest Village,43.8049,-79.3547
5,Bedford Park-Nortown,43.7303,-79.4114
6,North Riverdale,43.6698,-79.3554
7,Ionview,43.7308,-79.2739
8,York University Heights,43.7664,-79.4774
9,High Park-Swansea,43.6536,-79.4653


Now we merge the two dataframes to get a new one

In [13]:
new_data = lon_lat.copy()
new_data = new_data.assign(Total_Population=toronto_wellbeing_filtered['Total Population'].tolist())
new_data = new_data.assign(Healthy_Food_Index=toronto_wellbeing_filtered['Healthy Food Index'].tolist())
new_data

Unnamed: 0,Neighborhood,Latitude,Longitude,Total_Population,Healthy_Food_Index
0,Caledonia-Fairbank,43.6899,-79.4552,9955,53.48
1,Weston-Pellam Park,43.6716,-79.4577,11098,52.57
2,Lawrence Park North,43.7238,-79.3886,14607,52.03
3,Mount Pleasant East,43.7051,-79.3848,16775,50.4
4,Hillcrest Village,43.8049,-79.3547,16934,48.46
5,Bedford Park-Nortown,43.7303,-79.4114,23236,47.74
6,North Riverdale,43.6698,-79.3554,11916,47.17
7,Ionview,43.7308,-79.2739,13641,47.14
8,York University Heights,43.7664,-79.4774,27593,47.0
9,High Park-Swansea,43.6536,-79.4653,23925,46.53


Visualizing the neighborhoods on the map

In [14]:
latitude = 43.6529
longitude = -79.3849

# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(new_data['Latitude'], new_data['Longitude'], new_data['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Here we notice that out of the 13 neighborhoods, four are kind of isolated.
On the contrary, we see that two clusters of neighborhoods that are very near.

In the first cluster, there are three neighborhoods:
1. Bedford Park-Nortown
2. Lawrence Park North
3. Mount Pleasant East

In the second cluster, there are six neighborhoods:
1. Caledonia-Fairbank
2. Weston-Pellam Park
3. High Park-Swansea
4. Runnymede-Bloor West Village
5. Palmerston-Little Italy
6. Edenbridge-Humber Valley

We measure the distance between the neighborhoods in the same cluster to get an idea of how far they are.

In [63]:
Bedford_Park = (43.7303, -79.4114)
Lawrence_Park = (43.7238, -79.3886)
Mount_Pleasant = (43.7051, -79.3848)

print('Distance between Lawrence_Park and Bedford_Park: {})'.format(geopy.distance.geodesic(Lawrence_Park, Bedford_Park).km))
print('Distance between Lawrence_Park and Mount_Pleasant: {})'.format(geopy.distance.geodesic(Lawrence_Park, Mount_Pleasant).km))
print('Distance between Bedford_Park and Mount_Pleasant: {})'.format(geopy.distance.geodesic(Bedford_Park, Mount_Pleasant).km))

Distance between Lawrence_Park and Bedford_Park: 1.9739222619052499)
Distance between Lawrence_Park and Mount_Pleasant: 2.1001426333907793)
Distance between Bedford_Park and Mount_Pleasant: 3.5262293329838568)


We see that Lawrence Park lies in the mid of Bedford_Park and Mount_Pleasant

Let's look at the other cluster.

For brevity, we only measure the distance from the Weston_Pellam neighborhood to others in the cluster

In [64]:
Weston_Pellam = (43.6716, -79.4577)
High_Park_Swansea = (43.6536, -79.4653)
Caledonia_Fairbank	 = (43.6899, -79.4552)
Runnymede_Bloor = (43.6593, -79.4838)
Palmerston_Little = (43.6600,	-79.4175)
Edenbridge_Humber = (43.6671, -79.5280)

print('Distance between Weston_Pellam and High_Park_Swansea: {})'.format(geopy.distance.geodesic(Weston_Pellam, High_Park_Swansea).km))
print('Distance between Weston_Pellam and Caledonia_Fairbank: {})'.format(geopy.distance.geodesic(Weston_Pellam, Caledonia_Fairbank).km))
print('Distance between Weston_Pellam and Runnymede_Bloor: {})'.format(geopy.distance.geodesic(Weston_Pellam, Runnymede_Bloor).km))
print('Distance between Weston_Pellam and Palmerston_Little: {})'.format(geopy.distance.geodesic(Weston_Pellam, Palmerston_Little).km))
print('Distance between Weston_Pellam and Edenbridge_Humber: {})'.format(geopy.distance.geodesic(Weston_Pellam, Edenbridge_Humber).km))

Distance between Weston_Pellam and High_Park_Swansea: 2.0917431665893904)
Distance between Weston_Pellam and Caledonia_Fairbank: 2.043208846976358)
Distance between Weston_Pellam and Runnymede_Bloor: 2.50979640145787)
Distance between Weston_Pellam and Palmerston_Little: 3.4890960190633016)
Distance between Weston_Pellam and Edenbridge_Humber: 5.691716419725631)


We notice that apart from Edenbridge_Humber neighborhood, all the other neighborhoods in both the clusters are within 4km to the central neighborhood. This might help in making future decisions.

Using Foursquare API

In [18]:
CLIENT_ID = 'UVUA251H5Y3PL04ICTVBK1BYIAZKBFC1ZNMVI3MAF11YHTZ2' # your Foursquare ID
CLIENT_SECRET = 'N2TELK4XIUCRG1UKHXT0WRWVW3ZF0AUG4CLXLYHKVHTVAYSL' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 40
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: UVUA251H5Y3PL04ICTVBK1BYIAZKBFC1ZNMVI3MAF11YHTZ2
CLIENT_SECRET:N2TELK4XIUCRG1UKHXT0WRWVW3ZF0AUG4CLXLYHKVHTVAYSL


As neighboring neighborhoods in the two cluster lie within about 2km of range, we search for venues in the range of 1km

In [0]:
def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [36]:
toronto_venues = getNearbyVenues(names=new_data['Neighborhood'],
                                   latitudes=new_data['Latitude'],
                                   longitudes=new_data['Longitude']
                                  )

Caledonia-Fairbank
Weston-Pellam Park
Lawrence Park North
Mount Pleasant East
Hillcrest Village
Bedford Park-Nortown
North Riverdale
Ionview
York University Heights
High Park-Swansea
Edenbridge-Humber Valley
South Riverdale
Palmerston-Little Italy
Runnymede-Bloor West Village


In [37]:
toronto_venues.columns

Index(['Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude',
       'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category'],
      dtype='object')

Search using the API returned 430 venues from the 13 neighborhoods

In [38]:
toronto_venues.shape

(430, 7)

In [39]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Bedford Park-Nortown,40,40,40,40,40,40
Caledonia-Fairbank,27,27,27,27,27,27
Edenbridge-Humber Valley,21,21,21,21,21,21
High Park-Swansea,40,40,40,40,40,40
Hillcrest Village,19,19,19,19,19,19
Ionview,36,36,36,36,36,36
Lawrence Park North,13,13,13,13,13,13
Mount Pleasant East,40,40,40,40,40,40
North Riverdale,40,40,40,40,40,40
Palmerston-Little Italy,40,40,40,40,40,40


Let's count the unique categories to which the venues belong to.

In [40]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 127 uniques categories.


In [0]:
unique_categories = toronto_venues['Venue Category'].unique()

Let's check if there are any different types of restaurants in the unique categories.

In [0]:
restaurant_types = [i for i in unique_categories if 'Restaurant' in i]

In [43]:
restaurant_types

['Japanese Restaurant',
 'Falafel Restaurant',
 'Mexican Restaurant',
 'Fast Food Restaurant',
 'Portuguese Restaurant',
 'Brazilian Restaurant',
 'Seafood Restaurant',
 'Vietnamese Restaurant',
 'Restaurant',
 'Italian Restaurant',
 'Thai Restaurant',
 'Indian Restaurant',
 'Sushi Restaurant',
 'Asian Restaurant',
 'Vegetarian / Vegan Restaurant',
 'Ramen Restaurant',
 'Cantonese Restaurant',
 'Chinese Restaurant',
 'Middle Eastern Restaurant',
 'American Restaurant',
 'French Restaurant',
 'Tapas Restaurant',
 'Greek Restaurant',
 'Cuban Restaurant',
 'Turkish Restaurant',
 'Mediterranean Restaurant',
 'Korean Restaurant',
 'South American Restaurant']

We see four main categories of restaurants:
1. Far-Eastern: Asian, Vietnamese, Sushi, Thai, Chinese, Korean, Japanese, Ramen, Cantonese, Seafood, Vegan
2. Eastern: Falafel, Middle-Eastern, Indian, Turkish
3. American: Brazilian, New American, American, Mexican, Latin American, South American, Fast Food, Cuban
4. European: Portoguese, Italian, Greek, Tapas, Mediterranean, French

Discussing with our client, we found that he is more interested in opening an Eastern restaurant, so we limit our search to four types of restuarants:
1. Indian
2. Middle-Eastern
3. Falafel
4. Turkish

In [47]:
toronto_venues.loc[toronto_venues['Venue Category']=='Indian Restaurant']

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
82,Mount Pleasant East,43.7051,-79.3848,Marigold Indian Bistro,43.702881,-79.388008,Indian Restaurant
96,Mount Pleasant East,43.7051,-79.3848,Kamasutra,43.703991,-79.374597,Indian Restaurant
177,Bedford Park-Nortown,43.7303,-79.4114,The Copper Chimney,43.736195,-79.420271,Indian Restaurant
307,High Park-Swansea,43.6536,-79.4653,Bukhara indian cuisine,43.651105,-79.477104,Indian Restaurant
383,Palmerston-Little Italy,43.66,-79.4175,Banjara Indian Cuisine,43.662916,-79.421911,Indian Restaurant
414,Palmerston-Little Italy,43.66,-79.4175,Madras Masala,43.662959,-79.421646,Indian Restaurant


In [48]:
toronto_venues.loc[toronto_venues['Venue Category']=='Falafel Restaurant']

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
1,Caledonia-Fairbank,43.6899,-79.4552,Babos Dönerpoint,43.693249,-79.461851,Falafel Restaurant
79,Lawrence Park North,43.7238,-79.3886,Extreme Pita,43.721814,-79.37652,Falafel Restaurant
308,High Park-Swansea,43.6536,-79.4653,Ali Baba's,43.65101,-79.477179,Falafel Restaurant


In [49]:
toronto_venues.loc[toronto_venues['Venue Category']=='Middle Eastern Restaurant']

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
135,Hillcrest Village,43.8049,-79.3547,"Lazeez BBQ Kebabs, Shawarma, Falafel",43.812754,-79.358048,Middle Eastern Restaurant
259,York University Heights,43.7664,-79.4774,Chaihana,43.768936,-79.468502,Middle Eastern Restaurant


In [66]:
toronto_venues.loc[toronto_venues['Venue Category']=='Turkish Restaurant']

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
219,Ionview,43.7308,-79.2739,Baran's Turkish Restaurant & Bar,43.728978,-79.280811,Turkish Restaurant


Looking at the neighborhoods to which the searched restaurants belong to, we list the neighborhoods:
1. Mount Pleasant East	
2. Bedford Park-Nortown	
3. Lawrence Park North
4. Palmerston-Little Italy	
5. High Park-Swansea	
6. Caledonia-Fairbank
7. Hillcrest Village
8. York University Heights
9. Ionview

Discussing again with our client about the locations, we find that he is more interested in the cluster surrounded by Weston-Pellam Park, so we limit our search to that area.

In [51]:
toronto_venues.loc[toronto_venues['Neighborhood']=='Weston-Pellam Park']

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
27,Weston-Pellam Park,43.6716,-79.4577,"The Den Toronto : Workshops, Bench Rentals and...",43.670638,-79.456615,Jewelry Store
28,Weston-Pellam Park,43.6716,-79.4577,Sabor Brasil Restaurant,43.674445,-79.459228,Brazilian Restaurant
29,Weston-Pellam Park,43.6716,-79.4577,Caledonia Bakery & Pastry,43.675441,-79.454658,Bakery
30,Weston-Pellam Park,43.6716,-79.4577,Honest Weight,43.665389,-79.461335,Seafood Restaurant
31,Weston-Pellam Park,43.6716,-79.4577,TuckShop Kitchen,43.66505,-79.455624,Snack Place
32,Weston-Pellam Park,43.6716,-79.4577,Pho Xua,43.674056,-79.461112,Vietnamese Restaurant
33,Weston-Pellam Park,43.6716,-79.4577,Mattachioni,43.66496,-79.454912,Café
34,Weston-Pellam Park,43.6716,-79.4577,Tavora Foods,43.674868,-79.456559,Fish Market
35,Weston-Pellam Park,43.6716,-79.4577,Baguette & Co,43.664744,-79.455766,Café
36,Weston-Pellam Park,43.6716,-79.4577,Love Chix,43.66523,-79.45407,Restaurant


After discussing the potential locations, we shortlist five locations and measure the distance with the nearby Eastern restaurants

In [0]:
Grocery_Store = (43.673537, -79.468748)
Bakery = (43.675441,	-79.454658)
Fish_Market = (43.674868, -79.456559)
Shopping_Mall = (43.672681, -79.469688)
Organic_Grocery = (43.667743,	-79.463271)

R1 = (43.651105, -79.477104)
R2 = (43.662916, -79.421911)
R3 = (43.662959, -79.421646)
R4 = (43.693249, -79.461851)
R5 = (43.651010, -79.477179)

In [0]:
Weston_Pellam_locations = [Grocery_Store, Bakery, Fish_Market, Shopping_Mall, Organic_Grocery]
Nearby_Restaurants = [R1, R2, R3, R4, R5]

In [0]:
location_distances = []
for location in Weston_Pellam_locations:
  distances = []
  for restaurant in Nearby_Restaurants:
    d = geopy.distance.geodesic(location, restaurant).km
    distances.append(d)
  location_distances.append(distances)
location_distances

In [58]:
distance_matrix = np.array(location_distances)
distance_matrix

array([[2.58184709, 3.9575168 , 3.97650643, 2.25962421, 2.59361731],
       [3.2540222 , 2.985257  , 3.00197079, 2.0618283 , 3.26615932],
       [3.11717384, 3.09387157, 3.11115299, 2.08634074, 3.12933048],
       [2.47071983, 4.00315424, 4.02244446, 2.37098871, 2.48242779],
       [2.15923165, 3.37875342, 3.39910635, 2.83618206, 2.17139487]])

In [62]:
index = ['High-Park Swansea (Indian)','Palmerston-Little Italy (Indian)','Palmerston-Little Italy (Indian)', 
         'Caledonia-Fairbank (Falafel)','High-Park Swansea (Falafel)']
final_report = pd.DataFrame(distance_matrix, 
                            columns=['Grocery_Store', 'Bakery', 'Fish_Market', 'Shopping_Mall', 'Organic_Grocery'],
                            index = index)
final_report

Unnamed: 0,Grocery_Store,Bakery,Fish_Market,Shopping_Mall,Organic_Grocery
High-Park Swansea (Indian),2.581847,3.957517,3.976506,2.259624,2.593617
Palmerston-Little Italy (Indian),3.254022,2.985257,3.001971,2.061828,3.266159
Palmerston-Little Italy (Indian),3.117174,3.093872,3.111153,2.086341,3.12933
Caledonia-Fairbank (Falafel),2.47072,4.003154,4.022444,2.370989,2.482428
High-Park Swansea (Falafel),2.159232,3.378753,3.399106,2.836182,2.171395


Analyzing the distance matrix from the shortlisted locations and nearby Eastern restaurants, we suggest to Mr. Arif two things:
1. If he wants to be nearer to the locations of the Eastern restaurants, then he may try to find a land near the Grocery_Store, Shopping_Mall or the Organic_Grocery.
2. Otherwise, if he wants to keep a greater distance to other Eastern restaurants, he may try to find a place near the Bakery or the Fish_Market.

Saying he would not want to be too far away from the area where people prefer Eastern food, he decides to buy a place near the Grocery_Store.

Thanks for viewing my assignment!

After all, Mr. Arif in this assignment is a fictitious character.