<h1 style="color:black;text-align:center">Capstone Project</h1>
<h1 style="color:slateblue;text-align:center">IBM Professional Certificate in Data Science</h1><br><br>
<h2 style="color:gray;text-align:center"><i>Opening a African Restaurant in Downtown Toronto, Canada</i></h2><br>
<img align="center"  src="https://www.senecacollege.ca/content/dam/projects/seneca/homepage-assets/homepage_intl.jpg" width=700 height=400></img><br></br>
<p style="color:slateblue;text-align:center;font-size:24px">By Isaac Nyamekye<br><br><i>April 2020</i></p>

---
# Table of contents
* [Background and Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)

# Background and Business Problem <a name="introduction"></a>

<p style="font-size:24px">Toronto is the most populous city in Canada and the provincial capital of Ontario. The city is very diverse and home to a large immigrant population. Because of the cultural diversity of its inhabitants, the city has a dynamic and diverse culinary scene.<br><br>
However, there are a few Ghanaian (West Africa) restaurants, especially, in downtown Toronto. An African restaurant chain would like to change this by opening a couple of Ghanaian restaurants in downtown Toronto. <br><br>
To reduce competition with other African restaurants, the company would like to open their Ghanaian restaurants in neighbourhoods that have no or very few African restaurants. The purpose of this project would be to identify such neighbourhoods in downtown Toronto.<br><br>
The target audience of this project is mainly the senior executives of the African restaurant chain. The report would enable them make evidence-based decision on the best neighborhoods to open their restaurants.
</p>

# Data <a name="data"></a>

<p style="font-size:24px">This project used of data from:<br><ul style="list-style-type:circle;">
    <li style="font-size:24px"><a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"><b>Toronto Wikipedia page</b></a>: I scraped postal codes, boroughs and neighbourhoods names from this page. Since the focus of my project is downtown Toronto, I only considered neighbourhoods in that borough.</li><br>
    <li style="font-size:24px"><b>Geopy package:</b> This package was used to generate geolocation data (latitude and longitude) for the downtown Toronto neighbourhoods.</li><br>
    <li style="font-size:24px"><b>Foursquare API:</b> From the location data obtained from web scraping and geocoding, I used the Foursquare API to generate African restaurant venue details for the respective the neighbourhoods.</li>

# Methodology <a name="methodology"></a>

<p style="font-size:24px">This section details the process I used to solve the business problem. The process comprised of:<br><ul style="list-style-type:circle;">
    <li style="font-size:24px"><b>Data Acquisition and Cleaning:</b> This subsection outlines how I scrapped and cleaned the neighbourhood data from Toronto's Wikipedia Site. I also show how I used the Geopy library to generate the geolocation data need for my analysis.</li><br>
    <li style="font-size:24px"><b>Exploratory Data Analysis:</b> In this subsection, I explore how the neighbourhoods in downtown Toronto. I used the folium library to generate maps of the neighbourhoods and the Foursqaure APi to analysis the most common venues in each neighbourhood.</li><br>
    <li style="font-size:24px"><b>Model Development:</b> This subsection shows the modeling process. I used K-mean clustering to determine neighbourhood with no or few African restuarants</li>

## Data Acquistion and Cleaning

### Scrapping Toronto Neighborhoods Data from Wikipedia

In [11]:
#import libraries
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

In [12]:
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

soup = BeautifulSoup(website_url,'lxml')
My_table = soup.find('table',{'class':'wikitable'})

In [13]:
ths = My_table.findAll('th')
headings = [th.text.strip() for th in ths] # Create a list of column headers using the ths marker in the html

table_rows = My_table.find_all('tr')

l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip().replace('/',',') for tr in td if tr.text.strip()] # This is separating Neighborhoods with the same postal code with ","
    if row:
        l.append(row)
        
df = pd.DataFrame(l, columns = headings)

In [15]:
# Dropping rows with 'Not assigned' Borough
df.drop(df[df['Borough'] == 'Not assigned'].index, inplace=True) 
df.reset_index(drop=True)

#Replace missing Neighborhood values with their respective Borough
df['Neighborhood'].fillna(df['Borough'], inplace=True)
print(df.shape)
df.head()

(103, 3)


Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park , Harbourfront"
5,M6A,North York,"Lawrence Manor , Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"


Since the focus of my analysis is neighbourhoods in Downtown Toront, I filter out only the Downtown borough

In [18]:
# Selecting data for Downtown Toronto
Toronto_data = df[df['Borough']=="Downtown Toronto"]
print(Toronto_data.shape)
Toronto_data.head()

(19, 3)


Unnamed: 0,Postal code,Borough,Neighborhood
4,M5A,Downtown Toronto,"Regent Park , Harbourfront"
6,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"
13,M5B,Downtown Toronto,"Garden District, Ryerson"
22,M5C,Downtown Toronto,St. James Town
31,M5E,Downtown Toronto,Berczy Park


### Generate Latitude and Longitude for the Neighborhoods 

In this part, I generate latitude and longitude information for the downtown neoghbourhoods using the pgeocode library. This will make the data usable in Foursquare.

In [20]:
import pgeocode

list_lat = []
list_long = []

for index,row in Toronto_data.iterrows():
    Postal_code = row['Postal code']
    nomi =  pgeocode.Nominatim('ca')
    results = nomi.query_postal_code(Postal_code)
    lat = results['latitude']
    long = results['longitude']
    
    list_lat.append(lat)
    list_long.append(long)
    
Toronto_data['Latitude'] = list_lat   

Toronto_data['Longitude'] = list_long

print(Toronto_data.shape)
Toronto_data.head()

(19, 5)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
4,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.6555,-79.3626
6,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government",43.6641,-79.3889
13,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783
22,M5C,Downtown Toronto,St. James Town,43.6513,-79.3756
31,M5E,Downtown Toronto,Berczy Park,43.6456,-79.3754


## Exploratory Data Analysis

### Explore the Neighbourhoods

In [4]:
import pgeocode

list_lat = []
list_long = []

for index,row in df.iterrows():
    Postal_code = row['Postal code']
    nomi =  pgeocode.Nominatim('ca')
    results = nomi.query_postal_code(Postal_code)
    lat = results['latitude']
    long = results['longitude']
    
    list_lat.append(lat)
    list_long.append(long)
    
df['Latitude'] = list_lat   

df['Longitude'] = list_long

df

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
2,M3A,North York,Parkwoods,43.7545,-79.3300
3,M4A,North York,Victoria Village,43.7276,-79.3148
4,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.6555,-79.3626
5,M6A,North York,"Lawrence Manor , Lawrence Heights",43.7223,-79.4504
6,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government",43.6641,-79.3889
...,...,...,...,...,...
160,M8X,Etobicoke,"The Kingsway , Montgomery Road , Old Mill North",43.6518,-79.5076
165,M4Y,Downtown Toronto,Church and Wellesley,43.6656,-79.3830
168,M7Y,East Toronto,Business reply mail Processing CentrE,43.7804,-79.2505
169,M8Y,Etobicoke,"Old Mill South , King's Mill Park , Sunnylea ,...",43.6325,-79.4939
