## 1. Introduction

#### 
My client are looking to invest in restaurant business in the province of Alberta, Canada. My client is new in the country and looking for recommendation of the best location to build a restaurant. Our location of interest will be a densely populated area with few restaurant business around to minimize competition. 
In this project, we would compare the two most populated city/borough in Alberta using data science methodology. Advantages /disadvantages of each city will outlined and neighborhood in each city will be explored to determine best location for a new restaurant. Each city will be analysed separately.
 As they say in food business, location is KEY. So, project will be looking at best location to open a restaurant in the province.


## Business Problem/Challenge:

### TThe challenge is to find a suitable neighborhood in any borough in Alberta where the restaurant will thrive. Location of interest will be a densely populated area with few or no restaurants

## 2. Data

### 
To provide my client with necessary information, I will be looking at 
1.	Top 10 Neighborhoods with the highest number of population in the top 2 boroughs
2.	Borough  with significant population to support the business
3.	In selected neighborhood, Top 5 restaurants will be assessed to see if there is opportunity for my client in any of the neighborhood.
4.	I will be combining different data set from the web. 
5.	Alberta data: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_T
6.	Canada Census Data : https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/hlt-fst/pd-pl/Table.cfm?Lang=Eng&T=1201&SR=1&S=22&O=A&RPP=9999&PR=0
7.	Foursquare API will be used to explore neighborhoods and K-means to segment and cluster neighborhoods.




## 3. Methodology


•	Using pandas to scrape and explore the Alberta dataset from above website, an html file that contains Canada postcode, borough, neighborhoods and Long/lat.

•	Also, I will bring in the Canadian population information html file and merge it with the Alberta data frame.


•	I will extract the top two populated boroughs/city in the data frame and utilize foursquare API to explore each of their neighborhoods separetly.

•	I will use K-means to cluster and segment their neighborhoods into 4 clusters and visualize the cluster on a Map using the Folium Library.

•	I will also explore the top 10 neighborhoods with the highest number of restaurants using the foursquare API and visualize as well.

•	 Top 5 restaurants in the neighborhoods will be assesses and will present my findings based on the results.


### Importing necessary Libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
# import folium libraries
! pip install folium
import folium

import json # library to handle JSON files

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans



In [2]:
import pandas as pd
!conda install --yes lxml

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - lxml


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2020.6.24  |                0         125 KB
    certifi-2020.6.20          |           py36_0         156 KB
    libxslt-1.1.33             |       h7d1a2b0_0         426 KB
    lxml-4.5.1                 |   py36hefd8a0e_0         1.2 MB
    openssl-1.1.1g             |       h7b6447c_0         2.5 MB
    ------------------------------------------------------------
                                           Total:         4.4 MB

The following NEW packages will be INSTALLED:

  libxslt            pkgs/main/linux-64::libxslt-1.1.33-h7d1a2b0_0
  lxml               pkgs/main/linux-64::lxml-4.5.1-py36hefd8a0e_0

The following p

### Exploring Alberta Dataset , html file that contains Alberta postal code, borough, Neighborhoods and lat_Long

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_T'
df = pd.read_html(url)
df = pd.DataFrame(df[1]) 
df.head(10)   

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,T1A,Medicine Hat,Central Medicine Hat,50.03646,-110.67925
1,T2A,Calgary,"Penbrooke Meadows, Marlborough",51.04968,-113.96432
2,T3A,Calgary,"Dalhousie, Edgemont, Hamptons, Hidden Valley",51.12606,-114.143158
3,T4A,Airdrie,East Airdrie,51.27245,-113.98698
4,T5A,Edmonton,"West Clareview, East Londonderry",53.5899,-113.4413
5,T6A,Edmonton,North Capilano,53.5483,-113.408
6,T7A,Drayton Valley,Not assigned,53.2165,-114.9893
7,T8A,Sherwood Park,West Sherwood Park,53.519,-113.3216
8,T9A,Wetaskiwin,Not assigned,52.9741,-113.3646
9,T1B,Medicine Hat,South Medicine Hat,50.0172,-110.651



### Data clean up to take out rows whose Borough is not "Not assigned", then reset the indices


In [4]:
df =df[df.Neighborhood != 'Not assigned'].reset_index(drop=True)
df.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,T1A,Medicine Hat,Central Medicine Hat,50.03646,-110.67925
1,T2A,Calgary,"Penbrooke Meadows, Marlborough",51.04968,-113.96432
2,T3A,Calgary,"Dalhousie, Edgemont, Hamptons, Hidden Valley",51.12606,-114.143158
3,T4A,Airdrie,East Airdrie,51.27245,-113.98698
4,T5A,Edmonton,"West Clareview, East Londonderry",53.5899,-113.4413
5,T6A,Edmonton,North Capilano,53.5483,-113.408
6,T8A,Sherwood Park,West Sherwood Park,53.519,-113.3216
7,T1B,Medicine Hat,South Medicine Hat,50.0172,-110.651
8,T2B,Calgary,"Forest Lawn, Dover, Erin Woods",51.0318,-113.9786
9,T3B,Calgary,"Montgomery, Bowness, Silver Springs, Greenwood",51.0809,-114.1616


###### As seen "Not assigned" Neigbourhood has been dropped

### Canadian Population Information

In [5]:
html ="https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/hlt-fst/pd-pl/Table.cfm?Lang=Eng&T=1201&SR=1&S=22&O=A&RPP=9999&PR=0"
df_Canada_Population = pd.read_html(html, header=0)[0]
df_Canada_Population.head()

Unnamed: 0,Geographic name,"Population, 2016","Total private dwellings, 2016","Private dwellings occupied by usual residents, 2016"
0,,,,
1,CanadaFootnote 1,35151728.0,15412443.0,14072079.0
2,A0A,46587.0,26155.0,19426.0
3,A0B,19792.0,13658.0,8792.0
4,A0C,12587.0,8010.0,5606.0



#### Merge dataframe with population with Alberta neighborhoods


In [6]:
df_Alberta=df.merge(df_Canada_Population, left_on='Postal Code',right_on='Geographic name')
df_Alberta.drop(columns = ['Geographic name','Total private dwellings, 2016','Private dwellings occupied by usual residents, 2016'], inplace=True)
df_Alberta.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,"Population, 2016"
0,T1A,Medicine Hat,Central Medicine Hat,50.03646,-110.67925,25409.0
1,T2A,Calgary,"Penbrooke Meadows, Marlborough",51.04968,-113.96432,59641.0
2,T3A,Calgary,"Dalhousie, Edgemont, Hamptons, Hidden Valley",51.12606,-114.143158,53224.0
3,T4A,Airdrie,East Airdrie,51.27245,-113.98698,16054.0
4,T5A,Edmonton,"West Clareview, East Londonderry",53.5899,-113.4413,35049.0


#### Rename last Column 

In [11]:
df_Alberta=df_Alberta.rename({'Population, 2016':'Population'}, axis=1)
df_Alberta.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Population
0,T1A,Medicine Hat,Central Medicine Hat,50.03646,-110.67925,25409.0
1,T2A,Calgary,"Penbrooke Meadows, Marlborough",51.04968,-113.96432,59641.0
2,T3A,Calgary,"Dalhousie, Edgemont, Hamptons, Hidden Valley",51.12606,-114.143158,53224.0
3,T4A,Airdrie,East Airdrie,51.27245,-113.98698,16054.0
4,T5A,Edmonton,"West Clareview, East Londonderry",53.5899,-113.4413,35049.0


#### Drop Postal code column and sort value by population

In [22]:
df_ALBERTA=df_Alberta.drop(columns=['Postal Code']).sort_values('Population', ascending=False).reset_index(drop=True)

df_ALBERTA.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Population
0,Calgary,"Sandstone, MacEwan Glen, Beddington, Harvest H...",51.127,-114.0787,80792.0
1,Calgary,"Martindale, Taradale, Falconridge, Saddle Ridge",51.0999,-113.9422,77605.0
2,Calgary,"Discovery Ridge, Signal Hill, West Springs, Ch...",51.0566,-114.1815,71251.0
3,Edmonton,"West Jasper Place, West Edmonton Mall",53.5157,-113.6339,69914.0
4,Calgary,"Douglas Glen, McKenzie Lake, Copperfield, East...",50.9023,-113.9873,68438.0


In [32]:
df_A = df_ALBERTA.groupby(['Borough']).sum()
df_A.head()

Unnamed: 0_level_0,Population
Borough,Unnamed: 1_level_1
Airdrie,64602.0
Calgary,1249824.0
Edmonton,932650.0
Fort McMurray,69667.0
Grande Prairie,72646.0


### As seen above , Calgary and Edmonton are the 2 Biggest Borough by population

### Therefore. we will explore each major city separately.

##### Exploring Calgary Top 10 Neighborhood by population  -Generating dataset whose Borough= 'Calgary'

In [33]:
df_ALBERTA.shape

(98, 5)

In [35]:
index =list()
for i in range (98):
    if 'Calgary' in df_ALBERTA.iloc[i,0]:
        index.append(i)
Calgary_Neigh= df_ALBERTA.loc[index].reset_index(drop=True)
Calgary_Neigh.head(10)

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Population
0,Calgary,"Sandstone, MacEwan Glen, Beddington, Harvest H...",51.127,-114.0787,80792.0
1,Calgary,"Martindale, Taradale, Falconridge, Saddle Ridge",51.0999,-113.9422,77605.0
2,Calgary,"Discovery Ridge, Signal Hill, West Springs, Ch...",51.0566,-114.1815,71251.0
3,Calgary,"Douglas Glen, McKenzie Lake, Copperfield, East...",50.9023,-113.9873,68438.0
4,Calgary,"Millrise, Somerset, Bridlewood, Evergreen",50.9093,-114.0721,61344.0
5,Calgary,"Penbrooke Meadows, Marlborough",51.04968,-113.96432,59641.0
6,Calgary,"Hawkwood, Arbour Lake, Citadel, Ranchlands, Ro...",51.1147,-114.1796,59025.0
7,Calgary,"Rundle, Whitehorn, Monterey Park",51.0759,-114.0015,57237.0
8,Calgary,"Dalhousie, Edgemont, Hamptons, Hidden Valley",51.12606,-114.143158,53224.0
9,Calgary,"Queensland, Lake Bonavista, Willow Park, Acadia",50.9693,-114.0514,46394.0


##### Exploring Edmonton Top 10 Neighborhood by population  -Generating dataset whose Borough= 'Edmonton'

In [53]:
index =list()
for i in range (98):
    if 'Edmonton' in df_ALBERTA.iloc[i,0]:
      index.append(i)
Edmonton_Neigh= df_ALBERTA.loc[index].reset_index(drop=True)
Edmonton_Neigh.head(10)

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Population
0,Edmonton,"West Jasper Place, West Edmonton Mall",53.5157,-113.6339,69914.0
1,Edmonton,East Mill Woods,53.4681,-113.4339,51562.0
2,Edmonton,Heritage Valley,53.4129,-113.4957,50904.0
3,Edmonton,"Kaskitayo, Aspen Gardens",53.4822,-113.5269,47109.0
4,Edmonton,"Horse Hill, East Lake District",53.6026,-113.3837,45947.0
5,Edmonton,"Southgate, North Riverbend",53.4839,-113.5227,45145.0
6,Edmonton,East Castledowns,53.6072,-113.5183,41912.0
7,Edmonton,The Meadows,53.4768,-113.3662,37150.0
8,Edmonton,Ellerslie,53.4154,-113.4917,35649.0
9,Edmonton,"West Clareview, East Londonderry",53.5899,-113.4413,35049.0
