## Table of contents
* [Introduction and Business Problem](#introduction)
* [Data Description](#data)

## 1. Introduction and Business Problem <a id="introduction"></a>


Vancouver is the third largest city in Canada with a rapidly growing population. This city is located in the province of British Columbia on the west coast of the country. The city hosts the largest port in Canada, and it is the largest industrial location. Other economic activities here include lumber and tourism, among other things.

Our stakeholder is willing to open a new restaurant in the city of Vancouver. 
City has a huge variety of restaurants for every taste and, thus, to start a restaurant business in this area is not an easy task. Choosing a location for business is one of the stressful and controversial tasks, since there are a lot of criteria that has to be satisfied in order to achieve the highest revenue. Any new business venture in the city needs to be reviewed carefully and strategically so that the return on investment will be sustainably reasonable with less risk of investment. To get close to our business target, In the proposed location must be enough customers, and in order to have enough customers, population and population density in that neighborhood should be relatively high and there shouldn’t be high number of restaurant per a 1000 people in the neighborhood and preferably there should be less restaurant in an immediate proximity of the location. 

In this project, we will implement the basic analysis and try to find the most optimal neighborhood to open the restaurant according to mentioned criteria. It's obvious, that there are many additional factors, such as distance from parking places or distance from the main streets, price of rent for location. These analyses can be done separately and after choosing the neighborhood, and thus will not be done within the scope of this project.


## 2. Data Description <a id="data"></a>

We need to explore, segment and cluster the neighborhoods in the city of Vancouver. The neighborhoods data is the key for this project. Unfortunately, the data for the city of Vancouver is not widely available on the internet in the structured format, hence we need to scrap it through an existing Wikipedia page that exists and has all information about population and population density of the neighborhood in the city of Vancouver. We use these data to explore and cluster the neighborhoods in Vancouver. We use geopy.geocoders package to get the geographical coordinate of neighborhood and add it to our dataframe. Then further information about venues could be obtained by these geographical information using Foursquare API.

At the end of this part, we have a cleaned and structured dataframe including different parameters about each neighborhood in the city of Vancouver. These informations are namely:

1- Neighborhood Name
2- Neighborhood Latitude
3- Neighborhood Longitude
4- Population of Residents in each Neighborhood
5- Population Density of Each Neighborhood(Population/Neighborhood Area(km^2))

References: (https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Vancouver)


First, let's import all the libraries that we will need

In [204]:
# Importing Libraries that are usefull for this project
import random 
import numpy as np 
import pandas as pd 
import requests
import bs4 as bs
import urllib.request
import matplotlib.pyplot as plt  
import csv
import folium 
from pandas.io.json import json_normalize 
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# Import k-means from clustering stage
from sklearn.cluster import KMeans
# from sklearn.cluster import KMeans 
%matplotlib inline
print('Required Libraries are Imported.')


Required Libraries are Imported.


### Data Preparation

In [205]:
from collections import defaultdict
Venacouver_population_info = defaultdict(list)
Venacouver_population_info['Neighborhood'] = ['Arbutus-Ridge','Yaletown','Dunbar-Southlands','Fairview','Grandview-Woodland','Hastings-Sunrise',
'Kensington-Cedar Cottage','Kerrisdale','Killarney','Kitsilano','Marpole','Mount Pleasant','Oakridge','Renfrew-Collingwood','Riley Park','Shaughnessy'
,'South Cambie','Strathcona','Sunset','Victoria-Fraserview','West-End','Point Grey']
Venacouver_population_info['Population'] = [15295, 62030, 21425, 33620, 29175, 33045, 49165, 13975, 22325, 43045, 24460, 53986, 13030, 51530, 21794, 8430,  6995, 12585, 36500, 31065, 47200, 13065] 
Venacouver_population_info['Population Density'] = [4134, 16764, 2503, 10281, 6556, 4069, 6791, 2215, 4416, 7884,  4376, 5600, 3249, 6401, 4843,  1890, 3224, 3244, 5831,  5850,23838,2935]

df1 = pd.DataFrame (Venacouver_population_info, columns = ['Neighborhood','Population','Population Density'])
# df.set_index('Neighborhood', inplace = True)
df1.head()

Unnamed: 0,Neighborhood,Population,Population Density
0,Arbutus-Ridge,15295,4134
1,Yaletown,62030,16764
2,Dunbar-Southlands,21425,2503
3,Fairview,33620,10281
4,Grandview-Woodland,29175,6556


#### Adding the latitude and longitude values of all Neighborhoods in Vancouver to the dataframe using geopy 


In [206]:
latitudes = [] # Initializing the latitude array
longitudes = [] # Initializing the longitude array    
from geopy.geocoders import Nominatim
import folium
# for nbd in df1["Neighborhood"]: 
# Use geopy library to get the latitude and longitude values of Vancouver BC
for nbd in df1["Neighborhood"]: 
    address = nbd + ' Vancouver British Columbia, TO' # Formats the place name
    geolocator = Nominatim()
    location = None
    while(not location):
        try:
            location = geolocator.geocode(address)
        except:
            pass
    latitude1 = location.latitude
    longitude1 = location.longitude
    latitudes.append(latitude1)
    longitudes.append(longitude1)

  if __name__ == '__main__':


In [207]:
df1['Latitude'] = latitudes
df1['Longitude'] = longitudes
df1.head()

Unnamed: 0,Neighborhood,Population,Population Density,Latitude,Longitude
0,Arbutus-Ridge,15295,4134,49.240968,-123.167001
1,Yaletown,62030,16764,49.276322,-123.120956
2,Dunbar-Southlands,21425,2503,49.25346,-123.185044
3,Fairview,33620,10281,49.264113,-123.126835
4,Grandview-Woodland,29175,6556,49.270559,-123.067942


##### Judging the performance of the API using the number of collisions

In [208]:
col = 0
# df1[['Neigborhood'] == 'Oakridge',['Latitude'],['Longitude']] = [[49.226100], [-123.116600]]
df1.loc[df1['Neighborhood']=='Oakridge', ["Latitude", "Longitude"]] = 49.226100, -123.1166
explored_lat_lng = set()
for lat, lng in zip(df1['Latitude'], df1['Longitude']):
    if (lat, lng) in explored_lat_lng:
        col += 1
    else:
        explored_lat_lng.add((lat, lng))

print("Collisions : ", col)

Collisions :  0


Loading the dataframe into a csv file 

In [209]:
df1.to_csv(r'C:\Users\hjanani\Desktop\projects\dataVancouver.csv', index = False)

Reading the csv file

In [244]:
data_Vancouver = pd.read_csv('dataVancouver.csv')
print(data_Vancouver.shape)
data_Vancouver

(22, 5)


Unnamed: 0,Neighborhood,Population,Population Density,Latitude,Longitude
0,Arbutus-Ridge,15295,4134,49.240968,-123.167001
1,Yaletown,62030,16764,49.276322,-123.120956
2,Dunbar-Southlands,21425,2503,49.25346,-123.185044
3,Fairview,33620,10281,49.264113,-123.126835
4,Grandview-Woodland,29175,6556,49.270559,-123.067942
5,Hastings-Sunrise,33045,4069,49.277594,-123.04392
6,Kensington-Cedar Cottage,49165,6791,49.247632,-123.084207
7,Kerrisdale,13975,2215,49.234673,-123.155389
8,Killarney,22325,4416,49.224274,-123.04625
9,Kitsilano,43045,7884,49.26941,-123.155267


#### Use geopy library to get the latitude and longitude values of Vancouver BC

In [245]:
# This part is to get latitude and longtitude values of Vancouver.

from geopy.geocoders import Nominatim
import folium
# Use geopy library to get the latitude and longitude values of Vancouver BC.
address = 'Vancouver British Columbia, TO'
geolocator = Nominatim()
location = geolocator.geocode(address)
latitude_v = location.latitude
longitude_v = location.longitude
print('The geograpical coordinate of Vancouver BC are {}, {}.'.format(latitude_v, longitude_v))

  import sys


The geograpical coordinate of Vancouver BC are 49.2608724, -123.1139529.


#### Using Foursquare API to get information about venues around each neighborhood in Vancouver

In [246]:
# Defining Foursquare Credentials and Version
CLIENT_ID = 'JSUTWEDIV5U2NOGPBV2PVUCNG4SD4W0DOQI00WXQKGGHB4HR' # your Foursquare ID
CLIENT_SECRET = 'XMRSJTCFVVTXGLQPHWAGKSGOCQ2MEOLVOUD1LHX232V2WRE1' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: JSUTWEDIV5U2NOGPBV2PVUCNG4SD4W0DOQI00WXQKGGHB4HR
CLIENT_SECRET:XMRSJTCFVVTXGLQPHWAGKSGOCQ2MEOLVOUD1LHX232V2WRE1


#### Create a function to explore the venues for all the neighborhoods in Vancouver

In [247]:
# Explore Neighborhoods in Vancouver
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Create a new dataframe called Vancouver_venues.

In [248]:
# geting Vancouver venues
LIMIT = 100
radius = 1000
Vancouver_venues = getNearbyVenues(names=data_Vancouver['Neighborhood'],
                                   latitudes=data_Vancouver['Latitude'],
                                   longitudes=data_Vancouver['Longitude']
                                  )

print(Vancouver_venues.shape)


Arbutus-Ridge
Yaletown
Dunbar-Southlands
Fairview
Grandview-Woodland
Hastings-Sunrise
Kensington-Cedar Cottage
Kerrisdale
Killarney
Kitsilano
Marpole
Mount Pleasant
Oakridge
Renfrew-Collingwood
Riley Park
Shaughnessy
South Cambie
Strathcona
Sunset
Victoria-Fraserview
West-End
Point Grey
(613, 7)


In [249]:
Vancouver_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Arbutus-Ridge,49.240968,-123.167001,Butter Baked Goods,49.242209,-123.170381,Bakery
1,Arbutus-Ridge,49.240968,-123.167001,The Haven,49.241377,-123.166331,Spa
2,Arbutus-Ridge,49.240968,-123.167001,Barktholomews Pet Supplies,49.242746,-123.170193,Pet Store
3,Arbutus-Ridge,49.240968,-123.167001,The Dragon's Layer,49.238518,-123.169029,Nightlife Spot
4,Arbutus-Ridge,49.240968,-123.167001,The Heights Market,49.237902,-123.170949,Grocery Store


In [250]:
# check out how many venues were returned for each neighborhood
Vancouver_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Arbutus-Ridge,5,5,5,5,5,5
Dunbar-Southlands,9,9,9,9,9,9
Fairview,26,26,26,26,26,26
Grandview-Woodland,70,70,70,70,70,70
Hastings-Sunrise,13,13,13,13,13,13
Kensington-Cedar Cottage,20,20,20,20,20,20
Kerrisdale,38,38,38,38,38,38
Killarney,4,4,4,4,4,4
Kitsilano,46,46,46,46,46,46
Marpole,34,34,34,34,34,34


In [251]:
# find out how many unique categories can be curated from all the returned venues
print('There are {} uniques categories.'.format(len(Vancouver_venues['Venue Category'].unique())))

There are 148 uniques categories.


In [252]:
Vancouver_venues['Venue Category'].value_counts()

Coffee Shop            41
Sushi Restaurant       26
Japanese Restaurant    21
Café                   17
Park                   16
                       ..
Soccer Field            1
Food                    1
Shopping Mall           1
Pet Store               1
Art Gallery             1
Name: Venue Category, Length: 148, dtype: int64

#### exploring the number of restaurants between the venues in each neighborhood and adding it to our dataframe

#### Data Cleaning by fillna

In [253]:
Vancouver_restaurant = Vancouver_venues[(Vancouver_venues['Venue Category'].str.contains('Beer', regex=False)) |
                 (Vancouver_venues['Venue Category'].str.contains('Beer Garden', regex=False)) |
                 (Vancouver_venues['Venue Category'].str.contains('Restaurant', regex=False)) |
                 (Vancouver_venues['Venue Category'].str.contains('Bar', regex=False)) |
                 (Vancouver_venues['Venue Category'].str.contains('Steakhouse', regex=False)) |
                 (Vancouver_venues['Venue Category'].str.contains('Taverna', regex=False))].groupby(['Neighborhood']).count()
Vancouver_restaurant.drop(['Neighborhood Latitude', 'Neighborhood Longitude', 'Venue Longitude', 'Venue', 'Venue Latitude'], axis = 1, inplace = True)
Vancouver_restaurant.rename(columns = {'Venue Category':'Number of restaurants'}, inplace=True)
# Vancouver_restaurant = Vancouver_restaurant.reset_index()
# join above dataframe to the main df
data_Vancouver = data_Vancouver.join(Vancouver_restaurant, on='Neighborhood')
# Vancouver_restaurant = Vancouver_restaurant.reset_index()
# data_Vancouver = pd.merge(data_Vancouver, Vancouver_restaurant, on='Neighborhood')#merge removes line with no data but join keeps them and fill them with NaN
# data_Vancouver.head()
data_Vancouver = data_Vancouver.fillna(0)
data_Vancouver[['Number of restaurants']] = data_Vancouver[['Number of restaurants']].astype('int64')
data_Vancouver.head()

Unnamed: 0,Neighborhood,Population,Population Density,Latitude,Longitude,Number of restaurants
0,Arbutus-Ridge,15295,4134,49.240968,-123.167001,0
1,Yaletown,62030,16764,49.276322,-123.120956,42
2,Dunbar-Southlands,21425,2503,49.25346,-123.185044,5
3,Fairview,33620,10281,49.264113,-123.126835,13
4,Grandview-Woodland,29175,6556,49.270559,-123.067942,28


In [254]:
data_Vancouver.head()

Unnamed: 0,Neighborhood,Population,Population Density,Latitude,Longitude,Number of restaurants
0,Arbutus-Ridge,15295,4134,49.240968,-123.167001,0
1,Yaletown,62030,16764,49.276322,-123.120956,42
2,Dunbar-Southlands,21425,2503,49.25346,-123.185044,5
3,Fairview,33620,10281,49.264113,-123.126835,13
4,Grandview-Woodland,29175,6556,49.270559,-123.067942,28
