# Capstone Project - The Battle of the Neighborhoods (Vancouver Fitness Center)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction/Business Problem](#introduction)
* [Data](#data)
* [Methodology/Analysis](#methodology)
* [Discussion](#results)
* [Conclusion](#conclusion)

## Introduction

In this project, the goal would be to search for the ideal location to open up a gym/fitness center in the **city of Vancouver** which is located in the country of Canada.

This report is aimed to target stakeholder looking to open up a gym/fitness center in a neighbourhood that has a high population, high income, low crime rate, and low number of surrounding gyms.


## Data

The data that will be used to compare neighbourhoods are as follows:
* Population per neighbourhood
* Median Income per neighbourhood
* Crime rate per neighbourhood
* Number of total gyms per neighbourhood

In [2]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation


!pip install geopy
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize


! pip install folium==0.5.0
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

Collecting geopy
[?25l  Downloading https://files.pythonhosted.org/packages/07/e1/9c72de674d5c2b8fcb0738a5ceeb5424941fefa080bfe4e240d0bacb5a38/geopy-2.0.0-py3-none-any.whl (111kB)
[K     |████████████████████████████████| 112kB 6.1MB/s eta 0:00:01
[?25hCollecting geographiclib<2,>=1.49 (from geopy)
  Downloading https://files.pythonhosted.org/packages/8b/62/26ec95a98ba64299163199e95ad1b0e34ad3f4e176e221c40245f211e425/geographiclib-1.50-py3-none-any.whl
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-2.0.0
Folium installed
Libraries imported.


Can I use Foursquare API to obtain info on data catergories? Lets try to obtain info on Vancouver, shall we?

In [5]:
CLIENT_ID = 'CA1GOIVDKEDC04S5O5CKXE3TPG0QFJSJSAPRPGD4DL5DUJM4' # your Foursquare ID
CLIENT_SECRET = 'Y1SBHZM4Q3LKGSGRSIQWRBETXD4FYP14TMKNJQ2MTDKTNNJY' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 950
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: CA1GOIVDKEDC04S5O5CKXE3TPG0QFJSJSAPRPGD4DL5DUJM4
CLIENT_SECRET:Y1SBHZM4Q3LKGSGRSIQWRBETXD4FYP14TMKNJQ2MTDKTNNJY


#### Will use the below location as the center point for the Vancouver Map

In [6]:
address = 'Victoria Square, BC'

geolocator = Nominatim(user_agent="vancity_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Vancouver are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Vancouver are 49.23327555, -123.06628001152762.


From Four Square API, the table below is made using the latitude and longitude for each of the city of Vancouver's 23 neighbourhoods. In addition, this table also included statistical data from the city of Vancouver and Vancouver city police websites which outlines the median income, population, and crime of each neighbourhood. 

In [7]:
vancity_stats = pd.read_csv('https://raw.githubusercontent.com/kairulsa/Coursera_Capstone/main/Vancity%20Stats.csv')
vancity_stats = vancity_stats.rename(columns={'Portion of Crime Rate of Vancouver(%)' : 'Neighbourhood Crime Rate (%)'})
vancity_stats

Unnamed: 0,Neighbourhood,Latitude,Longitude,Population (2016),Median household income (2016 CAD),Neighbourhood Crime Rate (%)
0,Arbutus-Ridge,49.246305,-123.159636,15295,71008,1.05
1,Downtown,49.283393,-123.117456,62030,64234,31.66
2,Dunbar-Southlands,49.237864,-123.184354,21745,104450,1.02
3,Fairview,49.261956,-123.130408,33620,69337,5.44
4,Grandview-Woodland,49.275849,-123.066934,29175,55141,5.54
5,Hastings-Sunrise,49.27783,-123.040005,34575,68506,3.1
6,Kensington-Cedar Cottage,49.247632,-123.084207,49325,70815,4.4
7,Kerrisdale,49.220985,-123.159548,13975,75419,1.15
8,Killarney,49.218012,-123.037115,29325,71559,1.44
9,Kitsilano,49.26941,-123.155267,43045,72839,4.73


In [8]:
# create map of Vancouver using latitude and longitude values
map_vancity = folium.Map(location=[latitude, longitude], zoom_start=12.4)

# add markers to map
for lat, lng, neighborhood in zip(vancity_stats['Latitude'], vancity_stats['Longitude'], vancity_stats['Neighbourhood']):
    label = '{}, {}'.format(vancity_stats, vancity_stats['Neighbourhood'][0])
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lng], radius=10, popup=label, color='black',fill=True, fill_color='pink', fill_opacity=0.7,parse_html=False).add_to(map_vancity)  

map_vancity

Now that we got statistics for the city of Vancouver, it is time to find out the number of gyms that Foursquare API can locate in the city of Vancouver!

In [6]:
search_list = ['Gymnasium', 'Gym and Fitness', 'Treadmill', 'Exercise', 'Workout', 'Fitness Studio', 'Running', 'Weight lifting', 'Goodlife',
               'Planet Fitness', 'Fit4less', 'Crunch Fitness', 'Fitness Center', 'Strength Training', 'Yoga',
              'Physical Activity', 'Boxing', 'Soccer'] # using popular fitness terms and gym name in the search 

for q in range(len(search_list)):
    for i in range(len(vancity_stats)):
        #address = vancity_stats['Neighbourhood'][i] + ', BC' 

        #geolocator = Nominatim(user_agent="foursquare_agent")
        #location = geolocator.geocode(address)
        latitude = vancity_stats['Latitude'][i]
        longitude = vancity_stats['Longitude'][i]

        search_query = search_list[q] 
        radius = 100000
        #print(search_query + ' .... OK!')

        url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)

        results = requests.get(url).json()
        #results

        # assign relevant part of JSON to venues
        venues = results['response']['venues']

        # tranform venues into a dataframe
        dataframe = json_normalize(venues)

        # keep only columns that include venue name, and anything that is associated with location
        filtered_columns = ['name', 'categories'] + [col for col in dataframe.columns if col.startswith('location.')] + ['id']
        dataframe_filtered = dataframe.loc[:, filtered_columns]

        # function that extracts the category of the venue
        def get_category_type(row):
            try:
                categories_list = row['categories']
            except:
                categories_list = row['venue.categories']

            if len(categories_list) == 0:
                return None
            else:
                return categories_list[0]['name']

        # filter the category for each row
        dataframe_filtered['categories'] = dataframe_filtered.apply(get_category_type, axis=1)

        # clean column names by keeping only last term
        dataframe_filtered.columns = [column.split('.')[-1] for column in dataframe_filtered.columns]

        if q == 0 and i == 0 :
            gym_list = dataframe_filtered
        else:
            gym_list = gym_list.append(dataframe_filtered, ignore_index=True)
        
gym_list = gym_list.drop(['labeledLatLngs', 'distance', 'cc', 'state', 'country', 'postalCode', 'crossStreet'], axis=1)
gym_list



Unnamed: 0,name,categories,address,lat,lng,city,formattedAddress,id,neighborhood
0,Arbutus Club Gymnasium,Basketball Court,2001 Nanton Ave,49.248237,-123.152619,Vancouver,"[2001 Nanton Ave (at Arbutus St), Vancouver BC...",4f03b15bc512408c65058c18,
1,Arbutus Club Gymnasium,Basketball Court,2000 Nanton Avenue,49.248966,-123.152498,Vancouver,"[2000 Nanton Avenue (Arbutus street), Vancouve...",58026e9f38fa42d68d2bae77,
2,Hillcrest Gymnasium,Athletics & Sports,,49.244698,-123.108214,Vancouver,"[Vancouver BC, Canada]",5143b596e4b0853be3b36dc6,
3,West Point Grey Gymnasium,College Gym,West 2nd Avenue,49.271465,-123.203964,Vancouver,"[West 2nd Avenue (NW Marine Drive), Vancouver ...",4d08456043b36ea834432cef,
4,War Memorial Gymnasium,Gym,6081 University Boulevard,49.266741,-123.247656,Vancouver,"[6081 University Boulevard (at Wesbrook Mall),...",5d1b7a463f9ff700234aca3f,
...,...,...,...,...,...,...,...,...,...
13894,Soccer city/advantage sports,Sporting Goods Shop,210 Nooksack Ave,48.947503,-122.443235,Lynden,"[210 Nooksack Ave (Grover), Lynden, WA 98264, ...",4c8fb076d2aea093fcb9d869,
13895,Soccer Fields @ Pioneer Park,Soccer Field,,48.839108,-122.596400,Ferndale,"[Ferndale, WA, United States]",51788148e4b0b4b473fb3e9d,
13896,Soccer West,Sporting Goods Shop,32630 George Ferguson Way,49.054918,-122.319774,Abbotsford,"[32630 George Ferguson Way, Abbotsford BC, Can...",4f6cac2ee4b0ad1af3c2df89,
13897,Sherman Park Soccer Field,Soccer Field,,48.792954,-123.734190,Duncan,"[Duncan BC, Canada]",4e7e2a6477c8a872a2916e07,


So the Foursquare API was able to obtain a number of fitness centers using the latitude and longitude of the 23 neighbourhoods. However, many of the gyms are found again for each search and some are not in the city of Vancouver but in nearby cities. Therefore we will remove gyms that have already been listed more than once to find out the number of unique gyms.

In [7]:
gym_list.drop_duplicates(subset ="id", keep = 'first', inplace = True)  # using id instead of name incase different gym locations hold the same name
gym_list = gym_list[gym_list.city == 'Vancouver']
gym_list = gym_list.reset_index()
gym_list

Unnamed: 0,index,name,categories,address,lat,lng,city,formattedAddress,id,neighborhood
0,0,Arbutus Club Gymnasium,Basketball Court,2001 Nanton Ave,49.248237,-123.152619,Vancouver,"[2001 Nanton Ave (at Arbutus St), Vancouver BC...",4f03b15bc512408c65058c18,
1,1,Arbutus Club Gymnasium,Basketball Court,2000 Nanton Avenue,49.248966,-123.152498,Vancouver,"[2000 Nanton Avenue (Arbutus street), Vancouve...",58026e9f38fa42d68d2bae77,
2,2,Hillcrest Gymnasium,Athletics & Sports,,49.244698,-123.108214,Vancouver,"[Vancouver BC, Canada]",5143b596e4b0853be3b36dc6,
3,3,West Point Grey Gymnasium,College Gym,West 2nd Avenue,49.271465,-123.203964,Vancouver,"[West 2nd Avenue (NW Marine Drive), Vancouver ...",4d08456043b36ea834432cef,
4,4,War Memorial Gymnasium,Gym,6081 University Boulevard,49.266741,-123.247656,Vancouver,"[6081 University Boulevard (at Wesbrook Mall),...",5d1b7a463f9ff700234aca3f,
...,...,...,...,...,...,...,...,...,...,...
469,11280,Sport + Spinal Physical Therapy,Medical Center,206 - 990 Homer St,49.277285,-123.119749,Vancouver,"[206 - 990 Homer St (at Nelson St), Vancouver ...",4b50b1e3f964a520e02d27e3,
470,12429,Eastside Boxing Club,Athletics & Sports,238 Keefer St,49.280402,-123.070900,Vancouver,"[238 Keefer St (at Main St), Vancouver BC V6A ...",5192ebe5498ee4a38597200e,
471,12751,Soccer West,Sporting Goods Shop,,49.263226,-123.106660,Vancouver,"[Vancouver BC, Canada]",4e652acad164ddd5e6ee46b6,
472,12753,Musqueam Soccer fields,Field,,49.230404,-123.205582,Vancouver,"[Vancouver BC, Canada]",51eae75c498e5672e4a1b059,


Now that we have a list of gyms, we will now classify each gym to one of the 23 neighbourhoods in the city of Vancouver using the Euclidean distance. This is done due to the majority of the table above not having neighbourhood classification. The latitude and longitude for the gym and the neighbourhoood will be used here. The lowest distance will be used to classify a gym to the respective neighbourhood. This will result in the following table.

In [8]:
for w in range(len(gym_list)):
    for p in range(len(vancity_stats)):
        Euc_dis = ((vancity_stats['Latitude'][p] - gym_list['lat'][w])**2 + (vancity_stats['Longitude'][p] - gym_list['lng'][w])**2)**0.5
        if p == 0:
            neigh_dis = [Euc_dis]
        else:
            neigh_dis.append(Euc_dis)

    low_index = neigh_dis.index(min(neigh_dis))
    gym_list['neighborhood'][w] = vancity_stats['Neighbourhood'][low_index]

true_list = gym_list.drop(['index','categories', 'address', 'lat', 'lng', 'city', 'formattedAddress', 'id'], axis=1)
true_list

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,name,neighborhood
0,Arbutus Club Gymnasium,Arbutus-Ridge
1,Arbutus Club Gymnasium,Arbutus-Ridge
2,Hillcrest Gymnasium,Riley Park
3,West Point Grey Gymnasium,West Point Grey
4,War Memorial Gymnasium,University Lands
...,...,...
469,Sport + Spinal Physical Therapy,Downtown
470,Eastside Boxing Club,Grandview-Woodland
471,Soccer West,Mount Pleasant
472,Musqueam Soccer fields,Dunbar-Southlands


Now we can get a full picture with the number of gym/fitness centers per neighbourhood

In [9]:
x = true_list['neighborhood'].value_counts()
x = pd.DataFrame(x)
x = x.reset_index()
vancity_stats['Number of Gyms found in the Neighbourhood'] = 0

for k in range(len(vancity_stats)):
    for j in range(len(x)):
        if vancity_stats['Neighbourhood'][k] == x['index'][j]:
            vancity_stats['Number of Gyms found in the Neighbourhood'][k] = x['neighborhood'][j]

vancity_stats

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


Unnamed: 0,Neighbourhood,Latitude,Longitude,Population (2016),Median household income (2016 CAD),Neighbourhood Crime Rate (%),Number of Gyms found in the Neighbourhood
0,Arbutus-Ridge,49.246305,-123.159636,15295,71008,1.05,5
1,Downtown,49.283393,-123.117456,62030,64234,31.66,94
2,Dunbar-Southlands,49.237864,-123.184354,21745,104450,1.02,11
3,Fairview,49.261956,-123.130408,33620,69337,5.44,54
4,Grandview-Woodland,49.275849,-123.066934,29175,55141,5.54,30
5,Hastings-Sunrise,49.27783,-123.040005,34575,68506,3.1,13
6,Kensington-Cedar Cottage,49.247632,-123.084207,49325,70815,4.4,14
7,Kerrisdale,49.220985,-123.159548,13975,75419,1.15,0
8,Killarney,49.218012,-123.037115,29325,71559,1.44,5
9,Kitsilano,49.26941,-123.155267,43045,72839,4.73,49


## Methodology/Analysis

Now that the desired data has been collected, a ranking system will be made in order recommend the best neighbourhood to open up a gym in the city of Vancouver. Ideally, the aim for the stakeholder would be to open up a gym in a neighbourhood with high income, high population, low crime rate, and low number of surrounding gyms.

**The Ranking system will therefore work as follows:** 
* Population - highest # = 1 (best rank), lowest # = 23 (worst rank)
* Income - highest # = 1 (best rank), lowest # = 23 (worst rank)
* Crime - lowest # = 1 (best rank), highest # = 23 (worst rank)
* Gym - lowest # = 1 (best rank), highest # = 23 (worst rank)

In [10]:
rank = vancity_stats
rank.drop('Latitude', inplace=True, axis = 1)
rank.drop('Longitude', inplace=True, axis = 1)
rank = rank.rename(columns={'Population (2016)' : 'Population Rank', 'Median household income (2016 CAD) ' : 'Income Rank', 
                     'Neighbourhood Crime Rate (%)' : 'Crime Rank', 'Number of Gyms found in the Neighbourhood' : 'Gym Rank' })

cool = ['Population Rank', 'Income Rank', 'Crime Rank', 'Gym Rank']                        

rank.sort_values(cool[0], inplace=True, ascending = False)
t1 = rank
t1 = t1.reset_index()

for z in range(len(t1)):
    t1['Population Rank'][z] = z + 1

t1
t1.sort_values(cool[1], inplace=True, ascending = False)
t2 = t1
t2 = t2.reset_index()

for z in range(len(t2)):
    t2['Income Rank'][z] = z + 1

t2.drop('index', inplace=True, axis = 1)
t2.drop('level_0', inplace=True, axis = 1)
t2.sort_values(cool[2], inplace=True, ascending = True)
t3 = t2
t3 = t3.reset_index()

for z in range(len(t3)):
    t3['Crime Rank'][z] = z + 1

t3.sort_values(cool[3], inplace=True, ascending = True)
t4 = t3
t4 = t4.reset_index()

for z in range(len(t4)):
    t4['Gym Rank'][z] = z + 1

t4.drop('index', inplace=True, axis = 1)
t4.drop('level_0', inplace=True, axis = 1)   
t4.sort_values('Neighbourhood', inplace=True, ascending = True)

final_rank = t4
final_rank = final_rank.reset_index()
final_rank.drop('index', inplace=True, axis = 1)
final_rank

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,Neighbourhood,Population Rank,Income Rank,Crime Rank,Gym Rank
0,Arbutus-Ridge,16,10,5.0,4
1,Downtown,1,17,22.0,23
2,Dunbar-Southlands,15,2,3.0,10
3,Fairview,8,12,17.0,22
4,Grandview-Woodland,12,20,18.0,17
5,Hastings-Sunrise,7,14,13.0,12
6,Kensington-Cedar Cottage,3,11,14.0,14
7,Kerrisdale,17,7,6.0,1
8,Killarney,11,9,7.0,5
9,Kitsilano,5,8,16.0,21


### Results

In [11]:
final_rank['Sum of Rank'] = 0
for b in range(len(final_rank)):
    final_rank['Sum of Rank'][b] = final_rank['Population Rank'][b] + final_rank['Income Rank'][b] + final_rank['Crime Rank'][b] + final_rank ['Gym Rank'][b]

final_rank.sort_values('Sum of Rank', inplace=True, ascending = True)
final_rank = final_rank.reset_index()
#final_rank.drop('Latitude', inplace=True, axis = 1)
#final_rank.drop('Longitude', inplace=True, axis = 1)
final_rank.drop('index', inplace=True, axis = 1)
#final_rank.drop('level_0', inplace=True, axis = 1)

final_rank

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Neighbourhood,Population Rank,Income Rank,Crime Rank,Gym Rank,Sum of Rank
0,Shaughnessy,21,1,1.0,6,29
1,Dunbar-Southlands,15,2,3.0,10,30
2,Kerrisdale,17,7,6.0,1,31
3,Killarney,11,9,7.0,5,32
4,South Cambie,22,6,2.0,3,33
5,Arbutus-Ridge,16,10,5.0,4,35
6,Victoria-Fraserview,10,15,9.0,2,36
7,Riley Park,14,5,11.0,8,38
8,West Point Grey,18,4,4.0,15,41
9,Renfrew-Collingwood,2,18,15.0,7,42


## Discussion

From the **ranking table**, we can see which neighbourhoods are most ideal and which are not! More detail of what the stakeholder would desire for their gym/fitness center would definitely affect which neighbourhood it would be placed in. However, we can see that if each column of the final rank table is of equal importance to the stakeholder (which is the assumption here), they now have a sense of where they should place their new facility.

In [13]:
print('Therefore the most ideal neighbourhood to open up a gym/fitness center in the city of Vancouver is in', final_rank['Neighbourhood'][0])

Therefore the most ideal neighbourhood to open up a gym/fitness center in the city of Vancouver is in Shaughnessy


## Conclusion

Overall, the ranking table provides insight for the stakeholder in to which neighbourhood they should build their fitness center/gym. Depending on their preference for one of the columns over another, they would be able to choose that neighbourhood to fit their needs.  