<h1 style="color:#990033"> Week 5 Exercise : The Battle of the Neighborhoods (Week 2) </h1>

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

<p>
Our client, Superduper Express,  is a transportation company that runs electric scooters, electric bikes, normal pedal bikes and car sharing systems in various cities around the world. They would like to start their business in Toronto, Canada by exploring which is the best neighborhood area they would like to deploy their equipments in the city of Toronto. 

New market researchs revealed that there are 208 neighborhoods scattered in 10 areas that Superduper Express could possibly explore and potentially become one of the major actors in one of those areas.

As a data scientist, we would like conduct our analysis to find out which cluster of Toronto's neigborhoods is the best area for our client's new territory of business. One of our client's criteria is that the business should be started in a most convenient and popular area in terms of the variety of services and the accessibility of venues.</p>

## Data <a name="data"></a>

Based on client requests and the given criteria, our analysis will be based on the following factors:
* number of venues to find out which area is the most convenient. 
* Potential customers would likely to hop from one venue to another if the locations of the venues are close and dense in the neighborhood.

We decided to use regularly spaced grid of locations, centered around city center, to define our neighborhoods.

Following data sources will be needed to extract/generate the required information:
* number and type of venues obtaining by calling **FourSquare API**
* Geometric data of Toronto neighborhoods by web scraping wikipedia website at **https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M**
* coordinate of Toronto center will be obtained using Google Search
* Geospatial data with longitude and latitude of all Toronto zipcodes provided by Cognitive Class at http://cocl.us/Geospatial_data

### Step 1: Preparing neighborhood candidates

First, our process is to submit a request and scrap the content of the wikipedia page that contains the list of Toronto's neighborhoods, Boroughs and Postcode. The scraping process is done by using Beutiful Soup, a html parser. 
The result is stored into a dataframe and it shows that there are 208 neigborhoods scattered in 103 zipcodes in Toronto.


In [1]:
import numpy as np  # useful for many scientific computing in Python
import pandas as pd # primary data structure library
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt

In [2]:
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(res.content,'html.parser')

#table =soup.body.table
table = soup.find('table')
table_rows=table.find_all('tr')

l=[]
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.replace('\n','') for tr in td]   
    l.append(row)
    
df=pd.DataFrame(l, columns=["Postal Code", "Borough", "Neighbourhood"])
df = df[df.Borough != 'Not assigned']

df.groupby('Neighbourhood').nunique()

Unnamed: 0_level_0,Postal Code,Borough,Neighbourhood
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Adelaide,1,1,1
Agincourt,1,1,1
Agincourt North,1,1,1
Albion Gardens,1,1,1
Alderwood,1,1,1
Bathurst Manor,1,1,1
Bathurst Quay,1,1,1
Bayview Village,1,1,1
Beaumond Heights,1,1,1
Bedford Park,1,1,1


There are 208 neighborhoods.

#### Grouping neighborhoods by zipcode, we found out that there are 103 zipcodes with some neighborhoods sharing the same zipcode.

In [3]:
df=df.groupby(['Postal Code','Borough'])['Neighbourhood'].agg([('Neighbourhood', ', '.join)]).reset_index()
df.loc[df['Neighbourhood'] == 'Not assigned', 'Neighbourhood'] = df['Borough']


#### Grouped by zipcode, we can map the geospatial datapoints with neighborhoods areas to enrich our dataframe with latitude and longitude.
#### This is the neigborhood candidates dataframe

In [4]:
geo_data = pd.read_csv('http://cocl.us/Geospatial_data')

geo_data
result = pd.merge(df, geo_data, left_on='Postal Code', right_on='Postal Code')
result

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [None]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium

Solving environment: - 

In [None]:
locations = result[['Latitude', 'Longitude']]
locationlist = locations.values.tolist()

#locationlist

map = folium.Map(location=[43.6532, -79.3832],zoom_start=12)
for point in range(0, len(locationlist)):
    #print(result['Neighbourhood'])
    folium.Marker(locationlist[point]).add_to(map)
map


<h4>
    Our observation is that the neighborhood candiates are dense near the central area of Toronto and it covers the urban area that helps us to move forward for our next step of analysis.
    
 </h4>
    

### Step 2: Getting venues for each  area by FourSquare API

For each neighborhood area in the dataframe with longitude and latitude in step 1, we are going to continue our analysis by fetch the venues by FourSquare API.
Now we are preparing parameters for Fourqaure API call

In [None]:
# making requests to fetch data from 4square.
import requests
from pandas.io.json import json_normalize
CLIENT_ID = 'L5NI5QOJYRFTPG052OFVEBW4MPPAEURWY0RRELOXHF51DKUC' # your Foursquare ID
CLIENT_SECRET = 'OGYGZT3LUHTFAJYISMKN4PADF25R2E4NAMJCQNACBAHDZKA1' # your Foursquare Secret
VERSION = '20180604' # Foursquare API version

LIMIT = 100
radius = 500


Defining a function to obtains the dataframe of venues of given neigborhood areas and their respective longtitude and latitude

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
    
        results = requests.get(url).json()["response"]['groups'][0]['items']
                
        venues_list.append([(name, lat, lng, i['venue']['name'], i['venue']['location']['lat'], i['venue']['location']['lng'],  i['venue']['categories'][0]['name']) for i in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
        

        
       

    return(nearby_venues)

In [None]:
venues = getNearbyVenues(names=result['Neighbourhood'],
                                 latitudes=result['Latitude'],
                                 longitudes=result['Longitude']
                                  )

Examining the venues dataframe

In [None]:
venues.shape
venues.head()

Count the number of Venues in each neighborhood area

In [None]:
venues.groupby('Neighborhood').count()

In [None]:
print('There are {} uniques categories.'.format(len(venues['Venue Category'].unique())))

In [None]:
#Encoding the categorical values for clustering algorithm
# one hot encoding
toronto_onehot = pd.get_dummies(venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

In [None]:
toronto_onehot.shape

In [None]:
#calculating means for one spot
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()


<h4>Now lets pivot for each neighborhood area and take a look at the top 5 venues of all neighborhood areas sorted by the one hot encoding</h4>

In [None]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp =toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']        
    temp = temp.iloc[1:]   #slicing out the first row which is the nieghborhood name
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

Definiting a function to sort the row with top venue from left to right.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

import numpy as np

In [None]:
#Showing the top 10 venues for each neighborhood

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

## Methodology <a name="methodology"></a>

Now we have prepared the data for our neighborhoods candidates correcsponding to their venues. The next step is that we would perform the method of k means clustering to partition the neigborhoods area into 10 zones.
For these zones of neighborhood, we would compare the number of venues and propose which is the best area for the client to start the business in. 

<h3> Running K mean clustering on neighborhoods</h3>

In [None]:
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 10

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

In [None]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = result

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

## Analysis <a name="analysis"></a>

Let's perform some basic explanatory data analysis from our custer. Let's count the number of neigborhood with many variety of venues in every area candidate:

### Zone 0
<p>This cluster has 83 neighborhoods areas with many variety of venues. It is a convenient area with many acessible venues. </p>

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0,toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

### Zone 1
<p>This cluster has only one neighborhood. Not a good area to start a business. </p>

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1,toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

### Zone 2
<p>This cluster has only few neighborhood zip area with small amount of venue selection. Not a good area to start a business. </p>

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2,toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

### Zone 3
<p>This cluster has only few neighborhood zip area with small amount of venue selection. Not a good area to start a business. </p>

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3,toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

### Zone 4
<p>This cluster has only few neighborhood zip areas with small amount of venue selection. Not a good area to start a business. </p>

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4,toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

### Zone 5
<p>This cluster has only one neighborhood zip. Not a good area to start a business. </p>

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 5,toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

### Zone 6
<p>This cluster has only one neighborhood zip. Not a good area to start a business. </p>

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 6,toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

### Zone 7
<p>This cluster has only few neighborhoods. Not a good area to start a business. </p>

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 7,toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

### Zone 8
<p>This cluster has only few neighborhoods. Not a good area to start a business. </p>

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 8,toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

### Zone 9
<p>This cluster has only one neighborhoods. Not a good area to start a business. </p>

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 9,toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

### Zone 10
<p>This cluster has no neighborhood. Not a good area to start a business. </p>

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 10,toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

## Results and Discussion <a name="results"></a>

Result of all this is 10 zones, our analysis shows that there is a great number of neigborhoods(83) in Zone 0 with  with in Zone 0 is the best choice. Our stakeholders should consider deploying good percentage of their devices for easy commute among venues in this area.
The second best area is Zone 8. Recommended zones should therefore be considered only as a starting point for more detailed analysis which could eventually result in location which has not only no nearby competition but also other factors taken into account and all other relevant conditions met.


## Conclusion <a name="conclusion"></a>

Purpose of this project was to identify Toronto areas close to center with high nuber of venues in order to aid stakeholders in narrowing down the search for optimal location for a new business territory. By calculating venue density distribution from Foursquare data we have first identified general boroughs that justify further analysis. Clustering of those locations was then performed in order to create major zones of interest (containing greatest number of potential locations) and addresses of those zone centers were created to be used as starting points for final exploration by stakeholders.

Final decission on optimal business location will be made by stakeholders based on specific characteristics of neighborhoods and locations in every recommended zone, taking into consideration additional factors like attractiveness of each location (proximity to park or water) with more chances for installing devices like electronic schooters and bikes, real estate availability, social and economic dynamics of every neighborhood etc.