## __Toronto Data Science Capstone Project__
 

The datasource I will use will be the Toronto postal codes wiki page, and I will scrape the table to collect the neighborhoods and postal codes. I will then use the FourSquare API to identify venues close to each neighborhood. See below for a detailed step by step overview of the process. 

In [1]:
#import necessary libriaries to webscrape and scrape the page 

import numpy as np
import lxml.html as lh
import requests
import pandas as pd 

url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')


tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    print('%d:"%s"'%(i,name))
    col.append((name,[]))

1:"Postcode"
2:"Borough"
3:"Neighbourhood
"


From these data, I will load them into a dataframe. This will require a loop 

In [2]:
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
    #If row is not of size 3, the //tr data is not from our table 
    if len(T)!=3:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)

Format and clean the data, group neighborhoods in the same postal code together 

In [3]:
df.rename(columns ={'Neighbourhood\n':'Neighborhood','Postcode':'PostalCode'
                          }, 
                 inplace=True)
df.drop(df[df['Borough'] == 'Not assigned'].index, inplace=True)
df['Neighborhood']=df['Neighborhood'].str.replace("\n", "")
df_1 = df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()

This dataframe will now contain Postal code and Borough, with neighborhoods in the same postal code in one line. To leverage the FourSquare Location API, I need to add latitude and longitude coordinates to the dataset. I will then merge the two dataframes together

In [4]:
coord = pd.read_csv('Geospatial_Coordinates.csv')
coord.rename(columns ={'Postal Code':'PostalCode'
                          }, 
                 inplace=True)
df = pd.merge(df_1, coord, on='PostalCode')

I only want to look at neighborhoods in Toronto, so I will only use Boroughs where Toronto in in the name. 

In [5]:
df['Borough'].unique()
toronto_boroughs = ['East Toronto', 'Central Toronto', 'Downtown Toronto', 'West Toronto']
central_df = df[df['Borough'].isin(toronto_boroughs)].reset_index(drop=True)
print(central_df.shape)
central_df.head()

(38, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


I will now leverage the Foursquare API to match venues within these latitudes and longitudes to my dataset. 

In [6]:
#import necessary libraries 
import json 

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans
import folium # map rendering library
import geopy

declare variables for the API

In [7]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude


CLIENT_ID = 'EJF2T3C2DPZ4SXFOBWV1RMCYIEEDT13KS3R213TGTZ1ELUCS' # your Foursquare ID
CLIENT_SECRET = 'OIGCO5H2A0W44AUJJDAGP2CM34LWGQZ2OPKQ3CPHLJTCDFI2' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

Get nearby venues by using a list and limiting to 100 venues nearby 

In [8]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
def getNearbyVenues(names, latitudes, longitudes, radius=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now I am able to list all the nearby venues in each neighborhood. 

In [9]:
toronto_venues = getNearbyVenues(names=central_df['Neighborhood'],
                                   latitudes=central_df['Latitude'],
                                   longitudes=central_df['Longitude']
                                  )


The Beaches
The Danforth West, Riverdale
The Beaches West, India Bazaar
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park, Summerhill East
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
Rosedale
Cabbagetown, St. James Town
Church and Wellesley
Harbourfront, Regent Park
Ryerson, Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide, King, Richmond
Harbourfront East, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Centre
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North, Forest Hill West
The Annex, North Midtown, Yorkville
Harbord, University of Toronto
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place, Underground city
Christie
Dovercourt Village, Dufferin
Little Portugal, Trinity
Brockton, Exhibition Place, Parkdale Village
High Park, The 

Next I will create dummy variables for each of the venue names in the neighborhoods so I am able to cluster the neighborhoods later. I will also group by neighborhood.

In [10]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 
Neighborhood = toronto_onehot['Neighborhood']
# move neighborhood column to the first column
toronto_onehot.drop(labels=['Neighborhood'], axis=1,inplace = True)
toronto_onehot.insert(0, 'Neighborhood', Neighborhood)


toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()

Now I will find the most common categories in each neighborhood by writing a function 

In [11]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [12]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Greek Restaurant,Sushi Restaurant,Concert Hall,Japanese Restaurant,Steakhouse,Bar,Coffee Shop,Tea Room,Food Court,Vegetarian / Vegan Restaurant
1,"CN Tower, Bathurst Quay, Island airport, Harbo...",Performing Arts Venue,Yoga Studio,Gym,Gluten-free Restaurant,Gastropub,Garden,Food Court,Fast Food Restaurant,Farmers Market,Dessert Shop
2,"Cabbagetown, St. James Town",Italian Restaurant,Yoga Studio,Gym,Gluten-free Restaurant,Gastropub,Garden,Food Court,Fast Food Restaurant,Farmers Market,Dessert Shop
3,Central Bay Street,Coffee Shop,Pharmacy,Sandwich Place,Yoga Studio,Cocktail Bar,Gastropub,Garden,Food Court,Fast Food Restaurant,Farmers Market
4,"Chinatown, Grange Park, Kensington Market",Cocktail Bar,Bistro,Liquor Store,Farmers Market,Café,Coffee Shop,Gluten-free Restaurant,Gastropub,Garden,Food Court


__I will then be able to leverage these data for k-means clustering and analysis to determine neighborhoods are closest to the venues the store owner wants.__