We first install Beautiful Soup to our Jupyter Notebook instance

# Segmenting and Clustering Neighborhoods in Toronto

In this assignment within the Applied Data Science Capstone we will be exploring and clustering the neighborhoods in Toronto. There are three parts to the assignment, which are as follows:    

Part 1: Scraping postal codes, boroughs and neighborhoods from Wikipedia  
Part 2: Pulling in geospatial data - i.e. lat/long - and merging with our data that we scraped from Wikipedia  
Part 3: Exploring our data and visualizing it on a map  

### Part 1: Scraping Wikipedia and storing in a data frame

Since we are using a Jupyter Notebook on the Skills Network Lab, we need to do a little fancy footwork to bring in BeautifulSoup. We do this first. 

In [4]:
import pip

if int(pip.__version__.split('.')[0])>9:
    from pip._internal import main
else:
    from pip import main
def install(package):
    main(['install', package])
install('BeautifulSoup4')

Collecting BeautifulSoup4
  Downloading https://files.pythonhosted.org/packages/e8/b5/7bb03a696f2c9b7af792a8f51b82974e51c268f15e925fc834876a4efa0b/beautifulsoup4-4.9.0-py3-none-any.whl (109kB)
Collecting soupsieve>1.2 (from BeautifulSoup4)
  Downloading https://files.pythonhosted.org/packages/05/cf/ea245e52f55823f19992447b008bcbb7f78efc5960d77f6c34b5b45b36dd/soupsieve-2.0-py2.py3-none-any.whl
Installing collected packages: soupsieve, BeautifulSoup4
Successfully installed BeautifulSoup4-4.9.0 soupsieve-2.0


Next we bring in other libraries for reading and parsing data, including BeautifulSoup.

In [5]:
from bs4 import BeautifulSoup
import csv
import requests
import pandas as pd

We now parse the Wikipedia page and put the results into a list. We will have three columns: Postal code, Borough and Neighborhood

In [6]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'html')

table = soup.find('table')
list_table = []
for line in table.find_all('tr'):
    if "Not assigned" not in line.contents[3].text:
        list_table.append([line.contents[1].text.replace('\n',''), line.contents[3].text.replace('\n',''), line.contents[5].text[:-1].replace('\n','').replace(' / ',', ')])

Now we put the contents of our list with the three columns into a dataframe 

In [7]:
wiki_df = pd.DataFrame(list_table[1:], columns=list_table[0])
wiki_df.rename(columns = {'Postal code':'ID'}, inplace = True) 
wiki_df

Unnamed: 0,ID,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road , Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing CentrE
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Lastly, show the shape of the dataframe

In [8]:
print('Shape: ') 
print(wiki_df.shape)

Shape: 
(103, 3)


### Part 2: Pulling geospatial data and merging with scraped Wikipedia data into a dataframe

Next we load in the geospatial data so that we can bring in long/lat

In [10]:
geospatial_df = pd.read_csv("http://cocl.us/Geospatial_data") 
geospatial_df.rename(columns = {'Postal Code':'ID'}, inplace = True) 
geospatial_df

Unnamed: 0,ID,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [13]:
merged_df = pd.merge(wiki_df, geospatial_df, left_index=False, right_index=False, how='inner')
merged_df.rename(columns = {'ID':'Postal Code'}, inplace = True) 
merged_df

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road , Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,Business reply mail Processing CentrE,43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


### Part 3: Exploring, creating a map and displaying our results from Part 1 and Part 2

We first connect to the Foursquare API and pull results based on our merged data frame

In [46]:
# import the necessary libraries
from pandas.io.json import json_normalize
import folium
import requests

# Set the Foursquare client id, secret, API version, limit, and radius 
CLIENT_ID = 'ALLFWAOB3NHAEMKEOFNHBA5NOQ4021AVH1T5OAGZLZYKTQSE' 
CLIENT_SECRET = '0YAIIKKTSKJVZK02YVJU50ESWCWZQSLBESNREAKTQQ4WCUCD' 
VERSION = '20200411' 
LIMIT = 100
radius = 500 

#print('Your credentails:')
#print('CLIENT_ID: ' + CLIENT_ID)
#print('CLIENT_SECRET:' + CLIENT_SECRET)

# Select first building (which in our case is BedokVille):
#neighborhood_name = merged_df.loc[1, 'Neighborhood']
neighborhood_latitude = merged_df.loc[0, 'Latitude'] 
neighborhood_longitude = merged_df.loc[0, 'Longitude'] 

# limit of number of venues returned by Foursquare API
LIMIT = 100 
radius = 500

# Connect to the Foursquare API and pull results
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
results = requests.get(url).json()

In [87]:
venues = results['response']['groups'][0]['items']
nearby_venues = pd.json_normalize(venues)

In [88]:
# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]
# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(lambda x: nearby_venues['venue.categories'][0][0]['name'], axis=1)
#print(nearby_venues['venue.categories'][0][0]['name'])
# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Brookbanks Park,Park,43.751976,-79.33214
1,Variety Store,Park,43.751974,-79.333114


Now we create our map and set parameters. 

In [89]:
venues_map = folium.Map(location=[neighborhood_latitude, neighborhood_longitude], zoom_start=15)

# add a red circle marker to represent the Brookbanks Park
folium.features.CircleMarker(
    [neighborhood_latitude, neighborhood_longitude],
    radius=10,
    color='red',
    popup='Brookbanks Park',
    fill = True,
    fill_color = 'green',
    fill_opacity = 0.6
).add_to(venues_map)


<folium.features.CircleMarker at 0x7f5502f0de48>

Display the map

In [90]:
venues_map
