# PART 1
## Webscraping Toronto's Wikipedia page
In this section, we download the content of the Wikipedia page about Toronto neighborhoods and postal codes ([link])(https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) to populate a DataFrame

In [None]:
#imports used in this stage
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

Using `requests.get()` to get the contents of the page, and parsing it with BeautifulSoup. We also print the page title to check if we have accessed the correct page:

In [None]:
tor_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
htmldata = requests.get(url=tor_url).text
soup = BeautifulSoup(htmldata, 'lxml')
print(soup.title)

We count the number of tables, and print the beginning of each table to see which of them is the one that we want:

In [None]:
tables = soup.find_all('table')
print('There are {:} tables in the document.'.format(len(tables)))
# printing the beginning of each table to see which one is the right one
_ = [print(tables[i].prettify()[:250]) for i in range(len(tables))]

From the sample, we can see that the correct table is found in `tables[0]`.

Once the table is identified, we parse its contents, while also getting rid of undesired data:
- Postal codes that were not assigned yet
- Extraneous characters such as **`( ) /`** and extra whitespaces

In [None]:
neigh_tbl = tables[0]
neigh_lst = []
for val in (neigh_tbl.find_all('td'))[:-1]:
    strtmp = val.get_text(strip=True, separator=';')
    strlist = strtmp.replace('/', ',').split(';')
    # restrict the parsing only to the postal codes already assigned, otherwise skip
    if strlist[1] != 'Not assigned':
        post = strlist[0]
        bor = strlist[1]
        # merge contents of the other cells (neighborhoods)
        nei = ' '.join(strlist[2:])
        # parsing to remove extraneous symbols
        nei = nei.replace('( ', '').replace('(', '').replace(' ,', ',').replace(')', '')
        # removing spaces at the beginning and the end of the string
        nei = nei.lstrip().rstrip()
        strtmp = [post, bor, nei]
        neigh_lst.append(strtmp)

Once the list is correctly parsed, we convert it to a pandas DataFrame:

In [None]:
# convert the resulting list to a pandas DataFrame
df = pd.DataFrame(neigh_lst, columns=['Postal code', 'Borough', 'Neighborhoods'])
df.head()

Checking DataFrame size

In [None]:
df.shape

# PART 2
## Getting Latitude and Longitude of the Postal Codes

We obtained the Latitude and Longitude values from ArcGIS using `geocoder`

In [None]:
#imports
!pip install geocoder
import geocoder
import os
import matplotlib.pyplot as plt
from matplotlib.pyplot import cm  # Color palettes
from sklearn.cluster import KMeans  # ML library

In [None]:
# Open previously saved DataFrame, or create a new one and save a local copy
fname = 'Toronto Neighborhoods.csv'

if os.path.exists(fname):
     print("File '{}' already exists. Loading from cache...".format(fname))
     df = pd.read_csv(fname, index_col=0)
else:
     #create a copy of the DataFrame using Postal Code as index
     df2 = df.set_index('Postal code')
     df2['Latitude'], df2['Longitude'] = 0, 0

     # add latitude and longitude using geocoder
     for p in df2.index:
          # reply = None
          # while (reply is None):
          reply = geocoder.arcgis('{} Toronto ON, Canada'.format(p))
          print(reply, reply.latlng)
          df2.loc[p, ['Latitude', 'Longitude']] = reply.latlng

     print("There are {} missing coordinates after the query".format(sum([df2['Latitude'].isna().sum(),
          df2['Longitude'].isna().sum()])))
     # restore numerical index and Postal Code column
     df = df2.reset_index()
     # save results to a file
     df.to_csv(fname)
     print("DataFrame saved as '{}'.".format(fname))


Checking borough names and shape of the resulting DataFrame

In [None]:
print(df.Borough.unique())
df.shape

 # PART 3
## Clustering the neighborhoods using data from the Foursquare API
We kept the same parameters from the Manhattan sample exercise, and reused the function ```getNearbyVenues``` from it

In [None]:
#foursquare credentials
CLIENT_ID = 'L2LRCN30Z5RRWFLVVZVWHPL1JTUF05IZ3IAMMRZX40MU0TIF' # your Foursquare ID
CLIENT_SECRET = '01SEKO1V4WEHUQSEX0QJJ0FVKCIWBOEWFXXWU4OBUWLW5WQU' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

In [None]:
# function to get nearby venues and export to a DataFrame
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT)
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        # return only relevant information for each nearby venue
        venues_list.append(
            [(name, lat, lng, v['venue']['name'], v['venue']['location']['lat'], 
            v['venue']['location']['lng'], v['venue']['categories'][0]['name'])
            for v in results]
            )
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhoods', 'Neighborhood Latitude', 'Neighborhood Longitude',
                             'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']

    return(nearby_venues)

To save time and queries from the Foursquare API, we save the previous results in a file. In case the file is detected, it is loaded from disk instead of executing the query.

In [None]:
fname = 'Toronto Venues.csv'
if os.path.exists(fname):
    print('File "{}" exists, loading from cache'.format(fname))
    venues = pd.read_csv(fname, index_col=0)
else:
    venues = getNearbyVenues(df.Neighborhoods, df.Latitude, df.Longitude)
    venues.to_csv(fname)

print('Dataframe shape:', venues.shape)

Some of the venues from FSQ are actually neighborhood names, dropping....

In [None]:
venues = venues[~(venues['Venue Category'] == 'Neighborhood') ]
venues.shape

There are several similar categories (for example, **Café** vs **Coffee Shop**, **History Museum** vs **Museum**). We have tried to reduce the number of categories a bit. This was not exaustive, we just adjusted the obvious redundancies. Some other categories could probably be dropped altogether (**Intersection** and **Bridge** for example).

In [None]:
print('There are {} unique categories.'.format(len(venues['Venue Category'].unique())))

replacements = [['Café', 'Coffee Shop'], ['History Museum', 'Museum'], ['Golf Driving Range', 'Golf Course'],
                ['Gym', 'Gym & Fitness Center'], ['Gym / Fitness Center', 'Gym & Fitness Center'], ['Gym Pool', 'Pool'],
                ['Taiwanese Restaurant', 'Thai Restaurant'], ['Art Gallery', 'Art Gallery / Art Museum'], 
                ['Art Museum', 'Art Gallery / Art Museum'], ['Opera House', 'Concert Hall'], ['Jazz Club', 'Music Venue'],
                ['Korean BBQ Restaurant', 'Korean Restaurant'], ['Basketball Stadium', 'Stadium'], ['College Stadium','Stadium'],
                ['Bus Line', 'Public Transp. (bus/rail/metro)'], ['Bus Station', 'Public Transp. (bus/rail/metro)'],
                ['Metro Station', 'Public Transp. (bus/rail/metro)'], ['Light Rail Station', 'Public Transp. (bus/rail/metro)'],
                ['Bar', 'Pubs and Bars'], ['Beer Bar', 'Pubs and Bars'], ['Cocktail Bar', 'Pubs and Bars'],
                ['Gastropub', 'Pubs and Bars'], ['Hotel Bar', 'Pubs and Bars'], ['Irish Pub', 'Pubs and Bars'],
                ['Pub', 'Pubs and Bars'], ['Sake Bar', 'Pubs and Bars'], ['Sports Bar', 'Pubs and Bars'],
                ['Wine Bar', 'Pubs and Bars'], ['Food Court', 'Street Food'], ['Food Truck', 'Street Food'],
                ['Garden', 'Gardens'], ['Sculpture Garden', 'Gardens'], ['Sushi Restaurant', 'Japanese Restaurant']]

for r in replacements:
    venues.loc[venues['Venue Category'] == r[0], 'Venue Category'] = r[1]
print('Number of unique categories after grouping: {}'.format(len(venues['Venue Category'].unique())))

# for r in venues['Venue Category'].unique():
#     if 'restaurant' in r.lower():
#         print(r)

In [None]:
contractors = pd.DataFrame(columns=venues.columns)
restaurants = contractors.copy()
contractor_categories = ['Construction & Landscaping', 'Business Service', 'Home Service']
for idx in range(len(venues)):
    if (venues.iloc[idx]['Venue Category']) in contractor_categories:
        contractors = contractors.append(venues.iloc[idx])
    elif 'restaurant' in (venues.iloc[idx]['Venue Category']).lower():
        restaurants = restaurants.append(venues.iloc[idx])
    else:
        pass

In [None]:
# group the neighborhood by restaurant counts
restaurant_counts = restaurants.groupby('Neighborhoods')['Venue Category'].count()
restaurant_counts.columns = ['count']

# group the restaurants by type
onehot = pd.get_dummies(restaurants[['Venue Category']], prefix='', prefix_sep='')
rest_onehot = pd.concat([restaurants[['Neighborhoods']], onehot], axis=1)
rest_onehot = rest_onehot.groupby('Neighborhoods').sum().reset_index()
rest_onehot.shape
rest_onehot.set_index('Neighborhoods', inplace=True)

print(rest_onehot.shape)
cols = rest_onehot.columns.values
for i in range(len(cols)):
    if cols[i] == 'Restaurant':
        cols[i] = 'Unspecified Restaurant'
cols = [(' '.join(c.split(' ')[:-1])).rstrip() for c in rest_onehot.columns]
rest_onehot.columns = cols
rest_onehot.head()

In [None]:
print('There are {} different restaurant categories in Toronto'.format(len(rest_onehot.columns)))
top_restaurant_counts = rest_onehot.sum()
top_restaurant_counts.head()

In [None]:
top_counts_neigh = rest_onehot[rest_onehot.columns[:-2]].sum(1)
top_counts_neigh.head()

Create a list with the top 10 most common restaurants in a neighborhood (if there are that many categories)

In [None]:
a = rest_onehot.iloc[i].sort_values()[::-1]
a.values

Ranking and counts of each type of restaurant in the neighborhoods

In [None]:
top10list = []
for i in range(len(rest_onehot)):
    t10l = rest_onehot.iloc[i].sort_values()[::-1]
    t10l = (['{} ({})'.format(t10l[t10l > 0].index[i], t10l.values[i]) for i in range(len(t10l[t10l > 0]))] + 10*[''])[:10]
    top10list.append(t10l)

top10list = pd.DataFrame(top10list, index=rest_onehot.index, columns=range(1, 11))
top10list.head()

## Clustering the neighborhoods by similarity

Now we use *K-means* to find similar neighborhoods in Toronto city.

In [None]:
#imports
import numpy as np
import folium  # Plotting maps with overlays
import branca   # Fancy HTML text inside bubbles

In [None]:
n_clusters = 4
cl = KMeans(init='k-means++', n_clusters = n_clusters, random_state=188, n_init=100)
counts_neigh = np.array([top_counts_neigh.values, top_counts_neigh.values]).T
cl.fit(counts_neigh)
print(cl.labels_)

In [None]:
#Create an labeled index for the classes
class_names = ['Low', 'Medium', 'High', 'Very High']
class_counts = np.zeros((n_clusters, 3)).astype('object')

for i in range(n_clusters):
    class_counts[i, :] = [i, top_counts_neigh[cl.labels_ == i].max(), top_counts_neigh[cl.labels_ == i].min()]

class_labels = np.zeros_like(cl.labels_).astype('object')
reverse_index = class_counts[:, 1].argsort().argsort()

for pos, i in enumerate(class_counts[:, 1].argsort().argsort()):
    class_counts[pos, 0] = class_names[i]

for i in range(len(class_labels)):
    class_labels[i] = '{} ({})'.format(class_names[reverse_index[cl.labels_[i]]], top_counts_neigh[i])


Restore the latitude and longitude values


In [None]:
# restore latitude and longitude data
top10list['Latitude'], top10list['Longitude'] = 0, 0
for n in top10list.index:
    top10list.loc[n, 'Latitude'] = df[df['Neighborhoods'] == n]['Latitude'].values
    top10list.loc[n, 'Longitude'] = df[df['Neighborhoods'] == n]['Longitude'].values
top10list.head()

Here we plot the neighborhood's markers grouped by color. The colors represent the number of restaurants in each neighborhoods.

In [None]:
colors = cm.nipy_spectral(np.linspace(0, 1, n_clusters))

reply = geocoder.arcgis('Toronto, ON, Canada')
reply.latlng
# we added a small correction in latlng to allow for greater zoom
map_toronto = folium.Map(location=list(reply.latlng + np.array([.075, 0])), zoom_start=11)

# add markers to map
for lat, lng, label, grplbl, cidx, top in zip(top10list['Latitude'], top10list['Longitude'], top10list.index.values, class_labels, cl.labels_, top10list[top10list.columns[:10]].values):
    cv = (255*colors[cidx]).astype('int')
    chex = '#{:02x}{:02x}{:02x}'.format(cv[0], cv[1], cv[2])
    #l = ('<b>Group {:}:</b><br><i>{:}:</i><br>{:}, {:}, {:}, {:}, {:}, {:}, {:}, {:}, {:}, {:}').format(cidx, label, top[0], top[1], top[2], top[3], top[4], top[5], top[6], top[7], top[8], top[9])
    l = ('<b>{:}:</b><br><i>{:}:</i><br>{:}, {:}, {:}, {:}, {:}, {:}, {:}, {:}, {:}, {:}').format(grplbl, label, top[0], top[1], top[2], top[3], top[4], top[5], top[6], top[7], top[8], top[9])
    # print(l)
    lbl = branca.element.IFrame(l, width=250, height=130)
    label = folium.Popup(lbl)#, parse_html=True)
    folium.CircleMarker([lat, lng], radius=6, popup=label, color='black', weight=1, fill=True, fill_color=chex, fill_opacity=1,
    parse_html=True).add_to(map_toronto)
map_toronto

It is possible to see tha the more central regions (Downtown Toronto) have the biggest concentration of restaurants. 

In [None]:
top10list[[('Very High' in c) for c in class_labels]]

In [None]:
print(geocoder.arcgis('M4L, Toronto, ON, Canada').latlng)
print(geocoder.arcgis('M5E, Toronto, ON, Canada').latlng)

In [None]:
reply.latlng

In [None]:
df[20:]