# CAPSTONE PROJECT WEEK 3  
**Part 1**  
_In this part of the assignment, we will scrap the data of all postal code, its borough and its neighborhoods in Canada from Wikipedia page.
In this data, we only consider boroughs which have assigned values. To make the data comprehensible, the values for postal code remain 
unique,which play the key role in the dataframe. The NA values for neighborhood are replaced by its borough, which makes the assumption 
that these boroughs have no neighbor or the information is missing_

In [61]:
# Import all needed libraries
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import numpy as np

In [62]:
# Scrap needed data from Wikipedia
url = "https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=890001695"
html = requests.get(url).text
soup = bs(html, 'html.parser')
ta=soup.find('table',{'class':'wikitable'})
table_headers = ta.find_all('th')
table_rows = ta.find_all('tr')

l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    if row:
        row[-1] = row[-1].strip('\n')
        l.append(row)
df = pd.DataFrame(l, columns=["Postalcode", "Borough", "Neighborhood"])
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [63]:
# Drop rows for Borough is not assigned
df.drop(df.loc[df['Borough']== 'Not assigned'].index, inplace=True)
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [64]:
# Group by PostalCode
df = df.groupby('Postalcode').agg({'Borough':'first', 
                             'Neighborhood': ', '.join}).reset_index()
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [65]:
# Replace a Not assigned neighborhood by its borough
df['Neighborhood'] = np.where(df['Neighborhood'] == 0, df['Borough'], df['Neighborhood'])
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [66]:
df.shape

(103, 3)

**Part 2**  
_As the geocoder library is unreliable and the runtime for requesting the coordinates is too big, we import the needed information from 
the given .csv file. The imported dataframe contains the information of postal code and its longtitude and latitude. In order to have the 
longtitute and latitute for the given postal code in the dataframe in part 1, we merge these two dataframes based on its common value, 
which is portal code_

In [67]:
!pip install geocoder



In [68]:
# Reading the .csv file to dataframe
file_name='http://cocl.us/Geospatial_data'
coordinates=pd.read_csv(file_name)
coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [69]:
# Rename a column in coordinates
coordinates.rename({'Postal Code': 'Postalcode'}, axis=1, inplace=True)
coordinates.head()

Unnamed: 0,Postalcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [70]:
coordinates.shape

(103, 3)

In [71]:
# Merge two dataframes
data = pd.merge(df, coordinates,how='left', on='Postalcode')
data.head()

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [None]:
**Part 3**  
_In this part, we decide to work with only boroughs that contain the word Toronto, so a new dataframe which contains only needed information
is created. 

In [72]:
# Keep only informaton for boroughs that contain the word Toronto
toronto_df = data[data['Borough'].str.contains("Toronto")]
toronto_df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [73]:
# Requesting coordinate of Toronto
from geopy.geocoders import Nominatim
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [74]:
!pip install folium



In [75]:
# Credentials
CLIENT_ID = '2BHHDRJ3ZDKLQ4AWD5HEZPR1ENWU1SUALFHI3FONNA1XKDGF'
CLIENT_SECRET = 'DBUBPZZYF1E4XRNGEBB3GCRIRGQ22P53NWU0KDDX4GR15ORX' 
VERSION = '20200323'

In [76]:
radius=200
LIMIT=50
def getNearbyVenues(names, latitudes, longitudes, radius=200):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [77]:
toronto_venues = getNearbyVenues(names=toronto_df['Neighborhood'],
                                   latitudes=toronto_df['Latitude'],
                                   longitudes=toronto_df['Longitude']
                                  )

The Beaches
The Danforth West, Riverdale
The Beaches West, India Bazaar
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park, Summerhill East
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
Rosedale
Cabbagetown, St. James Town
Church and Wellesley
Harbourfront, Regent Park
Ryerson, Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide, King, Richmond
Harbourfront East, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Centre
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North, Forest Hill West
The Annex, North Midtown, Yorkville
Harbord, University of Toronto
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place, Underground city
Christie
Dovercourt Village, Dufferin
Little Portugal, Trinity
Brockton, Exhibition Place, Parkdale Village
High Park, The 

In [78]:
# Analyze Each Neighborhood
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Adult Boutique,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Bakery,Bank,Bar,Beer Bar,...,Tea Room,Thai Restaurant,Theater,Theme Restaurant,Thrift / Vintage Store,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [79]:
toronto_onehot.shape

(419, 132)

In [80]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Adult Boutique,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Bakery,Bank,Bar,...,Tea Room,Thai Restaurant,Theater,Theme Restaurant,Thrift / Vintage Store,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.074074,0.0,0.0,0.0,...,0.037037,0.037037,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.0
1,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Cabbagetown, St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,"Chinatown, Grange Park, Kensington Market",0.0,0.0,0.0,0.0,0.027778,0.0,0.027778,0.0,0.083333,...,0.0,0.0,0.0,0.0,0.027778,0.0,0.0,0.027778,0.083333,0.027778
7,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Church and Wellesley,0.0,0.047619,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.047619,0.0,0.0,0.0,0.0,0.047619,0.0
9,"Commerce Court, Victoria Hotel",0.0,0.0,0.021739,0.021739,0.0,0.0,0.021739,0.021739,0.0,...,0.021739,0.021739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.021739


In [81]:
toronto_grouped.shape

(34, 132)

In [82]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide, King, Richmond----
              venue  freq
0       Coffee Shop  0.07
1        Restaurant  0.07
2  Asian Restaurant  0.07
3        Steakhouse  0.07
4       Opera House  0.04


----Brockton, Exhibition Place, Parkdale Village----
                   venue  freq
0             Playground   1.0
1            Yoga Studio   0.0
2     Mexican Restaurant   0.0
3  Performing Arts Venue   0.0
4                   Park   0.0


----Business Reply Mail Processing Centre 969 Eastern----
                       venue  freq
0                    Brewery   1.0
1                Yoga Studio   0.0
2                  Nightclub   0.0
3  Middle Eastern Restaurant   0.0
4        Moroccan Restaurant   0.0


----CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara----
                   venue  freq
0  Performing Arts Venue   1.0
1            Yoga Studio   0.0
2               Pharmacy   0.0
3                   Park   0.0
4      Outdoor Sculpture   0

In [83]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [84]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
Neighborhood_venues_sorted = pd.DataFrame(columns=columns)
Neighborhood_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    Neighborhood_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

Neighborhood_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Asian Restaurant,Restaurant,Steakhouse,General Travel,Hotel,Concert Hall,Noodle House,Opera House,Park
1,"Brockton, Exhibition Place, Parkdale Village",Playground,Wine Bar,Flower Shop,Concert Hall,Convenience Store,Creperie,Cuban Restaurant,Deli / Bodega,Department Store,Dessert Shop
2,Business Reply Mail Processing Centre 969 Eastern,Brewery,Food Court,Concert Hall,Convenience Store,Creperie,Cuban Restaurant,Deli / Bodega,Department Store,Dessert Shop,Diner
3,"CN Tower, Bathurst Quay, Island airport, Harbo...",Performing Arts Venue,Flower Shop,Comic Shop,Concert Hall,Convenience Store,Creperie,Cuban Restaurant,Deli / Bodega,Department Store,Dessert Shop
4,"Cabbagetown, St. James Town",Café,Outdoor Sculpture,Pizza Place,Restaurant,Diner,Italian Restaurant,Beer Store,Coffee Shop,Indian Restaurant,Bakery


In [85]:
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 3, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

In [86]:
# add clustering labels
Neighborhood_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(Neighborhood_venues_sorted.set_index('Neighborhood'), on='Neighborhood', how='inner')
toronto_merged.head() # check the last columns!

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
37,M4E,East Toronto,The Beaches,43.676357,-79.293031,1,Park,Trail,Other Great Outdoors,Department Store,Farmers Market,Ethiopian Restaurant,Dumpling Restaurant,Diner,Dessert Shop,Cuban Restaurant
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,4,Greek Restaurant,Wine Bar,Flower Shop,Concert Hall,Convenience Store,Creperie,Cuban Restaurant,Deli / Bodega,Department Store,Dessert Shop
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,0,Park,Fish & Chips Shop,Wine Bar,Flower Shop,Concert Hall,Convenience Store,Creperie,Cuban Restaurant,Deli / Bodega,Department Store
43,M4M,East Toronto,Studio District,43.659526,-79.340923,1,Café,Coffee Shop,Bar,Cheese Shop,Convenience Store,Seafood Restaurant,Clothing Store,Bookstore,Gastropub,Pet Store
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,1,Lawyer,Wine Bar,Diner,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Ethiopian Restaurant,Dumpling Restaurant,Dessert Shop,Food Court


In [87]:
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

In [88]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters