# Scrape Wikipedia Page 
Scrape the following Wikipedia page (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M), in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe

In [1]:
import requests
import pandas as pd
import numpy as np

In [2]:
from bs4 import BeautifulSoup
import geocoder
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

In [3]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
from scipy.spatial.distance import cdist

In [4]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [5]:
wikipage = requests.get(url)

In [6]:
wikipage.status_code

200

In [7]:
wikisoup = BeautifulSoup(wikipage.text, 'html.parser')

In [8]:
postlist = wikisoup.find_all(name='table', attrs={'class':'wikitable sortable'})

In [9]:
trlist = postlist[0].find_all(name='tr')

In [10]:
trlist[0:5]

[<tr>
 <th>Postcode</th>
 <th>Borough</th>
 <th>Neighbourhood
 </th></tr>, <tr>
 <td>M1A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>, <tr>
 <td>M2A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>, <tr>
 <td>M3A</td>
 <td><a href="/wiki/North_York" title="North York">North York</a></td>
 <td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
 </td></tr>, <tr>
 <td>M4A</td>
 <td><a href="/wiki/North_York" title="North York">North York</a></td>
 <td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
 </td></tr>]

In [11]:
postaltable = []
for atr in trlist:
    tdlist = atr.find_all(name='td')
    if len(tdlist)==3:
        apostal = [tdlist[0].get_text(strip=True), tdlist[1].get_text(strip=True), tdlist[2].get_text(strip=True)]
        postaltable.append(apostal)

In [12]:
postaltable[0:5]

[['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront']]

In [13]:
torontopd = pd.DataFrame(data=postaltable, columns=['PostalCode', 'Borough', 'Neighborhood'])

In [14]:
torontopd = torontopd[(torontopd['Borough'] != 'Not assigned')]

In [15]:
torontopd["Neighborhood"] = torontopd.apply(lambda x: x["Neighborhood"] if x["Neighborhood"]!='Not assigned' else x["Borough"], axis=1)

In [16]:
torontopd.sort_values(by="PostalCode", inplace=True)

In [17]:
torontopd.shape

(211, 3)

In [18]:
torontopd.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern
29,M1C,Scarborough,Port Union
28,M1C,Scarborough,Rouge Hill
27,M1C,Scarborough,Highland Creek


In [19]:
torontopd[torontopd['Borough'] == "Queen's Park"]

Unnamed: 0,PostalCode,Borough,Neighborhood
8,M7A,Queen's Park,Queen's Park


In [20]:
torontopdun = torontopd.copy().drop_duplicates(subset=["PostalCode"])

In [21]:
torontopdun.shape

(103, 3)

In [22]:
torontoarr = []

for postcode, agroup in torontopd.groupby("PostalCode"):
    neiglist = ",".join(agroup["Neighborhood"].values)
    torontoarr.append(neiglist)

In [23]:
len(torontoarr)

103

In [24]:
torontopdun["Neighborhood"] = np.array(torontoarr)

In [25]:
torontopdun.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
11,M1B,Scarborough,"Rouge,Malvern"
29,M1C,Scarborough,"Port Union,Rouge Hill,Highland Creek"
42,M1E,Scarborough,"Guildwood,Morningside,West Hill"
53,M1G,Scarborough,Woburn
62,M1H,Scarborough,Cedarbrae


# Check points :
* The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood - Done, setup columns
* Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned. - Done, filter out rows
* More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table. - Done, groupby PostalCode, then join the neighborhood within each group. After that, merge the neighborhood list back into the table
* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park. - Done, use apply lambda to replace value
* Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making. - Done, this cell
* In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe. - Done, next cell

In [26]:
torontopdun.shape

(103, 3)

In [27]:
def getlatlng(postalcode):
    # initialize your variable to None
    lat_lng_coords = None
    
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(postalcode))
    lat_lng_coords = g.latlng
    
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    return latitude, longitude

In [28]:
#torontopdun[["Latitude", "Longitude"]] = torontopdun["PostalCode"].apply(getlatlng)

In [29]:
#getlatlng("M1B")

In [30]:
#g = geocoder.google('Mountain View, CA')

In [31]:
#print(g.latlng)

* Geocoder package is not working, so have to use CSV

In [32]:
canandapostaldb = pd.read_csv("Geospatial_Coordinates.csv")

In [33]:
canandapostaldb.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [34]:
torontopdun = torontopdun.merge(right=canandapostaldb, how="left", left_on="PostalCode", right_on="Postal Code")

In [35]:
torontopdun.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Postal Code,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",M1B,43.806686,-79.194353
1,M1C,Scarborough,"Port Union,Rouge Hill,Highland Creek",M1C,43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",M1E,43.763573,-79.188711
3,M1G,Scarborough,Woburn,M1G,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,M1H,43.773136,-79.239476


# Check points:
* Once you are able to create the above dataframe, submit a link to the new Notebook on your Github repository. - Done

In [36]:
"toronto" in (torontopdun["Borough"].str.lower())

False

In [104]:
#torontopdunsmall = torontopdun[torontopdun["Borough"].str.contains("Toronto")]
torontopdunsmall = torontopdun

In [105]:
torontopdunsmall.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Postal Code,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",M1B,43.806686,-79.194353
1,M1C,Scarborough,"Port Union,Rouge Hill,Highland Creek",M1C,43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",M1E,43.763573,-79.188711
3,M1G,Scarborough,Woburn,M1G,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,M1H,43.773136,-79.239476


In [106]:
torontopdunsmall.shape

(103, 6)

In [107]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto, Canada are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto, Canada are 43.653963, -79.387207.


In [108]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(torontopdunsmall['Latitude'], torontopdunsmall['Longitude'], torontopdunsmall['Borough'], torontopdunsmall['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

In [109]:
CLIENT_ID = 'MRVUKBJTLBOTD4045LWKGE0LVH3AVX00OMUAT3CCOUNWPCED' # your Foursquare ID
CLIENT_SECRET = 'PE1U4COCVVAZD4GJDCQHPUY4OGVSYFC3R2KXORV1HT0TO2QF' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT=100
radius=500

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: MRVUKBJTLBOTD4045LWKGE0LVH3AVX00OMUAT3CCOUNWPCED
CLIENT_SECRET:PE1U4COCVVAZD4GJDCQHPUY4OGVSYFC3R2KXORV1HT0TO2QF


In [124]:
def getNearbyVenus(row):
    
    
    neighborhood_latitude = row['Latitude'] # neighborhood latitude value
    neighborhood_longitude = row['Longitude'] # neighborhood longitude value
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, neighborhood_latitude, neighborhood_longitude, VERSION, radius, LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    print("processing " + row["Neighborhood"] + " venue: " + str(len(results)))
    
    venues_list=[]
    for v in results:
        venues_list.append([
            row["PostalCode"],
            row["Neighborhood"], 
            neighborhood_latitude, 
            neighborhood_longitude, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']])
        
    return(venues_list)

In [125]:
venues_list=[]
for i in range(len(torontopdunsmall)):
    venues_list.extend(getNearbyVenus(torontopdunsmall.iloc[i]))

processing Rouge,Malvern venue: 1
processing Port Union,Rouge Hill,Highland Creek venue: 2
processing Guildwood,Morningside,West Hill venue: 8
processing Woburn venue: 3
processing Cedarbrae venue: 7
processing Scarborough Village venue: 1
processing East Birchmount Park,Ionview,Kennedy Park venue: 5
processing Golden Mile,Oakridge,Clairlea venue: 9
processing Cliffcrest,Scarborough Village West,Cliffside venue: 2
processing Cliffside West,Birch Cliff venue: 4
processing Wexford Heights,Dorset Park,Scarborough Town Centre venue: 8
processing Maryvale,Wexford venue: 8
processing Agincourt venue: 5
processing Sullivan,Clarks Corners,Tam O'Shanter venue: 13
processing Milliken,Agincourt North,L'Amoreaux East,Steeles East venue: 3
processing L'Amoreaux West venue: 11
processing Upper Rouge venue: 0
processing Hillcrest Village venue: 5
processing Fairview,Oriole,Henry Farm venue: 59
processing Bayview Village venue: 4
processing Silver Hills,York Mills venue: 1
processing Willowdale,Newton

In [126]:
torontopdunsmall_venues = pd.DataFrame(data=venues_list, columns = ['PostalCode','Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category'])

In [127]:
torontopdunsmall_venues.shape

(2235, 8)

In [128]:
torontopdunsmall_venues.head()

Unnamed: 0,PostalCode,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M1B,"Rouge,Malvern",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,M1C,"Port Union,Rouge Hill,Highland Creek",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,M1C,"Port Union,Rouge Hill,Highland Creek",43.784535,-79.160497,Affordable Toronto Movers,43.787919,-79.162977,Moving Target
3,M1E,"Guildwood,Morningside,West Hill",43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
4,M1E,"Guildwood,Morningside,West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store


In [129]:
print('There are {} uniques categories.'.format(len(torontopdunsmall_venues['Venue Category'].unique())))

There are 271 uniques categories.


In [130]:
# one hot encoding
torontopdunsmall_onehot = pd.get_dummies(torontopdunsmall_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
torontopdunsmall_onehot['PostalCode'] = torontopdunsmall_venues['PostalCode']
#torontopdunsmall_onehot['NeighborhoodName'] = torontopdunsmall_venues['Neighborhood'] 


# move neighborhood column to the first column
columnlist = torontopdunsmall_onehot.columns.values
fixed_columns = columnlist[-1:] 
fixed_columns = np.hstack((fixed_columns, columnlist[:-1]))
torontopdunsmall_onehot = torontopdunsmall_onehot[fixed_columns]

torontopdunsmall_onehot.head()

Unnamed: 0,PostalCode,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M1C,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M1C,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M1E,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M1E,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [131]:
torontopdunsmall_onehot.shape

(2235, 272)

In [132]:
torontopdunsmall_grouped = torontopdunsmall_onehot.groupby('PostalCode').mean().reset_index()

In [133]:
torontopdunsmall_grouped.shape

(100, 272)

In [134]:
torontopdunsmall_grouped.head()

Unnamed: 0,PostalCode,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [135]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [181]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['PostalCode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['PostalCode'] = torontopdunsmall_grouped['PostalCode']

for ind in np.arange(torontopdunsmall_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(torontopdunsmall_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,PostalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Fast Food Restaurant,Dumpling Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Yoga Studio,Dim Sum Restaurant
1,M1C,Moving Target,Bar,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Yoga Studio,Diner
2,M1E,Electronics Store,Pizza Place,Intersection,Breakfast Spot,Medical Center,Rental Car Location,Mexican Restaurant,Yoga Studio,Donut Shop,Dog Run
3,M1G,Coffee Shop,Korean Restaurant,Yoga Studio,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Eastern European Restaurant,Electronics Store
4,M1H,Hakka Restaurant,Bakery,Bank,Athletics & Sports,Thai Restaurant,Caribbean Restaurant,Fried Chicken Joint,Discount Store,Doner Restaurant,Donut Shop


In [182]:
torontopdunsmall_clustering = torontopdunsmall_grouped.drop('PostalCode', axis=1)
for k in range(2, 50):
    kmeans = KMeans(n_clusters=k, n_init=20, random_state=0)
    kmeans.fit(torontopdunsmall_clustering)
    print(" groups " + str(k) + " score: " + str(sum(np.min(cdist(torontopdunsmall_clustering, kmeans.cluster_centers_, 'euclidean'), axis=1))))


 groups 2 score: 36.95507469031955
 groups 3 score: 35.79064808779616
 groups 4 score: 34.897823512376924
 groups 5 score: 34.46868608410717
 groups 6 score: 33.10117878669059
 groups 7 score: 32.1782535315446
 groups 8 score: 31.835484426119454
 groups 9 score: 30.97164189471442
 groups 10 score: 30.287818361910396
 groups 11 score: 29.64589606250419
 groups 12 score: 28.946241219522047
 groups 13 score: 28.279527340774237
 groups 14 score: 27.662916203637618
 groups 15 score: 27.17815838384257
 groups 16 score: 26.39867979343503
 groups 17 score: 25.79841311250387
 groups 18 score: 25.052998451644367
 groups 19 score: 24.659232407846076
 groups 20 score: 23.922397886130767
 groups 21 score: 24.281962318563565
 groups 22 score: 23.304794880448462
 groups 23 score: 22.808120300934842
 groups 24 score: 22.2904999257575
 groups 25 score: 21.482192653512584
 groups 26 score: 21.08341969792803
 groups 27 score: 20.742092374888173
 groups 28 score: 20.631416025103867
 groups 29 score: 19.71

In [183]:
# set number of clusters
kclusters = 7

torontopdunsmall_clustering = torontopdunsmall_grouped.drop('PostalCode', axis=1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters,n_init=20, random_state=0).fit(torontopdunsmall_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([4, 1, 1, 1, 1, 5, 1, 1, 1, 1])

In [184]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

torontopdunsmall_merged = torontopdunsmall

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
torontopdunsmall_merged = torontopdunsmall_merged.join(neighborhoods_venues_sorted.set_index('PostalCode'), on='PostalCode', how='right')

torontopdunsmall_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Postal Code,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Rouge,Malvern",M1B,43.806686,-79.194353,4,Fast Food Restaurant,Dumpling Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Yoga Studio,Dim Sum Restaurant
1,M1C,Scarborough,"Port Union,Rouge Hill,Highland Creek",M1C,43.784535,-79.160497,1,Moving Target,Bar,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Yoga Studio,Diner
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",M1E,43.763573,-79.188711,1,Electronics Store,Pizza Place,Intersection,Breakfast Spot,Medical Center,Rental Car Location,Mexican Restaurant,Yoga Studio,Donut Shop,Dog Run
3,M1G,Scarborough,Woburn,M1G,43.770992,-79.216917,1,Coffee Shop,Korean Restaurant,Yoga Studio,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Eastern European Restaurant,Electronics Store
4,M1H,Scarborough,Cedarbrae,M1H,43.773136,-79.239476,1,Hakka Restaurant,Bakery,Bank,Athletics & Sports,Thai Restaurant,Caribbean Restaurant,Fried Chicken Joint,Discount Store,Doner Restaurant,Donut Shop


In [185]:
torontopdunsmall_merged.shape

(100, 17)

In [186]:
set(torontopdunsmall_merged["Cluster Labels"])

{0, 1, 2, 3, 4, 5, 6}

In [187]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(torontopdunsmall_merged['Latitude'], torontopdunsmall_merged['Longitude'], torontopdunsmall_merged['Neighborhood'], torontopdunsmall_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [189]:
torontopdunsmall_merged.loc[torontopdunsmall_merged['Cluster Labels'] == 0, torontopdunsmall_merged.columns[[1] + list(range(5, torontopdunsmall_merged.shape[1]))]]

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
48,Central Toronto,-79.38316,0,Trail,Playground,Yoga Studio,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant


In [190]:
torontopdunsmall_merged.loc[torontopdunsmall_merged['Cluster Labels'] == 1, torontopdunsmall_merged.columns[[1] + list(range(5, torontopdunsmall_merged.shape[1]))]]

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Scarborough,-79.160497,1,Moving Target,Bar,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Yoga Studio,Diner
2,Scarborough,-79.188711,1,Electronics Store,Pizza Place,Intersection,Breakfast Spot,Medical Center,Rental Car Location,Mexican Restaurant,Yoga Studio,Donut Shop,Dog Run
3,Scarborough,-79.216917,1,Coffee Shop,Korean Restaurant,Yoga Studio,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Eastern European Restaurant,Electronics Store
4,Scarborough,-79.239476,1,Hakka Restaurant,Bakery,Bank,Athletics & Sports,Thai Restaurant,Caribbean Restaurant,Fried Chicken Joint,Discount Store,Doner Restaurant,Donut Shop
6,Scarborough,-79.262029,1,Department Store,Discount Store,Hobby Shop,Coffee Shop,Chinese Restaurant,Dumpling Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore
7,Scarborough,-79.284577,1,Bus Line,Bakery,Fast Food Restaurant,Intersection,Bus Station,Soccer Field,Park,Gluten-free Restaurant,Dim Sum Restaurant,Ethiopian Restaurant
8,Scarborough,-79.239476,1,Motel,American Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Yoga Studio,Diner
9,Scarborough,-79.264848,1,Café,General Entertainment,Skating Rink,College Stadium,Concert Hall,Diner,Fabric Shop,Event Space,Ethiopian Restaurant,Empanada Restaurant
10,Scarborough,-79.273304,1,Indian Restaurant,Light Rail Station,Furniture / Home Store,Chinese Restaurant,Thrift / Vintage Store,Pet Store,Vietnamese Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
11,Scarborough,-79.295849,1,Sandwich Place,Smoke Shop,Breakfast Spot,Auto Garage,Vietnamese Restaurant,Bakery,Shopping Mall,Middle Eastern Restaurant,Yoga Studio,Doner Restaurant


In [191]:
torontopdunsmall_merged.loc[torontopdunsmall_merged['Cluster Labels'] == 2, torontopdunsmall_merged.columns[[1] + list(range(5, torontopdunsmall_merged.shape[1]))]]

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
21,North York,-79.408493,2,Home Service,Eastern European Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Yoga Studio,Dim Sum Restaurant
32,North York,-79.495697,2,Home Service,Food Truck,Baseball Field,Yoga Studio,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Eastern European Restaurant


In [192]:
torontopdunsmall_merged.loc[torontopdunsmall_merged['Cluster Labels'] == 3, torontopdunsmall_merged.columns[[1] + list(range(5, torontopdunsmall_merged.shape[1]))]]

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,Scarborough,-79.284577,3,Playground,Park,Coffee Shop,Yoga Studio,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop
23,North York,-79.400049,3,Park,Bank,Convenience Store,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
25,North York,-79.329656,3,Food & Drink Shop,Park,Yoga Studio,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Eastern European Restaurant
30,North York,-79.464763,3,Airport,Park,Yoga Studio,Eastern European Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
40,East York,-79.338106,3,Park,Convenience Store,Yoga Studio,Eastern European Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
50,Downtown Toronto,-79.377529,3,Park,Playground,Trail,Yoga Studio,Dumpling Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop
74,York,-79.453512,3,Park,Fast Food Restaurant,Women's Store,Market,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
90,Etobicoke,-79.506944,3,Park,River,Yoga Studio,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
98,York,-79.518188,3,Park,Yoga Studio,Eastern European Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store


In [193]:
torontopdunsmall_merged.loc[torontopdunsmall_merged['Cluster Labels'] == 4, torontopdunsmall_merged.columns[[1] + list(range(5, torontopdunsmall_merged.shape[1]))]]

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Scarborough,-79.194353,4,Fast Food Restaurant,Dumpling Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Yoga Studio,Dim Sum Restaurant


In [194]:
torontopdunsmall_merged.loc[torontopdunsmall_merged['Cluster Labels'] == 5, torontopdunsmall_merged.columns[[1] + list(range(5, torontopdunsmall_merged.shape[1]))]]

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,Scarborough,-79.239476,5,Playground,Yoga Studio,Eastern European Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store


In [195]:
torontopdunsmall_merged.loc[torontopdunsmall_merged['Cluster Labels'] == 6, torontopdunsmall_merged.columns[[1] + list(range(5, torontopdunsmall_merged.shape[1]))]]

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
20,North York,-79.374714,6,Cafeteria,Dumpling Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant,College Stadium


## Check points
* to add enough Markdown cells to explain what you decided to do and to report any observations you make.
* What I decided,
* 1, use full data
* 2, Found out number of venues in each category, for each postal code. This shows what type of venues are charactered in the postal area.
* 3, use kmeans to cluster postal areas
* 4, Analysis kmeans clustrering result
* Observeration,
* 1, Majority areas contains lots of Restaurant, Cafe etc.
* 2, Some areas are parks, playgounds
* 3, This method assumes each category item are equal distance (or independent features). This is actually not ture, for example, Coffee Shop/Cafe actually same thing. Furthermore, Pizza Place and Restaurant is more closer than Park/River, this method does not consider that. Maybe we can further model/encode those category (for example, use tfidf or embeding), by grep description for each category from wikipedia. That will help to extract better features.
* to generate maps to visualize your neighborhoods and how they cluster together.  - Done