[View in Colaboratory](https://colab.research.google.com/github/nsp24/CityMap-Capstone/blob/master/Neighbourhood_clustering.ipynb)

**This notebook performs the K-means clustering of the neighborhood data**

> Importing the libraries

In [19]:
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as BSoup
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!pip install folium # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library



> (i) Scraping the Neighborhood data from the WIkipedia page and the geospatial data from the CSV file **(The geodata library isn't fetching the data)**

> (ii) Merging the Neighborhood data with geospatial data

In [0]:
BASE_URI = 'https://en.wikipedia.org'
page = requests.get(BASE_URI+'/wiki/List_of_postal_codes_of_Canada:_M')

# Using BeautifulSoup to make fetch happen
soup = BSoup(page.content, 'html.parser')

# Picking the target elements
neighbourhood_on_page = soup.select('table.wikitable tbody tr')

# Extracting values from target elements:
neighbourhood_data = list()
for neighbourhood in neighbourhood_on_page:
  neighbour_dict = dict()
  element_chain = list()
  for child in neighbourhood.children:
    if 'get_text' in dir(child):
      element = child.get_text()
    else:
      element = child.string
    element_chain.append(element)

  # Cleaning: encoding strings and replacing newline characters from returned HTML leaves
  element_chain = [element for element in map(lambda chain: str(chain).replace('\n', ''), element_chain)]

  # Cleaning: removing escape characters from returned list of elements
  elements = [element for element in filter(lambda element: element, element_chain)]
  if elements[1] != 'Not assigned': 
    # Filtering "Not assigned" Boroughs
    if elements[2] == 'Not assigned':
      # Cleaning "Not assigned" Neighbourhoods
      elements[2] = elements[1]
    neighbourhood_data.append(elements)

df_cols = neighbourhood_data.pop(0)

neighbourhood_df = pd.DataFrame(neighbourhood_data, columns=df_cols)

grouped_df = pd.DataFrame({'Neighbourhood':neighbourhood_df.groupby('Postcode').apply(lambda x: ','.join(x.Neighbourhood))})
grouped_df.reset_index(inplace=True)

# Fetching Geospatial Data from URL:
data_csv = requests.get('https://cocl.us/Geospatial_data').text.split('\r\n')
data = [obj for obj in map(lambda i: i.split(','), data_csv)]
cols = data.pop(0)
cols[0] = df_cols[0]
geodata_df = pd.DataFrame(data, columns=cols)

# Unifying the repeated Neighbourhood values of the same Postalcode
merged_df = pd.merge(neighbourhood_df[['Postcode', 'Borough']], grouped_df, how="inner", on="Postcode")
merged_df.drop_duplicates(inplace=True)

# Merging the neighbourhood data with their geospatial values
final_df = pd.merge(merged_df, geodata_df, how="inner", on="Postcode")
# print(final_df)

> Initializing the basic Foursquare API parameters and Fetching Nearby Venues in the neighborhood
 *Please apply your own FourSquare API Credentials here:* 

In [0]:
CLIENT_ID = '***********'
CLIENT_SECRET = '***********'
VERSION = '20181010'
LIMIT = 100

In [0]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):

  venues_list=[]
  for name, lat, lng in zip(names, latitudes, longitudes):
    name = name.split(',')[0]
#     print(name)

    # create the API request URL
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        lat, 
        lng, 
        radius, 
        LIMIT)

    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']

    # return only relevant information for each nearby venue
    venues_list.append([(
        name, 
        lat, 
        lng, 
        v['venue']['name'], 
        v['venue']['location']['lat'], 
        v['venue']['location']['lng'],  
        v['venue']['categories'][0]['name']) for v in results])

  nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
  nearby_venues.columns = ['Neighborhood', 
                'Neighborhood Latitude', 
                'Neighborhood Longitude', 
                'Venue', 
                'Venue Latitude', 
                'Venue Longitude', 
                'Venue Category']

  return(nearby_venues)

In [0]:
nearby_toronto_venues = getNearbyVenues(names=final_df['Neighbourhood'], latitudes=final_df['Latitude'], longitudes=final_df['Longitude'])

> Analyzing each neighbourhood

In [0]:
# onehot-encoding the venues-data:
toronto_venues_onehot = pd.get_dummies(nearby_toronto_venues[['Venue Category']], prefix='', prefix_sep='')

# adding Neighbourhood and geolocation to the onehot-encoded DF

toronto_venues_onehot['Latitude'] = pd.Series([lt for lt in map(lambda x: str(x), nearby_toronto_venues['Neighborhood Latitude'])])
toronto_venues_onehot['Longitude'] = pd.Series([ln for ln in map(lambda x: str(x), nearby_toronto_venues['Neighborhood Longitude'])])
toronto_venues_onehot['Neighborhood'] = nearby_toronto_venues['Neighborhood']

# shifting Neighborhood column to the front 
fixed_columns = [toronto_venues_onehot.columns[-1]] + list(toronto_venues_onehot.columns[:-1])
toronto_venues_onehot = toronto_venues_onehot[fixed_columns]

# calculating the mean of the venues, grouped by Neighborhood
toronto_grouped = toronto_venues_onehot.groupby('Neighborhood').mean().reset_index()

> Grouping and Merging the geospatial data with the neighborhood-grouped data of Toronto 

> > (so that Latitude and Longitudes of each neighborhood can be given as an input in the Folium's Map)

In [25]:
# Grouping the Latitude and Longitude values according to each Neighborhood
grouped_coordinates = toronto_venues_onehot.groupby('Neighborhood').agg('max')[['Latitude', 'Longitude']].reset_index()

# Setting Latitude and Longitude to the previousl grouped data
toronto_grouped['Latitude'] = pd.Series([lt for lt in map(lambda x: float(x), grouped_coordinates['Latitude'])])
toronto_grouped['Longitude'] = pd.Series([lt for lt in map(lambda x: float(x), grouped_coordinates['Longitude'])])

print(toronto_grouped.head(3))

      Neighborhood  Accessories Store  Adult Boutique  Afghan Restaurant  \
0         Adelaide               0.01             0.0                0.0   
1        Agincourt               0.00             0.0                0.0   
2  Agincourt North               0.00             0.0                0.0   

   Airport  Airport Food Court  Airport Gate  Airport Lounge  Airport Service  \
0      0.0                 0.0           0.0             0.0              0.0   
1      0.0                 0.0           0.0             0.0              0.0   
2      0.0                 0.0           0.0             0.0              0.0   

   Airport Terminal    ...      Video Game Store  Video Store  \
0               0.0    ...                   0.0          0.0   
1               0.0    ...                   0.0          0.0   
2               0.0    ...                   0.0          0.0   

   Vietnamese Restaurant  Warehouse Store  Wine Bar  Wings Joint  \
0                    0.0              0.0

> Taking out the top 10 venues for each neighborhood

In [0]:
def return_most_common_venues(row, num_top_venues):
  row_categories = row.iloc[1:]
  row_categories_sorted = row_categories.sort_values(ascending=False)

  return row_categories_sorted.index.values[0: num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
  try:
    columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
  except:
    columns.append('{}th Most Common Venue'.format(ind+1))

neighbourhoods_sorted = pd.DataFrame(columns=columns)
neighbourhoods_sorted[columns[0]] = toronto_grouped[columns[0]]

for index in np.arange(toronto_grouped.shape[0]):
  neighbourhoods_sorted.iloc[index, 1:] = return_most_common_venues(neighbourhoods_sorted.iloc[index, :], num_top_venues)


> Performing ** k-means Clustering**

In [0]:
# setting number of clusters:
k_cluster_num = 5
temp_toronto_grouped = toronto_grouped.drop('Neighborhood', 1)

# Perform k-means clustering
k_means = KMeans(n_clusters=k_cluster_num, random_state=0).fit(temp_toronto_grouped)

In [0]:
# Creating a DataFrame with both Cluster-values and top-10 venues of each neighborhood
toronto_merged = toronto_grouped
toronto_merged['Cluster Labels'] = k_means.labels_
toronto_merged = toronto_merged.join(neighbourhoods_sorted.set_index('Neighborhood'), on='Neighborhood')

> Displaying everything on a geographical map using Folium

In [52]:
""" 
Creating map of Toronto's Neighborhood using Folium
"""

# Initializing Toronto's Geographical parameters
toronto_lat = float('43.7')
toronto_lng = float('-79.4')

# defining the Folium Map in a function
def create_map(toronto_merged):
  map_clusters = folium.Map(location=[toronto_lat, toronto_lng], zoom_start=11)

  # set color scheme for the clusters
  x = np.arange(k_cluster_num)
  ys = [i+x+(i*x)**2 for i in range(k_cluster_num)]
  colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
  rainbow = [colors.rgb2hex(i) for i in colors_array]

  # add markers to the map
  markers_colors = []
  for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
  return map_clusters

cluster_map = create_map(toronto_merged)
cluster_map

> **Exploring the Clusters**

---



> (i) Exploring the **first** cluster values of the Toronto data
>> The first cluster contains 82 points of interest clustered around Toronto as it can be seen in the map here:

In [54]:
first_cluster = toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[0] + list(range(1, toronto_merged.shape[1]))]]
# print(len(first_cluster))
create_map(first_cluster)

> (ii) Analysis of the **second** cluster of the neighborhood:
>> The second cluster, as it can be seen from the map here, contains 15 points of interest scattered around Toronto

In [55]:
second_cluster = toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[0] + list(range(1, toronto_merged.shape[1]))]]
# print(len(second_cluster))
create_map(second_cluster)

> (iii) Exploring the **third** cluster
>> The third cluster contains only one point of interest in the Silver Hills neighborhood 

>> *Note: second iteration had a change in the output; the first iteration displayed a single point of interest in the Scarborough village*

In [62]:
third_cluster = toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[0] + list(range(1, toronto_merged.shape[1]))]]
print(third_cluster['Neighborhood'])
create_map(third_cluster)

80    Silver Hills
Name: Neighborhood, dtype: object


> (iv) Analysing the fourth cluster
>> The 4th cluster also has a single point of interest, located in Scarborough Village

In [60]:
fourth_cluster = toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[0] + list(range(1, toronto_merged.shape[1]))]]
print(fourth_cluster['Neighborhood'])
create_map(fourth_cluster)

79    Scarborough Village
Name: Neighborhood, dtype: object


> (v) Exploring the fifth cluster
>> The fifth and final cluster contains a single point in Highland Creek

In [61]:
fifth_cluster = toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[0] + list(range(1, toronto_merged.shape[1]))]]
print(fifth_cluster['Neighborhood'])
create_map(fifth_cluster)

52    Highland Creek
Name: Neighborhood, dtype: object
