<a href="https://colab.research.google.com/github/jals-code-dump/Coursera_Capstone/blob/master/Segmenting_and_Clustering_Neighborhoods_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Import dependencies


In [20]:
import pandas as pd
import numpy as np
!pip install geocoder
import geocoder
import folium
import requests
import time
from sklearn.cluster import KMeans

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |███▎                            | 10kB 23.4MB/s eta 0:00:01[K     |██████▋                         | 20kB 5.7MB/s eta 0:00:01[K     |██████████                      | 30kB 5.5MB/s eta 0:00:01[K     |█████████████▎                  | 40kB 6.9MB/s eta 0:00:01[K     |████████████████▋               | 51kB 5.9MB/s eta 0:00:01[K     |████████████████████            | 61kB 6.2MB/s eta 0:00:01[K     |███████████████████████▎        | 71kB 6.2MB/s eta 0:00:01[K     |██████████████████████████▋     | 81kB 5.9MB/s eta 0:00:01[K     |██████████████████████████████  | 92kB 6.3MB/s eta 0:00:01[K     |████████████████████████████████| 102kB 4.1MB/s 
Collecting ratelim
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad4

### Scrape Web Data
Using the inbuilt pandas read_html function i draw the table [0] from the web page and then remove all rows where Borough is 'not assigned'

In [21]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df = pd.read_html(url)[0]
df = df[df['Borough']!= 'Not assigned']
df.reset_index(drop=True, inplace=True)
df.shape

(103, 3)

###Inital Observations
The DataFrame contains 3 columns with a postal code, what borough it covers and neighbourhood.

Boroughs have multiple neighbourhoods and postal codes assigned to them.

In [22]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


###Lat & Long
Now we iterate through the DataFrame, using geocoder to extract lat and long. When trying to assign the value as a float the last few digits were removed so I have assigned it as a string keeping the last 6 decimals.

In [23]:
df['Latitude'] = ""
df['Longitude'] = ""
 
 
for index, row in df.iterrows():
  latlong = None
  while latlong == None:
    g = geocoder.arcgis('{}, Toronto, Ontario'.format(row['Postal Code']))
    latlong = g.latlng
  lat = str(latlong[0])
  long = str(latlong[1])
  row['Longitude'] = long[:10]
  row['Latitude'] = lat[:9]
 
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.75188,-79.330359
1,M4A,North York,Victoria Village,43.73042,-79.312819
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65514,-79.362649
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72321,-79.451409
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66449,-79.393019


###Splitting into Neighborhoods and Getting Fresh Lat and Long

The above DataFrame shows how we get long and lat data for postcodes, now we need to do the same for neighborhoods. First we need to create a new dataframe with one row per neighborhood, get the long and lat data and then search for venues. I added a timer to stop the iteration halting if ARCGIS couldn't find the location.

In [24]:
n_df = pd.DataFrame()

for index, row in df.iterrows():
  n_str = row['Neighbourhood']
  try:
    n_list = n_str.split(", ")
    for n in n_list:
      n_series = pd.Series(['Borough', 'Neighbourhood'])
      n_series[0] = row['Borough']
      n_series[1] = n
      n_df = n_df.append(n_series, ignore_index=True)
  except:
      n_series = pd.Series(['Borough', 'Neighbourhood'])
      n_series[0] = row['Borough']
      n_series[1] = n_str
      n_df = n_df.append(n_series, ignore_index=True)

n_df = n_df.rename(columns={0 : "Borough", 1 : "Neighborhood"})
n_df = n_df.drop_duplicates()
n_df['Latitude'] = ""
n_df['Longitude'] = ""

for index, row in n_df.iterrows():
  start = time.time()
  latlong = None
  while (time.time() - start) < 10.0 and latlong == None: 
    if latlong == None:
      g = geocoder.arcgis('{}, {}, Toronto, Ontario'.format(row['Neighborhood'], row['Borough']))
      latlong = g.latlng
  if latlong != None:
    lat = str(latlong[0])
    long = str(latlong[1])
    row['Longitude'] = long[:10]
    row['Latitude'] = lat[:9]

n_df.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,North York,Parkwoods,43.758895,-79.320322
1,North York,Victoria Village,43.73154,-79.314279
2,Downtown Toronto,Regent Park,43.66069,-79.360309
3,Downtown Toronto,Harbourfront,43.63923,-79.383069
4,North York,Lawrence Manor,43.72294,-79.431159


###Get Venues from FourSquare
Using the FourSquare API I will now get the venues within 1KM of each coordinates, this is easily within walking distance. Then I will extract how many of each type of venue category are within those areas. This will show us the amount of access those areas have to different venue categories.

As we can see from the below results(if somewhat messy) we now have the amount of each venue type.

In [25]:
CLIENT_ID = 'WSGBWWAHDFTOP05JEPAGEZWQZEGOV5D2GJHHEIIF2STQ2FKI' # your Foursquare ID
CLIENT_SECRET = 'VDK4W2CAE455XLKBGIYF2F2N13ZKWYJDOKUCD0SQ2JKUL3Y0' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100
radius = 1000
 
# create URL
def create_url(lat, long):
  url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
      CLIENT_ID, 
      CLIENT_SECRET, 
      VERSION, 
      lat, 
      long, 
      radius, 
      LIMIT)
  return url
 
 
res_df = pd.DataFrame()
for index, row in n_df.iterrows():
  temp = row
  latitude = row['Latitude']
  longitude = row['Longitude']
  url = create_url(latitude, longitude)
  try:
    results = requests.get(url).json()["response"]['groups'][0]['items']
    for res in results:  
      try:
        temp[str(res['venue']['categories'][0]['name'])] = temp[(res['venue']['categories'][0]['name'])] + 1
      except:
        temp[str(res['venue']['categories'][0]['name'])] = 1
    res_df = res_df.append(temp, ignore_index=True)
  except:
    print("Skipping {}".format(index))
res_df.fillna(value=0, inplace=True)
clean_df = res_df[res_df['Neighborhood'] != 1]

###Cleaning Data & Analysing
Now we have all Venues for each neighborhood we're going to k-means cluster the data into 6 areas. We will then look at the returned dataframe for initial analysis. 

In [26]:
clean_df = res_df[res_df['Neighborhood'] != 1]
pd.set_option('max_columns', None)

cluster_df = clean_df
clean_df.drop(labels=['Borough', 'Longitude', 'Latitude', 'Neighborhood'], axis=1, inplace=True)
kmeans = KMeans(n_clusters=6, random_state=0)
kmeans = kmeans.fit(cluster_df)

clean_df = res_df[res_df['Neighborhood'] != 1]
processed_df = clean_df[['Borough', 'Neighborhood', 'Latitude', 'Longitude']]
processed_df[['1', '2', '3', '4', '5']] = ""


labels = clean_df['Neighborhood']
clean_df.drop(labels=['Borough', 'Longitude', 'Latitude', 'Neighborhood'], axis=1, inplace=True)

venue_cat = 5
for index, row in clean_df.iterrows():
  for n in range(venue_cat):
    venues = row.sort_values(ascending=False)
    venues = venues[0:venue_cat].index
    processed_df.at[index, 4:10] = venues

processed_df
# check cluster labels generated for each row in the dataframe
processed_df.insert(4, 'Cluster Labels', kmeans.labels_)
for n in range(0, 6):
  print(processed_df[processed_df['Cluster Labels'] == n].head())

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[k] = np.nan
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexin

       Borough        Neighborhood   Latitude   Longitude  Cluster Labels  \
0   North York           Parkwoods  43.758895  -79.320322               0   
12  North York           Don Mills  43.705685  -79.333856               0   
17  North York           Glencairn  43.708432  -79.446448               0   
28        York  Humewood-Cedarvale  43.689420  -79.426979               0   
40   East York             Leaside  43.700236  -79.351065               0   

                1            2                     3                   4  \
0     Coffee Shop  Pizza Place           Supermarket            Pharmacy   
12           Park  Coffee Shop          Skating Rink  Turkish Restaurant   
17  Grocery Store  Coffee Shop  Fast Food Restaurant  Italian Restaurant   
28    Pizza Place  Coffee Shop    Italian Restaurant          Restaurant   
40    Coffee Shop         Park              Bus Line   Indian Restaurant   

                    5  
0      Discount Store  
12  Afghan Restaurant  
17      

As we can see from the above data, cluster 2 may appeal more to the 30+ age range, having the most bars and resteraunts where as cluster 4 has more shopping available.

###Visualization

Now using folium we will visualise these neighborhood clusterings. we can see from the below visual that cluster 2 is right in the centre of the city, with 3 and 4 to the west and north of it respectively. In the suburbs we have the majority of clusters 3 and 0 with a few of 1.

In [57]:
ontaro_lat = 43.6532
ontaro_long = -79.3832
 
cluster_map = folium.Map(location=[ontaro_lat, ontaro_long], zoom_start=11)
col_list= ['red', 'cadetblue', 'darkpurple', 'pink', 'black', 'green']
for index, row in processed_df.iterrows():
  label_str = str("Cluster {}, {}".format(row['Cluster Labels'], row['Neighborhood']))
  fill_col =col_list[int(row['Cluster Labels'])]
  icon_data = folium.Icon(color=col_list[int(row['Cluster Labels'])])
  lat, long = float(row['Latitude']), float(row['Longitude'])
  folium.Marker(location=[lat, long], popup=label_str, icon=icon_data).add_to(cluster_map)

cluster_map