# Week 3 Peer Graded Assignment

## Matthew Sullivan

This notebook contains the assigned code for my week 3 assignment.

## Overview
(Compiled from excerpts of the assignment description)

In this notebook, I will explore, segment, and cluster the neighborhoods in the city of Toronto as assigned in the IBM . However, unlike New York, the neighborhood data is not readily available on the internet. The intent of the assignment is to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. I will scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset used in the week 3 lab.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.




# Part 1 - Reproducing Table 1

## Instructions
For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

Start by creating a new Notebook for this assignment (which is this notebook).

Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below.

To create the above dataframe:

1. The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

2. Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

3. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

4. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

5. Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.

6. In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

7. Submit a link to your Notebook on your Github repository. (This notebook was accessed via my GitHub)


Installing Required Libraries
This cell installs the additional libraries that will be used in this notebook. It is placed in a separate cell so that it is not needed to be executed when executing other code. Since it takes a long time to install libraries, this facilitates rapidly running the code again as necessary.

In [77]:
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
!python -m pip install requests BeautifulSoup4
!conda install -c conda-forge folium=0.5.0 --yes

print("Done installing required libraries")

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.11

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.11

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Done installing required libraries


## Importing Modules

I find it cleaner to import modules at the top of a notebook in a separate cell. This avoids importing the same library more than once, and makes for cleaner code



In [78]:
#this is my library of functions for scraping from wikipedia

import requests 
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
import json
import uuid
from decimal import *
import numpy as np 
import pandas as pd 
import json 
from geopy.geocoders import Nominatim 
import requests 
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import pickle
import re
from geopy.extra.rate_limiter import RateLimiter


# I place a print statement at the end of cells that do not produce output to make it easy to know that the code execution has completed
print ("done with importing libraries")

done with importing libraries


## Defining functions

I like to use functions a lot when writing python, and as with importing libraries I like to place it early in the notebook. This might seem like over-engineering, and some of the functions have barely any actual code but I create functions for two purposes.

To avoid duplication of code. Duplicated code is harder to maintain because when you improve or update the code, you have to remember to do it in multiple places
To make procedural code easier to read. When a data scientist is examining my code to determine the procedure that was used to perform a statistical analysis, having that procedure be more "step by step" allows them to understand the analysis process more than the python syntax.

### Note
These is **not** the prettiest code I have ever written. If I were to spend more time on this, I would make the functions better named and less implementation specific. But I still wanted to achieve those two objectives. 

Also, there is a lot of code included for performing the New York data analysis, and also for scraping using a different Wiki page. I include this because I may be using it in future assignments.

**I know this is a lot of code to scroll through** , especially if the audience is not interested in the code. When I am reporting for information consumption, it would  not matter because **in a report I would hide code cells like this** 

In a notebook shared for data analysis reporting purposes, and not for an assignment, I would use Jupyter's functions for hiding code, i.e.

{ "tags": [ "hide_input", ] }

In [79]:
def get_raw_xhtml(url):
    try:
        with closing(get(url, stream=True)) as resp:
            if is_xhtml(resp):
                return resp.content
            else:
                return None
    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None

def is_xhtml(resp):
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 
            and content_type is not None 
            and content_type.find('html') > -1)

def log_error(e):
    print(e)

def get_wikipedia_link_tuple(link):
    base_url = 'https://en.wikipedia.org'
    return (base_url + link['href'],link.text)

def get_wikipedia_url_from_link(raw_link):
    base_url = 'https://en.wikipedia.org'
    return (base_url + raw_link['href'])


def get_wiki_body_div(page_url):
    base_url = 'https://en.wikipedia.org/'
    full_page_url = base_url + page_url
    page_html = get_raw_xhtml(full_page_url)
    page_parsed_html = BeautifulSoup(page_html, 'html.parser')
    body_div = page_parsed_html.find('div',{"class":"mw-body"})
    return body_div

def create_neighborhood_feature(section, name, latitude, longitude):
    top_left = latitude
    top_right = longitude
    botton_left = latitude * -1
    bottom_right = longitude * -1
    feature_id = uuid.uuid4().int
    stacked = 1
    annoangle=Decimal("0.00000000000")
    coordinates = {latitude,longitude}
    bbox = {top_left,top_right,botton_left,bottom_right}
    geometry = {"type": "Point", "coordinates": coordinates}
    properties =  {"name":name, "stacked":1, "annoline1":name, "annoline2":None, "annoline3":None, "annoangle":annoangle, "section":section, "bbox":bbox}
    feature = { "type": "feature", "id": feature_id, "geometry": geometry,"geometry_name": "geom","properties": properties}
    return feature

# every city will need its own parse function because the neighborhoods pages are not consistent

def return_toronto_json():
    data_object = scrape_toronto()
    with open('toronto.pickle', 'wb') as handle:
        pickle.dump(data_object, handle, protocol=pickle.HIGHEST_PROTOCOL)
    with open('toronto.pickle', 'rb') as handle:
        data_obj_json = pickle.load(handle)
    return data_obj_json

def scrape_toronto():
    body_div = get_wiki_body_div('wiki/List_of_neighbourhoods_in_Toronto')
    #get the city sections - there are four of them
    link_tables = body_div.find_all("table",{"class":"multicol"})    
    section_h4s = body_div.find_all("h4")
    features = []
    for section_number in range(3):
        section_name = section_h4s[section_number].find('span',{'class':'mw-headline'}).text
        link_table = link_tables[section_number]
        links = link_table.find_all('a')
        for link in links:
            neighborhood_name = neighborhood_url = neighborhood_latitude = neighborhood_longitude = None
            neighborhood_name = link.text
            neighborhood_url = get_wikipedia_url_from_link(link)
            brief_link = link['href']
            neighborhood_div = get_wiki_body_div(brief_link)
            #It turns out simply searching geolocator by neighborhood is more reliable than parsing the coordinates from wikipedia, and a lot less work.
            neighborhood_location_obj = get_location_object(neighborhood_name + ", Toronto, ON")
            if (neighborhood_location_obj is not None):
                neighborhood_latitude = neighborhood_location_obj.latitude
                neighborhood_longitude = neighborhood_location_obj.longitude
            #skip adding features with missing data. Make this safe in case there is something wrong with geolocate
            if None not in (section_name, neighborhood_name, neighborhood_location_obj, neighborhood_latitude, neighborhood_longitude) and "" not in (section_name, neighborhood_name, neighborhood_location_obj, neighborhood_latitude, neighborhood_longitude):
                this_feature = create_neighborhood_feature(section_name, neighborhood_name, neighborhood_latitude, neighborhood_longitude)
                features.append(this_feature)
    feature_count = len(features)
    data_object = {"type": "FeatureCollection","totalFeatures": feature_count,"features": features}
    return data_object

#the point of this is it allows script code to always get Toronto data in the same manner, but allows the library to change the implementation should better data become available, or should we choose to put the JSON on a server
def get_toronto_data():
    toronto_data = return_toronto_json()
    return toronto_data

#this is made into a function to make the 
def download_new_york_json():
    !wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
    with open('newyork_data.json') as json_data:
        newyork_data = json.load(json_data)
        #replace key 'borough' with 'section'
        for neighborhood_data in newyork_data['features']:
            neighborhood_data['properties']['section'] = neighborhood_data['properties'].pop('borough')
    return newyork_data

# the point here is to keep the interface the same, but allow the implementation to change if a better datasource is available
def get_new_york_data():
    new_york_data = download_new_york_json()
    return new_york_data


def get_nearby_venues(names, latitudes, longitudes, radius=500, limit=100):
        #CLIENT_ID = 'REDACTED' # your Foursquare ID
    #CLIENT_SECRET = 'REDACTED' # your Foursquare Secret
    VERSION = '20180605' # Foursquare API version
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return(nearby_venues)

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
    
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

#this needs an address
def get_location_object(place):
    geolocator = Nominatim(user_agent="matthewgsullivan_week3_assignment")
    geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
    location = None
    try:
        location = geolocator.geocode(place)
    except:
        log_error("error encountered looking up {}".format(place))
    return location

def get_neighborhoods_dataframe(data_object):
    # define the dataframe columns
    column_names = ['Section', 'Neighborhood', 'Latitude', 'Longitude'] 
    neighborhoods = pd.DataFrame(columns=column_names)
    for data in data_object:
        section = neighborhood_name = data['properties']['section'] 
        neighborhood_name = data['properties']['name']
        neighborhood_latlon = list(data['geometry']['coordinates'])
        neighborhood_lat = neighborhood_latlon[1]
        neighborhood_lon = neighborhood_latlon[0]
        neighborhoods = neighborhoods.append({'Section': section,
                                              'Neighborhood': neighborhood_name,
                                              'Latitude': neighborhood_lat,
                                              'Longitude': neighborhood_lon}, ignore_index=True)
    return neighborhoods

def get_map(location, zoom_start = 10):
    #this expects a location object. Use get_location_object()
    map_object = folium.Map(location=[location.latitude, location.longitude], zoom_start=10)
    return map_object

def add_neighborhood_markers_to_map(this_map, this_neighborhoods_dataframe):
    # add markers to map
    for lat, lng, borough, neighborhood in zip(this_neighborhoods_dataframe['Latitude'], this_neighborhoods_dataframe['Longitude'], this_neighborhoods_dataframe['Section'], this_neighborhoods_dataframe['Neighborhood']):
        label = '{}, {}'.format(neighborhood, borough)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(this_map)  

def get_sorted_venues_dataframe(df_grouped, num_top_venues = 10):
    indicators = ['st', 'nd', 'rd']

    # create columns according to number of top venues
    columns = ['Neighborhood']
    for ind in np.arange(num_top_venues):
        try:
            columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
        except:
            columns.append('{}th Most Common Venue'.format(ind+1))

    # create a new dataframe
    neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
    neighborhoods_venues_sorted['Neighborhood'] = df_grouped['Neighborhood']

    for ind in np.arange(df_grouped.shape[0]):
        neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(df_grouped.iloc[ind, :], num_top_venues)

    return neighborhoods_venues_sorted

def get_clustering_df(grouped_df):
    if('Neighborhood' in grouped_df.columns):
        grouped_clustering_df = grouped_df.drop('Neighborhood', 1)
    else:
        grouped_clustering_df = grouped_df
    return grouped_clustering_df
        


def get_clusters(grouped_clustering_df, number_of_clusters = 5):
    kmeans = KMeans(n_clusters=number_of_clusters, random_state=0).fit(grouped_clustering_df)
    return kmeans

def get_clustered_neighborhood_df_by_kmeans(kmeans, sorted_venue_df, neighborhoods_df):
    if 'Cluster Labels' not in sorted_venue_df.columns:
        sorted_venue_df.insert(0, 'Cluster Labels', kmeans.labels_)
    neighborhoods_merged = neighborhoods_df
    neighborhoods_merged = neighborhoods_merged.join(sorted_venue_df.set_index('Neighborhood'), on='Neighborhood')
    return neighborhoods_merged

def get_clustered_neighborhood_df(grouped_df, sorted_venue_df, neighborhoods_df, number_of_clusters=5):
    if('Neighborhood' in grouped_df.columns):
        grouped_clustering_df = grouped_df.drop('Neighborhood', 1)
    else:
        grouped_clustering_df = grouped_df
    kmeans = KMeans(n_clusters=number_of_clusters, random_state=0).fit(grouped_clustering_df)
    if 'Cluster Labels' not in sorted_venue_df.columns:
        sorted_venue_df.insert(0, 'Cluster Labels', kmeans.labels_)
    neighborhoods_merged = neighborhoods_df
    neighborhoods_merged = neighborhoods_merged.join(sorted_venue_df.set_index('Neighborhood'), on='Neighborhood')
    return neighborhoods_merged

def create_clustered_map(location,kclusters, clustered_neighborhood_df):
    # create map
    map_clusters = folium.Map(location=[location.latitude, location.longitude], zoom_start=11)

    # set color scheme for the clusters
    x = np.arange(kclusters)
    ys = [i + x + (i*x)**2 for i in range(kclusters)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]

    # add markers to the map
    markers_colors = []
    for lat, lon, poi, cluster in zip(clustered_neighborhood_df['Latitude'], clustered_neighborhood_df['Longitude'], clustered_neighborhood_df['Neighborhood'], clustered_neighborhood_df['Cluster Labels']):
        label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=rainbow[cluster-1],
            fill=True,
            fill_color=rainbow[cluster-1],
            fill_opacity=0.7).add_to(map_clusters)

    return map_clusters



def scrape_toronto_by_zip():
    body_div = get_wiki_body_div('wiki/List_of_postal_codes_of_Canada:_M')
    #get the postal code table. There is only one
    link_tables = body_div.find("table",{"class":"wikitable sortable"})    
    link_rows = link_table.find_all('tr')
    # this will output a JSON object that matches the NY JSON file
    features = []
    postal_codes_column_names = ['PostalCode','Borough','Neighborhood']
    postalcodes_combined_neighborhoods_columns = ['PostalCode','Borough','Neighborhood','Latitude','Longitude']
    postalcodes_df = pd.DataFrame(columns=postal_codes_column_names)
    postalcodes_combined_neighborhoods_df = pd.DataFrame(columns=postalcodes_combined_neighborhoods_columns)
    for this_row in link_rows:
        if len(this_row.find_all('td')) != 0:
            this_row_cells = this_row.find_all('td')
            this_postcode = this_row_cells[0].text
            #is the borough assigned
            this_borough = this_row_cells[1].text
            this_neighbourhood = this_row_cells[2].text.replace("\n","")
            if this_borough != "Not assigned":
                if "Not assigned" in this_neighbourhood:
                    this_neighbourhood = this_borough
                this_data_frame_item = {'PostalCode':this_postcode,'Borough':this_borough,'Neighborhood':this_neighbourhood}
                postalcodes_df = postalcodes_df.append(this_data_frame_item,ignore_index=True)
    for unique_postal_code in postalcodes_df['PostalCode'].unique():
        unique_postal_code_location = get_location_object("{}, Toronto, ON".format(unique_postal_code))
        neighborhoods_in_this_postal_code =  postalcodes_df.loc[postalcodes_df['PostalCode'] == unique_postal_code]
        neighborhoods_combined = ','.join(neighborhoods_in_this_postal_code['Neighborhood'].unique().tolist())
        #this should not be needed, but just in case
        boroughs_combined = ','.join(neighborhoods_in_this_postal_code['Borough'].unique().tolist())
        unique_postalcode_df_row = {'PostalCode':unique_postal_code,'Borough':boroughs_combined,'Neighborhood':neighborhoods_combined,'Latitude':unique_postal_code_location.latitude,'Longitude':unique_postal_code_location.longitude}
        postalcodes_combined_neighborhoods_df = postalcodes_combined_neighborhoods_df.append(unique_postalcode_df_row,ignore_index=True)
    return postalcodes_combined_neighborhoods_df

print('finished loading functions')

finished loading functions


## Building the dataframe 

This is procedural code for generating the first table. I also include this code in my function declarations, but I am placing it here as a procedure to demonstrate the code which completed the objectives of part 1 of this assignment.


In [90]:
body_div = get_wiki_body_div('wiki/List_of_postal_codes_of_Canada:_M')
#get the postal code table. There is only one
link_table = body_div.find("table",{"class":"wikitable sortable"})    
link_rows = link_table.find_all('tr')
postal_codes_column_names = ['PostalCode','Borough','Neighborhood']
postalcodes_df = pd.DataFrame(columns=postal_codes_column_names)
for this_row in link_rows:
    if len(this_row.find_all('td')) != 0:
        this_row_cells = this_row.find_all('td')
        this_postcode = this_row_cells[0].text
        #is the borough assigned
        this_borough = this_row_cells[1].text
        this_neighbourhood = this_row_cells[2].text.replace("\n","")
        if this_borough != "Not assigned":
            if "Not assigned" in this_neighbourhood:
                this_neighbourhood = this_borough
            this_data_frame_item = {'PostalCode':this_postcode,'Borough':this_borough,'Neighborhood':this_neighbourhood}
            postalcodes_df = postalcodes_df.append(this_data_frame_item,ignore_index=True)
postalcodes_combined_df = pd.DataFrame(columns=postal_codes_column_names)
for unique_postal_code in postalcodes_df['PostalCode'].unique():
        neighborhoods_in_this_postal_code =  postalcodes_df.loc[postalcodes_df['PostalCode'] == unique_postal_code]
        neighborhoods_combined = ','.join(neighborhoods_in_this_postal_code['Neighborhood'].unique().tolist())
        #this should not be needed, but just in case
        boroughs_combined = ','.join(neighborhoods_in_this_postal_code['Borough'].unique().tolist())
        unique_postalcode_df_row = {'PostalCode':unique_postal_code,'Borough':boroughs_combined,'Neighborhood':neighborhoods_combined}
        postalcodes_combined_df = postalcodes_combined_df.append(unique_postalcode_df_row,ignore_index=True)
postalcodes_combined_df.set_index('PostalCode', inplace=True) 
postalcodes_combined_df.head()

Unnamed: 0_level_0,Borough,Neighborhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Harbourfront,Regent Park"
M6A,North York,"Lawrence Heights,Lawrence Manor"
M7A,Queen's Park,Queen's Park


## Using the the .shape method to print the number of rows of your dataframe.

Below is the shape of the dataframe. The first item in the tuple is the number of rows, so I have referenced it explicitely

In [89]:
print("There are {} rows in table 1".format(postalcodes_combined_df.shape[0]))

There are 103 rows in table 1


## Part 2 - Creating table 2

Part two of this assignment is to generate a pandas dataframe like below (see the image from the assignment details).


### Code to create the table using GeoLocate

As predicted in the assignment details, the geolocate function from geopy was problematic for me. I found that looking up latitude and longitude by postalcode was unreliable.

Here is the original code in my function library that I use to look up addresses:

>def get_location_object(place):  
geolocator = Nominatim(user_agent="matthewgsullivan_week3_assignment")  
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)  
location = geolocator.geocode(place)  
return location


Often times, that would return a None type object (i.e. null). Here is the code that the instructor recommended I use to work around this.

>location = None  
while location == None:  
location = geolocator.geocode(place)

The problem with that approach is that it generates a lot of retries if geopy was actually returning None intentionally for whatever reason. And with repetitious calls, I found that eventually my rate limit would be reached. So I limited the total lookups to 10.

>def get_location_object(place):  
geolocator = Nominatim(user_agent="matthewgsullivan_week3_assignment")  
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)  
location = None  
attempts = 0  
while (location == None) and (attempts < 10):  
location = geolocator.geocode(place)  
attempts = attempts + 1  
return location

I think this problem is central to the approach assigned, looking up based on PostalCode. When I look up by neighborhood name, I find it to more reliably return data.

Nevertheless, here is the code for creating the assigned dataframe by looking up the longitude / latitude by postalcode.

**As you can see, the code results in an error, and if that was the only approach available, I would use a python "try" block to capture and handle this error.**










In [18]:
postalcodes_combined_neighborhoods_columns = ['PostalCode','Borough','Neighborhood','Latitude','Longitude']
postalcodes_combined_neighborhoods_df = pd.DataFrame(columns=postalcodes_combined_neighborhoods_columns)
for unique_postal_code in postalcodes_df['PostalCode'].unique():
        unique_postal_code_location = get_location_object("{}, Toronto, ON".format(unique_postal_code))
        neighborhoods_in_this_postal_code =  postalcodes_df.loc[postalcodes_df['PostalCode'] == unique_postal_code]
        neighborhoods_combined = ','.join(neighborhoods_in_this_postal_code['Neighborhood'].unique().tolist())
        #this should not be needed, but just in case
        boroughs_combined = ','.join(neighborhoods_in_this_postal_code['Borough'].unique().tolist())
        unique_postal_code_latitude = unique_postal_code_longitude = None
        if(unique_postal_code_location is not None):
            unique_postal_code_latitude = unique_postal_code_location[1][0]
            unique_postal_code_longitude = unique_postal_code_location[1][1]
        unique_postalcode_df_row = {'PostalCode':unique_postal_code,'Borough':boroughs_combined,'Neighborhood':neighborhoods_combined,'Latitude':unique_postal_code_latitude,'Longitude':unique_postal_code_longitude}
        postalcodes_combined_neighborhoods_df = postalcodes_combined_neighborhoods_df.append(unique_postalcode_df_row,ignore_index=True)
postalcodes_combined_neighborhoods_df.set_index('PostalCode', inplace=True) 

## Retrieving the data from the CSV

As you can see in the error above, because of flakiness in the geolocate function of geopy, and as suggested by the instructor may happen, the code has problems.

The workaround the instructor provided is a CSV with the data. Here is the code to consume that CSV.


In [60]:
# import csv
# !wget -q -O 'Geospatial_Coordinates.csv' https://cocl.us/Geospatial_data
# print('Data downloaded!')
postalcodes_csv_df = pd.read_csv("Geospatial_Coordinates.csv") 
postalcodes_csv_df.set_index('Postal Code', inplace=True)    
for index, row in postalcodes_csv_df.iterrows():
    postalcodes_combined_neighborhoods_df.loc[index,'Latitude'] = row['Latitude']
    postalcodes_combined_neighborhoods_df.loc[index,'Longitude'] = row['Longitude']
postalcodes_combined_neighborhoods_df

Unnamed: 0_level_0,Borough,Neighborhood,Latitude,Longitude
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M3A,North York,Parkwoods,43.753259,-79.329656
M4A,North York,Victoria Village,43.725882,-79.315572
M5A,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636
M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
M7A,Queen's Park,Queen's Park,43.662301,-79.389494
M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
M3B,North York,Don Mills North,43.745906,-79.352188
M4B,East York,"Woodbine Gardens,Parkview Hill",43.706397,-79.309937
M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937


# Part 3 - Replicating the New York Data Analysis

In this section, I reproduce the k-series segmentation done in the previous lab. One thing I opted to do differently is using a different data source.

I found that geopy does not work very well with postal codes, for whatever reason. To demonstrate an alternative approach, I scraped the data from a "Neighborhoods of Toronto" wiki page, looked up the latitude and longitude by neighborhood name, and used that data to perform the analysis. 

I believe this is not only more reliable for looking up the coordinates, but better analysis because the nieghborhoods are not sharing coordinates in the same zip code.



In [76]:
print("Loading toronto data")
toronto_data = get_toronto_data()
toronto_neighborhoods_data = toronto_data['features']
toronto_neighborhoods = get_neighborhoods_dataframe(toronto_neighborhoods_data)
toronto_location = get_location_object('Toronto, Ontario')
toronto_neighborhoods = toronto_neighborhoods[toronto_neighborhoods.Neighborhood != ""]
toronto_neighborhoods_data = toronto_data['features']
toronto_neighborhoods = get_neighborhoods_dataframe(toronto_neighborhoods_data)
toronto_location = get_location_object('Toronto, Ontario')
toronto_neighborhoods = toronto_neighborhoods[toronto_neighborhoods.Neighborhood != ""]
print("Toronto data loaded. Reference variable toronto_neighborhoods to view it")
print("generating Toronto map")
toronto_map = get_map(toronto_location)
print("Adding Toronto neighborhood markers to map")
add_neighborhood_markers_to_map(toronto_map, toronto_neighborhoods)
print("Toronto map generated, reference variable toronto_map to view it")
print("Getting venues for each neighborhood from FourSquare")
toronto_venues = get_nearby_venues(names=toronto_neighborhoods['Neighborhood'], latitudes=toronto_neighborhoods['Latitude'], longitudes=toronto_neighborhoods['Longitude'], radius=500, limit=100)
print("Venues have been retrieved. reference variable toronto_venues to view it.")

toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_venues_sorted = get_sorted_venues_dataframe(toronto_grouped)
kclusters = 5
print("Getting {} clusters".format(kclusters))
toronto_clustering_df = get_clustering_df(toronto_grouped)
toronto_kmeans = get_clusters(toronto_clustering_df, kclusters)
toronto_clustered_neighborhood_df = get_clustered_neighborhood_df_by_kmeans(toronto_kmeans, toronto_venues_sorted, toronto_neighborhoods)
print("Clusters have been retrieved. reference variable toronto_clustered_neighborhood_df to view it.")
print("Generating map")
toronto_clustered_map = create_clustered_map(toronto_location, kclusters, toronto_clustered_neighborhood_df)
print("Map has been retrieved, reference variable toronto_clustered_map to view it.")

print("finished with Toronto")

Loading toronto data
Toronto data loaded. Reference variable toronto_neighborhoods to view it
generating Toronto map
Adding Toronto neighborhood markers to map
Toronto map generated, reference variable toronto_map to view it
Getting venues for each neighborhood from FourSquare
Venues have been retrieved. reference variable toronto_venues to view it.
Getting 5 clusters
Clusters have been retrieved. reference variable toronto_clustered_neighborhood_df to view it.
Generating map
Map has been retrieved, reference variable toronto_clustered_map to view it.
finished with Toronto


In [73]:
toronto_clustered_map 