# TORONTO NEIGHBORHOOD SCRAPE, SEGMENT, CLUSTER CAPSTONE
### explore, segment, and cluster the neighborhoods in the city of Toronto.
1. Create structured dataframe of Toronto data
   * to explore and cluster the neighborhoods in Toronto scrape https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M . 
        * ignore neighborhoods/postal codes that are not assigned a borough.
        * boroughs lacking a neighborhood have the same name.
   * read data into a pandas dataframe with column keys: Postal Code, Borough, Neighborhood.
        * Postal Code serves as the primary key
        * Neighborhoods with same Postal Code will be concatenated into a single value: "N1,N2"
2. Comment code and provide Markdown titles.
3. Last line display dataframe and should give shape of cleaned dataframe. 
4. Make this the first notebook. Publish to github in an open access directory, provide link to github ipynb for first submission.
5. *Change name of notebook to xxx-v1.ipynb
6. Use Geocoder Python package: https://geocoder.readthedocs.io/index.html to get mapping data. 
    * Final output should be table with column keys: postal code, borough, neighborhood, latitude, longitude
    * May want to incorporate code:
                import geocoder # import geocoder
                lat_lng_coords = None
                while(lat_lng_coords is None):
                  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
                  lat_lng_coords = g.latlng
                latitude = lat_lng_coords[0]
                longitude = lat_lng_coords[1] 
7. Verify latitude and longitude http://cocl.us/Geospatial_data
8. Publish to Github, in an open access directory, provide link to github ipynb file for final submission.
9. *Change name to xxx-v2.ipynb
10. Explore and cluster the neighborhoods in Toronto. Use only boroughs that contain the word Toronto and replicate the same analysis as done with New York City data. 
    * use foursquare to identify categories of venues.
    * use folium to generate maps to visualize your neighborhoods and how they cluster together. 
11. Publish to Github, in an open access directory, provide link to github ipynb file for final submission.

## OVERVIEW
1. Load Relevant Libraries
2. Scrape wikipedia website.
3. Clean Toronto Data. 
4. Get postal code latitude and longitude. 
5. Display geolocated pandas data.
6. Explore and cluster toronto neighborhoods using NYC analysis as template.

### 1. LOAD RELEVANT LIBRARIES
1. General processing
2. Scraping
3. HTML and image display
4. geolocation
5. mapping
6. clustering

In [1]:
#1. load General processing libraries
import pandas as pd 
import numpy as np 
import random 

#2. load libraries for scraping
from bs4 import BeautifulSoup
import requests # library to handle requests

#3. libraries for displaying HTML and images
from IPython.display import Image 
from IPython.core.display import display, HTML

#4. loading libraries for geolocation
# module to convert an address into latitude and longitude values#
#try:
#    import geocoder
#except:
#    !conda install -c conda-forge geocoder --yes 
#    import geocoder
#
try:
    from geopy.geocoders import Nominatim
except:
    !conda install -c conda-forge geopy --yes 
    from geopy.geocoders import Nominatim
from geopy.exc import GeopyError

#5. loading libraries for Mapping    
try:
    import folium # plotting library
except:
    !conda install -c conda-forge folium=0.5.0 --yes
    import folium # plotting library

#6. Loading libraries for clustering
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize



Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geocoder


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ratelim-0.1.6              |           py36_0           5 KB  conda-forge
    openssl-1.1.1b             |       h14c3975_1         4.0 MB  conda-forge
    orderedset-2.0             |           py36_0         231 KB  conda-forge
    ca-certificates-2019.6.16  |       hecc5488_0         145 KB  conda-forge
    geocoder-1.38.1            |             py_0          52 KB  conda-forge
    certifi-2019.6.16          |           py36_0         148 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         4.5 MB

The following NEW packages will be INSTALLED:

    geocoder:        1.38.1-py_0       conda-forge
    orderedset:    

### 2. SCRAPING WIKIPEDIA DATA
1. Load wikipedia page
2. Extract table
3. format HTML table into string type and read data into pd.DataFrame
4. Display wrangled data

In [2]:
#1. Load wikipedia page
# use requests.get to read in wikipedia list of toronto postal codes.
html_url =  'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_file = requests.get(html_url)

#2. Extract table
#structure website using beautiful soup
file = BeautifulSoup(html_file.content, 'lxml')
#Extract postal data table (review website coding to identify data is located in a table structure)
table = file.table

#3. format HTML table into string type
#cast postal data table html as string so it can be read by pandas read_html.
toronto_raw_postal_wiki_df = pd.read_html(str(table))[0]

#4. Display wrangled data
#verify raw postal data table
print("Raw dataframe contains {} postal codes".format(toronto_raw_postal_wiki_df.shape[0]))
toronto_raw_postal_wiki_df.head()

Raw dataframe contains 288 postal codes


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### 3. CLEAN TORONTO DATA
1. identify and eliminate rows with Borough value "Not assigned"
2. assign Neighborhood value to Borough value if Neighborhood is "Not Assigned"
3. group by postcode. If there are multiple neighborhoods, replace with comma delimited string.
4. Display cleaned table, show count of rows and columns.

In [3]:
#1. identify and eliminate rows with Borough value "Not assigned"
#Make logical index of rows where borough is notassigned
Bad_Borough_Index = toronto_raw_postal_wiki_df.loc[:,'Borough'].isin(['Not assigned'])
#Use logical negation to drop bad rows. Assign values optimistically to new clean table.
toronto_postal_df = toronto_raw_postal_wiki_df.loc[~Bad_Borough_Index,:]

#2. assign Neighborhood value to Borough value if Neighborhood is "Not Assigned"
#Find rows where Neighborhood is not assigned.
Bad_Neighborhood_Index = toronto_postal_df.loc[:,'Neighbourhood'].isin(['Not assigned'])
print(" {} rows were found to have missing Neighborhood values".format(sum(Bad_Neighborhood_Index)))
print()        
#Assign Borough value to Bad Neighborhood values.
toronto_postal_df.loc[Bad_Neighborhood_Index,'Neighbourhood']=toronto_postal_df.loc[Bad_Neighborhood_Index,'Borough']
#check data frame for Missing Neighborhood values.
Bad_Neighborhood_Index2 = toronto_postal_df.loc[:,'Neighbourhood'].isin(['Not assigned'])
print("After correction, {} rows were found to have missing Neighborhood values".format(sum(Bad_Neighborhood_Index2)))
toronto_postal_df = toronto_postal_df.sort_values(by=['Postcode']).reset_index(drop=True)
toronto_postal_df.head()

 1 rows were found to have missing Neighborhood values

After correction, 0 rows were found to have missing Neighborhood values


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,Rouge
1,M1B,Scarborough,Malvern
2,M1C,Scarborough,Port Union
3,M1C,Scarborough,Rouge Hill
4,M1C,Scarborough,Highland Creek


In [4]:
#3. group by postcode. If there are multiple neighborhoods, replace with comma delimited string.
#group by postcode, replace neighborhoods with list.
toronto_clean_df = pd.DataFrame(toronto_postal_df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(list))
toronto_clean_df.reset_index(inplace=True)  
toronto_clean_df['Neighbourhood'] = toronto_clean_df.Neighbourhood.apply(lambda x: ", ".join(x)).tolist()

#4. Display cleaned table, show count of rows and columns.
#print("The shape is {} rows (Postcodes) and {} columns (Pstcd-Brgh-Nghbrhd)".format(toronto_clean_df.shape[0],toronto_clean_df.shape[1]))
print(toronto_clean_df.shape)
print()
#HTML(toronto_clean_df.to_html())
toronto_clean_df.head()

(103, 3)



Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Port Union, Rouge Hill, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### 4. GET LATITUDE AND LONGITUDE FROM POSTAL CODE
1. Download data file with latitude and longitude.
2. Merge by postal code with clean toronto borough table ***HUGE WASTE OF TIME TO USE GEOPY***.
3. clean and display toronto map data table.

In [53]:
#1. Download data file with latitude and longitude.
# use requests.get to read in wikipedia list of toronto postal codes.
CA_df = pd.read_csv("Geospatial_Coordinates.csv")
CA_df.rename({'Postal Code':'Postcode'},axis=1, inplace=True)

#2. Merge by postal code with clean toronto borough table
toronto_map_df = toronto_clean_df.merge(CA_df, how='outer', on='Postcode')

#3. clean and display toronto map data table.
HTML(toronto_map_df.to_html())

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Port Union, Rouge Hill, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Golden Mile, Oakridge, Clairlea",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Scarborough Village West, Cliffside",43.716316,-79.239476
9,M1N,Scarborough,"Cliffside West, Birch Cliff",43.692657,-79.264848
