# TORONTO NEIGHBORHOOD SCRAPE, SEGMENT, CLUSTER CAPSTONE
### explore, segment, and cluster the neighborhoods in the city of Toronto.
1. Create structured dataframe of Toronto data
   * to explore and cluster the neighborhoods in Toronto scrape https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M . 
        * ignore neighborhoods/postal codes that are not assigned a borough.
        * boroughs lacking a neighborhood have the same name.
   * read data into a pandas dataframe with column keys: Postal Code, Borough, Neighborhood.
        * Postal Code serves as the primary key
        * Neighborhoods with same Postal Code will be concatenated into a single value: "N1,N2"
2. Comment code and provide Markdown titles.
3. Last line display dataframe and should give shape of cleaned dataframe. 
4. Make this the first notebook. Publish to github in an open access directory, provide link to github ipynb for first submission.
5. *Change name of notebook to xxx-v1.ipynb
6. Use Geocoder Python package: https://geocoder.readthedocs.io/index.html to get mapping data. 
    * Final output should be table with column keys: postal code, borough, neighborhood, latitude, longitude
    * May want to incorporate code:
                import geocoder # import geocoder
                lat_lng_coords = None
                while(lat_lng_coords is None):
                  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
                  lat_lng_coords = g.latlng
                latitude = lat_lng_coords[0]
                longitude = lat_lng_coords[1] 
7. Verify latitude and longitude http://cocl.us/Geospatial_data
8. Publish to Github, in an open access directory, provide link to github ipynb file for final submission.
9. *Change name to xxx-v2.ipynb
10. Explore and cluster the neighborhoods in Toronto. Use only boroughs that contain the word Toronto and replicate the same analysis as done with New York City data. 
    * use foursquare to identify categories of venues.
    * use folium to generate maps to visualize your neighborhoods and how they cluster together. 
11. Publish to Github, in an open access directory, provide link to github ipynb file for final submission.

## OVERVIEW
1. Scrape wikipedia website. Find table of postal codes, convert to pandas df. Display pandas table of raw data.
2. clean pandas table of raw data. Eliminate "Not Assigned" or inherit borough name for neighborhoods. compounding neighborhoods with shared postal codes. 
3. display cleaned pandas df, verify it matches the assignment expectations.
4. Get postal code latitude and longitude. Display geolocated pandas data.
5. Explore and cluster toronto neighborhoods using NYC analysis as template.

### LOAD RELEVANT LIBRARIES
1. General processing
2. Scraping
3. HTML and image display
4. geolocation
5. mapping
6. clustering

In [54]:
#load libraries
import pandas as pd 
import numpy as np 
import random 

#libraries for scraping
from bs4 import BeautifulSoup
import requests # library to handle requests

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import display, HTML


# module to convert an address into latitude and longitude values
try:
    from geopy.geocoders import Nominatim
except:
    !conda install -c conda-forge geopy --yes 
    from geopy.geocoders import Nominatim
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

try:
    import folium # plotting library
except:
    !conda install -c conda-forge folium=0.5.0 --yes
    import folium # plotting library




### SCRAPING WIKIPEDIA DATA

In [52]:
# use requests.get to read in wikipedia list of toronto postal codes.
html_url =  'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_file = requests.get(html_url)

#structure website using beautiful soup
file = BeautifulSoup(html_file.content, 'lxml')

#Extract postal data table (review website coding to identify data is located in a table structure)
table = file.table

#cast postal data table html as string so it can be read by pandas read_html.
toronto_raw_postal_wiki_df = pd.read_html(str(table))[0]

#verify raw postal data table
print("Raw dataframe contains {} postal codes".format(toronto_raw_postal_wiki_df.shape[0]))
toronto_raw_postal_wiki_df.head()

Raw dataframe contains 288 postal codes


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### CLEAN TORONTO DATA

In [38]:
from IPython.core.display import display, HTML
pd.read_html(str(table))

[    Postcode           Borough  \
 0        M1A      Not assigned   
 1        M2A      Not assigned   
 2        M3A        North York   
 3        M4A        North York   
 4        M5A  Downtown Toronto   
 5        M5A  Downtown Toronto   
 6        M6A        North York   
 7        M6A        North York   
 8        M7A      Queen's Park   
 9        M8A      Not assigned   
 10       M9A         Etobicoke   
 11       M1B       Scarborough   
 12       M1B       Scarborough   
 13       M2B      Not assigned   
 14       M3B        North York   
 15       M4B         East York   
 16       M4B         East York   
 17       M5B  Downtown Toronto   
 18       M5B  Downtown Toronto   
 19       M6B        North York   
 20       M7B      Not assigned   
 21       M8B      Not assigned   
 22       M9B         Etobicoke   
 23       M9B         Etobicoke   
 24       M9B         Etobicoke   
 25       M9B         Etobicoke   
 26       M9B         Etobicoke   
 27       M1C       

In [34]:
str(table)

'<table class="wikitable sortable">\n<tbody><tr>\n<th>Postcode</th>\n<th>Borough</th>\n<th>Neighbourhood\n</th></tr>\n<tr>\n<td>M1A</td>\n<td>Not assigned</td>\n<td>Not assigned\n</td></tr>\n<tr>\n<td>M2A</td>\n<td>Not assigned</td>\n<td>Not assigned\n</td></tr>\n<tr>\n<td>M3A</td>\n<td><a href="/wiki/North_York" title="North York">North York</a></td>\n<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>\n</td></tr>\n<tr>\n<td>M4A</td>\n<td><a href="/wiki/North_York" title="North York">North York</a></td>\n<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>\n</td></tr>\n<tr>\n<td>M5A</td>\n<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>\n<td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>\n</td></tr>\n<tr>\n<td>M5A</td>\n<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>\n<td><a href="/wiki/Regent_Park" title="Regent Park">Regent Pa

In [5]:
import requests
requests(html_url)

TypeError: 'module' object is not callable

In [8]:
requests.request(url=html_url)

TypeError: request() missing 1 required positional argument: 'method'

In [7]:
html_url

'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'