# Segment and Cluster Toronto Neighborhoods

## Part 1: Scraping Wikipedia Data

In [None]:
!conda install -c anaconda xlrd --yes
!conda install -c anaconda beautifulsoup4

In [None]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

Fetch the page content from Wikipedia

In [None]:
page = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
contents = page.content

NOTE:

In the code below I make the following assumptions:

1. The table I'm looking has a specific class 'wikitable' and that there is only 1 table with that class. 
2. The data in the table is displayed in the order Postalcode, Borough, Neighborhood.
3. All tables have a value (including "Not assigned")

Having inspected the HTML of the page I clarified that the above assumptions were true at the time of creation (March 16th, 2019)

----

Use BeautifulSoup to help scrape data from the returned Wikipedia page content

Docs: https://www.crummy.com/software/BeautifulSoup/

In [None]:
soup = BeautifulSoup(contents, 'html.parser')

headers = ['Postcode', 'Borough', 'Neighborhood']

table = soup.find('table',{'class':'wikitable'})
table_rows = table.find_all('tr')
table_rows = table_rows[1:]

df_rows = []

for row in table_rows:
    items = row.find_all('td')
    if items[1].text.strip() != 'Not assigned':
        df_row =[]
        df_row.append(items[0].text.strip())
        df_row.append(items[1].text.strip())
        df_row.append(items[2].text.strip())
        df_rows.append(df_row)


In [None]:
init_df = pd.DataFrame(data=df_rows, columns=headers)
init_df.head()

Now that we have the initial dataframe we need to clean up the data by doing the following:

1. Combine Neighborhoods with the same Postcode
2. Set any Neighborhood with the value of "Not assigned" to be the same as the Borough

In the code below I loop over the rows of the dataframe and create a unique mapping of each postal code. During this process I concatenate the Neighborhoods so that each unique postal code row has a string containing all the Neighborhoods associated with it.

In [None]:
init_df.groupby(['Postcode']).head()

c_data = {} # cleaned data mapping

for index, row in init_df.iterrows():
    if row['Neighborhood'] == 'Not assigned':
        row['Neighborhood'] = row['Borough']
    if not row['Postcode'] in c_data:
        c_data[row['Postcode']] = [row['Postcode'], row['Borough'], row['Neighborhood']]
    elif not row['Neighborhood'] in c_data[row['Postcode']][2] :
        c_data[row['Postcode']][2] += ", " + row['Neighborhood']

In [None]:
tor_df = pd.DataFrame(list(c_data.values()), columns=headers)

In [None]:
tor_df.head()

In [None]:
tor_df.shape

## PART 2: Getting Geolocations

NOTE: I tried to actually use the geocoder package and was unable to do so successfully.

In [None]:
!wget -O geospacial.csv https://cocl.us/Geospatial_data

In [41]:
geo_df = pd.read_csv('geospacial.csv')
geo_df = geo_df.rename(index=str, columns={'Postal Code':'Postcode'})

Now that we have the geo spacial data for the Postcodes we need to join the two dataframes together

In [42]:
tor_df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M5P,Central Toronto,"Forest Hill North, Forest Hill West"
1,M3B,North York,Don Mills North
2,M4Y,Downtown Toronto,Church and Wellesley
3,M4X,Downtown Toronto,"Cabbagetown, St. James Town"
4,M6L,North York,"Maple Leaf Park, North Park, Upwood Park"


In [43]:
geo_df.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [46]:
tor_geo_df = pd.merge(tor_df, geo_df, on='Postcode', how='left')

In [47]:
tor_geo_df.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M5P,Central Toronto,"Forest Hill North, Forest Hill West",43.696948,-79.411307
1,M3B,North York,Don Mills North,43.745906,-79.352188
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
3,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675
4,M6L,North York,"Maple Leaf Park, North Park, Upwood Park",43.713756,-79.490074
