# Assignment - Segmenting and clustering neighborhoods in Toronto

-  Scrape code from the following [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)
-  Obtain data that is in the list of postal codes 
-  Transform the data into a pandas dataframe

The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood


In [1]:
import pandas as pd
import numpy as np
#Beautiful Soup is a Python library for pulling data out of HTML and XML files.


#  'None' value means unlimited.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


In [2]:
!pip install BeautifulSoup4
from bs4 import BeautifulSoup
#from urllib import urlopen

Requirement not upgraded as not directly required: BeautifulSoup4 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages


In [3]:
#columns = ['PostalCode', 'Borough', 'Neighborhood']
#df_Toronto = pd.DataFrame(columns = columns)
#df_Toronto

In [4]:
url_postalcodes = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
!wget -O 'toronto_postal_codes.html' https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

print('Downloaded file')

with open('toronto_postal_codes.html') as postalcodes:
    soup = BeautifulSoup(postalcodes, 'html.parser')

#soup = BeautifulSoup(urlopen(url_postalcodes), "html5lib")

--2018-12-11 18:59:21--  https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
Resolving en.wikipedia.org (en.wikipedia.org)... 208.80.154.224, 2620:0:861:ed1a::1
Connecting to en.wikipedia.org (en.wikipedia.org)|208.80.154.224|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 80297 (78K) [text/html]
Saving to: ‘toronto_postal_codes.html’


2018-12-11 18:59:27 (1.07 MB/s) - ‘toronto_postal_codes.html’ saved [80297/80297]

Downloaded file


Beautiful Soup - 

Reference code for parsing td and tr tags
[https://www.airpair.com/python/posts/using-python-and-qgis-for-geospatial-visualization#rmxtgga1m5PJ9Vp8.30](Using Python and QGIS for geospatial visualizations - a Case Study)

In [6]:

soup.prettify()
#The “default” manner to create a DataFrame from python is to use a list of dictionaries. 
#In this case each dictionary key is used for the column headings. A default index will be created automatically:

data = []
#print(df_Toronto)
if ('wikitable' in soup.table['class']):
    for row in soup.table.tbody('tr'):
        if row('th'):
            continue  # header row
        row_data = row('td')
        if (row_data[1].text == 'Not assigned'):
            continue
         
        data.append({'PostalCode': row_data[0].text,
                                        'Borough': row_data[1].text,
                                        'Neighborhood': row_data[2].text})
        
df_Toronto = pd.DataFrame(data)
df_Toronto.shape


(212, 3)

In [7]:
df_Toronto.columns

Index(['Borough', 'Neighborhood', 'PostalCode'], dtype='object')

In [9]:
df_Toronto.head(10)

Unnamed: 0,Borough,Neighborhood,PostalCode
0,North York,Parkwoods\n,M3A
1,North York,Victoria Village\n,M4A
2,Downtown Toronto,Harbourfront\n,M5A
3,Downtown Toronto,Regent Park\n,M5A
4,North York,Lawrence Heights\n,M6A
5,North York,Lawrence Manor\n,M6A
6,Queen's Park,Not assigned\n,M7A
7,Etobicoke,Islington Avenue\n,M9A
8,Scarborough,Rouge\n,M1B
9,Scarborough,Malvern\n,M1B


More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

Good [tutorial](https://www.youtube.com/watch?v=Wb2Tp35dZ-I) on pandas groupby - split, apply, combine

Perform a computation on the grouped data
For each group combine the neighborhood column values (Aggregation)

In [10]:
df_Toronto['Neighborhood'] = df_Toronto['Neighborhood'].map(lambda x:x.rstrip("\n"))

In [11]:
df_Toronto.head(10)

Unnamed: 0,Borough,Neighborhood,PostalCode
0,North York,Parkwoods,M3A
1,North York,Victoria Village,M4A
2,Downtown Toronto,Harbourfront,M5A
3,Downtown Toronto,Regent Park,M5A
4,North York,Lawrence Heights,M6A
5,North York,Lawrence Manor,M6A
6,Queen's Park,Not assigned,M7A
7,Etobicoke,Islington Avenue,M9A
8,Scarborough,Rouge,M1B
9,Scarborough,Malvern,M1B


Groupby is applied on columns 'PostalCode' and 'Borough'. With agg function you can use different functions on different columns. Concatenate the elements in column 'Neighborhood' while inserting a ', ' between the words 

The result of the aggregation will have the group names as the new index along the grouped axis. In the case of multiple keys, the result is a MultiIndex by default, though this can be changed by using the as_index option:

In [12]:
df_Toronto = df_Toronto.groupby(['PostalCode', 'Borough'], as_index = False).agg({'Neighborhood' : ', '.join})

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

Returns a pd.Series object consisting of the 'Neighborhood' column for all rows where 'Neighborhood' is 'Not assigned'
We assign df_Toronto['Borough'] for all elements in column 'Neighborhood' where 'Neighborhood' was 'Not assigned'

In [13]:
df_Toronto.loc[df_Toronto['Neighborhood'] == 'Not assigned', 'Neighborhood'] = df_Toronto['Borough']

In [14]:
df_Toronto.shape

(103, 3)

In [15]:
df_Toronto

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"
