<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>

<h1 align=center><font size = 4>Part One</font></h1>

### Scraping Postal Codes from Wikipidea Website

In [5]:
import requests # library to handle requests
import lxml.html as lh
import bs4 as bs
import urllib.request

import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis

#### URL Source

In [6]:
url   = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

### Using functions below to scrape html tables using

In [8]:
# -----------------------------------------------------
# Using BS4 as suggested in Assignment.
# -----------------------------------------------------

def scrape_table(cname,cols):
    page  = urllib.request.urlopen(url).read()
    soup  = bs.BeautifulSoup(page,'lxml')
    table = soup.find("table",class_=cname)
    header = [head.findAll(text=True)[0].strip() for head in table.find_all("th")]
    data   = [[td.findAll(text=True)[0].strip() for td in tr.find_all("td")]
              for tr in table.find_all("tr")]
    data    = [row for row in data if len(row) == cols]
    # Store data to this temporary dataframe
    raw_df = pd.DataFrame(data,columns=header)
    return raw_df

# -----------------------------------------------------
# Parsing using xpath.
# -----------------------------------------------------
def scrape_table_lxml(XPATH,cols):
    page = requests.get(url)
    doc = lh.fromstring(page.content)
    table_content = doc.xpath(XPATH)
    for table in table_content:
        headers = [th.text_content().strip() for th in table.xpath('//th')]
        headers = headers[0:3]
        data    = [[td.text_content().strip() for td in tr.xpath('td')] 
                   for tr in table.xpath('//tbody/tr')]
        data    = [row for row in data if len(row) == cols]
        raw_df = pd.DataFrame(data,columns=headers)
        return raw_df


In [13]:
#Test in beautifulSoup
tron_postal_codes_table = scrape_table("wikitable",3)

#Test in lxml ( for xpath based extraction)
#raw_TorontoPostalCodes = scrape_table_lxml("/html/body/div[3]/div[3]/div[4]/div/table[1]",3)

print("# Toronto Postal codes stored in data")
print(tron_postal_codes_table.info(verbose=True))

# Toronto Postal codes stored in data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 287 entries, 0 to 286
Data columns (total 3 columns):
Postcode         287 non-null object
Borough          287 non-null object
Neighbourhood    287 non-null object
dtypes: object(3)
memory usage: 6.9+ KB
None


### Cleaning the Data Grouping.
The scraped wikipedia table contains some un-wanted entries and needs some cleanup.
The following tasks will be performed:
* Drop/ignore cells with un-assigned boroughs.
* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
* Group the table by PostalCode/Borough, Neighbourhood belonging to same borough will be combined in 'Neighbourhood' column as separated with 'comma'.

In [14]:
# Process the cells that have an assigned borough only, and ignore cells with a borough that is un-assigned.. 
postal_codes=tron_postal_codes_table[~tron_postal_codes_table['Borough'].isin(['Not assigned'])]

# Sort and Reset index.
postal_codes=postal_codes.sort_values(by=['Postcode','Borough','Neighbourhood'], ascending=[1,1,1]).reset_index(drop=True)

# If a cell has a borough but un-assigned neighborhood, then the neighborhood will be the same as the borough.
postal_codes.loc[postal_codes['Neighbourhood'] == 'Not assigned', ['Neighbourhood']] = postal_codes['Borough']
check_unassigned_post_state_sample = postal_codes.loc[postal_codes['Borough'] == 'Queen\'s Park']


There are more than one neighborhood listed twice, and will be combined into one row with the neighborhoods separated with a comma.

In [16]:
postal_codes = postal_codes.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()
postal_codes

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."


In [17]:
postal_codes.shape

(103, 3)

Exporting data to CSV file

In [18]:
postal_codes.to_csv('Postal_Codes-Part-1.csv',index=False)

##### End of Part One