# Segmenting and Clustering Neighborhoods in Toronto

### Objective
The aim of this Notebook is to build the code to scrape the following Wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M,
in order to obtain the data that is in the table of postal codes and to transform it into a pandas dataframe.

### Steps

**1. Importing the necessary Python libraries**

In [2]:
import requests
import pandas as pd
import lxml.html as lh
import bs4 as bs
import urllib.request

**2 Scraping the website using BS4, a python library for pulling data out of HTML and XML files**

In [3]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
def scrape_table(cname, cols):
    page  = urllib.request.urlopen(url).read()
    soup  = bs.BeautifulSoup(page, 'lxml')
    table = soup.find("table", class_ = cname)
    header = [head.findAll(text = True)[0].strip() for head in table.find_all("th")]
    data   = [[td.findAll(text = True)[0].strip() for td in tr.find_all("td")]
    for tr in table.find_all("tr")]
    data    = [row for row in data if len(row) == cols]
    raw_df = pd.DataFrame(data, columns = header)
    return raw_df

**3. Parsing the data using lxml, a  Python library that supports scraping with XPath**

In [4]:
def scrape_table2(XPATH, cols):
    page = requests.get(url)
    doc = lh.fromstring(page.content)
    table_content = doc.xpath(XPATH)
    for table in table_content:
        headers = [th.text_content().strip() for th in table.xpath('//th')]
        headers = headers[0:3]
        data    = [[td.text_content().strip() for td in tr.xpath('td')] 
        for tr in table.xpath('//tbody/tr')]
        data    = [row for row in data if len(row) == cols]
        raw_df = pd.DataFrame(data, columns = headers)
        return raw_df
raw_data = scrape_table("wikitable", 3)

**4. Cleaning up and re-grouping the data**

In [14]:
df = raw_data[~raw_data['Borough'].isin(['Not assigned'])]
df = df.sort_values(by=['Postcode','Borough','Neighbourhood'], ascending = [1,1,1]).reset_index(drop = True)
df.loc[df['Neighbourhood'] == 'Not assigned', ['Neighbourhood']] = df['Borough']
check_unassigned_sample = df.loc[df['Borough'] == 'Queen\'s Park']
df = df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()
df.rename(columns={'Postcode':'PostalCode'}, inplace=True)

**5. Checking the dataframe**

In [15]:
## First ten rows of the dataframe
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [16]:
## Size of the dataframe (rows, columns)
df.shape

(103, 3)

**6. Saving the dataframe**

In [17]:
df.to_csv('Toronto1.csv', index = False)