# Capstone Project - Part 1: Data sourcing and processing
## Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

#### Source and process the data then export to CSV to be used in Part 2. This is being done to keep the notebooks more legible and also stop me from having to re-run data generating code multiple times

In [1]:
import requests # URL handler
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup # HTMl Parser
from tabulate import tabulate # nice way to view larger DFs
from itertools import chain # usefullexploding lists

#### Using Beautiful Soup to extract the data and Requests to handle the URL. The code below simply iterates over the Tags generated by BS. I've loaded Tabulate to help view the DataFrame then saved down the final DataFrame as a CSV to use later (as indicated in the beggining, mostly to keep the notebooks cleaner)
#### In the final cell I've listed the DF shapes as requried by the questions - note that the increased size is expected given the original table had collapsed neighborhoods in the same Postal Code into one row

In [2]:
# load the URL, scrape the data and pass into a  datafram
wikipage = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup = BeautifulSoup(wikipage.text)
soup_table = soup.find_all("table")[0] # there is only 1 table so just take the first one

n_columns = 0
n_rows = 0
column_names = []

# use "soup" tags to retrieve the table data and construct data frame
for row in soup_table.find_all('tr'):

    td_tags = row.find_all('td')
    if len(td_tags) > 0:
        n_rows+=1
        if n_columns == 0:
            n_columns = len(td_tags)


    th_tags = row.find_all('th') 
    if len(th_tags) > 0 and len(column_names) == 0:
        for th in th_tags:
            column_names.append(th.get_text().strip())

columns = column_names if len(column_names) > 0 else range(0,n_columns)
df = pd.DataFrame(columns = columns,
                  index= range(0,n_rows))
row_marker = 0
for row in soup_table.find_all('tr'):
    column_marker = 0
    columns = row.find_all('td')
    for column in columns:
        df.iat[row_marker,column_marker] = column.get_text().strip()
        column_marker += 1
    if len(columns) > 0:
        row_marker += 1

# stip out Boroughs with "Not Assigned" and reset index
df_filtered = df[df['Borough']!="Not assigned"]
df_filtered.reset_index(drop=True, inplace=True)

# return list from series of comma-separated strings
def chainer(s):
    return list(chain.from_iterable(s.str.split(',')))

# calculate lengths of splits
lens = df_filtered['Neighbourhood'].str.split(',').map(len)

# create new dataframe, repeating or chaining as appropriate - df_filtered_complete
df_fc = pd.DataFrame({'Postal Code': np.repeat(df_filtered['Postal Code'], lens),
                    'Borough': np.repeat(df_filtered['Borough'], lens),
                    'Neighbourhood': chainer(df_filtered['Neighbourhood'])})


# final result is a table with 1 neighborbood per row, excluding those with no borough
print(tabulate(df_fc.head(20), headers=df_fc.columns))

    Postal Code    Borough           Neighbourhood
--  -------------  ----------------  -----------------------------
 0  M3A            North York        Parkwoods
 1  M4A            North York        Victoria Village
 2  M5A            Downtown Toronto  Regent Park
 2  M5A            Downtown Toronto  Harbourfront
 3  M6A            North York        Lawrence Manor
 3  M6A            North York        Lawrence Heights
 4  M7A            Downtown Toronto  Queen's Park
 4  M7A            Downtown Toronto  Ontario Provincial Government
 5  M9A            Etobicoke         Islington Avenue
 5  M9A            Etobicoke         Humber Valley Village
 6  M1B            Scarborough       Malvern
 6  M1B            Scarborough       Rouge
 7  M3B            North York        Don Mills
 8  M4B            East York         Parkview Hill
 8  M4B            East York         Woodbine Gardens
 9  M5B            Downtown Toronto  Garden District
 9  M5B            Downtown Toronto  Ryerson
10  M6B 

In [3]:
df_fc.to_csv('Toronto_Neighborhoods_Cleaned', sep = ',', header=df_fc.columns)

In [4]:
print('rows in original DF: ', df.shape,
      '\nrows in final DF: ', df_fc.shape)


rows in original DF:  (180, 3) 
rows in final DF:  (217, 3)
