# Capstone Project: Segmenting and Clustering Neighborhoods in Toronto

## Part 1 - Data Sourcing / Processing

### Elements required:
1. [ ] Scrape Wiki page for data
2. [ ] Convert data scrape into a DataFrame
3. [ ] Generate final table for use in part 2 and save as CSV
4. [ ] Print required values for review (DataFrame size/shape)

---
#### import required modules:

In [1]:
import requests # URL handler
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup # HTMl Parser
from itertools import chain # useful for exploding lists into dfs with repeated cells

---

1. [x] Scrape Wiki page for data
2. [x] Convert data scrape into a DataFrame
3. [ ] Generate final table for use in part 2 and save as CSV
4. [ ] Print required values for review (DataFrame size/shape)
   - Use Requests to manage URL call
   - Use Beautiful Soup to extract the tables using html tags

In [2]:
# use requests to handle the url and return the html file
wikipage = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

# use beautiful soup to extract the table
soup = BeautifulSoup(wikipage.text)
soup_table = soup.find_all("table")[0] # there is only 1 table so just take the first one

n_columns = 0
n_rows = 0
column_names = []

for row in soup_table.find_all('tr'):

    td_tags = row.find_all('td')
    if len(td_tags) > 0:
        n_rows+=1
        if n_columns == 0:
            n_columns = len(td_tags)


    th_tags = row.find_all('th') 
    if len(th_tags) > 0 and len(column_names) == 0:
        for th in th_tags:
            column_names.append(th.get_text().strip())

columns = column_names if len(column_names) > 0 else range(0,n_columns)
df = pd.DataFrame(columns = columns,
                  index= range(0,n_rows))
row_marker = 0
for row in soup_table.find_all('tr'):
    column_marker = 0
    columns = row.find_all('td')
    for column in columns:
        df.iat[row_marker,column_marker] = column.get_text()
        df.iat[row_marker,column_marker]=df.iat[row_marker,column_marker].strip()
        column_marker += 1
    if len(columns) > 0:
        row_marker += 1

---
1. [x] Scrape Wiki page for data
2. [x] Convert data scrape into a DataFrame
3. [x] Generate final table for use in part 2 and save as CSV
4. [ ] Print required values for review (DataFrame size/shape)
   - Filter rows where no Borough has been provided

In [3]:
# stip out Boroughs with "Not Assigned" and reset index
df_filtered = df[df['Borough']!="Not assigned"]
df_filtered.reset_index(drop=True, inplace=True)

df_filtered.to_csv('Toronto_Neighborhoods_Cleaned.csv', sep = ',', header=df_filtered.columns)
df_filtered.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


---
**This is an additional table of unique neighbourhoods not requested in the instructions, you can ignore if you wish** - the wiki data was originally provided in this form i suspect then updated (hence the instructions provied don't match the wiki data). I may use this table later if I decide to use the list of neighbourhoods in a more granular way

The below cell simply creates a DataFrame with the neighbourhoods as one per row then repeats the other columns where needed

In [4]:
def chainer(s):
    return list(chain.from_iterable(s.str.split(',')))

# calculate lengths of splits
lens = df_filtered['Neighbourhood'].str.split(',').map(len)

# create new dataframe, repeating or chaining as appropriate - df_filtered_complete
df_filtered_UnpackN = pd.DataFrame({'Postal Code': np.repeat(df_filtered['Postal Code'].str.strip(), lens),
                    'Borough': np.repeat(df_filtered['Borough'].str.strip(), lens),
                    'Neighbourhood': chainer(df_filtered['Neighbourhood'].str.strip())})
df_filtered_UnpackN.to_csv('TNC_UnpackN.csv', sep = ',',
                           header=df_filtered_UnpackN.columns)
df_filtered_UnpackN.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Manor


---
1. [x] Scrape Wiki page for data
2. [x] Convert data scrape into a DataFrame
3. [x] Generate final table for use in part 2 and save as CSV
4. [X] Print required values for review (DataFrame size/shape)

**NOTE -**
I've used a different DataFrame to count the neighbourhoods, depending on how the question is asked regarding the "size of the resulting DataFrame" you could use any of the three numbers. In the format currently on the Wiki you should get 103 rows defined by the number of Post Codes, if you itemise the Neighbourhoods then you would have 217. Since the data has changed I believe the question was originally designed to check if you properly combined the rows per Postal Code and ended up with one row per Postal Code with multiple, comma separated, neighbourhoods (103 below).

In [5]:
print('\nThe dataframe has {} Boroughs with {} Postal Codes covering {} Neighborhoods.'.format(len(df_filtered['Borough'].unique()),
                                                                                               len(df_filtered['Postal Code'].unique()),
                                                                                               df_filtered_UnpackN.shape[0]))



The dataframe has 10 Boroughs with 103 Postal Codes covering 217 Neighborhoods.
