# Segmenting and Clustering Neighborhoods in Toronto
# Part I

We need to get a list of postcodes for Toronto, Canada. Luckily there is a table available on the following Wikipedia site: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

Unfortunetely there is no direct download function, so we can't just download the table as a csv file and import it in our notebook. Instead we need to extract it from the Wikipedia website.

## Prepare Environment
First, let's install and import the libaries we will need for that task:
+ Requests
+ BeautifulSoup4
+ Pandas
+ (Numpy)

In [160]:
# Uncomment, if needed libaries are not already installed

#!conda install -c conda-forge beautifulsoup4
#!conda install -c conda-forge lxml
#!conda install -c conda-forge requests

In [161]:
import pandas as pd
#import numpy as np
from bs4 import BeautifulSoup
import requests

## Acquire Data

Now we will download the data and transform it into a pandas dataframe. 

We will fetch the first table element from the website by using the method [find()](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find) of our soup-object of class BeautifulSoup.

The function [pandas.read_html()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html) will come in handy when parsing the table element from the website. It searches for ```table``` html-tags and returns a list of Dataframes, each containing the data of one table. As we fetched only one table element, the desired dataframe will be the first element in the list (index 0).

In [162]:
# Fetch website
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')

# Find table
table_html = soup.find('table', class_='wikitable sortable')

# Parse table and create pandas dataframe
df_postcodes = pd.read_html(str(table_html), header=0)[0]

df_postcodes.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


## Prepare Data

That looks good! Let's polish the dataframe a little. Do you see, that some postcodes belong to more than one neighborhood? We should combine those in one row and separate the neighborhoods with the same postcode by comas. Also let's remove all rows where the borough is "Not assigned" and replace the missing neighborhood names by the names of their borough:

In [163]:
# Drop rows where Borough is "Not assigned"
df_postcodes = df_postcodes[df_postcodes['Borough'] != 'Not assigned'].reset_index(drop=True)

# Create a list of indices of the cells where Neighbourhood is "Not assigned"
neighborhood_change_idx = df_postcodes.loc[df_postcodes['Neighbourhood'] == "Not assigned"].index

# Iterate the list of indicies and set Neighbourhood to name of Borough
for i in neighborhood_change_idx:
    df_postcodes.iloc[i, 2] = df_postcodes.iloc[i, 1] 

# Group table by Postcode and Borough; create a list of neighborhoods, if there is more then one
df_postcodes = df_postcodes.groupby(['Postcode', 'Borough'],as_index=False)['Neighbourhood'].agg(', '.join)

df_postcodes.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


We finished the preparation of the postcode data. In the next part of our Journey we will look for latitude and logitude coordinates of each neighborhood.

Let's save our dataframe to a csv file and have a last look at the shape of our dataframe. How to save a file from a Jupyter Notebook inside the IBM Cloud environment by using Watscon Studio can be read [here](https://medium.com/ibm-data-science-experience/control-your-dsx-projects-using-python-c69e13880312).

In [164]:
# create csv-file (command out, if using the below alternative)
df_postcodes.to_csv('torronto_postcodes.csv', index=False)

# To save the csv file as part of a Watson Studio project on IBM Cload use the following command:
#project.save_data(data=df_postcodes.to_csv(index=False), file_name='torronto_postcodes.csv', overwrite=True)

# How big is the dataframe?
df_postcodes.shape

(103, 3)