# Segmenting and Clustering Neighborhoods in Toronto, ON. Canada.


## Introduction

In this first part of the week 3 assignment I'll be scraping a Wikipedia page for data on the neighbourhoods of Toronto.

Before we get the data and start exploring it, let's download all the dependencies that we will need.


In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# import libraries for scraping data
import requests # library for making HTTP requests in Python
from bs4 import BeautifulSoup # library for pulling data out of HTML and XML files (scraping)

print('Libraries imported.')

Libraries imported.


<a id='item1'></a>


## Download and Explore Dataset


For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. I will then scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format.

I will be using the [Beautiful Soup](http://beautiful-soup-4.readthedocs.io/en/latest/) library for scraping the data.

### Scraping data from Wikipedia

Get the contents of the page in the form of text and store them in a variable called ```wiki_table```

In [2]:
wiki_table = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

Create a BeautifulSoup object, which represents the table as a nested data structure.

In [3]:
soup = BeautifulSoup(wiki_table)

Inspect the page's HTML to find the tags associated with the table

![title](img/table.png)

Use ```find()``` to extract only the table from the soup.

In [4]:
my_table = soup.find('table', {'class': 'wikitable sortable'})

Extracting the header which is within ```<th>``` tags and storing it inside a list called ```header```

In [5]:
header = [th.text.rstrip() for th in my_table.find_all('th')]
header

['Postal Code', 'Borough', 'Neighbourhood']

Extracting the columns which are within ```<tr>``` tags, and the cells (```<td>``` tags) and storing it inside lists

In [6]:
c1 = []
c2 = []
c3 = []
for row in my_table.find_all('tr'):
    cells = row.find_all('td')
    if len(cells) == 3:
        c1.append(cells[0].find(text=True).rstrip())
        c2.append(cells[1].find(text=True).rstrip())
        c3.append(cells[2].find(text=True).rstrip())

Create a ```dict``` using the header values as keys and storing the columns values as their values. Then, convert it into a DataFrame called ```toronto_boroughs```

In [7]:
d = dict([(x, 0) for x in header])
d['Postal Code'] = c1
d['Borough'] = c2
d['Neighbourhood'] = c3
toronto_boroughs = pd.DataFrame(d)

The dataframe consists of three columns: ```PostalCode```, ```Borough```, and ```Neighbourhood```

In [8]:
toronto_boroughs.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Some boroughs have more than one neighbourhoods, let's separate them

In [9]:
columns = toronto_boroughs.columns
toronto_boroughs = toronto_boroughs.set_index(['Postal Code', 'Borough']).Neighbourhood.str.split(', ', expand=True).stack().reset_index(['Postal Code', 'Borough']).reset_index(drop=True)
toronto_boroughs.columns = columns
toronto_boroughs.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park


Only process the cells that have an assigned borough. 

Ignore cells with a borough that is ```Not assigned```.

In [10]:
toronto_boroughs = toronto_boroughs.drop(toronto_boroughs[toronto_boroughs['Borough'] == 'Not assigned'].index).reset_index(drop=True)

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [11]:
toronto_boroughs[toronto_boroughs['Neighbourhood'] == "Not assigned"]

Unnamed: 0,Postal Code,Borough,Neighbourhood


After dropping the Boroughs with no data, we see that there are none ```Not assigned``` neighbourhoods

## So, this is our final Toronto neighbouhoods data frame

In [12]:
toronto_boroughs.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park
3,M5A,Downtown Toronto,Harbourfront
4,M6A,North York,Lawrence Manor
5,M6A,North York,Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park
7,M7A,Downtown Toronto,Ontario Provincial Government
8,M9A,Etobicoke,Islington Avenue
9,M9A,Etobicoke,Humber Valley Village


Use the .shape method to print the number of rows of the dataframe

In [13]:
toronto_boroughs.shape

(217, 3)

In [14]:
toronto_boroughs.to_csv('toronto_boroughs.csv', index=False)