# The Battle of the Neighborhoods (Week 1)

### Summary

The main goal of the IBM Applied Data Science Capstone project is to compare various neighborhoods in Toronto, Ontario, CA by exploring them using location data, segmenting them into similar clusters and comparing them based on various aspects such as the services they might provide, or potentially why certain venues are popular or possibly, why people complain about certain venues.

This notebook contains the first 'stage' of the capstone project. It primarily contains the code to obtain the neighborhood and borough data for the city of Toronto. 

### Obtaining Neighborhood Data
Unlike other cities, like New York, Toronto's neighborhood data is not readily available on the Internet in a directly consumable format. However, for this data, a [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) exists that contains data we can use to divide the city into boroughs and neighborhoods.

First, we get our imports in place. We are using `Pandas`, as well as `Requests` to scrape the page and `BeautifulSoup` for parsing the resulting HTML.

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

To utilize the Wikipedia page, we will need to perform the following steps:
1. Scrape the HTML from the page
2. Parse the page, locate the relevant data in an HTML table and transform it to a Pandas DataFrame
3. Perform some data wrangling to deal with neighborhoods and boroughs that are 'not assigned' in the data and to combine the data in a useful way

The code block below performs the steps and then lists the shape of the resulting DataFrame. Note that I performed some manual data wrangling utilizing SQL Server, to obtain the goal of the final shape of the DataFrame. Also note that the code below makes some key assumptions:
* Wikipedia doesn't change the existence, location, structure or content of the given page
* The HTML page only contains exactly one table with class `wikitable sortable`
* Each row in that HTML table contains data

In [2]:
# Get the HTML from the Wiki page in the assignment
wiki_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

# Create a BeautifulSoup object out of the HTML using the lxml parser
soup = BeautifulSoup(wiki_url, 'lxml')

# From inspection, the table in the page has a class of 'wikitable sortable'
# and is the only such table currently in the document. If one more is added
# this code will fail because pc_table will become a list.
pc_table = soup.find('table', {'class':'wikitable sortable'})

# Get all the rows in the table we found
table_rows = pc_table.find_all('tr')

# Go through all the rows and create a list for each one containing all the
# data elements on that row. We use rstrip() to remove the trailing \n, which
# from inspection is included on the last data item in the row. This code
# assumes that the the text attribute of the row is not None. Each is appended
# to an empty list
l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.rstrip() for tr in td]
    if (len(row) > 0 and row[1] != 'Not assigned'):
        l.append(row)

# Create a Pandas DataFrame from the list we built with appropriately named columns
pc_df = pd.DataFrame(l, columns=['PostalCode', 'Borough', 'Neighborhood'])

# Group the DataFrame by PostalCode and Borough and use the apply method to join
# Neighborhoods together with a comma in between. Create a new DataFrame from the
# result and reset the index
pc_df = pc_df.groupby(by=['PostalCode','Borough'])['Neighborhood'].apply(','.join).to_frame().reset_index()

# We excluded Boroughs that were 'Not assigned', however, from inspection, there is
# exactly one neighborhood that has 'Not assigned' for the neighborhood, Queen's Park.
# For this we update that one row's neighborhood with the value of its borough
pc_df.loc[pc_df.Neighborhood == 'Not assigned', 'Neighborhood'] = pc_df.loc[pc_df.Neighborhood == 'Not assigned', 'Borough']

# Display the first few rows
pc_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


The resulting DataFrame, based on manual, external validation in a relational database, should have 103 rows.

In [3]:
# The shape of the dataframe, based on external validation, should be (103,3)
print('The resulting DataFrame has {} rows'.format(pc_df.shape[0]))

The resulting DataFrame has 103 rows
