# Scraping Wikipedia for neighbourhood data
In this notebook we'll utilize BeautifulSoup to scrape a table from a Wikipedia page.
The scraped data will then be transformed into a pandas DataFrame.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from bs4 import BeautifulSoup # web scraping library
import requests # library for grabbing html pages

## Data acquisition and processing
Using the requests.get('site_link').text we grab the html code for the page.

Using this data and some BeautifulSoup magic we can extract data from the table on the page.

In [12]:
scrape_target = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
website_html = requests.get(scrape_target).text
print(website_html[:154])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of postal codes of Canada: M - Wikipedia</title>


BeautifulSoup parses the html page and builds a heirerchical data structure for programattically parsing web pages.

In [21]:
soup = BeautifulSoup(website_html, 'lxml')
print(soup.prettify()[:166])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>


Next, let's use the find() and findAll() methods of the BeautifulSoup class object to extract the table and it's elements.

In [23]:
postcode_table = soup.find('table', {'class': 'wikitable sortable'})
table_elements = postcode_table.findAll('td')
print(table_elements[:10])

[<td>M1A</td>, <td>Not assigned</td>, <td>Not assigned
</td>, <td>M2A</td>, <td>Not assigned</td>, <td>Not assigned
</td>, <td>M3A</td>, <td><a href="/wiki/North_York" title="North York">North York</a></td>, <td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td>, <td>M4A</td>]


Using the above code, we get a list of values in "td" tags. 

Comparing this list to the original html code, we gather that every 3 objects in the list reperesent a single line in the table.

Consequently, we'll split the list into a list of lists, where each sub-list is a line in the table.

In [26]:
split_table_elements = [[table_elements[i], table_elements[i+1], table_elements[i+2]] for i in range(0, len(table_elements), 3)]
print(split_table_elements[:3])

[[<td>M1A</td>, <td>Not assigned</td>, <td>Not assigned
</td>], [<td>M2A</td>, <td>Not assigned</td>, <td>Not assigned
</td>], [<td>M3A</td>, <td><a href="/wiki/North_York" title="North York">North York</a></td>, <td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td>]]


## Extracting data into a DataFrame

A few points that we must consider:
* data in the Postcode column is text
* data in the Bourough & Neighbourhood columns can be either text or a hyperlink 
* if the value in Bourough is "Not assigned" we should skip the line
* if the value in Neighbourhood is "Not assigned" (but Bourough is anything else), we should change it to the name of the Bourough

We'll extract that data from each sub-list and append them to a temporary list which we'll turn into a pandas.DataFrame.

In [39]:
data_table = []

# Each element in split_table_elements contains 3 strings that represent a line in the table
# element[0] - data from the Postcode column
# element[1] - data from the Bourough column
# element[2] - data from the Neighbourhood column
for element in split_table_elements:
    # Extract the textual data using the .text accessor
    postcode = element[0].text.strip()
    bourough = element[1].text.strip()
    neighbourhood = element[2].text.strip()
        
    data_table.append([postcode, bourough, neighbourhood])
    
data_table[:5]

[['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront']]

Create the DataFrame:

In [51]:
df = pd.DataFrame(data_table)
df.columns = ['Postcode', 'Bourough', 'Neighbourhood']
df.head()

Unnamed: 0,Postcode,Bourough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


## Data cleanup
Keep only rows where the value for Bourough != "Not assigned"

In [52]:
df = df.loc[df.Bourough != 'Not assigned'] # keep only rows where Bourough is not "Not assigned"
df.reset_index(inplace=True)
df.drop('index', axis=1, inplace=True)
df.head(10)

Unnamed: 0,Postcode,Bourough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Now we'll assign every Neighbourhood that is "Not assigned" the same value as the Bourough

In [53]:
missing_neighbourhoods = df.Neighbourhood.values == 'Not assigned' # build filter
df.Neighbourhood[missing_neighbourhoods] = df.Bourough[missing_neighbourhoods] # preform data assignment using the filter
df.head(10)

Unnamed: 0,Postcode,Bourough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Notice: line 6 in the last table has changed compared to the preceding table. Neighbourhood column now contains "Queen's Park" instead of "Not assigned"

## Building the final DataFrame
Lastly, we'll build a DataFrame, where each Postcode appears only once. 

The value under Neighbourhood for each line is a list of Neighbourhoods in that Bourough.

In [54]:
final_df = pd.DataFrame(columns=['Postcode', 'Bourough', 'Neighbourhood'])

# Loop over unique postcodes
for pc in df.Postcode.unique():
    # Extract the Bourough name
    bourough = df.loc[df.Postcode == pc].Bourough.values[0]
    # Extract the Neighbourhoods in the Bourough and join them into a list
    hoods = ', '.join(df.loc[df.Postcode == pc].Neighbourhood.values.tolist())
    
    # Append line to the DataFrame
    final_df = final_df.append({'Postcode': pc,
                                'Bourough': bourough,
                                'Neighbourhood': hoods}, ignore_index=True)

final_df.reset_index(inplace=True)
final_df.drop('index', axis=1, inplace=True)
final_df.head(10)

Unnamed: 0,Postcode,Bourough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [55]:
final_df.shape

(103, 3)