# Week 3 Assignment
1. Fetch data from Wikipedia's table of Canada Postal Codes starting with M.
2. Process the Web page to create a table.

In [7]:
import requests
# !pip install beautifulsoup4
from bs4 import BeautifulSoup
import pandas as pd

In [16]:
wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
resp = requests.get(wiki_url)
soup = BeautifulSoup(resp.text)

## HTML Scraping
1. The table we're interested in has the class 'wikitable'. So I'm using a CSS selector based search to find that entry in the html page.
1. A 'tr' is a table-row, and 'td' is table data (cell). We go through each row of the table. At each row, we extract the record, and store it in a list.
1. We're only interested in postal codes with a borough defined. We skip rows w/o a borough. 
1. Similarly, if there's no 'neighborhood' defined, we fill it with the borough.

In [18]:
nbhood_data = []
for i in soup.select('.wikitable')[0].find_all('tr')[1:]:
    pcode, bor, nbhood = [y.strip() for y in [x.getText() for x in i.find_all('td')]]
    if bor != 'Not assigned':
        if nbhood == 'Not assigned':
            nbhood = bor
        nbhood_data.append((pcode, bor, nbhood))

### By checking the list of postal codes, we see that there are no duplicates.
Places with duplicate postal codes have been merged into single entries in the wikipedia page.

In [None]:
t = [i[0] for i in nbhood_data]
print(len(t) == len(set(t))) # true - No entries in the current table have duplicate postal codes.

## Create the data frame
Using the columns listed in the assignment description, we create a new data frame and fill it up.

In [19]:
# define the dataframe columns
column_names = ['Postal Code', 'Borough', 'Neighborhood'] #, 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)
for postalCode, bor, nbh in nbhood_data:
    neighborhoods = neighborhoods.append({'Postal Code': postalCode, 'Borough': bor, 'Neighborhood': nbh}, ignore_index=True)
neighborhoods.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


## There are 103 entries in this table.

In [20]:
neighborhoods.shape

(103, 3)

## Using pd.read_csv() to get a dataframe from the CSV file with lat-longs. 
## Then I'm combining the neighborhoods dataframe with the latlongs dataframe to create a new combined dataframe.

In [21]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [22]:
pd.merge(left=neighborhoods, right=latlongs, left_on='Postal Code', right_on='Postal Code')

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509
