# This notebook will mainly be used as the Coursera Capstone Notebook

In [7]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

!conda install -c conda-forge geocoder --yes

import geocoder

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/DSX-Python35

  added / updated specs: 
    - geocoder


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2019.3.9   |       hecc5488_0         146 KB  conda-forge
    ratelim-0.1.6              |           py35_0           5 KB  conda-forge
    certifi-2018.8.24          |        py35_1001         139 KB  conda-forge
    orderedset-2.0             |           py35_0         685 KB  conda-forge
    openssl-1.0.2r             |       h14c3975_0         3.1 MB  conda-forge
    geocoder-1.38.1            |             py_0          52 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         4.1 MB

The following NEW packages will be INSTALLED:

    geocoder:        1.38.1-py_0       conda-forge
    orderedset:

The point of this notebook is to create a pandas dataframe that has the following factors: PostalCode, Borough, Neighborhood, Latitude, and Longitude. The first three factors (PostalCode, Borough, and Neighborhood) will be scraped from the following wiki link (<em>https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M</em>) using the BeautifulSoup Python package (<em>http://beautiful-soup-4.readthedocs.io/en/latest/</em>). Then after cleaning that dataframe up, I'll grab lat-long data using the Geocoder Python package to append to the dataframe.

After the dataframe is cleaned up and finished, I'll create some clustering analysis which can be displayed in a map graphical format.

First let's do some scraping. Next cell imports the 'requests' Python package, specifies the page link I want to scrape, uses the requests.get function to create an html object out of the text in the link, then creates a Soup object to mess around with.

In [33]:
import requests

PageLink = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

PageResponse = requests.get(PageLink, timeout=5).text

PageContent = BeautifulSoup(PageResponse, "lxml")

And let's look at the Soup object. Looks like the table is kept in a 'wikitable sortable' object.

In [37]:
#print(PageContent.prettify())

In [38]:
PostalCodeTable = PageContent.find('table',{'class':'wikitable sortable'})
PostalCodeTable

<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td></tr>
<tr>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a>
</td></tr>
<tr>
<td>M6A</td>

Course instructions said to only bother with boroughs that have an assigned postal code. That is awesome because each of those has an '</a href/>' object associated with it.

In [159]:
Rows = PostalCodeTable.findAll('tr')

DF = []

for row in Rows: DF.append(row.get_text().split('\n'))
    
DF_title = DF[0]

DF.pop(0)

from pandas import DataFrame

DF = DataFrame.from_records(DF)

DF.columns = DF_title

DF = DF[['Postcode', 'Borough', 'Neighbourhood']]

DF

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


Awesome, now we have a pandas dataframe with the full wiki scrape. Now I'll drop the rows that have contain unassigned postal codes.

In [160]:
DF = DF.loc[DF['Borough'] != 'Not assigned']
DF

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


OK now I am going to combine the rows of identical Boroughs but different neighborhoods into single rows with multivalued neighborhoods (kept as a string with commas separating values, may change dtype later if needed).

In [161]:
DF = DF.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(', '.join).reset_index()
DF.shape

(103, 3)

OK now I'm going to assigned any unassigned neighborhood the corresponding borough value.

In [162]:
DF.loc[DF['Neighbourhood'] == 'Not assigned', 'Neighbourhood'] = DF.loc[DF['Neighbourhood'] == 'Not assigned', 'Borough']#"Queen's Mary"

In [165]:
DF

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [166]:
print("Shape of the final dataframe is ", DF.shape)

Shape of the final dataframe is  (103, 3)
