<a href="https://colab.research.google.com/github/imtiaj-sreejon/Coursera_Capstone/blob/master/Capstone_Project_Notebook_for_IBM_Applied_Data_Science_Pro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font size=14>Capstone Project Notebook</font>
This notebook is for the <font color='green'>**capstone project**</font> under <font color='teal'>*IBM Applied Data Science Professional Course*</font> hosted on Coursera.

In [8]:
import pandas as pd
import numpy as np

print('Hello Capstone Project Course!')

Hello Capstone Project Course!


# Build Toronto neighborhood dataframe by scraping [Wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) 

### Our objective in this section is to get the postal code and neighborhood information from wikipedia and put that into a *pandas* dataframe.

At first, we need to fetch the wikipedia page through the <b><i>requests</i></b> module.

In [9]:
import requests

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

source = requests.get(url).text

# print first 6 lines
for i, line in zip(range(6), source.splitlines()):
    print(line)


<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of postal codes of Canada: M - Wikipedia</title>


Now let's import <b>BeautifulSoup</b> which is the most popular library for parsing webpage.

In [0]:
import bs4
from bs4 import BeautifulSoup

We will parse the wikipedia page now. As the postal codes of Toronto are recorded in a table, to avoid unneceassary memory wastage, we will parse only the portion of the webpage that contains `<table>` tags.  For more info on parsing part of a webpage, visit [here](https://beautiful-soup-4.readthedocs.io/en/latest/#parsing-only-part-of-a-document).

We will use the <b>lxml</b> parser as it supports parsing partial webpage.

In [0]:
from bs4 import SoupStrainer # this module is needed for partial parsing

only_table_tags = SoupStrainer('table')

soup = BeautifulSoup(source, 'lxml', parse_only=only_table_tags)

print(soup.prettify()) # the output is cleared to save space

# limit height of scrollable output window
from IPython.display import Javascript
display(Javascript('''google.colab.output.setIframeHeight(0, true, {maxHeight: 400})'''))

Now we have got all the tables in the page and we can see, there are 3 tables in total. The first table contains our desired postal codes. So, if we use the `find()` method then it will return only the first matching occurence and thats what we need.

In [0]:
postal_code_table = soup.find('table')

print(postal_code_table.prettify()) # the output is cleared to save space

# limit height of scrollable output window
from IPython.display import Javascript
display(Javascript('''google.colab.output.setIframeHeight(0, true, {maxHeight: 400})'''))

Here we can see, each neighborhood information is contained within a separate ***&lt;tr&gt;*** tag. Each `<tr>` tag contains 3 `<td>` tags. First `<td>` contains the **postal code**. Second one contains the **Borough**. And last one contains **Neighborhood** name. 

For making the task a bit easier, we will remove the first row as its just the header names. And we will store rest of the table in a separate variable.

In [13]:
# soup_object('string_to_find') works the same as soup_object.find_all('string_to_find')
# its just a shorthand notation
table_data = postal_code_table('tr')[1:] # we are storing everything after the first row

# print first 3 table rows
for i, line in zip(range(3), table_data):
    print(line.prettify())

<tr>
 <td>
  M1A
 </td>
 <td>
  Not assigned
 </td>
 <td>
 </td>
</tr>

<tr>
 <td>
  M2A
 </td>
 <td>
  Not assigned
 </td>
 <td>
 </td>
</tr>

<tr>
 <td>
  M3A
 </td>
 <td>
  North York
 </td>
 <td>
  Parkwoods
 </td>
</tr>



Now that we have all the rows containing postal code and corresponding neighborhood, we can iterate over each row and get our desired data. We can have all the 3 children i.e. `<td>` tags under each `<tr>` through the `.contents` method.

But we have two problems here. One is all the newlines i.e. `'/n'` are also children of the `<tr>` tag. And second, the text under `<td>` also contains newlines(`'\n'`) within them. So for ease of further processing, we will remove all the newlines from `<tr>` children and also strip `'\n'` from each string.

In [14]:
print('The children before removing newlines:')
print(table_data[0].contents)
print('The String before:', repr(table_data[0].contents[1].string)) # repr() is used to see the newlines in the output

# remove all the newlines from <tr> children
for tr in table_data:
    tr.contents[:] = [item for item in tr.contents if not item=='\n']

# remove all the newlines from the strings under <td>
for tr in table_data:
    for item in tr.contents:
        if isinstance(item, bs4.element.Tag):
            item.string = item.string.strip('\n')

print('\nThe children after removing newlines:')
print(table_data[0].contents)
print('The String after:', repr(table_data[0].contents[0].string))

The children before removing newlines:
['\n', <td>M1A
</td>, '\n', <td>Not assigned
</td>, '\n', <td>
</td>]
The String before: 'M1A\n'

The children after removing newlines:
[<td>M1A</td>, <td>Not assigned</td>, <td></td>]
The String after: 'M1A'


Now we have a very fine list containing our desired data. We just need to extract the information and put those into a dataframe.

First lets create an empty dataframe with just the column names.

In [15]:
column_names = ['PostalCode', 'Borough', 'Neighborhood']

neigh_df = pd.DataFrame(columns=column_names)

neigh_df

Unnamed: 0,PostalCode,Borough,Neighborhood


Now we will iterate over our refined `table_data` to extract postal code, borough and neighborhood information. While extracting, following rules will be considered as per the assignment requirements:

1.   Only process the cells that have an assigned borough. Ignore cells with a borough that is **Not assigned**.
2.   If a cell has a borough but a **Not assigned** neighborhood, then the neighborhood will be the same as the borough.

Another rule was there that said there might be two rows with same postal code but i guess, wikipedia page was updated and that condition doesn't exist anymore i.e. all the rows have different postal codes.

In [16]:
for tr in table_data:
    postal_code = tr.contents[0].string # first child is postal code
    borough = tr.contents[1].string # second child is borough
    neighborhood = tr.contents[2].string # third child is neighborhood

    # 1st condition
    if (borough == 'Not assigned'):
        continue

    # 2nd condition
    if (neighborhood == 'Not assigned'):
        neighborhood = borough

    # add the info to dataframe
    neigh_df = neigh_df.append({'PostalCode':postal_code,
                                'Borough':borough,
                                'Neighborhood':neighborhood}, ignore_index=True)
    
neigh_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [18]:
# check values in borough column
neigh_df['Borough'].value_counts()

North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
York                 5
East Toronto         5
East York            5
Mississauga          1
Name: Borough, dtype: int64

Finally, lets get the number of rows of our dataframe using `.shape` method.

In [17]:
print('Total number of rows:', neigh_df.shape[0])

Total number of rows: 103
