# Collecting Neighborhoods in Toronto

## 1. Collecting Neighborhoods

Let's create a webscrapping script to collect Toronto neighborhoods information from the table on https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M with following columns: PostalCode, Borough, and Neighborhood.

Inporting corresponding modules for webscrapping:

In [48]:
import requests, bs4

# Download the webpage
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
res = requests.get(url)
res.raise_for_status()

In [49]:
# Create an beautifulSoup object
toronto_soup = bs4.BeautifulSoup(res.text)

In [50]:
# Selecting all elements inside the corresponding tags
elements = toronto_soup.select('div table tbody tr td')

In [51]:
# Printing what we found
for i in range(0, len(elements), 3):
    print('{} | {} | {} | {}'.format(str(i//3+1), elements[i].getText(), elements[i+1].getText(), elements[i+2].getText()[:-1]))
    if elements[i].getText() == 'M9Z': # The last postal code on the table
        break

1 | M1A | Not assigned | Not assigned
2 | M2A | Not assigned | Not assigned
3 | M3A | North York | Parkwoods
4 | M4A | North York | Victoria Village
5 | M5A | Downtown Toronto | Harbourfront
6 | M5A | Downtown Toronto | Regent Park
7 | M6A | North York | Lawrence Heights
8 | M6A | North York | Lawrence Manor
9 | M7A | Queen's Park | Not assigned
10 | M8A | Not assigned | Not assigned
11 | M9A | Etobicoke | Islington Avenue
12 | M1B | Scarborough | Rouge
13 | M1B | Scarborough | Malvern
14 | M2B | Not assigned | Not assigned
15 | M3B | North York | Don Mills North
16 | M4B | East York | Woodbine Gardens
17 | M4B | East York | Parkview Hill
18 | M5B | Downtown Toronto | Ryerson
19 | M5B | Downtown Toronto | Garden District
20 | M6B | North York | Glencairn
21 | M7B | Not assigned | Not assigned
22 | M8B | Not assigned | Not assigned
23 | M9B | Etobicoke | Cloverdale
24 | M9B | Etobicoke | Islington
25 | M9B | Etobicoke | Martin Grove
26 | M9B | Etobicoke | Princess Gardens
27 | M9B | Eto

## 2. Creating Toronto DataFrame

In [52]:
elements[864] 

<td>M9Z</td>

At the previous step we found 289 rows with data. The last postal code in the table is 'M9Z' and its index in _elements_ list is 864.
Let's transform the list, considering assignment rules: 
1. Ignore postal codes with a borough that is 'Not assigned'.
2. If neighborhood is 'Not assigned', assign it with a corresponding borough.

In [53]:
# Creating a new list of rows
lst = []
for i in range(0, 865, 3):
    postal_code, borough, neighborhood = elements[i].getText(), elements[i+1].getText(), elements[i+2].getText()[:-1]
    if borough == 'Not assigned':
        continue
    if neighborhood == 'Not assigned':
        neighborhood = borough      
    lst.append([postal_code, borough, neighborhood])
lst[:10]

[['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront'],
 ['M5A', 'Downtown Toronto', 'Regent Park'],
 ['M6A', 'North York', 'Lawrence Heights'],
 ['M6A', 'North York', 'Lawrence Manor'],
 ['M7A', "Queen's Park", "Queen's Park"],
 ['M9A', 'Etobicoke', 'Islington Avenue'],
 ['M1B', 'Scarborough', 'Rouge'],
 ['M1B', 'Scarborough', 'Malvern']]

In [54]:
print('Now we have {} rows of relevant data.'.format(len(lst)))

Now we have 212 rows of relevant data.


But there is the third rule: neighborhoods with the same borough and postal code should be combined in the only row. Since the combined rows should not be processed again, we create row_to_skip list. Also we will need a dictionary to convert into a pandas DataFrame later.

In [55]:
# Creating a dictionary
toronto_data = {'PostalCode': [], 'Borough': [], 'Neighborhood': []}

# Apply the rule #3 and populate the dictionary
row_to_skip = [False]*len(lst)
for i in range(len(lst)):
    if row_to_skip[i]:
        continue
    p, b, h = lst[i][0], lst[i][1], lst[i][2]
    for j in range(i+1, len(lst)): # start from i+1 because the postal codes in lst is in ascending order
        if row_to_skip[j]:
            continue
        if lst[j][0] == p and lst[j][1] == b:
            h += ', {}'.format(lst[j][2])
            row_to_skip[j] = True
            
    toronto_data['PostalCode'].append(p)
    toronto_data['Borough'].append(b)
    toronto_data['Neighborhood'].append(h)
    
print('toronto_data dictionary has been created!')

toronto_data dictionary has been created!


Now we are ready to achive our first intermediate goal - **Toronto DataFrame**.

In [56]:
# Importing the module
import pandas as pd

toronto_df = pd.DataFrame(data=toronto_data)
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [57]:
print('Toronto DataFrame has {} data rows.'.format(toronto_df.shape[0]))

Toronto DataFrame has 103 data rows.
