# Toronto Neighborhoods Analysis

### Part 1: Getting the names of neighborhoods and boroughs, and the postal code

The libraries will be imported as we need them, not at the beginning.

First, we need to get the list of neighborhoods in Toronto from the Wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [1]:
wikipedia_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [30]:
!pip install bs4
!pip install requests

from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

import pandas as pd
import numpy as np

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes


In [21]:
# Get data from page
text_data = requests.get(wikipedia_url).text

# Process data with BeautifulSoup
soup = BeautifulSoup(text_data,"html5lib")

# Find all tables in the page
all_tables = soup.find_all('table')
print("There are {} tables in the page".format(len(all_tables)))

# Seeing the page, we can see that we want the table starting with 'M1A'
for index,table in enumerate(all_tables):
    if ("M1A" in str(table)):
        table_index = index
print("The table we want is of index {}".format(table_index))

# Get that table in a variable and print it so we can see its structure
raw_table = all_tables[0]
print(raw_table.prettify())

There are 3 tables in the page
The table we want is of index 0
<table cellpadding="2" cellspacing="0" rules="all" style="width:100%; border-collapse:collapse; border:1px solid #ccc;">
 <tbody>
  <tr>
   <td style="width:11%; vertical-align:top; color:#ccc;">
    <p>
     <b>
      M1A
     </b>
     <br/>
     <span style="font-size:85%;">
      <i>
       Not assigned
      </i>
     </span>
    </p>
   </td>
   <td style="width:11%; vertical-align:top; color:#ccc;">
    <p>
     <b>
      M2A
     </b>
     <br/>
     <span style="font-size:85%;">
      <i>
       Not assigned
      </i>
     </span>
    </p>
   </td>
   <td style="width:11%; vertical-align:top;">
    <p>
     <b>
      M3A
     </b>
     <br/>
     <span style="font-size:85%;">
      <a href="/wiki/North_York" title="North York">
       North York
      </a>
      <br/>
      (
      <a href="/wiki/Parkwoods" title="Parkwoods">
       Parkwoods
      </a>
      )
     </span>
    </p>
   </td>
   <td style="width:11

Seeing the table, we can see that each 'datapoint' is a cell, not a row!

It follows this structure:

```python
<td style="width:11%; vertical-align:top;">
    <p>
     <b>
      M5A # This is the Postal Code
     </b>
     <br/>
     <span style="font-size:85%;">
      <a href="/wiki/Downtown_Toronto" title="Downtown Toronto">
       Downtown Toronto # The first value is the Borough
      </a>
      <br/>
      (
      <a href="/wiki/Regent_Park" title="Regent Park">
       Regent Park # The following values are the neighborhoods
      </a>
      /
      <a href="/wiki/Harbourfront,_Toronto" title="Harbourfront, Toronto">
       Harbourfront # This is also a neighborhood
      </a>
      )
     </span>
    </p>
   </td>
```

Now, we will create a list to store our data.

In [54]:
# Create empty list
table_contents=[]

# Iterate through table cells ("td")
for row in raw_table.findAll('td'):
    
    # Createempty dictionary
    cell = {}
    
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3] # Get the first 3 letters of the text in each cell
        
        cell['Borough'] = (row.span.text).split('(')[0] # Get everything that is before the "("
        
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ') # Get what is after the ")", and replace slashes with commas
        
        table_contents.append(cell) # Append cell to contents

# print(table_contents)

# Transform list into a dataframe 
toronto_df=pd.DataFrame(table_contents)

# Make adjustments as recommended
toronto_df['Borough']=toronto_df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

print("Shape is {}".format(toronto_df.shape))
toronto_df.head(5)

Shape is (103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


# Print the shape again just to be sure

In [57]:
print("Shape of Toronto neighborhood dataframe is {}".format(toronto_df.shape))

Shape of Toronto neighborhood dataframe is (103, 3)


### Part 2: Getting latitude and longitude for each borough

First, we import pgeocode

In [68]:
!pip install pgeocode
import pgeocode # import pgeocode

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting pgeocode
  Downloading pgeocode-0.3.0-py3-none-any.whl (8.5 kB)
Installing collected packages: pgeocode
Successfully installed pgeocode-0.3.0


Now we get the latitudes and longitudes

In [84]:

# Convert postal codes to a list
postal_codes = toronto_df['PostalCode'].tolist()

# Define the geolocator
geolocator = pgeocode.Nominatim('ca')

# Create empty lists for lat and long
latitudes = []
longitudes = []

# Go through the postal codes and get the latlong
for i, postal_code in enumerate(postal_codes):
    
    # Get specific location
    g = geolocator.query_postal_code(postal_code)
    
    # Get lat and long
    if not g.empty:
        latitudes.append(g.latitude)
        longitudes.append(g.longitude)
    else:
        latitudes.append("PC not found")
        longitudes.append("PC not found")

Pass the lat and long we just got into the dataframe with neighborhoods and boroughs

In [101]:
toronto_df_latlong = toronto_df
toronto_df_latlong['Latitude'] = latitudes
toronto_df_latlong['Longitude'] = longitudes
toronto_df_latlong.head(5)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Queen's Park,Ontario Provincial Government,43.6641,-79.3889


We can see that one of the latlongs is not available, let's investigate it

In [102]:
toronto_df_latlong.iloc[[76]]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
76,M7R,Mississauga,Enclave of L4W,,


In [110]:
g = geolocator.query_postal_code('M7R')
g

postal_code       M7R
country_code      NaN
place_name        NaN
state_name        NaN
state_code        NaN
county_name       NaN
county_code       NaN
community_name    NaN
community_code    NaN
latitude          NaN
longitude         NaN
accuracy          NaN
Name: 0, dtype: object

It indeed didn't find anything. Let's delete this row as it won't impact the exercise

In [119]:
toronto_df_latlong_fixed = toronto_df_latlong.drop([76]).reset_index()
toronto_df_latlong_fixed.head(5)

Unnamed: 0,index,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,0,M3A,North York,Parkwoods,43.7545,-79.33
1,1,M4A,North York,Victoria Village,43.7276,-79.3148
2,2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,4,M7A,Queen's Park,Ontario Provincial Government,43.6641,-79.3889


In [None]:
toronto_df_latlong_fixed.iloc[[]]