#  **WebScraping and preprocessing Toronto neighborhood**
 ### From <a href= https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Toronto >https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Toronto</a>

### Install collected packages: soupsieve, beautifulsoup4, bs4


In [1]:
!pip install bs4
#!pip install requests

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes


### Import the required modules and functions


In [2]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page
import pandas as pd

<h2 id="BSO">Beautiful Soup Objects</h2>
Beautiful Soup is a Python library for pulling data out of HTML and XML files, we will focus on HTML files. 
This is accomplished by representing the HTML as a set of objects with methods used to parse the HTML. 
We can navigate the HTML as a tree and/or filter out what we are looking for. 

In [3]:
# Store Web page URL in url variable
url = "https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Toronto"
#We use get to download the contents of the webpage in text format and store in a variable called data:
data  = requests.get(url).text 
# We create a BeautifulSoup object using the BeautifulSoup constructor,  
# The BeautifulSoup object, which represents the document as a nested data structure:
soup = BeautifulSoup(data,"html5lib")  # create a soup object using the variable 'data'

In [4]:
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of neighbourhoods in Toronto - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"00479fd6-2c0f-4542-a3f3-f4c0c78d2f68","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_neighbourhoods_in_Toronto","wgTitle":"List of neighbourhoods in Toronto","wgCurRevisionId":1030286615,"wgRevisionId":1030286615,"wgArticleId":1150939,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Webarchive template wayback links","Articles with short description","Short description i

In [5]:
#find all html tables in the web page
all_tables = soup.find_all('table')
# we can see how many tables were found by checking the length of the tables list
len(all_tables)


16

In [6]:
# Our target table has column name "City-designated neighbourhood" find index for targeted table
for index,table in enumerate(all_tables):
    if ("City-designated neighbourhood" in str(table)):
        table_index = index
print(table_index)

10


In [7]:
int_table = all_tables[table_index]
print(int_table.prettify())

<table class="wikitable sortable">
 <tbody>
  <tr bgcolor="lightblue">
   <th width="5%">
    CDN number
   </th>
   <th width="20%">
    City-designated neighbourhood
   </th>
   <th width="10%">
    Former city/borough
   </th>
   <th width="50%">
    Neighbourhoods covered
   </th>
   <th width="15%">
    Map
   </th>
  </tr>
  <tr>
   <td>
    129
   </td>
   <td>
    Agincourt North
   </td>
   <td>
    Scarborough
   </td>
   <td>
    <a href="/wiki/Agincourt,_Toronto" title="Agincourt, Toronto">
     Agincourt
    </a>
    and Brimwood
   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>
    128
   </td>
   <td>
    Agincourt South-Malvern West
   </td>
   <td>
    Scarborough
   </td>
   <td>
    <a href="/wiki/Agincourt,_Toronto" title="Agincourt, Toronto">
     Agincourt
    </a>
    and
    <a href="/wiki/Malvern,_Toronto" title="Malvern, Toronto">
     Malvern
    </a>
   </td>
   <td>
    <a class="image" href="/wiki/File:Malvern_to_locator.gif">
     <img alt="Malvern to loca

# Hint code for refrance 
```python
table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

# print(table_contents)
df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})
```

In [8]:
df = pd.DataFrame(columns=["PostalCode", "Borough", "Neighborhood"])

for row in int_table.tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        postcode = col[0].text
        neigh = col[1].text
        brough = (col[2].text)
        df = df.append({"PostalCode":postcode, "Borough":brough, "Neighborhood":neigh}, ignore_index=True)

df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,129\n,Scarborough\n,Agincourt North\n
1,128\n,Scarborough\n,Agincourt South-Malvern West\n
2,20\n,Etobicoke\n,Alderwood\n
3,95\n,Old City of Toronto\n,Annex\n
4,42\n,North York\n,Banbury-Don Mills\n
...,...,...,...
135,94\n,Old City of Toronto\n,Wychwood\n
136,100\n,Old City of Toronto\n,Yonge and Eglinton\n
137,97\n,Old City of Toronto\n,Yonge-St.Clair\n
138,27\n,North York\n,York University Heights\n


In [9]:
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

In [10]:
df = df.replace('\n','', regex=True)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,129,Scarborough,Agincourt North
1,128,Scarborough,Agincourt South-Malvern West
2,20,Etobicoke,Alderwood
3,95,Old City of Toronto,Annex
4,42,North York,Banbury-Don Mills
...,...,...,...
135,94,Old City of Toronto,Wychwood
136,100,Old City of Toronto,Yonge and Eglinton
137,97,Old City of Toronto,Yonge-St.Clair
138,27,North York,York University Heights


In [11]:
df.groupby('PostalCode')['Neighborhood'].nunique()

PostalCode
1      1
10     1
100    1
101    1
102    1
      ..
95     1
96     1
97     1
98     1
99     1
Name: Neighborhood, Length: 140, dtype: int64

In [22]:
df.shape

(140, 3)