## Toronto Neighborhood Classification

This notebook is about scrapping from the Wikipedia website data about the territorial localization of Toronto's postal code, boroughs and neighborhoods.
Via data storing through means of comma separated values dataset, the final product would be a Pandas Dataframe exempt of NaN records and duplicate entries.

### Installing necessary packages
For the reasons of geographical, markup language and webpage parsing, we install the geopy, geocoder, folium, beautifulsoup4, lxml, html5lib, wget and requests libraries.

In [1]:
import sys
!{sys.executable} -m pip install geopy
!{sys.executable} -m pip install beautifulsoup4
!{sys.executable} -m pip install lxml
!{sys.executable} -m pip install html5lib
!{sys.executable} -m pip install requests
!{sys.executable} -m pip install geocoder
!{sys.executable} -m pip install folium
!{sys.executable} -m pip install wget



### Importation of the necessary packages
Inthe wake of the installation of geopy, geocoder, folium, beautifulsoup4, lxml, html5lib, wget and requests libraries, we have to impoert them into the notebook. furthermore, we import csv and pandas libraries.


In [1]:
import geopy
import geocoder
from bs4 import BeautifulSoup
import lxml
import html5lib
import requests
import csv
import pandas as pd
import folium
import wget

### Creation of the csv file
By using the combination of the "open" Python's inherent function and the csv library, we create a csv file called "toronto.csv" whose columns titles are "Postal code", "Borough" and "Neighborhood".

In [2]:
toronto_csv = open('toronto.csv', 'w')
toro_csv_w = csv.writer(toronto_csv)
toro_csv_w.writerow(['Postal code', 'Borough', 'Neighborhood'])


34

### Parsing the webpage and the resulting markup file

Here, we use the "requests" library to access the content of the Wikipedia webpage about the neighborhoods and postal codes of Toronto under the text format. We assign the result to a variable called "origin".

We then use "BeautifulSoup" module to transform the text yielded by the webpage parsing into an "xml" file. The result is assigned to a variable called "or_text".

A close analysis of the content of the  xml file reveals that the  table containing the information we are looking for is located under the tag "table". The tag "table" has also children under the tag of "tr", which on their turn have children tagged as "td".

The information we are looking for is in those grandchildren tags of "td". 

We then use a for loop to respectively populate our "toronto.csv" file with the specific names of postal code, borough and neighborhood.

In [3]:
origin = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
or_text = BeautifulSoup(origin, 'lxml')

for tt in or_text.table.tbody.find_all('tr'):
    temp = []
    for st in tt.find_all('td'):
        entry_text = st.text
        if entry_text.endswith('\n'):
            entry_text = entry_text[:-(len('\n'))]
            
        temp.append(entry_text)
        
    
    
    toro_csv_w.writerow(temp)
toronto_csv.close()

### Writing data into Pandas Dataframe


Within the cell below, we export the content of the toronto.csv file into the pandas dataframe, and name the variable as "df".

As a precess of the cleanup, we ensure to establish a Pandas Dataframe exempt of NaN records and duplicate entries.

In [4]:
df = pd.read_csv('toronto.csv')
df.dropna(inplace = True)
df.reset_index(drop = True, inplace = True)
df_dupli = df[df.duplicated(['Postal code'])]
df_dupli.head()

Unnamed: 0,Postal code,Borough,Neighborhood


### The first fifteen rows of the final Pandas dataframe

We check the status of the "df" dataframe by taking a look at the first 15 rows.

In [5]:
df.head(15)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,Malvern / Rouge
7,M3B,North York,Don Mills
8,M4B,East York,Parkview Hill / Woodbine Gardens
9,M5B,Downtown Toronto,"Garden District, Ryerson"


### A Look on the Shape of the Dataframe

By using the "shape" attributre of pandas dataframe, we can see that "df" is a dataframe made of 103 rows, and three columns, the index column excluded.

In [6]:
df.shape

(103, 3)

### Getting the latitude and longitude of the postal codes

Given the fact that the "df" dataframe lacks data about latitude and longitude for every postal code, we scrape the webpage of 'https://cocl.us/Geospatial_data' to obtain latitude and longitude data. Through the medium of "wget" library, we download the csv file that is then stored into the variable "file_csv".

In [7]:
file_scv = wget.download('https://cocl.us/Geospatial_data')

We then store the data into the pandas dataframe "geo_pd".
And check the fist 11 rows.

In [8]:
geo_pd = pd.read_csv(file_scv)
geo_pd.head(11)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


### Updating the "df" dataframe with data on latitude and longitude

Upon checking the status of the "geo_pd" dataframe, we realize that the the postal code entries of "geo_pd" are not in the same order as of "pd". 

Therefore, we create the "post_list" containing postal code data in the same order as in "pd" dataframe.

By utilizing the for loop, we create the columns "Latitude" and "Longitude" columns to populate the "df" dataframe with respective latitude and longitude data from the geo_pd data frame through the "post_list" list.

In [9]:
post_list = list(geo_pd['Postal Code'])
    
for i, post in enumerate(post_list):
    ind = df.index[df['Postal code'] == post]
    df.loc[ind, 'Latitude'] = geo_pd.loc[i, 'Latitude']
    df.loc[ind, 'Longitude'] = geo_pd.loc[i, 'Longitude']
    
geo_pd.loc[i, 'Longitude']
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636
3,M6A,North York,Lawrence Manor / Lawrence Heights,43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,43.662301,-79.389494


### Creating the Toronto Map
In order to create the Toronto map, showing the geographical location of all postal codes, we scrape the interned to get the latitude and longitude of Toronto around which to center the map.

In [10]:
lat_long = requests.get('https://tools.wmflabs.org/geohack/geohack.php?pagename=Toronto&params=43_44_30_N_79_22_24_W_region:CA-ON_type:city').text
beau = BeautifulSoup(lat_long, 'lxml')

### Accessing the Latitude and Longitude of Toronto Metropole.

In [11]:
beau_b = beau.find('body', class_= 'mediawiki skin-modern')
Latitude = float(beau_b.find('div', class_= 'toccolours plainlinks').table.tbody.td.find('span', class_='latitude').text)
Longitude = float(beau_b.find('div', class_= 'toccolours plainlinks').table.tbody.td.find('span', class_='longitude').text)

### Toronto Map

By using the "folium" library, we create the Toronto Map.

Each circle in the map is at the geographical location of the respective postal code.

By clicking at any circle, the corresponding postal code and borough are displayed.

In [12]:
toronto_map = folium.Map(location=[Latitude, Longitude], zoom_start=10.5) # generate map centred around the Conrad Hotel


for lati, longi, code, boro in zip(df.Latitude, df.Longitude, df['Postal code'], df['Borough']):
    folium.CircleMarker([lati, longi], 
                        radius=12, 
                        color="#007849", 
                        popup = f"Borough: {boro} Postal Code: {code}",
                        fill = True, 
                        fill_color='blue', 
                        fill_opacity=0.6).add_to(toronto_map)


# display map
toronto_map