# Week 3 Assignment

## Part 1

### Web Scraping
The first step is webscraping the Canada postal codes for each Borough and Neighborhood by querying the Wikipedia Page

In [2]:
import requests
wikipedia_link='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
raw_wikipedia_page=requests.get(wikipedia_link)
#print(raw_wikipedia_page.text)

By examining the page HTML we can identify the tree path of our target.
We need beautifulsoup to help parse our HTML page

In [10]:
#!conda install -c conda-forge beautifulsoup4 --yes

In [3]:
from bs4 import BeautifulSoup
from lxml import html

page = BeautifulSoup(raw_wikipedia_page.text, 'html.parser')
#print(page.prettify())

First we find the table in our HTML, then we extract all rows. For each row we take **Postcode**,**Borough** and **Neighborhood**.

Note that we need to skip the first row of out table, the header, and clean out our values of **\n** characters.

In [4]:
import pandas as pd
# define the dataframe columns
column_names = ['Postcode','Borough', 'Neighborhood'] 

# instantiate the dataframe
neighborhoods_df = pd.DataFrame(columns=column_names)

table = page.find('table', {'class': 'wikitable'})
rows = table.find_all('tr')
rows = iter(rows)
next(rows)
for row in rows:
    data = row.findChildren('td')
    postcode = data[0]
    borough = data[1]
    neighborhood = data[2]
    neighborhoods_df = neighborhoods_df.append({
        'Postcode': postcode.get_text().replace('\n',''),
        'Borough': borough.get_text().replace('\n',''),
        'Neighborhood': neighborhood.get_text().replace('\n','')
    },ignore_index=True)

In [5]:
neighborhoods_df.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


The next step is cleaning out **Not assigned** Boroughs and assigning the corresponding Borough to empty Neighborhoods

In [6]:
neighborhoods_df = neighborhoods_df[neighborhoods_df.Borough!='Not assigned']
neighborhoods_df['Neighborhood'].loc[neighborhoods_df['Neighborhood'] == 'Not assigned'] = neighborhoods_df['Borough']
neighborhoods_df.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In the end we aggregate by **Postcode** and **Borough**, evaluating the **Neighborhood** as the concatenation of all the corresponding values

In [7]:
new_df = neighborhoods_df.groupby(['Postcode','Borough'],as_index=False).agg(lambda x: ','.join(x))
new_df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


## Part 2

### Geocoding

The second part of the assignment consists in geocoding our postal codes, thus translating the code into a Latitude and Longitude.
Since the geocoder Library didn't work well in my case, i directly imported the Geolocation CSV with Postal Codes.

In [8]:
#!conda install -c conda-forge geocoder --yes
#import geocoder # import geocoder

In [1]:
'''
for index,row in new_df.iterrows():
    # initialize your variable to None
    lat_lng_coords = None
    
    # loop until you get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.google('{}, {}, Canada'.format(row['Postcode'],row['Borough']))
      lat_lng_coords = g.latlng

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    new_df.at[index,'Latitude'] = latitude
    new_df.at[index,'Longitude'] = longitude
    
new_df.head()
'''

"\nfor index,row in new_df.iterrows():\n    # initialize your variable to None\n    lat_lng_coords = None\n    \n    # loop until you get the coordinates\n    while(lat_lng_coords is None):\n      g = geocoder.google('{}, {}, Canada'.format(row['Postcode'],row['Borough']))\n      lat_lng_coords = g.latlng\n\n    latitude = lat_lng_coords[0]\n    longitude = lat_lng_coords[1]\n    new_df.at[index,'Latitude'] = latitude\n    new_df.at[index,'Longitude'] = longitude\n    \nnew_df.head()\n"

Downloading and reading our geocoded data into a Dataframe

In [9]:
!wget -q -O 'geo.csv' https://cocl.us/Geospatial_data
print('Data downloaded!')

Data downloaded!


In [10]:
geocodes = pd.read_csv('geo.csv')
geocodes.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now that we have our Geocoded data we can join it with our dataframe to obtain Latitude and Longitude for our Boroughs

In [11]:
geocoded_df = new_df.join(geocodes.set_index('Postal Code'), on='Postcode')
geocoded_df.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
