# Scrape Toronto's Postal Code from Wikipedia

## Section 1 - Scraping

### 1. Import all the needed resources, as well as initializing the Wikipedia URL that we want to scrape

In [2]:
import requests
import lxml.html as lh
import pandas as pd

wikipedia_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

### 2. Extract the HTML content from the URL, parse it into HTML document and get the table using XPath

In [79]:
response = requests.get(wikipedia_url)

doc = lh.fromstring(response.content)
table = doc.xpath('//table[@class=\'wikitable sortable\']')

### 3. Convert the table into Panda DataFrame

Note that there is no specific ID defined in the HTML table in Wikipedia page, and therefore, a little bit of hack is needed.
The loop will stop if the iterator finds the first element of the table is empty, which is indicating that it is now going through the next table which we don't need

In [98]:
rows = table[0].xpath('//tr')
colums = []
toronto_df = None

for i, r in enumerate(rows):
    data_row = [r[0].text_content().rstrip(),
                r[1].text_content().rstrip() if r[1].text_content().rstrip() != 'Not assigned' else None,
                r[2].text_content().rstrip() if r[2].text_content().rstrip() != 'Not assigned' else None]
    if i == 0:
        columns = data_row
        toronto_df = pd.DataFrame(columns=columns)
    else:
        if r[0].text_content().rstrip() == '':
            break
        tmp = pd.DataFrame([data_row], columns=columns)
        toronto_df = toronto_df.append(tmp, ignore_index=True)

toronto_df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,,
176,M6Z,,
177,M7Z,,
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


## Section 2 - Geocoder

### 1. Download the Geospatial data and load it to DataFrame

In [100]:
!wget -O geospatial_data.csv https://cocl.us/Geospatial_data

--2020-12-24 11:01:59--  https://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 169.63.96.194, 169.63.96.176
Connecting to cocl.us (cocl.us)|169.63.96.194|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-12-24 11:02:00--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 185.235.236.197
Connecting to ibm.box.com (ibm.box.com)|185.235.236.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-12-24 11:02:00--  https://ibm.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Reusing existing connection to ibm.box.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following

In [102]:
toronto_geo_df = pd.read_csv('geospatial_data.csv')
toronto_geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### 2. Cross reference the Geospatial data with postal code data, as well as perform data clean-up

In [122]:
complete_toronto_df = toronto_df.copy()
complete_toronto_df = complete_toronto_df.join(toronto_geo_df.set_index('Postal Code'), on='Postal Code', how='left')
complete_toronto_df = complete_toronto_df.dropna(subset=['Latitude'])
complete_toronto_df = complete_toronto_df[complete_toronto_df['Neighbourhood'] != None]
complete_toronto_df

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
2,M3A,North York,Parkwoods,43.753259,-79.329656
3,M4A,North York,Victoria Village,43.725882,-79.315572
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
5,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
165,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


## Section 3 - Segmenting and Clustering

In [None]:
!conda install -c conda-forge geopy --yes
!conda install -c conda-forge folium=0.5.0 --yes

print('Folium and Geopy are installed')

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: | 