### Part 1 - A: Notebook Setup
Start a new notebook for this assignment. Install basic libraries & packages. 

In [2]:
#Install pandas, numpy & requests
import pandas as pd
import numpy as np

### Part 1 - B: Scrape Wiki
Scrape the Wikipedia page; https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
In order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe. 

In [4]:
#Install Beautiful Soup & Requests
!pip install beautifulsoup4
!pip install requests

from bs4 import BeautifulSoup
import requests



In [94]:
#Collect HTML from Wiki page and create a soup object
source = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(source.text, 'lxml')

### Part 1 -C: Set Up Dataframe

Once the data is scraped, then the data must be cleaned and organised into the dataframe. 

In [95]:
#Set up table
data = []
table = soup.find(class_='wikitable')
for index, tr in enumerate(table.find_all('tr')):
    section = []
    for td in tr.find_all(['th', 'td']):
        section.append(td.text.rstrip())
    if (index == 0):
        columns = section
    else:
        data.append(section)
        
#Convert to dataframe
toronto_df1 = pd.DataFrame(data = data, columns = ['PostalCode', 'Borough', 'Neighbourhood'])
toronto_df1.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [96]:
#Data clean up, start with df info
toronto_df1.info()
toronto_df1.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 287 entries, 0 to 286
Data columns (total 3 columns):
PostalCode       287 non-null object
Borough          287 non-null object
Neighbourhood    287 non-null object
dtypes: object(3)
memory usage: 6.8+ KB


(287, 3)

In [97]:
#Remove Boroughs that are 'Not Assigned'
toronto_df1.drop(toronto_df1[toronto_df1['Borough']=="Not assigned"].index, axis=0, inplace=True)
toronto_df1.info()
toronto_df1.shape
toronto_df1.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 210 entries, 2 to 285
Data columns (total 3 columns):
PostalCode       210 non-null object
Borough          210 non-null object
Neighbourhood    210 non-null object
dtypes: object(3)
memory usage: 6.6+ KB


Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [98]:
#Reindex
toronto_df2 = toronto_df1.reset_index()
toronto_df2.info()
toronto_df2.shape
toronto_df2.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 4 columns):
index            210 non-null int64
PostalCode       210 non-null object
Borough          210 non-null object
Neighbourhood    210 non-null object
dtypes: int64(1), object(3)
memory usage: 6.6+ KB


Unnamed: 0,index,PostalCode,Borough,Neighbourhood
0,2,M3A,North York,Parkwoods
1,3,M4A,North York,Victoria Village
2,4,M5A,Downtown Toronto,Harbourfront
3,5,M6A,North York,Lawrence Heights
4,6,M6A,North York,Lawrence Manor


In [99]:
#Multiple neighborhoods are listed under one postcode, therefore add groupby
toronto_df2 = toronto_df1.groupby("PostalCode").agg(lambda x:','.join(set(x)))
toronto_df2.info()
toronto_df2.shape
toronto_df2.head()

<class 'pandas.core.frame.DataFrame'>
Index: 103 entries, M1B to M9W
Data columns (total 2 columns):
Borough          103 non-null object
Neighbourhood    103 non-null object
dtypes: object(2)
memory usage: 2.4+ KB


Unnamed: 0_level_0,Borough,Neighbourhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,Scarborough,"Malvern,Rouge"
M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
M1E,Scarborough,"Morningside,West Hill,Guildwood"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae


In [83]:
#Fix cells where the neighbourhood is missing, neighborhood = borough
toronto_df2.loc[toronto_df2['Neighbourhood']=="Not assigned",'Neighbourhood']=toronto_df2.loc[toronto_df2['Neighbourhood']=="Not assigned",'Borough']

In [89]:
#Reindex
toronto_df3 = toronto_df2.reset_index()
toronto_df3.info()
toronto_df3.shape
toronto_df3.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 3 columns):
PostalCode       103 non-null object
Borough          103 non-null object
Neighbourhood    103 non-null object
dtypes: object(3)
memory usage: 2.5+ KB


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern,Rouge"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Morningside,West Hill,Guildwood"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [91]:
#Remove duplicate Borroughs 
toronto_df3['Borough']= toronto_df3['Borough'].str.replace('nan|[{}\s]','').str.split(',').apply(set).str.join(',').str.strip(',').str.replace(",{2,}",",")
toronto_df3.reset_index()
toronto_df3.info()
toronto_df3.shape
toronto_df3.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 3 columns):
PostalCode       103 non-null object
Borough          103 non-null object
Neighbourhood    103 non-null object
dtypes: object(3)
memory usage: 2.5+ KB


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern,Rouge"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Morningside,West Hill,Guildwood"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


This process has reduced the original 287 rows to 103.

### Part 2: Add GeoData
Using the provided geospatial data add the Longitude and Latitude to the dataframe. 

In [103]:
#Import the GeoData
geo_data=pd.read_csv("https://cocl.us/Geospatial_data")
geo_data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Join the two tables into a single dataframe

In [174]:
#Join the data to the original dataframe and inspect details of updated dataframe
toronto_df3['Latitude'] = geo_data['Latitude'].values
toronto_df3['Longitude'] = geo_data['Longitude'].values
toronto_df3.info()
toronto_df3.shape
toronto_df3.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 5 columns):
PostalCode       103 non-null object
Borough          103 non-null object
Neighbourhood    103 non-null object
Latitude         103 non-null float64
Longitude        103 non-null float64
dtypes: float64(2), object(3)
memory usage: 4.1+ KB


Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern,Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Morningside,West Hill,Guildwood",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


### Part 3: Visualize the Data

In [111]:
#Install libraries & packages
!conda install -c conda-forge folium=0.5.0 --yes
import folium

!conda install -c conda-forge geopy --yes

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    altair-4.0.0               |             py_0         606 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.1 MB

The following NEW packages will be 

In [182]:
#Find center of Toronto 
c_lat = toronto_df3.Latitude.mean()
c_long = toronto_df3.Longitude.mean()
print("center_lat", center_lat)
print("center_long", center_long)

center_lat 43.70460773398059
center_long -79.39715291165048


In [189]:
#Set map location/ variable
toronto_map = folium.Map(location = [c_lat, c_long], zoom_start = 10)

In [None]:
for lat, lng, borough, neighborhood in zip(toronto_df3['Latitude'], toronto_df3['Longitude'], toronto_df3['Borough'], toronto_df3['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        popup = label,
        color = 'blue',
        fill = True,
        fill_color = '#3186cc',
        fill_opacity = 0.7
        ).add_to(toronto_map) 
toronto_map