# Week 3 Assignment

Importing the necessary modules for the exercise.

In [None]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from IPython.core.display import HTML
!pip install geopy
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
from geopy.geocoders import Nominatim

## Part 1 

#### Creating a new dataframe with Postal Code, Borough and Neighbourhood information

Make a request using `requests.get()` method to the Wikipedia page. Store the text from the page in a local variable

In [None]:
req = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
page = req.text

Use BeautifulSoup's `html.parser` to parse the page content

In [None]:
soup = BeautifulSoup(page, 'html.parser')

Find all the tables in the page. There might be many tables and we are interested in only one of the tables.

In [None]:
can_table = soup.find_all("table")
print(len(can_table))

In [None]:
HTML(str(can_table[0]))

We see that there are three tables in the page and the first of those seems to be the one we want.

Let us look at the table we extracted to see what other HTML elements need to be filtered out.

In [None]:
can_table[0]

We see that the columns names are enclosed within `<th>` and the values of the different rows and columns (the names of the Boroughs, Neighborhoods and the postal codes) are enclosed withint `<td>` tags.

Let us filter these one by one.

In [None]:
columns = [c for c in can_table[0].find_all('th')]
columns

Let us write a lambda function to replace new line characters with just a space

In [None]:
rm_newline = lambda s: s.replace("\n", "")

In [None]:
columns = [rm_newline(c.get_text()) for c in columns]
columns

Let us do a similar thing for filtering out rows (which are in `<td>`)

In [None]:
rows = [r for r in can_table[0].find_all('td')]
rows

Let us replace newline characters with spaces

In [None]:
rows = [rm_newline(r.get_text()) for r in rows]
rows

Now, we have a flattened out list of all the elements (content) in the table and we want to organize them under the columns. The elements (content) in each row is organized within these three columns. So, we need to group them in groups of three (three = number of columns).

In [None]:
rows = list(zip(*[rows[i::3] for i in range(len(columns))]))
rows

We have all the ingredients in the form of python lists. We can use pandas to convert these lists into a dataframe

In [None]:
df = pd.DataFrame(rows, columns=columns)
df.head()

Now, we have the dataframe to play with. We can use pandas to filter out further. First, let us drop all the rows in which Borough is "Not assigned"

In [None]:
not_assigned_indexes = df[df['Borough']=='Not assigned'].index
df.drop(not_assigned_indexes, inplace=True)
df.head()

Now, let us merge the rows that have the same postal code (rows grouped by postal code) into one row, collecting the neighborhoods belonging to the postal code into a list (achieved by a join using ',')

In [None]:
df = df.groupby(df['Postal Code']).agg({'Borough': 'first', 'Neighbourhood' : ', '.join}).reset_index()

In [None]:
df.head()

How many rows are there?

In [None]:
df.shape

So, our dataset has 103 rows organized under 3 columns

## Part 2

#### Gathering geo-spatial data and augmenting our dataframe with this geo-spatial data

Downloaded the geospatial coordinates csv file. Now, reading the file into a pandas dataframe in order to use for the next exercise.

In [None]:
can_geo_data = pd.read_csv('Geospatial_Coordinates.csv')
can_geo_data.head()

Let us merge this dataframe containing geo-spatial data with the earlier data-frame grouping on the common field "Postal Code"

In [None]:
can_geo_df = pd.merge(df, can_geo_data, left_on=['Postal Code'], right_on=['Postal Code'], how='right')
can_geo_df

Let's save this dataframe as a csv file if needed for further analysis

In [None]:
can_geo_df.to_csv('Canada_GeoSpatial.csv')

## Part 3

#### Cluster the neighbourhoods in Toronto borough and plot it on a map

Let us filter out the rows that have the string "Toronto" in the Borough name.

In [None]:
toronto_df = can_geo_df[can_geo_df['Borough'].str.contains("Toronto")]
toronto_df.head()

Let us get the geo-spatial coordinates for Toronto using geopy's Nominatim

In [None]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
toronto_lat = location.latitude
toronto_long = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(toronto_lat, toronto_long))

Let us now use Folium to get a map of Toronto, using the geo-spatial coordinates identified above.

In [None]:
map_toronto = folium.Map(location=[toronto_lat, toronto_long], zoom_start=11)
map_toronto

Now, let us loop through the different geo-spatial coordinates in our data-frame and get the names of the boroughs and neighbourhoods for each of them and add markers for those points on the map

In [None]:
for lat, lng, pincode, borough, neighborhood in zip(toronto_df['Latitude'], toronto_df['Longitude'], 
                                                    toronto_df['Postal Code'], toronto_df['Borough'], 
                                                    toronto_df['Neighbourhood']):
    label = '{}({}): [{}]'.format(borough.upper(), pincode, neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='yellow',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto) 
map_toronto

The markers are formatted as follows:
* It shows the Borough name in uppercase 
* It shows the Postal Code within brackets (next to the Borough name)
* It shows all the neighbourhoods within square brackets (as a list)