<a href="https://www.bigdatauniversity.com"><img src="https://ibm.box.com/shared/static/cw2c7r3o20w9zn8gkecaeyjhgw3xdgbj.png" width="400" align="center"></a>

<h1 align="center"><font size="5">Data Science Capstone Assignment</font></h1>

In this notebook I will be building the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

The project is integrated with [github repo.](https://github.com/krishnatejaperannagari/Coursera_Capstone)

# Part 1 (10 marks)
### Guidelines for creating dataframe :

The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.

In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [1]:
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
import urllib.request

wiki_page = urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

parse_tree = BeautifulSoup(wiki_page, "lxml")

table= parse_tree.find("table", style="width:100%; border-collapse:collapse; border:1px solid #ccc;")

### Extracting information from tables

In [2]:
loc_df = pd.DataFrame(columns=['PostalCode', 'Borough', 'Neighborhood'])

i=0

for row in table.findAll('tr'):
    cells = row.findAll('td')
    for cell in cells:
        
        PostalCode = cell.find('b').find(text=True)    
        links = cell.findAll('a')
        
        # Ignoring the cells that do not have an assigned borough
        if len(links)>0 and links[0].find(text=True) is not None:
            Borough = links[0].find(text=True)
            Neighbourhood = ', '.join(map(str, [neigh.find(text=True) for neigh in links[1:len(links)]]))
            
            #Assigning same neighborhood as the borough when cell has a Not assigned neighborhood, 
            if Neighbourhood =='':
                Neighbourhood = Borough
                
            #Combining rows when more than one neighborhood can exist in one postal code area
            rep_neigh = loc_df[loc_df['PostalCode'] == PostalCode].index
            if len(rep_neigh) > 0:
                print("Repeated value at" + str(rep_neigh[0]))
                loc_df.loc[rep_neigh[0]][2] = str(loc_df.loc[rep_neigh[0]][2]) + Neighbourhood
            else:
                loc_df.loc[i] = [PostalCode, Borough, Neighbourhood]
                i+=1

#sorting in ascending order
loc_df.sort_values(["PostalCode"], axis=0, ascending=True, inplace=True)

loc_df

Unnamed: 0,PostalCode,Borough,Neighborhood
6,M1B,Scarborough,"Malvern, Rouge"
12,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
18,M1E,Scarborough,"Guildwood, Morningside, West Hill"
22,M1G,Scarborough,Woburn
26,M1H,Scarborough,Cedarbrae
32,M1J,Scarborough,Scarborough Village
38,M1K,Scarborough,"Kennedy Park, Ionview, Birchmount Park"
44,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
51,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village"
57,M1N,Scarborough,"Birch Cliff, Cliffside"


In [3]:
loc_df.shape

(101, 3)

# Part 2 (2 marks)

In an older version of this course, we were leveraging the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighborhood. However, recently Google started charging for their API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code. 

Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data


In [4]:
geo_data = pd.read_csv('https://cocl.us/Geospatial_data')
geo_data

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [5]:
geo_cor_df = pd.DataFrame(columns=['PostalCode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude' ])
for index in loc_df.index:
    coord = geo_data.loc[geo_data['Postal Code'] == loc_df["PostalCode"][index]]
    geo_cor_df.loc[index] = [loc_df["PostalCode"][index], loc_df["Borough"][index], loc_df["Neighborhood"][index], coord.iloc[0][1], coord.iloc[0][2]]
geo_cor_df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
12,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
18,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
22,M1G,Scarborough,Woburn,43.770992,-79.216917
26,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
32,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
38,M1K,Scarborough,"Kennedy Park, Ionview, Birchmount Park",43.727929,-79.262029
44,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.711112,-79.284577
51,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village",43.716316,-79.239476
57,M1N,Scarborough,"Birch Cliff, Cliffside",43.692657,-79.264848


# Part 3 (3 marks)

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

Just make sure: 
1. to add enough Markdown cells to explain what you decided to do and to report any observations you make.
2. to generate maps to visualize your neighborhoods and how they cluster together. 

Once you are happy with your analysis, submit a link to the new Notebook on your Github repository. (3 marks)

### Extracting the boroughs that contain the word Toronto

In [6]:
tor_bor =  geo_cor_df[geo_cor_df['Borough'].str.contains("toronto", case = False)]
tor_bor

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
71,M4R,North Toronto,North Toronto,43.715383,-79.405678
89,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
94,M4X,Downtown Toronto,"St. James Town, Cabbagetown",43.667967,-79.367675
97,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
20,M5E,Downtown Toronto,Downtown Toronto,43.644771,-79.373306
24,M5G,Downtown Toronto,Bay Street,43.657952,-79.387383
30,M5H,Downtown Toronto,"Richmond, King",43.650571,-79.384568


### Create map of Toronto using latitude and longitude values extracted above

In [7]:
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab


from geopy.geocoders import Nominatim
import folium

address = 'Toronto, ON'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [8]:
map_toronto

### Imposing the markers on the map

In [9]:
for lat, lng, borough, neighborhood in zip(tor_bor['Latitude'], tor_bor['Longitude'], tor_bor['Borough'], tor_bor['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto) 

map_toronto

# The End