# Applied data science capstone

This notebook is part of the Coursera course for Applied data science capstone (week 3).

To do so, this notebook will folow two parts:

Part 1 was presented in the previous notebook and consist of the following steps:

1. Import the necessary modules.
2. Get the data provided in the lab section.
3. Check the data.
4. Cleaning the data I (if the value of column "Borough" is "Not Assigned" the row will be droped). 
5. Checking the dataframe (verify if all the row that didn't have an assigned value for the column "Borough" were excluded).
6. Cleaning the data II (if the value of "Neighbourhood" is "Not Assigned", the value of "Neighbourhood" will be equal to the value of "Borough").
7. Checking the dataframe (verifying if the postalcode "M7A", Queen's Land was updated).
8. Joining the Neighbourhoods with the same postal code.
9. Checking the shape of the dataframe.

Part 2 is the main goal of this notebook and consists of the following steps:

1. Import the extra modules for Part 2.
2. Getting the coordinates using geocode (since the package is unreliable, step 2 was not performed, steps 3 to 5 five imported the information).
3. Retrieving the geodata using the url provided in the lab.
4. Using geo dataframe to create two support lists to complete the first dataframe.
5. Creating the new columns (latitude and longitude) and updating the dataframe with the extra information.


1 - First, let's import the modules

In [10]:
import pandas as pd
from pandas.io.html import read_html
import numpy as np

2 - Get the data from the url provided in the lab and assigning it to a variable

In [2]:
url_page = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

#Searching in url for wikitables
table = read_html(url_page,  attrs={"class":"wikitable"})

#Check how many tables were imported
print ("Extracted {num} table from url".format(num=len(table)))

Extracted 1 table from url


3 - Since only one table were imported, the data can be acessed through "table[0]"

In [7]:
table[0].head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


4 - Cleaning the cells that doens't have an assigned Borough

In [3]:
#Identify where the indexes where Borough is "Not assigned"
indexNames = table[0][ table[0]['Borough'] == 'Not assigned' ].index

# Delete these row indexes from dataFrame
table[0].drop(indexNames , inplace=True)

5 - Checking the table to verify if all the row that met the criteria were excluded

In [4]:
indexNames = table[0][ table[0]['Borough'] == 'Not assigned' ].index
len(indexNames)

0

Since no indexes were found labeled as "Not Assigned" the "Borough", column "Borough" is cleaned

6 - If the neighbourhood is not assigned, the neighborhood will be equal to Borough

In [5]:
#First, identify all indixes of neighbourhoods that doens't have an assigned value
indexNames = table[0][ table[0]['Neighbourhood'] == 'Not assigned' ].index

#Assigning the value of "Borough" to "Neighbourhood" when "Neighbourhood" = "Not assigned"
for i in indexNames:
    table[0]['Neighbourhood'][i] = table[0]['Borough'][i] 

7 - Verifying if M7A "Queen's park" was updated, since it was the only value with a Borough that was missing the Neighboorhood

In [6]:
table[0].loc[table[0]['Postcode'] == 'M7A']

Unnamed: 0,Postcode,Borough,Neighbourhood
8,M7A,Queen's Park,Queen's Park


8 - Joining the neighboorhood with the same postal code area

In [7]:
#Grouping the information based on the Postcode and joining the column neighbourhood that has the same postal code separeted by a comma
table[0] = table[0].groupby(['Postcode', 'Borough'],as_index=False)['Neighbourhood'].agg(lambda x:', '.join(x))
table[0]

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


9 - Verifying the frame's dimensions

In [8]:
table[0].shape

(103, 3)

# Part 2 of the lab section

From this step forward, the notebook will identify the geolocation of the neighbourhoods

1 - Installing geocoder (if not installed) and importing geocoder for Part 2

In [25]:
!conda install -c conda-forge geocoder --yes
import geocoder

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geocoder


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geocoder-1.38.1            |             py_1          53 KB  conda-forge
    ratelim-0.1.6              |             py_2           6 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          59 KB

The following NEW packages will be INSTALLED:

    geocoder: 1.38.1-py_1 conda-forge
    ratelim:  0.1.6-py_2  conda-forge


Downloading and Extracting Packages
geocoder-1.38.1      | 53 KB     | ##################################### | 100% 
ratelim-0.1.6        | 6 KB      | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done


2 - Getting the coordinates based on postalcode using geocoders

In [None]:
# initialize an empty list
lat_lng_coords = [None]*len(table[0])

# loop until you get the coordinates
for i in [0,2]:
    while(lat_lng_coords[i] is None):
        print(table[0]['Postcode'][i])
        g = geocoder.google('{}, Toronto, Ontario'.format(table[0]['Postcode'][i]))
        lat_lng_coords[i] = g.latlng
        print(g)
    table[0]['Latitude'][i] = lat_lng_coords[i][0]
    table[0]['longitude'][i] = lat_lng_coords[i][1]
    
table[0]

As explained in the lab section, geocoder can be unreliable, this means that the command geocoder.google might not return any coordinates.
Due the unreliability of the package, the second step will not be followed. Step 3 will replace step 2 due the unreliability of the package.

3 - Retrieving the geodata from the csv file provided in the lab.

In [11]:
#Saving the url of the csv file
url = 'http://cocl.us/Geospatial_data'

geo = pd.read_csv(url)
geo

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


The data retrieved from the url presents the latitute and longitute of each post code, therefore, by comparing the "Postcode" column from table[0] with the column Postal code from geo, it is possible to update table[0] and add the latitude and longitude.

4 - Use the postal code from the geo dataframe to complete table[0].

In [18]:
#Create two empty list to save the latitude and longitude to save the information in the same order as presented in table[0]
latitude = []
longitude = []

#All codes from table[0] should be filled, so a for loop is needed
for i in range(len(table[0])):
    if geo["Postal Code"][i] == table[0]["Postcode"][i]:
        latitude.append(geo['Latitude'][i])
        longitude.append(geo['Longitude'][i])

5 - Create the new columns on the datafram table[0].

In [21]:
table[0]['Latitude'] = latitude
table[0]['Longitude'] = longitude
table[0]

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848
