### Coursera/IBM Applied Data Science Capstone Course
#### Week 3 assignment: Toronto Neighborhoods 

---

##### Part 1: Scraping neighborhood postal codes and names from wikipedia page

First step: we import the necessary libraries:

* *Request* to grap html site data

* *BeautifulSoup* to scrape html data

* *Numpy* to handle data in a vectorized manner

* *Pandas* for data analysis and dataframes

In [1]:
import requests # library to grab html data
from bs4 import BeautifulSoup # library to scrape html data

import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis

In the next steps we:
* 1- Define the URL link
* 2- Use request.get to download the data from the wikipedia site and assign the data to the variable *wikipedia_data*
* 3- Use the data attribute text to extract the html data as text string, parse it with BeautifulSoup function and assign to the variable *soup*

In [2]:
#1
wikipedia_link="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
#2
wikipedia_data= requests.get(wikipedia_link)
#3
soup = BeautifulSoup(wikipedia_data.text, 'html.parser')

Next we define the dataframe column names by
* 4- Finding the relevant postalcode data in the body of an html table with the attribute *.tbody* and extracting the content within the *th* html element using the BeautifulSoup method *find_all*
* 5- Extracting the text contained in each *th* element and adding it to an array of the column names
* 6- Creating a new dataframe with those columns
* 7- Adjusting some of the column names to fit the asignment description

In [3]:
#4
column_name_array = soup.tbody.find_all('th')
#5
column_names = [column_name_array[0].string , column_name_array[1].string , column_name_array[2].string.strip('\n')]
#6
toronto_neighborhoods = pd.DataFrame(columns=column_names)
#7
toronto_neighborhoods = toronto_neighborhoods.rename(columns={'Postcode':'PostalCode' , 'Neighbourhood': 'Neighborhood'})
toronto_neighborhoods

Unnamed: 0,PostalCode,Borough,Neighborhood


---

Now we fill in the dataframe columns with the data from the wikipedia postalcodes table:
* 8- Using the BeautifulSoup method *find_all* we collect the table rows into an array variable *table_data*
* 9- Looping through the array (except the first element corresponding to the headers used for the column names), each element corresponding to a row in the table
* 10- Extracting the row elements using the *find_all* on the html tag *td*, which results in an array with the three values of interest. 
* 11- Assign the values to each column in the dataframe

In [4]:
#8
table_data = soup.tbody.find_all('tr')

#9
for row in table_data[1:]:
    #10
    row_entries = row.find_all('td')
    #11
    postcode = row_entries[0].get_text()
    borough = row_entries[1].get_text()
    neighborhood = row_entries[2].get_text().strip('\n')
    toronto_neighborhoods = toronto_neighborhoods.append({'PostalCode': postcode,
                                          'Borough': borough,
                                          'Neighborhood': neighborhood}, ignore_index=True)

    
toronto_neighborhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


---

Now we clean the dataframe by eliminating all rows without an assigned borough, that is, those containing *'Not assigned'* as value. 
* 12- First convert all the elements with the value *'Not assigned'* in the column *Borough* into a *NaN
* 13- Then drop all the rows containing *NaN

In [5]:
#12
toronto_neighborhoods.loc[toronto_neighborhoods['Borough'] == 'Not assigned','Borough'] = np.nan
#13
toronto_neighborhoods = toronto_neighborhoods.dropna()
toronto_neighborhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


---

* 14- Next find and replace all rows where the *Neighborhood* value is set to *'Not assigned'* and replace it by the value in the *Borough* column using the *numpy.where* function

In [6]:
#14
toronto_neighborhoods['Neighborhood'] = np.where(toronto_neighborhoods['Neighborhood'] == 'Not assigned', toronto_neighborhoods['Borough'], toronto_neighborhoods['Neighborhood'])
toronto_neighborhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


---

* 15- Finally, we group all neighborhoods with the same postal code and Borough name into a single row, combining or aggregating the neighborhood names into a list separated by comas
* 16- We need to reset the index 

In [7]:
#15
toronto_neighborhoods = toronto_neighborhoods.groupby(['PostalCode', 'Borough']).agg(lambda x: ','.join(x.values))
#16
toronto_neighborhoods.reset_index(inplace = True)
toronto_neighborhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


---

17- Display the dataframe size/shape

In [8]:
#17
toronto_neighborhoods.shape

(103, 3)

---
---
##### Part 2: Acquiring longitude and latitude coordinates for each borough


After much trying to connect to the Geocoder Python package, I needed to use plan B and load the coordinates from the csv file:
* 1- Download data from csv file
* 2- Reset the index
* 3- Rename column to match *'toronto_neighborhoods'* dataframe


In [9]:

#1
LatLong_data = pd.read_csv('http://cocl.us/Geospatial_data', header=0, index_col=0)
#2
LatLong_data = LatLong_data.reset_index()
#3
LatLong_data = LatLong_data.rename(columns={'Postal Code' : 'PostalCode'})

LatLong_data.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


4. Next we merge both dataframes using the *PostalCode* column as the key

In [10]:
#4
toronto_neighborhoods = pd.merge(toronto_neighborhoods, LatLong_data, on='PostalCode')
toronto_neighborhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff,Cliffside West",43.692657,-79.264848


---

Display the dataframe size/shape

In [11]:
toronto_neighborhoods.shape

(103, 5)