# Applied Data Science Capstone
## Assignments
This notebook would be available on Github under the following link (Github handle @mukulbiswas)

In [2]:
import pandas as pd
import numpy as np

print('Hello Capstone Project Course!')

Hello Capstone Project Course!


# Part 1: Scraping and Preparation of Toronto Neighbourhood Data
The notebook scrapes a suggested Wikipedia pages for Toronto neighbourhood details in it. It contains the postcode details in a tabular format. Following steps are required as per the assignment given -


## Broad approach:
1. Scrape the table from the Wikipedia page
2. Preprocess the data in using pandas


_*Using the notebook from an earlier assignment.*_

In [2]:
import pandas as pd
import numpy as np
import requests as req

In [4]:
wikipedia_link='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [7]:
from lxml import html

myWikiPage = req.get(wikipedia_link)
page = myWikiPage.text

# Load the page (string) in a DOM structure
dom_html = html.fromstring(page)

# Use XPath to find the first instance of <title> as a DOM element
dom_rows = dom_html.xpath('/html/body/div[3]/div[3]/div[4]/div/table[1]/tbody/tr')

print('No of rows in the table are', len(dom_rows))

No of rows in the table are 290


_*290 rows including the header. There are 289 data rows.*_

In [23]:
# Create a dataframe with the required columns
data_columns = ['PostCode','Borough','Neighbourhood']
df = pd.DataFrame(columns=data_columns)

rowcount = len(dom_rows)

# for all the rows, iterate. Skip the header row.
for x1 in range(1, rowcount):
    tr_node = dom_rows[x1]
    new_row = {}

    # for each column, iterate
    for y1 in range(0,3):
        # if the content is text, store in the dict
        if (type(tr_node[y1].text) is str):
            new_row[data_columns[y1]] = tr_node[y1].text.strip()
        
        # else if the content is another tag (anchor-tag), then read the inner text
        else:
            new_row[data_columns[y1]] = tr_node[y1][0].text.strip()
    
    # insert the dict of 3 items into the df as a new row. x1 is the index
    df.loc[x1] = new_row

print('No of rows processed and columns are', df.shape)
df.head()

No of rows processed and columns are (289, 3)


Unnamed: 0,PostCode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


Remove the rows where Borough is *Not assigned*.

In [24]:
# Remove all the items where both borough and neighbourhood are "Not assigned"
df_2 = df[(df['Borough']!='Not assigned')]

print('The new shape of the dataframe is', df_2.shape)
df_2.head()

The new shape of the dataframe is (212, 3)


Unnamed: 0,PostCode,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights


77 rows are dropped.

In [25]:
# Find the indexes of the rows where the neighbourhood are not assigned

indexes = df_2[df_2['Neighbourhood']=='Not assigned'].index.values

# For those indexes, replace the name of the neighbourhood with that of the Borough
df_2.loc[indexes,'Neighbourhood'] = df_2.loc[indexes,'Borough']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [27]:
# Check if there is any row left with neighbourhood == 'Not assigned'

len(df_2[df_2['Neighbourhood']=='Not assigned'])

0

In [29]:
# Merge rows of same Post Codes; with comma-separated Borough names
df_3 = pd.DataFrame({'Neighborhood' : df_2.groupby([ 'PostCode','Borough'])['Neighbourhood'].apply(','.join)})

print('The new shape of the dataframe is', df_3.shape)
df_3.head(10)

The new shape of the dataframe is (103, 1)


Unnamed: 0_level_0,Unnamed: 1_level_0,Neighborhood
PostCode,Borough,Unnamed: 2_level_1
M1B,Scarborough,"Rouge,Malvern"
M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
M1E,Scarborough,"Guildwood,Morningside,West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae
M1J,Scarborough,Scarborough Village
M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
M1N,Scarborough,"Birch Cliff,Cliffside West"


## Summary and Assumptions
The above dataframe contains the scraped and preprocessed data from Wikipedia that is indexed by PostCode with multiple neighbourhood names captured against it.

It is assumed while processing the data that -
1. The term "Not assigned" has the fixed pattern. A variation with addition whitespaces would not result the same way.
2. There are no blank cells in the wikipedia page. If existed, it should have been replaced with "Not assigned" for pre-processing.
3. The wikipedia page format does not changes. The XPath of the data-table has been obtained using a browser feature availabile in developer-mode.

The final shape of the dataframe is captured below-

In [30]:
print('The final shape of the dataframe is', df_3.shape)

The final shape of the dataframe is (103, 1)


---------------------------

# Part 2: Determination of the Coordinates of the Postcodes
This is the 2nd part of the week-3 assignment (2 marks) where lat-long is determined using geocoder for each post-code that has been scraped in the part 1.

## Approach
- Starting from the output dataframe of the part 1
- Download the geocoder data available online and read into a dataframe
- Merge the geocoder dataframe with dataframe from part 1.


In [33]:
!wget http://cocl.us/Geospatial_data

--2018-12-11 14:29:29--  http://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 159.8.72.228
Connecting to cocl.us (cocl.us)|159.8.72.228|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cocl.us/Geospatial_data [following]
--2018-12-11 14:29:29--  https://cocl.us/Geospatial_data
Connecting to cocl.us (cocl.us)|159.8.72.228|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2018-12-11 14:29:32--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.27.197
Connecting to ibm.box.com (ibm.box.com)|107.152.27.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2018-12-11 14:29:33--  https://ibm.ent.box.com/shared/static

In [35]:
geo = pd.read_csv('Geospatial_data')

In [41]:
geo.columns=['PostCode', 'Latitude', 'Longitude']
geo = geo.set_index('PostCode')
geo.head()

Unnamed: 0_level_0,Latitude,Longitude
PostCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


In [48]:
df_4 = df_3.merge(geo, left_index =True, right_index =True, how='outer').reset_index()

df_4.head(10)

Unnamed: 0,PostCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff,Cliffside West",43.692657,-79.264848


----

# Part 3: Neighbourhood Analysis of Toronto (_*selected Boroughs*_)


In [50]:
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium


Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    altair:  2.2.2-py35_1 conda-forge
    branca:  0.3.1-py_0   conda-forge
    folium:  0.5.0-py_0   conda-forge
    vincent: 0.4.4-py_1   conda-forge

altair-2.2.2-p 100% |################################| Time: 0:00:00  55.63 MB/s
branca-0.3.1-p 100% |################################| Time: 0:00:00  36.06 MB/s
vincent-0.4.4- 100% |################################| Time: 0:00:00  39.85 MB/s
folium-0.5.0-p 100% |################################| Time: 0:00:00  48.18 MB/s


In [51]:
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values


Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    geographiclib: 1.49-py_0   conda-forge
    geopy:         1.17.0-py_0 conda-forge

geographiclib- 100% |################################| Time: 0:00:00  24.19 MB/s
geopy-1.17.0-p 100% |################################| Time: 0:00:00  35.86 MB/s


In [53]:
address = 'Toronto, Canada'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))



The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [55]:
# Short-list the list to a selected few boroughs
df_5 = df_4[df_4['Borough'].str.contains('Toronto')]
print('The new shape of the shortlisted dataframe is', df_5.shape)
df_5.head()

The new shape of the shortlisted dataframe is (38, 5)


Unnamed: 0,PostCode,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [56]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_5['Latitude'], df_5['Longitude'], df_5['Borough'], df_5['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [58]:
CLIENT_ID = '22KPZXIQGHJB2NRJ5P1QDWQPPQ0EQSUN1M0ZVRBW1U3K2CMC' # your Foursquare ID
CLIENT_SECRET = '2O1UILVIKZ3WX0XOUWS1VOYTHSWV12EDLCYSS4BJXIZEQHJ2' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 22KPZXIQGHJB2NRJ5P1QDWQPPQ0EQSUN1M0ZVRBW1U3K2CMC
CLIENT_SECRET:2O1UILVIKZ3WX0XOUWS1VOYTHSWV12EDLCYSS4BJXIZEQHJ2


### Explore schools in the selected boroughs

In [66]:
search_query = 'School'
radius = 5000

url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/search?client_id=22KPZXIQGHJB2NRJ5P1QDWQPPQ0EQSUN1M0ZVRBW1U3K2CMC&client_secret=2O1UILVIKZ3WX0XOUWS1VOYTHSWV12EDLCYSS4BJXIZEQHJ2&ll=43.653963,-79.387207&v=20180604&query=School&radius=5000&limit=30'

In [67]:
import requests

results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5c0fd9d96a607102ba1c76bc'},
 'response': {'venues': [{'categories': [{'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/education/academicbuilding_',
       'suffix': '.png'},
      'id': '4bf58dd8d48988d198941735',
      'name': 'College Academic Building',
      'pluralName': 'College Academic Buildings',
      'primary': True,
      'shortName': 'Academic Building'}],
    'hasPerk': False,
    'id': '4ada0d95f964a520d41d21e3',
    'location': {'address': '575 Bay St.',
     'cc': 'CA',
     'city': 'Toronto',
     'country': 'Canada',
     'crossStreet': 'at Dundas St. W',
     'distance': 405,
     'formattedAddress': ['575 Bay St. (at Dundas St. W)',
      'Toronto ON M5G 2C5',
      'Canada'],
     'labeledLatLngs': [{'label': 'display',
       'lat': 43.65564568498175,
       'lng': -79.38273654779792}],
     'lat': 43.65564568498175,
     'lng': -79.38273654779792,
     'postalCode': 'M5G 2C5',
     'state': 'ON'},
    'name': 'Ted 

In [68]:
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranform venues into a dataframe
schools = json_normalize(venues)
schools.head()

Unnamed: 0,categories,hasPerk,id,location.address,location.cc,location.city,location.country,location.crossStreet,location.distance,location.formattedAddress,location.labeledLatLngs,location.lat,location.lng,location.postalCode,location.state,name,referralId,venuePage.id
0,"[{'pluralName': 'College Academic Buildings', ...",False,4ada0d95f964a520d41d21e3,575 Bay St.,CA,Toronto,Canada,at Dundas St. W,405,"[575 Bay St. (at Dundas St. W), Toronto ON M5G...","[{'lng': -79.38273654779792, 'lat': 43.6556456...",43.655646,-79.382737,M5G 2C5,ON,Ted Rogers School of Management,v-1544542681,
1,"[{'pluralName': 'College Academic Buildings', ...",False,4ae0c938f964a520758221e3,105 St. George St.,CA,Toronto,Canada,University of Toronto,1514,"[105 St. George St. (University of Toronto), T...","[{'lng': -79.39857565702448, 'lat': 43.6647972...",43.664797,-79.398576,M5S 3E6,ON,Rotman School of Management,v-1544542681,
2,"[{'pluralName': 'Offices', 'name': 'Office', '...",False,502ba134e4b057455a9a1284,439 University Ave.,CA,Toronto,Canada,at Dundas St. W,80,"[439 University Ave. (at Dundas St. W), Toront...","[{'lng': -79.38783051465907, 'lat': 43.6545329...",43.654533,-79.387831,,ON,Ontario Public School Boards' Association,v-1544542681,
3,"[{'pluralName': 'Schools', 'name': 'School', '...",False,4cbb27d14c60a093e14c4aca,64 Baldwin Street,CA,Toronto,Canada,,678,"[64 Baldwin Street, Toronto ON, Canada]","[{'lng': -79.395314, 'lat': 43.655623, 'label'...",43.655623,-79.395314,,ON,Beverly Junior Public School,v-1544542681,
4,"[{'pluralName': 'College Arts Buildings', 'nam...",False,4cec04a5fe90a35ddb39560e,230 Richmond St E,CA,Toronto,Canada,at George St,1314,"[230 Richmond St E (at George St), Toronto ON,...","[{'lng': -79.37089528079174, 'lat': 43.6535944...",43.653594,-79.370895,,ON,George Brown School of Design,v-1544542681,


In [69]:
# keep only columns that include venue name, and anything that is associated with location
filtered_columns = ['name', 'categories'] + [col for col in schools.columns if col.startswith('location.')] + ['id']
schools_filtered = schools.loc[:, filtered_columns]

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# filter the category for each row
schools_filtered['categories'] = schools_filtered.apply(get_category_type, axis=1)

# clean column names by keeping only last term
schools_filtered.columns = [column.split('.')[-1] for column in schools_filtered.columns]

schools_filtered

Unnamed: 0,name,categories,address,cc,city,country,crossStreet,distance,formattedAddress,labeledLatLngs,lat,lng,postalCode,state,id
0,Ted Rogers School of Management,College Academic Building,575 Bay St.,CA,Toronto,Canada,at Dundas St. W,405,"[575 Bay St. (at Dundas St. W), Toronto ON M5G...","[{'lng': -79.38273654779792, 'lat': 43.6556456...",43.655646,-79.382737,M5G 2C5,ON,4ada0d95f964a520d41d21e3
1,Rotman School of Management,College Academic Building,105 St. George St.,CA,Toronto,Canada,University of Toronto,1514,"[105 St. George St. (University of Toronto), T...","[{'lng': -79.39857565702448, 'lat': 43.6647972...",43.664797,-79.398576,M5S 3E6,ON,4ae0c938f964a520758221e3
2,Ontario Public School Boards' Association,Office,439 University Ave.,CA,Toronto,Canada,at Dundas St. W,80,"[439 University Ave. (at Dundas St. W), Toront...","[{'lng': -79.38783051465907, 'lat': 43.6545329...",43.654533,-79.387831,,ON,502ba134e4b057455a9a1284
3,Beverly Junior Public School,School,64 Baldwin Street,CA,Toronto,Canada,,678,"[64 Baldwin Street, Toronto ON, Canada]","[{'lng': -79.395314, 'lat': 43.655623, 'label'...",43.655623,-79.395314,,ON,4cbb27d14c60a093e14c4aca
4,George Brown School of Design,College Arts Building,230 Richmond St E,CA,Toronto,Canada,at George St,1314,"[230 Richmond St E (at George St), Toronto ON,...","[{'lng': -79.37089528079174, 'lat': 43.6535944...",43.653594,-79.370895,,ON,4cec04a5fe90a35ddb39560e
5,Ogden Junior Public School,College Academic Building,33 Phoebe St.,CA,Toronto,Canada,,776,"[33 Phoebe St., Toronto ON M5T 1A3, Canada]","[{'lng': -79.3954304253186, 'lat': 43.65031519...",43.650315,-79.39543,M5T 1A3,ON,4cc60bd5b2beb1f7ee38264c
6,George Brown College - School of ESL,College Academic Building,341 King St. E,CA,Toronto,Canada,,1757,"[341 King St. E, Toronto ON M5A 1L1, Canada]","[{'lng': -79.36557973642425, 'lat': 43.6518719...",43.651872,-79.36558,M5A 1L1,ON,4fe873dce4b0ec3d85f23caa
7,Canadian National Ballet School,Dance Studio,400 Jarvis St,CA,Toronto,Canada,btwn Carlton & Maitland St.,1347,"[400 Jarvis St (btwn Carlton & Maitland St.), ...","[{'lng': -79.377237, 'lat': 43.663681, 'label'...",43.663681,-79.377237,M4Y 2G6,ON,4bcb2a13fb84c9b68b391e3e
8,Dalla Lana School of Public Health,College & University,155 College St.,CA,Toronto,Canada,at McCaul St.,762,"[155 College St. (at McCaul St.), Toronto ON M...","[{'lng': -79.39325440855995, 'lat': 43.6592319...",43.659232,-79.393254,M5T 3M7,ON,4db6ee9343a1369cb5f0e21d
9,Keystone International School,High School,23 Toronto St,CA,Toronto,Canada,,964,"[23 Toronto St, Toronto ON M5C 2R1, Canada]","[{'lng': -79.376154, 'lat': 43.65062, 'label':...",43.65062,-79.376154,M5C 2R1,ON,59ff1993b1538e3a0b64c6e8


In [70]:
venues_map = folium.Map(location=[latitude, longitude], zoom_start=13) # generate map centred around the Toronto

# add a red circle marker to represent the Conrad Hotel
folium.features.CircleMarker(
    [latitude, longitude],
    radius=10,
    color='red',
    popup='Toronto',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.6
).add_to(venues_map)

# add the Italian restaurants as blue circle markers
for lat, lng, label in zip(schools_filtered.lat, schools_filtered.lng, schools_filtered.categories):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        popup=label,
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(venues_map)

# display map
venues_map