# Applied Data Science Capstone

This notebook hosts code for Applied Data Science Capstone project

### Week 1 - Capstone Project Notebook

In [11]:
import pandas as pd
import numpy as np

In [12]:
print('Hello Capstone Project Course!')

Hello Capstone Project Course!


### Week 3 - Segmenting and Clustering Neighborhoods in Toronto

**Step 1**: prepare environment

In [13]:
import requests

**Step 2:** get wiki page as text

In [84]:
wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wiki_page = requests.get(wiki_url)

**Step 3:** extract table by class to dataframe

In [86]:
codes_df_list = pd.read_html(io = wiki_page.text, attrs = {'class': 'wikitable'})
codes_df_0 = codes_df_list[0]

In [96]:
codes_df_0.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
8,M8A,Not assigned,Not assigned
9,M9A,Queen's Park,Not assigned


**Step 4:** rename column

In [100]:
codes_df_1 = codes_df_0.rename(columns = {'Postcode': 'PostalCode'})

In [101]:
codes_df_1.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
8,M8A,Not assigned,Not assigned
9,M9A,Queen's Park,Not assigned


**Step 5:** remove Borough 'Not Assigned' values

In [156]:
codes_df_2 = codes_df_1[codes_df_1['Borough'] != 'Not assigned']

In [157]:
codes_df_2.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
9,M9A,Queen's Park,Not assigned
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


**Step 6:** fill 'Not assigned' neighborhood with borough

In [158]:
codes_df_2['Neighbourhood'] = np.where(codes_df_2['Neighbourhood'] == 'Not assigned', codes_df_2['Borough'], codes_df_2['Neighbourhood'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [159]:
codes_df_2.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
9,M9A,Queen's Park,Queen's Park
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


**Step 7:** combine neighborhoods with same postal code area

In [165]:
aggr_df = codes_df_2.groupby(['PostalCode', 'Borough'])['Neighbourhood'].apply(lambda v: ', '.join(v)).reset_index()

**Step 8:** check result dataframe

In [186]:
aggr_df.shape

(103, 3)

**Step 9:** enhance environment

In [172]:
import sys
!{sys.executable} -m pip install geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 10.9MB/s ta 0:00:01
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


In [173]:
import geocoder

In [183]:
def get_lat_lon(postal_code):
    lat_lng_coords = None
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
    return [lat_lng_coords[0], lat_lng_coords[1]]
#latitude = lat_lng_coords[0]
#longitude = lat_lng_coords[1]

In [184]:
lat_list = []
lon_list = []
pc_list = aggr_df['PostalCode'].tolist()
for pc in pc_list:
    lat_lon = get_lat_lon(pc)
    lat.apend(lat_lon[0])
    lon.append(lat_lon[1])

KeyboardInterrupt: 

**Step 10:** geocoder does not work

In [193]:
geo_df = pd.read_csv('https://cocl.us/Geospatial_data')

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [195]:
geo_df_merge = geo_df.rename(columns = {'Postal Code': 'PostalCode'}, )

In [197]:
geo_df_merge.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [200]:
data_with_geo = pd.merge(aggr_df, geo_df_merge[['PostalCode', 'Latitude', 'Longitude']], on = 'PostalCode')

**Step 11:** check data

In [203]:
data_with_geo.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [221]:
data_with_geo['Borough'].unique().tolist()

['Scarborough',
 'North York',
 'East York',
 'East Toronto',
 'Central Toronto',
 'Downtown Toronto',
 'York',
 'West Toronto',
 'Mississauga',
 'Etobicoke',
 "Queen's Park"]

In [208]:
# The code was removed by Watson Studio for sharing.

**Step 12:** envance environment vol.2

In [209]:
!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.21.0               |             py_0          58 KB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.50-py_0         conda-forge
    geopy:           1.21.0-py_0       conda-forge

The following packages will be UPDATED:

    ca-

In [211]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.0.1               |             py_0         575 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         673 KB

The following NEW packages will be INSTALLED:

    altair:  4.0.1-py_0 conda-forge
    branca:  0.3.1-py_0 conda-forge
    folium:  0.5.0-py_0 conda-forge
    vincent: 0.4.4-py_1 conda-forge


Downloading and Extracting Packages
altair-4.0.1         | 575 KB    | #####

**Step 13:** take a look at map

In [216]:
geolocator = Nominatim(user_agent='foursquare_agent')
location = geolocator.geocode('Toronto, Canada')
tor_lat = location.latitude
tor_lon = location.longitude
print(tor_lat, tor_lon)

43.653963 -79.387207


In [229]:
bor_col = {
     'East Toronto': 'red',
     'Central Toronto': 'green',
     'Downtown Toronto': 'blue',
     'West Toronto': 'yellow'
}

select only Toronto's boroughs

In [231]:
tor_bor = data_with_geo[data_with_geo['Borough'].str.contains('Toronto')]

In [232]:
tor_bor.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [235]:
tor_map = folium.Map(location=[tor_lat, tor_lon], zoom_start=11)

for lat, lng, borough in zip(tor_bor['Latitude'], tor_bor['Longitude'], tor_bor['Borough']):
    label = borough
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=bor_col.get(borough),
        fill=True,
        fill_color=bor_col.get(borough),
        fill_opacity=0.5,
        parse_html=False).add_to(tor_map)
    
tor_map