<h1 align=center> <font size = 5> Segmenting and Clustering Neighborhoods in Toronto </font></h1>

<h3 align = center> Navaneeth's Capestone project - Webscrap , Data preparation , segmenting and clustering </h3>

This Notebook contains the process to webscrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, and clean the data for further processing. Let see the steps inorder to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood. Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.More than one neighborhood can exist in one postal code area. These two rows will be combined into one row with the neighborhoods separated with a comma.If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. Get the final size which is the number of rows and columns of the cleansed dataframe.

## 1 . Webscrapping the data from wikipedia webpage

Import all the required libraries for webscrapping and data frames

In [2]:
#importing the beautifulSoup and requests
import requests, re
from bs4 import BeautifulSoup

# imporing pandas for dataframes
import pandas as pd

Get the target url and try

In [3]:
#getting the target scrape page

reqs_wiki=requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
cont_wiki=reqs_wiki.content

Identified that the html section is with a class "mw-parser-output". scrap the section to a dictionary called all tab.

In [4]:
# table class = wikitable sortable jquery-tablesorter
Bsoup=BeautifulSoup(cont_wiki,"html.parser")
all_tab=Bsoup.find_all("div",{"class":"mw-parser-output"})
all_tab = {}
for k, body in enumerate(Bsoup.findAll('tbody')):
    all_tab['table' + str(k)] = []
    for tr in body.find_all('tr'):
        tmp = tuple()
        th = tr.find('th')
        if th:
            th = tr.find('th').text.strip()
            tmp += (th,)
        for td in tr.find_all('td'):
            tmp += (td.text.strip(),)
        all_tab['table' + str(k)].append(tmp)

Out of the tables extracted, extract only the first table to a dataframe

In [5]:
data_full = pd.DataFrame.from_dict(all_tab['table0'])

## 2. Data Cleaning 

Clean the data , Remove or replace uncessary fields/records. Provide necessary coloumn names

In [6]:
# Renaming the columns - Postcode	Borough	Neighbourhood
data_full.columns = ['Postcode','Borough','Neighbourhood']
data_full.set_index('Postcode', inplace=True)
data_full = data_full.iloc[1:,]
data_full.head()

Unnamed: 0_level_0,Borough,Neighbourhood
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1A,Not assigned,Not assigned
M2A,Not assigned,Not assigned
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Harbourfront


Noticed that few fileds under Borough are not assigned. Hence we Remove all the Borough which are holding the value <b> "Not assigned" </b>

In [7]:
data_full = data_full[data_full.Borough != 'Not assigned']
data_full.reset_index(inplace=True)
data_full.head(8)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue


Replace all the Neighbourhood which are <b> "Not assigned" </b> to the Borough values

In [8]:
i = 1;
for i, row in data_full.iterrows():    
    if data_full['Neighbourhood'].values[i] == 'Not assigned':
        data_full['Neighbourhood'].values[i] = data_full['Borough'].values[i]     

Let's group the dataset according to the Postcode. Add the Neighbourhood field.

In [9]:
Series_stage2 = data_full.groupby(['Postcode','Borough'], as_index=True)['Neighbourhood'].apply(lambda x: "%s" % ', '.join(x))
data_stage2 = Series_stage2.to_frame()
data_stage2.reset_index(inplace=True)
data_stage2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Now Let's check the size of the dataframe

In [10]:
data_stage2.shape

(103, 3)

### 3. Joining the location cordinates 

The geocoder package is not working correctly, hence going for the alternate approach by using a csv file with location details. The file is available in the following path https://cocl.us/Geospatial_data. 

In [11]:
!wget -O geo_data.csv https://cocl.us/Geospatial_data

--2019-05-09 07:13:14--  https://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 169.48.113.201
Connecting to cocl.us (cocl.us)|169.48.113.201|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-05-09 07:13:15--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.27.197
Connecting to ibm.box.com (ibm.box.com)|107.152.27.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-05-09 07:13:15--  https://ibm.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Reusing existing connection to ibm.box.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-05-09 

Now let's read the csv file to a data frame and set the index for the data frame

In [18]:
df_geo = pd.read_csv('geo_data.csv')
df_geo.set_index("Postal Code",inplace=True)
df_geo.head()

Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


Joining the dataframes together based on the indexes.

In [14]:
data_stage2.set_index("Postcode",inplace=True)
result = pd.concat([data_stage2, df_geo], axis=1, join_axes=[data_stage2.index])

In [15]:
result.reset_index(inplace=True)

In [16]:
result.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


### 4. Visualization of the data on the maps

In this section we create map of Torornto using latitude and longitude values. We also add markers to map. For this lets first install the folium package and the geocoder package to get the toronto geo cordinates

In [20]:
!conda install -c conda-forge folium=0.5.0 --yes
!conda config --add channels conda-forge
!conda install geocoder

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/DSX-Python35

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    folium-0.5.0               |             py_0          45 KB  conda-forge
    openssl-1.0.2r             |       h14c3975_0         3.1 MB  conda-forge
    ca-certificates-2019.3.9   |       hecc5488_0         146 KB  conda-forge
    altair-2.2.2               |           py35_1         462 KB  conda-forge
    certifi-2018.8.24          |        py35_1001         139 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         4.0 MB

The following NEW packages will

importing the pacakges folium for maps and Nominatim for the geo locations

In [26]:
import folium
from geopy.geocoders import Nominatim

Lets use the geopy library to get the latitude and longitude values of Toronto.In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent toronto_explorer.

In [42]:
address = 'Toronto, Canada'
geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
tor_latitude = location.latitude
tor_longitude = location.longitude

Now with the folium package we can iterate through the data set to create markers on the map. We can also give some markers about Neighbourhood on the maps.

In [49]:
latitude = tor_latitude
longitude = tor_longitude

map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, Borough, Neighbourhood in zip(result['Latitude'], result['Longitude'], result['Borough'], result['Neighbourhood']):
    label = '{}, {}'.format(Neighbourhood, Borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],        
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto