# Battle of the Neighborhoods

## IBM Coursera Capstone Project

##### In this particular notebook we are going to be scraping data for neighborhoods in Canada.

We will perform the following steps:
    
    1. Get data from wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
    2. Convert the wiki table into a pandas data frame
    3. ignore the postcodes which do not have a borough assigned
    4. for neighborhoods which are not assigned, mark them as same as the borough name
    5. for same postcodes with multiple neighborhoods, we will keep the neighborhoods as a comma separated list instead of multiple rows
    6. view the panda dataframe created
    7. write out the number of rows

In [1]:
import numpy as np
import pandas as panda
import requests
from bs4 import BeautifulSoup
from collections import defaultdict
import pandas as panda
import folium
from matplotlib import pyplot as plot
from itertools import chain
from sklearn.cluster import KMeans


In [2]:
wiki_link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wiki_data_as_string = requests.get(wiki_link).text

In [3]:
soup = BeautifulSoup(wiki_data_as_string,'lxml')

We inspected the wikipedia page detailing Toronto neighborhoods using chrome developer tools. We found out that we are lookign for a table with the class details as 
#### class="wikitable sortable jquery-tablesorter"

In [4]:
toronto_table = soup.find('table',{'class':'wikitable sortable'})

Maintain a running list for the postal codes, boroughs and neighborhoods found.
We will add them to a dictionary , for the simple reason of ease of conversion to panda dataframe

In [5]:
toronto_neighborhood = defaultdict(list)

for row in toronto_table.findAll("tr"):
    cells = row.findAll(["td"])
    if cells:
        cell_text = [i.text.strip() for i in cells]
        postcode = cell_text[0]
        borough = cell_text[1]
        neighborhood = cell_text[2]
        
        if borough and borough.lower()!='not assigned':
            toronto_neighborhood['postcode'].append(postcode)
            toronto_neighborhood['borough'].append(borough)
            toronto_neighborhood['neighborhood'].append(neighborhood if neighborhood.lower()!='not assigned' else borough)

    

In [6]:
toronto_neighborhood= panda.DataFrame(toronto_neighborhood)


In order to fulfill the requirement of having same borough with multiple neighborhoods appearing as a comma separated
list , we will perform the following steps:
    
    1. Groupby borough
    2. aggregate on the column neighborhood
    3. aggregation function returns comma separated values
    4. merge the aggregated table with original table
    5. remove duplicate columns and duplicate rows

In [7]:
def combine_all_neighborhoods(x):
    hoods=[]
    hoods.extend(x)
    return ','.join(hoods)

In [8]:
grouped_by_borough = toronto_neighborhood.groupby(['borough']).agg({'neighborhood':combine_all_neighborhoods}).reset_index()
grouped_by_borough.head()

Unnamed: 0,borough,neighborhood
0,Central Toronto,"Lawrence Park,Roselawn,Davisville North,Forest..."
1,Downtown Toronto,"Harbourfront,Regent Park,Ryerson,Garden Distri..."
2,East Toronto,"The Beaches,The Danforth West,Riverdale,The Be..."
3,East York,"Woodbine Gardens,Parkview Hill,Woodbine Height..."
4,Etobicoke,"Islington Avenue,Cloverdale,Islington,Martin G..."


In [9]:
grouped_by_borough.shape, toronto_neighborhood.shape

((11, 2), (212, 3))

In [10]:
temp = panda.merge(toronto_neighborhood,grouped_by_borough,how='left', on ='borough')
temp.drop('neighborhood_x', axis=1,inplace=True)
temp.rename(columns = {'postcode':'postcode'.title(),'borough':'borough'.title(),'neighborhood_y':'neighborhood'.title()}, inplace = True)
temp

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,"Parkwoods,Victoria Village,Lawrence Heights,La..."
1,M4A,North York,"Parkwoods,Victoria Village,Lawrence Heights,La..."
2,M5A,Downtown Toronto,"Harbourfront,Regent Park,Ryerson,Garden Distri..."
3,M5A,Downtown Toronto,"Harbourfront,Regent Park,Ryerson,Garden Distri..."
4,M6A,North York,"Parkwoods,Victoria Village,Lawrence Heights,La..."
5,M6A,North York,"Parkwoods,Victoria Village,Lawrence Heights,La..."
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,"Islington Avenue,Cloverdale,Islington,Martin G..."
8,M1B,Scarborough,"Rouge,Malvern,Highland Creek,Rouge Hill,Port U..."
9,M1B,Scarborough,"Rouge,Malvern,Highland Creek,Rouge Hill,Port U..."


In [11]:
temp.drop_duplicates(inplace=True)
temp.head(20)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,"Parkwoods,Victoria Village,Lawrence Heights,La..."
1,M4A,North York,"Parkwoods,Victoria Village,Lawrence Heights,La..."
2,M5A,Downtown Toronto,"Harbourfront,Regent Park,Ryerson,Garden Distri..."
4,M6A,North York,"Parkwoods,Victoria Village,Lawrence Heights,La..."
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,"Islington Avenue,Cloverdale,Islington,Martin G..."
8,M1B,Scarborough,"Rouge,Malvern,Highland Creek,Rouge Hill,Port U..."
10,M3B,North York,"Parkwoods,Victoria Village,Lawrence Heights,La..."
11,M4B,East York,"Woodbine Gardens,Parkview Hill,Woodbine Height..."
13,M5B,Downtown Toronto,"Harbourfront,Regent Park,Ryerson,Garden Distri..."


In [12]:
temp.isnull().any()

Postcode        False
Borough         False
Neighborhood    False
dtype: bool

In [13]:
temp.shape

(103, 3)

In order to add the latitude and longitude co-ordinates with already webscraped data that we have collected, we are going to use the geospatial co ordinates file provided.

1. Load the excel using pandas
2. Join on postal code with already created table

In [23]:
geospatial_coordinates  = 'Geospatial_Coordinates.csv'
geo_data = panda.read_csv(geospatial_coordinates)
geo_data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [24]:
geo_data.rename(columns={'Postal Code':'Postcode'}, inplace=True)
geo_data.head(1)

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353


In [25]:
toronto_neighborhood_with_coordinates= panda.merge(temp,geo_data, how='left', on='Postcode')
toronto_neighborhood_with_coordinates.shape, temp.shape

((103, 5), (103, 3))

In [26]:
toronto_neighborhood_with_coordinates.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,"Parkwoods,Victoria Village,Lawrence Heights,La...",43.753259,-79.329656
1,M4A,North York,"Parkwoods,Victoria Village,Lawrence Heights,La...",43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront,Regent Park,Ryerson,Garden Distri...",43.65426,-79.360636
3,M6A,North York,"Parkwoods,Victoria Village,Lawrence Heights,La...",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
