<a href="https://www.bigdatauniversity.com"><img src="https://ibm.box.com/shared/static/cw2c7r3o20w9zn8gkecaeyjhgw3xdgbj.png" width="400" align="center"></a>

<h1 align="center"><font size="5">Segmenting and Clustering Neighborhoods in Toronto</font></h1>

### Step 2 - Use the Notebook to build the code to scrape the following Wikipedia page

In [7]:
#Install Beautifulsoup version 4
!pip install beautifulsoup4
!pip install lxml
!pip install html5lib
!pip install requests

#Import the necessary libraries
import pandas as pd
import numpy as np



Get the data from the Wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [14]:
from bs4 import BeautifulSoup
import requests

In [62]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

In [63]:
soup = BeautifulSoup(source,'lxml')
table = soup.find('table')

Clean up the data first
- The dataframe will consist of three columns: Postal Code, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [64]:
data = list()

#loop through looking for the tag for rows (i.e., 'tr')
for rows in table.find_all('tr'):
    #loop through lookin for the tag for data (i.e., 'td')
    row = rows.find_all('td')
    #grab/strip out the postal code, borough, and neighborhood
    if row:
        postalcode = row[0].text.rstrip()
        borough = row[1].text.rstrip()
        neighborhood = row[2].text.rstrip()
        #If the data is there, append it; only process rows that have a borough assigned
        if borough != 'Not assigned':
            #To deal with the situation where there is not a neighborhood, give the neighborhood the value of the borough
            if neighborhood == 'Not assigned':
                neighborhood = borough
            data.append([postalcode, borough, neighborhood])

#Grab the column headers (this will be used in creating the dataframe)
col_head = list()
for cols in table.tr.find_all('th'):
    col_head.append(cols.text.strip())

#Print out the column headers
col_head

['Postal code', 'Borough', 'Neighborhood']

Convert this into a dataframe

In [65]:
df = pd.DataFrame(data, columns = col_head)

In [66]:
#Look at what we have so far (we will need to )
df.describe()

Unnamed: 0,Postal code,Borough,Neighborhood
count,103,103,103
unique,103,10,98
top,M6C,North York,Downsview
freq,1,24,4


Note: More than one neighborhood can exist in one postal code area. These two rows will be combined into one row with the neighborhoods separated with a comma.

Merge the neighborhoods by using a custom groupby statement

In [72]:
df = df.groupby('Postal code').agg(
    {
        'Borough':'first',
        'Neighborhood': ','.join
    }
    ).reset_index()

In [73]:
# df.groupby('Postal code')['Neighborhood'].agg(','.join)

In [74]:
#Look at what we have now
df.describe()

Unnamed: 0,Postal code,Borough,Neighborhood
count,103,103,103
unique,103,10,98
top,M6C,North York,Downsview
freq,1,24,4


In [75]:
#Print out the first 5 rows, to see if the concatenation of neighborhoods is working
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [76]:
#Rename "Postal code" to "Postalcode", so we can join with the latitude/longitude table further down
df.rename(columns={'Postal code':'Postalcode'}, inplace=True)

df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Use the .shape method to print the number of rows of your dataframe.

In [77]:
print(df.shape)

(103, 3)


### Step 3 - Incorporate the Latitude & Longitude Coordinates

In [78]:
#Download the geospatial data
!wget -q -O geospatial_data.csv http://cocl.us/Geospatial_data
print('Data downloaded!')

Data downloaded!


In [79]:
#Read the geospatial into a CSV file
dfgeo = pd.read_csv("geospatial_data.csv")
dfgeo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [80]:
#Rename the Postal Code field, so you can merge it with the neighborhood data table
dfgeo.rename(columns={'Postal Code': 'Postalcode'}, inplace = True)
dfgeo.head()

Unnamed: 0,Postalcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [81]:
dfmerged = pd.merge(df, dfgeo, on="Postalcode", how='left')
dfmerged.head(11)

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,Malvern / Rouge,43.806686,-79.194353
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek,43.784535,-79.160497
2,M1E,Scarborough,Guildwood / Morningside / West Hill,43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,Kennedy Park / Ionview / East Birchmount Park,43.727929,-79.262029
7,M1L,Scarborough,Golden Mile / Clairlea / Oakridge,43.711112,-79.284577
8,M1M,Scarborough,Cliffside / Cliffcrest / Scarborough Village West,43.716316,-79.239476
9,M1N,Scarborough,Birch Cliff / Cliffside West,43.692657,-79.264848
