# Segmenting and Clustering Neighborhoods in Toronto, CA

## Author: Laila Linke

## Part 1
In this part of the notebook, we scrape the wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M to get the postal codes of neighborhoods in Toronto

In [1]:
# Import packages
import requests
import urllib.request
from urllib.request import urlopen
from bs4 import BeautifulSoup #For Datascraping
import pandas as pd

### Webscraping
We use BeautifulSoup to scrape the table from the Wikipedia page. For this, we first define the URL of the Wikipediapage and then use BeautifulSoup to read out the complete html page

In [2]:
# Define url with table
url=urlopen('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

In [3]:
# Read out website and store in soup object
soup = BeautifulSoup(url,'html.parser')

We find the table in the html page by searching for the keyword "table"

In [4]:
# Find table in soup object
table = soup.find('table')

We read out the table into three lists: one for each column. We find the elements of the table by looking for the keywords 'tr' and 'td'. 'tr' marks the beginning of a row, 'td' marks the beginning of a new cell inside a row. We iterate over all rows and write the value of the first cell into the list "postal_codes", the value of the second cell into the list "boroughs" and the value of the third cell into the list "neighborhoods".

In [5]:
#Create empty lists
postal_codes=[]
boroughs=[]
neighborhoods=[]

rows=table.find_all('tr') # Find all rows of the table
for row in rows: #Go through all rows of table
    cells=row.find_all('td') # Find all cells of this row
    for i, cell in enumerate(cells): # Go through all cells
        cell_val=cell.text.strip() # Strip whitespace
        if(i==0):
            postal_codes.append(cell_val) # Append first cell to "postal_codes"
        elif(i==1):
            boroughs.append(cell_val) # Append second cell to "boroughs"
        else:
            neighborhoods.append(cell_val) # Append third cell to "neighborhoods"

We convert the lists into a Pandas dataframe, which looks similar to the final expected dataframe.
Note however, that this dataframe still contains rows with "Not Assigned" boroughs and neighborhoods.

In [6]:
# Create Dataframe
df=pd.DataFrame()
df['PostalCode']=postal_codes
df['Borough']=boroughs
df['Neighborhood']=neighborhoods

df #Print Dataframe for visual inspection

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


### Data Cleaning
After scraping the complete table, we need to clean the data. We perform 2 steps, although the last step is not strictly necessary, as the Wikipedia table is already in the right format.

Step 1: We remove all rows whose Borough is not assigned

Step 2: For all rows, where the Borough is assigned, but the Neighborhood is not assigned, we set the Neighborhood to the Borough name. This is actually not necessary, because there are no cases in the Wikipedia table to which this applies

In [7]:
# Drop Not assigned Boroughs
df_cleaned=df[(df.Borough!='Not assigned')]

# Set Neighborhood to Borough name, if Neighborhood is Not Assigned

def setNeighborhood(neighborhood, borough): #Define custom function that gives out borough name, if neighborhood is not assigned
    if neighborhood=='Not assigned':
        return borough
    else:
        return neighborhood

df_cleaned['Neighborhood']=[setNeighborhood(n, b) for n, b in (zip(df_cleaned['Neighborhood'], df_cleaned['Borough']))]
df_cleaned.reset_index(inplace=True) #Set new indices
df_cleaned.drop('index', inplace=True, axis=1) #Drop old indices
df_cleaned # Give out cleaned Dataframe

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['Neighborhood']=[setNeighborhood(n, b) for n, b in (zip(df_cleaned['Neighborhood'], df_cleaned['Borough']))]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [8]:
# Give out shape of cleaned Dataframe
df_cleaned.shape

(103, 3)