<h1>Clustering Toronto's Neighborhoods  

*IBM Data Science Course 9 Week 3 Project*

<h2>PART 1

Directions: Scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, (to obtain the data in the table of postal codes) and transform the data into a pandas dataframe.

#### Step 1: This project begins with importing the 'requests' library and designating which URL Toronto's neighborhoods will be scraped from.

In [1]:
import requests
URL = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

#### Step 2: Importing the 'BeautifulSoup' library
This will be used to glean the html code from the given URL, which is displayed with proper indents. 

In [2]:
from bs4 import BeautifulSoup

#Designating webpage's html code as 'soup'
soup = BeautifulSoup(URL,'lxml')

#prints HTML code. prettify() tag adds indents to code for readability
#print(soup.prettify()) This is commented out to reoduce exessively long printout

#### Step 3: Calls upon specific table that contains Toronto's postal codes, boroughs, and neighborhoods.

In [3]:
#Calls upon specific table in the webpage
TNTable = soup.find('table',{'class':'wikitable sortable'})

#### Step 4: Establishes dictionary, loops through table's trs and tds to extract desired text 

In [4]:
longlist = []
for entries in TNTable.findAll('tr'):
    columns = entries.findAll('td')
    list = []
    for column in columns:
        list.append(column.text)
    longlist.append(list)
#print(longlist)


#### Step 5: Creates dataframe ‘df’ with new column labels
New column labels are "Postal Code", "Borough", and "Neighborhood" instead of "0", "1", and "2".

In [5]:
import pandas as pd

df = pd.DataFrame(longlist) 
df.columns = ['Postal Code', 'Borough', 'Neighborhood']
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,,,
1,M1A,Not assigned,Not assigned\n
2,M2A,Not assigned,Not assigned\n
3,M3A,North York,Parkwoods\n
4,M4A,North York,Victoria Village\n


#### Step 6: Deletes "None" entries in dataframe

In [6]:
#resets indexes
df.reset_index()

#Deletes "None" (blank) entries from first line of dataframe
df = df.drop(0)
df.head()


Unnamed: 0,Postal Code,Borough,Neighborhood
1,M1A,Not assigned,Not assigned\n
2,M2A,Not assigned,Not assigned\n
3,M3A,North York,Parkwoods\n
4,M4A,North York,Victoria Village\n
5,M5A,Downtown Toronto,Harbourfront\n


#### Step 7: Deletes 'Not assigned' boroughs from dataframe
Gets names of indexes in column "Borough" that have value "Not assigned" and deletes these row indexes from dataframe.

In [7]:
# Gets names of indexes in column "Borough" that have value "Not assigned"
indexNames = df[ df['Borough'] == 'Not assigned' ].index
 
# Deletes these row indexes from dataFrame
df.drop(indexNames , inplace=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
3,M3A,North York,Parkwoods\n
4,M4A,North York,Victoria Village\n
5,M5A,Downtown Toronto,Harbourfront\n
6,M5A,Downtown Toronto,Regent Park\n
7,M6A,North York,Lawrence Heights\n


#### Step 8: Removes string “\n” from entries in Neighborhood column

In [8]:
df.Neighborhood = [x.strip('\n') for x in df.Neighborhood]
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights


**Step 9:** Replace 'Not assigned' Neighborhood with corresponding Borough

In [9]:
#Replace 'Not assigned' Neighborhood with corresponding Borough


#### Step 10: Merge all neighborhoods together that have similar Postal Codes and Boroughs. 
Separate neighborhoods by commas and re-indexes table.

In [10]:
df_neigh = df.groupby(['Postal Code','Borough'])['Neighborhood'].apply(', '.join).reset_index()
df_neigh.head()
     

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


#### Step 11: Print the number of rows of your dataframe.
Use the **.shape** method

In [42]:
df_neigh.shape


(103, 3)