# Coursera capstone week 3- Segmenting and Clustering Neighborhoods in Toronto

## 1. Scraping data from wiki

First, import the necessary packages

In [1]:
import requests # this is forhandling request from wiki
from bs4 import BeautifulSoup #this is for handling data requested
import pandas as pd #this is for processing and cleaning data

Request data and process it using BeautifulSoup and lxml parser

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(url)
soup = BeautifulSoup(source.text, 'lxml')
# print(soup.prettify()) uncomment this to take a look at my soup

The information we need is in the table element and class : wikitable sortable. 

In [3]:
table = soup.find('table', class_ = 'wikitable sortable')
# print(table.prettify()) uncomment this to take a look at the table

All the content of the rows is in the body of the table -> Find all the rows 

In [4]:
trows = table.tbody.find_all('tr')
# trows # this is a list of all the content of the rows uncomment this to take a look at the rows in table

Take a look at the first row and second row

In [5]:
print(trows[0])
print()
print(trows[1])

<tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>

<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>


We can see that the name of the columns is in the first row, the necessary information is in all other rows.
We need to create a dictionary to put all the information.

In [6]:
dict_ = {}
dict_['PostalCode'], dict_['Borough'], dict_['Neighborhood'] = [], [], []
dict_

{'PostalCode': [], 'Borough': [], 'Neighborhood': []}

I defined a function used to pull out the information. If the the information in the 'Borough' column is 'Not assigned', don't take it, then if the information in the 'Neighborhood' column is not assigned, we take the information in the 'Borough' column and put it in the 'Neighborhood' column.

In [7]:
def scrape_table(trows):
    for row in trows:
        temp_row = row.text.strip().split('\n')
        if temp_row[1] == 'Not assigned':
            pass
        elif temp_row[2] == 'Not assigned':
            dict_['PostalCode'].append(temp_row[0])
            dict_['Borough'].append(temp_row[1])
            dict_['Neighborhood'].append(temp_row[1])
        else:
            dict_['PostalCode'].append(temp_row[0])
            dict_['Borough'].append(temp_row[1])
            dict_['Neighborhood'].append(temp_row[2])
    print('Done scraping!')
    return None

In [8]:
scrape_table(trows[1:])

Done scraping!


Turn out dictionary into a pd.datafram and take a look

In [9]:
df = pd.DataFrame(dict_)
df.head(20)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


We need to join all the neighborhood with the same PostalCode and Borough together and create a new dataframe

In [10]:
#Group all the Neighborhood by PostalCode and Borough and join them
new_df = df.groupby(['PostalCode','Borough'])['Neighborhood'].apply(lambda x : ', '.join(x))

#Create new data frame
new_df = pd.DataFrame(new_df)

#Create new index
new_df = new_df.reset_index()
new_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Check some information

In [11]:
new_df[new_df['PostalCode'] == 'M5A']

Unnamed: 0,PostalCode,Borough,Neighborhood
53,M5A,Downtown Toronto,"Harbourfront, Regent Park"


In [12]:
new_df[new_df['Borough'] == new_df['Neighborhood']]

Unnamed: 0,PostalCode,Borough,Neighborhood
85,M7A,Queen's Park,Queen's Park


Save our file to neighborhoods_toronto.csv

In [None]:
new_df.to_csv('neighborhoods_toronto.csv')

In [13]:
num_rows = new_df.shape[0]
print('Our data set has %s rows.' %(num_rows))

Our data set has 103 rows.
