# Toronto Suburb Clusters

## Part 1
Instruction: Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe

### Import relevant packages

In [1]:
import requests
import lxml.html as lh
import pandas as pd

### Scrape the website

In [2]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

#Create a handle, page, to handle the contents of the website
page = requests.get(url)

#Store the contents of the website under doc
doc = lh.fromstring(page.content)

#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')

In [3]:
#Check the length of the first 12 rows
[len(T) for T in tr_elements[:12]]

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

## Convert table to pandas dataframe

1. Parse the table header

In [4]:
tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    print ('%d:"%s"'%(i,name))
    col.append((name,[]))

1:"Postal code
"
2:"Borough
"
3:"Neighborhood
"


2. Creating Pandas DataFrame

Each header is appended to a tuple along with an empty list

In [5]:
#Since our first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
    #If row is not of size 10, the //tr data is not from our table 
    if len(T)!=3:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

3. Check length of each column

In [6]:
[len(C) for (title,C) in col]

[181, 181, 181]

4. Create dataframe

In [7]:
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)

In [8]:
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')
df.head()

Unnamed: 0,postal_code,borough,neighborhood
0,M1A\n,Not assigned\n,\n
1,M2A\n,Not assigned\n,\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,Regent Park / Harbourfront\n


5. Remove \n at end of each cell

In [9]:
df['postal_code'] = df['postal_code'].str.replace(r'\n', '')
df['borough'] = df['borough'].str.replace(r'\n', '')
df['neighborhood'] = df['neighborhood'].str.replace(r'\n', '')
df.head()

Unnamed: 0,postal_code,borough,neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


6. Remove not assigned postal codes.

Also removing unrequired row at end and PO boxes

In [51]:
df = df[df.borough != 'Not assigned']
df = df[df.borough != 'Canadian postal codes']
df = df[df.neighborhood != 'Stn A PO Boxes']
df = df[df.neighborhood != 'Business reply mail Processing CentrE']
df['neighborhood'] = df['neighborhood'].str.replace(r'/', ',')
df['neighborhood'] = df['neighborhood'].str.replace(r' ,', ',')
df.reset_index(inplace = True, drop = True)
df.head()

Unnamed: 0,postal_code,borough,neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


7. Make non-assigned neighborhood the same as borough

In [52]:
#df.sort_values('neighborhood', na_position = 'first')
#df.iloc[0:30,]
#df.iloc[30:70,]
#df.iloc[70:102,]
df.describe()

Unnamed: 0,postal_code,borough,neighborhood
count,101,101,101
unique,101,10,96
top,M1V,North York,Downsview
freq,1,24,4


Investigation shows no non-assigned neighborhoods

8. Use the .shape method to print the number of rows of your dataframe

In [53]:
df.head(15)

Unnamed: 0,postal_code,borough,neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [46]:
df.shape

(101, 3)