# Segmenting and Clustering Neighborhoods in Toronto: Part 1

## Part 1

This notebook contains the procedures involved in the first part of the assignment. Before we do anything else, we must import the necessary packages.

In [1]:
!conda install -c conda-forge lxml beautifulsoup4 --yes
print('Packages installed successfully!')

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.2

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - beautifulsoup4
    - lxml


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    scikit-learn-0.20.1        |   py36h22eb022_0         5.7 MB
    liblapack-3.8.0            |      11_openblas          10 KB  conda-forge
    numpy-1.18.1               |   py36h95a1406_0         5.2 MB  conda-forge
    liblapacke-3.8.0           |      11_openblas          10 KB  conda-forge
    libopenblas-0.3.6          |       h5a2b251_2         7.7 MB
    scipy-1.4.1                |   py36h921218d_0        18.9 MB  conda-forge
    beautifulsoup4-4.8.2       |     

Now we can begin the assignment. The first step will be to collect the Toronto neighborhood data from a Wikipedia page. We will do this using the read_html function in pandas. Since this function will return our data as a list of data frames, we will also have to unite them all into a single data frame using the concatenate function. 

In [56]:
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
dfs = pd.read_html(url) 
df = pd.concat(dfs)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11,12,13,14,15,16,17,Borough,Neighbourhood,Postcode
0,,,,,,,,,,,...,,,,,,,,Not assigned,Not assigned,M1A
1,,,,,,,,,,,...,,,,,,,,Not assigned,Not assigned,M2A
2,,,,,,,,,,,...,,,,,,,,North York,Parkwoods,M3A
3,,,,,,,,,,,...,,,,,,,,North York,Victoria Village,M4A
4,,,,,,,,,,,...,,,,,,,,Downtown Toronto,Harbourfront,M5A
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,,,,,,,,...,,,,,,,,,,
2,NL,NS,PE,NB,QC,QC,QC,ON,ON,ON,...,ON,MB,SK,AB,BC,NU/NT,YT,,,
3,A,B,C,E,G,H,J,K,L,M,...,P,R,S,T,V,X,Y,,,
0,NL,NS,PE,NB,QC,QC,QC,ON,ON,ON,...,ON,MB,SK,AB,BC,NU/NT,YT,,,


As you can see, in the last step, we obtained a dataframe with the information from the Wikipedia page. The data we need are actually in the final three columns of this data frame, and we will therefore select them and view them in the following step. We will also take the opportunity to remove rows where a post code has not been assigned, as well as rows containing 'NaN' values.

In [62]:
df = df[['Borough','Neighbourhood','Postcode']]
to_drop = ['Not assigned']
df = df[~df['Borough'].isin(to_drop)].dropna()
df.head()

Unnamed: 0,Borough,Neighbourhood,Postcode
2,North York,Parkwoods,M3A
3,North York,Victoria Village,M4A
4,Downtown Toronto,Harbourfront,M5A
5,North York,Lawrence Heights,M6A
6,North York,Lawrence Manor,M6A


Our next step is to group all neighborhoods which share the same postal code. We will do this using the groupby function, joining rows by postal code and borough. We will also take the opportunity to rename the postal code column, so it looks like the template in the task instructions.

In [67]:
df2 = df[df.duplicated('Postcode', keep=False)].groupby(['Postcode','Borough'])['Neighbourhood'].apply(list).reset_index()
df2.rename(columns={'Postcode': "PostalCode"}, inplace=True)
df2

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"[Rouge, Malvern]"
1,M1C,Scarborough,"[Highland Creek, Rouge Hill, Port Union]"
2,M1E,Scarborough,"[Guildwood, Morningside, West Hill]"
3,M1K,Scarborough,"[East Birchmount Park, Ionview, Kennedy Park]"
4,M1L,Scarborough,"[Clairlea, Golden Mile, Oakridge]"
5,M1M,Scarborough,"[Cliffcrest, Cliffside, Scarborough Village West]"
6,M1N,Scarborough,"[Birch Cliff, Cliffside West]"
7,M1P,Scarborough,"[Dorset Park, Scarborough Town Centre, Wexford..."
8,M1R,Scarborough,"[Maryvale, Wexford]"
9,M1T,Scarborough,"[Clarks Corners, Sullivan, Tam O'Shanter]"


The data frame now looks like the example provided. The final step is to use the shape function to look at the dimensions of our dataframe.

In [66]:
df2.shape

(56, 3)

You have reached the end of the assignment! Thank you for following along.