# Coursera Capstone
## Segmenting and Clustering Neighborhoods in Toronto

__created by marodatavision__ <br/>
*Python Version:*

In [1]:
!python --version

Python 3.7.4


In [18]:
# all imports
import numpy as np
import pandas as pd
import pickle

Let's explore the wikipedia site using pandas

In [3]:
scraped_list = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

We get a list of scraped items (pandas dataframes) from html tables of the website.
Now let's take a look in the list:

In [4]:
print("The list contains {} items.".format(len(scraped_list)))

The list contains 3 items.


The first data frame looks like the table in the submission. So let's focus on that item at first.

In [5]:
scraped_list[0]

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


Here are the other two dataframes in the list. Just to take a look.

In [6]:
scraped_list[1]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,,Canadian postal codes,,,,,,,,,,,,,,,,
1,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,,,,,,,,,,,,,,,
2,NL,NS,PE,NB,QC,QC,QC,ON,ON,ON,ON,ON,MB,SK,AB,BC,NU/NT,YT
3,A,B,C,E,G,H,J,K,L,M,N,P,R,S,T,V,X,Y


In [7]:
scraped_list[2]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,NL,NS,PE,NB,QC,QC,QC,ON,ON,ON,ON,ON,MB,SK,AB,BC,NU/NT,YT
1,A,B,C,E,G,H,J,K,L,M,N,P,R,S,T,V,X,Y


## Now we should wrangle and clean the data
*Here are the steps we should do*
1. The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
1. Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
1. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
1. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
1. Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

### Step 1

In [8]:
# The first point is already done
df = scraped_list[0]
print("Here are the columns of the first dataframe \
(the main one which will use in the further steps): {}".format(", ".join(df.columns)))

Here are the columns of the first dataframe (the main one which will use in the further steps): Postal Code, Borough, Neighbourhood


### Step 2

In [9]:
# Let's work on step 2
# get all rows with no assigned borough
df_no_bo = df[df['Borough'] == 'Not assigned']
# list of indeces without assigned borough
list_of_indeces_to_drop = df_no_bo.index.values

In [10]:
# drop the non assigned rows from main dataframe
df_dropped = df.drop(list_of_indeces_to_drop, axis=0).reset_index()

In [11]:
df_dropped

Unnamed: 0,index,Postal Code,Borough,Neighbourhood
0,2,M3A,North York,Parkwoods
1,3,M4A,North York,Victoria Village
2,4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,5,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...,...
98,160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,165,M4Y,Downtown Toronto,Church and Wellesley
100,168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


### Step 3

In [12]:
# Take a look on the entry from step three on the coursera submission instructions:
df_dropped[df_dropped['Postal Code'] == 'M5A']

Unnamed: 0,index,Postal Code,Borough,Neighbourhood
2,4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Looks like this preprocessing step is also already done. But let's group the the dataframe __df_dropped__ and compare the row numbers to be sure that every __Neighbourhood__ is summed up under a __unique Postal Code__.

In [13]:
df_dropped.groupby('Postal Code')['Neighbourhood'].apply(lambda x: ', '.join(x)).to_frame()

Unnamed: 0_level_0,Neighbourhood
Postal Code,Unnamed: 1_level_1
M1B,"Malvern, Rouge"
M1C,"Rouge Hill, Port Union, Highland Creek"
M1E,"Guildwood, Morningside, West Hill"
M1G,Woburn
M1H,Cedarbrae
...,...
M9N,Weston
M9P,Westmount
M9R,"Kingsview Village, St. Phillips, Martin Grove ..."
M9V,"South Steeles, Silverstone, Humbergate, Jamest..."


comparing the to dataframes you can see, that the row counts match

### Step 4

In [14]:
# look for not assigned Neighbourhoods
df_dropped[df_dropped['Neighbourhood'].str.contains("assign")]

Unnamed: 0,index,Postal Code,Borough,Neighbourhood


In [15]:
# look for Neighbourhoods which have the same name like the borough
df_dropped[df_dropped['Neighbourhood'] == df_dropped['Borough']]

Unnamed: 0,index,Postal Code,Borough,Neighbourhood


In [16]:
# checking for nan values
df_dropped[df_dropped['Neighbourhood'] == np.nan]

Unnamed: 0,index,Postal Code,Borough,Neighbourhood


looks like there are no __not assigned__ Neighbourhoods in the dataframe

### Step 5

Now let's dump the dataframe and print the number of rows in the dataframe

In [17]:
pickle.dump(op)
print("Here is the shape: {}".format(df_dropped.shape))
print("And here is the row count: {}".format(df_dropped.shape[0]))

Here is the shape: (103, 4)
And here is the row count: 103
