# Coursera Capstone Notebook

## Description

This notebook will be used for the capstone project.

In [1]:
import pandas as pd
import numpy as np

In [2]:
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


## Part 1: Segmenting and Clustering Neighborhoods in Toronto

This part of the notebook is dedicated to 'Segmenting and Clustering Neighborhoods in Toronto' assignment

--------------------------------------------------------------------------------------

Read data from Wikipedia and create a dataframe 'df' with it:

In [3]:
df = pd.DataFrame()
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")[0]
df.head(20)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


Display number of rows, columns, etc.:

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Postal Code   180 non-null    object
 1   Borough       180 non-null    object
 2   Neighborhood  180 non-null    object
dtypes: object(3)
memory usage: 4.3+ KB


Now, look for 'Not assigned' values, both on 'Borough' and 'Neighborhood' columns:

In [5]:
print("value_counts() in Neighborhood:\n")
print(df['Neighborhood'].value_counts())
print("\n\n")
print("value_counts() in Borough:\n")
print(df['Borough'].value_counts())

value_counts() in Neighborhood:

Not assigned                                                                                                                              77
Downsview                                                                                                                                  4
Don Mills                                                                                                                                  2
Rosedale                                                                                                                                   1
Humewood-Cedarvale                                                                                                                         1
                                                                                                                                          ..
Old Mill South, King's Mill Park, Sunnylea, Humber Bay, Mimico NE, The Queensway East, Royal York South East, Kingsway Pa

Replace 'Not assigned' values on 'Borough' column with NaN:

In [6]:
df['Borough'].replace('Not assigned',np.nan,inplace=True)
print("value_counts() in Neighborhood:\n")
print(df['Neighborhood'].value_counts())
print("\n\n")
print("value_counts() in Borough:\n")
print(df['Borough'].value_counts())

value_counts() in Neighborhood:

Not assigned                                                                                                                              77
Downsview                                                                                                                                  4
Don Mills                                                                                                                                  2
Rosedale                                                                                                                                   1
Humewood-Cedarvale                                                                                                                         1
                                                                                                                                          ..
Old Mill South, King's Mill Park, Sunnylea, Humber Bay, Mimico NE, The Queensway East, Royal York South East, Kingsway Pa

Now, there are NaN values on 'Borough', instead of 'Not assigned'. Remove those rows with dropna():

In [7]:
#Number of NaN occurrences:
nan_occur = len(df['Borough']) - df['Borough'].count()
print("Number of NaN occurrences previous to dropna() is: ",nan_occur)
#Drop rows with NaN:
df.dropna(axis=0,inplace=True)
nan_occur = len(df['Borough']) - df['Borough'].count()
print("Now the number of NaN is: ",nan_occur)
df.head()

Number of NaN occurrences previous to dropna() is:  77
Now the number of NaN is:  0


Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [8]:
print("value_counts() in Neighborhood:\n")
print(df['Neighborhood'].value_counts())
print("\n\n")
print("value_counts() in Borough:\n")
print(df['Borough'].value_counts())

value_counts() in Neighborhood:

Downsview                                    4
Don Mills                                    2
Lawrence Park                                1
Wexford, Maryvale                            1
Roselawn                                     1
                                            ..
Regent Park, Harbourfront                    1
Kensington Market, Chinatown, Grange Park    1
Parkview Hill, Woodbine Gardens              1
Woburn                                       1
Davisville                                   1
Name: Neighborhood, Length: 99, dtype: int64



value_counts() in Borough:

North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
East Toronto         5
East York            5
York                 5
Mississauga          1
Name: Borough, dtype: int64


Display number of rows, columns, etc.:

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103 entries, 2 to 178
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Postal Code   103 non-null    object
 1   Borough       103 non-null    object
 2   Neighborhood  103 non-null    object
dtypes: object(3)
memory usage: 3.2+ KB


Group dataframe by Postal Code, and join Neighborhoods on the same Postal Code:

In [10]:
grouped = df.groupby('Postal Code').agg(Borough=('Borough',lambda x:x),
                              Neighborhood=('Neighborhood',lambda x: ', '.join(x))).reset_index()

Transform DataFrameGroupBy back to DataFrame:

In [11]:
df = grouped.transform(lambda x:x)
df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


Shape of the resulting dataframe:

In [12]:
df.shape

(103, 3)