## Segmenting and Clustering Neighborhoods in the City of Toronto, Canada
#### This is part of the Coursera final capstone project. It involves exploration, segmentation, and clustering of the neighborhoods in the city of Toronto based on the postal code and borough information.

### Part 1: Install Required Packages, Webscrapping, Create and Clean Dataframe

#### Step 1: Install and import required packages and libraries.

In [1]:
!pip install beautifulsoup4
!pip install lxml

import requests # library to handle requests
import re
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 

from IPython.display import display_html
import lxml.html as lh
import pandas as pd
import numpy as np
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library
from bs4 import BeautifulSoup # for scrapping webpage contents
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

print('Folium installed')
print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python-3.8-main

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2021.5.30          |   py38h578d9bd_0         141 KB  conda-forge
    geographiclib-1.52         |     pyhd8ed1ab_0          35 KB  conda-forge
    geopy-2.2.0                |     pyhd8ed1ab_0          67 KB  conda-forge
    openssl-1.1.1k             |       h7f98852_0         2.1 MB  conda-forge
    python_abi-3.8             |           2_cp38           4 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.4 MB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.52-pyhd8ed1ab_0
  geopy       

#### Step 2: Scrape the data from the source url - Wikipedia; wrangle, clean and read it into a pandas dataframe so that it is in a structured format.

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

data = []
for p in soup.select("td > p"):
    text = p.get_text(strip=True, separator=" ")
    post_code, borough, neighbourhood = re.search(
        r"^(M[^\s]+)\s*([^(]+)(?:\s*(.*))?", text
    ).groups()
    borough = borough.strip()
    neighbourhood = (neighbourhood or "Not Assigned").strip("() ")
    neighbourhood = neighbourhood.replace("(", "/").replace(")", "/")

    data.append((post_code, borough, neighbourhood))

df = pd.DataFrame(data, columns=["Postcode", "Borough", "Neighborhood"])
print(df)

    Postcode           Borough  \
0        M1A      Not assigned   
1        M2A      Not assigned   
2        M3A        North York   
3        M4A        North York   
4        M5A  Downtown Toronto   
..       ...               ...   
175      M5Z      Not assigned   
176      M6Z      Not assigned   
177      M7Z      Not assigned   
178      M8Z         Etobicoke   
179      M9Z      Not assigned   

                                          Neighborhood  
0                                         Not Assigned  
1                                         Not Assigned  
2                                            Parkwoods  
3                                     Victoria Village  
4                           Regent Park / Harbourfront  
..                                                 ...  
175                                       Not Assigned  
176                                       Not Assigned  
177                                       Not Assigned  
178  Mimico NW / The 

In [3]:
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not Assigned
1,M2A,Not assigned,Not Assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


#### Step 3: Remove 'Not assigned' boroughs and fix dataframe index

In [4]:
df.drop(df[df['Borough'] == 'Not assigned'].index, inplace=True)
df.index = range(len(df))
df

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto Business reply mail Processing Ce...,Enclave of M4L
101,M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...


#### Step 4: Print the number of rows and columns in the dataframe

In [5]:
df.shape

(103, 3)