## Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

### This notebook will be used for the 2nd Peer-graded Assignment of the Data Science Capstone course.

<h1>Table of contents</h1>

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="#Question_1a">Q1a - Dataframe creation from Wikipedia web scraping</a></li>
        <li><a href="#Question_1b">Q1b - Cleaning of the dataframe</a></li>
        <li><a href="Question_2">Q2 - Getting latitude and longitude</a></li>
     </ol>
</div>
<br>
<hr>



Lets load required libraries

In [1]:
#importing librairies
import random # library for random number generation
import numpy as np # library for vectorized computation
import pandas as pd # library to process data as dataframes

import matplotlib.pyplot as plt # plotting library
# backend for rendering plots within the browser
%matplotlib inline 

from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs

### 1. Dataframe creation from Wikipedia web scraping

In [2]:
#importing more librairies
# import the library we use to open URLs

import urllib.request

In [3]:
# importing the table 

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = urllib.request.urlopen(url)


In [4]:
pip install BeautifulSoup4

Collecting BeautifulSoup4
[?25l  Downloading https://files.pythonhosted.org/packages/d1/41/e6495bd7d3781cee623ce23ea6ac73282a373088fcd0ddc809a047b18eae/beautifulsoup4-4.9.3-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 9.2MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2; python_version >= "3.0" (from BeautifulSoup4)
  Downloading https://files.pythonhosted.org/packages/02/fb/1c65691a9aeb7bd6ac2aa505b84cb8b49ac29c976411c6ab3659425e045f/soupsieve-2.1-py3-none-any.whl
Installing collected packages: soupsieve, BeautifulSoup4
Successfully installed BeautifulSoup4-4.9.3 soupsieve-2.1
Note: you may need to restart the kernel to use updated packages.


In [5]:
from bs4 import BeautifulSoup

In [6]:
pip install lxml

Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/bd/78/56a7c88a57d0d14945472535d0df9fb4bbad7d34ede658ec7961635c790e/lxml-4.6.2-cp36-cp36m-manylinux1_x86_64.whl (5.5MB)
[K     |████████████████████████████████| 5.5MB 8.2MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.6.2
Note: you may need to restart the kernel to use updated packages.


In [7]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
tables = pd.read_html(url)

In [8]:
tables[0].head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### 2. Cleaning of the dataframe

In [9]:
#Droping rows where Borough = "Not assigned"
df= tables [0]
df = df[df.Borough != "Not assigned"].reset_index()
df.head()

Unnamed: 0,index,Postal Code,Borough,Neighbourhood
0,2,M3A,North York,Parkwoods
1,3,M4A,North York,Victoria Village
2,4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,5,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


**1st Note:**
Combination of postal code no longer required. 
The Wikipedia 'List of postal codes of Canada: M' has no duplicate Postal Codes

--> df.groupby(['Postal Code','Borough'])['Neighbourhood'].apply(','.join).reset_index()


**2nd Note:** 
The column Neighbourhood does not contain any "Not assigned" rows (see below)

In [10]:
#check if the column Neighbourhood contains any "Not assigned column"
(df['Neighbourhood'] == "Not assigned").any()

False

In [11]:
#clean dataframe drop index column

df.drop(columns=['index'])

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [12]:
#print the numbers of rows and columns of the dataframe
df.shape

(103, 4)

### 3. Getting latitude and longitude

In [17]:
#importing geocoder package
import pip

In [22]:
!pip install geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 6.6MB/s ta 0:00:01
[?25hCollecting click (from geocoder)
  Using cached https://files.pythonhosted.org/packages/d2/3d/fa76db83bf75c4f8d338c2fd15c8d33fdd7ad23a9b5e57eb6c5de26b430e/click-7.1.2-py2.py3-none-any.whl
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Collecting future (from geocoder)
[?25l  Downloading https://files.pythonhosted.org/packages/45/0b/38b06fd9b92dc2b68d58b75f900e97884c45bedd2ff83203d933cf5851c9/future-0.18.2.tar.gz (829kB)
[K     |████████████████████████████████| 829kB 7.7MB/s eta 0:00:01
Building wheels for collected packages: future
  Building wheel for future (setup.py) ... [?25ld

In [38]:
import geocoder # import geocoder

# initialize your variable to None

lat_lng_coords = None

# loop until you get the coordinates

while(lat_lng_coords is None):
    g = geocoder.google('{}, Toronto, Ontario'.format(df['Postal Code'])
                        lat_lng_coords = g.latlng
                        
latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]

SyntaxError: invalid syntax (<ipython-input-38-015278b0f067>, line 11)