# Segmenting and Clustering Neighborhoods in Toronto

##### Index of the notebook.
1. _Information retrival from wikipedia and storing into database,_
2. _Add neighbourhood latitude and longitude to the database,_
3. _Explore and cluster the neighborhoods in Toronto._

#### 1. Information retrival from Wikipedia and storing into database

In [None]:
import requests as req
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

# Retrive the HTML code and create a BeautifulSoup object.
wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wiki_page = str(req.get(wiki_url).text)
soup=BeautifulSoup(wiki_page,'html.parser')

# Create a list with the informations contained in the table.
tag=soup.table
text=tag.get_text()
tmp_list=text.split('\n')
tmp_list2=tmp_list[1:-1]
new_list=[]
#print(tmp_list2) # uncomment to understand the for-cycle.

for i in range(0,len(tmp_list2),5):
    new_list.append([tmp_list2[i+1],tmp_list2[i+2],tmp_list2[i+3]])


# Create the database.
df_tor=pd.DataFrame(new_list[1:])
df_tor.columns=new_list[0]
df_tor.drop(df_tor[df_tor.Borough == 'Not assigned'].index, inplace=True) # Drop row with 'Borough' == 'Not assigned'.
df_tor.loc[df_tor['Neighbourhood'] == 'Not assigned', 'Neighbourhood'] = df_tor.loc[df_tor['Neighbourhood'] == 'Not assigned', 'Borough']       # Replace when 'Neighbourhood' == 'Not assigne' with the 'Borough' name.        
df_tor.reset_index(drop=True,inplace=True) # Reset index to 0 after dropping row.

df_tor.head(20) #uncomment to see the first 20 row of the database

The above code uses BeautifulSoup functions in oder to get the text contained between the tags `<table>...</table>` 
used in the Wikipedia page to build a table. See comments in the code to understand the various instructions. The database assumes that, if not otherwise specified, the 'Borough' coincides with the 'Neighbourhood'.

In [2]:
df_tor.shape

(212, 3)

#### 2. Add neighbourhood latitude and longitude to the database

In [None]:
#

url_coord = 'http://cocl.us/Geospatial_data'
df_tor2 = pd.merge(left=df_tor,right=pd.read_csv(url_coord), how='left', left_on='Postcode', right_on='Postal Code')
df_tor2.drop('Postal Code',axis=1,inplace=True)
df_tor2.rename(columns={'Postcode':'Postal Code'},inplace=True)

The above code add latitude and longitude for each postal code by merging two databases. This is done since the geocode routine (install geocoder first)

```python
import geocoder 
lat_lng_coords = None
while(lat_lng_coords == None):
    g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
    lat_lng_coords = g.latlng  
print(lat_lng_coords)```

does not work, as anticipated in the assignment instructions.

In [None]:
df_tor2.head(12)