# Capstone Project: Segmenting and Clustering Neighborhoods in Toronto

## Part 2 - Data Augmentation - adding Latitude and Longitude

**NOTE**
I was unable to retrieve the data using GeoCoder (the while loop hit the daily limit) so I've loaded the CSV file provided in order to augment the previous table with the Lat/Lon data in this first part - I've used a different service later which produced reasonable results when checked against the supplied CSV file so I've used those.

---
#### import required modules:

In [1]:
import pandas as pd
import json
import requests # library to handle requests
import time # the data source has a max calls per second so the loop needs to be slowed

#pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)

#### Import the Coordinate CSV provided and join to the DataFrame generated in part one

In [2]:
df_tpc=pd.read_csv('Toronto_Neighborhoods_Cleaned.csv').drop('Unnamed: 0', axis=1) # csv created in previous section
df_coord=pd.read_csv('Geospatial_Coordinates.csv') # csv provided with list of coords
df_tpc=df_tpc.merge(df_coord, on='Postal Code') # equiv to sql inner join
df_tpc

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


---
In the next section I'm created a free account at LocationIQ and used their API to find the coordinates. The API was stable mostly but I've used a Try/Catch setup and moved the list reset outside the call - that way I could build up the answer over multiple calls if needed.

**NOTE**
In doing the API calls myself I found a "mistake" in the data, one of the postal codes does not belong to the Toronto Municipality (its in another township). This may have changed since this was setup, **I have removed this row and hence my dataset has one less than the CSV provide for the course.**

In [3]:
def get_coords(post_code):

    url = "https://us1.locationiq.com/v1/search.php"
    
    coords=[]

    data = {
    'key': '0dd3eb043b17af',
    'postalcode' : post_code,
    'City' : 'Toronto',
    'countrycode' : 'CA',
    'format': 'json'
    }
    
    response = requests.get(url, params=data)
    
    lat = response.json()[0]['lat']
    lon = response.json()[0]['lon']
    
    coords = [lat,lon]
    
    return coords

In [4]:
rows = df_tpc.shape[0]
max_attemps = 1000
counter=1
lat_list = [0]*rows
lon_list = [0]*rows

In [5]:
for n in range(0, rows-1):
    counter=0
    
    while lat_list[n]==0 and counter<max_attemps:   
        
        try:
            loca = df_tpc['Postal Code'][n]
            lat, lon = get_coords(loca)
            lat_list[n]=lat
            lon_list[n]=lon
        except:
            lat_list[n]='R'
            lon_list[n]='R'
            pass

        time.sleep(0.7)
        counter+=1

In [6]:
df_tpc.drop(['Latitude', 'Longitude'], axis=1, inplace=True)
df_tpc['Latitude']=lat_list
df_tpc['Longitude']=lon_list
df_tpc=df_tpc[df_tpc['Latitude']!='R']

df_tpc.shape

(102, 5)

In [7]:
df_tpc.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889


In [8]:
df_tpc.to_csv('TN_C_nLatLon.csv', sep = ',', header=df_tpc.columns)