<a href="https://colab.research.google.com/github/lugoll/Coursera_Capstone/blob/main/Similarities%20of%20the%20neighborhood%20structure%20-%20Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Similarities of the neighborhood structure of Capital Cities

The full report to this topic can be found [here](https://github.com/lugoll/Coursera_Capstone/blob/main/Report.pdf).

Based on the neighborhood data as well as the venues in the neighborhoods both cities Toronto and New York City will be compared. With the coordinates of each neighborhood the venues will be retrieved from the Foursquare API. This Data will be clustered and drawn out on a map to show similarities or dissimilarities between the countries.


In [16]:
import pandas as pd
import numpy as np

!pip install geocoder > /dev/null
import geocoder 

from sklearn.cluster import KMeans

!pip install geopy > /dev/null
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

!pip install folium > /dev/null
import folium

import matplotlib.cm as cm
import matplotlib.colors as colors

import json
import requests

print("Imports successful")

Imports successful


Firstly define some constants:

In [14]:
CLIENT_ID = '4D0X1SORGPPFNYUMHTWPLTDZYXTPZLVAJ55LJQM3VN1A3YOX'
CLIENT_SECRET = 'QP32KGL54O1HJRTZPDPRDJQJC5PPUDSMYSWOZ1PS112HQMRQ'
VERSION = '20201016'
LIMIT = 100

Define function for retrieving venues of Foursquare:

In [13]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

## Toronto Data

Get Postal Codes of Wikipedia

In [2]:
postal_codes_df = pd.read_html('http://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
postal_codes_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Remove not assigned Postal Codes

In [4]:
postal_codes_clean = postal_codes_df[postal_codes_df['Borough'] != 'Not assigned']

Add coordinates provided from coursera Course

In [6]:
backup_df = pd.read_csv('http://cocl.us/Geospatial_data')
toronto_df = postal_codes_clean.merge(backup_df, on='Postal Code')

# Postal Code isn't necessary any more
toronto_df = toronto_df.drop('Postal Code',1)
toronto_df.rename({'Neighbourhood': 'Neighborhood'},axis=1, inplace=True)
toronto_df.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,North York,Parkwoods,43.753259,-79.329656
1,North York,Victoria Village,43.725882,-79.315572
2,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


Now fetch the venues for Toronto

In [17]:
toronto_venues = getNearbyVenues(toronto_df['Neighborhood'],toronto_df['Latitude'],toronto_df['Longitude'])
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant
4,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


## New York Data

For simplicity the neighborhood provided in an earlier coursera Lab will be used

In [11]:
# Download Data
!wget -q -O 'newyork_data.json' https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json

# Read downloaded file
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

neighborhoods_data = newyork_data['features']

# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
newyork_df = pd.DataFrame(columns=column_names)

# Now put it into Dataframe
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    newyork_df = newyork_df.append({
        'Borough': borough,
        'Neighborhood': neighborhood_name,
        'Latitude': neighborhood_lat,
        'Longitude': neighborhood_lon
        }, ignore_index=True)

newyork_df.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


Now fetch the venues for New York

In [18]:
newyork_venues = getNearbyVenues(newyork_df['Neighborhood'],newyork_df['Latitude'],newyork_df['Longitude'])
newyork_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop
1,Wakefield,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy
2,Wakefield,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop
3,Wakefield,40.894705,-73.847201,Walgreens,40.896528,-73.8447,Pharmacy
4,Wakefield,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop


## Merge and save Datasets

In [23]:
newyork_venues_marked = newyork_venues.assign(City = 'NewYork')
print(newyork_venues_marked.shape)
newyork_venues_marked.head()

(10112, 8)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,City
0,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop,NewYork
1,Wakefield,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy,NewYork
2,Wakefield,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop,NewYork
3,Wakefield,40.894705,-73.847201,Walgreens,40.896528,-73.8447,Pharmacy,NewYork
4,Wakefield,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop,NewYork


In [24]:
toronto_venues_marked = toronto_venues.assign(City = 'Toronto')
print(toronto_venues_marked.shape)
toronto_venues_marked.head()

(2136, 8)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,City
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park,Toronto
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop,Toronto
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena,Toronto
3,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant,Toronto
4,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop,Toronto


In [25]:
venues_df = newyork_venues_marked.append(toronto_venues_marked)
venues_df.shape

(12248, 8)

Fix columns for City coming first

In [26]:
fixed_columns = list(venues_df.columns)
fixed_columns.remove('City')
fixed_columns = ['City'] + fixed_columns
venues_df = venues_df[fixed_columns]
venues_df.head()

Unnamed: 0,City,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,NewYork,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop
1,NewYork,Wakefield,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy
2,NewYork,Wakefield,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop
3,NewYork,Wakefield,40.894705,-73.847201,Walgreens,40.896528,-73.8447,Pharmacy
4,NewYork,Wakefield,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop


In [27]:
venues_df.to_csv('prepared_venues_newyork_toronto.csv')