# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction <a name="introduction"></a>

Suppose that a person plans to move from their own country to an entirely different country for a new job. Moving to a new place can sometimes be difficult, especially if it is an unfamiliar and strange place. Let's assume that one way to ease the moving process is to make sure that the new place is somehow familiar or similar in terms of the neighborhood.

In this project, I will make a **simple recommendation system** that gives a **ranked list of pseudo neighborhoods** within a specified target location **based on the similarity of venues in the neighborhood** with an origin location.

## Data <a name="data"></a>

For this project, we only need few types of data:
* **coordinates** of origin location and target location
* venue information: **id, name, coordinates, category name, category id**
* **land area** of target location (optional)

For the sources, data will be extracted from or provided by the following:
* **GeoPy** for the coordinates
* **Foursquare API** for the venue information
* **Wikipedia** for the optional land area

### Origin Location

First, let us assume an example of origin location, get its coordinates, and then create a map of it. We will be using geopy for reverse geolocation. In this example, we set _Brooklyn, New York_ as the origin location.

In [1]:
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from types import SimpleNamespace

origin_address = 'Brooklyn, New York'

geolocator = Nominatim(user_agent="neighborhood_finder")
origin_geo = geolocator.geocode(origin_address)
origin = {
    'address': origin_address,
    'latitude': origin_geo.latitude,
    'longitude': origin_geo.longitude,
}
origin = SimpleNamespace(**origin)
print('The geograpical coordinates of {} are {}, {}.'.format(origin.address, origin.latitude, origin.longitude))

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.22.0               |     pyh9f0ad1d_0          63 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          97 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-1.22.0-pyh9f0ad1d_0



Downloading and Extracting Packages
geopy-1.22.0         | 63 KB     | ##################################### | 100% 
geographiclib-1.50   | 34 KB     | ###############################

Now let's look at the origin location in a map using folium.

In [2]:
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

folium.Map(location=[origin.latitude, origin.longitude], zoom_start=17)

### Target Location

We will do the same for the target location, let's choose _Toronto, Canada_. 

In [3]:
import numpy as np

target_address = 'Toronto, Canada'

target_geo = geolocator.geocode(target_address)
target = {
    'address': target_address,
    'latitude': target_geo.latitude,
    'longitude': target_geo.longitude
}
target = SimpleNamespace(**target)
print('The geograpical coordinates of {} are {}, {}.'.format(target.address, target.latitude, target.longitude))

The geograpical coordinates of Toronto, Canada are 43.6534817, -79.3839347.


Now let's look at the target location in a map.

In [4]:
folium.Map(location=[target.latitude, target.longitude], zoom_start=11)

### Foursquare API

So far, we have only obtained the geographic coordinates of the locations of interest. We will now obtain the venues using Foursquare API as our neighborhood is defined by the venues in its vicinity.

In [5]:
i=1
fs_id = ['NUGCL2CDP1QSCG3OPHGTXKSWEV5WAUNOXIIXWN3I3BKHWVVN', 'DR1BWCTC0GB0HS4NGLV5JAOOTNBGL3OURRKDRKGWLB2LO4P4'][i]
fs_secret = ['45VDI0WGKRMVPIBJNN0JX0H411E5410EJ042WJDKHPJNKYNP', 'TOB1MZ5NXU14QQJ3A4R10SLED1CYQXLIYHVWME4LY1POM00M'][i]

In [6]:
import pandas as pd
import json
import requests
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt 

CLIENT_ID = fs_id # your Foursquare ID
CLIENT_SECRET = fs_secret # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

def getNearbyVenues(location, categoryId="", limit=1000):
    venues_list=[]
    if type(location) == tuple:
        location = "&ll=" + str(location[0]) + "," + str(location[1])
    else:
        location = "&near=" + location
    try:
        if str(categoryId) != "":
            categoryId = '&categoryId=' + categoryId
    except:
        categoryId = ""
        
    trueLimit = 100
    totalResults = -1
    if limit < trueLimit:
        trueLimit = limit
    for offset in range(0, limit, 100):
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}{}{}&limit={}&offset={}&sortByDistance=1'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            location, 
            categoryId, 
            trueLimit,
            offset
        )
            
        # make the GET request
        response = requests.get(url).json()["response"]
        try:
            results = response['groups'][0]['items']
            if totalResults == -1:
                totalResults = response['totalResults']
#                 print("Max Results:", totalResults)
                if limit > totalResults:
                    limit = totalResults
        except Exception as e:
            print(e)
            return
        
        # return only relevant information for each nearby venue
        venues_list.append(
            [
                (
                    v['venue']['id'],
                    v['venue']['name'], 
                    v['venue']['location']['lat'], 
                    v['venue']['location']['lng'],  
                    v['venue']['categories'][0]['id'],
                    v['venue']['categories'][0]['name']

                ) for v in results
            ]
        )

        
        if trueLimit < 100:
            break
        if (offset + 100 + trueLimit) > limit:
            trueLimit = limit - (offset + 100)

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = [ 
        'Venue ID', 
        'Name', 
        'Latitude', 
        'Longitude', 
        'Category ID', 
        'Category'
    ]
    
    return(nearby_venues)

Here, we will retrieve the venues within the origin location. In this example, let's obtain the 15 closest venues.

In [7]:
origin_venues = getNearbyVenues(location=(origin.latitude, origin.longitude), limit=15)
venue_per_neighborhood = origin_venues.shape[0]
category_id = origin_venues['Category ID'].values
category_string = ','.join(np.unique(category_id))
origin_bounds = [origin_venues[['Latitude', 'Longitude']].min().to_list(), origin_venues[['Latitude', 'Longitude']].max().to_list()]
print("Total Results:", venue_per_neighborhood)
origin_venues.head()

Total Results: 15


Unnamed: 0,Venue ID,Name,Latitude,Longitude,Category ID,Category
0,4c5999242091a59385425dd0,MaCdonalds Avenue Playground,40.65001,-73.94994,4bf58dd8d48988d1e7941735,Playground
1,4cde3e8acf9060fc8521d58d,Bad Gym,40.650007,-73.95,4bf58dd8d48988d175941735,Gym / Fitness Center
2,4dc04a5e81545e1cc7e1591e,Chase Bank,40.650351,-73.949864,4bf58dd8d48988d10a951735,Bank
3,4bf12f5fb315c9b69daf93ff,The Yoga Studio,40.65,-73.95,4bf58dd8d48988d102941735,Yoga Studio
4,4c7f9332d860b60ca32a619d,"Courts (Furniture, Electronics, & Appliances)",40.650436,-73.950608,4bf58dd8d48988d1f8941735,Furniture / Home Store


Let's view the venues within the origin location in a map.

In [8]:
origin_center = origin_venues[['Latitude', 'Longitude']].mean()
map_origin = folium.Map(location=[origin_center[0], origin_center[1]])

# add markers to map
for name, category, lat, lng in zip(origin_venues['Name'], origin_venues['Category'], origin_venues['Latitude'], origin_venues['Longitude']):
    label = folium.Popup(name + " (" + category + ")", parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False
    ).add_to(map_origin)  
    
map_origin.fit_bounds(origin_bounds)
map_origin

Now we get the venues within the target location. First we search for all types of venue, then we filter it to get venues of interest or venues with same category as the ones in the origin location.

In [9]:
target_venues = getNearbyVenues(location=target.address)
target_venues_filtered = getNearbyVenues(location=target.address, categoryId=category_string)
target_venues_filtered = target_venues_filtered[target_venues_filtered['Category ID'].isin(category_id)]
target_venues = target_venues.append(target_venues_filtered, ignore_index=True).drop_duplicates(subset='Venue ID').reset_index()
target_bounds = [target_venues[['Latitude', 'Longitude']].min().to_list(), target_venues[['Latitude', 'Longitude']].max().to_list()]
total_results = target_venues.shape[0]
print("Total Results:", total_results)
print("Total Results of Interest:", target_venues_filtered.shape[0])
target_venues_filtered.head()

Total Results: 222
Total Results of Interest: 98


Unnamed: 0,Venue ID,Name,Latitude,Longitude,Category ID,Category
7,4cc0bf6e31ceef3b87645d0d,Domino's Pizza,43.70717,-79.442658,4bf58dd8d48988d1ca941735,Pizza Place
8,4aeb7abbf964a52080c221e3,Ferro Bar Cafe,43.68108,-79.42857,4bf58dd8d48988d1ca941735,Pizza Place
10,5afa2b9cd8096e002d829e35,Maker Pizza,43.723017,-79.415404,4bf58dd8d48988d1ca941735,Pizza Place
11,4c42f944520fa59311f0cbac,Doce Minho Pastry & Bakery,43.691893,-79.448191,4bf58dd8d48988d16a941735,Bakery
12,4ae73286f964a5204aa921e3,Moorevale Park,43.69361,-79.383465,4bf58dd8d48988d1e7941735,Playground


In this example, Foursquare returned a maximum of 221 results with 97 of interest within Toronto, Canada. Let's look at all the venues of interest found in a map. Venues that are not of interest are still shown but in lower opacity.

In [10]:
target_center = target_venues[['Latitude', 'Longitude']].mean()
map_target = folium.Map(location=[target_center[0], target_center[1]])

# add markers to map
for name, c_id, category, lat, lng in zip(target_venues['Name'], target_venues['Category ID'], target_venues['Category'], target_venues['Latitude'], target_venues['Longitude']):
    label = name + " (" + category + ")"
    popup = folium.Popup(label, parse_html=True)
    opacity = 1 if c_id in category_id else 0.25
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=popup,
        color='blue',
        opacity=opacity,
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7*opacity,
        parse_html=False
    ).add_to(map_target)  
    
map_target.fit_bounds(target_bounds)
map_target

## Methodology <a name="methodology"></a>

To build the recommender system, we want to define the multiple classes or neighborhood profiles, and then classify probabilistically the origin neighborhood profile against those.

First, we will process the data to be used in the clustering later and define a neighborhood profile.

Second, we will use k-means clustering to define our target neighborhoods. We want to build a recommender system for neighborhoods without relying on physical boundaries. A set of venues of interest might be in between of two or more neighborhoods, so a pseudo neighborhood might be a better choice than a physical neighborhood for this system.

Lastly, we will use Gaussian Naive Bayes to classify probabilistically the origin neighborhood against multiple target neighborhoods.

Define the venues of interest table.

In [11]:
venues = pd.DataFrame(pd.unique(origin_venues['Category']), columns=['Category'])
print("Total Categories:", venues.shape[0])
venues.head()

Total Categories: 11


Unnamed: 0,Category
0,Playground
1,Gym / Fitness Center
2,Bank
3,Yoga Studio
4,Furniture / Home Store


Define the origin neighborhood profile which represents presence of venue of interest within the neighborhood. In this profiling, we have no interest in the frequency or distance of venues, only in its presence.

In [12]:
origin_profile = pd.DataFrame([[1] * venues.shape[0]], columns=venues.Category.values)
print("Origin Neighborhood Profile:")
origin_profile

Origin Neighborhood Profile:


Unnamed: 0,Playground,Gym / Fitness Center,Bank,Yoga Studio,Furniture / Home Store,Donut Shop,Juice Bar,Caribbean Restaurant,Pharmacy,Bakery,Pizza Place
0,1,1,1,1,1,1,1,1,1,1,1


Now, we will define target neighborhoods by clustering using k-means.

For the number of clusters, we want to obtain neighborhoods that have approximately the same area as the origin neighborhood. We will use the standard deviation of coordinates to approximate the number of clusters.

In [13]:
from sklearn.cluster import KMeans

# set number of clusters
n_clusters = int(np.sqrt((target_venues['Latitude'] ** 2) + (target_venues['Longitude'] ** 2)).std() / np.sqrt((origin_venues['Latitude'] ** 2) + (origin_venues['Longitude'] ** 2)).std())

# run k-means clustering
neighborhood = KMeans(n_clusters=n_clusters).fit(target_venues[['Latitude', 'Longitude']])
neighborhood_id = neighborhood.labels_
target_labels = np.unique(neighborhood_id)
print("Number of clusters:", n_clusters)


# update target venues with clustering labels
target_venues.drop('Neighborhood ID', axis=1, inplace=True, errors='ignore')
target_venues.insert(0, 'Neighborhood ID', neighborhood_id)

target_venues.tail()

Number of clusters: 42


Unnamed: 0,Neighborhood ID,index,Venue ID,Name,Latitude,Longitude,Category ID,Category
217,5,228,4bd444f5462cb71393ebdf07,Francesca Bakery,43.787716,-79.256852,4bf58dd8d48988d16a941735,Bakery
218,12,229,5675fb22498e98ce8baff885,Crown Pizza,43.824307,-79.247393,4bf58dd8d48988d1ca941735,Pizza Place
219,12,230,4aef94c7f964a52060d921e3,Walmart,43.833671,-79.256036,4bf58dd8d48988d10f951735,Pharmacy
220,21,231,4e0b137722713e13018e7117,Tim Hortons / Esso,43.801863,-79.199296,4bf58dd8d48988d148941735,Donut Shop
221,2,232,53669066498eb02e34d50dcd,Yoga Grove - Small Classes. Big Difference.,43.649227,-79.506812,4bf58dd8d48988d102941735,Yoga Studio


Similar to the origin neighborhood profile, we define the target neighborhood profiles by presence of venues of interest.

In [14]:
target_profiles_list = [((venues['Category'].isin(target_venues[target_venues['Neighborhood ID'] == i]['Category'])).astype('int').to_list()) for i in range(n_clusters)]
target_profiles = pd.DataFrame(target_profiles_list, columns=venues.Category.values)
target_profiles.insert(0, 'Neighborhood ID', target_labels)
print("Target Neighborhood Profiles:")
target_profiles.head()

Target Neighborhood Profiles:


Unnamed: 0,Neighborhood ID,Playground,Gym / Fitness Center,Bank,Yoga Studio,Furniture / Home Store,Donut Shop,Juice Bar,Caribbean Restaurant,Pharmacy,Bakery,Pizza Place
0,0,0,0,0,1,0,0,0,0,0,1,0
1,1,0,0,1,0,0,0,0,0,1,1,1
2,2,0,0,0,1,0,0,0,0,0,1,0
3,3,0,0,0,0,0,0,0,0,0,0,0
4,4,0,0,0,0,0,0,0,0,1,1,1


## Analysis <a name="analysis"></a>

Now we calculate neighborhood profile similarities by presence of venues of interest using Gaussian Naive Bayes, which is used for multiclass probabilistic classification. The predicted probability in this set up tends to result in extreme values of one and zeros, so we will rank each neighborhood profiles according to log-probability instead. We should observe values from negative infinity (zero probability) up to $0$ (equivalent to 100% probability).

In [15]:
from sklearn.naive_bayes import GaussianNB

X = target_profiles.values[:,1:]
Y = target_labels
clf = GaussianNB()
clf.fit(X, Y)

log_proba = clf.predict_log_proba([origin_profile.values[0]])[0]
log_proba_df = pd.DataFrame(log_proba)
log_proba_df.columns = ['Log Probability']
log_proba_df['Neighborhood ID'] = log_proba_df.index
ranked_neighborhood=log_proba_df.sort_values(by='Log Probability', ascending=False, ignore_index=True)

replace_dict = dict(zip(ranked_neighborhood['Neighborhood ID'].values, ranked_neighborhood.index.values))
neighborhood_id = pd.Series(neighborhood_id).replace(replace_dict).values
target_venues['Neighborhood ID'] = neighborhood_id
target_profiles['Neighborhood ID'] = target_profiles['Neighborhood ID'].replace(replace_dict).values
target_profiles.sort_values(by='Neighborhood ID', inplace=True, ignore_index=True)
ranked_neighborhood['Neighborhood ID'] = ranked_neighborhood.index.values
ranked_neighborhood['Mean Latitude'] = [(target_venues[target_venues['Neighborhood ID'] == i]['Latitude'].mean()) for i in target_labels]
ranked_neighborhood['Mean Longitude'] = [(target_venues[target_venues['Neighborhood ID'] == i]['Longitude'].mean()) for i in target_labels]

ranked_neighborhood.head()

Unnamed: 0,Log Probability,Neighborhood ID,Mean Latitude,Mean Longitude
0,0.0,0,43.666333,-79.38671
1,-2004545000.0,1,43.647132,-79.39539
2,-4009091000.0,2,43.649534,-79.411497
3,-4009091000.0,3,43.644248,-79.422093
4,-4009091000.0,4,43.776172,-79.256715


Since K-Means Clustering does not guarantee convergence to the global optimum, each run of k-means will produce differing mean coordinates of the high ranking neighborhoods, but these neighborhoods will always place higher than the rest.

Based on the log probality estimates, let's look at the rearranged profiles by rank. We should observe more ones at the higher rows compared to lower rows.

In [16]:
print("Rearranged Target Neighborhood Profiles:")
target_profiles[0:5]

Rearranged Target Neighborhood Profiles:


Unnamed: 0,Neighborhood ID,Playground,Gym / Fitness Center,Bank,Yoga Studio,Furniture / Home Store,Donut Shop,Juice Bar,Caribbean Restaurant,Pharmacy,Bakery,Pizza Place
0,0,0,1,1,1,0,0,0,1,1,1,0
1,1,1,0,0,1,0,0,0,0,1,1,1
2,2,0,0,0,1,0,0,1,0,0,1,1
3,3,1,0,0,1,1,0,0,0,0,1,0
4,4,0,0,0,0,1,0,0,0,1,1,1


As desired, higher rows have more venues of interest present (represented by $1$s) then lower rows.

## Results and Discussion <a name="results"></a>

Let's create a map to see the neighborhood ranking colored by rank of log probability.

The neighborhood number in the popup is also the rank of that neighborhood. We use viridis colormap to color each neigborhood. Higher ranking neighborhoods tend to be more yellow, while lower ranks tend to be more blue. Venues that are not of interest are still shown but in lower opacity.

In [17]:
# create map
target_center = target_venues[['Latitude', 'Longitude']].mean()
rank_map = folium.Map(location=[target_center[0], target_center[1]], zoom_start=11)

# set color scheme for the clusters
rank_array = n_clusters - np.array(ranked_neighborhood.index)
rank_norm = rank_array - np.min(rank_array)
rank_norm = rank_norm / float(np.max(rank_norm))
rank_color = cm.viridis(rank_norm)
rank_color_hex = [colors.rgb2hex(i) for i in rank_color]

# add markers to the map
for cluster, name, c_id, category, lat, lon in zip(target_venues['Neighborhood ID'], target_venues['Name'], target_venues['Category ID'], target_venues['Category'], target_venues['Latitude'], target_venues['Longitude']):
    label = "Neighborhood #" + str(cluster) + ": " + name + " (" + category + ")"
    popup = folium.Popup(label, parse_html=True)
    opacity = 1 if c_id in category_id else 0.33
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=popup,
        color=rank_color_hex[cluster],
        opacity=opacity,
        fill=True,
        fill_color=rank_color_hex[cluster],
        fill_opacity=0.7*opacity
    ).add_to(rank_map)
       
rank_map.fit_bounds(target_bounds)
rank_map

Because of the large number of clusters that we set, we can observe that the venues within each neighborhood are close together, similar to the original neighborhood. This is a desired results as we want the target neighborhood to be as similar as possible with the original neighborhood.

The colormap distributes the color evenly to focus on ranking and partitioning, but there isn't much emphasis on degree of similarity.

Now let's look at the map of neighborhood similarity colored by normalized log probability.

We still use viridis colormap to color each neigborhood. Higher log probability tend to be more yellow, while lower log probability tend to be more blue. Venues that are not of interest are still shown but in lower opacity.

In [18]:
# create map
log_proba_map = folium.Map(location=[target_center[0], target_center[1]], zoom_start=11)

# set color scheme for the clusters
log_proba_array = np.array(ranked_neighborhood.sort_values(by='Neighborhood ID', ascending=True))[:,0]
log_proba_norm = log_proba_array - np.min(log_proba_array)
log_proba_norm = log_proba_norm / float(np.max(log_proba_norm))
log_proba_color = cm.viridis(log_proba_norm)
log_proba_color_hex = [colors.rgb2hex(i) for i in log_proba_color]

# add markers to the map
for cluster, name, c_id, category, lat, lon in zip(target_venues['Neighborhood ID'], target_venues['Name'], target_venues['Category ID'], target_venues['Category'], target_venues['Latitude'], target_venues['Longitude']):
    label = "Neighborhood #" + str(cluster) + ": " + name + " (" + category + ")"
    popup = folium.Popup(label, parse_html=True)
    opacity = 1 if c_id in category_id else 0.33
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=popup,
        color=log_proba_color_hex[cluster],
        opacity=opacity,
        fill=True,
        fill_color=log_proba_color_hex[cluster],
        fill_opacity=0.7*opacity
    ).add_to(log_proba_map)
       
log_proba_map.fit_bounds(target_bounds)
log_proba_map

This map is more intuitive since it considers ties in the ranking, not just the partitioning. Here, we can focus on neighborhoods with near perfect similarity and ignore the rest.

## Conclusion <a name="conclusion"></a>

The interested users and stakeholders can already utilize this system as is and gain desirable recommendations. Improvements can be done on neighborhood clustering by finding global maximum or consistent clustering, and on map visualization by adding neighborhood boundaries. More modification can still be done on the definition of a neighborhood profile, depending on stakeholders' preference, such as adding more features aside from venues and utilizing other APIs aside from Foursquare.

Nevertheless, this system is working as intended to recommend similar neighborhoods within a given location based on a smaller input location. We were able to achieve that inconsistently due to the local maximum convergence tendency of k-means clustering, but the ranking of neighborhoods are consistently correct, informative, and hopefully useful for the target market.



Thank you for taking the time in reading this.