# The Battle of Neighborhoods
## Coursera's "Applied Data Science Capstone" Project - Weeks 4 and 5 Assignment

**Student:** Michael Onishi <br>
**Date:** June 2020

https://www.coursera.org/learn/applied-data-science-capstone

## 1 - Introduction

This is the final assignment from the Coursera's "Applied Data Science Capstone" course. In this project we are required to: 
> come up with an idea to leverage the Foursquare location data to explore or compare neighborhoods or cities of your choice or to come up with a problem that you can use the Foursquare location data to solve.

For this project, I will build a model that recommends neighborhoods of San Diego/California to someone based on his/her interest. So it may be used by someone looking for a place to live in San Diego or just going there for tourism.

The user will give a weight to each of the following categories (all of them are top level venue categories from Foursquare):

 * Arts & Entertainment
 * College & University
 * Event
 * Food
 * Nightlife Spot
 * Outdoors & Recreation
 * Shop & Service
 
Based on this profile, the system will recommend neighborhoods that get the best scores.

Although I am Brazilian, I chose San Diego because it is a city well covered in Foursquare and I fell in love with it when I went there some years ago.

## 2 - Data

To get the neighborhood location data, I will use data from San Diego Open Data Portal: https://data.sandiego.gov/datasets/pd-neighborhoods/ .

As mencioned there, _"These boundaries are for law enforcement only and do not represent legal neighborhood or community boundaries."_ But I think it represents pretty well the official neighborhoods, so I will use it in this project.

Then I will get venues information for every neighborhood using the Foursquare API. The most important data will be the top level category for each venue.

### Retrieving neighborhood data

In [102]:
import json # library to handle JSON files
import numpy as np # library to handle data in a vectorized manner
import pandas as pd

In [5]:
!curl http://seshat.datasd.org/sde/pd/pd_neighborhoods_datasd.geojson -o sandiego_data.json
print('Data downloaded!')

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 5861k  100 5861k    0     0  2525k      0  0:00:02  0:00:02 --:--:-- 2524k
Data downloaded!


In [6]:
with open('sandiego_data.json') as json_data:
    sandiego_data = json.load(json_data)

In [117]:
from collections.abc import Iterable

def flatten(l):
    """Used to flatten the latitude and latitude arrays"""
    for el in l:
        if isinstance(el, Iterable) and not isinstance(el, (str, bytes)):
            yield from flatten(el)
        else:
            yield el

neighborhoods_data = sandiego_data['features']
names = []
latitudes = []
longitudes = []
for data in neighborhoods_data:
    names.append(data['properties']['name'])
    # turning all the coordinates into a flat list
    all_coordinates = list(flatten(data['geometry']['coordinates']))
    
    # then transform them into a list of tuples of latitudes and longitudes
    all_coordinates = np.reshape(all_coordinates, (-1,2))
    
    # using the mean of all polygon points to make only one position for the neighbor
    coordinates = np.mean(all_coordinates, axis=0)
    latitudes.append(coordinates[1])
    longitudes.append(coordinates[0])

Let's transform these lists into a pandas Dataframe.

In [109]:
df = pd.DataFrame(zip(names, latitudes, longitudes), columns =['Neighborhood', 'Latitude', 'Longitude'])

In [113]:
print(df.shape)
df.head()

(124, 3)


Unnamed: 0,Neighborhood,Latitude,Longitude
0,MIRAMAR RANCH NORTH,32.92505,-117.098017
1,TORREY HIGHLANDS,32.969518,-117.149431
2,MISSION BAY,32.776936,-117.2218
3,NORTH CITY,32.97082,-117.217243
4,LOMA PORTAL,32.745151,-117.221147


#### Let's plot in a map to see if it appears correct

In [146]:
import folium

# create map of San Diego using latitude and longitude values
sandiego_latitude = 32.715736
sandiego_longitude = -117.161087
map_sandiego = folium.Map(location=[sandiego_latitude, sandiego_longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(df['Latitude'], df['Longitude'], df['Neighborhood']):
    label = folium.Popup(neighborhood, parse_html=True)
    folium.CircleMarker(
        location=(lat, lng),
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_sandiego)
    
map_sandiego

### Retrieving venue data from the neighborhoods

#### Define Foursquare Credentials and Version

In [116]:
import yaml
import requests

## retrieving the Foursquare API credentials
with open("config.yml", 'r') as ymlfile:
    cfg = yaml.safe_load(ymlfile)

CLIENT_ID = cfg['foursquare_api_credentials']['CLIENT_ID'] # your Foursquare ID
CLIENT_SECRET = cfg['foursquare_api_credentials']['CLIENT_SECRET'] # your Foursquare Secret
VERSION = '20200618' # Foursquare API version

In [180]:
def getCategories():
    """As defined here: https://developer.foursquare.com/docs/api-reference/venues/categories/"""
    
    # create the API request URL
    url = f'https://api.foursquare.com/v2/venues/categories?&client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&v={VERSION}'
    
    # make the GET request
    results = requests.get(url).json()["response"]['categories']
    return results

def getNearbyVenues(names, latitudes, longitudes, category_ids, radius=500, limit=100):
    """Using this API: https://developer.foursquare.com/docs/api-reference/venues/explore/"""
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            ','.join(category_ids),
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Filtering the target categories

In [123]:
categories = getCategories()

In [128]:
target_categories_names = ['Arts & Entertainment',
                           'College & University',
                           'Event',
                           'Food',
                           'Nightlife Spot',
                           'Outdoors & Recreation',
                           'Shop & Service']

target_categories = [c for c in categories if c['name'] in target_categories_names]

In [138]:
print(f'There are {len(target_categories)} selected categories')
target_categories[0].keys()

There are 7 selected categories


dict_keys(['id', 'name', 'pluralName', 'shortName', 'icon', 'categories'])

In [141]:
target_category_ids = [c['id'] for c in target_categories]
target_category_ids

['4d4b7104d754a06370d81259',
 '4d4b7105d754a06372d81259',
 '4d4b7105d754a06373d81259',
 '4d4b7105d754a06374d81259',
 '4d4b7105d754a06376d81259',
 '4d4b7105d754a06377d81259',
 '4d4b7105d754a06378d81259']

In [169]:
[c['name'] for c in target_categories]

['Arts & Entertainment',
 'College & University',
 'Event',
 'Food',
 'Nightlife Spot',
 'Outdoors & Recreation',
 'Shop & Service']

In [178]:
','.join(target_category_ids)

'4d4b7104d754a06370d81259,4d4b7105d754a06372d81259,4d4b7105d754a06373d81259,4d4b7105d754a06374d81259,4d4b7105d754a06376d81259,4d4b7105d754a06377d81259,4d4b7105d754a06378d81259'

#### Getting the venues for the neighborhoods

In [144]:
%%time

sandiego_venues = getNearbyVenues(names=df['Neighborhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude'],
                                   category_ids=target_category_ids
                                  )

MIRAMAR RANCH NORTH
TORREY HIGHLANDS
MISSION BAY
NORTH CITY
LOMA PORTAL
GASLAMP
HILLCREST
SOUTH PARK
NORTH PARK
SOUTHCREST
TERALTA WEST
AZALEA/HOLLYWOOD PARK
MT HOPE
TERALTA EAST
CHOLLAS VIEW
RIDGEVIEW/WEBSTER
COLINA DEL SOL
PARADISE HILLS
BAY TERRACES
SORRENTO VALLEY
SERRA MESA
RANCHO BERNARDO
OCEAN CREST
LA PLAYA
ROSEVILLE / FLEET RIDGE
HORTON PLAZA
CORE-COLUMBIA
SHERMAN HEIGHTS
BURLINGAME
STOCKTON
NORMAL HEIGHTS
MOUNTAIN VIEW
EMERALD HILLS
REDWOOD VILLAGE/ROLANDO PARK
ENCANTO
QUALCOMM
MIRA MESA
SAN CARLOS
RANCHO PENASQUITOS
SUNSET CLIFFS
WOODED AREA
PACIFIC BEACH
CARMEL VALLEY
MISSION HILLS
CHEROKEE POINT
CORRIDOR
SWAN CANYON
ISLENAIR
COLLEGE WEST
BROADWAY HEIGHTS
ALLIED GARDENS
SKYLINE
LAKE MURRAY
UNIVERSITY CITY
EGGER HIGHLANDS
TIJUANA RIVER VALLEY
SAN YSIDRO
OLD TOWN
MORENA
MIDTOWN
FAIRMONT VILLAGE
LINCOLN PARK
CHOLLAS CREEK
GRANTVILLE
DEL CERRO
CARMEL MOUNTAIN
SAN PASQUAL
OCEAN BEACH
MISSION BAY PARK
LA JOLLA
TORREY PINES
BAY HO
HARBORVIEW
PARK WEST
LINDA VISTA
GOLDEN HILL
SHELL

In [145]:
sandiego_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,GASLAMP,32.710986,-117.160070,Sparks Gallery,32.711083,-117.159592,Art Gallery
1,GASLAMP,32.710986,-117.160070,The Shout House,32.712070,-117.160815,Piano Bar
2,GASLAMP,32.710986,-117.160070,Moonshine Flats,32.708927,-117.158600,Music Venue
3,GASLAMP,32.710986,-117.160070,San Diego Padres Hall of Fame,32.708381,-117.157923,Museum
4,GASLAMP,32.710986,-117.160070,The Balboa Theatre,32.714349,-117.161122,Theater
...,...,...,...,...,...,...,...
291,KEARNY MESA,32.828999,-117.144162,Journey's West Gallery,32.828594,-117.146420,Art Gallery
292,KEARNY MESA,32.828999,-117.144162,Left Coast Studios,32.829550,-117.147026,Music Venue
293,KEARNY MESA,32.828999,-117.144162,Prosystems,32.829094,-117.147247,Arts & Entertainment
294,KEARNY MESA,32.828999,-117.144162,Fred Astaire Dance Studio San Diego,32.832246,-117.145511,Dance Studio


#### Let's create a new column with the top level category

In [147]:
def flatten_category_descendants(l):
    """Used to flatten the list of descendants of a category"""
    for el in l['categories']:
        yield from flatten_category_descendants(el)
        yield el['name']

In [173]:
parent_category_map = {}
for cat in target_categories:
    parent_category_map[cat['name']] = cat['name']
    for desc in flatten_category_descendants(cat):
        parent_category_map[desc] = cat['name']

In [181]:
sandiego_venues['Venue Category'].map(parent_category_map)

0      Arts & Entertainment
1      Arts & Entertainment
2      Arts & Entertainment
3      Arts & Entertainment
4      Arts & Entertainment
               ...         
291    Arts & Entertainment
292    Arts & Entertainment
293    Arts & Entertainment
294    Arts & Entertainment
295    Arts & Entertainment
Name: Venue Category, Length: 296, dtype: object

In [182]:
sandiego_venues['Venue Category']

0                Art Gallery
1                  Piano Bar
2                Music Venue
3                     Museum
4                    Theater
               ...          
291              Art Gallery
292              Music Venue
293     Arts & Entertainment
294             Dance Studio
295    Performing Arts Venue
Name: Venue Category, Length: 296, dtype: object

## 3 - Methodology

## 4 - Results

## 5 - Discussion

## 6 - Conclusion