## Business Problem


Suppose the owner of the Hilton Times Square hotel in Manhattan wants to find out about his opponent. I will find 30 hotels in a location close to his hotel for review.

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


<a id='item1'></a>

## 1. Download and Explore Dataset

https://geo.nyu.edu/catalog/nyu_2451_34572

In [2]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


#### Load and explore the data

In [3]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

Let's take a quick look at the data.

In [4]:
newyork_data

{'type': 'FeatureCollection',
 'totalFeatures': 306,
 'features': [{'type': 'Feature',
   'id': 'nyu_2451_34572.1',
   'geometry': {'type': 'Point',
    'coordinates': [-73.84720052054902, 40.89470517661]},
   'geometry_name': 'geom',
   'properties': {'name': 'Wakefield',
    'stacked': 1,
    'annoline1': 'Wakefield',
    'annoline2': None,
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.84720052054902,
     40.89470517661,
     -73.84720052054902,
     40.89470517661]}},
  {'type': 'Feature',
   'id': 'nyu_2451_34572.2',
   'geometry': {'type': 'Point',
    'coordinates': [-73.82993910812398, 40.87429419303012]},
   'geometry_name': 'geom',
   'properties': {'name': 'Co-op City',
    'stacked': 2,
    'annoline1': 'Co-op',
    'annoline2': 'City',
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.82993910812398,
     40.87429419303012,
     -73.82993910812398,
     40.87429419303012]}},
  {'type': 'Feature',
 

In [12]:
neighborhoods_data = newyork_data['features']

Let's take a look at the first item in this list.

In [6]:
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

#### Tranform the data into a *pandas* dataframe

The next task is essentially transforming this data of nested Python dictionaries into a *pandas* dataframe. So let's start by creating an empty dataframe.

In [27]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Take a look at the empty dataframe to confirm that the columns are as intended.

In [26]:
neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


Then let's loop through the data and fill the dataframe one row at a time.

In [42]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

Quickly examine the resulting dataframe.

In [43]:
neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585
5,Bronx,Kingsbridge,40.881687,-73.902818
6,Manhattan,Marble Hill,40.876551,-73.91066
7,Bronx,Woodlawn,40.898273,-73.867315
8,Bronx,Norwood,40.877224,-73.879391
9,Bronx,Williamsbridge,40.881039,-73.857446


And make sure that the dataset has all 5 boroughs and 306 neighborhoods.

In [30]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


#### Use geopy library to get the latitude and longitude values of New York City.

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>ny_explorer</em>, as shown below.

In [31]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


#### Create a map of New York with neighborhoods superimposed on top.

In [32]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

However, for illustration purposes, let's simplify the above map and segment and cluster only the neighborhoods in Manhattan. So let's slice the original dataframe and create a new dataframe of the Manhattan data.

In [33]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


Let's get the geographical coordinates of Manhattan.

In [34]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7900869, -73.9598295.


As we did with all of New York City, let's visualizat Manhattan the neighborhoods in it.

In [35]:
# create map of Manhattan using latitude and longitude values
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version

In [44]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    geographiclib: 1.49-py_0   conda-forge
    geopy:         1.19.0-py_0 conda-forge

geographiclib- 100% |################################| Time: 0:00:00  24.38 MB/s
geopy-1.19.0-p 100% |################################| Time: 0:00:00  35.61 MB/s
Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
folium                    0.5.0                      py_0    conda-forge
Folium installed
Libraries imported.


In [45]:
CLIENT_ID = 'XKMRB50G1A5F2NSAPKAKKRIAYTOMTBDUNCCPEBE5M44QFDZT' # your Foursquare ID
CLIENT_SECRET = '000S5UHQBYXYS0Z12COS4J25JU2EPJB5ASQFBEQOK3AR5TVT' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: XKMRB50G1A5F2NSAPKAKKRIAYTOMTBDUNCCPEBE5M44QFDZT
CLIENT_SECRET:000S5UHQBYXYS0Z12COS4J25JU2EPJB5ASQFBEQOK3AR5TVT


In [46]:
# Hilton Times Square Hotel

address = '234 W 42nd St, New York, NY 10036'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(latitude, longitude)

40.7564587 -73.9884226


In [47]:
search_query = 'Hotel'
radius = 500
print(search_query + ' .... OK!')

Hotel .... OK!


In [49]:
LIMIT = 30
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/search?client_id=XKMRB50G1A5F2NSAPKAKKRIAYTOMTBDUNCCPEBE5M44QFDZT&client_secret=000S5UHQBYXYS0Z12COS4J25JU2EPJB5ASQFBEQOK3AR5TVT&ll=40.7564587,-73.9884226&v=20180605&query=Hotel&radius=500&limit=30'

In [50]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5cc5b9f6351e3d1d4f8fbf70'},
 'response': {'venues': [{'categories': [{'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/travel/hotel_',
       'suffix': '.png'},
      'id': '4bf58dd8d48988d1fa931735',
      'name': 'Hotel',
      'pluralName': 'Hotels',
      'primary': True,
      'shortName': 'Hotel'}],
    'hasPerk': False,
    'id': '4adbaf34f964a520012a21e3',
    'location': {'address': '235 W 46th St',
     'cc': 'US',
     'city': 'New York',
     'country': 'United States',
     'distance': 351,
     'formattedAddress': ['235 W 46th St',
      'New York, NY 10036',
      'United States'],
     'labeledLatLngs': [{'label': 'display',
       'lat': 40.759474944805255,
       'lng': -73.98717696388796}],
     'lat': 40.759474944805255,
     'lng': -73.98717696388796,
     'postalCode': '10036',
     'state': 'NY'},
    'name': 'Paramount Hotel',
    'referralId': 'v-1556462070',
    'venuePage': {'id': '33865901'}},
   {'categories': 

In [51]:
# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranform venues into a dataframe
dataframe = json_normalize(venues)
dataframe.head()

Unnamed: 0,categories,delivery.id,delivery.provider.icon.name,delivery.provider.icon.prefix,delivery.provider.icon.sizes,delivery.provider.name,delivery.url,hasPerk,id,location.address,location.cc,location.city,location.country,location.crossStreet,location.distance,location.formattedAddress,location.labeledLatLngs,location.lat,location.lng,location.neighborhood,location.postalCode,location.state,name,referralId,venuePage.id
0,[{'icon': {'prefix': 'https://ss3.4sqi.net/img...,,,,,,,False,4adbaf34f964a520012a21e3,235 W 46th St,US,New York,United States,,351,"[235 W 46th St, New York, NY 10036, United Sta...","[{'lng': -73.98717696388796, 'lat': 40.7594749...",40.759475,-73.987177,,10036,NY,Paramount Hotel,v-1556462070,33865901.0
1,[{'icon': {'prefix': 'https://ss3.4sqi.net/img...,,,,,,,False,4b4bbe3ff964a52016a626e3,228 W 47th St,US,New York,United States,at 8th Ave,407,"[228 W 47th St (at 8th Ave), New York, NY 1003...","[{'lng': -73.98608770066558, 'lat': 40.7596663...",40.759666,-73.986088,,10036,NY,Hotel Edison,v-1556462070,501560524.0
2,[{'icon': {'prefix': 'https://ss3.4sqi.net/img...,,,,,,,False,5ac67754a9e4026c7d487d95,260 W 40th St,US,New York,United States,40th btw 7th & 8th Ave,182,"[260 W 40th St (40th btw 7th & 8th Ave), New Y...","[{'lng': -73.99031690674605, 'lat': 40.7556572...",40.755657,-73.990317,,10018,NY,AC Hotel Times Square,v-1556462070,
3,[{'icon': {'prefix': 'https://ss3.4sqi.net/img...,,,,,,,False,5bc65d8e1822230025fffcb1,310 W 40th St,US,New York,United States,,267,"[310 W 40th St, New York, NY 10018, United Sta...","[{'lng': -73.99157595205548, 'lat': 40.7562071...",40.756207,-73.991576,,10018,NY,Aliz Hotel,v-1556462070,
4,[{'icon': {'prefix': 'https://ss3.4sqi.net/img...,,,,,,,False,4af0ae0df964a52037de21e3,319 W 48th St,US,New York,United States,btw 8th Ave & 9th Ave,572,"[319 W 48th St (btw 8th Ave & 9th Ave), New Yo...","[{'lng': -73.9880682833074, 'lat': 40.76159900...",40.761599,-73.988068,,10036,NY,The Belvedere Hotel,v-1556462070,


#### Let's explore the first neighborhood in our dataframe.

In [52]:
# keep only columns that include venue name, and anything that is associated with location
filtered_columns = ['name', 'categories'] + [col for col in dataframe.columns if col.startswith('location.')] + ['id']
dataframe_filtered = dataframe.loc[:, filtered_columns]

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# filter the category for each row
dataframe_filtered['categories'] = dataframe_filtered.apply(get_category_type, axis=1)

# clean column names by keeping only last term
dataframe_filtered.columns = [column.split('.')[-1] for column in dataframe_filtered.columns]

dataframe_filtered

Unnamed: 0,name,categories,address,cc,city,country,crossStreet,distance,formattedAddress,labeledLatLngs,lat,lng,neighborhood,postalCode,state,id
0,Paramount Hotel,Hotel,235 W 46th St,US,New York,United States,,351,"[235 W 46th St, New York, NY 10036, United Sta...","[{'lng': -73.98717696388796, 'lat': 40.7594749...",40.759475,-73.987177,,10036.0,NY,4adbaf34f964a520012a21e3
1,Hotel Edison,Hotel,228 W 47th St,US,New York,United States,at 8th Ave,407,"[228 W 47th St (at 8th Ave), New York, NY 1003...","[{'lng': -73.98608770066558, 'lat': 40.7596663...",40.759666,-73.986088,,10036.0,NY,4b4bbe3ff964a52016a626e3
2,AC Hotel Times Square,Hotel,260 W 40th St,US,New York,United States,40th btw 7th & 8th Ave,182,"[260 W 40th St (40th btw 7th & 8th Ave), New Y...","[{'lng': -73.99031690674605, 'lat': 40.7556572...",40.755657,-73.990317,,10018.0,NY,5ac67754a9e4026c7d487d95
3,Aliz Hotel,Hotel,310 W 40th St,US,New York,United States,,267,"[310 W 40th St, New York, NY 10018, United Sta...","[{'lng': -73.99157595205548, 'lat': 40.7562071...",40.756207,-73.991576,,10018.0,NY,5bc65d8e1822230025fffcb1
4,The Belvedere Hotel,Hotel,319 W 48th St,US,New York,United States,btw 8th Ave & 9th Ave,572,"[319 W 48th St (btw 8th Ave & 9th Ave), New Yo...","[{'lng': -73.9880682833074, 'lat': 40.76159900...",40.761599,-73.988068,,10036.0,NY,4af0ae0df964a52037de21e3
5,Kimpton Muse Hotel,Hotel,130 W 46th St,US,New York,United States,btwn 6th & 7th Ave,420,"[130 W 46th St (btwn 6th & 7th Ave), New York,...","[{'lng': -73.983764, 'lat': 40.757808, 'label'...",40.757808,-73.983764,,10036.0,NY,4a9f2f6ff964a520d93c20e3
6,Distrikt Hotel New York City,Hotel,453 Blvd of the Allies,US,New York,United States,9th Ave,376,"[453 Blvd of the Allies (9th Ave), New York, N...","[{'lng': -73.99287335653906, 'lat': 40.7567071...",40.756707,-73.992873,,10018.0,NY,4b58b0a5f964a520d66528e3
7,Millennium Broadway Hotel,Hotel,145 W 44th St,US,New York,United States,at Broadway,313,"[145 W 44th St (at Broadway), New York, NY 100...","[{'lng': -73.9848, 'lat': 40.757073, 'label': ...",40.757073,-73.9848,,10036.0,NY,4a0215c6f964a5202a711fe3
8,The Premier Hotel New York,Hotel,133 W 44th St,US,New York,United States,Broadway,334,"[133 W 44th St (Broadway), New York, NY 10036,...","[{'lng': -73.98449, 'lat': 40.75684, 'label': ...",40.75684,-73.98449,,10036.0,NY,53d7e382498e903b9f2e6e9f
9,Cassa Hotel NY 45th Street,Hotel,70 W 45th St,US,New York,United States,,527,"[70 W 45th St, New York, NY 10036, United States]","[{'lng': -73.98217049570087, 'lat': 40.7564104...",40.75641,-73.98217,,10036.0,NY,4c45fdcff9652d7fd2c4132b


Get the neighborhood's name.

In [53]:
dataframe_filtered.name

0                                       Paramount Hotel
1                                          Hotel Edison
2                                 AC Hotel Times Square
3                                            Aliz Hotel
4                                   The Belvedere Hotel
5                                    Kimpton Muse Hotel
6                          Distrikt Hotel New York City
7                             Millennium Broadway Hotel
8                            The Premier Hotel New York
9                            Cassa Hotel NY 45th Street
10    DoubleTree by Hilton Hotel New York - Times Sq...
11                                Archer Hotel New York
12              Renaissance New York Times Square Hotel
13                                       Hotel St James
14                             Night Hotel Times Square
15            The Algonquin Hotel, Autograph Collection
16                                           Hotel MELA
17                   Merrion Row Hotel and Publi

In [57]:

# add a red circle marker to represent the Conrad Hotel
folium.features.CircleMarker(
    [latitude, longitude],
    radius=10,
    color='red',
    popup='Conrad Hotel',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.6
).add_to(map_manhattan)

# add the Italian restaurants as blue circle markers
for lat, lng, label in zip(dataframe_filtered.lat, dataframe_filtered.lng, dataframe_filtered.categories):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        popup=label,
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(map_manhattan)

# display map
map_manhattan