# Capstone Project

## this notebook will be mainly used for the capstone project


In [3]:
#Dataframe manipulation library
import pandas as pd
import numpy as np

In [5]:
print ("Hello Capstone Project Course!")

Hello Capstone Project Course!


### Business Problem: 
JS hotel just opened in Manahattan NY. They are targeting out-of-town visitors who want to explore all that Manahattan has to offer. Unfortunetely, due to size limitations, the buidling does not come equiped with a gym. However, they want to provide health and wellness to their customers.

### Solution: 
Given this, the hotel wants to contact gyms in the area for a potential partnership and to put together a packet detailing where all the gyms in the area are located. The hotel really values high customer satisfaction so they also want this ethos to be reflected in the gyms they partner with.

### Data: 
I will pull location data for the Manahattan burough of NY, and use the FourSquare API to find the  gyms in the area, and filter to see which ones have the highest ratings. For example, I will search for gyms in the area with FourSquare's Venue API that are within a certain radius of the hotel.

# Methodology



In [3]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


## 1. Download and Explore Dataset

NYC has a total of 5 boroughs and 306 neighborhoods. I will be specifically focusing on the Manhattan borough. In order to segement the neighborhoods of Manhattan, I will use dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood.

In [1]:
!wget -q -O 'newyork_data.json' https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json
print('Data downloaded!')

Data downloaded!


In [4]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

Pulling all the relevant data from the data, which is a list of the neighborhoods. Placing in neighborhoods_data.

In [7]:
neighborhoods_data = newyork_data['features']

#### Tranform the data into a _pandas_ dataframe

In [8]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

#looping throught the data into boroughs

for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
    

neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


## Zeroing in on Manahattan data

In [9]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


Getting coordinates of Manhattan for map later on

In [10]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7896239, -73.9598939.


In [11]:
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

## Defining Foursquare credentials and startting up

In [12]:
CLIENT_ID = '2QDKVNQMPYFWLOLZ11U1XGDUTSKJGY3CF2RYGKDUZQ5FMFGV' # your Foursquare ID
CLIENT_SECRET = 'FOBGRZ2M0BUPUCWD5DIRNMR23ZX1T3MGJ14R4ZIP54A5XMIU' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 2QDKVNQMPYFWLOLZ11U1XGDUTSKJGY3CF2RYGKDUZQ5FMFGV
CLIENT_SECRET:FOBGRZ2M0BUPUCWD5DIRNMR23ZX1T3MGJ14R4ZIP54A5XMIU


## Gathering Manhattan Neighborhoods

In [13]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [14]:
manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )

manhattan_venues.head()

Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Flatiron
Hudson Yards


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place
1,Marble Hill,40.876551,-73.91066,Bikram Yoga,40.876844,-73.906204,Yoga Studio
2,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner
3,Marble Hill,40.876551,-73.91066,Rite Aid,40.875467,-73.908906,Pharmacy
4,Marble Hill,40.876551,-73.91066,Subway,40.874667,-73.909586,Sandwich Place


In [23]:
#filtering for venues that fall into a category with the word gym

manhattan_gyms = manhattan_venues[manhattan_venues['Venue Category'].str.contains('Gym')] 
manhattan_gyms.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
19,Marble Hill,40.876551,-73.91066,Blink Fitness,40.877271,-73.905595,Gym
172,Washington Heights,40.851903,-73.9369,Blink Fitness,40.848562,-73.936941,Gym
191,Washington Heights,40.851903,-73.9369,Planet Fitness,40.847536,-73.937937,Gym / Fitness Center
315,Manhattanville,40.816934,-73.957385,Steep Rock West,40.816668,-73.957969,Climbing Gym
376,Central Harlem,40.815976,-73.943211,Harlem YMCA,40.81479,-73.94291,Gym / Fitness Center


In [24]:
# Calculating the number of gyms that returned for each neighborhood
manhattan_gyms.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Battery Park City,3,3,3,3,3,3
Carnegie Hill,6,6,6,6,6,6
Central Harlem,3,3,3,3,3,3
Chelsea,2,2,2,2,2,2
Civic Center,8,8,8,8,8,8
Clinton,7,7,7,7,7,7
Financial District,6,6,6,6,6,6
Flatiron,5,5,5,5,5,5
Greenwich Village,2,2,2,2,2,2
Hudson Yards,6,6,6,6,6,6


In [27]:
# one hot encoding
manhattan_onehot = pd.get_dummies(manhattan_gyms[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
manhattan_onehot['Neighborhood'] = manhattan_gyms['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
manhattan_onehot = manhattan_onehot[fixed_columns]

manhattan_onehot.head()

Unnamed: 0,Neighborhood,Boxing Gym,Climbing Gym,College Gym,Gym,Gym / Fitness Center,Gym Pool
19,Marble Hill,0,0,0,1,0,0
172,Washington Heights,0,0,0,1,0,0
191,Washington Heights,0,0,0,0,1,0
315,Manhattanville,0,1,0,0,0,0
376,Central Harlem,0,0,0,0,1,0


In [28]:
#group rows by neighborhood and by taking the mean of the frequency of occurrence of each category¶

manhattan_grouped = manhattan_onehot.groupby('Neighborhood').mean().reset_index()
manhattan_grouped

Unnamed: 0,Neighborhood,Boxing Gym,Climbing Gym,College Gym,Gym,Gym / Fitness Center,Gym Pool
0,Battery Park City,0.0,0.0,0.0,1.0,0.0,0.0
1,Carnegie Hill,0.0,0.0,0.166667,0.5,0.333333,0.0
2,Central Harlem,0.0,0.0,0.0,0.333333,0.666667,0.0
3,Chelsea,0.0,0.0,0.0,0.5,0.5,0.0
4,Civic Center,0.125,0.0,0.0,0.25,0.625,0.0
5,Clinton,0.0,0.0,0.0,0.428571,0.571429,0.0
6,Financial District,0.0,0.0,0.0,0.333333,0.666667,0.0
7,Flatiron,0.0,0.0,0.0,0.4,0.6,0.0
8,Greenwich Village,0.0,0.0,0.0,1.0,0.0,0.0
9,Hudson Yards,0.0,0.0,0.0,0.333333,0.666667,0.0


In [91]:
# Grouping Neighborhoods by top 3 most common gyms

num_top_gyms = 3

def return_most_common_gyms(row, num_top_gyms):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_gyms]


indicators = ['st', 'nd', 'rd']

# create columns according to number of top gyms
columns = ['Neighborhood']
for ind in np.arange(num_top_gyms):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_gyms_sorted = pd.DataFrame(columns=columns)
neighborhoods_gyms_sorted['Neighborhood'] = manhattan_grouped['Neighborhood']

for ind in np.arange(manhattan_grouped.shape[0]):
    neighborhoods_gyms_sorted.iloc[ind, 1:] = return_most_common_gyms(manhattan_grouped.iloc[ind, :], num_top_gyms)

neighborhoods_gyms_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,Battery Park City,Gym,Boxing Gym,Climbing Gym
1,Carnegie Hill,Gym,Gym / Fitness Center,College Gym
2,Central Harlem,Gym / Fitness Center,Gym,Boxing Gym
3,Chelsea,Gym,Gym / Fitness Center,Boxing Gym
4,Civic Center,Gym / Fitness Center,Gym,Boxing Gym


## K Means Clustering

In [92]:
# set number of clusters
kclusters = 3

manhattan_grouped_clustering = manhattan_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 2, 2, 2, 2, 2, 2, 2, 0, 2], dtype=int32)

In [93]:
# add clustering labels
neighborhoods_gyms_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

manhattan_merged = manhattan_data

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
manhattan_merged = manhattan_merged.join(neighborhoods_gyms_sorted.set_index('Neighborhood'), on='Neighborhood')


manhattan_merged = manhattan_merged.dropna(subset=['Cluster Labels'])

# converting 'Weight' from float to int
manhattan_merged['Cluster Labels'] = manhattan_merged['Cluster Labels'].astype(int)

manhattan_merged.head() # check the last columns!


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,Manhattan,Marble Hill,40.876551,-73.91066,0,Gym,Boxing Gym,Climbing Gym
2,Manhattan,Washington Heights,40.851903,-73.9369,2,Gym,Gym / Fitness Center,Boxing Gym
5,Manhattan,Manhattanville,40.816934,-73.957385,1,Climbing Gym,Boxing Gym,College Gym
6,Manhattan,Central Harlem,40.815976,-73.943211,2,Gym / Fitness Center,Gym,Boxing Gym
8,Manhattan,Upper East Side,40.775639,-73.960508,2,Gym / Fitness Center,Boxing Gym,Climbing Gym


In [94]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_merged['Latitude'], manhattan_merged['Longitude'], manhattan_merged['Neighborhood'], manhattan_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Examine Clusters

In [95]:
# Cluster 1
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 0, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,Marble Hill,Gym,Boxing Gym,Climbing Gym
9,Yorkville,Gym,Gym / Fitness Center,Boxing Gym
18,Greenwich Village,Gym,Boxing Gym,Climbing Gym
24,West Village,Gym,Boxing Gym,Climbing Gym
28,Battery Park City,Gym,Boxing Gym,Climbing Gym


In [96]:
#Cluster 2
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 1, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
5,Manhattanville,Climbing Gym,Boxing Gym,College Gym
15,Midtown,Gym,Boxing Gym,Gym / Fitness Center
22,Little Italy,Boxing Gym,Climbing Gym,College Gym
23,Soho,Boxing Gym,Gym,Climbing Gym
31,Noho,Boxing Gym,Gym,Climbing Gym


In [97]:
#Cluster 3
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 2, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
2,Washington Heights,Gym,Gym / Fitness Center,Boxing Gym
6,Central Harlem,Gym / Fitness Center,Gym,Boxing Gym
8,Upper East Side,Gym / Fitness Center,Boxing Gym,Climbing Gym
10,Lenox Hill,Gym,Gym / Fitness Center,Boxing Gym
11,Roosevelt Island,Gym,Gym / Fitness Center,Boxing Gym
12,Upper West Side,Gym,Gym / Fitness Center,Boxing Gym
13,Lincoln Square,Gym,Gym / Fitness Center,Boxing Gym
14,Clinton,Gym / Fitness Center,Gym,Boxing Gym
16,Murray Hill,Gym / Fitness Center,Boxing Gym,Gym
17,Chelsea,Gym,Gym / Fitness Center,Boxing Gym


## Results, Discussion, Conclusion

After exploring the selected gyms in the area, JS hotel learns that there are an overall abundance of gyms in Manhattan area that they can reccomend to their guests. After clustering the avaialble gyms, they also learn that they should neighborhoods based the type of experience the guests in looking for. If the guests is more concerned with availability and the widest selection, they should reccomend neighborhoods in cluster 3. If the guest is looking for specialy wellness, the the hotel should recommend neighborhoods in cluster 2.