# Capstone Project - The Battle of Neighborhoods (Week 2)
## By Coursera and IBM

Table of Contents
1. Introduction/Business Problem
2. Data
  * Description of data
  * Data processing  
3. Methodology  
4. Results  
5. Discussion  
6. Conclusion  

###  1. Introduction / Business Problem

In this project we will try to find the best suited location for opening gyms. This report is targeted and will be a help for stakeholders who want to open a gym in Manhattan, New York.


**Problem:**

The population of Manhattan is significantly high as compared to the other neighborhoods in the city. Hence, demand for gyms has increased significantly. Plus, some areas are completely devoid of gyms. People living in these areas have to travel far just for such a basic facility. Since, gyms are very important from the perspective of staying fit, new gyms need to be constructed. This report will help in identifying areas which have no facility of gym nearby. Hence new gym can be created in this vicinity. 


### 2. Data

### 2.1 Description of Data

This dataset is from New York University Libraries of New York from Spatial Data Repository of NYU. This dataset exists for free on the web. The .json file has coordinates of the city of New York. This is the link to the dataset https://geo.nyu.edu/catalog/nyu_2451_34572.


This file is downloaded and cleaned. The data is placed on the server, so that we can simply use a `wget` command and access the data. The dataset contains 5 boroughs and the neighborhoods that exist in each borough as well as the latitude and longitude coordinates of each neighborhood. We only need 1 borough – Manhattan, hence we need to narrow down the data and only process Manhattan’s neighborhoods.


Based on definition of our problem, factors that will influence our decision are:
  1. Number of existing gyms in the neighbourhood
  2. Number of neighbourhoods devoid of gyms


Following data sources will be needed to extract/generate the required information:
  * Forsquare API will be needed to get the most common venues of Manhattan Borough of New York. The venue that we are interested in is “Gym”.
  * Coordinate of Manhattan centre will be obtained using Google Maps API geocoding of well known location Manhattan.


### 2.2 Data Processing

### Importing important libraries.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.

Libraries imported.


### Downloading the data using wget command.

In [2]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


### Opening the data and accessing it.

In [3]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [4]:
newyork_data

{'bbox': [-74.2492599487305,
  40.5033187866211,
  -73.7061614990234,
  40.9105606079102],
 'crs': {'properties': {'name': 'urn:ogc:def:crs:EPSG::4326'}, 'type': 'name'},
 'features': [{'geometry': {'coordinates': [-73.84720052054902,
     40.89470517661],
    'type': 'Point'},
   'geometry_name': 'geom',
   'id': 'nyu_2451_34572.1',
   'properties': {'annoangle': 0.0,
    'annoline1': 'Wakefield',
    'annoline2': None,
    'annoline3': None,
    'bbox': [-73.84720052054902,
     40.89470517661,
     -73.84720052054902,
     40.89470517661],
    'borough': 'Bronx',
    'name': 'Wakefield',
    'stacked': 1},
   'type': 'Feature'},
  {'geometry': {'coordinates': [-73.82993910812398, 40.87429419303012],
    'type': 'Point'},
   'geometry_name': 'geom',
   'id': 'nyu_2451_34572.2',
   'properties': {'annoangle': 0.0,
    'annoline1': 'Co-op',
    'annoline2': 'City',
    'annoline3': None,
    'bbox': [-73.82993910812398,
     40.87429419303012,
     -73.82993910812398,
     40.874294193

### All the relevant data is in the features key (list of the neighborhoods). So defining a new variable to include this data.

In [5]:
neighborhoods_data = newyork_data['features']

### Looking at the first item in this list.

In [6]:
neighborhoods_data[0]

{'geometry': {'coordinates': [-73.84720052054902, 40.89470517661],
  'type': 'Point'},
 'geometry_name': 'geom',
 'id': 'nyu_2451_34572.1',
 'properties': {'annoangle': 0.0,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661],
  'borough': 'Bronx',
  'name': 'Wakefield',
  'stacked': 1},
 'type': 'Feature'}

### Transforming this data of nested Python dictionaries into a pandas dataframe. Starting by creating an empty dataframe.

In [7]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [8]:
neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


### Looping through the data and filling the dataframe one row at a time.

In [9]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

### Let's examine the dataframe.

In [10]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


### Just checking the number of boroughs and neighborhoods. :)

In [11]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


## Using geopy library to get the latitude and longitude values of New York City.

In [12]:
# We need an instance of the geocoder, therefore we need to define a user_agent. Name of our agent is ny_explorer.
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


### Creating a map of New York with all the neighborhoods

In [13]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

### We need only 1 borough- Manhattan. Let's simplify the above map and segment and cluster only the neighborhoods in Manhattan. 

In [14]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688
5,Manhattan,Manhattanville,40.816934,-73.957385
6,Manhattan,Central Harlem,40.815976,-73.943211
7,Manhattan,East Harlem,40.792249,-73.944182
8,Manhattan,Upper East Side,40.775639,-73.960508
9,Manhattan,Yorkville,40.77593,-73.947118


### Getting the geographical coordinates of Manhattan.

In [15]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7900869, -73.9598295.


### Let's visualize Manhattan and the neighborhoods in it.

In [16]:
# create map of Manhattan using latitude and longitude values
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

### Now our data is ready for the next step. :)

### 3. Methodology

First, we utilize the Foursquare API to explore the neighbourhoods in Manhattan and segment them. Our focus is to find gyms in Manhattan. Hence, we need to explore venues in Manhattan. For this we use Foursquare API which returns us venues. We set the Foursquare API so that it returns us the top 100 venues within the radius of 500 meters. Then we create a new dataframe manhattan_venues.

Now, we have 2 goals –  
1.	To find neighbourhoods with very few gyms.  
2.	To find neighbourhoods with no gyms  
Our main goal is to construct gyms in the areas which have no gyms. We need these neighbourhoods because these neighbourhoods are areas where gyms can be constructed. So stakeholders can construct gyms in these areas.


So, we proceed to create two major dataframes.  
1. venues_with_gyms  
2. df_no_gyms  

### Now, we define Foursquare Credentials and Version

In [17]:
CLIENT_ID = 'L0XBWQWQS03SXZSKIKAM3VFX0QXFWGBUDSOMXFZD0BOK0ION' # your Foursquare ID
CLIENT_SECRET = 'V2D4N3NRAHMZ3QXZ4G5WBQNVD5SAMVDJCCOGLNPOBSYJDFA4' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: L0XBWQWQS03SXZSKIKAM3VFX0QXFWGBUDSOMXFZD0BOK0ION
CLIENT_SECRET:V2D4N3NRAHMZ3QXZ4G5WBQNVD5SAMVDJCCOGLNPOBSYJDFA4


### Let's explore the first neighborhood in our dataframe to check our data.

In [18]:
manhattan_data.loc[0, 'Neighborhood']

'Marble Hill'

### Checking number of unique rows in Manhattan

In [19]:
len(manhattan_data['Neighborhood'].unique().tolist())

40

#### So the first data in our dataframe is Marble Hill. Now, lets get latitude and longitude values of Marble Hill.

In [20]:
neighborhood_latitude = manhattan_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = manhattan_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = manhattan_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Marble Hill are 40.87655077879964, -73.91065965862981.


#### Getting the top 100 venues that are in Marble Hill within a radius of 500 meters.

In [21]:
#Creating the GET request URL.
LIMIT = 100
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url


'https://api.foursquare.com/v2/venues/explore?&client_id=L0XBWQWQS03SXZSKIKAM3VFX0QXFWGBUDSOMXFZD0BOK0ION&client_secret=V2D4N3NRAHMZ3QXZ4G5WBQNVD5SAMVDJCCOGLNPOBSYJDFA4&v=20180605&ll=40.87655077879964,-73.91065965862981&radius=500&limit=100'

In [22]:
#Sending the GET request and examine the resutls
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5cf04d496a607149390bc05e'},
 'response': {'groups': [{'items': [{'reasons': {'count': 0,
       'items': [{'reasonName': 'globalInteractionReason',
         'summary': 'This spot is popular',
         'type': 'general'}]},
      'referralId': 'e-0-4b4429abf964a52037f225e3-0',
      'venue': {'categories': [{'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/pizza_',
          'suffix': '.png'},
         'id': '4bf58dd8d48988d1ca941735',
         'name': 'Pizza Place',
         'pluralName': 'Pizza Places',
         'primary': True,
         'shortName': 'Pizza'}],
       'delivery': {'id': '72548',
        'provider': {'icon': {'name': '/delivery_provider_seamless_20180129.png',
          'prefix': 'https://fastly.4sqi.net/img/general/cap/',
          'sizes': [40, 50]},
         'name': 'seamless'},
        'url': 'https://www.seamless.com/menu/arturos-pizza-5189-broadway-ave-new-york/72548?affiliate=1131&utm_source=foursquare-affiliat

#### All the information is in the *items* key. Therefore, let's use the **get_category_type** function (from the Foursquare lab).

In [23]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#### Cleaning the json file and structuring it into a *pandas* dataframe.

In [24]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Arturo's,Pizza Place,40.874412,-73.910271
1,Bikram Yoga,Yoga Studio,40.876844,-73.906204
2,Tibbett Diner,Diner,40.880404,-73.908937
3,Dunkin',Donut Shop,40.877136,-73.906666
4,Starbucks,Coffee Shop,40.877531,-73.905582


#### Lets check how many different venues came from Foursquare.

In [25]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

25 venues were returned by Foursquare.


### Now, we have checked the data with one neighborhood. Now, lets explore all the Neighborhoods in Manhattan. 

In [26]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

### Now writing the code to run the above function on each neighborhood and creating a new dataframe called **manhattan_venues**.

In [27]:
manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )

Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Flatiron
Hudson Yards


### Checking the size and few rows of the resulting dataframe.

In [28]:
print(manhattan_venues.shape)
manhattan_venues.head()

(3324, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place
1,Marble Hill,40.876551,-73.91066,Bikram Yoga,40.876844,-73.906204,Yoga Studio
2,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner
3,Marble Hill,40.876551,-73.91066,Dunkin',40.877136,-73.906666,Donut Shop
4,Marble Hill,40.876551,-73.91066,Starbucks,40.877531,-73.905582,Coffee Shop


### Now, let's check the actual number of neighborhoods present in Manhattan.

In [29]:
len(manhattan_venues['Neighborhood'].unique().tolist())

40

### Creating a new dataframe **neighborhoods_in_Manhattan** for convenience.

In [30]:
neighorhoods_in_Manhattan = pd.DataFrame(manhattan_venues)

### Now, our main goal is to find neighborhoods which are devoid of gyms. So, let's create another dataframe **df_no_gym**.

### But first, let's filter out rows which have gyms from the data.

In [31]:
df_no_gym = pd.DataFrame(neighorhoods_in_Manhattan)
df_no_gym = neighorhoods_in_Manhattan[neighorhoods_in_Manhattan["Venue Category"] != "Gym"]
df_no_gym.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place
1,Marble Hill,40.876551,-73.91066,Bikram Yoga,40.876844,-73.906204,Yoga Studio
2,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner
3,Marble Hill,40.876551,-73.91066,Dunkin',40.877136,-73.906666,Donut Shop
4,Marble Hill,40.876551,-73.91066,Starbucks,40.877531,-73.905582,Coffee Shop


### Now, we will find distict neighborhoods in Manhattan which have no gyms.

In [32]:
df_no_gym = df_no_gym.reset_index().drop_duplicates(subset=['Neighborhood'],keep='first').set_index('index')
df_no_gym.head()

Unnamed: 0_level_0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place
25,Chinatown,40.715618,-73.994279,Hotel 50 Bowery NYC,40.715936,-73.996789,Hotel
125,Washington Heights,40.851903,-73.9369,The Uptown Garrison,40.851255,-73.939473,Restaurant
211,Inwood,40.867684,-73.92121,PJ Wine,40.867251,-73.922349,Wine Shop
271,Hamilton Heights,40.823604,-73.949688,The Grange Bar & Eatery,40.822554,-73.949532,Cocktail Bar


### Let's reset the index for our convenience

In [33]:
df_no_gym = df_no_gym.reset_index()

In [34]:
df_no_gym

Unnamed: 0,index,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place
1,24,Chinatown,40.715618,-73.994279,Hotel 50 Bowery NYC,40.715936,-73.996789,Hotel
2,124,Washington Heights,40.851903,-73.9369,The Uptown Garrison,40.851255,-73.939473,Restaurant
3,209,Inwood,40.867684,-73.92121,PJ Wine,40.867251,-73.922349,Wine Shop
4,265,Hamilton Heights,40.823604,-73.949688,The Grange Bar & Eatery,40.822554,-73.949532,Cocktail Bar
5,324,Manhattanville,40.816934,-73.957385,Jin Ramen,40.815406,-73.958547,Ramen Restaurant
6,363,Central Harlem,40.815976,-73.943211,Harlem Cycle,40.817201,-73.942592,Cycle Studio
7,409,East Harlem,40.792249,-73.944182,East Harlem Bottling Co.,40.793024,-73.945727,Beer Bar
8,457,Upper East Side,40.775639,-73.960508,Sant Ambroeus Madison Ave,40.775328,-73.962819,Italian Restaurant
9,557,Yorkville,40.77593,-73.947118,Peng's Noodle Folk,40.777258,-73.94911,Asian Restaurant


### Let's find neighborhoods which actually have gyms.

In [34]:
# Seeing venues which are gyms
venues_with_gym = pd.DataFrame(manhattan_venues)
venues_with_gym = manhattan_venues[manhattan_venues["Venue Category"] == "Gym"]
venues_with_gym.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
5,Marble Hill,40.876551,-73.91066,Blink Fitness Riverdale,40.877147,-73.905837,Gym
69,Chinatown,40.715618,-73.994279,Bowery CrossFit,40.717812,-73.992624,Gym
176,Washington Heights,40.851903,-73.9369,Blink Fitness Washington Heights,40.848489,-73.936794,Gym
203,Washington Heights,40.851903,-73.9369,Lucille Roberts,40.848487,-73.934636,Gym
410,Central Harlem,40.815976,-73.943211,Lt Joseph P Kennedy Jr Community Center Gym,40.812608,-73.939699,Gym


### Let's reset the index of this dataframe too.

In [35]:
venues_with_gym = venues_with_gym.reset_index()

In [36]:
venues_with_gym.head()

Unnamed: 0,index,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,5,Marble Hill,40.876551,-73.91066,Blink Fitness Riverdale,40.877147,-73.905837,Gym
1,69,Chinatown,40.715618,-73.994279,Bowery CrossFit,40.717812,-73.992624,Gym
2,176,Washington Heights,40.851903,-73.9369,Blink Fitness Washington Heights,40.848489,-73.936794,Gym
3,203,Washington Heights,40.851903,-73.9369,Lucille Roberts,40.848487,-73.934636,Gym
4,410,Central Harlem,40.815976,-73.943211,Lt Joseph P Kennedy Jr Community Center Gym,40.812608,-73.939699,Gym


### Before moving forward, let's analyze how many gyms are present in each neighborhood (neighborhoods which actually have gym).

In [37]:
venues_with_gym['Neighborhood'].value_counts()

Yorkville             6
Flatiron              5
Financial District    4
Murray Hill           3
Battery Park City     3
Carnegie Hill         3
Sutton Place          3
Lenox Hill            3
Tribeca               3
Washington Heights    2
Lincoln Square        2
Midtown               2
Hudson Yards          2
Clinton               2
Greenwich Village     2
Civic Center          2
Chinatown             1
Central Harlem        1
Marble Hill           1
Roosevelt Island      1
Noho                  1
Upper West Side       1
Chelsea               1
West Village          1
East Harlem           1
Midtown South         1
Tudor City            1
Turtle Bay            1
Lower East Side       1
Name: Neighborhood, dtype: int64

### Observing the data we see that following neighborhoods already have high and suffienct amount of gyms constructed.
Yorkville - 6  
Flatiron - 5  
Financial District, Battery Park City - 4  
Tribeca, Murray Hill, Lenox Hill, Carnegie Hill, Sutton Place - 3  
### Hence, these neighborhoods don't need new gyms.

### Now, we will cluster the neighborhoods with respect to high frequency of gyms.

### Creating new dataframe for convenience.

In [38]:
venues_with_gym_clusters = pd.DataFrame(venues_with_gym)
venues_with_gym_clusters.head()

Unnamed: 0,index,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,5,Marble Hill,40.876551,-73.91066,Blink Fitness Riverdale,40.877147,-73.905837,Gym
1,69,Chinatown,40.715618,-73.994279,Bowery CrossFit,40.717812,-73.992624,Gym
2,176,Washington Heights,40.851903,-73.9369,Blink Fitness Washington Heights,40.848489,-73.936794,Gym
3,203,Washington Heights,40.851903,-73.9369,Lucille Roberts,40.848487,-73.934636,Gym
4,410,Central Harlem,40.815976,-73.943211,Lt Joseph P Kennedy Jr Community Center Gym,40.812608,-73.939699,Gym


### Finding how many gyms are present within single neighborhood.

In [39]:
venues_with_gym_clusters = venues_with_gym_clusters['Neighborhood'].value_counts()

In [40]:
venues_with_gym_clusters

Yorkville             6
Flatiron              5
Financial District    4
Murray Hill           3
Battery Park City     3
Carnegie Hill         3
Sutton Place          3
Lenox Hill            3
Tribeca               3
Washington Heights    2
Lincoln Square        2
Midtown               2
Hudson Yards          2
Clinton               2
Greenwich Village     2
Civic Center          2
Chinatown             1
Central Harlem        1
Marble Hill           1
Roosevelt Island      1
Noho                  1
Upper West Side       1
Chelsea               1
West Village          1
East Harlem           1
Midtown South         1
Tudor City            1
Turtle Bay            1
Lower East Side       1
Name: Neighborhood, dtype: int64

### We cluster the neighborhoods which have high number of gyms.

In [41]:
num_clusters = 5

k_means = KMeans(init="k-means++", n_clusters=num_clusters, n_init=12)
k_means.fit(venues_with_gym_clusters.values.reshape(-1,1))
labels = k_means.labels_

print(labels)

[2 2 4 1 1 1 1 1 1 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0]


### Following neigborhoods have high number of gyms

In [42]:
venues_with_gym_clusters["Labels"] = labels
venues_with_gym_clusters.head(5)

Yorkville             6
Flatiron              5
Financial District    4
Murray Hill           3
Battery Park City     3
Name: Neighborhood, dtype: object

### Now, getting back to find neighborhoods with no gyms. We will use **venues_with_gym** dataframe for further processing.

### Let's check number of unique values in the dataframe **venues_with_gym** 

In [43]:
len(venues_with_gym['Neighborhood'].unique().tolist())

29

### Now we'll find only the "neighborhoods" (without the rest of the information) which have gyms. 

In [44]:
venues_with_gym.Neighborhood.unique()

array(['Marble Hill', 'Chinatown', 'Washington Heights', 'Central Harlem',
       'East Harlem', 'Yorkville', 'Lenox Hill', 'Roosevelt Island',
       'Upper West Side', 'Lincoln Square', 'Clinton', 'Midtown',
       'Murray Hill', 'Chelsea', 'Greenwich Village', 'Lower East Side',
       'Tribeca', 'West Village', 'Battery Park City',
       'Financial District', 'Carnegie Hill', 'Noho', 'Civic Center',
       'Midtown South', 'Sutton Place', 'Turtle Bay', 'Tudor City',
       'Flatiron', 'Hudson Yards'], dtype=object)

### To find neigborhoods which do not have any gym, we need to drop all the neighborhoods which have occured in **venues_with_gym** from the dataframe **df_no_gym**.

In [45]:
df_no_gym = df_no_gym.drop(df_no_gym.index[[0, 1, 2, 6, 7, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 21, 24, 28, 29, 30, 31, 32, 33, 34, 35, 36, 38, 39]])

In [46]:
df_no_gym

Unnamed: 0,index,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
3,211,Inwood,40.867684,-73.92121,PJ Wine,40.867251,-73.922349,Wine Shop
4,271,Hamilton Heights,40.823604,-73.949688,The Grange Bar & Eatery,40.822554,-73.949532,Cocktail Bar
5,330,Manhattanville,40.816934,-73.957385,Jin Ramen,40.815406,-73.958547,Ramen Restaurant
8,466,Upper East Side,40.775639,-73.960508,Sant Ambroeus Madison Ave,40.775328,-73.962819,Italian Restaurant
19,1492,East Village,40.727847,-73.982226,Good Beer NYC,40.727643,-73.983835,Beer Store
20,1592,Lower East Side,40.717807,-73.98089,Spoke Art NYC,40.718395,-73.982844,Art Gallery
22,1752,Little Italy,40.719324,-73.997305,La Compagnie des Vins Surnaturels,40.720448,-73.997969,Wine Bar
23,1852,Soho,40.722184,-74.000657,Sam Brocato Salon,40.722371,-74.002562,Salon / Barbershop
25,2052,Manhattan Valley,40.797307,-73.964286,Saiguette,40.799209,-73.96278,Vietnamese Restaurant
26,2111,Morningside Heights,40.808,-73.963896,Alma Mater Statue,40.807726,-73.962252,Outdoor Sculpture


### Resetting the index

In [47]:
df_no_gym = df_no_gym.reset_index()
df_no_gym

Unnamed: 0,level_0,index,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,3,211,Inwood,40.867684,-73.92121,PJ Wine,40.867251,-73.922349,Wine Shop
1,4,271,Hamilton Heights,40.823604,-73.949688,The Grange Bar & Eatery,40.822554,-73.949532,Cocktail Bar
2,5,330,Manhattanville,40.816934,-73.957385,Jin Ramen,40.815406,-73.958547,Ramen Restaurant
3,8,466,Upper East Side,40.775639,-73.960508,Sant Ambroeus Madison Ave,40.775328,-73.962819,Italian Restaurant
4,19,1492,East Village,40.727847,-73.982226,Good Beer NYC,40.727643,-73.983835,Beer Store
5,20,1592,Lower East Side,40.717807,-73.98089,Spoke Art NYC,40.718395,-73.982844,Art Gallery
6,22,1752,Little Italy,40.719324,-73.997305,La Compagnie des Vins Surnaturels,40.720448,-73.997969,Wine Bar
7,23,1852,Soho,40.722184,-74.000657,Sam Brocato Salon,40.722371,-74.002562,Salon / Barbershop
8,25,2052,Manhattan Valley,40.797307,-73.964286,Saiguette,40.799209,-73.96278,Vietnamese Restaurant
9,26,2111,Morningside Heights,40.808,-73.963896,Alma Mater Statue,40.807726,-73.962252,Outdoor Sculpture


### Finally, these are the neighborhoods which are completely devoid of gyms. Hence, these places should be considered first to construct a gym.

In [48]:
df_no_gym.Neighborhood

0                  Inwood
1        Hamilton Heights
2          Manhattanville
3         Upper East Side
4            East Village
5         Lower East Side
6            Little Italy
7                    Soho
8        Manhattan Valley
9     Morningside Heights
10               Gramercy
11        Stuyvesant Town
Name: Neighborhood, dtype: object

### Let's visualize the places where there is no gym.

### Getting Manhattan latitude and longitude values.

In [49]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7900869, -73.9598295.


### Creating a map.

In [50]:
# create map of Manhattan using latitude and longitude values
map_manhattan_no_gym = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df_no_gym['Neighborhood Latitude'], df_no_gym['Neighborhood Longitude'], df_no_gym['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan_no_gym)  
    
map_manhattan_no_gym

### 4. Result

After all the processing, we observe the following results -:   
1.	There are very few neighbourhoods which have sufficient amount of gyms. And these neighbourhoods are :-  
    Yorkville  
    Flatiron  
    Financial District, Battery Park City  
    Tribeca, Murray Hill, Lenox Hill, Carnegie Hill, Sutton Place     
    
2.	Excluding the above places, rest all the neighbourhoods are good options to create gyms.  

3.	The best options for creating gyms are neighbourhoods with no gym. These are –  
•	Inwood  
•	Hamilton Heights   
•	Manhattanville   
•	Upper East Side   
•	East Village  
•	Lower East Side  
•	Little Italy  
•	Soho  
•	Manhattan Valley  
•	Morningside Heights  
•	Gramercy  
•	Stuyvesant Town  



### 5. Discussion

Gyms are very important, especially from the perspective of fitness. All age groups should have access to gyms. Gyms help in increasing bone density if followed by proper diet. It’s a basic facility which is neglected a lot. Stakeholders who are seeking profits in fitness industry should start with creating new gyms or any fitness school. New gyms can open up even employment for those who pursue fitness as their careers, for example gym trainers. This report helps stakeholders to target neighbourhoods in Manhattan through which they can attain maximum profits. Plus new gyms will be constructed for the people and benefit people.  


The best strategy would be start your business related to gym or construction of gyms from the neighbourhoods which do not have gyms. Even in these neighbourhoods we can start from neighbourhoods which are more popular and dense like Manhattan valley. If opting only to promote gyming equipments and stakeholders are not interested in construction of gyms, then opting neighbourhoods which have few gyms would be a good option. Neighbourhood like Chinatown which have only one gym but is a neighbourhood which is fairly popular, there could be a lot of load on a single gym. Hence chances are that they will need renewal of their equipments more.


### 6. Conclusion

Gyms and fitness are very important aspects of human life. Enhancing and increasing quality of them is hence very necessary. This report is beneficial to people who promote fitness or want to construct gyms and who aim to make peoples’ lives fitter and healthier. 