# Capstone Project - The Battle of the Neighborhoods

## Table of Contents

* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction: Business Problem <a name="introduction"></a>

**Problem** : A person wants to start a sports shop which supplies sports equipment in Manhattan,New York. The main customers of this business will be schools and stadiums in the locality.

**Where and how Foursquare is used** : Using Foursquare, we can get the list of stadium, schools etc in the neighborhood.



## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decision are:
* Number of potential customers locations around the location under inspection

Following data sources will be needed to extract/generate the required information:
* New york data of neighborhoods is obtained 
* Centers of candidate areas will be generated algorithmically and approximate addresses of centers of those areas will be obtained using **LocationIQ API reverse geocoding**
* Number of schools, colleges, rec centers etc and their type and location in every neighborhood will be obtained using **Foursquare API**. The ID's of various categories were taken from the Foursquare website https://developer.foursquare.com/docs/resources/categories

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    branca-0.4.1               |             py_0          26 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    ca-certificates-2020.4.5.2 |       hecda079_0         147 KB  conda-forge
    certifi-2020.4.5.2         |   py36h9f0ad1d_0         152 KB  conda-forge
    altair-4.1.0               |             py_1         614 KB  conda-forge
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    ------------------------------------------------------------
                       

## Methodology <a name="methodology"></a>



* The first step was to get data of Mahattan, New York and this was done by creating a Dataframe from the newyork_data json file.

* The second step was to get the coordinates of Manhattan using a geolocator

* In the third step, using Foursquare API, locations of potential cusomers were retrieved in Manhattan and stored in Dataframes. This data was cleaned and compiled into one Dataframe.

* The fourth and the final step is to use k-means clustering to form clusters of these locations and the centres of these clusters will be used as potential candidate locations for the shop. The address of these centres are found out using revervse geocoding.


In [2]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


In [3]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [4]:
neighborhoods_data = newyork_data['features']

In [5]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [6]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [7]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


In [8]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7896239, -73.9598939.


In [9]:
# The code was removed by Watson Studio for sharing.

Your credentails:
CLIENT_ID: 1H2WWX1TJLXM4SLAEDM3PIA4L1S2401DBIYXCFR5FRZTVDR2
CLIENT_SECRET:V5LNSSKJZSAJJJD5CSHAVZFVLE5N0BTLO1SPV5L44LFMIQMQ


In [10]:

def getNearbyVenues(names, latitudes, longitudes,categoryId,radius=500,LIMIT = 100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            categoryId,
            radius, 
            LIMIT)
        try :     
            # make the GET request
            results = requests.get(url).json()["response"]['groups'][0]['items']

            # return only relevant information for each nearby venue
            venues_list.append([(
                name, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name']) for v in results])
        except :
            venues = []
        
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Each category of potential locations had to looked through individually to see what categories were there and which ones to choose for the analysis. 

In [11]:
#College & University : 4d4b7105d754a06372d81259
manhattancollege = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                    latitudes=manhattan_data['Latitude'],
                                    longitudes=manhattan_data['Longitude'],
                                    categoryId = '4d4b7105d754a06372d81259'
                                    )


In [12]:
print(manhattancollege.shape)
manhattancollege.head()

(1391, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Bronx Engineering & Technology Academy,40.877258,-73.912575,High School
1,Marble Hill,40.876551,-73.91066,Spuyten Duyvil Preschool,40.879244,-73.907205,University
2,Marble Hill,40.876551,-73.91066,IN-Tech Academy,40.879101,-73.911026,High School
3,Chinatown,40.715618,-73.994279,IS 131,40.716454,-73.99353,General College & University
4,Chinatown,40.715618,-73.994279,PS 42,40.715949,-73.990888,General College & University


In [13]:
manhattancollege['Venue Category'].value_counts()

College Academic Building                   189
General College & University                162
College Administrative Building             125
Student Center                              107
College Classroom                            81
Trade School                                 79
University                                   79
College Library                              71
College Lab                                  50
College Arts Building                        46
Medical School                               41
College Residence Hall                       40
College Gym                                  30
College Auditorium                           29
College Cafeteria                            22
College Theater                              21
Law School                                   21
College Quad                                 18
College Science Building                     18
Office                                       15
School                                  

In [14]:
category = ['General College & University','University','Medical School','Law School',
                                    'Community College','College & University']
for ind, row in manhattancollege.iterrows():
    cat = False
    for c in category:
        if row['Venue Category'] == c:
            cat = True
    if cat == False:
        manhattancollege.drop(index = ind,axis = 0,inplace = True)
manhattancollege.shape

(324, 7)

In [15]:
manhattancollege['Venue Category'].value_counts()

General College & University    162
University                       79
Medical School                   41
Law School                       21
College & University             11
Community College                10
Name: Venue Category, dtype: int64

**As seen above, the category was explored and the necessary ones where chosen and stored in the dataframe. This was done for all subsequent categories.**

In [16]:
#Elementary School : 4f4533804b9074f6e4fb0105
manhattaneleschool = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                    latitudes=manhattan_data['Latitude'],
                                    longitudes=manhattan_data['Longitude'],
                                    categoryId = '4f4533804b9074f6e4fb0105'
                                    )

In [17]:
print(manhattaneleschool.shape)
manhattaneleschool.head()

(90, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,PS/MS 37,40.880132,-73.9106,Elementary School
1,Marble Hill,40.876551,-73.91066,PS 186X Site 306,40.873867,-73.907091,Elementary School
2,Marble Hill,40.876551,-73.91066,Inwood School With Jeff,40.876125,-73.916364,Elementary School
3,Chinatown,40.715618,-73.994279,CPC Chrystie St,40.718418,-73.993738,Elementary School
4,Chinatown,40.715618,-73.994279,P.S. 124,40.714314,-73.995694,School


In [18]:
category = ['Elementary School']
for ind, row in manhattaneleschool.iterrows():
    cat = False
    for c in category:
        if row['Venue Category'] == c:
            cat = True
    if cat == False:
        manhattaneleschool.drop(index = ind,axis = 0,inplace = True)
manhattaneleschool.shape

(86, 7)

In [19]:
manhattaneleschool['Venue Category'].value_counts()

Elementary School    86
Name: Venue Category, dtype: int64

In [20]:
#High School : 4bf58dd8d48988d13d941735
manhattanhighschool = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                    latitudes=manhattan_data['Latitude'],
                                    longitudes=manhattan_data['Longitude'],
                                    categoryId = '4bf58dd8d48988d13d941735'
                                    )

In [21]:
print(manhattanhighschool.shape)
manhattanhighschool.head()

(155, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Bronx Theatre High School,40.877258,-73.912575,High School
1,Marble Hill,40.876551,-73.91066,ELLIS Preparatory Academy,40.875809,-73.912597,High School
2,Marble Hill,40.876551,-73.91066,marble hill high school,40.877258,-73.912575,High School
3,Marble Hill,40.876551,-73.91066,John F. Kennedy High School,40.877266,-73.912746,High School
4,Marble Hill,40.876551,-73.91066,John F. Kennedy High School,40.876944,-73.912976,High School


In [22]:
manhattanhighschool = manhattanhighschool[manhattanhighschool['Venue Category'] == 'High School'].reset_index(drop=True)

In [23]:
manhattanhighschool['Venue Category'].value_counts()

High School    147
Name: Venue Category, dtype: int64

In [24]:
#Middle School : 4f4533814b9074f6e4fb0106
manhattanmidschool = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                    latitudes=manhattan_data['Latitude'],
                                    longitudes=manhattan_data['Longitude'],
                                    categoryId = '4f4533814b9074f6e4fb0106'
                                    )

In [25]:
print(manhattanmidschool.shape)
manhattanmidschool.head()

(52, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Atmosphere Academy,40.875072,-73.910164,Middle School
1,Chinatown,40.715618,-73.994279,Sun Yat Sen Middle School MS 131,40.716486,-73.993523,Middle School
2,Chinatown,40.715618,-73.994279,middle school 131,40.715935,-73.993721,Middle School
3,Chinatown,40.715618,-73.994279,Innovate Manhattan Charter School,40.719572,-73.992309,Middle School
4,Hamilton Heights,40.823604,-73.949688,Hamilton Grange School - M209,40.820975,-73.952983,Middle School


In [26]:
manhattanmidschool = manhattanmidschool[manhattanmidschool['Venue Category'] == 'Middle School'].reset_index(drop=True)

In [27]:
manhattanmidschool['Venue Category'].value_counts()

Middle School    48
Name: Venue Category, dtype: int64

In [28]:
#Private School : 52e81612bcbc57f1066b7a46
manhattanprivschool = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                    latitudes=manhattan_data['Latitude'],
                                    longitudes=manhattan_data['Longitude'],
                                    categoryId = '52e81612bcbc57f1066b7a46'
                                    )

In [29]:
print(manhattanprivschool.shape)
manhattanprivschool.head()

(29, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Upper East Side,40.775639,-73.960508,SciTech Kids,40.775461,-73.955495,Private School
1,Yorkville,40.77593,-73.947118,Kumon Learning Center,40.77507,-73.951019,Private School
2,Yorkville,40.77593,-73.947118,The Goddard School,40.778348,-73.945702,Daycare
3,Lincoln Square,40.773529,-73.985338,The Shefa School,40.776409,-73.983643,Private School
4,Lincoln Square,40.773529,-73.985338,Fusion Academy Upper West Side,40.774373,-73.980597,Private School


In [30]:
manhattanprivschool = manhattanprivschool[manhattanprivschool['Venue Category'] == 'Private School'].reset_index(drop=True)

In [31]:
manhattanprivschool['Venue Category'].value_counts()

Private School    26
Name: Venue Category, dtype: int64

In [32]:
#Stadium : 4bf58dd8d48988d184941735
manhattanstadium = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                    latitudes=manhattan_data['Latitude'],
                                    longitudes=manhattan_data['Longitude'],
                                    categoryId = '4bf58dd8d48988d184941735'
                                    )


In [33]:
print(manhattanstadium.shape)
manhattanstadium.head()

(19, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Midtown,40.754691,-73.981669,Ping Pong - Bryant Park,40.754299,-73.983729,Athletics & Sports
1,Midtown,40.754691,-73.981669,Vanderbilt Tennis Club,40.752252,-73.977477,Tennis Stadium
2,Murray Hill,40.748303,-73.978332,Grand Hyatt New York,40.751786,-73.976574,Hotel
3,Murray Hill,40.748303,-73.978332,Vanderbilt Tennis Club,40.752252,-73.977477,Tennis Stadium
4,Chelsea,40.744035,-74.003116,Kitchen Stadium,40.742376,-74.004985,Theater


In [34]:
manhattanstadium['Venue Category'].value_counts()


Shoe Store            4
Stadium               2
Music Venue           2
Tennis Stadium        2
Hotel                 1
Park                  1
Tennis Court          1
Building              1
Theater               1
Basketball Stadium    1
Athletics & Sports    1
Soccer Stadium        1
College Gym           1
Name: Venue Category, dtype: int64

In [35]:
category = ['Tennis Stadium','Stadium','Tennis Court','College Gym','Basketball Stadium','Soccer Stadium','Athletics & Sports']
for ind, row in manhattanstadium.iterrows():
    cat = False
    for c in category:
        if row['Venue Category'] == c:
            cat = True
    if cat == False:
        manhattanstadium.drop(index = ind,axis = 0,inplace = True)
manhattanstadium.shape

(9, 7)

In [36]:
#Rec Centres : 4d4b7105d754a06377d81259
manhattanrec = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                    latitudes=manhattan_data['Latitude'],
                                    longitudes=manhattan_data['Longitude'],
                                    categoryId = '4d4b7105d754a06377d81259'
                                    )

In [37]:
print(manhattanrec.shape)
manhattanrec.head()

(2451, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Bikram Yoga,40.876844,-73.906204,Yoga Studio
1,Marble Hill,40.876551,-73.91066,Blink Fitness,40.877271,-73.905595,Gym
2,Marble Hill,40.876551,-73.91066,Astral Fitness & Wellness Center,40.876705,-73.906372,Gym
3,Marble Hill,40.876551,-73.91066,Planet Fitness,40.874088,-73.909137,Gym / Fitness Center
4,Marble Hill,40.876551,-73.91066,Marble Hill Playground,40.877765,-73.907994,Playground


In [38]:
manhattanrec['Venue Category'].value_counts()

Gym / Fitness Center     483
Gym                      464
Park                     272
Yoga Studio              170
Plaza                    151
Playground               111
Athletics & Sports        92
Martial Arts Dojo         83
Pilates Studio            61
Garden                    57
Basketball Court          49
Dog Run                   34
Boxing Gym                34
Scenic Lookout            34
Cycle Studio              29
Soccer Field              29
Tennis Court              25
Roof Deck                 24
Pool                      23
Weight Loss Center        20
Harbor / Marina           20
Trail                     18
Gym Pool                  17
Pedestrian Plaza          15
Baseball Field            14
Sports Club               13
Skate Park                10
Gymnastics Gym             8
Golf Course                7
Recreation Center          7
Track                      7
Skating Rink               6
Climbing Gym               6
Fountain                   5
Indoor Play Ar

In [39]:
category = ['Gym','Athletics & Sports','Basketball Court','Soccer Field','Tennis Court','Sports Club']
for ind, row in manhattanrec.iterrows():
    cat = False
    for c in category:
        if row['Venue Category'] == c:
            cat = True
    if cat == False:
        manhattanrec.drop(index = ind,axis = 0,inplace = True)
manhattanrec.shape

(672, 7)

In [40]:
print(manhattanstadium.shape)
print(manhattanprivschool.shape)
print(manhattanmidschool.shape)
print(manhattanhighschool.shape)
print(manhattaneleschool.shape)
print(manhattancollege.shape)
print(manhattanrec.shape)

(9, 7)
(26, 7)
(48, 7)
(147, 7)
(86, 7)
(324, 7)
(672, 7)


#### Total Venues available was 1316

We have a total of 1316 venues available. Using all these data points will not be useful so some assumptions are made:</t>

1.Out of all the schools, it is a known fact that majority of competitions and the seriousness of sports is seen at the high school level. So for this analysis we will only use high schools and discard the rest. </t>

2.Most often than not the stadiums used do no themselves provide any equipment. So we will discard the stadium data set. </t>

So total data points that are being considered are from high schools(147),colleges(311) and recreational facilities(688) is equal to 1146


In [41]:
df = pd.concat([manhattanhighschool,manhattancollege,manhattanrec], ignore_index= False)

In [42]:
df.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Bronx Theatre High School,40.877258,-73.912575,High School
1,Marble Hill,40.876551,-73.91066,ELLIS Preparatory Academy,40.875809,-73.912597,High School
2,Marble Hill,40.876551,-73.91066,marble hill high school,40.877258,-73.912575,High School
3,Marble Hill,40.876551,-73.91066,John F. Kennedy High School,40.877266,-73.912746,High School
4,Marble Hill,40.876551,-73.91066,John F. Kennedy High School,40.876944,-73.912976,High School


In [43]:
df.drop_duplicates(subset="Venue", keep=False, inplace = True)

In [44]:
df.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Bronx Theatre High School,40.877258,-73.912575,High School
1,Marble Hill,40.876551,-73.91066,ELLIS Preparatory Academy,40.875809,-73.912597,High School
2,Marble Hill,40.876551,-73.91066,marble hill high school,40.877258,-73.912575,High School
5,Marble Hill,40.876551,-73.91066,Bronx Engineering & Technology Academy,40.877258,-73.912575,High School
6,Marble Hill,40.876551,-73.91066,The New Visions Charter High School for the Hu...,40.877392,-73.912769,High School


In [45]:
df.shape

(892, 7)

## Analysis <a name="analysis"></a>

Now that the data is ready, these co-ordinates are plotted on a map.

In [46]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


In [47]:
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, venue, category in zip(df['Venue Latitude'], df['Venue Longitude'], df['Venue'], df['Venue Category']):
    label = '{}, {}'.format(venue, category)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

To know if there are any sports shops around the candidate location, the locations of shops in Manhattan were found using Foursquare API and subsequently plotted on a map.

In [48]:
#Sports shops : 4bf58dd8d48988d1f2941735
shops = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                    latitudes=manhattan_data['Latitude'],
                                    longitudes=manhattan_data['Longitude'],
                                    categoryId = '4bf58dd8d48988d1f2941735'
                                    )

In [49]:
print(shops.shape)
shops.head()

(188, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Chinatown,40.715618,-73.994279,Labor Skate Shop,40.714902,-73.991547,Sporting Goods Shop
1,Chinatown,40.715618,-73.994279,Bok Lei Po Trading Inc.,40.716053,-73.998197,Sporting Goods Shop
2,Chinatown,40.715618,-73.994279,G & S Sporting Goods,40.716381,-73.989647,Sporting Goods Shop
3,Chinatown,40.715618,-73.994279,G And S Sports,40.716381,-73.989647,Sporting Goods Shop
4,Chinatown,40.715618,-73.994279,John Jovino Gun Shop,40.719273,-73.997692,Sporting Goods Shop


In [50]:
shops['Venue Category'].value_counts()

Sporting Goods Shop     178
Shoe Store                6
Boutique                  1
Fishing Store             1
Outdoor Supply Store      1
Clothing Store            1
Name: Venue Category, dtype: int64

In [51]:
shops_df = shops[shops['Venue Category'] == 'Sporting Goods Shop'].reset_index(drop=True)

In [52]:
shops_df.shape


(178, 7)

In [53]:
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=12)


for lat, lng, venue, category in zip(shops_df['Venue Latitude'], shops_df['Venue Longitude'], shops_df['Venue'], shops_df['Venue Category']):
    label = '{}, {}'.format(venue, category)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='purple',
        fill=True,
        fill_opacity=0.3,
        parse_html=False).add_to(map_newyork)  

map_newyork

Now using k-means clustering, all these potential cutomer base locations are clustered as shown in the map given below

In [54]:
number_of_clusters = 20
kmeans = KMeans(n_clusters=number_of_clusters, random_state=0).fit(xy)

cluster_centers = kmeans.cluster_centers_

NameError: name 'xy' is not defined

In [None]:
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=12)

for lat, lng, venue, category in zip(df['Venue Latitude'], df['Venue Longitude'], df['Venue'], df['Venue Category']):
    label = '{}, {}'.format(venue, category)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
for lat, lng in cluster_centers:
    folium.Circle([lat, lng], radius=1200, color='green', fill=False).add_to(map_newyork)
    
map_newyork

Now the centres of these clusters were taken and plotted on a map which will be the potential locations for the shop. This is shown by the red markers.

Along with this other shops in Manhattan are also plotted which is shown by the purple markers.

In [None]:
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=12)

for lat, lng, venue, category in zip(shops_df['Venue Latitude'], shops_df['Venue Longitude'], shops_df['Venue'], shops_df['Venue Category']):
    label = '{}, {}'.format(venue, category)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='purple',
        fill=True,
        fill_opacity=0.3,
        parse_html=False).add_to(map_newyork)   
    
for lat, lng in cluster_centers:
    label = '{}, {}'.format(lat, lng)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lng], radius = 5, color='red', fill=True).add_to(map_newyork)
    
    
map_newyork

Now using LocationIQ API reverse geocoding, the street addreses of the centres or the potential locations were obtained.

In [None]:
addresses = []

In [None]:
cluster_centers

In [None]:
lat = 40.76623438
lon= -73.96429785 
url = 'https://us1.locationiq.com/v1/reverse.php?key={}&lat={}&lon={}&format=json'.format(api_key,lat,lon)
response = requests.get(url).json()
results = response['display_name']
print(results)
#addresses.append(results)

**This can be put in a for loop, however the API used here has restrictions on the number of calls per second as it is a free version. So the rest of the coordinates were also reverse geocoded and then stored in the list**

In [None]:
addresses.append('386, 5th Avenue, Midtown West, New York, New York County, New York, 10018, USA')
addresses.append('60, Greene Street, SoHo, New York, New York County, New York, 10012, USA')
addresses.append('1896, 3rd Avenue, East Harlem, New York, New York County, New York, 10029, USA')
addresses.append('517, W 207 St, Inwood, New York, New York County, New York, 10034, USA')
addresses.append('600, West 182nd Street, Washington Heights, New York, New York County, New York, 10033, USA')
addresses.append('150, East 59th Street, Midtown East, New York, New York County, New York, 10022, USA')
addresses.append('606, West 143rd Street, Hamilton Heights, New York, New York County, New York, 10031, USA')
addresses.append('410, East 89th Street, Upper East Side, New York, New York County, New York, 10128, USA')
addresses.append('30, Horatio St, West Village, New York, New York County, New York, 10014, USA')
addresses.append('Duane Reade, Upper West Side, New York, New York County, New York, 10025, USA')
addresses.append('Silverstein Family Park, Tribeca, New York, New York County, New York, 10007, USA')
addresses.append('252, Broome St, Lower East Side, New York, New York County, New York, 10002, USA')
addresses.append('159, E 71 St, Upper East Side, New York, New York County, New York, 10021, USA')
addresses.append('One Vanderbilt, Midtown East, New York, New York County, New York, 10017, USA')
addresses.append('311, E 10 St, East Village, New York, New York County, New York, 10009, USA')
addresses.append('357, W 14 St, Chelsea, New York, New York County, New York, 10014, USA')
addresses.append('Duane Reade, Upper West Side, New York, New York County, New York, 10025, USA')
addresses.append('Citi Bike - E 43 St & Vanderbilt Ave, Midtown East, New York, New York County, New York, 10017, USA')
addresses.append('2, West 37th Street, Midtown West, New York, New York County, New York, 10018, USA')


In [None]:
print('==============================================================')
print('Addresses of locations recommended for further analysis')
print('==============================================================\n')

size = len(addresses)
for i in range(0,size):
    addr = addresses[i].replace(', USA', '')
    print(addr)
    

## Results and Discussion <a name="results"></a>

The analysis done in this project shows that there is enough and more customer bases in and around Manhattan, New York for a sports shop.

In this analysis we first got all the potential customers and then we clustered them and used the centers of these clusters as potential locations for the shop/store. Addresses of these locations were obtained using reverse geocoding.

As a result of this analysis we have generated 20 potential locations. These of course are not the exact optimal locations for the stores. There may be other reasons for which there are no shops in the given location. This project only takes into account potential customer bases and doesn’t take into account the close by shops. In this case it was done because a shop like SKECHERS being beside the shop can help boost sales. So a closer manual inspection is required. These conditions can be changed according to the objectives and requirements of the stakeholder.


## Conclusion <a name="conclusion"></a>

Purpose of this project was to generate potential locations to set up a sports shop depending upon the number of potential customer bases in and around the neighborhood. The potential customer bases and its coordinates were obtained using Foursquare API. The data was inspected and cleaned and then clustered using k-means clustering. The centres of these clusters were taken as potential locations and their addresses were obtained using LocationIQ reverse geocoding API.

The final decision is up to the stakeholders and may require further manual inspection or on foot inspection as they may have other factors in mind too such as attractiveness of the location, proximity to main roads etc.