# project background
  A contractor is asking where should he/she to start his/her business. To start the work, we should first know what kind of business is more popular in a given area. Based on that information, we could plot a map with top 5 business kinds in each area. 
  The assumption is that if a business is more popular, the possiblity of new business success is higher. 

# Data source and solution
  As I'm new to this part, I will take the data in this class of New York data https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json. Also try to re-use the similar algo in the class to clustering neighbourhoods.
  The we can provide a simple algo to indicate whether the proposed business is good not.

# Codes for the assignment

In [None]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
# use DBSCAN for auto clustering
from sklearn.cluster import KMeans
#from sklearn.cluster import DBSCAN

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

In [3]:
# download New York neighourhood data
!wget -q -O 'newyork_data.json' https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json

with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
    
neighborhoods_data = newyork_data['features']

In [4]:
# convert to data frame
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [5]:
# get New York latitude and longitude for plotting purpose
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


In [6]:
# foursquared credential
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
radius = 500

In [21]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]
        if ('groups' in results):
            results = results['groups'][0]['items']
        else:
            continue
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
# something wrong with full data, taking manhaton as example
#neighborhoods = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
ny_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                           latitudes=neighborhoods['Latitude'],
                           longitudes=neighborhoods['Longitude']
                          )

# one hot encoding
ny_onehot = pd.get_dummies(ny_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
ny_onehot['Neighborhood'] = ny_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [ny_onehot.columns[-1]] + list(ny_onehot.columns[:-1])
ny_onehot = ny_onehot[fixed_columns]

In [24]:
ny_onehot.shape

(3177, 338)

In [25]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [27]:
# group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
ny_grouped = ny_onehot.groupby('Neighborhood').mean().reset_index()

num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = ny_grouped['Neighborhood']

for ind in np.arange(ny_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(ny_grouped.iloc[ind, :], num_top_venues)

#neighborhoods_venues_sorted.head()

In [28]:
# clustering similiar neighbourhoods
ny_grouped_clustering = ny_grouped.drop('Neighborhood', 1)

# run DBSCAN clustering
#epsilon = 0.3
#minimumSamples = 7
#db = DBSCAN(eps=epsilon, min_samples=minimumSamples).fit(ny_grouped_clustering)

# check cluster labels generated for each row in the dataframe
#kclusters = len(set(db.labels_))
kclusters = 5

# run k-means clustering
db = KMeans(n_clusters=kclusters, random_state=0).fit(ny_grouped_clustering)

db.labels_[0:10] 


array([1, 1, 3, 1, 3, 1, 1, 4, 1, 1], dtype=int32)

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', db.labels_)

ny_merged = neighborhoods

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
ny_merged = ny_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

#ny_merged.head() # check the last columns!

## plot map with similar clusters
So that we could have view of popular similiarity of clusters

In [31]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(ny_merged['Latitude'], ny_merged['Longitude'], ny_merged['Neighborhood'], ny_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

# check whether select a given business in a given neighborhood is a good idea
To verify the result, one could select a neighborhood name and a target business idea.
Also will estimate the result.
If it's in the most popular business, good idea is shown.
Otherwise, it lists proposed business for the reason.

In [32]:
def verify_proposal(neighborhood, business):
    businesses = neighborhoods_venues_sorted[neighborhoods_venues_sorted['Neighborhood']==neighborhood]
    if business in list(businesses.iloc[0][2:]):
        print("It's a good idea to setup {} in {}".format(business, neighborhood))
    else:
        print("Try to setup \n{} \nin \n{}".format(list(businesses.iloc[0][2:]), neighborhood))
    

In [33]:
# testing soluiton
business = "Hotel"
neighborhood = "Battery Park City"
verify_proposal(neighborhood, business)

It's a good idea to setup Hotel in Battery Park City


# Discussion
  In the above section, we verified our assumption based on existing data for NY neighboring for most popular businesses.
  One may argue that we choose most popular business may lead to more competition, should it be a good idea? 
  Yes, as the factors for a successful business is more than we know, so it's just a quick idea to verify. 
  For food, maybe it's a good reference. 
  For other kind of business, good practice should also consider other factors, e.g: population, target client and so forth.

# Conclusion
 Whether the proposal is good or not still needs to be verified for the real world.
 We should continue to tune.