# Capstone Project - The Battle of Neighborhoods

## 1. Introduction/Business Problem

In this project, we shall be considering two locations; Houston, TX, and New York, NY. The idea behind this is to firstly compare the business types using Foursquare location data. 

[The Top 10 Largest Cities in the U.S. by Population](https://www.moving.com/tips/the-top-10-largest-us-cities-by-population/) ranks New York City as number one and ranks Houston as number four, with over 8 million and over 2 million inhabitants respectively. The article also details New York as one of the largest financial hubs in the world and Houston as one of the best places for business.

Due to this, we have decided to compare these two cities to review the top business types by number of businesses under those types. Using that information, we will also be recommending business types to potential investors who want to invest in these choice locations.

## 2. Data

For this project, we will be utilizing the Foursquare API to fetch location data of both Houston and New York. To start with, we will geolocate both cities using the geopy library. These location points (latitude and longitude) will then be combined with other essential parameters to be used for the Foursquare API.

The fetched data will, which comes in JSON format will then be converted to a dataframe (with the location points merged to its respective location) containing just the essential data columns such as id, name, address, city, state, country, latitude, longitude, postal code, and business types.

As mentioned in the Introduction/Business Problem section, these dataframes for both cities will help us determine the top business in both cities and as well help us recommend business types to potential investors.

## 3. Methodology

Using the foursquare API, we will be extracting and reviewing the location data of Houston and New York (our reference locations), which details various location properties. As mentioned in the Data section, we will extract just the essential or needed properties for use in this project. Also, on reviewing the data we ranked the the business types and their count in both locations, to give us an idea of the most common or top ranked businesses in those areas.

After all these are done, we apply K-Means Clustering Technique to cluster the businesses in terms of postal code (which is relative since similar locations mostly share same zip code). On the otherhand, we tried other feasible means of categorization such as business types, which didn't yield reasonable clusters. This is the reason why the postal code clusters was a better option.

In [50]:
import numpy as np

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)

import json

#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim

import requests
from pandas.io.json import json_normalize


import matplotlib.cm as cm
import matplotlib.colors as colors


from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes
import folium

print('Libraries imported.')

Libraries imported.


In [51]:
# The code was removed by Watson Studio for sharing.

Your credentials have been stored!


In [161]:
def query(queryname, radius, limit):
    '''
    This function takes in a location name and radius, and uses this location name to generate a central latitude and longitude,
    which is used to extract the business/buildings/addresses in a specified radius to the central point.
    
    data: queryname is a string type input of location name
          radius is an integer type input of radius of coverage from central point
          limit  is an integer type input of the number of locations fetched from the url using Foursquare API
    '''
    print('The location is:', queryname)
    
    
    #Use the location name to extract the location latitude and longitude
    geolocator = Nominatim(user_agent="foursquare_agent")
    location = geolocator.geocode(queryname)
    latitude = location.latitude
    longitude = location.longitude
    print('The Latitude is: ', latitude, 'the Longitude is: ', longitude)
    
    
    #using the latitude and longitude, create the foursquare API url where the location data will be extracted/requested from
    url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, limit)
    #print(url, '\n')
    
    
    #get the location data in JSON format using Foursquare API
    results_json = requests.get(url).json()
    
    #format the queried json data for the required useful data
    query_json = results_json['response']['venues']
    
    #transform the formatted json data into a dataframe
    df = json_normalize(query_json)

    #filter the dataframe to extract essential location data
    formatted_df = df[['id', 'name', 'location.address', 'location.city', 'location.state', 'location.cc', 'location.lat', 'location.lng', 'location.postalCode']]
    
    #extra location data are found in the categories column, which will be formatted to be included in the final dataframe
    categories = df['categories']
    
    business = []
    pluralBusiness = []

    for each in categories:
        #print(each)
        if len(each) == 0:
            pass
        else:
            business.append(each[0]['name'])
            pluralBusiness.append(each[0]['pluralName'])

    #print(business)
    #print(pluralBusiness)

    names_dict = {'type': business, 'plural': pluralBusiness}
    names_df = pd.DataFrame(names_dict)
    
    #combine the formatted dataframe and the extra location dataframe into the final dataframe to be used for the project
    final_df = formatted_df.join(names_df)
    
    final_df = final_df.rename(columns = {'location.address': 'address', 'location.city': 'city', 'location.state': 'state', 'location.cc': 'country', 'location.lat': 'latitude', 'location.lng': 'longitude', 'location.postalCode': 'postalCode'})
    
    #drop locations with no known addresses
    final_df = final_df.dropna()
    
    #drop locations not matching the specific queryname location i.e. New York or Houston
    final_df = final_df[final_df['city'] == queryname[:-4]]
    
    return final_df, [latitude, longitude]

In [162]:
newyork = query('New York, NY', radius = 1500, limit = 300)[0]

newyork.head()

The location is: New York, NY
The Latitude is:  40.7127281 the Longitude is:  -74.0060152
New York


Unnamed: 0,id,name,address,city,state,country,latitude,longitude,postalCode,type,plural
0,4a676321f964a52051c91fe3,New York City Hall,260 E Broadway,New York,NY,US,40.712659,-74.00588,10002,City Hall,City Halls
1,3fd66200f964a520d8f11ee3,City Hall Park,17 Park Row,New York,NY,US,40.712415,-74.006724,10038,Park,Parks
3,4b79a5e8f964a52037082fe3,NY Gift Shop,234 Canal St,New York,NY,US,40.712733,-74.005978,10013,Gift Shop,Gift Shops
4,51a4bc7c498e469047be66d6,City Hall Council Chambers,City Hall,New York,NY,US,40.712736,-74.005472,10007,City Hall,City Halls
5,5d25d7fd3221cb002471e7c8,City Hall Plaza,Steve Flanders Square,New York,NY,US,40.712641,-74.006131,10007,Plaza,Plazas


In [163]:
houston = query('Houston, TX', radius = 1500, limit = 300)[0]

houston.head()                

The location is: Houston, TX
The Latitude is:  29.7589382 the Longitude is:  -95.3676974
Houston


Unnamed: 0,id,name,address,city,state,country,latitude,longitude,postalCode,type,plural
0,4cbf112b00d837047b8a415c,Baker Botts LLP,910 Louisiana St,Houston,TX,US,29.759671,-95.367728,77002,Conference Room,Conference Rooms
1,5b3b978a56ca62001c94580e,"Nrg Energy, Inc.",910 Louisiana St,Houston,TX,US,29.759544,-95.367348,77002,Office,Offices
2,4b993b14f964a520d76b35e3,One Shell Plaza,910 Louisiana St,Houston,TX,US,29.759283,-95.367858,77002,Office,Offices
3,527096a111d2393d1fc331aa,The Houston Club,920 Louisiana St,Houston,TX,US,29.759112,-95.367565,77002,Restaurant,Restaurants
4,4b9ff072f964a520224c37e3,Starbucks,3801 Cullen Blvd,Houston,TX,US,29.75909,-95.3675,77004,Coffee Shop,Coffee Shops


In [164]:
def businessTypes(dataframe):
    
    dataframe_count = (pd.DataFrame(dataframe['plural'].value_counts())).reset_index()
    dataframe_count = dataframe_count.rename(columns = {'index': 'businessType', 'plural': 'Count'})
    
    dataframe_keys = dataframe_count['businessType'].to_list()
    
    length = dataframe_count.count()[0]

    dataframe_dict = {}

    for i in range(length):
        key = dataframe_count['businessType'].iloc[i]
        value = dataframe_count['Count'].iloc[i]
        dataframe_dict[key] = value
        
    return dataframe_count, dataframe_keys, dataframe_dict

In [None]:
def map_plot(location, dataframe):

    # create map using latitude and longitude values
    map_plot = folium.Map(location, zoom_start=50)

    # add markers to map
    for lat, lng, name, typ in zip(dataframe['latitude'], dataframe['longitude'], dataframe['name'], dataframe['type']):
        label = "{}, ({})".format(name, typ)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker([lat, lng], radius=5, popup=label, color='red', fill=True, fill_color='white', fill_opacity=0.7, parse_html=False).add_to(map_plot)

    return map_plot

In [None]:
def cluster(dataframe):
    grouped = dataframe[['id', 'name', 'type']].groupby('type').count()
    grouped = grouped.sort_values(by = 'name', ascending = False)
    #print(grouped)
    
    dummies = pd.get_dummies(dataframe['name'])
    dummies = dataframe[['postalCode']].join(dummies)
    dummies = dummies.groupby('postalCode').mean().reset_index()
    #print(dummies.head())

    kclusters = len(dataframe['postalCode'].unique())
    
    kmeans_group = dummies.drop('postalCode', axis = 1)
    #print(kmeans_group.head())
    
    kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(kmeans_group)
    print(kmeans.labels_)
    
    dummies.insert(0, 'Cluster Labels', kmeans.labels_)

    cluster_df = dummies[['Cluster Labels', 'postalCode']]
    final_df = dataframe.merge(cluster_df, how = 'right')
    
    return final_df

In [None]:
def cluster_map(location, dataframe):
    map_clusters = folium.Map(location, zoom_start=50)
    
    kclusters = len(dataframe['postalCode'].unique())
    
    # set color scheme for the clusters
    x = np.arange(kclusters)
    ys = [i + x + (i*x)**2 for i in range(kclusters)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]

    # add markers to the map
    markers_colors = []

    for lat, lon, point, cluster in zip(dataframe['latitude'], dataframe['longitude'], dataframe['postalCode'], dataframe['Cluster Labels']):
        label = folium.Popup(str(point) + ' : Cluster ' + str(cluster), parse_html=True)
        folium.CircleMarker([lat, lon], radius=5, popup=label, color=rainbow[cluster-1], fill=True, fill_color=rainbow[cluster-1], fill_opacity=0.9).add_to(map_clusters)


    return map_clusters

## 4. Results

Upon extraction and analysis of both locations (i.e. Houston and New York), firstly we discovered that the extracted data contained closer locations such as San Antonio, or didn't contain an address, as such couldn't be referenced, etc. A lot of these unwanted or wrong data, were dropped so as not to affect the analysis or decision making process. Secondly, in both locations, we discovered that Office buildings (under business type) were the most common businesses, followed by office-related buildings such as Banks, other type of Buildings, Conference rooms, etc.

Thirdly, as mentioned in the methodology, we applied K-Means Clustering, which focused on the postal codes as against the thoughtful business types, which yielded non-ideal clusters. These clusters for both locations are shown in the maps shown.

In [165]:
houston_count = businessTypes(houston)[0] #total number of business types per business
houston_keys = businessTypes(houston)[1] #business types
houston_dict = businessTypes(houston)[2] #business type dictionary, using keys from above

print(houston_dict, '\n')
houston_count.head(10)

{'Offices': 10, 'Bus Lines': 5, 'Buildings': 5, 'Banks': 4, 'Bus Stations': 3, 'Conference Rooms': 3, 'Food Trucks': 3, 'Gyms': 2, 'Delis / Bodegas': 2, 'Parking': 2, 'Coffee Shops': 2, 'Plazas': 1, "Dentist's Offices": 1, 'Art Galleries': 1, 'Gyms or Fitness Centers': 1, 'Restaurants': 1, 'Farmers Markets': 1, 'Optical Shops': 1, 'Event Spaces': 1, 'Lounges': 1, 'Taco Places': 1, 'Sandwich Places': 1, 'Weight Loss Centers': 1, 'Miscellaneous Shops': 1, 'Professional & Other Places': 1, 'Performing Arts Venues': 1, 'Cafés': 1, 'Breakfast Spots': 1, 'Auditoriums': 1, 'Nightclubs': 1, 'Thai Restaurants': 1, 'Libraries': 1, 'Bakeries': 1, 'American Restaurants': 1, 'Parks': 1, 'Fast Food Restaurants': 1} 



Unnamed: 0,businessType,Count
0,Offices,10
1,Bus Lines,5
2,Buildings,5
3,Banks,4
4,Bus Stations,3
5,Conference Rooms,3
6,Food Trucks,3
7,Gyms,2
8,Delis / Bodegas,2
9,Parking,2


In [166]:
newyork_count = businessTypes(newyork)[0] #total number of business types per business
newyork_keys = businessTypes(newyork)[1] #business types
newyork_dict = businessTypes(newyork)[2] #business type dictionary, using keys from above

print(newyork_dict, '\n')
newyork_count.head(10)

{'Offices': 5, 'College Classrooms': 4, 'Buildings': 2, 'Bus Stations': 2, 'Parks': 2, 'Government Buildings': 2, 'Bus Stops': 2, 'College Theaters': 2, 'City Halls': 2, 'Pizza Places': 2, 'Plazas': 2, 'College Bookstores': 1, 'Jewelry Stores': 1, 'Farmers Markets': 1, 'Perfume Shops': 1, 'Bookstores': 1, 'Music Venues': 1, 'Arts & Crafts Stores': 1, 'Dessert Shops': 1, 'Metro Stations': 1, 'Student Centers': 1, 'College Administrative Buildings': 1, 'Professional & Other Places': 1, 'Coworking Spaces': 1, 'Beer Stores': 1, 'Miscellaneous Shops': 1, 'Gift Shops': 1, 'Kofte Places': 1, 'Thai Restaurants': 1, 'Monuments / Landmarks': 1, 'Islands': 1, 'Public Art': 1, 'Ice Cream Shops': 1, 'Strip Clubs': 1, 'Lawyers': 1, 'College Labs': 1, 'Mexican Restaurants': 1, 'Food Trucks': 1, 'Indian Restaurants': 1} 



Unnamed: 0,businessType,Count
0,Offices,5
1,College Classrooms,4
2,Buildings,2
3,Bus Stations,2
4,Parks,2
5,Government Buildings,2
6,Bus Stops,2
7,College Theaters,2
8,City Halls,2
9,Pizza Places,2


In [167]:
unique_keys = list(set(houston_keys + newyork_keys))
print(unique_keys)

['Indian Restaurants', "Dentist's Offices", 'Conference Rooms', 'Taco Places', 'Jewelry Stores', 'Art Galleries', 'Arts & Crafts Stores', 'College Bookstores', 'Optical Shops', 'Event Spaces', 'Perfume Shops', 'Banks', 'Metro Stations', 'Plazas', 'College Theaters', 'College Labs', 'Parking', 'City Halls', 'Coworking Spaces', 'Lounges', 'Performing Arts Venues', 'Cafés', 'Breakfast Spots', 'Gift Shops', 'Thai Restaurants', 'Libraries', 'Student Centers', 'Monuments / Landmarks', 'Public Art', 'Ice Cream Shops', 'Bus Stations', 'Miscellaneous Shops', 'Lawyers', 'American Restaurants', 'Gyms', 'Offices', 'Beer Stores', 'Sandwich Places', 'Bus Lines', 'Music Venues', 'Bookstores', 'Restaurants', 'Farmers Markets', 'Pizza Places', 'College Administrative Buildings', 'College Classrooms', 'Bus Stops', 'Dessert Shops', 'Gyms or Fitness Centers', 'Weight Loss Centers', 'Strip Clubs', 'Coffee Shops', 'Auditoriums', 'Fast Food Restaurants', 'Professional & Other Places', 'Government Buildings',

In [168]:
full_dict = {}

for each in unique_keys:
    if each in houston_keys and each not in newyork_keys:
        value = houston_dict[each]
    elif each in houston_keys and each in newyork_keys:
        value = houston_dict[each] + newyork_dict[each]
    elif each not in houston_keys and each in newyork_keys:
        value = newyork_dict[each]
    
    full_dict[each] = value

print(full_dict)

{'Indian Restaurants': 1, "Dentist's Offices": 1, 'Conference Rooms': 3, 'Taco Places': 1, 'Jewelry Stores': 1, 'Art Galleries': 1, 'Arts & Crafts Stores': 1, 'College Bookstores': 1, 'Optical Shops': 1, 'Event Spaces': 1, 'Perfume Shops': 1, 'Banks': 4, 'Metro Stations': 1, 'Plazas': 3, 'College Theaters': 2, 'College Labs': 1, 'Parking': 2, 'City Halls': 2, 'Coworking Spaces': 1, 'Lounges': 1, 'Performing Arts Venues': 1, 'Cafés': 1, 'Breakfast Spots': 1, 'Gift Shops': 1, 'Thai Restaurants': 2, 'Libraries': 1, 'Student Centers': 1, 'Monuments / Landmarks': 1, 'Public Art': 1, 'Ice Cream Shops': 1, 'Bus Stations': 5, 'Miscellaneous Shops': 2, 'Lawyers': 1, 'American Restaurants': 1, 'Gyms': 2, 'Offices': 15, 'Beer Stores': 1, 'Sandwich Places': 1, 'Bus Lines': 5, 'Music Venues': 1, 'Bookstores': 1, 'Restaurants': 1, 'Farmers Markets': 2, 'Pizza Places': 2, 'College Administrative Buildings': 1, 'College Classrooms': 4, 'Bus Stops': 2, 'Dessert Shops': 1, 'Gyms or Fitness Centers': 1, 

In [169]:
df_dict = {'businessType': list(full_dict.keys()), 'Count': list(full_dict.values())}
business_df = (pd.DataFrame(df_dict).sort_values(by = 'Count', ascending=False)).reset_index()
business_df = business_df.drop(columns = 'index')
business_df.head(10)

Unnamed: 0,businessType,Count
0,Offices,15
1,Buildings,7
2,Bus Stations,5
3,Bus Lines,5
4,Banks,4
5,College Classrooms,4
6,Food Trucks,4
7,Plazas,3
8,Parks,3
9,Conference Rooms,3


In [170]:
ny_location = query('New York, NY', radius = 1500, limit = 300)[1]
hs_location = query('Houston, TX', radius = 1500, limit = 300)[1]

The location is: New York, NY
The Latitude is:  40.7127281 the Longitude is:  -74.0060152
New York
The location is: Houston, TX
The Latitude is:  29.7589382 the Longitude is:  -95.3676974
Houston


In [172]:
map_plot(ny_location, newyork)

In [173]:
map_plot(hs_location, houston)

In [175]:
houston_cluster = cluster(houston)
houston_cluster

[2 5 4 7 6 0 3 1 8]


Unnamed: 0,id,name,address,city,state,country,latitude,longitude,postalCode,type,plural,Cluster Labels
0,4cbf112b00d837047b8a415c,Baker Botts LLP,910 Louisiana St,Houston,TX,US,29.759671,-95.367728,77002,Conference Room,Conference Rooms,2
1,5b3b978a56ca62001c94580e,"Nrg Energy, Inc.",910 Louisiana St,Houston,TX,US,29.759544,-95.367348,77002,Office,Offices,2
2,4b993b14f964a520d76b35e3,One Shell Plaza,910 Louisiana St,Houston,TX,US,29.759283,-95.367858,77002,Office,Offices,2
3,527096a111d2393d1fc331aa,The Houston Club,920 Louisiana St,Houston,TX,US,29.759112,-95.367565,77002,Restaurant,Restaurants,2
4,4e3def4f62e19d6109763188,Hermann Square,900 Smith St,Houston,TX,US,29.7598,-95.368543,77002,Park,Parks,2
5,4b73281af964a520d09e2de3,The Houston Club,910 Louisiana St Ste 4900,Houston,TX,US,29.759041,-95.367777,77002,Office,Offices,2
6,4baa6bd3f964a52089683ae3,Wells Fargo Plaza,1000 Louisiana St,Houston,TX,US,29.758619,-95.368399,77002,Office,Offices,2
7,4ae4e065f964a5200d9f21e3,Hobby Center for the Performing Arts,800 Bagby St,Houston,TX,US,29.761526,-95.369376,77002,Performing Arts Venue,Performing Arts Venues,2
8,58f648e7588e3646b4ca9a0c,Tacos A Go Go,910 Louisiana St,Houston,TX,US,29.759125,-95.367525,77002,Taco Place,Taco Places,2
9,527aca1c11d26f356c4b3840,One Shell Wellness Center,910 Louisiana St,Houston,TX,US,29.759373,-95.367615,77002,Gym,Gyms,2


In [176]:
newyork_cluster = cluster(newyork)
newyork_cluster

[4 7 2 8 3 0 6 1 5]


Unnamed: 0,id,name,address,city,state,country,latitude,longitude,postalCode,type,plural,Cluster Labels
0,4a676321f964a52051c91fe3,New York City Hall,260 E Broadway,New York,NY,US,40.712659,-74.00588,10002,City Hall,City Halls,4
1,3fd66200f964a520d8f11ee3,City Hall Park,17 Park Row,New York,NY,US,40.712415,-74.006724,10038,Park,Parks,1
2,4b57b0dff964a520293c28e3,MTA Subway - City Hall (R/W),Warren St.,New York,NY,US,40.713394,-74.006934,10038,Metro Station,Metro Stations,1
3,4cd03c8ba03a9eb0cc58b103,Pace NYC Center for Academic Excellence,1 Pace Plz,New York,NY,US,40.711887,-74.005789,10038,Park,Parks,1
4,4cd03c85a03a9eb0c158b103,Barns & Noble at Pace University,41 Park Row,New York,NY,US,40.713389,-74.005944,10038,Dessert Shop,Dessert Shops,1
5,5476b736498ec574d6261fb1,Music NY,40 Ann St Fl 1,New York,NY,US,40.710328,-74.007388,10038,College Classroom,College Classrooms,1
6,4bde139e6c1b9521fc4ead0f,Printing House Square,Avenue of the Finest at Park Row,New York,NY,US,40.712134,-74.004722,10038,Arts & Crafts Store,Arts & Crafts Stores,1
7,4e727ee5e4cda03368ae650d,The Magic Costume Shop,41 Park Row,New York,NY,US,40.712463,-74.006728,10038,Bus Station,Bus Stations,1
8,4c9bdfe87c096dcb6190bcd1,MTA - M22 Park Row & Spruce Street,Park Row,New York,NY,US,40.712068,-74.005909,10038,Mexican Restaurant,Mexican Restaurants,1
9,4cd03c87a03a9eb0c658b103,Pace NYC Dyson,41 Park Row,New York,NY,US,40.712767,-74.006937,10038,Jewelry Store,Jewelry Stores,1


In [179]:
cluster_map(ny_location, newyork_cluster)

In [180]:
cluster_map(hs_location, houston_cluster)

## 5. Discussion

On careful review of the results (business types and their rankings, clusters, etc) on a first glance, it is recommended to have more office buildings or office-related buildings, which if provided at a reasonable and relatively better price to others, will be highly patronized and resultingly profitable. However, on a deeper look, we also notice that the number of restaurants in those areas in both locations are very minimal. This means that, even though a lot of people report to those clusters on a daily basis to their respective jobs, they have to either bring their lunch from home or have to drive out away from that radius or cluster to get their lunch, or have it delivered. This shows a better business opportunity in these clusters in terms of a restaurant business. The type of cuisine offered by the restaurant(s) can then be decided upon via surveys or questionnaires, etc.

So based on these results, I'd recommend two business types for my potential investors. These are;
<br>
1. Office Buildings.
2. Restaurants.

As detailed earlier, the Office buildings might be a reasonable option except that it might be a long shot based off the fact that they are already in high numbers (top ranked business type) in both locations. And on the other hand, the restaurant type based on cuisine would have to be determined (and which is outside the scope of this project).

## 6. Conclusion

In conclusion, the techniques, tools, methods, algorithms, libraries, etc, learned or acquired from this course has helped complete a project from start to finish, with specific consideration of introducing business  investment suggestions to potental investors.