# Moving to New York City

## Introduction

This project helps customers to analyze moving destinations in NYC. Client wants to live nearby a subway station, so we obtain all stations and their coordinates and find out how expensive are their surroundings for the client to narrow the search. Within their selected cost cluster, we find which subway stations surroundings has minimum required venues, such as gym/restaurants/etc. We eliminate stations not fulfilling those minimum requirements and perform cluster analysis on the stations that do, based on venues that are desirable to the client, such as café, parks, sushi, etc. Client may also specify subway lines and boroughs to be eliminated from proposals.

## Data Sources

Subway Stations: csv file from mta website. http://web.mta.info/developers/data/nyct/subway/Stations.csv
Relevant columns: Stop Name, GTFS Latitude, GTFS Longitude, Daytime Routes, Borough.

Venues, their prices and categories: Foursquare API.

Data usage: the process obtains New York City subway station and their coordinates. Then, for each station, it searches restaurants per price tag (1 to 4). It determines price distribution per station and performs cluster analysis on all stations per price level. After the client selects desired price level, the process obtains all venues of remaining subway stations, cleans data by grouping category synonyms (such as pharmacy and drugstore), filters venues by desired categories and performs a second cluster analysis, this time by desired categories.  

## Client Requirements

In [998]:
RemoveLines = []          # For removing speficic subway lines ###########
RemoveBoroughs = []       # For specifying desired borough. Available: Q, M, Bk, Bx, SI  ######

# For grouping similar categories to enhance analysis ##########
Synonyms = ["Gym_Gym / Fitness Center_Gym Pool_Pool", "Market_Supermarket", 
            "Laundromat_Dry Cleaner_Laundry Service", "Café_Coffee Shop", 
            "Japanese Restaurant_Sushi Restaurant", "Pharmacy_Drugstore"]

# Must-have venues ##############
DesiredVenues = ["Gym", "Market", "Laundromat"]

# Desired venues for clustering ####################
RelevantVenues = ["Café", "Dance Studio", "Deli / Bodega", 
                  "Gym", "Japanese Restaurant", "Laundromat",
                  "Park","Pharmacy", "Medical Center", "Spa", "Market"]

Import required libraries

In [929]:
#!conda install -c conda-forge folium=0.5.0 --yes     #uncomment to install folium
import folium
import pandas as pd
import numpy as np
import requests
import wget
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import json
import os.path
from os import path

In [930]:
# @hidden_cell
pwd = json.load(open('pwd.json'))
CLIENT_ID = pwd['CLIENT_ID'][0]
CLIENT_SECRET = pwd['CLIENT_SECRET'][0]
VERSION = pwd['VERSION'][0]

Obtain New York City subway stations

In [931]:
!wget -O Stations.csv http://web.mta.info/developers/data/nyct/subway/Stations.csv

"wget" no se reconoce como un comando interno o externo,
programa o archivo por lotes ejecutable.


In [932]:
df_stations = pd.read_csv('stations.csv', index_col=0)

#  filtering lines and boroughs ################
df_stations["Daytime Routes"] = " " + df_stations["Daytime Routes"] + " "  
df_stations["Borough"] = " " + df_stations["Borough"] + " "  

for i in RemoveLines:
    remove = " " + i + " "
    df_stations = df_stations[~df_stations["Daytime Routes"].str.contains(remove)]

for i in RemoveBoroughs:
    remove = " " + i + " "
    df_stations = df_stations[~df_stations["Borough"].str.contains(remove)]

#Clean unwanted charachers ########
df_stations["Stop Name"].replace({r'.*!(.*)': r'\1'}, regex=True, inplace=True)  
df_stations["Line"].replace({r'.*!(.*)': r'\1'}, regex=True, inplace=True)  
df_stations["Borough"].replace({r'.*!(.*)': r'\1'}, regex=True, inplace=True)  

df_stations['Stop Name'] = df_stations['Stop Name'].str.strip()
df_stations['Line'] = df_stations['Line'].str.strip()
df_stations['Borough'] = df_stations['Borough'].str.strip()
df_stations["Station"] = df_stations['Line'] + "_" + df_stations["Borough"] + "_" + df_stations['Stop Name']
df_stations = df_stations.rename(columns={"GTFS Latitude": "Lat", "GTFS Longitude": "Lng"})

print(df_stations.shape)
df_stations.head(5)

(496, 13)


Unnamed: 0_level_0,Complex ID,GTFS Stop ID,Division,Line,Stop Name,Borough,Daytime Routes,Structure,Lat,Lng,North Direction Label,South Direction Label,Station
Station ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,1,R01,BMT,Astoria,Astoria - Ditmars Blvd,Q,N W,Elevated,40.775036,-73.912034,,Manhattan,Astoria_Q_Astoria - Ditmars Blvd
2,2,R03,BMT,Astoria,Astoria Blvd,Q,N W,Elevated,40.770258,-73.917843,Ditmars Blvd,Manhattan,Astoria_Q_Astoria Blvd
3,3,R04,BMT,Astoria,30 Av,Q,N W,Elevated,40.766779,-73.921479,Astoria - Ditmars Blvd,Manhattan,Astoria_Q_30 Av
4,4,R05,BMT,Astoria,Broadway,Q,N W,Elevated,40.76182,-73.925508,Astoria - Ditmars Blvd,Manhattan,Astoria_Q_Broadway
5,5,R06,BMT,Astoria,36 Av,Q,N W,Elevated,40.756804,-73.929575,Astoria - Ditmars Blvd,Manhattan,Astoria_Q_36 Av


 Estimate neighborhoods costs by obtaining restaurant price distribution per station

In [933]:
def ElementExists(df, field, element):
    exists = False
    for e in df[field]:
        if e == element:
            exists = True
            break
    return(exists)

In [934]:
def getVenuesPriceCategory(latitude, longitude, radius, limit, price, category):
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&price={}&categoryId={}'.format(
        CLIENT_ID, CLIENT_SECRET, VERSION, latitude, longitude, radius, limit,
        price, category)
    res = requests.get(url).json()
    print(res)    # for checking errors, licence limit ###########
    results = res["response"]['groups'][0]['items']
    return len(results)

In [935]:
def getPriceDistribution(names, latitudes, longitudes, df):
    radius = 250
    limit = 500
    category = "4d4b7105d754a06374d81259"    # restaurants #############
    
    for name, lat, lng in zip(names, latitudes, longitudes):
        if ElementExists(df, "Station", name) == False:
            station = [name, lat, lng]
            prices = [0, 0, 0, 0]
            total = 0

            for price in range(0, 4):
                count = getVenuesPriceCategory(lat, lng, radius, limit, str(price + 1), category)
                total += count
                prices[price] = count

            if(total > 0):
                prices = [x / total for x in prices]
                
            station.extend(prices)
            dfTemp = pd.DataFrame([station], columns=PersistedStationPricesCols)
            df=df.append(dfTemp)
            df.columns = PersistedStationPricesCols
            df.to_csv(PersistedStationPricesCsv)
    return(df)

In [936]:
PersistedStationPricesCsv = 'PersistedStationPrices.csv'
PersistedStationPricesCols = ['Station', 'Lat', 'Lng', 'Price1', 'Price2', 'Price3','Price4']

df_PersistedStationPrices = pd.DataFrame(columns=PersistedStationPricesCols)
if (path.exists(PersistedStationPricesCsv)):
    df_PersistedStationPrices = pd.read_csv(PersistedStationPricesCsv, index_col=0)

df_PersistedStationPrices = getPriceDistribution(names=df_stations['Station'],
                     latitudes=df_stations['Lat'],
                     longitudes=df_stations['Lng'], 
                     df=df_PersistedStationPrices)
df_PersistedStationPrices.head(5)

Unnamed: 0,Station,Lat,Lng,Price1,Price2,Price3,Price4
0,Astoria_Q_Astoria - Ditmars Blvd,40.775036,-73.912034,0.508772,0.438596,0.017544,0.035088
0,Astoria_Q_Astoria Blvd,40.770258,-73.917843,0.25,0.75,0.0,0.0
0,Astoria_Q_30 Av,40.766779,-73.921479,0.65625,0.3125,0.0,0.03125
0,Astoria_Q_Broadway,40.76182,-73.925508,0.5,0.5,0.0,0.0
0,Astoria_Q_36 Av,40.756804,-73.929575,0.545455,0.393939,0.060606,0.0


Performs K-means clustering analysis on station cost

In [1002]:
Clustering = df_PersistedStationPrices.copy()
Clustering["PriceTot"] = Clustering["Price1"] + Clustering["Price2"] + Clustering["Price3"] + Clustering["Price4"] 
Clustering = Clustering[Clustering["PriceTot"] > 0.99]
Clustering = Clustering.drop(["PriceTot"], axis=1)

kclusters = 2
PriceCluster = Clustering.drop(["Station", "Lat", "Lng"], axis=1)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(PriceCluster)

PriceCluster = Clustering.copy()
PriceCluster.insert(0, 'Cluster', kmeans.labels_)
PriceCluster.head(3)

Unnamed: 0,Cluster,Station,Lat,Lng,Price1,Price2,Price3,Price4
0,1,Astoria_Q_Astoria - Ditmars Blvd,40.775036,-73.912034,0.508772,0.438596,0.017544,0.035088
0,1,Astoria_Q_Astoria Blvd,40.770258,-73.917843,0.25,0.75,0.0,0.0
0,0,Astoria_Q_30 Av,40.766779,-73.921479,0.65625,0.3125,0.0,0.03125


In [938]:
PriceCluster.groupby('Cluster').agg({'Price1': ['mean'], 'Price2': ['mean'], 
                                     'Price3': ['mean'], 'Price4': ['mean']})

Unnamed: 0_level_0,Price1,Price2,Price3,Price4
Unnamed: 0_level_1,mean,mean,mean,mean
Cluster,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,0.774299,0.198741,0.020302,0.006658
1,0.437,0.430411,0.0993,0.03329


Display cost estimates map.

In [939]:
latitude = 40.764416
longitude = -73.9369046
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(PriceCluster['Lat'], PriceCluster['Lng'], PriceCluster['Station'], PriceCluster['Cluster']):
    label = folium.Popup('Cluster ' + str(cluster) + ': ' +str(poi), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=3,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
map_clusters

Client determines price cluster wanted, in this case excluding expensive prices. We analyze subway stations fulfilling minimum required nearby venues.

In [940]:
PriceCluster2 = PriceCluster[PriceCluster["Cluster"] == 0]  # <---- filtering cluster ####
PriceCluster2.head(5)

Unnamed: 0,Cluster,Station,Lat,Lng,Price1,Price2,Price3,Price4
0,0,Astoria_Q_30 Av,40.766779,-73.921479,0.65625,0.3125,0.0,0.03125
0,0,Broadway_M_Cortlandt St,40.710668,-74.011029,0.64,0.28,0.04,0.04
0,0,Broadway_Bk_Court St,40.6941,-73.991777,0.666667,0.333333,0.0,0.0
0,0,Broadway_Bk_Jay St - MetroTech,40.69218,-73.985942,0.627907,0.302326,0.069767,0.0
0,0,Broadway - Brighton_Bk_DeKalb Av,40.690635,-73.981824,0.634615,0.307692,0.057692,0.0


In [941]:
def getNearbyVenues(names, latitudes, longitudes, df):
    radius = 700
    LIMIT = 50000
    for name, lat, lng in zip(names, latitudes, longitudes):
        if ElementExists(df, "Station", name) == False:
            url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
                CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT)

            results = requests.get(url).json()["response"]['groups'][0]['items']
            venues = [(name, lat, lng, v['venue']['name'], v['venue']['location']['lat'],
                       v['venue']['location']['lng'],v['venue']['categories'][0]['name']) 
                      for v in results]
            
            dfTemp = pd.DataFrame(venues, columns=PersistedVenuesCols)
            df = pd.concat([df, dfTemp])
            df.columns = PersistedVenuesCols
            df.to_csv(PersistedVenuesCsv)
    return(df)

In [942]:
# Obtain venues, persisting into a csv to avoid querying multiple times #####
PersistedVenuesCsv = 'PersistedVenues.csv'
PersistedVenuesCols = ['Station', 'StationLat', 'StationLng', 'VenueLat', 'VenueLng', 'Venue', 'Category']

df_PersistedVenues = pd.DataFrame(columns=PersistedVenuesCols)
if (path.exists(PersistedVenuesCsv)):
    df_PersistedVenues = pd.read_csv(PersistedVenuesCsv, index_col=0)

df_PersistedVenues = getNearbyVenues(names=PriceCluster2['Station'],
                     latitudes=PriceCluster2['Lat'],
                     longitudes=PriceCluster2['Lng'], 
                     df = df_PersistedVenues)

df_PersistedVenues.head(5)

Unnamed: 0,Station,StationLat,StationLng,VenueLat,VenueLng,Venue,Category
0,Broadway_Bk_Court St,40.6941,-73.991777,Heatwise,40.69345,-73.991788,Yoga Studio
1,Broadway_Bk_Court St,40.6941,-73.991777,Brooklyn Historical Society,40.694942,-73.992333,History Museum
2,Broadway_Bk_Court St,40.6941,-73.991777,SoulCycle Brooklyn Heights,40.692253,-73.991042,Cycle Studio
3,Broadway_Bk_Court St,40.6941,-73.991777,Xtend Barre Brooklyn Heights,40.693599,-73.992376,Gym / Fitness Center
4,Broadway_Bk_Court St,40.6941,-73.991777,Orangetheory Fitness,40.693967,-73.991519,Gym


In [980]:
# rename categories using 1st element before "_" for clustering by category #######
df_Venues = df_PersistedVenues.copy()
for s1 in Synonyms: 
    s = s1.split("_")
    for s2 in s: 
        if s2 != s[0]:
            df_Venues.loc[df_Venues["Category"] == s2, "Category"] = s[0]

# filter venues using only Relevant ###########
df_Venues = df_Venues[df_Venues["Category"].isin(RelevantVenues)]

# filter venues using only selected price cluster ###########
df_Venues = df_Venues[df_Venues["Station"].isin(PriceCluster2["Station"])]

In [981]:
for v1 in DesiredVenues: 
    # Obtain venues for current group ########
    df = df_Venues[df_Venues["Category"] == v1]

    # Restrict venues to stations with every desired venues ##########
    df_Venues = df_Venues[df_Venues["Station"].isin(df["Station"].drop_duplicates())]

Finalists = PriceCluster2[PriceCluster2["Station"].isin(df_Venues["Station"].drop_duplicates())]
Finalists

Unnamed: 0,Cluster,Station,Lat,Lng,Price1,Price2,Price3,Price4
0,0,Broadway - Brighton_Bk_DeKalb Av,40.690635,-73.981824,0.634615,0.307692,0.057692,0.0
0,0,Jamaica_Q_111 St,40.697418,-73.836345,0.714286,0.285714,0.0,0.0
0,0,8th Av - Fulton St_M_163 St - Amsterdam Av,40.836013,-73.939892,0.714286,0.285714,0.0,0.0
0,0,8th Av - Fulton St_M_155 St,40.830518,-73.941514,0.666667,0.333333,0.0,0.0
0,0,Rockaway_Q_Beach 105 St,40.583209,-73.827559,1.0,0.0,0.0,0.0
0,0,Concourse_Bx_Bedford Park Blvd,40.873244,-73.887138,0.809524,0.190476,0.0,0.0
0,0,Queens Blvd_Q_67 Av,40.726523,-73.852719,0.653846,0.307692,0.038462,0.0
0,0,Queens Blvd_Q_Woodhaven Blvd,40.733106,-73.869229,0.684211,0.263158,0.052632,0.0
0,0,Broadway - 7Av_M_157 St,40.834041,-73.94489,0.785714,0.214286,0.0,0.0
0,0,Broadway - 7Av_M_145 St,40.826551,-73.95036,0.761905,0.238095,0.0,0.0


In [982]:
latitude = 40.764416
longitude = -73.9369046
map_Stations = folium.Map(location=[latitude, longitude], zoom_start=11)

markers_colors = []
for lat, lng, poi in zip(Finalists['Lat'], Finalists['Lng'], Finalists['Station']):
    label = folium.Popup(str(poi), parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Stations)  
map_Stations

We cluster surviving stations by their nearby venues, using their category. Obtain number of venues per category and determine frequency distributions.

In [990]:
# Obtain frequencies of relevant venues only #########
city_onehot = pd.get_dummies(df_Venues[['Category']], prefix="", prefix_sep="")

# add station column back to dataframe
city_onehot['Station'] = df_Venues['Station'] 

# move station column to the first column
fixed_columns = [city_onehot.columns[-1]] + list(city_onehot.columns[:-1])
city_onehot = city_onehot[fixed_columns]

# group rows by station and by taking the mean of the frequency of occurrence of each category
city_grouped = city_onehot.groupby('Station').mean().reset_index()
city_grouped

Unnamed: 0,Station,Café,Dance Studio,Deli / Bodega,Gym,Japanese Restaurant,Laundromat,Market,Park,Pharmacy,Spa
0,8th Av - Fulton St_M_155 St,0.277778,0.055556,0.277778,0.055556,0.0,0.111111,0.111111,0.111111,0.0,0.0
1,8th Av - Fulton St_M_163 St - Amsterdam Av,0.357143,0.0,0.214286,0.071429,0.0,0.071429,0.071429,0.142857,0.0,0.071429
2,Broadway - 7Av_M_145 St,0.291667,0.041667,0.25,0.083333,0.125,0.041667,0.041667,0.125,0.0,0.0
3,Broadway - 7Av_M_157 St,0.1875,0.0625,0.3125,0.0625,0.0,0.0625,0.0625,0.25,0.0,0.0
4,Broadway - Brighton_Bk_DeKalb Av,0.384615,0.153846,0.0,0.076923,0.076923,0.076923,0.076923,0.076923,0.0,0.076923
5,Concourse_Bx_Bedford Park Blvd,0.117647,0.0,0.117647,0.117647,0.0,0.058824,0.235294,0.176471,0.176471,0.0
6,Jamaica_Q_111 St,0.0,0.0,0.375,0.125,0.0,0.125,0.125,0.125,0.125,0.0
7,Lenox - White Plains Rd_Bx_Gun Hill Rd,0.0,0.0,0.0,0.333333,0.0,0.111111,0.222222,0.0,0.222222,0.111111
8,Pelham_Bx_Whitlock Av,0.0,0.0,0.111111,0.111111,0.0,0.111111,0.111111,0.222222,0.333333,0.0
9,Queens Blvd_Q_67 Av,0.05,0.0,0.05,0.15,0.2,0.1,0.1,0.15,0.1,0.1


Function for obtaining most common venues per station. 

In [991]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Show most common venue types per station. 

In [992]:
# common venues per station #########
indicators = ['st', 'nd', 'rd']
num_top_venues = 7

# create columns according to number of top venues
columns = ['Station']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
Station_venues_sorted = pd.DataFrame(columns=columns)
Station_venues_sorted['Station'] = city_grouped['Station']

for ind in np.arange(city_grouped.shape[0]):
    Station_venues_sorted.iloc[ind, 1:] = return_most_common_venues(city_grouped.iloc[ind, :], num_top_venues)

Station_venues_sorted

Unnamed: 0,Station,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
0,8th Av - Fulton St_M_155 St,Deli / Bodega,Café,Park,Market,Laundromat,Gym,Dance Studio
1,8th Av - Fulton St_M_163 St - Amsterdam Av,Café,Deli / Bodega,Park,Spa,Market,Laundromat,Gym
2,Broadway - 7Av_M_145 St,Café,Deli / Bodega,Park,Japanese Restaurant,Gym,Market,Laundromat
3,Broadway - 7Av_M_157 St,Deli / Bodega,Park,Café,Market,Laundromat,Gym,Dance Studio
4,Broadway - Brighton_Bk_DeKalb Av,Café,Dance Studio,Spa,Park,Market,Laundromat,Japanese Restaurant
5,Concourse_Bx_Bedford Park Blvd,Market,Pharmacy,Park,Gym,Deli / Bodega,Café,Laundromat
6,Jamaica_Q_111 St,Deli / Bodega,Pharmacy,Park,Market,Laundromat,Gym,Spa
7,Lenox - White Plains Rd_Bx_Gun Hill Rd,Gym,Pharmacy,Market,Spa,Laundromat,Park,Japanese Restaurant
8,Pelham_Bx_Whitlock Av,Pharmacy,Park,Market,Laundromat,Gym,Deli / Bodega,Spa
9,Queens Blvd_Q_67 Av,Japanese Restaurant,Park,Gym,Spa,Pharmacy,Market,Laundromat


Cluster stations using K-Means algorithm

In [995]:
kclusters = 4
city_grouped_clustering = city_grouped.drop('Station', 1)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(city_grouped_clustering)
city_grouped2 = city_grouped.copy()
city_grouped2.insert(0, 'Cluster', kmeans.labels_)
city_grouped2

Finalists2 = Finalists.drop('Cluster', 1)
Finalists2 = pd.merge(Finalists2, city_grouped2[['Station', 'Cluster']], on='Station')

Station_venues_sorted2 = pd.merge(Station_venues_sorted, city_grouped2[['Station', 'Cluster']], on='Station')
Station_venues_sorted2.sort_values("Cluster")

Unnamed: 0,Station,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,Cluster
7,Lenox - White Plains Rd_Bx_Gun Hill Rd,Gym,Pharmacy,Market,Spa,Laundromat,Park,Japanese Restaurant,0
9,Queens Blvd_Q_67 Av,Japanese Restaurant,Park,Gym,Spa,Pharmacy,Market,Laundromat,0
4,Broadway - Brighton_Bk_DeKalb Av,Café,Dance Studio,Spa,Park,Market,Laundromat,Japanese Restaurant,1
10,Queens Blvd_Q_Woodhaven Blvd,Café,Japanese Restaurant,Gym,Deli / Bodega,Pharmacy,Market,Laundromat,1
0,8th Av - Fulton St_M_155 St,Deli / Bodega,Café,Park,Market,Laundromat,Gym,Dance Studio,2
1,8th Av - Fulton St_M_163 St - Amsterdam Av,Café,Deli / Bodega,Park,Spa,Market,Laundromat,Gym,2
2,Broadway - 7Av_M_145 St,Café,Deli / Bodega,Park,Japanese Restaurant,Gym,Market,Laundromat,2
3,Broadway - 7Av_M_157 St,Deli / Bodega,Park,Café,Market,Laundromat,Gym,Dance Studio,2
5,Concourse_Bx_Bedford Park Blvd,Market,Pharmacy,Park,Gym,Deli / Bodega,Café,Laundromat,3
6,Jamaica_Q_111 St,Deli / Bodega,Pharmacy,Park,Market,Laundromat,Gym,Spa,3


Map Clusters

In [997]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Finalists2['Lat'], Finalists2['Lng'], Finalists2['Station'], Finalists2['Cluster']):
    label = folium.Popup('Cluster ' + str(cluster) + ': ' +str(poi), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
map_clusters

Cluster 0: Gym, spa.
Cluster 1: Café.
Cluster 2: Deli, café, park.
Cluster 3: Pharmacy, market, park.

## Further development


Obtain available housing prices for sale and rental. Obtain criminality statistics per neighborhood. Obtain commuting estimates from each subway station to work at rush hours by different transportation means. This work could be included as a component of a real estate broker website to enhance customer experience, enabling advanced neighborhood selection. 