# Final Project

### Overview:

**Introduction**

The problem is that, for one reason or another, a business or individual may wish to know a list of other neighborhoods that are similar to a given neighborhood, perhaps the one in which they are currently located. For example, a business may be seeking to relocate or open a second location, and it is satisfied with its current neighborhood and thus would like to replicate it as closely as possible. As a result, the problem will take as inputs neighborhoods from within Manhattan itself. The user would then be able to input a neighborhood and a geographic requirement, and the most suitable neighborhood given these parameters would be produced. In particular, the example that this particular problem will focus on will be businesses that have a target location to which they must locate as closely as possible, but this could of course be readily adjusted.


**Data in Use**

The primary data in use is Foursquare. Information on other venues in the area will be collated with explore queries and used as a part of our similarity determinations. However, the website Rent Cafe is also scraped for average rent information by neighborhood, which seems eminently relevant to a business in terms of affordability concerns. While the values are residential, this listing is the closest to fitting the needs of the project and should at least provide a basic indicator of commerical rent. Finally, the table of all neighborhoods in Manhatten is also used, in addition to basic geographic data for illustrative purposes. 

**Methodology**

Once the data is loaded in, some exploratory data analysis is performed to produce tables showing common venues, which can be was a good way to visualize examine the similarity of certain neighboorhoods. Visualizations with folium maps were also used to visualize the neighborhoods within Manhattan. 

Ultimately, machine learning was a primary driver of the project rather than statistical technique. The information available as described, including both typical venues and average rents, was used to produce clusters of similar neighborhoods using the k-means clustering technique. Datapoints in the same cluster as the input neighborhood were then compared with the desired geographic coordinates, and the neighborhood that requires the minimal distance to that point comprised the output, along with an overview of the neighborhood’s relevant statistics, which in this case were limited to neighborhood name, average rent, and distance from target

**Results**

Ultimately, the project produced a function that can take in the parameters of a current neighborhood and desired target area. It then produced the neighborhood that best fits these parameters, and it could, of course, be adjusted to work with other combinations of related parameters to produce different results. For example, another sample function was produced that prioritized price rather than distance in its recommendation.

**Discussion**

One interesting piece of information is the sheer quantity of different venue types that the Foursquare analysis returns. If recommendations were being made for a specific type of business, it would probably be helpful to remove some categories from consideration and combine others in order to produce industry-specific recommendations that are comprehensible to a human being. This type of additional data processing might also be helpful in ensuring that certain additional pieces of information receive due weight; for example, while average rent is fed into the k-means algorithm, it is difficult to tell how much this extra information actually affected the final clusters. Finally, the idea of extending this type of application to broader areas or to inter-city comparison definitely seems worth investigation.

**Conclusion**

In conclusion, this project aimed to provide an aid to businesses or even individuals seeking to switch neighborhoods within the borough of Manhattan. The end result was a pair of slightly different Python functions that utilize machine learning in order to make recommendations based on some combination of rent, geographic, and venue data. The basic findings are also readily extensible in a variety of directions, and it can only be hoped that this project would provide a strong foundation for any such extensions.

### Creating the project

##### Preparing the data

In [2]:
#Imports
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from tabulate import tabulate
import json 
from pandas.io.json import json_normalize

In [3]:
#Now let's get the New York/Manhattan data
!wget -q -O 'newyork_data.json' https://ibm.box.com/shared/static/fbpwbovar7lf8p5sgddm06cgipa2rxpe.json
print('Data downloaded!')

Data downloaded!


In [4]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [5]:
neighborhoods_data = newyork_data['features']
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

#fill the dataframe
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)


In [6]:
#Reduce it to just Manhattan
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


In [7]:
#Getting the rent information
res = requests.get("https://www.rentcafe.com/average-rent-market-trends/us/ny/manhattan/")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[2] 
df_rents = pd.read_html(str(table), skiprows = 1)[0]
df_rents.head()

Unnamed: 0,Neighborhood,All rentals,Studio,1 Bed,2 Beds,3 Beds
0,Washington Heights,"$2,170","$1,678","$2,000","$2,600","$3,219"
1,East Harlem,"$2,528","$1,895","$2,362","$3,521","$5,755"
2,Harlem,"$2,783","$1,975","$2,660","$4,112","$6,765"
3,Ellis Island,"$3,328","$2,929","$3,384","$3,793","$1,424"
4,Tudor City,"$3,389","$2,967","$3,556","$4,908","$5,450"


In [8]:
#Drop extra columns
df_rents.drop(['Studio', '1 Bed', '2 Beds', '3 Beds'], axis = 1, inplace = True)
df_rents.head()

Unnamed: 0,Neighborhood,All rentals
0,Washington Heights,"$2,170"
1,East Harlem,"$2,528"
2,Harlem,"$2,783"
3,Ellis Island,"$3,328"
4,Tudor City,"$3,389"


In [18]:
#Joining the tables
df_manhattan = manhattan_data.set_index('Neighborhood').join(df_rents.set_index('Neighborhood'))

In [19]:
df_manhattan.reset_index(inplace = True)
df_manhattan

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude,All rentals
0,Marble Hill,Manhattan,40.876551,-73.91066,
1,Chinatown,Manhattan,40.715618,-73.994279,"$4,864"
2,Washington Heights,Manhattan,40.851903,-73.9369,"$2,170"
3,Inwood,Manhattan,40.867684,-73.92121,
4,Hamilton Heights,Manhattan,40.823604,-73.949688,
5,Manhattanville,Manhattan,40.816934,-73.957385,
6,Central Harlem,Manhattan,40.815976,-73.943211,
7,East Harlem,Manhattan,40.792249,-73.944182,"$2,528"
8,Upper East Side,Manhattan,40.775639,-73.960508,
9,Yorkville,Manhattan,40.77593,-73.947118,"$4,130"


In [20]:
df_manhattan.rename(columns={ df_manhattan.columns[4]: "Average Rent"}, inplace = True)

In [21]:
#Replacing NaN values manually, since in most cases, they result from a simple mismatch
#Unavailable values will simply be dropped; in actual implementation, an effort would be made to estimate or research to replace.
df_manhattan.loc[5, 'Average Rent'] = "$4,553"
df_manhattan.loc[6, 'Average Rent'] = "$2,783"
df_manhattan.loc[8, 'Average Rent'] = "$4,173"
df_manhattan.loc[14, 'Average Rent'] = "$3,872"
df_manhattan.loc[15, 'Average Rent'] = "$3,830"
df_manhattan.loc[21, 'Average Rent'] = "$5,586"
df_manhattan.loc[23, 'Average Rent'] = "$5,066"
df_manhattan.loc[24, 'Average Rent'] = "$4,197"
df_manhattan.loc[27, 'Average Rent'] = "$4,071"
df_manhattan.loc[31, 'Average Rent'] = "$4,206"
df_manhattan.loc[33, 'Average Rent'] = "$4,076"
df_manhattan.loc[33, 'Average Rent'] = "$4,093"
df_manhattan.dropna(inplace = True)
df_manhattan.reset_index(drop=True, inplace = True)
df_manhattan

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude,Average Rent
0,Chinatown,Manhattan,40.715618,-73.994279,"$4,864"
1,Washington Heights,Manhattan,40.851903,-73.9369,"$2,170"
2,Manhattanville,Manhattan,40.816934,-73.957385,"$4,553"
3,Central Harlem,Manhattan,40.815976,-73.943211,"$2,783"
4,East Harlem,Manhattan,40.792249,-73.944182,"$2,528"
5,Upper East Side,Manhattan,40.775639,-73.960508,"$4,173"
6,Yorkville,Manhattan,40.77593,-73.947118,"$4,130"
7,Lenox Hill,Manhattan,40.768113,-73.95886,"$4,269"
8,Roosevelt Island,Manhattan,40.76216,-73.949168,"$3,430"
9,Upper West Side,Manhattan,40.787658,-73.977059,"$4,536"


In [22]:
#Converting rent to a numeric to make for easier comparisons
df_manhattan['Average Rent'] = df_manhattan['Average Rent'].str.replace('$','')
df_manhattan['Average Rent'] = df_manhattan['Average Rent'].str.replace(',','')
df_manhattan[['Average Rent']] = df_manhattan[['Average Rent']].apply(pd.to_numeric)
df_manhattan.head()

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude,Average Rent
0,Chinatown,Manhattan,40.715618,-73.994279,4864
1,Washington Heights,Manhattan,40.851903,-73.9369,2170
2,Manhattanville,Manhattan,40.816934,-73.957385,4553
3,Central Harlem,Manhattan,40.815976,-73.943211,2783
4,East Harlem,Manhattan,40.792249,-73.944182,2528


In [23]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium
from geopy.geocoders import Nominatim

In [24]:
address = 'Manhattan, NY'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7900869, -73.9598295.


In [25]:
#Illustrative map
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, label in zip(df_manhattan['Latitude'], df_manhattan['Longitude'], df_manhattan['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

##### Adding in Foursquare Data

In [26]:
CLIENT_ID = 'EG3MLLWKZMMQNTG2J14UZPTXA5ZW3LBMY0PHBUU3XREQVPZ1' # your Foursquare ID
CLIENT_SECRET = 'JBASA1GFDZ2Q5YKQU421IMEEDIXWODDMAZOMD0IJMCPZR4XS' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: EG3MLLWKZMMQNTG2J14UZPTXA5ZW3LBMY0PHBUU3XREQVPZ1
CLIENT_SECRET:JBASA1GFDZ2Q5YKQU421IMEEDIXWODDMAZOMD0IJMCPZR4XS


In [27]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id='+str(CLIENT_ID)+'&client_secret='+str(CLIENT_SECRET)+'&v='+str(VERSION)+'&ll='+str(lat)+','+str(lng)+'&radius='+str(radius)+'&limit=50'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius) 
            #LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [28]:
manhattan_venues = getNearbyVenues(names=df_manhattan['Neighborhood'],
                                  latitudes=df_manhattan['Latitude'],
                                  longitudes=df_manhattan['Longitude']
                                  )

Chinatown
Washington Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Hudson Yards


In [29]:
print(manhattan_venues.shape)
manhattan_venues.head()

(1719, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Chinatown,40.715618,-73.994279,SKY TING YOGA,40.716469,-73.99502,Yoga Studio
1,Chinatown,40.715618,-73.994279,Spicy Village 大福星 (Spicy Village),40.71701,-73.99353,Chinese Restaurant
2,Chinatown,40.715618,-73.994279,Mission Escape Games,40.716505,-73.99472,General Entertainment
3,Chinatown,40.715618,-73.994279,Bar Belly,40.715135,-73.991802,Cocktail Bar
4,Chinatown,40.715618,-73.994279,Kiki's,40.714476,-73.992036,Greek Restaurant


In [30]:
manhattan_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Battery Park City,50,50,50,50,50,50
Carnegie Hill,50,50,50,50,50,50
Central Harlem,43,43,43,43,43,43
Chelsea,50,50,50,50,50,50
Chinatown,50,50,50,50,50,50
Civic Center,50,50,50,50,50,50
Clinton,50,50,50,50,50,50
East Harlem,43,43,43,43,43,43
East Village,50,50,50,50,50,50
Financial District,50,50,50,50,50,50


In [31]:
print('There are {} uniques categories.'.format(len(manhattan_venues['Venue Category'].unique())))

There are 268 uniques categories.


In [32]:
# one hot encoding
manhattan_onehot = pd.get_dummies(manhattan_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
manhattan_onehot['Neighborhood'] = manhattan_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
manhattan_onehot = manhattan_onehot[fixed_columns]

manhattan_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Animal Shelter,Antique Shop,Arcade,Argentinian Restaurant,Art Gallery,...,Vietnamese Restaurant,Volleyball Court,Watch Shop,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Chinatown,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,Chinatown,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Chinatown,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Chinatown,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Chinatown,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [33]:
manhattan_grouped = manhattan_onehot.groupby('Neighborhood').mean().reset_index()
manhattan_grouped

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Animal Shelter,Antique Shop,Arcade,Argentinian Restaurant,Art Gallery,...,Vietnamese Restaurant,Volleyball Court,Watch Shop,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Battery Park City,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.02,0.0
1,Carnegie Hill,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.02,0.04,0.0,0.0,0.04
2,Central Harlem,0.0,0.0,0.069767,0.046512,0.0,0.0,0.0,0.0,0.023256,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Chelsea,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.02,0.0
4,Chinatown,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,...,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02
5,Civic Center,0.0,0.0,0.0,0.04,0.0,0.02,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.02
6,Clinton,0.0,0.0,0.0,0.06,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0
7,East Harlem,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,East Village,0.0,0.0,0.0,0.04,0.0,0.02,0.0,0.0,0.0,...,0.04,0.0,0.0,0.0,0.0,0.06,0.04,0.0,0.0,0.0
9,Financial District,0.02,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0


In [34]:
num_top_venues = 5

for hood in manhattan_grouped['Neighborhood']:
    #print("----"+hood+"----")
    temp = manhattan_grouped[manhattan_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    #print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    #print('\n')
#Removed ouputs to shorten the github.

In [35]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [36]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = manhattan_grouped['Neighborhood']

for ind in np.arange(manhattan_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(manhattan_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Battery Park City,Park,Coffee Shop,Sandwich Place,Fountain,Plaza,Cupcake Shop,BBQ Joint,Department Store,Performing Arts Venue,Building
1,Carnegie Hill,Spa,Bookstore,Pizza Place,Yoga Studio,Bakery,Café,Coffee Shop,French Restaurant,Gym,Gym / Fitness Center
2,Central Harlem,African Restaurant,Seafood Restaurant,Pizza Place,American Restaurant,Chinese Restaurant,French Restaurant,Gym / Fitness Center,Cosmetics Shop,Public Art,Market
3,Chelsea,Ice Cream Shop,Hotel,Nightclub,Asian Restaurant,Coffee Shop,Seafood Restaurant,Theater,Bakery,Italian Restaurant,Pizza Place
4,Chinatown,Chinese Restaurant,Ice Cream Shop,American Restaurant,Sandwich Place,Bar,Salon / Barbershop,Cocktail Bar,Vietnamese Restaurant,Noodle House,New American Restaurant
5,Civic Center,Gym / Fitness Center,Bakery,Cocktail Bar,Italian Restaurant,Coffee Shop,Sandwich Place,Park,Gym,Sushi Restaurant,French Restaurant
6,Clinton,Theater,American Restaurant,Gym / Fitness Center,Hotel,Wine Shop,Gym,Lounge,Pizza Place,Music School,Mediterranean Restaurant
7,East Harlem,Mexican Restaurant,Bakery,Deli / Bodega,Latin American Restaurant,Thai Restaurant,Pharmacy,Cuban Restaurant,Street Art,Steakhouse,Dance Studio
8,East Village,Bar,Ice Cream Shop,Wine Bar,Coffee Shop,Wine Shop,American Restaurant,Vietnamese Restaurant,Pizza Place,Vegetarian / Vegan Restaurant,Speakeasy
9,Financial District,Coffee Shop,Steakhouse,Pizza Place,Hotel,Jewelry Store,Gym / Fitness Center,Gym,Event Space,Monument / Landmark,Accessories Store


In [37]:
#Adding the rent column back in:
manhattan_grouped = manhattan_grouped.set_index('Neighborhood').join(df_manhattan.set_index('Neighborhood'))
manhattan_grouped.drop(['Borough', 'Latitude', 'Longitude'], axis = 1, inplace = True)
manhattan_grouped.reset_index(inplace = True)
manhattan_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Animal Shelter,Antique Shop,Arcade,Argentinian Restaurant,Art Gallery,...,Volleyball Court,Watch Shop,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Average Rent
0,Battery Park City,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.02,0.0,5363
1,Carnegie Hill,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.02,0.04,0.0,0.0,0.04,4130
2,Central Harlem,0.0,0.0,0.069767,0.046512,0.0,0.0,0.0,0.0,0.023256,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2783
3,Chelsea,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.02,0.0,4181
4,Chinatown,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,4864


##### K-means clustering

In [38]:
# import k-means for clustering stage
from sklearn.cluster import KMeans

In [56]:
# set number of clusters
kclusters = 6

manhattan_grouped_clustering = manhattan_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 4, 1, 4, 0, 0, 2, 1, 4, 2], dtype=int32)

In [57]:
manhattan_merged = df_manhattan

# add clustering labels
manhattan_merged['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
manhattan_merged = manhattan_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

manhattan_merged.head() # check the last columns!

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude,Average Rent,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Chinatown,Manhattan,40.715618,-73.994279,4864,3,Chinese Restaurant,Ice Cream Shop,American Restaurant,Sandwich Place,Bar,Salon / Barbershop,Cocktail Bar,Vietnamese Restaurant,Noodle House,New American Restaurant
1,Washington Heights,Manhattan,40.851903,-73.9369,2170,4,Café,Caribbean Restaurant,Mobile Phone Shop,Park,Deli / Bodega,Tapas Restaurant,Chinese Restaurant,Latin American Restaurant,Bakery,Wine Shop
2,Manhattanville,Manhattan,40.816934,-73.957385,4553,1,Deli / Bodega,Italian Restaurant,Sushi Restaurant,Mexican Restaurant,Seafood Restaurant,Music School,Lounge,Supermarket,Burger Joint,Spanish Restaurant
3,Central Harlem,Manhattan,40.815976,-73.943211,2783,4,African Restaurant,Seafood Restaurant,Pizza Place,American Restaurant,Chinese Restaurant,French Restaurant,Gym / Fitness Center,Cosmetics Shop,Public Art,Market
4,East Harlem,Manhattan,40.792249,-73.944182,2528,0,Mexican Restaurant,Bakery,Deli / Bodega,Latin American Restaurant,Thai Restaurant,Pharmacy,Cuban Restaurant,Street Art,Steakhouse,Dance Studio


In [58]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_merged['Latitude'], manhattan_merged['Longitude'], manhattan_merged['Neighborhood'], manhattan_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

##### Defining the appropriate function

In [50]:
import math

In [51]:
#Prereq functions to determine the distance between two coordinates
def getDistance(lat1,lon1,lat2,lon2):
    R = 6371 #Radius of the earth in km
    dLat = deg2rad(lat2-lat1)  #deg2rad below
    dLon = deg2rad(lon2-lon1); 
    a = math.sin(dLat/2) * math.sin(dLat/2) + math.cos(deg2rad(lat1)) * math.cos(deg2rad(lat2)) * math.sin(dLon/2) * math.sin(dLon/2) 
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a)) 
    d = R * c; #Distance in km
    return d

def deg2rad(deg):
    return deg * (math.pi/180)

In [62]:
def neighborhood_recommender(name, lat, lng):
    cluster_label = manhattan_merged.loc[manhattan_merged['Neighborhood'] == name, 'Cluster Labels'].iloc[0] #returns a single numeric value
    options = manhattan_merged.loc[manhattan_merged['Cluster Labels'] == cluster_label, 'Neighborhood'] #returns a pandas series
    distance = 500000
    best_index = 0
    for i in range(0, 36):
        if manhattan_merged.loc[i, 'Neighborhood'] != name:
            if manhattan_merged.loc[i, 'Cluster Labels'] == cluster_label:
                new_lat = manhattan_merged.loc[i, 'Latitude']
                new_lng = manhattan_merged.loc[i, 'Longitude']
                new_distance = getDistance(lat, lng, new_lat, new_lng)
                if new_distance < distance:
                    distance = new_distance
                    best_index = i
    recommended_name = manhattan_merged.loc[best_index, 'Neighborhood']
    recommended_distance = distance
    recommended_rent = manhattan_merged.loc[best_index, 'Average Rent']
    output_string = "The recommended neighborhood is "+str(recommended_name)+", it is "+str(recommended_distance)+" kilometers from the desired location, and the average rent is "+str(recommended_rent)+"."
    print(output_string)


The function is all set and ready to be called!

In [63]:
neighborhood_recommender('Yorkville', 40.5, -74 )
    

The recommended neighborhood is Little Italy, it is 24.388754508902256 kilometers from the desired location, and the average rent is 5690.


Within a terminal setting, we could ask for inputs from the user and pass them as parameters to the function. We could also adjust the function to take a target neighborhood as input instead of latitude and longitude, then extract the geographic coordinates.

We could also produce a recommender that prioritizes closeness of rent cost instead if geographic location is not a priority.

In [64]:
def neighborhood_recommender_rent(name):
    cluster_label = manhattan_merged.loc[manhattan_merged['Neighborhood'] == name, 'Cluster Labels'].iloc[0] #returns a single numeric value
    options = manhattan_merged.loc[manhattan_merged['Cluster Labels'] == cluster_label, 'Neighborhood'] #returns a pandas series
    current_rent = manhattan_merged.loc[manhattan_merged['Neighborhood'] == name, 'Average Rent'].iloc[0]
    rent_diff = 500000
    best_index = 0
    for i in range(0, 36):
        if manhattan_merged.loc[i, 'Neighborhood'] != name:
            if manhattan_merged.loc[i, 'Cluster Labels'] == cluster_label:
                new_rent = manhattan_merged.loc[i, 'Average Rent']
                new_rent_diff = abs(new_rent - current_rent)
                if new_rent_diff < rent_diff:
                    rent_diff = new_rent_diff
                    best_index = i
    recommended_name = manhattan_merged.loc[best_index, 'Neighborhood']
    recommended_rent = manhattan_merged.loc[best_index, 'Average Rent']
    output_string = "The recommended neighborhood is "+str(recommended_name)+", and the average rent is "+str(recommended_rent)+", which is a difference of "\
    +str(rent_diff)+" from your current average rent of "+str(current_rent)+"."
    print(output_string)

In [65]:
neighborhood_recommender_rent('Yorkville')

The recommended neighborhood is Carnegie Hill, and the average rent is 4130, which is a difference of 0 from your current average rent of 4130.


A whole variety of tweaks could be made to both the function and clustering algorithm depending on the needs of the moment, and an extension to other cities would also be possible. Thanks for reading!