# Housing Sales Prices and Venues Data Analysis of Boston Neighborhoods #
_by Nataliia Romanenko_

This is a Capstone Project notebook for IBM Applied Data Science Specialization on Coursera. In this project, we will analyze venues data for each Boston neighborhood, segment and cluster the venues, and combine this information with the median price for a condo in each neighborhood, so that people with different preferences will have a full picture for choosing a neighborhood to live in Boston.

In [1]:
import os
import time
import pandas as pd
import numpy as np
import json # library to handle JSON files
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans # import k-means from clustering stage
import folium # map rendering library
from bs4 import BeautifulSoup  # library for web scraping

## 1. Data Wrangling
### Getting Boston Neighborhoods data

In [2]:
with open('Boston_Neighborhoods.geojson') as json_data:
    boston_data = json.load(json_data)

Let's take a look at the data

In [3]:
# number of neighborhoods described in the file
len(boston_data['features'])

26

In [4]:
# let's take a look at the first neighborhood
boston_data['features'][0]

{'type': 'Feature',
 'properties': {'OBJECTID': 27,
  'Name': 'Roslindale',
  'Acres': 1605.5682375,
  'Neighborhood_ID': '15',
  'SqMiles': 2.51,
  'ShapeSTArea': 69938272.92557049,
  'ShapeSTLength': 53563.912597056624},
 'geometry': {'type': 'MultiPolygon',
  'coordinates': [[[[-71.12592717485386, 42.272013107957406],
     [-71.12610933458738, 42.2716219294518],
     [-71.12603188298199, 42.27158985153841],
     [-71.12571713956957, 42.27152070474045],
     [-71.12559042372907, 42.27146017841939],
     [-71.12523676125656, 42.271387313901805],
     [-71.12522437821433, 42.271425073651166],
     [-71.12489533053173, 42.27134458090032],
     [-71.12482468090687, 42.271318140479686],
     [-71.12485155056099, 42.27124753819149],
     [-71.12476329046935, 42.270292339717635],
     [-71.12470249712558, 42.270295367758344],
     [-71.12259088359436, 42.2700534081311],
     [-71.1223931813923, 42.27003085475475],
     [-71.12252039300371, 42.269427196690025],
     [-71.12214745279846, 42.2

From the geojson file we can exract names and areas (to help us decide on the radius for venue retrieval)

In [5]:
neighborhoods = []
areas = []
for hood in boston_data['features']:
    neighborhoods.append(hood['properties']['Name'])
    areas.append(hood['properties']['SqMiles'])

#create a datagrame for neighborhoods data
boston = pd.DataFrame({'Neighborhood': neighborhoods, 'Area': areas})
boston

Unnamed: 0,Neighborhood,Area
0,Roslindale,2.51
1,Jamaica Plain,3.94
2,Mission Hill,0.55
3,Longwood,0.29
4,Bay Village,0.04
5,Leather District,0.02
6,Chinatown,0.12
7,North End,0.2
8,Roxbury,3.29
9,South End,0.74


In [6]:
# let's add geo coordinates of neighborhood centers
latitudes = []
longitudes = []
for hood in neighborhoods:
    print(hood)
    # Leather District Coordinates are not available thru geolocator, so we add them manually
    if hood == 'Leather District':
        latitudes.append(42.351049)
        longitudes.append(-71.057969)
        continue
    address = hood + ' Boston, Massachusetts'
    geolocator = Nominatim(user_agent="boston_explorer")
    location = geolocator.geocode(address)
    latitudes.append(location.latitude)
    longitudes.append(location.longitude)
    time.sleep(3)

Roslindale
Jamaica Plain
Mission Hill
Longwood
Bay Village
Leather District
Chinatown
North End
Roxbury
South End
Back Bay
East Boston
Charlestown
West End
Beacon Hill
Downtown
Fenway
Brighton
West Roxbury
Hyde Park
Mattapan
Dorchester
South Boston Waterfront
South Boston
Allston
Harbor Islands


In [7]:
boston['Latitude'] = latitudes
boston['Longitude'] = longitudes
boston

Unnamed: 0,Neighborhood,Area,Latitude,Longitude
0,Roslindale,2.51,42.291209,-71.124497
1,Jamaica Plain,3.94,42.30982,-71.12033
2,Mission Hill,0.55,42.332926,-71.103214
3,Longwood,0.29,42.336168,-71.099527
4,Bay Village,0.04,42.350011,-71.066948
5,Leather District,0.02,42.351049,-71.057969
6,Chinatown,0.12,42.352217,-71.062607
7,North End,0.2,42.365097,-71.054495
8,Roxbury,3.29,42.324843,-71.095016
9,South End,0.74,42.34131,-71.07723


### Let's visualize the neighborhoods

In [8]:
address = 'Boston, Massachusetts'

geolocator = Nominatim(user_agent="boston_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Boston are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Boston are 42.3602534, -71.0582912.


#### Create a map of Boston with neighborhoods superimposed on top.

In [9]:
# create map of Boston using latitude and longitude values
boston_map = folium.Map(location=[latitude, longitude], zoom_start=11)
geo_json = r'Boston_Neighborhoods.geojson'

# add markers to map
for lat, lng, hood in zip(boston['Latitude'], boston['Longitude'], boston['Neighborhood']):
    label = folium.Popup(hood, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(boston_map)  

# display the map
boston_map

### Getting housing prices

In [10]:
headers = {"user-agent": "Webscraper for IBM Data Science Capstone"}
page = requests.get("https://www.bostonmagazine.com/top-places-to-live-2018-condos/", headers = headers)

# check for valid status 
if page.status_code != requests.codes.ok :
    print("Request was not successful, status code:", page.status_code)
    exit()
    
# Parse page using BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

In [11]:
# print scraped page title
print(soup.title.text)

Condo Prices in Boston 2018 - Boston Magazine


From inspecting the page we can see that the information we need is inside table element with id 'tablepress-151'.

In [12]:
# get table data
table = soup.find("table", {"id":"tablepress-151"})
# print first row of the table (column headers)
print(table.find("tr"))

<tr class="row-1 odd">
<th class="column-1">Boston Neighborhoods</th><th class="column-2">Median Price: 2017</th><th class="column-3">Median Price: 2016</th><th class="column-4">Median Price: 2012</th><th class="column-5">Median Price: 2007</th><th class="column-6">Percent Change in Price: One-Year</th><th class="column-7">Percent Change in Price: Five-Year</th><th class="column-8">Percent Change in Price: Ten-Year</th><th class="column-9">Days on Market: 2017</th>
</tr>


In [13]:
# create dataframe for Boston House Prices
df= pd.DataFrame(columns = ["Neighborhood", "2017 Median Price"])
# get all table rows
trs = table.find_all("tr")
# process data
for i,tr in enumerate(trs[1:]):
    entry = tr.find_all("td")
    df.loc[i] = entry[0].text.strip(), entry[1].text.strip()

df

Unnamed: 0,Neighborhood,2017 Median Price
0,Allston,"$480,000"
1,Back Bay,"$1,100,000"
2,Bay Village/South End,"$615,000"
3,Beacon Hill,"$952,500"
4,Brighton,"$430,000"
5,Charlestown,"$690,000"
6,Chinatown/Leather Dist.,"$850,000"
7,Dorchester,"$429,950"
8,East Boston,"$454,500"
9,Fenway,"$571,000"


In the data we can have multiple neighborhoods in one entry, let's fix this, so that one entry has only one neighborhood. Let's also convert price to integer values.

In [14]:
for i,hood in enumerate(df['Neighborhood']):
    df.iloc[i,1] = int(df.iloc[i,1].lstrip('$').replace(',', ''))
    if '/' in hood:
        hoods = hood.split('/')
        df.iloc[i,0] = hoods[0]
        df = df.append({'Neighborhood': hoods[1], '2017 Median Price': df.iloc[i,1]}, ignore_index=True)
df

Unnamed: 0,Neighborhood,2017 Median Price
0,Allston,480000
1,Back Bay,1100000
2,Bay Village,615000
3,Beacon Hill,952500
4,Brighton,430000
5,Charlestown,690000
6,Chinatown,850000
7,Dorchester,429950
8,East Boston,454500
9,Fenway,571000


In [15]:
# Let's rename Leather dist. to Leather district to join dataframes
df.iloc[19,0] = 'Leather District'
df

Unnamed: 0,Neighborhood,2017 Median Price
0,Allston,480000
1,Back Bay,1100000
2,Bay Village,615000
3,Beacon Hill,952500
4,Brighton,430000
5,Charlestown,690000
6,Chinatown,850000
7,Dorchester,429950
8,East Boston,454500
9,Fenway,571000


In [48]:
# let's join both dataframes on Neighborhood Name
df_boston = boston.join(df.set_index('Neighborhood'), on='Neighborhood')
df_boston

Unnamed: 0,Neighborhood,Area,Latitude,Longitude,2017 Median Price
0,Roslindale,2.51,42.291209,-71.124497,450000.0
1,Jamaica Plain,3.94,42.30982,-71.12033,534000.0
2,Mission Hill,0.55,42.332926,-71.103214,
3,Longwood,0.29,42.336168,-71.099527,
4,Bay Village,0.04,42.350011,-71.066948,615000.0
5,Leather District,0.02,42.351049,-71.057969,850000.0
6,Chinatown,0.12,42.352217,-71.062607,850000.0
7,North End,0.2,42.365097,-71.054495,570500.0
8,Roxbury,3.29,42.324843,-71.095016,338000.0
9,South End,0.74,42.34131,-71.07723,615000.0


We don't have price data for 5 neighborhoods. For Downtont and Waterfront neighborhoods the median price for condo is available in __[the Elliman report](https://www.elliman.com/pdf/cf8dc6172df72a7b8c6817c3feb145a8b3e4ce6c)__, for them we add the data manually. For other neighborhoods (Mission Hill, Longwood, Harbor Islands) the data are not easily accessible, so we leave median price as a missing value, but still perform venue analisys.  

In [49]:
# median price for Q1-2018, closest to 2017 median price
df_boston.iloc[15,4] = 940000  
df_boston.iloc[22,4] = 857500
df_boston

Unnamed: 0,Neighborhood,Area,Latitude,Longitude,2017 Median Price
0,Roslindale,2.51,42.291209,-71.124497,450000.0
1,Jamaica Plain,3.94,42.30982,-71.12033,534000.0
2,Mission Hill,0.55,42.332926,-71.103214,
3,Longwood,0.29,42.336168,-71.099527,
4,Bay Village,0.04,42.350011,-71.066948,615000.0
5,Leather District,0.02,42.351049,-71.057969,850000.0
6,Chinatown,0.12,42.352217,-71.062607,850000.0
7,North End,0.2,42.365097,-71.054495,570500.0
8,Roxbury,3.29,42.324843,-71.095016,338000.0
9,South End,0.74,42.34131,-71.07723,615000.0


In [50]:
# save to csv file 
df_boston.to_csv('boston.csv', index=False)

#### Let's visualize the price data 

In [51]:
boston_map = folium.Map(location=[latitude, longitude], zoom_start=11)
# add price data to the map
folium.Choropleth(
    geo_data=geo_json,
    data=df_boston,
    columns=['Neighborhood', '2017 Median Price'],
    key_on='feature.properties.Name',
    fill_color='YlOrRd', 
    fill_opacity=0.6,
    nan_fill_color='gray',
    nan_fill_opacity=0.2,
    line_opacity=0.6,
    legend_name='Condo Median Prices in Boston 2017',
    highlight=True
).add_to(boston_map)

# display map
boston_map

In [52]:
# save map
boston_map.save(os.path.join('', 'Condo Median Prices in Boston 2017.html'))

## 2. Explore Boston neighborhoods using Foursquare API

#### Define Foursquare Credentials and Version

In [54]:
# function to get nearby venues for all the neighborhoods in Boston
def getNearbyVenues(names, latitudes, longitudes, areas):
    
    LIMIT = 100 # limit of number of venues returned by Foursquare API
    venues_list=[]
    for name, lat, lng, area in zip(names, latitudes, longitudes, areas):
        # default
        radius = 500
        if area > 1:  # radius for large neighborhoods
            radius = 1000
        print(name, 'radius:', radius)
        
        # create the APIrequest URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [55]:
boston_venues = getNearbyVenues(names=df_boston['Neighborhood'],
                                   latitudes=df_boston['Latitude'],
                                   longitudes=df_boston['Longitude'],
                                   areas = df_boston['Area']
                                  )

Roslindale radius: 1000
Jamaica Plain radius: 1000
Mission Hill radius: 500
Longwood radius: 500
Bay Village radius: 500
Leather District radius: 500
Chinatown radius: 500
North End radius: 500
Roxbury radius: 1000
South End radius: 500
Back Bay radius: 500
East Boston radius: 1000
Charlestown radius: 1000
West End radius: 500
Beacon Hill radius: 500
Downtown radius: 500
Fenway radius: 500
Brighton radius: 1000
West Roxbury radius: 1000
Hyde Park radius: 1000
Mattapan radius: 1000
Dorchester radius: 1000
South Boston Waterfront radius: 500
South Boston radius: 1000
Allston radius: 1000
Harbor Islands radius: 1000


In [56]:
print(boston_venues.shape)
boston_venues.head()

(1529, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Roslindale,42.291209,-71.124497,Peters Hill,42.293617,-71.128063,Scenic Lookout
1,Roslindale,42.291209,-71.124497,Roslindale House Of Pizza,42.287989,-71.126549,Pizza Place
2,Roslindale,42.291209,-71.124497,Delfino’s,42.287106,-71.12947,Italian Restaurant
3,Roslindale,42.291209,-71.124497,Roslindale Village Farmers Market,42.286534,-71.128509,Farmers Market
4,Roslindale,42.291209,-71.124497,Fornax Bread Company,42.286171,-71.12976,Bakery


In [57]:
# check how many venues were returned for each neighborhood
boston_venues.groupby('Neighborhood').count()[['Venue']]

Unnamed: 0_level_0,Venue
Neighborhood,Unnamed: 1_level_1
Allston,82
Back Bay,100
Bay Village,95
Beacon Hill,64
Brighton,81
Charlestown,66
Chinatown,100
Dorchester,20
Downtown,48
East Boston,42


In [58]:
# how many unique categories in all the returned venues
print('There are {} uniques categories.'.format(len(boston_venues['Venue Category'].unique())))

There are 225 uniques categories.


We don't have both venue data and condo median price for Harbor Islands, so we can remove this neighborhood from our analysis

In [59]:
df_boston = df_boston.drop(25,0)
df_boston

Unnamed: 0,Neighborhood,Area,Latitude,Longitude,2017 Median Price
0,Roslindale,2.51,42.291209,-71.124497,450000.0
1,Jamaica Plain,3.94,42.30982,-71.12033,534000.0
2,Mission Hill,0.55,42.332926,-71.103214,
3,Longwood,0.29,42.336168,-71.099527,
4,Bay Village,0.04,42.350011,-71.066948,615000.0
5,Leather District,0.02,42.351049,-71.057969,850000.0
6,Chinatown,0.12,42.352217,-71.062607,850000.0
7,North End,0.2,42.365097,-71.054495,570500.0
8,Roxbury,3.29,42.324843,-71.095016,338000.0
9,South End,0.74,42.34131,-71.07723,615000.0


## 3. Analyze Each Neighborhood

In [60]:
# one hot encoding
boston_onehot = pd.get_dummies(boston_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
boston_onehot['Neighborhood'] = boston_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [boston_onehot.columns[-1]] + list(boston_onehot.columns[:-1])
boston_onehot = boston_onehot[fixed_columns]

boston_onehot.head()

Unnamed: 0,Yoga Studio,ATM,Accessories Store,Afghan Restaurant,African Restaurant,American Restaurant,Arepa Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Veterinarian,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [61]:
# check the size
boston_onehot.shape

(1529, 225)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category


In [62]:
boston_grouped = boston_onehot.groupby('Neighborhood').mean().reset_index()
boston_grouped

Unnamed: 0,Neighborhood,Yoga Studio,ATM,Accessories Store,Afghan Restaurant,African Restaurant,American Restaurant,Arepa Restaurant,Art Gallery,Art Museum,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Veterinarian,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Allston,0.012195,0.0,0.0,0.012195,0.0,0.012195,0.0,0.0,0.0,...,0.0,0.0,0.012195,0.0,0.0,0.02439,0.0,0.0,0.0,0.0
1,Back Bay,0.01,0.0,0.02,0.0,0.0,0.05,0.0,0.01,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01
2,Bay Village,0.010526,0.0,0.0,0.0,0.0,0.031579,0.0,0.0,0.0,...,0.0,0.0,0.031579,0.0,0.0,0.021053,0.0,0.010526,0.0,0.0
3,Beacon Hill,0.015625,0.0,0.0,0.0,0.0,0.03125,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Brighton,0.0,0.0,0.0,0.0,0.0,0.012346,0.0,0.0,0.0,...,0.012346,0.0,0.0,0.0,0.0,0.0,0.0,0.012346,0.012346,0.0
5,Charlestown,0.0,0.0,0.0,0.0,0.0,0.015152,0.0,0.015152,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.015152,0.0,0.0,0.0
6,Chinatown,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,...,0.0,0.0,0.02,0.0,0.0,0.02,0.01,0.0,0.0,0.0
7,Dorchester,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,...,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Downtown,0.020833,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,East Boston,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,...,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.0,0.0,0.0


In [63]:
# check the size
boston_grouped.shape

(25, 225)

#### Let's print each neighborhood along with the top 7 most common venues

In [64]:
num_top_venues = 7

for hood in boston_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = boston_grouped[boston_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Allston----
                  venue  freq
0           Pizza Place  0.06
1                Bakery  0.05
2     Korean Restaurant  0.05
3                   Bar  0.04
4  Gym / Fitness Center  0.04
5    Chinese Restaurant  0.04
6           Coffee Shop  0.02


----Back Bay----
                 venue  freq
0          Coffee Shop  0.05
1  American Restaurant  0.05
2       Clothing Store  0.04
3   Italian Restaurant  0.04
4                Hotel  0.04
5   Seafood Restaurant  0.04
6   Salon / Barbershop  0.03


----Bay Village----
                           venue  freq
0                 Sandwich Place  0.05
1                        Theater  0.04
2                          Hotel  0.04
3             Italian Restaurant  0.04
4  Vegetarian / Vegan Restaurant  0.03
5           Gym / Fitness Center  0.03
6                    Coffee Shop  0.03


----Beacon Hill----
                venue  freq
0         Pizza Place  0.06
1           Hotel Bar  0.06
2  Italian Restaurant  0.06
3           Gift Shop  0.

Let's organize these data into a dataframe. First, we write a function to sort the venues in descending order, then create the new dataframe and display the top 10 venues for each neighborhood.

In [65]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = boston_grouped['Neighborhood']

for ind in np.arange(boston_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(boston_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Allston,Pizza Place,Bakery,Korean Restaurant,Chinese Restaurant,Gym / Fitness Center,Bar,Café,Sushi Restaurant,Diner,Coffee Shop
1,Back Bay,American Restaurant,Coffee Shop,Hotel,Italian Restaurant,Clothing Store,Seafood Restaurant,Salon / Barbershop,Spa,Sporting Goods Shop,Cosmetics Shop
2,Bay Village,Sandwich Place,Hotel,Italian Restaurant,Theater,Steakhouse,American Restaurant,Performing Arts Venue,Gym / Fitness Center,Bakery,Vegetarian / Vegan Restaurant
3,Beacon Hill,Hotel Bar,Pizza Place,Italian Restaurant,Gift Shop,American Restaurant,Plaza,Hotel,Sushi Restaurant,Coffee Shop,Gourmet Shop
4,Brighton,Pizza Place,Convenience Store,Café,Coffee Shop,Pub,Bakery,Donut Shop,Chinese Restaurant,Dry Cleaner,Greek Restaurant
5,Charlestown,Café,Park,Gastropub,Pizza Place,Gym,History Museum,Grocery Store,Pub,Donut Shop,Athletics & Sports
6,Chinatown,Chinese Restaurant,Asian Restaurant,Bakery,Sushi Restaurant,Theater,Coffee Shop,Sandwich Place,Pizza Place,Performing Arts Venue,Seafood Restaurant
7,Dorchester,Pharmacy,Pizza Place,Liquor Store,Golf Course,Park,Diner,Discount Store,Sandwich Place,Fast Food Restaurant,Metro Station
8,Downtown,Hotel Bar,Gift Shop,Italian Restaurant,Gourmet Shop,Pizza Place,Hotpot Restaurant,Liquor Store,Ice Cream Shop,Kids Store,Korean Restaurant
9,East Boston,Mexican Restaurant,Park,Italian Restaurant,Latin American Restaurant,Pizza Place,Seafood Restaurant,Café,Bar,Burrito Place,Fast Food Restaurant


## 4. Cluster Neighborhoods

Run k-means to cluster the neighborhood into 5 clusters.

In [66]:
# set number of clusters
kclusters = 5

boston_grouped_clustering = boston_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(boston_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([2, 4, 2, 4, 1, 1, 2, 3, 4, 1, 1, 3, 2, 2, 1, 3, 1, 0, 3, 1, 1, 1,
       4, 1, 3])

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [67]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

boston_merged = df_boston

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
boston_merged = boston_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

boston_merged

Unnamed: 0,Neighborhood,Area,Latitude,Longitude,2017 Median Price,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Roslindale,2.51,42.291209,-71.124497,450000.0,3,Pizza Place,American Restaurant,Grocery Store,Bar,Plaza,Italian Restaurant,Sandwich Place,Liquor Store,Rental Car Location,Discount Store
1,Jamaica Plain,3.94,42.30982,-71.12033,534000.0,2,Park,Bakery,Coffee Shop,Pizza Place,Art Gallery,Bookstore,Thrift / Vintage Store,Yoga Studio,American Restaurant,Accessories Store
2,Mission Hill,0.55,42.332926,-71.103214,,1,Pizza Place,Sandwich Place,Sushi Restaurant,Convenience Store,Coffee Shop,Falafel Restaurant,Café,Caribbean Restaurant,Gastropub,Donut Shop
3,Longwood,0.29,42.336168,-71.099527,,1,Sandwich Place,Donut Shop,Italian Restaurant,Platform,Pub,Falafel Restaurant,Café,Bookstore,Liquor Store,Gastropub
4,Bay Village,0.04,42.350011,-71.066948,615000.0,2,Sandwich Place,Hotel,Italian Restaurant,Theater,Steakhouse,American Restaurant,Performing Arts Venue,Gym / Fitness Center,Bakery,Vegetarian / Vegan Restaurant
5,Leather District,0.02,42.351049,-71.057969,850000.0,2,Coffee Shop,Asian Restaurant,Chinese Restaurant,Bakery,Sandwich Place,Sushi Restaurant,Vegetarian / Vegan Restaurant,American Restaurant,Food Truck,Hotpot Restaurant
6,Chinatown,0.12,42.352217,-71.062607,850000.0,2,Chinese Restaurant,Asian Restaurant,Bakery,Sushi Restaurant,Theater,Coffee Shop,Sandwich Place,Pizza Place,Performing Arts Venue,Seafood Restaurant
7,North End,0.2,42.365097,-71.054495,570500.0,0,Italian Restaurant,Bakery,Pizza Place,Seafood Restaurant,Park,Café,Sandwich Place,Market,Playground,Coffee Shop
8,Roxbury,3.29,42.324843,-71.095016,338000.0,1,Donut Shop,Pizza Place,Italian Restaurant,Convenience Store,Skating Rink,Recreation Center,Plaza,Bed & Breakfast,Supermarket,Furniture / Home Store
9,South End,0.74,42.34131,-71.07723,615000.0,4,Italian Restaurant,Coffee Shop,Wine Shop,Wine Bar,Park,Bar,Gift Shop,Bakery,Yoga Studio,Salon / Barbershop


Finally, let's visualize the resulting clusters

In [68]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(boston_merged['Latitude'], boston_merged['Longitude'], boston_merged['Neighborhood'], boston_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster],
        fill=True,
        fill_color=rainbow[cluster],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 5. Examine Clusters

#### Cluster 1

In [69]:
boston_merged.loc[boston_merged['Cluster Labels'] == 0, boston_merged.columns[[0] + list(range(5, boston_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
7,North End,0,Italian Restaurant,Bakery,Pizza Place,Seafood Restaurant,Park,Café,Sandwich Place,Market,Playground,Coffee Shop


#### Cluster 2

In [70]:
boston_merged.loc[boston_merged['Cluster Labels'] == 1, boston_merged.columns[[0] + list(range(5, boston_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Mission Hill,1,Pizza Place,Sandwich Place,Sushi Restaurant,Convenience Store,Coffee Shop,Falafel Restaurant,Café,Caribbean Restaurant,Gastropub,Donut Shop
3,Longwood,1,Sandwich Place,Donut Shop,Italian Restaurant,Platform,Pub,Falafel Restaurant,Café,Bookstore,Liquor Store,Gastropub
8,Roxbury,1,Donut Shop,Pizza Place,Italian Restaurant,Convenience Store,Skating Rink,Recreation Center,Plaza,Bed & Breakfast,Supermarket,Furniture / Home Store
11,East Boston,1,Mexican Restaurant,Park,Italian Restaurant,Latin American Restaurant,Pizza Place,Seafood Restaurant,Café,Bar,Burrito Place,Fast Food Restaurant
12,Charlestown,1,Café,Park,Gastropub,Pizza Place,Gym,History Museum,Grocery Store,Pub,Donut Shop,Athletics & Sports
13,West End,1,Sandwich Place,Pizza Place,Donut Shop,Coffee Shop,Bar,Hotel,Café,Italian Restaurant,Sports Bar,Gym / Fitness Center
16,Fenway,1,Sports Bar,Pizza Place,Coffee Shop,Lounge,Baseball Field,Restaurant,American Restaurant,Thai Restaurant,Liquor Store,Mexican Restaurant
17,Brighton,1,Pizza Place,Convenience Store,Café,Coffee Shop,Pub,Bakery,Donut Shop,Chinese Restaurant,Dry Cleaner,Greek Restaurant
22,South Boston Waterfront,1,Pizza Place,Italian Restaurant,Bar,Sports Bar,Chinese Restaurant,Liquor Store,Coffee Shop,Donut Shop,Sushi Restaurant,Dog Run
23,South Boston,1,Pizza Place,Bar,Donut Shop,Sandwich Place,Italian Restaurant,Coffee Shop,Gym,Convenience Store,Sports Bar,Beach


#### Cluster 3

In [71]:
boston_merged.loc[boston_merged['Cluster Labels'] == 2, boston_merged.columns[[0] + list(range(5, boston_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Jamaica Plain,2,Park,Bakery,Coffee Shop,Pizza Place,Art Gallery,Bookstore,Thrift / Vintage Store,Yoga Studio,American Restaurant,Accessories Store
4,Bay Village,2,Sandwich Place,Hotel,Italian Restaurant,Theater,Steakhouse,American Restaurant,Performing Arts Venue,Gym / Fitness Center,Bakery,Vegetarian / Vegan Restaurant
5,Leather District,2,Coffee Shop,Asian Restaurant,Chinese Restaurant,Bakery,Sandwich Place,Sushi Restaurant,Vegetarian / Vegan Restaurant,American Restaurant,Food Truck,Hotpot Restaurant
6,Chinatown,2,Chinese Restaurant,Asian Restaurant,Bakery,Sushi Restaurant,Theater,Coffee Shop,Sandwich Place,Pizza Place,Performing Arts Venue,Seafood Restaurant
24,Allston,2,Pizza Place,Bakery,Korean Restaurant,Chinese Restaurant,Gym / Fitness Center,Bar,Café,Sushi Restaurant,Diner,Coffee Shop


#### Cluster 4

In [72]:
boston_merged.loc[boston_merged['Cluster Labels'] == 3, boston_merged.columns[[0] + list(range(5, boston_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Roslindale,3,Pizza Place,American Restaurant,Grocery Store,Bar,Plaza,Italian Restaurant,Sandwich Place,Liquor Store,Rental Car Location,Discount Store
18,West Roxbury,3,Pizza Place,Pharmacy,Convenience Store,Liquor Store,Bank,Park,Gift Shop,Donut Shop,American Restaurant,Grocery Store
19,Hyde Park,3,Pizza Place,Park,Pharmacy,American Restaurant,Grocery Store,Platform,Theater,Sandwich Place,Donut Shop,Gas Station
20,Mattapan,3,Caribbean Restaurant,Pizza Place,Liquor Store,Scenic Lookout,Hot Dog Joint,Ice Cream Shop,Indian Restaurant,Hardware Store,Gym / Fitness Center,Donut Shop
21,Dorchester,3,Pharmacy,Pizza Place,Liquor Store,Golf Course,Park,Diner,Discount Store,Sandwich Place,Fast Food Restaurant,Metro Station


#### Cluster 5

In [73]:
boston_merged.loc[boston_merged['Cluster Labels'] == 4, boston_merged.columns[[0] + list(range(5, boston_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
9,South End,4,Italian Restaurant,Coffee Shop,Wine Shop,Wine Bar,Park,Bar,Gift Shop,Bakery,Yoga Studio,Salon / Barbershop
10,Back Bay,4,American Restaurant,Coffee Shop,Hotel,Italian Restaurant,Clothing Store,Seafood Restaurant,Salon / Barbershop,Spa,Sporting Goods Shop,Cosmetics Shop
14,Beacon Hill,4,Hotel Bar,Pizza Place,Italian Restaurant,Gift Shop,American Restaurant,Plaza,Hotel,Sushi Restaurant,Coffee Shop,Gourmet Shop
15,Downtown,4,Hotel Bar,Gift Shop,Italian Restaurant,Gourmet Shop,Pizza Place,Hotpot Restaurant,Liquor Store,Ice Cream Shop,Kids Store,Korean Restaurant


After examining each cluster we can determine the discriminating venue categories that distinguish each cluster. 

- The most common venues for cluster 1 are Cafes, Restaurants, Parks, and Playgrounds
- The most common venues for cluster 2 are Fast Food Restaurants, Pubs/Bars, and Parks
- The most common venues for cluster 3 are Asian Cuisine Restaurants and Art venues
- The most common venues for cluster 4 are Grocery / Convenience Stores, Restaurants / Cafes, Liquor Stores
- The most common venues for cluster 5 are Parks, Hotels, and Gift/Gourmet Shops

In [74]:
#Lets create cluster labels and add them to our dataframe
cluster_lab = ['Cafes&nbsp;|&nbsp;Restaurants&nbsp;|&nbsp;Parks&nbsp;|&nbsp;Playgrounds',
              'Fast&nbsp;Food&nbsp;Restaurants&nbsp;|&nbsp;Pubs/Bars&nbsp;|&nbsp;Parks',
              'Asian&nbsp;Cuisine&nbsp;Restaurants&nbsp;|&nbsp;Art&nbsp;Venues',
              'Grocery&nbsp;|&nbsp;Convenience&nbsp;Stores&nbsp;|&nbsp;Restaurants&nbsp;|&nbsp;Cafes&nbsp;|&nbsp;Liquor&nbsp;Stores',
              'Parks&nbsp;|&nbsp;Hotels&nbsp;|&nbsp;Gift/Gourmet&nbsp;Shops']

In [75]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
colors = ['red', 'blue', 'green', 'purple', 'black']

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(boston_merged['Latitude'], boston_merged['Longitude'], boston_merged['Neighborhood'], boston_merged['Cluster Labels']):
    label = folium.Popup(str(poi).upper() + "\n\n" + cluster_lab[cluster], parse_html=False)
    folium.CircleMarker(
        [lat, lon],
        radius=6,
        popup=label,
        color=colors[cluster],
        fill=True,
        fill_color=colors[cluster],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Let's overlay venue clusters on median condo price map

In [76]:
# add cluster markers to the boston_map
markers_colors = []
for lat, lon, poi, cluster in zip(boston_merged['Latitude'], boston_merged['Longitude'], boston_merged['Neighborhood'], boston_merged['Cluster Labels']):
    label = folium.Popup(str(poi).upper() + "\n\n" + cluster_lab[cluster], parse_html=False)
    folium.CircleMarker(
        [lat, lon],
        radius=6,
        popup=label,
        color=colors[cluster],
        fill=True,
        fill_color=colors[cluster],
        fill_opacity=0.7).add_to(boston_map)
    
boston_map

In [77]:
# save map
boston_map.save(os.path.join('', 'Condo Median Prices with venue clusters.html'))

Further discussion of the results is available in the project report

#### References:
1. __[Boston Neighborhoods Geospatial Dataset](http://bostonopendata-boston.opendata.arcgis.com/datasets/3525b0ee6e6b427f9aab5d0a1d0a1a28_0?geometry=-71.373%2C42.31%2C-70.719%2C42.399)__
2. __[Foursquare API](https://developer.foursquare.com/)__
3. __[Condo Prices in Boston 2018 Report](https://www.bostonmagazine.com/top-places-to-live-2018-condos/)__
4. __[the Elliman report](https://www.elliman.com/pdf/cf8dc6172df72a7b8c6817c3feb145a8b3e4ce6c)__