# Capstone Project - The Battle of the Neighborhoods (Week 2)
### The Report and Notebook

## Table of contents
* [I. Introduction: Business Problem](#introduction)
* [II. Data](#data)
* [III. Methodology](#methodology)
* [IV. Analysis](#analysis)
* [V. Results and Discussion](#results)
* [VI. Conclusion](#conclusion)

## I. Introduction: Business Problem <a name="introduction"></a>

My friend Wei Li is moving from Beijing to the Washington DC area, as her daughter is going to the Georgetown University in DC in the coming fall.  Wei is interested in living in a DC neighborhood and explores the DC museums and city activities.  Wei is an avid bike rider, loves cooking and enjoys a variety of cuisines. Wei will not have a car in the first several months and will rely on Uber or public transportation for the most part.  In this data science project, I will try to provide insights about the DC neighborhoods which will help her to narrow down the neighborhood(s) for considerations.

I will use the data science tool to generate several more suitable DC neighborhoods based on the above criteria. I will use analysis to shed light on the pros and cons of each area so that she can choose the most ideal neighborshood.  Other people who may be interested in this project could be people who look to move to the DC area in general.

## II. Data <a name="data"></a>

Based on the definition of the business problem, I think four key factors are important to evaluate:
* Safety of the neighborhood
* Easy access to bike trails
* Close distance to shopping center(s)
* Availability of a variety of venues and restaurants in the neighborhood (any type of restaurant)

I gathered the following data sources to manipulate, extract and generate the required information for this project:
* Washington DC neighborhood names and geospacial data (source: https://opendata.dc.go): I will use these information to create a dynamic neighborhood map to help visualize key data
* 2017 - 2019 Washington DC crime reports (source: https://opendata.dc.go):  This will help evaluate the safety of each neighborhood
* Bike trails available in the DC neighboods (source: https://opendata.dc.go): This is important information given Wei's passion for biking
* Shops in the DC neighborhoods (source: https://opendata.dc.go): Having shops in or near the neighborhood is critical as Wei loves cooking and shopping
* Venues and restaurants in the DC neighborhoods (source: Foursquare API): I will use the location information from Foursquare to discover the key activities and life style of the neighborhoods

I downloaded and update most of the above data files in the Github project location (https://github.com/lisawu83/CapstoneProject).

### II 1) Evaluate the safety factor of the neighborhoods

Let's check the recent years' crime reports to find out which neighborhoods are relatively safer in the DC area

Let's create latitude & longitude coordinates for centroids of our candidate neighborhoods. We will create a grid of cells covering our area of interest which is aprox. 12x12 killometers centered around Berlin city center.

Let's first find the latitude & longitude of Berlin city center, using specific, well known address and Google Maps geocoding API.

In [1]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
#import random # library for random number generation

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.10.0 --yes
import folium # plotting library

!conda install geopandas
import geopandas as gpd
import branca


# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
%matplotlib inline
import matplotlib.pyplot as plt

print('Folium installed')
print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    geopy-1.20.0               |             py_0          57 KB  conda-forge
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.50-py_0         conda-forge
    geopy:           1.20.0-py_0       conda-forge

The following packages will be UPDATED:

    ca-

#### Let's first get the Washington DC neighborhood name and geo data (latitude and longitude), which I sourced from https://opendata.dc.gov.  The source file for the neighborhood names is at the Neighborhood Cluster level, meaning multiple neighborhoods with the same zip code is included in the same Neighborhood Cluster row.  However, the geo data file is at each neighborhood level.  To make it easier for this project, I combined the two files into one file named "Neighborhood_Cluster.csv" and uploaded to Github:  I used one neighborhood in each Neighborhood cluster to look up in the geo data (longitude and latitude).  

#### The limitation for the combined file is that for those Neighborhold Clusters with multiple neighborhoods will reflect the geo data for one of its neighborhoods. Given all neighborhoods in each Neighborhood Cluster are in the same zip code, I think they are close enough.

In [2]:
# Read Washington DC Neighborhood cluster numbers and names. source: https://opendata.dc.go)
url_cluster='https://raw.githubusercontent.com/lisawu83/CapstoneProject/master/Neighborhood_Clusters.csv'
dc_neighbordhood = pd.read_csv(url_cluster, error_bad_lines=False, index_col=None, header =0)
dc_neighbordhood.rename(columns = {'NAME':'NEIGHBORHOOD_CLUSTER'}, inplace = True)
dc_neighbordhood.head(1)

Unnamed: 0,OBJECTID,WEB_URL,NEIGHBORHOOD_CLUSTER,NBH_NAMES,Geo_NAME,Latitude,Longitude,Shape_Length,Shape_Area,TYPE
0,1,http://planning.dc.gov/,Cluster 39,"Congress Heights, Bellevue, Washington Highlands",Congress Heights,38.841077,-76.99795,10711.66801,4886462.548,Original


#### Now we have the neighborhood names and geo data, let's review the neighborhood crime reports and use the crime counts as a criteria to determine the safety of each neighborhood.  I obtained 2017-2019 crime incidents reports from https://opendata.dc.gov.  Later, I will create a neighborhood map that is colored based on the average crime incident data. 

In [4]:
# Import DC crime rate in 2019 

import pandas as pd
url_2019='https://raw.githubusercontent.com/lisawu83/CapstoneProject/master/DC_Crime_Incidents_in_2019.csv'
c_2019 = pd.read_csv(url_2019, error_bad_lines=False, index_col=None, header =0)
c_2019.head(1)
print(c_2019.shape)


(33920, 25)


In [5]:
# group 2019 crime count by neighborhood cluster number
cluster = c_2019.groupby('NEIGHBORHOOD_CLUSTER')['OFFENSE'].count()
c_2019_cluster = pd.DataFrame(cluster).reset_index()
c_2019_cluster.rename(columns = {'OFFENSE':'Crime_2019'}, inplace = True) 
c_2019_cluster.head(2)

Unnamed: 0,NEIGHBORHOOD_CLUSTER,Crime_2019
0,Cluster 1,709
1,Cluster 10,277


In [6]:
# Import DC crime rate in 2018

import pandas as pd
url_2018='https://raw.githubusercontent.com/lisawu83/CapstoneProject/master/DC_Crime_Incidents_in_2018.csv'
c_2018 = pd.read_csv(url_2018, error_bad_lines=False, index_col=None, header =0)
print(c_2018.shape)

(33772, 25)


In [7]:
# group 2019 crime count by neighborhood cluster number
cluster = c_2018.groupby('NEIGHBORHOOD_CLUSTER')['OFFENSE'].count()
c_2018_cluster = pd.DataFrame(cluster).reset_index()
c_2018_cluster.rename(columns = {'OFFENSE':'Crime_2018'}, inplace = True) 
c_2018_cluster.head(2)

Unnamed: 0,NEIGHBORHOOD_CLUSTER,Crime_2018
0,Cluster 1,652
1,Cluster 10,290


In [8]:
# Import DC crime rate in 2017

import pandas as pd
url_2017='https://raw.githubusercontent.com/lisawu83/CapstoneProject/master/DC_Crime_Incidents_in_2017.csv'
c_2017 = pd.read_csv(url_2017, error_bad_lines=False, index_col=None, header =0)
print(c_2017.shape)

(33113, 25)


In [9]:
# group 2017 crime count by neighborhood cluster number
cluster = c_2017.groupby('NEIGHBORHOOD_CLUSTER')['OFFENSE'].count()
c_2017_cluster = pd.DataFrame(cluster).reset_index()
c_2017_cluster.rename(columns = {'OFFENSE':'Crime_2017'}, inplace = True) 
c_2017_cluster.head(2)

Unnamed: 0,NEIGHBORHOOD_CLUSTER,Crime_2017
0,Cluster 1,713
1,Cluster 10,256


#### Now we have the three years' crime incidents grouped by the neighborhood cluster.  Let's merge these crime incident reports with the dc_neighborhood file to get the geo data so we can create a map later.  

#### Let's sum up the three years' crime counts for each neighborhood and calculate the average crime count per year. 

In [10]:
from functools import reduce
dfs = [dc_neighbordhood[['NEIGHBORHOOD_CLUSTER','NBH_NAMES','Geo_NAME','Latitude','Longitude']], c_2019_cluster, c_2018_cluster, c_2017_cluster]
df_final = reduce(lambda left,right: pd.merge(left,right,on='NEIGHBORHOOD_CLUSTER'), dfs)
columns = ['Crime_2019','Crime_2018','Crime_2017']
df_final['Total_3yrs']=df_final.loc[:, columns].sum(axis=1)
df_final['Avg_3yrs']=round(df_final.loc[:, columns].mean(axis=1),0)
df_final_best10 = df_final.sort_values(["Total_3yrs", "Avg_3yrs"], axis=0, 
                 ascending=True, inplace=True)
df_final.head(2)

Unnamed: 0,NEIGHBORHOOD_CLUSTER,NBH_NAMES,Geo_NAME,Latitude,Longitude,Crime_2019,Crime_2018,Crime_2017,Total_3yrs,Avg_3yrs
10,Cluster 29,"Eastland Gardens, Kenilworth",Eastland Gardens,38.905329,-76.94307,56,77,77,210,70.0
19,Cluster 12,"North Cleveland Park, Forest Hills, Van Ness",North Cleveland Park,38.947879,-77.071774,235,231,180,646,215.0


#### Now we have the crime incidents and geo data for the neighborhood cluster. Let's now create a map for the neighborhoods, with color for each neighborhood driven by the crime incidents count.   I considered using Choropleth to create the map but couldn't get it displayed properly.  So I used Folium GeoPandas to create this map.  The color of the map will be driven the average 3 years crime incidents in each neighborhood.  Darker color indicates higher average crime count.  

In [12]:
#Load DC neighborhood geojson file as geopandas frame
url_geo = r'https://raw.githubusercontent.com/lisawu83/CapstoneProject/master/zillow-neighborhoods.geojson' # geojson file
DC_gdf = gpd.read_file(url_geo)
DC_gdf.rename(columns = {'name':'Geo_NAME'}, inplace = True)
DC_gdf.head(1) 

Unnamed: 0,city,Geo_NAME,regionid,county,state,geometry
0,Washington,Catholic University,273159,District of Columbia,DC,"POLYGON ((-77.00433 38.94064, -77.00423 38.940..."


In [13]:
## merge the variables of interest (df_final crime incidents information) into the Geodataframe
DC_gdf_update = pd.merge(DC_gdf, df_final, how='inner', on=['Geo_NAME'])
DC_gdf_update.head(1)

Unnamed: 0,city,Geo_NAME,regionid,county,state,geometry,NEIGHBORHOOD_CLUSTER,NBH_NAMES,Latitude,Longitude,Crime_2019,Crime_2018,Crime_2017,Total_3yrs,Avg_3yrs
0,Washington,Takoma,268831,District of Columbia,DC,"POLYGON ((-77.01381 38.97268, -77.01415 38.972...",Cluster 17,"Takoma, Brightwood, Manor Park",38.976462,-77.021558,878,1043,891,2812,937.0


In [14]:
# Create a dynamic map based on the centroid of DC
centroid=DC_gdf.geometry.centroid ## identifies the center point of all the neighborhood shapes 
m=folium.Map(location=[centroid.y.mean(), centroid.x.mean()], zoom_start=12) ## initiaes a map based on the centroid


In [15]:
#Use the3 years average crime count in the neighborhood as the criteria
variable = 'Avg_3yrs' 
name = 'Avg Crime Count'

print(name, "colorscale")
print("Min:",DC_gdf_update[variable].min())
print("Max:",DC_gdf_update[variable].max())
    
colorscale = branca.colormap.linear.YlOrRd_09.scale(DC_gdf_update[variable].min(), DC_gdf_update[variable].max()) 
colorscale

Avg Crime Count colorscale
Min: 70.0
Max: 2513.0


In [23]:
#Find out the Top 5 Saftest Neighborhoods, indicated by the lowest 3-year average crime count
df_interest1 = DC_gdf_update[['Geo_NAME', variable]].sort_values(by = variable, ascending = True)
df_interest1.head()

Unnamed: 0,Geo_NAME,Avg_3yrs
33,Eastland Gardens,70.0
18,North Cleveland Park,215.0
9,Colonial Village,216.0
21,Spring Valley,238.0
6,Woodlands,270.0


In [16]:
# create df with neighborhood name and variable of interest, sorted from largest to smallest
df_interest = DC_gdf_update[['Geo_NAME', variable]].sort_values(by = variable, ascending = False)  

# reset index so that the largest value corresponds to row 0 and smallest to row 136
df_interest.reset_index(inplace = True)
leg_brks = list(df_interest[df_interest.index.isin([0,4,9,19,29,49])][variable]) # identify the value of the var by index position 

# make the smallest value of the scale be 0
leg_brks.append(0)
leg_brks.sort() # sort from smallest to largest
print("Quantiles:", leg_brks)

print(name, "colorscale")

colorscale = branca.colormap.linear.YlOrRd_09.scale(DC_gdf_update[variable].min(), DC_gdf_update[variable].max()) 
colorscale = colorscale.to_step(n = 6, quantiles = leg_brks) ## sets quantile breaks 
colorscale.caption = name ## adds name for legend
    
colorscale

Quantiles: [0, 380.0, 674.0, 1239.0, 1701.0, 2513.0]
Avg Crime Count colorscale


In [17]:
variable = 'Avg_3yrs' #average crime count in Washington, DC neighborhood between 2017 and 2019
name = 'Avg Crime Count(2017-2019)'

folium.GeoJson(DC_gdf_update, ## GeoPandas dataframe
               name="Washington DC",
                   
               ## controls the fill of the geo regions; applying colorscale based on variable
               style_function=lambda x: {"weight":1
                                         , 'color': '#545453'
                                         ## if variable is 0 map is a very light grey
                                         ## else colorscale applies based on variable
                                         , 'fillColor':'#9B9B9B' if x['properties'][variable] == 0 
                                         else colorscale(x['properties'][variable])
                                         ## similarly opacity is increased if value is 0
                                         , 'fillOpacity': 0.2 if x['properties'][variable] == 0 
                                         else 0.5},
                   
               ## changes styling of geo regions upon hover
               highlight_function=lambda x: {'weight':3, 'color':'black', 'fillOpacity': 1}, 
               
                ## tooltip can include information from any column in the GeoPandas dataframe   
                tooltip=folium.features.GeoJsonTooltip(
                fields=['Geo_NAME', 'Total_3yrs', variable],
                aliases=['Neighborhood:', 'Total Crime Count(2017-2019):', name])
              ).add_to(m)

## add colorscale to map so that it appears as the legend
colorscale.add_to(m)
    
m


#### Based on the safety evaluation analysis above, the northwest neighborhoods and some of the east and south neighborhoods in DC are safer neighborhood, compared to the downtown areas.  Wei will be glad to find out that the neighborhood where Geogetown University is located is one of the safer neighborhoods.

#### As I noted earlier, there is a limitation in mapping the geo data for each Neighborhood Cluster: each Cluster in the crime incident reports reflects only the geo data of one neighborhood in each cluster, and therefore the map is not populated for all neighborhoods in each cluster.  Given all neighborhoods in each cluster are in close distance, I decided to accept this limitation.

### II. 2) Bike trails in the Washington DC neighborhoods

Source: http://opendata.dc.gov/

#### Wei is an avid bike rider.  After we checked out the safty information, we can look for bike trail information

In [24]:
#Load bike trail geojson file as geopandas frame
url_bike = r'https://raw.githubusercontent.com/benbalter/dc-maps/master/maps/bike-trails.geojson' # geojson file
DC_bike = gpd.read_file(url_bike)
#DC_bike.rename(columns = {'name':'Geo_NAME'}, inplace = True)
DC_bike.head(1)

Unnamed: 0,OBJECTID,LENGTH,NAME,STATUS,MAINTENANC,Shape_Length,MILES,geometry
0,1,336.54,Metropolitan Branch Trail,Open,DDOT,102.577767,0.063739,"LINESTRING (-76.99465 38.93253, -76.99506 38.9..."


In [25]:
#  Add the bike trail (blue color lines) to the neighborhood map
folium.GeoJson(DC_bike).add_to(m), ## GeoPandas dataframe
m

#### Based on the bike trail information, it looks like there are many trails in the northwest, south and central neighborhoods. This is a good news!

### II. 3) Shops in the Washington DC neighborhoods

Source: http://opendata.dc.gov/

#### Wei loves home-cooking and shopping.  She won't plan to buy a car in the first several months and needs to rely on Uber.  So it is good to have shopping centers near the neighborhood for convenience.  Let's check out the shopping center distribution in the DC area. 

In [26]:
## Load shopping center geojson as geopandas dataframe
url_shop = r'https://raw.githubusercontent.com/benbalter/dc-maps/master/maps/shopping-centers.geojson'
dc_shop = gpd.read_file(url_shop)    
dc_shop.head(1)

Unnamed: 0,OBJECTID,GIS_ID,AID,NAME,ADDRESS,STATUS,SSL,WARD,ZIPCODE,X_COORD,Y_COORD,geometry
0,1,ShpCenter_1,288047,EAST RIVER PARK SHOPPING CENTER,3939 MINNESOTA AVE NE,ACTIVE,5051N 0015,Ward 7,20019,404409.780500002,136323.600000001,POINT (-76.94917 38.89476)


In [27]:
folium.GeoJson(dc_shop).add_to(m), ## GeoPandas dataframe
m

#### Based on the shopping center information, except for the neighborhoods in the far north and south end of DC which do not have shopping centers nearby, all other neighborhoods seem to be within reasonable distance of one or many shopping centers. 

#### After we analyzed the neighborhood safety, bike trials and shopping centers location data, several neighborhoods on the northwest and southeast of DC satisfy all three criteria (safe, bike trail and shopping centers nearby)
#### Let’s then further review the comprehensive venues information of all neighborhood using Four Square API so we can gain insight of the expected lifestyle of the neighborhoods

## III. Methodology <a name="methodology"></a>

In this section, I will use Four Square to collect venues information for each neighborhood, which will provide further insight what lifestyle Wei should expect for each neighborhood.  

In the second step, I will sort the venues in each neighborhood so we can understand the top venue types. 

The final step is to choose a machine learning approach to create clusters of similar neighborhoods based on the top venues in each neighborhood.  K-means clustering model is a good choice, as it is an unsupervised machine learning approach and works very well with this kind of case. 


## IV. Analysis <a name="analysis"></a>

Let's perform some basic explanatory data analysis and derive some additional info from our raw data

#### Find out the neighborhood names and geo information

In [28]:
# Load Washington DC Neighborhood with Lat and Long fields 
url_label =r'https://raw.githubusercontent.com/lisawu83/CapstoneProject/master/Neighborhood_Labels.csv'
df_LL = pd.read_csv (url_label,  index_col =None, header = 0)
df_LL = df_LL.rename(columns={'X': 'Longitude', 'Y': 'Latitude','NAME': 'Name'})
df_LL['Name'] = df_LL['Name'].map(str)+', '+'DC'
print(df_LL.shape)
df_LL.head()
df_LL1 = df_LL[['Name','LABEL_NAME','GIS_ID','Latitude','Longitude']]
df_LL1.head(3)

(132, 8)


Unnamed: 0,Name,LABEL_NAME,GIS_ID,Latitude,Longitude
0,"Fort Stanton, DC",Fort Stanton,nhood_050,38.855658,-76.980348
1,"Congress Heights, DC",Congress Heights,nhood_031,38.841077,-76.99795
2,"Washington Highlands, DC",Washington Highlands,nhood_123,38.830237,-76.995636


#### Define Four Square credential

In [30]:
# @hidden cell
CLIENT_ID = 'OLL3WAJH2SLHDOU2SMWCAL1JDHCOE4VOREW0QMSEX2YQFDYH' # your Foursquare ID
CLIENT_SECRET = 'RMY1ICVLYE5UX3H21JOH4XXF5UZOZGPFR4IVN1J1OEUNL32E' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
Radius = 500
Limit = 50

In [31]:
# Use Four Square to get information
latitude = df_LL1.loc[1,'Latitude']
longitude = df_LL1.loc[1,'Longitude']
radius = 500
LIMIT = 100 # limit of number of venues returned by Foursquare API
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?client_id=OLL3WAJH2SLHDOU2SMWCAL1JDHCOE4VOREW0QMSEX2YQFDYH&client_secret=RMY1ICVLYE5UX3H21JOH4XXF5UZOZGPFR4IVN1J1OEUNL32E&ll=38.84107731,-76.99794993&v=20180605&radius=500&limit=100'

In [32]:
import requests
results = requests.get(url).json()

In [None]:
'There are {} around Congress Heights.'.format(len(results['response']['groups'][0]['items']))

In [33]:
address_DC = 'Washington DC, DC'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address_DC)
latitude_DC = location.latitude
longitude_DC = location.longitude
print('The geograpical coordinate of Washington DC are {}, {}.'.format(latitude_DC, longitude_DC))

The geograpical coordinate of Washington DC are 38.8948932, -77.0365529.


In [34]:
# add markers to map
map_DC = folium.Map(location=[latitude, longitude], zoom_start=11)
for lat, lng, borough, neighborhood in zip(df_LL1['Latitude'], df_LL1['Longitude'], df_LL1['Name'], df_LL1['LABEL_NAME']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_DC)  
    
map_DC

#### Let's create a function to pull venues information about all DC neighborhood from Four Square

In [37]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
         # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)
        

In [36]:
DC_venues = getNearbyVenues(names=df_LL1['Name'],latitudes=df_LL1['Latitude'],longitudes=df_LL1['Longitude']                                  )

Fort Stanton, DC
Congress Heights, DC
Washington Highlands, DC
Bellevue, DC
Knox Hill/Buena Vista, DC
Shipley, DC
Douglass, DC
Woodland, DC
Garfield Heights, DC
Near Southeast, DC
Capitol Hill, DC
Dupont Park, DC
Twining, DC
Randle Highlands, DC
Fairlawn, DC
Penn Branch, DC
Barry Farm, DC
Historic Anacostia, DC
Columbia Heights, DC
Logan Circle/Shaw, DC
Cardozo/Shaw, DC
Van Ness, DC
Forest Hills, DC
Georgetown Reservoir, DC
Foxhall Village, DC
Fort Totten, DC
Pleasant Hill, DC
Kenilworth, DC
Eastland Gardens, DC
Deanwood, DC
Fort Dupont, DC
Greenway, DC
Woodland-Normanstone, DC
Mass. Ave. Heights, DC
Naylor Gardens, DC
Pleasant Plains, DC
Hillsdale, DC
Benning Ridge, DC
Penn Quarter, DC
Chinatown, DC
Stronghold, DC
South Central, DC
Langston, DC
Downtown East, DC
North Portal Estates, DC
Colonial Village, DC
Shepherd Park, DC
Takoma, DC
Lamond Riggs, DC
Petworth, DC
Brightwood Park, DC
Manor Park, DC
Brightwood, DC
Hawthorne, DC
Barnaby Woods, DC
Queens Chapel, DC
Michigan Park, DC
Nor

In [38]:
print(DC_venues.shape)
DC_venues.head(1)

(2764, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Fort Stanton, DC",38.855658,-76.980348,Anacostia Community Museum,38.856728,-76.976899,Museum


In [39]:
DC_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"16th Street Heights, DC",14,14,14,14,14,14
"Adams Morgan, DC",73,73,73,73,73,73
"American University Park, DC",1,1,1,1,1,1
"Arboretum, DC",15,15,15,15,15,15
"Barnaby Woods, DC",4,4,4,4,4,4
"Barry Farm, DC",7,7,7,7,7,7
"Bellevue, DC",5,5,5,5,5,5
"Benning Ridge, DC",4,4,4,4,4,4
"Benning, DC",17,17,17,17,17,17
"Bloomingdale, DC",20,20,20,20,20,20


#### Find out unique venues 

In [40]:
print('There are {} uniques categories.'.format(len(DC_venues['Venue Category'].unique())))

There are 306 uniques categories.


#### Analyze each neighborhood in DC

In [41]:
# one hot encoding
DC_onehot = pd.get_dummies(DC_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
DC_onehot['Neighborhood'] = DC_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [DC_onehot.columns[-1]] + list(DC_onehot.columns[:-1])
DC_onehot = DC_onehot[fixed_columns]

DC_onehot.head()

Unnamed: 0,Zoo Exhibit,Afghan Restaurant,African Restaurant,Airport Lounge,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Art Gallery,Art Museum,...,Warehouse Store,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


#### Let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [42]:
DC_grouped = DC_onehot.groupby('Neighborhood').mean().reset_index()
DC_grouped

Unnamed: 0,Neighborhood,Zoo Exhibit,Afghan Restaurant,African Restaurant,Airport Lounge,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Art Gallery,...,Warehouse Store,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio
0,"16th Street Heights, DC",0.00,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,...,0.0,0.00,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000,0.000000
1,"Adams Morgan, DC",0.00,0.013699,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.013699,...,0.0,0.00,0.013699,0.013699,0.000000,0.0,0.000000,0.000000,0.000,0.000000
2,"American University Park, DC",0.00,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,...,0.0,0.00,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000,0.000000
3,"Arboretum, DC",0.00,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,...,0.0,0.00,0.000000,0.066667,0.000000,0.0,0.000000,0.000000,0.000,0.000000
4,"Barnaby Woods, DC",0.00,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,...,0.0,0.00,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000,0.000000
5,"Barry Farm, DC",0.00,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,...,0.0,0.00,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000,0.000000
6,"Bellevue, DC",0.00,0.000000,0.000000,0.0,0.200000,0.000000,0.0,0.0,0.000000,...,0.0,0.00,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000,0.000000
7,"Benning Ridge, DC",0.00,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,...,0.0,0.00,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000,0.000000
8,"Benning, DC",0.00,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,...,0.0,0.00,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000,0.000000
9,"Bloomingdale, DC",0.00,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,...,0.0,0.00,0.000000,0.050000,0.000000,0.0,0.000000,0.000000,0.000,0.050000


In [None]:
DC_grouped.shape

In [43]:
# Print each neighborhood with the top 5 most common vennues
num_top_venues = 5

for hood in DC_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = DC_grouped[DC_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----16th Street Heights, DC----
                   venue  freq
0               Bus Stop  0.07
1  Salvadoran Restaurant  0.07
2            Coffee Shop  0.07
3        Bed & Breakfast  0.07
4         Gymnastics Gym  0.07


----Adams Morgan, DC----
            venue  freq
0             Bar  0.05
1             Spa  0.04
2  Ice Cream Shop  0.04
3     Coffee Shop  0.04
4    Cocktail Bar  0.04


----American University Park, DC----
                venue  freq
0  Italian Restaurant   1.0
1         Zoo Exhibit   0.0
2                Park   0.0
3            Pharmacy   0.0
4           Pet Store   0.0


----Arboretum, DC----
                  venue  freq
0  Fast Food Restaurant  0.13
1             Nightclub  0.07
2                 Hotel  0.07
3    Chinese Restaurant  0.07
4      Botanical Garden  0.07


----Barnaby Woods, DC----
                  venue  freq
0             BBQ Joint  0.25
1                 Field  0.25
2                  Park  0.25
3  Gym / Fitness Center  0.25
4           Zoo Exhibi

#### Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [44]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [45]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = DC_grouped['Neighborhood']

for ind in np.arange(DC_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(DC_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"16th Street Heights, DC",Cosmetics Shop,Pizza Place,Breakfast Spot,Bus Stop,Park,Chinese Restaurant,Bed & Breakfast,Coffee Shop,Greek Restaurant,Salvadoran Restaurant
1,"Adams Morgan, DC",Bar,Ice Cream Shop,Coffee Shop,Cocktail Bar,Spa,New American Restaurant,Ethiopian Restaurant,Asian Restaurant,Mediterranean Restaurant,Diner
2,"American University Park, DC",Italian Restaurant,Yoga Studio,Filipino Restaurant,Exhibit,Eye Doctor,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field
3,"Arboretum, DC",Fast Food Restaurant,Botanical Garden,Storage Facility,Gas Station,Lake,BBQ Joint,Automotive Shop,Garden,Chinese Restaurant,Brewery
4,"Barnaby Woods, DC",Park,Gym / Fitness Center,BBQ Joint,Field,Yoga Studio,Exhibit,Eye Doctor,Falafel Restaurant,Farm,Farmers Market


## Use K-means modeling approach to Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 6 clusters.

In [46]:
# set number of clusters
kclusters = 6

DC_grouped_clustering = DC_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(DC_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([4, 4, 5, 4, 4, 3, 4, 2, 4, 3], dtype=int32)

In [47]:
df_LL1['Neighborhood'] = df_LL1['Name']
df_LL2 = df_LL1[['Neighborhood','Latitude','Longitude']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [48]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)


Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [49]:
DC_merged = df_LL2

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
DC_merged = DC_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

DC_merged.head(1) # check the last columns!

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Fort Stanton, DC",38.855658,-76.980348,0.0,Park,Recreation Center,Museum,Art Gallery,Dog Run,Field,Eye Doctor,Falafel Restaurant,Farm,Farmers Market


In [50]:
DC_merged = DC_merged.fillna(0)
DC_merged['Cluster Labels'] = DC_merged['Cluster Labels'].astype(int)
DC_merged['Cluster Labels'].unique()

array([0, 2, 4, 1, 3, 5])

In [51]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(DC_merged['Latitude'], DC_merged['Longitude'], DC_merged['Neighborhood'], DC_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Examine Clusters

Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster.

### Cluster 1 

In [52]:
DC_merged.loc[DC_merged['Cluster Labels'] == 0, DC_merged.columns[[0] + list(range(4, DC_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Fort Stanton, DC",Park,Recreation Center,Museum,Art Gallery,Dog Run,Field,Eye Doctor,Falafel Restaurant,Farm,Farmers Market
7,"Woodland, DC",Park,Recreation Center,Museum,Art Gallery,Field,Exhibit,Eye Doctor,Falafel Restaurant,Farm,Farmers Market
56,"Michigan Park, DC",Mexican Restaurant,Park,Yoga Studio,Exhibit,Eye Doctor,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field
68,"Langdon, DC",Memorial Site,Park,Dog Run,Yoga Studio,Fast Food Restaurant,Eye Doctor,Falafel Restaurant,Farm,Farmers Market,Field
92,"North Cleveland Park, DC",0,0,0,0,0,0,0,0,0,0
95,"Spring Valley, DC",Tennis Court,Athletics & Sports,Park,Yoga Studio,Ethiopian Restaurant,Exhibit,Eye Doctor,Falafel Restaurant,Farm,Farmers Market
114,"Kingman Park, DC",Park,Taco Place,Liquor Store,Intersection,Pool,Yoga Studio,Exhibit,Eye Doctor,Falafel Restaurant,Farm
122,"Grant Park, DC",Park,Yoga Studio,Field,Exhibit,Eye Doctor,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Filipino Restaurant
130,"Crestwood, DC",0,0,0,0,0,0,0,0,0,0


##### Cluster #1 (Urban Sub Mix):  This cluster has a urban suburban mix feel.  There are a lot of parks and other sites to relax and visit.  There are also farmers markets in the neighborhood that provide fresh groceries 


### Cluster 2

In [53]:
DC_merged.loc[DC_merged['Cluster Labels'] == 1, DC_merged.columns[[0] + list(range(4, DC_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,"Fairlawn, DC",Liquor Store,Sandwich Place,Shop & Service,Deli / Bodega,Other Repair Shop,Yoga Studio,Event Space,Exhibit,Eye Doctor,Falafel Restaurant
60,"Brookland, DC",Stables,Gym,Empanada Restaurant,Event Space,Exhibit,Eye Doctor,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant
67,"Gateway, DC",Liquor Store,Thrift / Vintage Store,Fried Chicken Joint,Gas Station,Business Service,Shipping Store,Wine Shop,Fish & Chips Shop,Filipino Restaurant,Flea Market
120,"NE Boundary, DC",Liquor Store,Yoga Studio,Field,Exhibit,Eye Doctor,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Filipino Restaurant


##### Cluster #2 (Sub Feel):  This cluster offers a suburban feel while close to the center of the city.  There are many affordable stores, restaurants and farm/flea markets in the neighborhoods.  

### Cluster 3

In [54]:
DC_merged.loc[DC_merged['Cluster Labels'] == 2, DC_merged.columns[[0] + list(range(4, DC_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,"Congress Heights, DC",Liquor Store,Ice Cream Shop,Convenience Store,American Restaurant,Intersection,Deli / Bodega,Road,Fried Chicken Joint,Tennis Court,Yoga Studio
4,"Knox Hill/Buena Vista, DC",Convenience Store,Grocery Store,Liquor Store,Fast Food Restaurant,Event Space,Exhibit,Eye Doctor,Falafel Restaurant,Farm,Farmers Market
5,"Shipley, DC",Convenience Store,Wings Joint,Dance Studio,Liquor Store,Performing Arts Venue,Chinese Restaurant,Yoga Studio,Fast Food Restaurant,Exhibit,Eye Doctor
8,"Garfield Heights, DC",Chinese Restaurant,Convenience Store,Park,Art Gallery,Field,Exhibit,Eye Doctor,Falafel Restaurant,Farm,Farmers Market
11,"Dupont Park, DC",Liquor Store,Convenience Store,Pharmacy,Intersection,Restaurant,Bike Rental / Bike Share,Mobile Phone Shop,Bank,Seafood Restaurant,Sandwich Place
12,"Twining, DC",Liquor Store,Pharmacy,Restaurant,Convenience Store,Bike Rental / Bike Share,Fast Food Restaurant,Exhibit,Eye Doctor,Falafel Restaurant,Farm
15,"Penn Branch, DC",Bike Rental / Bike Share,Wings Joint,Convenience Store,Laundromat,Yoga Studio,Exhibit,Eye Doctor,Falafel Restaurant,Farm,Farmers Market
34,"Naylor Gardens, DC",Convenience Store,Liquor Store,Playground,Coffee Shop,Bank,Grocery Store,Gym,Sandwich Place,Pizza Place,Cosmetics Shop
36,"Hillsdale, DC",Convenience Store,Gym,Spa,Yoga Studio,Field,Exhibit,Eye Doctor,Falafel Restaurant,Farm,Farmers Market
37,"Benning Ridge, DC",Electronics Store,Convenience Store,Construction & Landscaping,Burger Joint,Yoga Studio,Field,Eye Doctor,Falafel Restaurant,Farm,Farmers Market


##### Cluster #3 (Dense Urban):  This cluster offers an urban feel with high population density and relatively small-sized convenience and other stores


### Cluster 4

In [55]:
DC_merged.loc[DC_merged['Cluster Labels'] == 3, DC_merged.columns[[0] + list(range(4, DC_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
16,"Barry Farm, DC",Bus Stop,Basketball Court,Rental Car Location,Intersection,Home Service,Metro Station,Fast Food Restaurant,Event Space,Exhibit,Eye Doctor
23,"Georgetown Reservoir, DC",Tennis Court,Home Service,Deli / Bodega,Trail,Lake,Yoga Studio,Fast Food Restaurant,Exhibit,Eye Doctor,Falafel Restaurant
24,"Foxhall Village, DC",Tennis Court,Home Service,Bus Station,Trail,Sandwich Place,Lake,Yoga Studio,Farm,Event Space,Exhibit
25,"Fort Totten, DC",Miscellaneous Shop,Hospital,Grocery Store,Bus Stop,Memorial Site,Park,Fast Food Restaurant,Eye Doctor,Falafel Restaurant,Farm
26,"Pleasant Hill, DC",Bus Stop,Sandwich Place,Seafood Restaurant,Liquor Store,Chinese Restaurant,Dance Studio,Flower Shop,Flea Market,Fish & Chips Shop,Filipino Restaurant
27,"Kenilworth, DC",Chinese Restaurant,Liquor Store,Coffee Shop,Border Crossing,Park,Fast Food Restaurant,Exhibit,Eye Doctor,Falafel Restaurant,Farm
45,"Colonial Village, DC",Locksmith,Bus Station,Yoga Studio,Field,Exhibit,Eye Doctor,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant
63,"Bloomingdale, DC",Bus Stop,Café,Yoga Studio,Dive Bar,Bus Station,Park,Food & Drink Shop,Grocery Store,Pizza Place,Dog Run
64,"Lincoln Park, DC",Liquor Store,Coffee Shop,Bus Stop,Pizza Place,Park,Yoga Studio,Farmers Market,Exhibit,Eye Doctor,Falafel Restaurant
85,"Burleith/Hillandale, DC",Dog Run,Beer Garden,Deli / Bodega,Bagel Shop,Yoga Studio,Eye Doctor,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant


##### Cluster #4 (Heaven for Young Professionals):  The neighborhoods in this cluster provide a lot of bars, restaurants, coffee shops, and parks. It also has many bus tops, trails, market/farmer markets which offer great convenience.  Many young professionals live in these neighborhoods. The public schools in Foxhall Village are highly rated.  The home or rent prices for this cluster are at the median level of the DC neighborhoods.  The neighborshoods in this cluster which also meet the other criteria (safe, bike trails and shopping centers) could be good candidates for Wei to evaluate further


### Cluster 5

In [56]:
DC_merged.loc[DC_merged['Cluster Labels'] == 4, DC_merged.columns[[0] + list(range(4, DC_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,"Washington Highlands, DC",Grocery Store,Basketball Court,Asian Restaurant,Liquor Store,Snack Place,Seafood Restaurant,Ethiopian Restaurant,Event Space,Exhibit,Eye Doctor
3,"Bellevue, DC",Baseball Field,American Restaurant,Pizza Place,Shoe Repair,Playground,Fast Food Restaurant,Event Space,Exhibit,Eye Doctor,Falafel Restaurant
6,"Douglass, DC",Bank,Breakfast Spot,Pizza Place,Sandwich Place,Video Store,Women's Store,Cosmetics Shop,Spa,Food & Drink Shop,Farmers Market
9,"Near Southeast, DC",Sushi Restaurant,Coffee Shop,Park,Gym / Fitness Center,Brewery,Café,Taco Place,Supermarket,Portuguese Restaurant,Burger Joint
10,"Capitol Hill, DC",Bar,Italian Restaurant,Pizza Place,American Restaurant,Playground,Deli / Bodega,Coffee Shop,Sushi Restaurant,Park,Gym / Fitness Center
13,"Randle Highlands, DC",Intersection,Sandwich Place,Seafood Restaurant,Bank,Mobile Phone Shop,Gym / Fitness Center,Yoga Studio,Event Space,Exhibit,Eye Doctor
17,"Historic Anacostia, DC",American Restaurant,Boutique,Comfort Food Restaurant,Outdoor Sculpture,Market,Convenience Store,History Museum,Art Gallery,Diner,Thrift / Vintage Store
18,"Columbia Heights, DC",Gym,Pizza Place,Bakery,Indian Restaurant,Pharmacy,Soccer Field,Café,Shoe Store,Shipping Store,Caribbean Restaurant
19,"Logan Circle/Shaw, DC",Bar,Wine Bar,Coffee Shop,Mexican Restaurant,Bakery,American Restaurant,Ethiopian Restaurant,Gym / Fitness Center,Cocktail Bar,Sushi Restaurant
20,"Cardozo/Shaw, DC",Bar,Coffee Shop,Pizza Place,Restaurant,Mexican Restaurant,American Restaurant,New American Restaurant,Southern / Soul Food Restaurant,Gay Bar,Thai Restaurant


##### Cluster #5 (Family Friendly):  The neighborhoods in this cluster provide many family-friendly facilities (e.g.g baseball fields, playgrounds, trails) and stores.   Not too many restaurant varieties.

### Cluster 6

In [57]:
DC_merged.loc[DC_merged['Cluster Labels'] == 5, DC_merged.columns[[0] + list(range(4, DC_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
94,"American University Park, DC",Italian Restaurant,Yoga Studio,Filipino Restaurant,Exhibit,Eye Doctor,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field


##### Cluster #6 (University Life):  There is one neighborhood in this cluster.  This neighborhood has many universities and schools and has many family-friendly facilities (e.g. doctor office, farmers market and shopping centers)

## V. Results and Discussion <a name="results"></a>

##### See more detailed analysis of the data and clusters above.

##### Our analysis shows that Cluster 4 (Heaven for Young Professionals) could be the best candidate for Wei.  The neighborhoods in this cluster provide a lot of bars, restaurants, coffee shops, and parks. It also has many bus tops, trails, market/farmer markets which offer great convenience.  The home or rent prices for this cluster are at the median level of the DC neighborhoods.  This cluster seems to match well to Wei's situation.




## VI. Conclusion <a name="conclusion"></a>

##### The purpose of this project is to evaluate the top factors (safty, bike trails, shopping centers and venue availability) that Wei will consider in searching for a neighborhood.  Based on our analysis, the neighborhoods in Cluster 4 are good candidates for Wei to evaluate further.

##### Final selection of the neighborhood and an apartment will be made after Wei searches the available apartment listings and schedules onsite visits.  Wei will need to evaluate the actual rent, room condition, community facilities and the location of the apartment to make a final decision.