# ARLINGTON, VA NEIGHBORHOOD ANALYSIS

## Introduction

Several stakeholders are looking to open different resaurants in Arlingtion, VA - using the available location data, we want to recommend to them the best possible neighborhood for each business to be sucessful.  By providing this type of analysis, we could substaintally reduce the risk for each stakeholder in opening a new restaurant, but also improve their return on investment.  Each restaurant should be in an area where it won't be drowned out by its competition and draws in plenty of its own customers.

Overall, the goal of this project is to analyze location data received from the Foursquare API with the k-means clustering algorithm to determine the similarity/differences between neighborhoods.

## Data

To solve this problem, we will first need to know what neighborhoods are in Arlington, VA and where they are located (names, latitude, and longitude).  A list of neighborhoods can be found at 'https://en.wikipedia.org/wiki/List_of_neighborhoods_in_Arlington_County,_Virginia', but the location data will have to be determined with Nominatim from the geopy library.  From here, we will need information on the different types of venues located in and around each neighborhood.  This data can be called from the Foursquare API service, and will be used to cluster the neighborhoods togehter and determine their similarities/differences.

## Methodology

### 1.)  Import Packages

In [1]:
!pip install folium==0.5.0
!pip install geopy
!pip install beautifulsoup4



In [2]:
#General
import pandas as pd
import numpy as np
import requests

# Web Scraping
import bs4 as bs

# Geospacial Info - convert address to latitude and longitude
from geopy.geocoders import Nominatim

# Mapping
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

# Clustering 
from sklearn.cluster import KMeans

### 2.)  Import Data and Build Dataframe(s)

#### Web Scraping

In [3]:
# Scrape neighborhood names from internet

sauce = requests.get('https://en.wikipedia.org/wiki/List_of_neighborhoods_in_Arlington_County,_Virginia').text
soup = bs.BeautifulSoup(sauce, 'lxml')

neighborhood_names = [neighborhood.text for neighborhood in soup.find('tbody').find_all('li')]

print('There are {} neigbhorhoods in Arlington, VA.'.format(len(neighborhood_names)))

neighborhood_names[0:15]

There are 73 neigbhorhoods in Arlington, VA.


['Alcova Heights',
 'Arlington Forest',
 'Arlington Heights',
 'Arlington Ridge',
 "Arlington View / Johnson's Hill",
 'Ashton Heights',
 'Aurora Highlands',
 'Aurora Hills',
 'Ballston',
 'Barcroft',
 'Bellevue Forest',
 'Bluemont',
 'Bon Air',
 'Boulevard Manor',
 'Brandon Village']

In [4]:
#Format neighborhood names as neeeded and determine latitude/longitude
latitude = []
longitude = []

for i, neighborhood in enumerate(neighborhood_names):
    if ' /' in neighborhood:
        neighborhood_names[i] = neighborhood.split(' /')[0]
    elif ' (' in neighborhood:
        neighborhood_names[i] = neighborhood.split(' (')[0]
    
    try:
        geolocator = Nominatim(user_agent='arlington_explorer')
        location = geolocator.geocode('{}, Arlington, VA'.format(neighborhood))
        latitude.append(location.latitude)
        longitude.append(location.longitude)
    except:
        latitude.append('N/A')
        longitude.append('N/A')
        
# Neighborhoods with additional names or info (post formatting)
for i in [4, 40, 49, 64]:
    print(neighborhood_names[i])

Arlington View
High View Park
Nauck
Waycroft-Woodlawn


In [5]:
ava_neighborhoods = pd.DataFrame(list(zip(neighborhood_names, latitude, longitude)), columns=['Neighborhood', 'Latitude', 'Longitude'])
pd.set_option('display.max_rows', None)

ava_neighborhoods

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Alcova Heights,38.8646,-77.0972
1,Arlington Forest,38.8689,-77.1131
2,Arlington Heights,41.0062,-75.2126
3,Arlington Ridge,40.9841,-81.4939
4,Arlington View,,
5,Ashton Heights,,
6,Aurora Highlands,38.8528,-77.0684
7,Aurora Hills,38.8515,-77.0641
8,Ballston,38.882,-77.1115
9,Barcroft,38.8559,-77.1039


In [7]:
#For simplity will filter neighborhoods down to those where Nominatim returned good latitudes/longitudes

ava_neighborhoods.drop(ava_neighborhoods[ava_neighborhoods['Latitude'] == "N/A"].index, inplace = True)

m = (ava_neighborhoods['Latitude'].between(38,39)) & (ava_neighborhoods['Longitude'].between(-78,-77))

ava_filtered = ava_neighborhoods[m].reset_index(drop=True)

ava_filtered

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Alcova Heights,38.8646,-77.0972
1,Arlington Forest,38.8689,-77.1131
2,Aurora Highlands,38.8528,-77.0684
3,Aurora Hills,38.8515,-77.0641
4,Ballston,38.882,-77.1115
5,Barcroft,38.8559,-77.1039
6,Bellevue Forest,38.9143,-77.1136
7,Bluemont,38.8747,-77.133
8,Bon Air,38.8732,-77.1266
9,Brandon Village,38.8757,-77.1158


In [8]:
print('The dataframe has {} neighborhoods.'.format(ava_filtered.shape[0]))

The dataframe has 50 neighborhoods.


#### Foursquare API

In [13]:
# The code was removed by Watson Studio for sharing.

In [14]:
VERSION = '20201208'
LIMIT = 100
radius = 500 #meters

In [15]:
def getNearbyVenues(names, latitudes, longitudes):
    
    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):
        
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        venues_list.append([(
            name,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])
        
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude', 'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']
        
    return(nearby_venues)

In [16]:
ava_venues = getNearbyVenues(names = ava_filtered['Neighborhood'], latitudes = ava_filtered['Latitude'], longitudes = ava_filtered['Longitude'])
print(ava_venues.shape)
ava_venues.head()

KeyError: 'groups'

In [None]:
ava_restaurant = ava_venues[ava_venues['Venue Category'].str.contains('Restaurant')]

ava_restaurant.shape

In [None]:
# Use neighborhoods where there are 5 or more restaurants
ava_restaurant_count = ava_restaurant.groupby('Neighborhood').count()
temp = ava_test[ava_test['Venue']>=5]
select_neighborhoods = temp.index.tolist()

ava_restaurant_filtered = ava_restaurant[ava_restaurant['Neighborhood'].isin(select_neighborhoods)]

### 4.) Restructure Data

In [None]:
ava_onehot = pd.get_dummies(ava_restaurant_filtered[['Venue Category']], prefix="", prefix_sep="")

ava_onehot['Neighborhood'] = ava_restaurant['Neighborhood']

fixed_columns = [ava_onehot.columns[-1]] + list(ava_onehot.columns[:-1])
ava_onehot = ava_onehot[fixed_columns]

ava_onehot.head()

In [None]:
print('There are {} different venues in these Arlington neighborhoods, and a total of {} different restaurant categories.'.format(ava_onehot.shape[0], ava_onehot.shape[1]))

In [None]:
ava_grouped = ava_onehot.groupby('Neighborhood').mean().reset_index()
ava_grouped

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending = False)
    
    
    
    return row_categories_sorted.index.values[0: num_top_venues]

In [None]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
        
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = ava_grouped['Neighborhood']

for ind in np.arange(ava_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(ava_grouped.iloc[ind, :], num_top_venues)
    
neighborhoods_venues_sorted

### 5. Clustering

In [None]:
kclusters = 5

ava_grouped_clustering = ava_grouped.drop('Neighborhood', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(ava_grouped_clustering)

In [None]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

ava_final = ava_filtered[ava_filtered['Neighborhood'].isin(select_neighborhoods)]

ava_final = ava_final.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

ava_final.head()

## Discussion

### Map Cluster

In [None]:
#downtown toronto lat/long
latitude = 38.8816
longitude = -77.0910 

map_clusters = folium.Map(location = [latitude, longitude], zoom_start = 13)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0,1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(ava_final['Latitude'], ava_final['Longitude'], ava_final['Neighborhood'], ava_final['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color = rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
    
map_clusters

### Examine Clusters

In [None]:
# First Cluster
ava_final.loc[toronto_merged['Cluster Labels']==0, ava_final.columns[[2] + list(range(5, toronto_merged.shape[1]))]]