# Virginia Hospital Escapes - A comparison of the venues surrounding Virginia hospitals
## IBM Capstone Project
#### Author Micah C. Gray

## 1. Introduction

If you ever spent a night at a hospital, either as a visitor or a staff member, you were probably grateful for any contact with the outside world. You might also have appreciated having venues nearby for food, prescriptions, flowers, or just a path with some fresh air. This report compares hospitals in the state of Virginia with respect to their surrounding venues, with the intent that your next hospital visit in Virginia is a little more freeing. 

This analysis draws upon location data obtained from Foursquare.com in order to explore the diversity of venues surrounding the 200-plus hospitals in Virginia. I will attempt to discover the ten most common venue types surrounding each hospital. I will also filter out hospitals with little or few options for nearby food, pharmacies, or nature walks. 

## 2. Data

### 2.1 Data Sources
Data about the hospitals in Virginia, including the geocoordinates, city, and state, were obtained from http://www.lat-long.com/ while data about the venues surrounding the hospitals were obtained from Foursquare.com using a 1000 meter radius.

### 2.2 Data Cleaning

My first step was to obtain the hospital data. I performed a search of Virginia Hospitals on lat-long.com and got results in the form of a table that contained hospital name, feature type (hospital), county, and state. I was able to copy the table and paste it into an Excel spreadsheet. Latitutde and Longitude for each hospital were obtained one at a time and copied individually to new columns in the Excel spreadsheet. This was a manageable task given the size of my data (about 250 hospitals). Next I saved the spreadsheet and uploaded it to my jupyter notebook in IBM's Watson studio as a pandas dataframe.

In [2]:
import types
import pandas as pd

Read the hospital data from the xls file on my local computer

In [3]:
#  ----------------------!!!  Remove/Hide this cell once the code has been run !!!-------------------------
from botocore.client import Config
import ibm_boto3
# -------------- Read the hospital data from the xls file on my local computer --------------------

def __iter__(self): return 0

# @hidden_cell
# !! Credentials removed !!

body = # portions removed (Key='Virginia_Hospital_Locations.xlsx')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_hospitals = pd.read_excel(body)
df_hospitals.head()


Unnamed: 0,Name,Feature Type,County,State,Latitude,Longitude
0,A B Adams Convalescent Center,Hospital,Emporia (city),VA,36.685705,-77.537758
1,A D Williams Memorial Clinic,Hospital,Richmond,VA,37.539869,-77.430261
2,Access Emergency Hospital,Hospital,Fairfax,VA,38.965666,-77.35693
3,Albemarle County Health Department,Hospital,Charlottesville (city),VA,38.042083,-78.482789
4,Alexander W Terrell Memorial Infirmary,Hospital,Lynchburg (city),VA,37.438475,-79.172247


In [4]:
# Next I save the dataframe as a .csv file for easy access
df_hospitals.to_csv('hospital_location_data.csv', index = None)

In [5]:
# import pandas as pd
df_hospitals2 = pd.read_csv('hospital_location_data.csv')
df_hospitals2.head()

Unnamed: 0,Name,Feature Type,County,State,Latitude,Longitude
0,A B Adams Convalescent Center,Hospital,Emporia (city),VA,36.685705,-77.537758
1,A D Williams Memorial Clinic,Hospital,Richmond,VA,37.539869,-77.430261
2,Access Emergency Hospital,Hospital,Fairfax,VA,38.965666,-77.35693
3,Albemarle County Health Department,Hospital,Charlottesville (city),VA,38.042083,-78.482789
4,Alexander W Terrell Memorial Infirmary,Hospital,Lynchburg (city),VA,37.438475,-79.172247


### Cleanup
Now I begin cleaning up the hospital data.

In [6]:
# First I remove historical hospitals by dropping rows that have "historical" in the Name field.
bool_historical = df_hospitals2['Name'].str.contains('historical')
df_hospitals3=df_hospitals2[~bool_historical] # apply the boolean mask to the dataframe and save with a new name
df_hospitals3.head(14) # Let's see if it worked

Unnamed: 0,Name,Feature Type,County,State,Latitude,Longitude
0,A B Adams Convalescent Center,Hospital,Emporia (city),VA,36.685705,-77.537758
1,A D Williams Memorial Clinic,Hospital,Richmond,VA,37.539869,-77.430261
2,Access Emergency Hospital,Hospital,Fairfax,VA,38.965666,-77.35693
3,Albemarle County Health Department,Hospital,Charlottesville (city),VA,38.042083,-78.482789
4,Alexander W Terrell Memorial Infirmary,Hospital,Lynchburg (city),VA,37.438475,-79.172247
5,Alleghany Memorial Hospital,Hospital,Covington (city),VA,37.794847,-79.999502
6,Alleghany Regional Hospital,Hospital,Alleghany,VA,37.792204,-79.88099
7,Andrew Rader Clinic,Hospital,Arlington,VA,38.87039,-77.07609
8,Arlington Free Clinic,Hospital,Arlington,VA,38.882449,-77.105439
9,Ashland Convalescent Center,Hospital,Hanover,VA,37.767643,-77.49554


You can see that row 13 was dropped. Now I will reset the index.

In [8]:
df_hospitals3.reset_index(inplace=True, drop = True) #Reset the index. Drop the current index.
df_hospitals3.head(14)

Unnamed: 0,Name,Feature Type,County,State,Latitude,Longitude
0,A B Adams Convalescent Center,Hospital,Emporia (city),VA,36.685705,-77.537758
1,A D Williams Memorial Clinic,Hospital,Richmond,VA,37.539869,-77.430261
2,Access Emergency Hospital,Hospital,Fairfax,VA,38.965666,-77.35693
3,Albemarle County Health Department,Hospital,Charlottesville (city),VA,38.042083,-78.482789
4,Alexander W Terrell Memorial Infirmary,Hospital,Lynchburg (city),VA,37.438475,-79.172247
5,Alleghany Memorial Hospital,Hospital,Covington (city),VA,37.794847,-79.999502
6,Alleghany Regional Hospital,Hospital,Alleghany,VA,37.792204,-79.88099
7,Andrew Rader Clinic,Hospital,Arlington,VA,38.87039,-77.07609
8,Arlington Free Clinic,Hospital,Arlington,VA,38.882449,-77.105439
9,Ashland Convalescent Center,Hospital,Hanover,VA,37.767643,-77.49554


Just in case there are missing values, I'll drop any rows that have missing Latitutde coordinates.

In [9]:
print('shape prior to dropping missing values:', df_hospitals3.shape) # print the dimensions of the dataframe
df_hospitals3.dropna(inplace=True)
print('shape after dropping missing values:', df_hospitals3.shape) # print the dimensions again to see changes

shape prior to dropping missing values: (247, 6)
shape after dropping missing values: (247, 6)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


It appears that my hospital data includes a veterinary hospital, dental facilities, nursing homes and psychiatric and mental hospitals. I'd like to separate out those other facilities so I can just focus on hospitals. I will change the 'Feature Type' in the original dataframe to differentiate between the different types of hospitals. I will also create five separate dataframes for the different hospital types in case I want to investigate any of the other types individually.

In [11]:
# Let's create five dataframes, one for nursing homes, one for psychiatric and
# mental hospitals, one for veterinary clinics, one for dental clinics,
# and one for everything else. 
    
# Create nursing home dataframe
nursing_home_bool = df_hospitals3['Name'].str.contains('Nursing') # create a boolean mask for nursing homes
df_hospitals3.loc[nursing_home_bool, 'Feature Type'] = 'Nursing Home' #updatet 'Feature Type' in the old dataframe
nursing_homes = df_hospitals3[nursing_home_bool] #create nursing_homes dataframe
nursing_homes.reset_index(drop=True, inplace=True)

# Create dental clinic dataframe
dental_bool = df_hospitals3['Name'].str.contains('Dental')
df_hospitals3.loc[dental_bool, 'Feature Type'] = 'Dental Clinic'
dental_clinics = df_hospitals3[dental_bool]
dental_clinics.reset_index(drop=True, inplace=True)

# Create psychiatric hospital dataframe
bool_psychiatric_hospitals = df_hospitals3['Name'].str.contains('Psychiatric')
df_hospitals3.loc[bool_psychiatric_hospitals, 'Feature Type'] = 'Psychiatric Hospital' #change 'Feature Type' in the old dataframe
df_psychiatric = df_hospitals3[bool_psychiatric_hospitals] 
bool_mental_hospitals = df_hospitals3['Name'].str.contains('Mental')
df_hospitals3.loc[bool_mental_hospitals, 'Feature Type'] = 'Psychiatric Hospital' #change 'Feature Type' in the old dataframe
psychiatric_hospitals = df_hospitals3[bool_mental_hospitals].append(df_psychiatric) # create psych dataframe
psychiatric_hospitals.reset_index(drop=True, inplace=True)

# Create veterinary hospital dataframe
veterinary_hospitals = df_hospitals3[df_hospitals3['Feature Type'].str.contains('Veterinary')]

# Create general hospital dataframe
df_hospitals3['Feature Type'].str.strip() # strip white space from before and after string
hospitals_only=df_hospitals3[df_hospitals3['Feature Type'].str.startswith('Hospital')]
hospitals_only.reset_index(drop=True)

print('hospitals_only shape:', hospitals_only.shape)
print('nursing_homes shape:', nursing_homes.shape)
print('psychiatric_hospitals shape:', psychiatric_hospitals.shape)
print('veterinary_hospitals shape:', veterinary_hospitals.shape)
print('dental_clinics shape:', dental_clinics.shape)

df_hospitals3.groupby('Feature Type').count()
#hospitals_only.groupby('Feature Type').count()

hospitals_only shape: (219, 6)
nursing_homes shape: (15, 6)
psychiatric_hospitals shape: (11, 6)
veterinary_hospitals shape: (1, 6)
dental_clinics shape: (1, 6)


Unnamed: 0_level_0,Name,County,State,Latitude,Longitude
Feature Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Dental Clinic,1,1,1,1,1
Hospital,219,219,219,219,219
Nursing Home,15,15,15,15,15
Psychiatric Hospital,11,11,11,11,11
Veterinary Hospital,1,1,1,1,1


Before continuing with our 'hospitals_only' data, let's plot all of the hospital types on a map of Virginia. We will use folium to plot the map.

In [12]:
# Install folium
!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library
#df_hospitals3.head(2)

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2019.9.11  |       hecc5488_0         144 KB  conda-forge
    altair-3.2.0               |           py36_0         770 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    certifi-2019.9.11          |           py36_0         147 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.2 MB

The following NEW packages will be 

In [115]:
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors

In [14]:
# Plot the different types of hospitals in Virginia on a map of Virginia.

# set the coordinates for Virginia
latitude = 37.9965159
longitude = -79.8305715

# create map of Virginia using folium
virginia_map = folium.Map(location=[latitude, longitude], zoom_start=7)

# 'Feature Types' has five different values (types of hospitals) so we will make 5
# different colors of markers on the map.
k_types = 5 
x = np.arange(k_types)
ys = [i + x + (i*x)**2 for i in range(k_types)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add hospital markers to map
#markers_colors = []
for lat, lng, hospital, f_type in zip(df_hospitals3['Latitude'], df_hospitals3['Longitude'],
                                      df_hospitals3['Name'], df_hospitals3['Feature Type']):
    feature_int = {'Hospital':0, 'Nursing Home':1, 'Psychiatric Hospital':2,
                   'Veterinary Hospital':3, 'Dental Clinic':4}
    label = '{}, {}'.format(hospital, f_type)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[feature_int.get(f_type, 0)],
        fill=True,
        fill_color=rainbow[feature_int.get(f_type, 0)],
        fill_opacity=0.7,
        parse_html=False).add_to(virginia_map)  
    
virginia_map

### Now get the Foursquare data

In [15]:
#  -------------------------------!!!  Remove credentials once the code has been run !!!----------------------------------

# Initialize Foursquare credentials
#CLIENT_ID =  !! credentials removed !!
#CLIENT_SECRET = !! credential removed !!

In [16]:
# Import necessary libraries
# import pandas as pd
# import numpy as np
import requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

In [17]:
# Define some parameters for the call to Foursquare
VERSION = '20180605' # Foursquare API version
RADIUS = 500 # Include venues within a 1 kilometer radius
INTENT = 'browse'
LIMIT = 100
#search_query = 'Pharmacy' ## Optionally, search for food, garden, parks, walking trails

In [19]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [20]:
# Define a function that returns a dataframe with nearby venue data from Foursquare given a
# name and pair of coordinates
def getNearbyVenues(name, lat, lng, RADIUS=1000):
    url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&intent={}'.format(
            CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, RADIUS, LIMIT, INTENT)
    # send the get request
    results = requests.get(url).json()['response']['venues']
    nearby_venues = json_normalize(results) # flatten JSON and save as a dataframe
    
    #troubleshoot
    #print(nearby_venues.head()) # take a peek at the raw dataframe
    
    # filter columns
    filtered_columns = ['name', 'location.lat', 'location.lng', 'location.distance', 'categories']
    nearby_venues =nearby_venues.loc[:, filtered_columns]
    
    #troubleshoot distance
    #print("Distance:", nearby_venues["location.distance"]) # take a peek at the raw dataframe
    
    # filter the category for each row
    nearby_venues['categories'] = nearby_venues.apply(get_category_type, axis=1)
    
    #troubleshoot
    #print('After filtering categories \n', nearby_venues.head()) # take a peek at the raw dataframe
    
    # clean columns.
    nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
    #print('After cleaning columns \n', nearby_venues.head()) # take a peek at the raw dataframe
    
    len = nearby_venues.size  # get the number of rows
    #print('length of nearby_venues:', len)
    index = 0
    names_list = []
    while index < len:   # create a list containing the 'name' parameter for each row
        names_list.append(name)
        index = index + 1
    names_series = pd.Series(data = names_list)
    #print('names_list size:', names_series.size)
    nearby_venues['Hospital']= names_series  #set each row in this column of nearby_venues to 'name'
    
    #troubleshoot
    #print('After adding hospital name... \n', nearby_venues.head()) # take a peek at the raw dataframe
    
    # add hospital data
    return(nearby_venues) 


In [21]:
# Let's call the function above to get a dataframe with venues for all hospitals in Virginia.
# create a dictionary to initialize my dataframe
dictionary1 = {'name': ['value'], 'lat': ['NaN'],
               'lng': ['NaN'], 'distance': ['NaN'], 'categories': ['value'], 'Hospital': ['value']}
# create a dataframe to contain the combined venue data
all_venues = pd.DataFrame(dictionary1) 
for i, hospital in enumerate(df_hospitals3['Name']):
    # call the function to get nearby venues for each hospital
    all_venues = all_venues.append(getNearbyVenues(hospital, df_hospitals3.iloc[i,4],
                                                   df_hospitals3.iloc[i,5], RADIUS))
    # Append the nearby venue data for each hospital in the dataframe

# Testing on just one hospital    
#venue_data = getNearbyVenues(df_hospitals3.loc[12,'Name'], df_hospitals3.loc[12,'Latitude'], df_hospitals3.loc[12,'Longitude'], radius)
#venue_data.head() # This is the venue data for one hospital
#all_venues = all_venues.append(getNearbyVenues(df_hospitals3.loc[1,'Latitude'], df_hospitals.loc[1,'Longitude'], radius), sort=False)    

print('all_venues shape:', all_venues.shape)
all_venues.head()

all_venues shape: (23867, 6)


Unnamed: 0,name,lat,lng,distance,categories,Hospital
0,value,,,,value,value
0,Richardson Memorial Library,36.6872,-77.5412,352.0,Library,A B Adams Convalescent Center
1,Greensville County Courthouse,36.6858,-77.5426,436.0,Courthouse,A B Adams Convalescent Center
2,New Century Hospice - Emporia,36.6854,-77.5438,537.0,Medical Center,A B Adams Convalescent Center
3,Calvary Baptist Church,36.693,-77.5392,825.0,Church,A B Adams Convalescent Center


In [22]:
#Save the dataframe to a .csv
all_venues.to_csv('all_venues_rough.csv', index=None)

In [23]:
# Get the dataframe from the .csv
import pandas as pd
all_venues = pd.read_csv('all_venues_rough.csv')
all_venues.head()

Unnamed: 0,name,lat,lng,distance,categories,Hospital
0,value,,,,value,value
1,Richardson Memorial Library,36.687238,-77.541208,352.0,Library,A B Adams Convalescent Center
2,Greensville County Courthouse,36.685785,-77.542643,436.0,Courthouse,A B Adams Convalescent Center
3,New Century Hospice - Emporia,36.685387,-77.54377,537.0,Medical Center,A B Adams Convalescent Center
4,Calvary Baptist Church,36.693036,-77.539188,825.0,Church,A B Adams Convalescent Center


In [25]:
# Remove the null row and reset the index
#all_venues.drop('Unnamed: 0', axis = 1, inplace = True)
all_venues.drop(labels = 0,axis = 0, inplace = True)
all_venues.reset_index(drop=True, inplace=True)
#all_venues.head()

In [24]:
# Renaming the columns
all_venues.rename(columns={"name":"Venue","distance":"Meters from Hospital", "lat":"Venue Lat", "lng":"Venue Lng", "categories":"Category"}, inplace=True)

In [26]:
all_venues.head(35)

Unnamed: 0,Venue,Venue Lat,Venue Lng,Meters from Hospital,Category,Hospital
0,Richardson Memorial Library,36.687238,-77.541208,352.0,Library,A B Adams Convalescent Center
1,Greensville County Courthouse,36.685785,-77.542643,436.0,Courthouse,A B Adams Convalescent Center
2,New Century Hospice - Emporia,36.685387,-77.54377,537.0,Medical Center,A B Adams Convalescent Center
3,Calvary Baptist Church,36.693036,-77.539188,825.0,Church,A B Adams Convalescent Center
4,Peggy Malone - State Farm Insurance Agent,36.693969,-77.538062,920.0,Office,A B Adams Convalescent Center
5,Greensville County High School,36.682069,-77.543601,660.0,High School,A B Adams Convalescent Center
6,dr. adams foot care,36.694221,-77.543526,1078.0,Doctor's Office,A B Adams Convalescent Center
7,Veteran's Memorial Park,36.688216,-77.540897,395.0,Park,A B Adams Convalescent Center
8,Bus Stop,36.682384,-77.551267,1261.0,Bus Line,A B Adams Convalescent Center
9,Wkn It,36.697002,-77.541643,1304.0,Gym,A B Adams Convalescent Center


In [27]:
# Let's see how many venues were returned for each hospital

all_venues.groupby('Hospital').count() # There should be up to 100 venues per hospital

Unnamed: 0_level_0,Venue,Venue Lat,Venue Lng,Meters from Hospital,Category
Hospital,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A B Adams Convalescent Center,100,100,100,100,96
A D Williams Memorial Clinic,100,100,100,100,98
Access Emergency Hospital,100,100,100,100,96
Albemarle County Health Department,100,100,100,100,98
Alexander W Terrell Memorial Infirmary,100,100,100,100,91
Alleghany Memorial Hospital,100,100,100,100,95
Alleghany Regional Hospital,100,100,100,100,97
Andrew Rader Clinic,100,100,100,100,95
Arlington Free Clinic,100,100,100,100,96
Ashland Convalescent Center,100,100,100,100,90


## 3. Methodology
From the results listed above, it appears that 100 venues were obtained for each hospital. That should be enough to provide some useful insights about the venues for escape near each hospital in Virginia. As we saw from the map, hospitals in Virginia are very spread out, with most hospitals clustered around areas of dense population. Let's narrow our focus to just the hospitals in Northern Virginia. Then let's get the nearest 100 venues for each hospital from Foursquare. I'll just drop the venue data above and start over with selecting only general hospitals in northern Virginia.

In [18]:
# Setting the geographical limits for northern Virginia
east_limit = -78.00000 # longitude
west_limit = -76.80000 # longitude
north_limit = 39.00000 # latitude
south_limit = 38.50000 # latitude
hospitals_only.head(2)

Unnamed: 0,Name,Feature Type,County,State,Latitude,Longitude
0,A B Adams Convalescent Center,Hospital,Emporia (city),VA,36.685705,-77.537758
1,A D Williams Memorial Clinic,Hospital,Richmond,VA,37.539869,-77.430261


In [19]:
# Initializing a new dataframe to contain hospitals in northern Virginia
col_names = hospitals_only.columns
nova_hospitals = pd.DataFrame(columns = col_names)
nova_hospitals

Unnamed: 0,Name,Feature Type,County,State,Latitude,Longitude


In [20]:
# Filling the new dataframe with hospitals within the limits of northern Virginia
for i, lat in enumerate(hospitals_only['Latitude']):
    if ((lat < north_limit) and (lat > south_limit)):
        lng = hospitals_only.iloc[i, 5]
        if ((lng > east_limit) and (lng < west_limit)):
            nova_hospitals.loc[len(nova_hospitals)] = hospitals_only.iloc[i] #add the row
            
nova_hospitals

Unnamed: 0,Name,Feature Type,County,State,Latitude,Longitude
0,Access Emergency Hospital,Hospital,Fairfax,VA,38.965666,-77.35693
1,Andrew Rader Clinic,Hospital,Arlington,VA,38.87039,-77.07609
2,Arlington Free Clinic,Hospital,Arlington,VA,38.882449,-77.105439
3,Burke Medical Center,Hospital,Fairfax,VA,38.78848,-77.29777
4,Circle Terrace Hospital,Hospital,Alexandria (city),VA,38.82678,-77.075533
5,Columbia Fairfax Surgical Center,Hospital,Fairfax (city),VA,38.849709,-77.315847
6,DeWitt Hospital,Hospital,Fairfax,VA,38.700393,-77.136646
7,Dominion Hospital,Hospital,Fairfax,VA,38.870112,-77.158591
8,Fair Oaks Medical Plaza,Hospital,Fairfax,VA,38.883723,-77.381375
9,Fair Oaks Professional Building,Hospital,Fairfax,VA,38.884001,-77.380542


Good! It looks like we have filtered out just the hospitals in northern Virginia. Now I will search Foursquare for the nearest 100 venues to each hospital.

In [21]:
# Set parameters for Foursquare call
VERSION = '20180605' # Foursquare API version
INTENT = 'browse'
LIMIT = 100
RADIUS = 1000 # Include venues within a 1 kilometer radius

In [22]:
# Let's call Foursquare and search for venues near the hospitals in northern Virginia

# create a dictionary to initialize my dataframe
col_list = ['name','lat','lng', 'distance', 'categories', 'Hospital']
# create a dataframe to contain the combined venue data
nova_venues = pd.DataFrame(columns = col_list) 
for i, hospital in enumerate(nova_hospitals['Name']):
    # call the function to get nearby venues for each hospital
    nova_venues = nova_venues.append(getNearbyVenues(hospital, nova_hospitals.iloc[i,4],
                                                   nova_hospitals.iloc[i,5], RADIUS))
    # Append the nearby venue data for each hospital in the dataframe

print('nova_venues shape:', nova_venues.shape)
nova_venues.head()

nova_venues shape: (2481, 6)


Unnamed: 0,name,lat,lng,distance,categories,Hospital
0,Dr. Michael Joseph Horwath M.D.,38.965462,-77.356857,23,Doctor's Office,Access Emergency Hospital
1,Inova Emergency Room - Reston/Herndon,38.96641,-77.35666,86,Emergency Room,Access Emergency Hospital
2,Inova Endocrinology,38.965462,-77.356857,23,Medical Center,Access Emergency Hospital
3,Harris Teeter,38.965674,-77.354899,175,Supermarket,Access Emergency Hospital
4,Office Depot,38.965969,-77.355501,128,Paper / Office Supplies Store,Access Emergency Hospital


We now have venue data for 2481 venues surrounding hospitals in northern Virginia. Our data includes the distance from the hospital and the category of the venue. Let's take a peak at the data types and statitstics to see if there are any irregularities. Then I'll clean up the column names.

In [23]:
nova_venues.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2481 entries, 0 to 99
Data columns (total 6 columns):
name          2481 non-null object
lat           2481 non-null float64
lng           2481 non-null float64
distance      2481 non-null object
categories    2375 non-null object
Hospital      2481 non-null object
dtypes: float64(2), object(4)
memory usage: 135.7+ KB


Great! I see just a couple of issues. 

First, I see that my distance data is of type 'object'. I'd like to convert it to type 'integer'.

Second, I see that nearly 90 venues have a null value for category. Since I will mostly be basing my analysis on that field, I should remedy those cells by filling them in or deleting the rows entirely.

In [24]:
# Changing the distance column to type 'int'
nova_venues['distance'] = nova_venues['distance'].astype('int')
nova_venues['distance'].dtypes # Let's check the type now

dtype('int64')

In [25]:
# Let's see which venues have no category listed - I'll use a boolean mask on the dataframe
nova_venues[nova_venues['categories']=='']

Unnamed: 0,name,lat,lng,distance,categories,Hospital


That didn't work. Let me try some other methods to see the rows with nothing in the 'categories' field.

In [28]:
nova_venues[nova_venues['categories']==False]

Unnamed: 0,name,lat,lng,distance,categories,Hospital


In [29]:
nova_venues[nova_venues['categories']==0]

Unnamed: 0,name,lat,lng,distance,categories,Hospital


In [36]:
nova_venues['categories'].hasnans

True

In [41]:
nova_venues['categories'].count()

2386

Hmmm. There are definitely missing values! Somehow I am going to find out which rows have NaN/null values.

In [38]:
nova_venues[nova_venues['categories']==np.nan]

Unnamed: 0,name,lat,lng,distance,categories,Hospital


In [39]:
bool_null_categories = nova_venues['categories']==False
nova_venues.loc[bool_null_categories]

Unnamed: 0,name,lat,lng,distance,categories,Hospital


In [39]:
nova_venues[nova_venues['categories'].isna()] # And we have a winner!!! 

# It also looks like I need to reset the index.

Unnamed: 0,name,lat,lng,distance,categories,Hospital
28,Fairfax Government Center,38.965222,-77.359142,197,,Access Emergency Hospital
63,"Baldino's Lock & Key, Reston",38.967257,-77.354741,259,,Access Emergency Hospital
69,Nv pools,38.961466,-77.356034,473,,Access Emergency Hospital
70,Horsey Headquarters,38.968832,-77.357163,353,,Access Emergency Hospital
76,Barber,38.968155,-77.354648,340,,Access Emergency Hospital
1,Radar clinic pharmacy,38.870095,-77.076140,33,,Andrew Rader Clinic
9,4x4 Canon,38.868331,-77.075859,230,,Andrew Rader Clinic
33,Piano Lessons,38.870925,-77.083549,649,,Andrew Rader Clinic
40,El Pollo Rico,38.868008,-77.077036,277,,Andrew Rader Clinic
97,S. Joyce Street,38.865173,-77.077203,588,,Andrew Rader Clinic


That worked! It also looks like I need to reset the index. Then I'll see if I can assign values to some of the missing categories.

In [26]:
nova_venues.reset_index(drop=True, inplace=True) # reset the index
nova_venues[nova_venues['categories'].isna()] # list the rows with nothing in 'categories'

Unnamed: 0,name,lat,lng,distance,categories,Hospital
25,Fairfax Government Center,38.965587,-77.358506,136,,Access Emergency Hospital
62,Nv pools,38.961466,-77.356034,473,,Access Emergency Hospital
63,Horsey Headquarters,38.968832,-77.357163,353,,Access Emergency Hospital
68,Barber,38.968155,-77.354648,340,,Access Emergency Hospital
101,Radar clinic pharmacy,38.870095,-77.076140,33,,Andrew Rader Clinic
110,4x4 Canon,38.868331,-77.075859,230,,Andrew Rader Clinic
131,Piano Lessons,38.870925,-77.083549,649,,Andrew Rader Clinic
135,El Pollo Rico,38.868008,-77.077036,277,,Andrew Rader Clinic
192,S. Joyce Street,38.865173,-77.077203,588,,Andrew Rader Clinic
222,Northern Virginia Endodontic Associates,38.882880,-77.106590,110,,Arlington Free Clinic


In [27]:
# I'm going to manually assign a category for some of these venues, like this:
nova_venues.loc[nova_venues['name'].str.contains('Government'),'categories'] = 'Government Venue'

# This is how to manually change the 'categories' feature using index notation, but beware -
# the indexes will likely change the next time you get the data from Foursquare
#nova_venues.iloc[[2089, 2280, 2328, 2358, 2429, 2430],[4]]='Hospital'

nova_venues[nova_venues['categories'].isna()]

Unnamed: 0,name,lat,lng,distance,categories,Hospital
62,Nv pools,38.961466,-77.356034,473,,Access Emergency Hospital
63,Horsey Headquarters,38.968832,-77.357163,353,,Access Emergency Hospital
68,Barber,38.968155,-77.354648,340,,Access Emergency Hospital
101,Radar clinic pharmacy,38.870095,-77.076140,33,,Andrew Rader Clinic
110,4x4 Canon,38.868331,-77.075859,230,,Andrew Rader Clinic
131,Piano Lessons,38.870925,-77.083549,649,,Andrew Rader Clinic
135,El Pollo Rico,38.868008,-77.077036,277,,Andrew Rader Clinic
192,S. Joyce Street,38.865173,-77.077203,588,,Andrew Rader Clinic
222,Northern Virginia Endodontic Associates,38.882880,-77.106590,110,,Arlington Free Clinic
255,Black Shiny Building,38.882947,-77.103892,145,,Arlington Free Clinic


Good. Row 25 containing 'Fairfax Government Center' is no longer missing its category. Let's change some more.

In [28]:
nova_venues.loc[nova_venues['name'].str.contains('Lock & Key'),'categories'] = 'Locksmith'

nova_venues.loc[nova_venues['name'].str.contains('pool'),'categories'] = 'Pool'

nova_venues.loc[nova_venues['name'].str.contains('Barber'),'categories'] = 'Barbershop'

nova_venues.loc[nova_venues['name'].str.contains('pharmacy'),'categories'] = 'Pharmacy'

nova_venues.loc[nova_venues['name'].str.contains('El Pollo Rico'),'categories'] = 'Peruvian Restaurant'

nova_venues.loc[nova_venues['name'].str.contains('Endodontic'),'categories'] = 'Endodontic Dentist'

nova_venues.loc[nova_venues['name'].str.contains('Family Medicine'),'categories'] = 'Family Medicine'

nova_venues.loc[nova_venues['name'].str.contains('Salon'),'categories'] = 'Salon'

nova_venues.loc[nova_venues['name'].str.contains('Massage & Spa'),'categories'] = 'Massage and Spa'

nova_venues.loc[nova_venues['name'].str.contains('Chapel Next'),'categories'] = 'Christian Church'

nova_venues.loc[nova_venues['name'].str.contains('Church'),'categories'] = 'Christian Church'

nova_venues.loc[nova_venues['name'].str.contains('University'),'categories'] = 'University'

nova_venues.loc[nova_venues['name'].str.contains('Child Development'),'categories'] = 'Child Daycare'

nova_venues.loc[nova_venues['name'].str.contains('SAIC'),'categories'] = 'Corporate Office'

nova_venues.loc[nova_venues['name'].str.contains('Hospice'),'categories'] = 'Nursing Home'

nova_venues.loc[nova_venues['name'].str.contains('store'),'categories'] = 'Store'

nova_venues.loc[nova_venues['name'].str.contains("Doctors Office"),'categories'] = "Doctor's Office"

nova_venues.loc[nova_venues['name'].str.contains('Allergy'),'categories'] = 'Allergy Clinic'

nova_venues.loc[nova_venues['name'].str.contains('William Urology'),'categories'] = 'Urology Clinic'

nova_venues.loc[nova_venues['name'].str.contains('Medical Supply'),'categories'] = 'Medical Supply'

nova_venues.loc[nova_venues['name'].str.contains('Otolaryngology'),'categories'] = 'Otolaryngology Clinic'

nova_venues.loc[nova_venues['name'].str.contains('Gift Shop'),'categories'] = 'Gift Shop'

nova_venues.loc[nova_venues['name'].str.contains('Cardiovascular'),'categories'] = 'Cardiovascular Clinic'

nova_venues.loc[nova_venues['name'].str.contains('DVMS'),'categories'] = 'Veterinarian'

nova_venues[nova_venues['categories'].isna()]

Unnamed: 0,name,lat,lng,distance,categories,Hospital
63,Horsey Headquarters,38.968832,-77.357163,353,,Access Emergency Hospital
110,4x4 Canon,38.868331,-77.075859,230,,Andrew Rader Clinic
131,Piano Lessons,38.870925,-77.083549,649,,Andrew Rader Clinic
192,S. Joyce Street,38.865173,-77.077203,588,,Andrew Rader Clinic
255,Black Shiny Building,38.882947,-77.103892,145,,Arlington Free Clinic
271,Qahtani's Home,38.881889,-77.105887,73,,Arlington Free Clinic
288,The View from Tiffany's,38.882805,-77.104681,76,,Arlington Free Clinic
310,Fed Ex Box,38.787900,-77.297302,76,,Burke Medical Center
389,"Chestnut Woods, Burke, VA",38.789473,-77.291960,516,,Burke Medical Center
403,511 High Street,38.821184,-77.070401,765,,Circle Terrace Hospital


In [29]:
nova_venues.loc[nova_venues['name'].str.contains('JP Killeen'),'categories'] = 'Healthcare IT Business'

nova_venues.loc[nova_venues['name'].str.contains('Fed Ex Box'),'categories'] = 'FedEx Box'

nova_venues.loc[nova_venues['name'].str.contains('pool'),'categories'] = 'Pool'

nova_venues.loc[nova_venues['name'].str.contains('Cookie'),'categories'] = 'Cookie Shop'

nova_venues.loc[nova_venues['name'].str.contains("Women's Health"),'categories'] = "Women's Health Clinic"

nova_venues.loc[nova_venues['name'].str.contains('Imaging Center'),'categories'] = 'Hospital Imaging Center'

nova_venues.loc[nova_venues['name'].str.contains('Law Office'),'categories'] = 'Law Office'

nova_venues.loc[nova_venues['name'].str.contains('Chow Hall'),'categories'] = 'Chow Hall'

nova_venues.loc[nova_venues['name'].str.contains('Town hall'),'categories'] = 'Town Hall'

nova_venues.loc[nova_venues['name'].str.contains('Surgery Center'),'categories'] = 'Surgery Center'

nova_venues.loc[nova_venues['name'].str.contains('Town hall'),'categories'] = 'Town Hall'

nova_venues.loc[nova_venues['name'].str.contains('Grapevine'),'categories'] = 'Wine and Cigar Shop'

nova_venues[nova_venues['categories'].isna()]

Unnamed: 0,name,lat,lng,distance,categories,Hospital
63,Horsey Headquarters,38.968832,-77.357163,353,,Access Emergency Hospital
110,4x4 Canon,38.868331,-77.075859,230,,Andrew Rader Clinic
131,Piano Lessons,38.870925,-77.083549,649,,Andrew Rader Clinic
192,S. Joyce Street,38.865173,-77.077203,588,,Andrew Rader Clinic
255,Black Shiny Building,38.882947,-77.103892,145,,Arlington Free Clinic
271,Qahtani's Home,38.881889,-77.105887,73,,Arlington Free Clinic
288,The View from Tiffany's,38.882805,-77.104681,76,,Arlington Free Clinic
389,"Chestnut Woods, Burke, VA",38.789473,-77.291960,516,,Burke Medical Center
403,511 High Street,38.821184,-77.070401,765,,Circle Terrace Hospital
415,Home2,38.834996,-77.076001,915,,Circle Terrace Hospital


For better or worse, I'm going to have to delete the rest of the venues with no category.

In [30]:
# Now drop the remaining rows where "categories" is 'None'
nova_venues = nova_venues[~(nova_venues['categories'].isna())]
nova_venues.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2419 entries, 0 to 2480
Data columns (total 6 columns):
name          2419 non-null object
lat           2419 non-null float64
lng           2419 non-null float64
distance      2419 non-null int64
categories    2419 non-null object
Hospital      2419 non-null object
dtypes: float64(2), int64(1), object(3)
memory usage: 132.3+ KB


In [31]:
# I thought I saw some duplicate venues, so let's go through the dataframe one more time just to
# make sure we don't have duplicate venues for any one hospital.
nova_venues.drop_duplicates(inplace=True)
nova_venues.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2419 entries, 0 to 2480
Data columns (total 6 columns):
name          2419 non-null object
lat           2419 non-null float64
lng           2419 non-null float64
distance      2419 non-null int64
categories    2419 non-null object
Hospital      2419 non-null object
dtypes: float64(2), int64(1), object(3)
memory usage: 132.3+ KB


In [32]:
# Now let's see the statistics
nova_venues.describe()
#max(nova_venues['Meters from Hospital'])

Unnamed: 0,lat,lng,distance
count,2419.0,2419.0,2419.0
mean,38.838887,-77.254302,380.169905
std,0.076981,0.168454,1517.897249
min,38.247773,-77.8159,4.0
25%,38.818695,-77.364938,129.0
50%,38.859374,-77.227763,236.0
75%,38.88373,-77.110007,444.0
max,38.979113,-77.065195,68761.0


It appears from the max distance value that some venues were included that are not in our defined radius. Let's just keep rows (venues) whose distance from a hospital is less than 1001 meters.

In [33]:
nova_venues = nova_venues[nova_venues['distance']<1001] # Keep rows where 'distance' is less than 1001
nova_venues.describe()

Unnamed: 0,lat,lng,distance
count,2376.0,2376.0,2376.0
mean,38.839115,-77.254646,308.046717
std,0.076339,0.168489,237.294274
min,38.630892,-77.8159,4.0
25%,38.818458,-77.365131,126.0
50%,38.85943,-77.227918,234.0
75%,38.883739,-77.109847,430.0
max,38.969783,-77.065195,999.0


That's much better. Now I'll rename the columns.

In [35]:
# Cleaning up column names
#nova_venues.drop(labels = 0,axis = 0, inplace = True)   # If there was a column named '0'
nova_venues.reset_index(drop = True, inplace=True) # reseting the index again
# Renaming the columns
nova_venues.rename(columns={"name":"Venue","distance":"Meters from Hospital", "lat":"Venue Lat", "lng":"Venue Lng", "categories":"Category"}, inplace=True)
nova_venues.head(25)

Unnamed: 0,Venue,Venue Lat,Venue Lng,Meters from Hospital,Category,Hospital
0,Dr. Michael Joseph Horwath M.D.,38.965462,-77.356857,23,Doctor's Office,Access Emergency Hospital
1,Inova Emergency Room - Reston/Herndon,38.96641,-77.35666,86,Emergency Room,Access Emergency Hospital
2,Inova Endocrinology,38.965462,-77.356857,23,Medical Center,Access Emergency Hospital
3,Harris Teeter,38.965674,-77.354899,175,Supermarket,Access Emergency Hospital
4,Office Depot,38.965969,-77.355501,128,Paper / Office Supplies Store,Access Emergency Hospital
5,Exxon,38.966936,-77.356145,156,Gas Station,Access Emergency Hospital
6,Long & Foster,38.967057,-77.357499,162,Building,Access Emergency Hospital
7,Fountain Dr and Baron Cameron,38.96675,-77.355102,199,Road,Access Emergency Hospital
8,Bank of America,38.966636,-77.357936,138,Bank,Access Emergency Hospital
9,Doubletake Salon,38.964868,-77.355648,142,Salon,Access Emergency Hospital


In [109]:
# Let's see how many venues we have for each hospital?

nova_venues.groupby('Hospital')['Venue'].count() # Count the number of venues associated with each hospital

Hospital
Access Emergency Hospital                97
Andrew Rader Clinic                      97
Arlington Free Clinic                    96
Burke Medical Center                     95
Circle Terrace Hospital                  89
Columbia Fairfax Surgical Center         97
DeWitt Hospital                         100
Dominion Hospital                        97
Fair Oaks Medical Plaza                  88
Fair Oaks Professional Building          89
Fauquier Hospital                        97
Fort Myer Hospital                       93
Hospice of Northern Virginia             90
Inova Alexandria Hospital                99
Inova Fair Oaks Hospital                 97
Inova Fairfax Hospital                   98
Inova Mount Vernon Hospital              91
Jefferson Memorial Hospital              90
Joseph Willard Health Center             95
Northern Virginia Community Hospital     91
Potomac Hospital                         97
Prince William Hospital                  98
Reston Hospital Center 

Good! It looks like we have 100 or less venues for each hospital

#### Let's find out how many unique categories of venues we have.

In [110]:
print('There are {} unique categories.'.format(len(nova_venues['Category'].unique())))

There are 282 unique categories.


Now that the data is clean and looking good, I want to do an analysis of the venues that are within walking distance of the hospitals and categorize them according to the most common venue types. I am arbitrarily going to set 300 meters as the walking distance. I am not concerned with venues that are actually roads or intersections, or that are a part of the hospital. So I will remove those venues first.

In [111]:
# nova_venues minus hospital venues (remove "Doctor's Office", "Medical Center" "Hospital", 
# "Dentist", "Clinic", and "Emergency Room").
to_drop = ['Road', 'Intersection', "Doctor's Office", 'Medical Center', 'Hospital', 'Hospital Ward',
           "Dentist's Office",'Allergy Clinic','Hospital Imaging Center','Maternity Clinic','Emergency Room']
nova_venues_no_h = nova_venues[~(nova_venues['Category'].isin(to_drop))]
#nova_venues_no_h.reset_index(drop = True, inplace=True) # reset the index

# Now I will remove any venues farther than 300 meters
nova_venues_no_h = nova_venues_no_h[nova_venues_no_h['Meters from Hospital']<301]
nova_venues_no_h.reset_index(drop=True, inplace=True)

# Let's see what we're left with
nova_venues_no_h.head(30)

Unnamed: 0,Venue,Venue Lat,Venue Lng,Meters from Hospital,Category,Hospital
0,Harris Teeter,38.965674,-77.354899,175,Supermarket,Access Emergency Hospital
1,Office Depot,38.965969,-77.355501,128,Paper / Office Supplies Store,Access Emergency Hospital
2,Exxon,38.966936,-77.356145,156,Gas Station,Access Emergency Hospital
3,Long & Foster,38.967057,-77.357499,162,Building,Access Emergency Hospital
4,Bank of America,38.966636,-77.357936,138,Bank,Access Emergency Hospital
5,Doubletake Salon,38.964868,-77.355648,142,Salon,Access Emergency Hospital
6,Reston Human Services Building,38.965335,-77.35885,170,Government Building,Access Emergency Hospital
7,Redbox,38.966079,-77.355072,167,Video Store,Access Emergency Hospital
8,North County Human Services Building,38.965587,-77.358506,136,Building,Access Emergency Hospital
9,Merchant's Tire & Auto Centers,38.967065,-77.355411,203,Automotive Shop,Access Emergency Hospital


### Venue Frequency Analysis
#### Let's transform the data in order to find out how many of each kind of venue are around each hospital

In [112]:
# To transform the data, I will use Pandas' get_dummies function to create a
# binary value for each venue based on type. This is called one-hot encoding.

# one hot encoding
hospital_onehot = pd.get_dummies(nova_venues_no_h[['Category']], prefix="", prefix_sep="")

# add hospital column back to dataframe
hospital_onehot['Target Hospital'] = nova_venues_no_h['Hospital'] 

# move hospital column to the first column
fixed_columns = [hospital_onehot.columns[-1]] + list(hospital_onehot.columns[:-1])
hospital_onehot = hospital_onehot[fixed_columns]

hospital_onehot.head()

Unnamed: 0,Target Hospital,American Restaurant,Antique Shop,Art Gallery,Arts & Crafts Store,Assisted Living,Athletics & Sports,Auditorium,Auto Dealership,Automotive Shop,...,University,Urgent Care Center,Urology Clinic,Veterinarian,Video Store,Volleyball Court,Water Park,Women's Health Clinic,Women's Store,Yoga Studio
0,Access Emergency Hospital,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Access Emergency Hospital,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Access Emergency Hospital,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Access Emergency Hospital,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Access Emergency Hospital,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Good, now I want to have just one row for each hospital. To consolidate the values (ones and zeros), I will get the mean, which represents the total number of venues of the specified venue type (near the specified hospital) divided by the total number of venues within 300 meters of the specified hospital. A value of 0.25, for example on row 10, means that a quarter of the venues surrounding Fauquier Hospital are Automotive shops.

In [113]:
# let's get the mean number of venues for each venue category 
hospital_grouped = hospital_onehot.groupby('Target Hospital').mean().reset_index()
hospital_grouped

Unnamed: 0,Target Hospital,American Restaurant,Antique Shop,Art Gallery,Arts & Crafts Store,Assisted Living,Athletics & Sports,Auditorium,Auto Dealership,Automotive Shop,...,University,Urgent Care Center,Urology Clinic,Veterinarian,Video Store,Volleyball Court,Water Park,Women's Health Clinic,Women's Store,Yoga Studio
0,Access Emergency Hospital,0.0,0.0,0.0,0.0,0.021739,0.0,0.0,0.0,0.021739,...,0.0,0.0,0.0,0.021739,0.021739,0.0,0.0,0.0,0.0,0.0
1,Andrew Rader Clinic,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Arlington Free Clinic,0.0,0.0,0.0,0.0,0.014925,0.0,0.014925,0.0,0.0,...,0.014925,0.014925,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Burke Medical Center,0.0,0.017857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.0
4,Circle Terrace Hospital,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Columbia Fairfax Surgical Center,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,DeWitt Hospital,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Dominion Hospital,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Fair Oaks Medical Plaza,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Fair Oaks Professional Building,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Let's print each hospital along with the top 10 most common venue types around it.

In [74]:
num_top_venues = 10

for hood in hospital_grouped['Target Hospital']:
    print("----"+hood+"----")
    temp = hospital_grouped[hospital_grouped['Target Hospital'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Access Emergency Hospital----
                 venue  freq
0                 Bank  0.15
1             Building  0.07
2  Government Building  0.07
3       Hardware Store  0.07
4               Office  0.04
5          Gas Station  0.02
6   Mexican Restaurant  0.02
7        Grocery Store  0.02
8          Supermarket  0.02
9      Business Center  0.02


----Andrew Rader Clinic----
            venue  freq
0      Barbershop  0.13
1      University  0.07
2          Office  0.07
3  Baseball Field  0.07
4   Child Daycare  0.07
5   Grocery Store  0.07
6             Gym  0.07
7        Cemetery  0.07
8  Clothing Store  0.07
9    Tennis Court  0.07


----Arlington Free Clinic----
                                      venue  freq
0  Residential Building (Apartment / Condo)  0.16
1                                    Office  0.06
2                                  Building  0.06
3                                      Pool  0.06
4                                      Café  0.03
5                    

                   venue  freq
0               Building  0.25
1  Cardiovascular Clinic  0.12
2            Medical Lab  0.12
3     Physical Therapist  0.12
4                Parking  0.12
5                   Café  0.12
6           Nursing Home  0.12
7            Music Store  0.00
8            Music Venue  0.00
9             Nail Salon  0.00


----Virginia Hospital Center----
                   venue  freq
0               Bus Line  0.20
1         Surgery Center  0.10
2                Parking  0.10
3         Massage Studio  0.05
4            Beer Garden  0.05
5             Eye Doctor  0.05
6              Cafeteria  0.05
7  Cardiovascular Clinic  0.05
8                   Park  0.05
9            Bus Station  0.05




### Very interesting. Now I'd like to cluster these hospitals with respect to these values.

I will use K-Means. If I was going to calculate the best value for K, I would reserve a part of your data for testing the accuracy of the model. Then chose k =1, use the training part for modeling, and calculate the accuracy of prediction using all samples in your test set. Repeat this process, increasing the k, and see which k is the best for your model.

#This is how to do it for KNN. Create X_train, y_train, X_test, and y_test, then
#use KMeans instead of KNeighborsClassifier

#Train Test Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Ks = 10
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
ConfustionMx = [];
for n in range(1,Ks):
    
    #Train Model and Predict  
    neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
    yhat=neigh.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)

    
    std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

mean_acc

### Cluster according to Frequency

In [None]:
# import kmeans library
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5 # a random guess, will optimize later if I have time

hospital_grouped_clustering = hospital_grouped.drop('Target Hospital', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(hospital_grouped_clustering)

# check cluster labels generated for each row/hospital in the dataframe
labels=kmeans.labels_
labels

#### Great! I have grouped the hospitals based on the commonality of their surrounding venues. 

I saved the groupings in a variable called labels, which I will use later to plot these clusters on a map.

Now I would like to compare hospitals based on the distance of the closest venue of each venue type.

### Venue Distance Analysis

####  Let's transform the data in order to find out the distance of various venue types from each hospital. I will only keep the closest venues when two venues near a hospital are of the same type.

In [115]:
min_venue_distances = nova_venues_no_h.groupby(['Hospital','Category'])['Meters from Hospital'].min()
min_venue_distances = min_venue_distances.to_frame()
min_venue_distances

Unnamed: 0_level_0,Unnamed: 1_level_0,Meters from Hospital
Hospital,Category,Unnamed: 2_level_1
Access Emergency Hospital,Assisted Living,288
Access Emergency Hospital,Automotive Shop,203
Access Emergency Hospital,BBQ Joint,268
Access Emergency Hospital,Bank,97
Access Emergency Hospital,Bar,130
Access Emergency Hospital,Breakfast Spot,234
Access Emergency Hospital,Building,136
Access Emergency Hospital,Business Center,175
Access Emergency Hospital,Diner,188
Access Emergency Hospital,Fast Food Restaurant,211


#### Good! This will be useful for determining which hospitals lack a nearby pharmacy, park, or restaurant. 
Now I'll sort by distance.

In [116]:
min_venue_distances.sort_values(by=(['Hospital','Meters from Hospital']), axis = 0, inplace=True)
min_venue_distances_H = min_venue_distances.copy() # Save a copy of this dataframe with the hierarchical index in place

In [117]:
# Let's reset the index to make pivoting the table possible later
min_venue_distances.reset_index(inplace=True)
min_venue_distances.head(30)

Unnamed: 0,Hospital,Category,Meters from Hospital
0,Access Emergency Hospital,Bank,97
1,Access Emergency Hospital,Paper / Office Supplies Store,128
2,Access Emergency Hospital,Bar,130
3,Access Emergency Hospital,Building,136
4,Access Emergency Hospital,Government Venue,136
5,Access Emergency Hospital,Hardware Store,138
6,Access Emergency Hospital,Mobile Phone Shop,141
7,Access Emergency Hospital,Salon,142
8,Access Emergency Hospital,Spa,142
9,Access Emergency Hospital,Government Building,144


Let's substitute categorical integers from one to five in place of the distances, with 5 being the closest, I will later assign the value 0 to venue types not within 1000 meters. This process is also known as 'binning'. I will add the 'Distance Bin' column to min_venue_distances and the hierarchical indexed version, min_venue_distances_H.

In [118]:
bins = np.linspace(min(min_venue_distances['Meters from Hospital']), max(min_venue_distances['Meters from Hospital']), 6)
print("Max distance is", max(min_venue_distances['Meters from Hospital']), ". bins=", bins)    # get 5 equally spaced numbers over the range from the min to the max distance
value_list = [5, 4, 3, 2, 1] # I want them in reverse order so bin 5 has the closest venues and bin 1 has the farthest
min_venue_distances["Distance Bin"] = pd.cut(min_venue_distances['Meters from Hospital'],
                                            bins, labels=value_list, include_lowest=True) # I use the Pandas function cut()
min_venue_distances_H['Distance Bin'] = min_venue_distances['Distance Bin'].values        # to sort the values into bins
print(min_venue_distances.head(10))
print(min_venue_distances_H.head(10))

Max distance is 299 . bins= [  4.  63. 122. 181. 240. 299.]
                    Hospital                       Category  \
0  Access Emergency Hospital                           Bank   
1  Access Emergency Hospital  Paper / Office Supplies Store   
2  Access Emergency Hospital                            Bar   
3  Access Emergency Hospital                       Building   
4  Access Emergency Hospital               Government Venue   
5  Access Emergency Hospital                 Hardware Store   
6  Access Emergency Hospital              Mobile Phone Shop   
7  Access Emergency Hospital                          Salon   
8  Access Emergency Hospital                            Spa   
9  Access Emergency Hospital            Government Building   

   Meters from Hospital Distance Bin  
0                    97            4  
1                   128            3  
2                   130            3  
3                   136            3  
4                   136            3  
5           

#### Let's pivot min_venue_distances and make the Hospital column the index

In [119]:
p_min_venue_distances = min_venue_distances.pivot(index='Hospital',columns='Category', values='Distance Bin')
p_min_venue_distances

Category,American Restaurant,Antique Shop,Art Gallery,Arts & Crafts Store,Assisted Living,Athletics & Sports,Auditorium,Auto Dealership,Automotive Shop,BBQ Joint,...,University,Urgent Care Center,Urology Clinic,Veterinarian,Video Store,Volleyball Court,Water Park,Women's Health Clinic,Women's Store,Yoga Studio
Hospital,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Access Emergency Hospital,,,,,1.0,,,,2.0,1.0,...,,,,1.0,3.0,,,,,
Andrew Rader Clinic,,,,,,,,,,,...,2.0,,,,,,,,,
Arlington Free Clinic,,,,,3.0,,2.0,,,,...,5.0,4.0,,,,,,,,
Burke Medical Center,,4.0,,,,,,,,,...,,,,,1.0,,,,,
Circle Terrace Hospital,,,,,,,,,,,...,,,,,,,,,,
Columbia Fairfax Surgical Center,,,1.0,,,,,,,,...,,,,,,,,,,
DeWitt Hospital,,,,,,,,,,,...,,,,,,,,,,
Dominion Hospital,,,,,,,,,2.0,,...,,,,,,,,,,
Fair Oaks Medical Plaza,,,,,,,,,,,...,,,,,,,,,,
Fair Oaks Professional Building,,,,,,,,,,,...,,,,,,,,,,


Great! Now I need to do something about all the missing (NaN) values. Since these values represent venue categories that are not found in the near surroundings of the hospitals, I will assign a 0 value (representing the distance bin value).

In [120]:
p_min_venue_distances.replace(np.nan, 0, inplace=True) # replace NaN values with zero.
p_min_venue_distances

Category,American Restaurant,Antique Shop,Art Gallery,Arts & Crafts Store,Assisted Living,Athletics & Sports,Auditorium,Auto Dealership,Automotive Shop,BBQ Joint,...,University,Urgent Care Center,Urology Clinic,Veterinarian,Video Store,Volleyball Court,Water Park,Women's Health Clinic,Women's Store,Yoga Studio
Hospital,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Access Emergency Hospital,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,1.0,...,0.0,0.0,0.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0
Andrew Rader Clinic,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Arlington Free Clinic,0.0,0.0,0.0,0.0,3.0,0.0,2.0,0.0,0.0,0.0,...,5.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Burke Medical Center,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
Circle Terrace Hospital,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Columbia Fairfax Surgical Center,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DeWitt Hospital,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Dominion Hospital,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Fair Oaks Medical Plaza,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Fair Oaks Professional Building,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Cluster According to Distance
Let's run K-Means again to see how the categories/labels compare with our previous labels based on frequency.

In [121]:
# run k-means clustering
kmeans_dist = KMeans(n_clusters=kclusters, random_state=0).fit(p_min_venue_distances)

# check cluster labels generated for each row in the dataframe
dist_labels=kmeans_dist.labels_
dist_labels

array([0, 3, 2, 0, 3, 3, 3, 3, 4, 4, 3, 3, 1, 3, 4, 1, 3, 3, 3, 3, 3, 1,
       1, 3, 1], dtype=int32)

Very interesting results! The clusters have changed considerably! Let's compare them side by side. I will add them to my nova_hospitals dataframe.

In [122]:
nova_hospitals['Frequency Class'] = labels
nova_hospitals['Distance Class'] = dist_labels
nova_hospitals

Unnamed: 0,Name,Feature Type,County,State,Latitude,Longitude,Frequency Class,Distance Class
0,Access Emergency Hospital,Hospital,Fairfax,VA,38.965666,-77.35693,0,0
1,Andrew Rader Clinic,Hospital,Arlington,VA,38.87039,-77.07609,0,3
2,Arlington Free Clinic,Hospital,Arlington,VA,38.882449,-77.105439,0,2
3,Burke Medical Center,Hospital,Fairfax,VA,38.78848,-77.29777,0,0
4,Circle Terrace Hospital,Hospital,Alexandria (city),VA,38.82678,-77.075533,2,3
5,Columbia Fairfax Surgical Center,Hospital,Fairfax (city),VA,38.849709,-77.315847,0,3
6,DeWitt Hospital,Hospital,Fairfax,VA,38.700393,-77.136646,3,3
7,Dominion Hospital,Hospital,Fairfax,VA,38.870112,-77.158591,0,3
8,Fair Oaks Medical Plaza,Hospital,Fairfax,VA,38.883723,-77.381375,4,4
9,Fair Oaks Professional Building,Hospital,Fairfax,VA,38.884001,-77.380542,4,4


Cool! 
### Evaluating the Results
Let's evaluate each of the labels. It would be nice to know what the different categories represent. I should be able to pull up the venues for each category and see what they have in common.

In [123]:
# Create a 'Frequency Class' column for 'p_min_venue_distances'
hospital_grouped['Frequency Class'] = labels 
hospital_grouped.head()

Unnamed: 0,Target Hospital,American Restaurant,Antique Shop,Art Gallery,Arts & Crafts Store,Assisted Living,Athletics & Sports,Auditorium,Auto Dealership,Automotive Shop,...,Urgent Care Center,Urology Clinic,Veterinarian,Video Store,Volleyball Court,Water Park,Women's Health Clinic,Women's Store,Yoga Studio,Frequency Class
0,Access Emergency Hospital,0.0,0.0,0.0,0.0,0.021739,0.0,0.0,0.0,0.021739,...,0.0,0.0,0.021739,0.021739,0.0,0.0,0.0,0.0,0.0,0
1,Andrew Rader Clinic,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,Arlington Free Clinic,0.0,0.0,0.0,0.0,0.014925,0.0,0.014925,0.0,0.0,...,0.014925,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
3,Burke Medical Center,0.0,0.017857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.0,0
4,Circle Terrace Hospital,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2


In [125]:
# Create a 'Distance Class' column for 'p_min_venue_distances'
p_min_venue_distances['Distance Class'] = dist_labels 
p_min_venue_distances.head()

Category,American Restaurant,Antique Shop,Art Gallery,Arts & Crafts Store,Assisted Living,Athletics & Sports,Auditorium,Auto Dealership,Automotive Shop,BBQ Joint,...,Urgent Care Center,Urology Clinic,Veterinarian,Video Store,Volleyball Court,Water Park,Women's Health Clinic,Women's Store,Yoga Studio,Distance Class
Hospital,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Access Emergency Hospital,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,1.0,...,0.0,0.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,0
Andrew Rader Clinic,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3
Arlington Free Clinic,0.0,0.0,0.0,0.0,3.0,0.0,2.0,0.0,0.0,0.0,...,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
Burke Medical Center,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0
Circle Terrace Hospital,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3


Perfect! You can see the new columns on the far right in the dataframes above.

In [126]:
# Let's see how many hospitals are assigned to each label
p_min_venue_distances['Distance Class'].value_counts()

3    14
1     5
4     3
0     2
2     1
Name: Distance Class, dtype: int64

In [127]:
hospital_grouped['Frequency Class'].value_counts()

0    16
4     6
3     1
2     1
1     1
Name: Frequency Class, dtype: int64

### Discussion
Interesting. When grouped by distances, most of the hospitals fall in one category, with diminishing numbers in the other 4 categories. When grouped by frequency, most of the hospitals fall into two of the five categories. The other three categories contain outliers or anomalies.

Let's take a look at these results from another perspective. Let's group by the distance class and see what the distance values are for each of the venue types.
#### Evaluation by Distance

In [129]:
# Now I group by 'Distance Class' and average the distance values. The highest values will represent the closest types of venues.
by_class = p_min_venue_distances.groupby('Distance Class').mean()
by_class

Category,American Restaurant,Antique Shop,Art Gallery,Arts & Crafts Store,Assisted Living,Athletics & Sports,Auditorium,Auto Dealership,Automotive Shop,BBQ Joint,...,University,Urgent Care Center,Urology Clinic,Veterinarian,Video Store,Volleyball Court,Water Park,Women's Health Clinic,Women's Store,Yoga Studio
Distance Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,2.0,0.0,0.0,0.5,0.0,0.0,0.0,1.0,0.5,...,0.0,0.0,0.0,0.5,2.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6,0.4,...,0.2,0.6,0.4,0.0,0.0,0.0,0.0,0.6,0.6,0.4
2,0.0,0.0,0.0,0.0,3.0,0.0,2.0,0.0,0.0,0.0,...,5.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.285714,0.0,0.071429,0.071429,0.5,0.071429,0.0,0.5,0.428571,0.0,...,0.357143,0.285714,0.0,0.0,0.142857,0.285714,0.214286,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


It is apparent from the output above that this method of evaluation will not work. There are too many columns.

In [130]:
# There are too many columns to evaluate individually.
# Let's replace 0 values with NaN, which we can later remove
by_class.replace(float(0), np.nan, inplace=True)
by_class

Category,American Restaurant,Antique Shop,Art Gallery,Arts & Crafts Store,Assisted Living,Athletics & Sports,Auditorium,Auto Dealership,Automotive Shop,BBQ Joint,...,University,Urgent Care Center,Urology Clinic,Veterinarian,Video Store,Volleyball Court,Water Park,Women's Health Clinic,Women's Store,Yoga Studio
Distance Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,2.0,,,0.5,,,,1.0,0.5,...,,,,0.5,2.0,,,,,
1,1.0,,,,,,,,0.6,0.4,...,0.2,0.6,0.4,,,,,0.6,0.6,0.4
2,,,,,3.0,,2.0,,,,...,5.0,4.0,,,,,,,,
3,0.285714,,0.071429,0.071429,0.5,0.071429,,0.5,0.428571,,...,0.357143,0.285714,,,0.142857,0.285714,0.214286,,,
4,,,,,,,,,,,...,,,,,,,,,,


In [144]:
# Let's look at class 0 by itself. There are 2 hospitals in this class
class_0= pd.Series(by_class.iloc[0].dropna())  # Drop missing values
class_0.sort_values(ascending=False, inplace=True)  # Sort by descending order
class_0.head(40)

Category
Mobile Phone Shop                4.0
Bank                             3.5
Salon                            3.5
Sandwich Place                   3.0
Building                         3.0
Fast Food Restaurant             3.0
Gas Station                      3.0
Playground                       2.5
Frozen Yogurt Shop               2.0
Gym / Fitness Center             2.0
Grocery Store                    2.0
Video Store                      2.0
FedEx Box                        2.0
Credit Union                     2.0
Camera Store                     2.0
Breakfast Spot                   2.0
Laundry Service                  2.0
Gym                              2.0
School                           2.0
Pharmacy                         2.0
Antique Shop                     2.0
Office                           1.5
Government Venue                 1.5
Baseball Field                   1.5
Supermarket                      1.5
Business Center                  1.5
Spa                          

#### Analysis:
There are 2 hospitals in category 0. These hospitals have a variety of venues within walking distance, including banks, fast fast food, a playground, and a pharmacy.

In [143]:
# Let's look at class 1 by itself. There are five hospitals in category 1.
class_1= pd.Series(by_class.iloc[1].dropna())  # Drop missing values
class_1.sort_values(ascending=False, inplace=True)  # Sort by descending order
class_1.head(40)

Category
Coffee Shop                    3.8
Building                       3.2
Office                         3.0
Café                           2.6
Cafeteria                      2.4
Surgery Center                 2.4
Coworking Space                2.0
Gift Shop                      2.0
Parking                        1.8
Bus Line                       1.6
Medical Lab                    1.2
Credit Union                   1.2
Cardiovascular Clinic          1.2
Health & Beauty Service        1.2
Gas Station                    1.2
Pool                           1.2
Optical Shop                   1.0
Laundry Service                1.0
Nail Salon                     1.0
Christian Church               1.0
Eye Doctor                     1.0
Corporate Office               1.0
Otolaryngology Clinic          1.0
American Restaurant            1.0
Pharmacy                       1.0
Restaurant                     1.0
Beer Garden                    1.0
Business Service               1.0
Bus Station

#### Analysis:
There are five hospitals in category 1. This class of hospital also has a variety of venues around it, but they are not as close by. Most of the nearby venues seem to be associated with the hospital, including the coffee shop, gift shop, and cafeteria. There is a pharmacy, restaurant, park, and bakery further away.

In [136]:
# Let's look at class 2 by itself. There is one hospital in this class.
class_2= pd.Series(by_class.iloc[2].dropna())  # Drop missing values
class_2.sort_values(ascending=False, inplace=True)  # Sort by descending order
class_2.head(50)

Category
Chiropractor                                5.0
Rafting                                     5.0
Indonesian Restaurant                       5.0
Office                                      5.0
Deli / Bodega                               5.0
Physical Therapist                          5.0
University                                  5.0
Professional & Other Places                 5.0
Juice Bar                                   5.0
Residential Building (Apartment / Condo)    5.0
Burger Joint                                5.0
Thai Restaurant                             5.0
Barbershop                                  5.0
Café                                        4.0
Christian Church                            4.0
Convenience Store                           4.0
Building                                    4.0
Boxing Gym                                  4.0
Endodontic Dentist                          4.0
Eye Doctor                                  4.0
Gym                            

#### Analysis: 
There is just one hospital in category 2, which makes it an outlier. This hospital also has a great variety of nearby venues, including a convenience store, several places to eat, a pool, a park, and a spa.

In [138]:
# Let's look at class 3 by itself. There are fourteen hospitals in class 3.
class_3 = pd.Series(by_class.iloc[3].dropna())  # Drop missing values
class_3.sort_values(ascending=False, inplace=True)  # Sort by descending order
class_3.head(50)

Category
Office                                      2.142857
Building                                    1.285714
Chiropractor                                1.142857
Government Building                         1.142857
Pharmacy                                    0.928571
Medical Lab                                 0.928571
Residential Building (Apartment / Condo)    0.928571
Spa                                         0.928571
School                                      0.857143
Café                                        0.714286
Event Space                                 0.642857
Professional & Other Places                 0.642857
Eye Doctor                                  0.571429
Parking                                     0.571429
Salon                                       0.571429
Physical Therapist                          0.500000
Salon / Barbershop                          0.500000
Auto Dealership                             0.500000
Assisted Living                      

#### Analysis:
There are 14 hospital in category 3, which makes it difficult to assess the whether or not a particular venue type is near a specific hospital. It appears that there are indeed a great variety of venues surround these hospitals but you will likely have to walk 200 meters or more to get to one.

In [142]:
# Let's look at class 4 by itself. There are three hospitals in this class.
class_4 = pd.Series(by_class.iloc[4].dropna())  # Drop missing values
class_4.sort_values(ascending=False, inplace=True)  # Sort by descending order
class_4.head(20)

Category
Café                     4.666667
Pharmacy                 4.333333
Medical Lab              4.333333
Eye Doctor               4.333333
Building                 4.333333
Parking                  3.666667
Nursing Home             3.666667
Surgery Center           3.333333
Physical Therapist       3.333333
Coworking Space          3.000000
Cardiovascular Clinic    3.000000
College Gym              2.000000
Office                   1.333333
Nursery School           1.000000
Daycare                  0.333333
Conference Room          0.333333
Name: 4, dtype: float64

#### Analysis: 
There are 3 hospitals in category 4. They appear to have a smaller variety of nearby venues, and none that look like an escape, but a cafe and pharmacy are close by.

After this initial analysis, I will generalize the distance-based categories with the following names:
#### Distance Categories
    0 - Variety Mid Dist
    1 - Variety Farther Distance
    2 - Variety and Leisure Close
    3 - Pharm with Farther Variety
    4 - Cafe, Pharm and Med
    

In [157]:
# Let's save these category names as a dictionary. Start with a simple list.
dist_categories = ["Variety Mid Dist", "Variety Farther Distance","Variety and Leisure Close",
                   "Pharm with Farther Variety","Cafe, Pharm and Med"]

# Let's call it int_class since it maps
# an integer to the distance class/category.
cat_ints = [0, 1, 2, 3, 4]
cats = iter(dist_categories)
ints = iter(cat_ints)
int_class = dict(zip(ints, cats))
int_class

{0: 'Variety Mid Dist',
 1: 'Variety Farther Distance',
 2: 'Variety and Leisure Close',
 3: 'Pharm with Farther Variety',
 4: 'Cafe, Pharm and Med'}

#### Evaluation by Frequency
Now let's evaluate the categories that were based on the frequency/commonality of venues near the hospitals

In [147]:
# Group hospital_grouped by 'Frequency Class'
by_class_f = hospital_grouped.groupby('Frequency Class').mean()
by_class_f

Unnamed: 0_level_0,American Restaurant,Antique Shop,Art Gallery,Arts & Crafts Store,Assisted Living,Athletics & Sports,Auditorium,Auto Dealership,Automotive Shop,BBQ Joint,...,University,Urgent Care Center,Urology Clinic,Veterinarian,Video Store,Volleyball Court,Water Park,Women's Health Clinic,Women's Store,Yoga Studio
Frequency Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.008319,0.001116,0.001736,0.005208,0.006966,0.001603,0.000933,0.011619,0.022043,0.002661,...,0.009404,0.007695,0.001302,0.001359,0.005279,0.0,0.001202,0.006579,0.001838,0.003289
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.011905,0.0,0.0,0.0,0.0


In [148]:
# There are too many columns to evaluate individually.
# Let's replace 0 values with NaN, which we can later remove
by_class_f.replace(float(0), np.nan, inplace=True)
by_class_f

Unnamed: 0_level_0,American Restaurant,Antique Shop,Art Gallery,Arts & Crafts Store,Assisted Living,Athletics & Sports,Auditorium,Auto Dealership,Automotive Shop,BBQ Joint,...,University,Urgent Care Center,Urology Clinic,Veterinarian,Video Store,Volleyball Court,Water Park,Women's Health Clinic,Women's Store,Yoga Studio
Frequency Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.008319,0.001116,0.001736,0.005208,0.006966,0.001603,0.000933,0.011619,0.022043,0.002661,...,0.009404,0.007695,0.001302,0.001359,0.005279,,0.001202,0.006579,0.001838,0.003289
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,0.011905,,,,


In [158]:
# Let's look at class 0 by itself. There are 16 hospitals in this class
f_class_0= pd.Series(by_class_f.iloc[0].dropna())  # Drop missing values
f_class_0.sort_values(ascending=False, inplace=True)  # Sort by descending order
f_class_0.head(40)

Office                                      0.070048
Building                                    0.050580
Residential Building (Apartment / Condo)    0.025510
Medical Lab                                 0.024188
Government Building                         0.022616
Bus Line                                    0.022587
Bank                                        0.022405
Chiropractor                                0.022270
Automotive Shop                             0.022043
Christian Church                            0.020774
Parking                                     0.016170
Pharmacy                                    0.016016
Pool                                        0.015430
Cafeteria                                   0.015064
School                                      0.014377
Surgery Center                              0.013982
Coffee Shop                                 0.013166
Convenience Store                           0.012801
Salon                                       0.

#### Analysis:
Hospitals in frequency class 0 appear to be mostly surrounded by buildings.

In [151]:
# Let's look at class 1 by itself. There is 1 hospital in this class
f_class_1= pd.Series(by_class_f.iloc[1].dropna())  # Drop missing values
f_class_1.sort_values(ascending=False, inplace=True)  # Sort by descending order
f_class_1.head(10)

School         0.2
Office         0.2
Food Court     0.2
Coffee Shop    0.2
Beer Garden    0.2
Name: 1, dtype: float64

#### Analysis:
Frequency class 1 has just one hospital in it, and it has just five venues of five different types within walking distance. The only escapes are a food court, a coffe shop, and a beer garden.

In [152]:
# Let's look at class 2 by itself. There is 1 hospital in this class
f_class_2= pd.Series(by_class_f.iloc[2].dropna())  # Drop missing values
f_class_2.sort_values(ascending=False, inplace=True)  # Sort by descending order
f_class_2.head(10)

Pool                   0.25
Music School           0.25
Housing Development    0.25
Elementary School      0.25
Name: 2, dtype: float64

#### Analysis:
Frequency class 2 has just one hospital in it, and it has just four venues of four different types within walking distance. The only escape is a pool.

In [153]:
# Let's look at class 3 by itself. There is 1 hospital in this class
f_class_3= pd.Series(by_class_f.iloc[3].dropna())  # Drop missing values
f_class_3.sort_values(ascending=False, inplace=True)  # Sort by descending order
f_class_3.head(10)

Playground               0.25
Military Base            0.25
Government Building      0.25
General Entertainment    0.25
Name: 3, dtype: float64

#### Analysis:
Frequency class 3 has just one hospital in it, and it has just four venues of four different types within walking distance. The two possible escapes are a playground and general entertainment.

In [156]:
# Let's look at class 4 by itself. There are 6 hospitals in this class
f_class_4= pd.Series(by_class_f.iloc[4].dropna())  # Drop missing values
f_class_4.sort_values(ascending=False, inplace=True)  # Sort by descending order
f_class_4.head(30)

Café                      0.124356
Building                  0.116618
Physical Therapist        0.091023
Parking                   0.066642
Medical Lab               0.061880
Eye Doctor                0.057714
Nursing Home              0.049976
Cardiovascular Clinic     0.049976
Office                    0.047967
College Gym               0.029142
Surgery Center            0.029142
Pharmacy                  0.029142
Coworking Space           0.029142
Furniture / Home Store    0.016667
General Travel            0.016667
Bar                       0.016667
Spa                       0.016667
Bus Line                  0.016667
Baseball Field            0.011905
Volleyball Court          0.011905
Event Space               0.011905
Tech Startup              0.011905
Mental Health Office      0.011905
Optical Shop              0.011905
Police Station            0.011905
School                    0.011905
Government Venue          0.011905
Conference Room           0.009259
Daycare             

#### Analysis:
Frequency class 4 has 6 hospitals in it. There are a variety of venues around them, the most common of which are cafes, buildings, and physical therapists. The escapes are not common, but include bars and spas.

After this initial analysis, I will generalize the frequency-based categories with the following names:
#### Frequency Categories
    0 - Mostly Office Buildings
    1 - Food Court, Coffee Shop, Beer Garden
    2 - Elem. School and Pool
    3 - Entertainment and Playground
    4 - Cafes and Buildings

In [159]:
# Let's save these category names as a dictionary
int_f_class = {0:"Mostly Office Buildings", 1:"Food Court, Coffee Shop, Beer Garden", 2:"Elem. School and Pool",
              3:"Entertainment and Playground", 4:"Cafes and Buildings"}

In [160]:
int_class.get(4)

'Cafe, Pharm and Med'

In [161]:
int_f_class.get(4)

'Cafes and Buildings'

Before trying the easier method above, I tried this: I filtered 'nova_hospitals' above with one of the 'Distance Class' labels to retrieve a list of hospitals. I then filtered the 'nova_venues_no_h' dataframe according to those hospitals. I added the bin value column (using join) from 'min_venue_distances' using Hospital and Venue names as keys. Lastly, I filtered the dataframe for just those rows with the closest venues ('Distance Bin' = 10) to see what kinds of venues were represented. Look at the Category column in the output field below to see what kinds of venues are nearest to the hospitals in category 0 of my 'Distance Class' cluster. I opted out of using this method because it is so verbose - I cannot even see all of the venue categories represented.

In [100]:
# Here's another way to look at venues based on Distance Class 
Hospital_series = nova_hospitals[nova_hospitals['Distance Class']==0]['Name'] # Get hospital 'Name' where 'Distance Class'==0
filtered_hospitals = nova_venues_no_h[nova_venues_no_h['Hospital'].isin(Hospital_series)] # Keep just the hospitals in Hospital_series
combined_venues = filtered_hospitals.join(min_venue_distances_H, on=['Hospital', 'Category'], how='inner',
                                            lsuffix='', rsuffix='.2')# Use the hierarchical index for 
                                                          # min_venue_distances: min_venue_distances_H
# Remove unnecessary columns
combined_venues.drop(['Venue Lat', 'Venue Lng', 'Meters from Hospital','Meters from Hospital.2'], axis=1, inplace=True)
# Filter the dataframe for just the closest venues ('Distance Bin'=10)
combined_venues = combined_venues[combined_venues['Distance Bin'].isin([10])]
combined_venues.reset_index(drop=True, inplace=True)
combined_venues

Unnamed: 0,Venue,Category,Hospital,Distance Bin
0,Bank of America,Bank,Access Emergency Hospital,10
1,Citibank,Bank,Access Emergency Hospital,10
2,HSBC,Bank,Access Emergency Hospital,10
3,United Bank,Bank,Access Emergency Hospital,10
4,Middleburg Bank,Bank,Access Emergency Hospital,10
5,TD Bank,Bank,Access Emergency Hospital,10
6,NFCU ATM,Bank,Access Emergency Hospital,10
7,Capital One Bank,Bank,Access Emergency Hospital,10
8,Nexus Systems,Office,Dominion Hospital,10
9,Emerson Lee CPA,Office,Dominion Hospital,10


Getting back to the two classes of clusters, let's see how they look on a map!
### Map the hospitals in the distance-based clusters!

In [88]:
# First I'll import the legend image that I created
from folium.plugins import FloatImage

In [93]:
# My attempt to add a legend image to the map
# Your data file was loaded into a botocore.response.StreamingBody object.
# Please read the documentation of ibm_boto3 and pandas to learn more about the possibilities to load the data.
# ibm_boto3 documentation: https://ibm.github.io/ibm-cos-sdk-python/
# pandas documentation: http://pandas.pydata.org/
streaming_body_1 = client_ab73e7ab23a545909f79eeb564d5629e.get_object(Bucket='ibmdatasciencecapstoneprojectchar-donotdelete-pr-u5aqkb54k9pyz9', Key='Legend.png')['Body']
# add missing __iter__ method so pandas accepts body as file-like object
if not hasattr(streaming_body_1, "__iter__"): streaming_body_1.__iter__ = types.MethodType( __iter__, streaming_body_1 ) 
streaming_body_1.()

# I'm not sure how to load or read the .png file from the StreamingBody

<ibm_botocore.response.StreamingBody object at 0x7f9f761e2f98>


In [165]:
# Now let's map the hospitals and label them according to the distance class

# set the coordinates for Northern Virginia (use Sunrise at Fair Oaks)
latitude = 38.8223352 # To automate the center of the data: nova_hospitals['Latitude'].mean()
longitude = -77.3833199 # This works best if the points are evening distributed in all quadrants

# create map of northern Virginia using folium
n_virginia_map = folium.Map(location=[latitude, longitude], zoom_start=10)

Title = 'Hotels Classified by Surrounding Venues'
# Now I will map the hospitals in the 'nova_hospitals' dataframe and add the hospital name
# and category as labels, according to the 'Distance Class'.

# 'Distance Class' has five different values (types of hospitals) so we will make 5
# different colors of markers on the map.
k_types = 5 
x = np.arange(k_types)
ys = [i + x + (i*x)**2 for i in range(k_types)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add hospital markers to map
#markers_colors = []
for lat, lng, hospital, d_class in zip(nova_hospitals['Latitude'], nova_hospitals['Longitude'],
                                       nova_hospitals['Name'], nova_hospitals['Distance Class']):
           
    label = '{}, {}'.format(hospital, int_class.get(d_class))
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[d_class],
        fill=True,
        fill_color=rainbow[d_class],
        fill_opacity=0.7,
        parse_html=False).add_to(n_virginia_map)  

#FloatImage(Legend_image, bottom=0, left=1).add_to(n_virginia_map)
n_virginia_map

    Legend
    Purple - Variety Mid Dist
    Lt Blue - Variety Farther Distance
    Aqua Grn - Variety and Leisure Close
    Orange - Pharm with Farther Variety
    Red - Cafe, Pharm and Med

### Map the hospitals in the frequency-based clusters!

In [163]:
# Now let's map the hospitals and label them according to the frequency class

# set the coordinates for Northern Virginia (use Sunrise at Fair Oaks)
latitude = 38.8223352 # To automate the center of the data: nova_hospitals['Latitude'].mean()
longitude = -77.3833199 # This works best if the points are evening distributed in all quadrants

# create map of northern Virginia using folium
n_virginia_map_f = folium.Map(location=[latitude, longitude], zoom_start=10)

Title = 'Hotels Classified by Surrounding Venues'
# Now I will map the hospitals in the 'nova_hospitals' dataframe and add the hospital name
# and category as labels, according to the 'Distance Class'.

# 'Distance Class' has five different values (types of hospitals) so we will make 5
# different colors of markers on the map.
k_types = 5 
x = np.arange(k_types)
ys = [i + x + (i*x)**2 for i in range(k_types)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add hospital markers to map
#markers_colors = []
for lat, lng, hospital, d_class in zip(nova_hospitals['Latitude'], nova_hospitals['Longitude'],
                                       nova_hospitals['Name'], nova_hospitals['Frequency Class']):
           
    label = '{}, {}'.format(hospital, int_f_class.get(d_class))
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[d_class],
        fill=True,
        fill_color=rainbow[d_class],
        fill_opacity=0.7,
        parse_html=False).add_to(n_virginia_map_f)  

#FloatImage(Legend_image, bottom=0, left=1).add_to(n_virginia_map)
n_virginia_map_f

    Legend
    Purple - Mostly Office Buildings
    Lt Blue - Food Court, Coffee Shop, Beer Garden
    Aqua Grn - Elem. School and Pool
    Orange - Entertainment and Playground
    Red - Cafes and Buildings

### That's great but...

Although these clusters are certainly intriguing, they don't really capture the information that I feel would be most helpful to hospital patrons. Instead of relying on a machine learning algorithm to group the hospitals, I will cluster them according to my own criteria. I believe that the most frequently visited venues by hospital patrons are fast food restaurants, parks, and convenience stores. I am assuming that patients in the hospital would not need a pharmacy until they are released, but a pharmacy can also serve as a convenience store, so I will include it in my target venues. Let's make the following clusters:

    1 - Park, Fast Food, and Convenience Store
    2 - Fast Food and Park
    3 - Fast Food and Convenience Store
    4 - Park and Convenience Store
    5 - Park
    6 - Fast Food
    7 - Convenience Store
    8 - None of the three

In [28]:
my_clusters = {1:'Park, Fast Food, and Convenience Store', 2:'Fast Food and Park', 
                3:'Fast Food and Convenience Store', 4:'Park and Convenience Store',
                5:'Park', 6:'Fast Food', 7:'Convenience Store', 8:'None of the three'}

Rather than just focus on the hospitals in northern Virginia, let's apply our groupings to all Virginia hospitals. I will have to grab new venue data for hospitals_only then clean up the venue data just like I did for hospitals in northern Virginia.

### Get Venue Data for General Hospitals Throughout Virginia

In [70]:
# Let's call the function above to get a dataframe with venues for all hospitals in Virginia.
# create a dictionary to initialize my dataframe
dictionary1 = {'name': ['value'], 'lat': ['NaN'],
               'lng': ['NaN'], 'distance': ['NaN'], 'categories': ['value'], 'Hospital': ['value']}
# create a dataframe to contain the combined venue data
hospital_venues = pd.DataFrame(dictionary1) 
for i, hospital in enumerate(hospitals_only['Name']):
    # call the function to get nearby venues for each hospital
    hospital_venues = hospital_venues.append(getNearbyVenues(hospital, hospitals_only.iloc[i,4],
                                                   hospitals_only.iloc[i,5], RADIUS))
    # Append the nearby venue data for each hospital in the dataframe

# Testing on just one hospital    
#venue_data = getNearbyVenues(hospitals_only.loc[12,'Name'], hospitals_only.loc[12,'Latitude'], hospitals_only.loc[12,'Longitude'], radius)
#venue_data.head() # This is the venue data for one hospital
#hospital_venues = hospital_venues.append(getNearbyVenues(hospitals_only.loc[1,'Latitude'], df_hospitals.loc[1,'Longitude'], radius), sort=False)    

print('hospital_venues shape:', hospital_venues.shape)
hospital_venues.head()

hospital_venues shape: (21095, 6)


Unnamed: 0,name,lat,lng,distance,categories,Hospital
0,value,,,,value,value
0,Richardson Memorial Library,36.6872,-77.5412,352.0,Library,A B Adams Convalescent Center
1,Greensville County Courthouse,36.6858,-77.5426,436.0,Courthouse,A B Adams Convalescent Center
2,Peggy Malone - State Farm Insurance Agent,36.694,-77.5381,920.0,Office,A B Adams Convalescent Center
3,New Century Hospice - Emporia,36.6854,-77.5438,537.0,Medical Center,A B Adams Convalescent Center


### Inspect and Clean the Data

In [71]:
# Drop the null row
hospital_venues.drop(labels = 0,axis = 0, inplace = True)
# Rename the columns
hospital_venues.rename(columns={"name":"Venue","distance":"Meters from Hospital", "lat":"Venue Lat", "lng":"Venue Lng", "categories":"Category"}, inplace=True)

In [72]:
hospital_venues.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20875 entries, 1 to 99
Data columns (total 6 columns):
Venue                   20875 non-null object
Venue Lat               20875 non-null object
Venue Lng               20875 non-null object
Meters from Hospital    20875 non-null object
Category                19879 non-null object
Hospital                20875 non-null object
dtypes: object(6)
memory usage: 1.1+ MB


Yep. It looks like we have the same issues with this larger data set.

In [73]:
# Let's see if any venues have a null value for 'distance' 
hospital_venues[hospital_venues['Meters from Hospital'].isna()]

Unnamed: 0,Venue,Venue Lat,Venue Lng,Meters from Hospital,Category,Hospital


In [74]:
# Changing the 'Meters from Hospital' column to type 'int'
hospital_venues['Meters from Hospital'] = hospital_venues['Meters from Hospital'].astype('int')
hospital_venues['Meters from Hospital'].dtypes # Let's check the type now

dtype('int64')

In [81]:
# Changing 'Venue Lat' and 'Venue Lng' columns to type 'float'
hospital_venues['Venue Lat'] = hospital_venues['Venue Lat'].astype('float')
hospital_venues['Venue Lng'] = hospital_venues['Venue Lng'].astype('float')
print('Venue Lat has type:', hospital_venues['Venue Lat'].dtypes)
print('Venue Lng has type:', hospital_venues['Venue Lng'].dtypes)

Venue Lat has type: float64
Venue Lng has type: float64


In [75]:
hospital_venues.reset_index(drop=True, inplace=True)
hospital_venues[hospital_venues['Category'].isna()]

Unnamed: 0,Venue,Venue Lat,Venue Lng,Meters from Hospital,Category,Hospital
15,Telpage,36.6788,-77.5478,1186,,A B Adams Convalescent Center
16,Habitat ReStore,36.6931,-77.5353,849,,A B Adams Convalescent Center
24,MARKET DRIVE CITGO,36.6939,-77.5396,925,,A B Adams Convalescent Center
97,Monumental UMC,36.691,-77.5355,620,,A B Adams Convalescent Center
146,VCUHS - Nelson Clinic,37.5395,-77.4308,63,,A D Williams Memorial Clinic
166,Miles Lab,37.5397,-77.4303,19,,A D Williams Memorial Clinic
232,Fairfax Government Center,38.9656,-77.3585,136,,Access Emergency Hospital
265,"Baldino's Lock & Key, Reston",38.9673,-77.3548,255,,Access Emergency Hospital
273,Nv pools,38.9615,-77.356,473,,Access Emergency Hospital
274,Horsey Headquarters,38.9688,-77.3572,353,,Access Emergency Hospital


In [76]:
# I'm going to manually assign a category for some of these venues
hospital_venues.loc[hospital_venues['Venue'].str.contains('Government'),'Category'] = 'Government Venue'
hospital_venues.loc[hospital_venues['Venue'].str.contains('Lock & Key'),'Category'] = 'Locksmith'
hospital_venues.loc[hospital_venues['Venue'].str.contains('pool'),'Category'] = 'Pool'
hospital_venues.loc[hospital_venues['Venue'].str.contains('Pool'),'Category'] = 'Pool'
hospital_venues.loc[hospital_venues['Venue'].str.contains('Barber'),'Category'] = 'Barbershop'
hospital_venues.loc[hospital_venues['Venue'].str.contains('pharmacy'),'Category'] = 'Pharmacy'
hospital_venues.loc[hospital_venues['Venue'].str.contains('El Pollo Rico'),'Category'] = 'Peruvian Restaurant'
hospital_venues.loc[hospital_venues['Venue'].str.contains('Endodontic'),'Category'] = 'Endodontic Dentist'
hospital_venues.loc[hospital_venues['Venue'].str.contains('Family Medicine'),'Category'] = 'Family Medicine'
hospital_venues.loc[hospital_venues['Venue'].str.contains('Salon'),'Category'] = 'Salon'
hospital_venues.loc[hospital_venues['Venue'].str.contains('Massage & Spa'),'Category'] = 'Massage and Spa'
hospital_venues.loc[hospital_venues['Venue'].str.contains('Chapel Next'),'Category'] = 'Christian Church'
hospital_venues.loc[hospital_venues['Venue'].str.contains('Church'),'Category'] = 'Christian Church'
hospital_venues.loc[hospital_venues['Venue'].str.contains('University'),'Category'] = 'University'
hospital_venues.loc[hospital_venues['Venue'].str.contains('Child Development'),'Category'] = 'Child Daycare'
hospital_venues.loc[hospital_venues['Venue'].str.contains('SAIC'),'Category'] = 'Corporate Office'
hospital_venues.loc[hospital_venues['Venue'].str.contains('Hospice'),'Category'] = 'Nursing Home'
hospital_venues.loc[hospital_venues['Venue'].str.contains('store'),'Category'] = 'Store'
hospital_venues.loc[hospital_venues['Venue'].str.contains("Doctors Office"),'Category'] = "Doctor's Office"
hospital_venues.loc[hospital_venues['Venue'].str.contains('Allergy'),'Category'] = 'Allergy Clinic'
hospital_venues.loc[hospital_venues['Venue'].str.contains('William Urology'),'Category'] = 'Urology Clinic'
hospital_venues.loc[hospital_venues['Venue'].str.contains('Medical Supply'),'Category'] = 'Medical Supply'
hospital_venues.loc[hospital_venues['Venue'].str.contains('Otolaryngology'),'Category'] = 'Otolaryngology Clinic'
hospital_venues.loc[hospital_venues['Venue'].str.contains('Gift Shop'),'Category'] = 'Gift Shop'
hospital_venues.loc[hospital_venues['Venue'].str.contains('Cardiovascular'),'Category'] = 'Cardiovascular Clinic'
hospital_venues.loc[hospital_venues['Venue'].str.contains('DVMS'),'Category'] = 'Veterinarian'
hospital_venues.loc[hospital_venues['Venue'].str.contains('JP Killeen'),'Category'] = 'Healthcare IT Business'
hospital_venues.loc[hospital_venues['Venue'].str.contains('Fed Ex Box'),'Category'] = 'FedEx Box'
hospital_venues.loc[hospital_venues['Venue'].str.contains('pool'),'Category'] = 'Pool'
hospital_venues.loc[hospital_venues['Venue'].str.contains('Cookie'),'Category'] = 'Cookie Shop'
hospital_venues.loc[hospital_venues['Venue'].str.contains("Women's Health"),'Category'] = "Women's Health Clinic"
hospital_venues.loc[hospital_venues['Venue'].str.contains('Imaging Center'),'Category'] = 'Hospital Imaging Center'
hospital_venues.loc[hospital_venues['Venue'].str.contains('Law Office'),'Category'] = 'Law Office'
hospital_venues.loc[hospital_venues['Venue'].str.contains('Chow Hall'),'Category'] = 'Chow Hall'
hospital_venues.loc[hospital_venues['Venue'].str.contains('Town hall'),'Category'] = 'Town Hall'
hospital_venues.loc[hospital_venues['Venue'].str.contains('Surgery Center'),'Category'] = 'Surgery Center'
hospital_venues.loc[hospital_venues['Venue'].str.contains('Town hall'),'Category'] = 'Town Hall'
hospital_venues.loc[hospital_venues['Venue'].str.contains('Grapevine'),'Category'] = 'Wine and Cigar Shop'

hospital_venues[hospital_venues['Category'].isna()]

Unnamed: 0,Venue,Venue Lat,Venue Lng,Meters from Hospital,Category,Hospital
15,Telpage,36.6788,-77.5478,1186,,A B Adams Convalescent Center
16,Habitat ReStore,36.6931,-77.5353,849,,A B Adams Convalescent Center
24,MARKET DRIVE CITGO,36.6939,-77.5396,925,,A B Adams Convalescent Center
97,Monumental UMC,36.691,-77.5355,620,,A B Adams Convalescent Center
146,VCUHS - Nelson Clinic,37.5395,-77.4308,63,,A D Williams Memorial Clinic
166,Miles Lab,37.5397,-77.4303,19,,A D Williams Memorial Clinic
274,Horsey Headquarters,38.9688,-77.3572,353,,Access Emergency Hospital
343,Blue Ridge Beads,38.0408,-78.479,359,,Albemarle County Health Department
344,420 Altamont,38.0447,-78.4799,388,,Albemarle County Health Department
405,Camerons Color And Cut,37.4406,-79.1766,453,,Alexander W Terrell Memorial Infirmary


In [82]:
# If necessary, remove the extra index column which I forgot to drop earlier
#all_venues.drop('index', axis = 1, inplace = True)
hospital_venues.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19956 entries, 0 to 20874
Data columns (total 6 columns):
Venue                   19956 non-null object
Venue Lat               19956 non-null float64
Venue Lng               19956 non-null float64
Meters from Hospital    19956 non-null int64
Category                19956 non-null object
Hospital                19956 non-null object
dtypes: float64(2), int64(1), object(3)
memory usage: 1.1+ MB


In [83]:
# Now drop the remaining rows where "categories" is 'None'
hospital_venues = hospital_venues[~(hospital_venues['Category'].isna())]
hospital_venues.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19956 entries, 0 to 20874
Data columns (total 6 columns):
Venue                   19956 non-null object
Venue Lat               19956 non-null float64
Venue Lng               19956 non-null float64
Meters from Hospital    19956 non-null int64
Category                19956 non-null object
Hospital                19956 non-null object
dtypes: float64(2), int64(1), object(3)
memory usage: 1.1+ MB


In [84]:
# I thought I saw some duplicate venues, so let's go through the dataframe one more time just to
# make sure we don't have duplicate venues for any one hospital.
hospital_venues.drop_duplicates(inplace=True)
hospital_venues.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19956 entries, 0 to 20874
Data columns (total 6 columns):
Venue                   19956 non-null object
Venue Lat               19956 non-null float64
Venue Lng               19956 non-null float64
Meters from Hospital    19956 non-null int64
Category                19956 non-null object
Hospital                19956 non-null object
dtypes: float64(2), int64(1), object(3)
memory usage: 1.1+ MB


In [85]:
# Now let's see the statistics
hospital_venues.describe()
#max(nova_venues['Meters from Hospital'])

Unnamed: 0,Venue Lat,Venue Lng,Meters from Hospital
count,19956.0,19956.0,19956.0
mean,37.575945,-78.304214,3792.944
std,0.751921,1.966885,106765.8
min,19.684407,-156.018251,0.0
25%,37.02196,-79.401464,204.0
50%,37.413173,-77.527962,460.0
75%,38.033646,-77.288752,871.0
max,46.342502,-71.791618,7817226.0


In [86]:
# Let's limit the distance to venues to less than 501 meters
hospital_venues = hospital_venues[hospital_venues['Meters from Hospital']<501]
hospital_venues.describe()

Unnamed: 0,Venue Lat,Venue Lng,Meters from Hospital
count,10586.0,10586.0,10586.0
mean,37.725758,-77.984917,233.224825
std,0.747696,1.418898,131.567693
min,36.569351,-83.051193,0.0
25%,37.185852,-78.870783,127.0
50%,37.541032,-77.462217,214.0
75%,38.354521,-77.285575,337.0
max,39.198217,-75.375535,500.0


In [87]:
hospital_venues.reset_index(drop = True, inplace=True) # reseting the index again
hospital_venues.head(25)

Unnamed: 0,Venue,Venue Lat,Venue Lng,Meters from Hospital,Category,Hospital
0,Greensville County Courthouse,36.685785,-77.542643,436,Courthouse,A B Adams Convalescent Center
1,Veteran's Memorial Park,36.688216,-77.540897,395,Park,A B Adams Convalescent Center
2,Emporia General District Court,36.686527,-77.542419,426,Courthouse,A B Adams Convalescent Center
3,First Presbyterian Church,36.68797,-77.542599,500,Christian Church,A B Adams Convalescent Center
4,emporia foot center,36.684927,-77.542728,452,Doctor's Office,A B Adams Convalescent Center
5,dr william t tillar (Optometrist),36.685074,-77.542906,464,Doctor's Office,A B Adams Convalescent Center
6,Emporia Municipal Bldg,36.687361,-77.542352,449,City Hall,A B Adams Convalescent Center
7,United States Post Office,36.688684,-77.541535,472,Government Building,A B Adams Convalescent Center
8,Commonwealth Atty Office,36.686531,-77.542389,423,City Hall,A B Adams Convalescent Center
9,Emporia Post Office,36.68867,-77.541355,460,Post Office,A B Adams Convalescent Center


#### Great! Now I'll write the code to assign each hospital to a group, dependent on the venue types nearby.
### Methodology

In [88]:
# Let's transform the data to make it easier to perform an operation on venues by hotel.

# one hot encoding
onehot = pd.get_dummies(hospital_venues[['Category']], prefix="", prefix_sep="")

# add hospital column back to dataframe
onehot['Target Hospital'] = hospital_venues['Hospital'] 

# move hospital column to the first column
fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]

onehot.head()

Unnamed: 0,Target Hospital,ATM,Accessories Store,Acupuncturist,Advertising Agency,Airport Terminal,Allergy Clinic,American Restaurant,Animal Shelter,Antique Shop,...,Water Park,Wine Bar,Wine Shop,Wine and Cigar Shop,Winery,Wings Joint,Women's Health Clinic,Women's Store,Yoga Studio,Zoo
0,A B Adams Convalescent Center,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,A B Adams Convalescent Center,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,A B Adams Convalescent Center,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,A B Adams Convalescent Center,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,A B Adams Convalescent Center,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [89]:
# Now let's group the rows by hospital and get the mean venues under each venue category 
by_hospital = onehot.groupby('Target Hospital').mean().reset_index()
by_hospital

Unnamed: 0,Target Hospital,ATM,Accessories Store,Acupuncturist,Advertising Agency,Airport Terminal,Allergy Clinic,American Restaurant,Animal Shelter,Antique Shop,...,Water Park,Wine Bar,Wine Shop,Wine and Cigar Shop,Winery,Wings Joint,Women's Health Clinic,Women's Store,Yoga Studio,Zoo
0,A B Adams Convalescent Center,0.000000,0.0000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000
1,A D Williams Memorial Clinic,0.000000,0.0000,0.000000,0.000000,0.0,0.000000,0.010309,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000
2,Access Emergency Hospital,0.000000,0.0000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000
3,Albemarle County Health Department,0.000000,0.0000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.025316,...,0.000000,0.012658,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.012658,0.000000
4,Alexander W Terrell Memorial Infirmary,0.000000,0.0000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000
5,Alleghany Memorial Hospital,0.000000,0.0000,0.000000,0.000000,0.0,0.000000,0.025641,0.000000,0.025641,...,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000
6,Alleghany Regional Hospital,0.000000,0.0000,0.000000,0.000000,0.0,0.000000,0.074074,0.037037,0.000000,...,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000
7,Andrew Rader Clinic,0.018868,0.0000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.018868,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000
8,Arlington Free Clinic,0.000000,0.0000,0.000000,0.000000,0.0,0.021739,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000
9,Ashland Convalescent Center,0.000000,0.0000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000


Now I'll create a column called 'My Clusters' to store the new cluster ID

In [90]:

# Create a 'My Clusters' column in the 'by_hospital' dataframe
by_hospital['My Clusters']= pd.Series(index=by_hospital.index)
by_hospital.head()

Unnamed: 0,Target Hospital,ATM,Accessories Store,Acupuncturist,Advertising Agency,Airport Terminal,Allergy Clinic,American Restaurant,Animal Shelter,Antique Shop,...,Wine Bar,Wine Shop,Wine and Cigar Shop,Winery,Wings Joint,Women's Health Clinic,Women's Store,Yoga Studio,Zoo,My Clusters
0,A B Adams Convalescent Center,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,A D Williams Memorial Clinic,0.0,0.0,0.0,0.0,0.0,0.0,0.010309,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,Access Emergency Hospital,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,Albemarle County Health Department,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025316,...,0.012658,0.0,0.0,0.0,0.0,0.0,0.0,0.012658,0.0,
4,Alexander W Terrell Memorial Infirmary,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,


In [91]:
my_clusters

{1: 'Park, Fast Food, and Convenience Store',
 2: 'Fast Food and Park',
 3: 'Fast Food and Convenience Store',
 4: 'Park and Convenience Store',
 5: 'Park',
 6: 'Fast Food',
 7: 'Convenience Store',
 8: 'None of the three'}

Now I need to find the names of columns that fall into one of my three important venue types.

In [92]:
# Create a list of Category names containing any of these strings (target words):
# 'Park' (not Parking), 'Walking', 'Nature', 'Fast Food', 'Burger',
# 'Sandwich', 'Take-out', 'Convenience', 'Grocery', 'Pharmacy', 'Pizza'
target_words = ['Walking', 'Nature', 'Fast Food', 'Burger', 'Sandwich', 
                'Convenience', 'Grocery', 'Pharmacy', 'Pizza']
                # I'll have to grab the 'Park' category explicitly, later on.
target_columns = []
#for col in by_hospital.columns.values:
#    for word in target_words:
#        if col.contains(word):
#            target_columns.append(col)
for cat in hospital_venues['Category'].unique():
    for word in target_words:
        if (cat.rfind(word) != -1):
            target_columns.append(cat)

target_columns.append('Park')
target_columns.append('Target Hospital')
target_columns

['Pharmacy',
 'Fast Food Restaurant',
 'Sandwich Place',
 'Pizza Place',
 'Grocery Store',
 'Burger Joint',
 'Convenience Store',
 'Organic Grocery',
 'Park',
 'Target Hospital']

Good! Since I wont be using any other columns to categorize the hospitals, I can filter them out.

In [93]:
target_venues_by_hospital=by_hospital[by_hospital.columns[by_hospital.columns.isin(target_columns)]]
target_venues_by_hospital.head(10)

Unnamed: 0,Target Hospital,Burger Joint,Convenience Store,Fast Food Restaurant,Grocery Store,Organic Grocery,Park,Pharmacy,Pizza Place,Sandwich Place
0,A B Adams Convalescent Center,0.0,0.0,0.0,0.0,0.0,0.052632,0.052632,0.0,0.0
1,A D Williams Memorial Clinic,0.0,0.0,0.020619,0.0,0.0,0.0,0.010309,0.0,0.010309
2,Access Emergency Hospital,0.0,0.0,0.010753,0.010753,0.0,0.0,0.010753,0.032258,0.010753
3,Albemarle County Health Department,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Alexander W Terrell Memorial Infirmary,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Alleghany Memorial Hospital,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025641
6,Alleghany Regional Hospital,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0
7,Andrew Rader Clinic,0.0,0.0,0.0,0.018868,0.0,0.0,0.018868,0.018868,0.0
8,Arlington Free Clinic,0.01087,0.01087,0.0,0.0,0.0,0.01087,0.0,0.01087,0.0
9,Ashland Convalescent Center,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Great! Now let's see if I can use the 'target_venues_by_hospital' dataframe to create boolean masks for each of the venue categories important to me.

In [94]:
bool_park = (target_venues_by_hospital['Park']>0)
bool_park.head()

0     True
1    False
2    False
3    False
4    False
Name: Park, dtype: bool

It worked! I'll go ahead and create masks for the other two venue categories.

In [95]:
bool_food = ((target_venues_by_hospital['Burger Joint']>0) | (target_venues_by_hospital['Fast Food Restaurant']>0) | (target_venues_by_hospital['Pizza Place']>0) | (target_venues_by_hospital['Sandwich Place']>0))
bool_convenience = ((target_venues_by_hospital['Convenience Store']>0) | (target_venues_by_hospital['Grocery Store']>0) | (target_venues_by_hospital['Organic Grocery']>0) | (target_venues_by_hospital['Pharmacy']>0))

In [96]:
# Let's see the clusters again before creating the clustering algorithm.
my_clusters

{1: 'Park, Fast Food, and Convenience Store',
 2: 'Fast Food and Park',
 3: 'Fast Food and Convenience Store',
 4: 'Park and Convenience Store',
 5: 'Park',
 6: 'Fast Food',
 7: 'Convenience Store',
 8: 'None of the three'}

### My Clustering Algorithm

In [97]:
# Let's fill the 'My Clusters' column with an integer to represent the appropriate cluster for that hotel
length = by_hospital['My Clusters'].size  # get the number of rows
my_clusters_index = by_hospital.columns.size - 1 # get and store the index for the 'My Clusters' column
print("length of by_hospital['My Clusters']:", length)
index = 0
while index < length:   # create a list containing the 'name' parameter for each row
    if (bool_park.iloc[index]):
        if (bool_food.iloc[index]):
            if (bool_convenience.iloc[index]):
                by_hospital.iloc[index, my_clusters_index] = 1
            else:
                by_hospital.iloc[index, my_clusters_index] = 2
        elif (bool_convenience.iloc[index]):
            by_hospital.iloc[index, my_clusters_index] = 4
        else:
            by_hospital.iloc[index, my_clusters_index] = 5
    elif (bool_food.iloc[index]):
        if (bool_convenience.iloc[index]):
            by_hospital.iloc[index, my_clusters_index] = 3
        else:
            by_hospital.iloc[index, my_clusters_index] = 6
    elif (bool_convenience.iloc[index]):
        by_hospital.iloc[index, my_clusters_index] = 7
    else:
        by_hospital.iloc[index, my_clusters_index] = 8
    index = index + 1
    
print(by_hospital['My Clusters']) # Let's take a look at the new cluster IDs.

length of by_hospital['My Clusters']: 218
0      4.0
1      3.0
2      3.0
3      8.0
4      8.0
5      6.0
6      6.0
7      3.0
8      1.0
9      8.0
10     8.0
11     3.0
12     2.0
13     6.0
14     8.0
15     8.0
16     8.0
17     4.0
18     6.0
19     3.0
20     7.0
21     3.0
22     8.0
23     5.0
24     1.0
25     8.0
26     6.0
27     8.0
28     8.0
29     2.0
      ... 
188    7.0
189    6.0
190    8.0
191    6.0
192    3.0
193    6.0
194    3.0
195    3.0
196    2.0
197    8.0
198    2.0
199    5.0
200    3.0
201    7.0
202    3.0
203    7.0
204    8.0
205    3.0
206    8.0
207    7.0
208    7.0
209    3.0
210    3.0
211    3.0
212    6.0
213    8.0
214    8.0
215    8.0
216    8.0
217    3.0
Name: My Clusters, Length: 218, dtype: float64


Nice, except I would like my clusters to be integers, not floats.

In [100]:
# Let's change the cluster value to integers then see how many hospitals are in each cluster!
by_hospital['My Clusters']=by_hospital['My Clusters'].astype('int')
by_hospital['My Clusters'].value_counts()

8    68
3    51
7    34
6    27
5    11
2    11
1     9
4     7
Name: My Clusters, dtype: int64

Wow! The category with none of the three target venue types is the largest. That means there are more hospitals with no fast food, convenience store, or park nearby than hospitals with one of the other combinations of the three. Now I can plot the hospitals on the map of Virgina and color them according to their group.

In [153]:
# Generate an array of colors (I opted out of using the first method because the colors
# are chosen randomly and therefor change each time the code is run).
#from random import randint
#rand_colors = []
#for i in range(8):
#    rand_colors.append('#%06X' % randint(0, 0xFFFFFF))

# 'My Clusters' has eight different values (groups) so we will make 8
# different colors of markers on the map.
k_types = 8 
x = np.arange(k_types)
ys = [i + x + (i*x)**2 for i in range(k_types)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

Lee_Regional_Lng = -83.0477424
Twin_County_Lng = -80.9238909
Legend_Lng = (Lee_Regional_Lng - 1.0)
Legend_Lat = np.linspace(37.65, 39.15, 8)
Legend_Lat

array([37.65      , 37.86428571, 38.07857143, 38.29285714, 38.50714286,
       38.72142857, 38.93571429, 39.15      ])

In [138]:
# Import the DivIcon feature to print my legend text
from folium.features import DivIcon

In [154]:
# Let's map the hospitals and label them with the new classifications

# set the coordinates for Virginia
latitude = 37.9965159
longitude = -79.8305715

# create map of Virginia using folium
virginia_map_2 = folium.Map(location=[latitude, longitude], zoom_start=7)

# Now I will map the hospitals in the 'hospitals_only' dataframe and add the hospital name
# and 'My Clusters' as labels.
for lat, lng, hospital, m_class in zip(hospitals_only['Latitude'], hospitals_only['Longitude'],
                                       hospitals_only['Name'], by_hospital['My Clusters']):
           
    label = '{}, {}'.format(hospital, my_clusters.get(m_class))
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[m_class-1],
        fill=True,
        fill_color=rainbow[m_class-1],
        fill_opacity=0.7,
        parse_html=False).add_to(virginia_map_2)  

lng = Legend_Lng
ind = 0 # Use this to iterate through the cluster keys and color array
for lat, key, in zip(Legend_Lat, my_clusters):
    label=my_clusters[key]
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        color=rainbow[ind],
      #  popup=label,
        fill=True,
        html='<div style="font-size: 10pt">{}</div>'.format(label),
        fill_color=rainbow[ind],
        fill_opacity=0.7,
        parse_html=True).add_to(virginia_map_2)
    # Print the labels either in circlemaker or here
    text_lat = lat + 0.09
    text_lng = lng + 0.14
    folium.map.Marker(
        [text_lat,text_lng],
        icon=DivIcon(
            icon_size=(300,12),
            icon_anchor=(0,0),
            html='<div style="font-size: 10pt">{}</div>'.format(label),
            )
        ).add_to(virginia_map_2)
    ind = ind + 1

virginia_map_2

And there we have it. A map with useful data about the venues surrounding hospitals in Virginia.