# Final Project:
## New Facility Location Selection
### by: Jeffrey Dupree

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

The owner of several successful gyms wants to open a new facility in Tampa, FL. They want to ensure that the gym’s location is in an area not already saturated with gyms and other businesses that might compete with a new gym. This initial analysis will be to determine to a neighborhood level, where to consider placing the new gym facility. Later analysis and research of available real estate will be required to select the final location. That is beyond the scope of this analysis.

In order to conduct this analysis, we must collect:
* Zip Codes in Tampa, FL
* Zip Code locations (latitude/longitude)
* Zip Code boundaries
* Business type and frequency

Before we start we need to install and import the necessary libraries and dependencies.

In [1]:
# If you don't have these packages available, uncomment the appropriate lines below to install them.
import sys
#!{sys.executable} -m pip install beautifulsoup4
#!{sys.executable} -m pip install lxml
#!{sys.executable} -m pip install requests
#!{sys.executable} -m pip install geopy
#!{sys.executable} -m pip install geojson
#!conda install -c conda-forge folium=0.5.0 --yes

from bs4 import BeautifulSoup #For scaping and rendering webpages.
import requests
import pandas as pd
import numpy as np
import re #This will allow use of regular expressions (regex)
from tqdm import tqdm   #This will allow a progress bar to show that there is progress being made. This is helpful when an
                        #iterative process may take more than a few seconds.
import geopy
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter #This will get around getting shut down for too many request errors.
from functools import partial #This will allow multiple arguments to be passed to RateLimiter.
import json #Library to handle JSON files
from pandas.io.json import json_normalize #Tranform JSON file into a pandas dataframe
import matplotlib.cm as cm #Matplotlib and associated plotting modules
import matplotlib.colors as colors
import folium #Map rendering library
import urllib
import geojson
from geojson import Feature, Point, FeatureCollection, MultiPolygon

## Data <a name="data"></a>

### Zip Codes
To begin with, the analysis will need specific Zip Code data for Tampa, FL.

<b>Step one:</b> Identify the list of Zip Codes that correspond to Tampa, FL. For that, this notebook will scrape information from a ZIP-CODES.COM page https://www.zip-codes.com/state/fl.asp#zipcodes to create a dataframe consisting of the Zip Code, the City name, County name and the Zip Code type.

In [2]:
source = requests.get('https://www.zip-codes.com/state/fl.asp#zipcodes').text

Use the BeautifulSoup package to scrape the information from the webpage. I used the lxml parsing method, but you can use any you like.

In [3]:
soup = BeautifulSoup(source, 'lxml')

Find the table using `soup.find` from BeautifulSoup. When uncommented, the second line displays the structure and content of the table. Once the analyst understands the structure, they can develop the logic required to extract the desired elements in the next steps.

In [4]:
table = soup.find(id="tblZIP")
print(table.prettify())

<table border="0" cellpadding="0" cellspacing="0" class="statTable" id="tblZIP" title="All Florida ZIP Codes, City, County, Classification, and Area Codes." width="99%">
 <tr>
  <td class="label" title="All ZIP Codes for Florida">
   <strong>
    ZIP Code
   </strong>
  </td>
  <td class="info" title="The official city name as designated by the USPS.">
   <strong>
    City
   </strong>
  </td>
  <td class="info" title="The primary county or parish this ZIP Code serves.">
   <strong>
    County
   </strong>
  </td>
  <td class="info" title="The classification type of this ZIP Code.">
   <strong>
    Type
   </strong>
  </td>
 </tr>
 <tr>
  <td>
   <a href="/zip-code/32003/zip-code-32003.asp" title="ZIP Code 32003">
    ZIP Code 32003
   </a>
  </td>
  <td>
   <a href="/city/fl-fleming-island.asp" title="Fleming Island, FL">
    Fleming Island
   </a>
  </td>
  <td>
   <a href="/county/fl-clay.asp">
    Clay
   </a>
  </td>
  <td>
   Standard
  </td>
 </tr>
 <tr>
  <td>
   <a href="/zip-

Now a pandas dataframe needs to be created. This will require looping through the elements from the table and assigning the elements to a list. The list can then be made into a dataframe using `pd.DataFrame`. The columns will need header names. I manually assigned these instead of pulling them from the BeautifulSoup object `table`.

In [5]:
table_rows = table.find_all('tr')

res = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        res.append(row)

# Label the columns.
df = pd.DataFrame(res[1:], columns=['Zip_Code','City','County','Type'])

# Remove the text 'Zip Code' from the records in the Zip Code column.
df['Zip_Code'] = df['Zip_Code'].str[-5:]

# Select only the Zip Codes for Tampa, FL.
df = df.loc[df['City'] == "Tampa"]

Next remove the rows where the type is "P.O. Box".

In [6]:
# Remove rows with Type = "P.O. Box" and "Unique", and reset the index to start at 0
df = df[df.Type == 'Standard']
df = df.reset_index(drop=True)

The first five rows of theresulting dataframe look like this.

In [7]:
df.head()

Unnamed: 0,Zip_Code,City,County,Type
0,33602,Tampa,Hillsborough,Standard
1,33603,Tampa,Hillsborough,Standard
2,33604,Tampa,Hillsborough,Standard
3,33605,Tampa,Hillsborough,Standard
4,33606,Tampa,Hillsborough,Standard


<b>Step two:</b> The locations of the Zip Codes (latitude and longitude) will need to be collected. This will be accomplished through Nominatim in the Geopy library. This leverages the OpenStreetMap (OSM) dataset application programming interface (API) to geolocate each Zip Code.

In [8]:
# @hidden_cell
user_agent = "JGD_20191006"

In [9]:
tqdm.pandas()
geolocator = Nominatim(user_agent=user_agent)
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=0.5, max_retries=2, error_wait_seconds=5.0, swallow_exceptions=True, return_value_on_exception=None)
df['location'] = df['Zip_Code'].progress_apply(partial(geocode, country_codes='us'))
df['point'] = df['location'].apply(lambda loc: tuple(loc.point) if loc else None)
df.head()

  from pandas import Panel
100%|██████████| 26/26 [00:27<00:00,  1.06s/it]


Unnamed: 0,Zip_Code,City,County,Type,location,point
0,33602,Tampa,Hillsborough,Standard,"(Ybor City, Tampa, Hillsborough County, Florid...","(27.9516574, -82.449638, 0.0)"
1,33603,Tampa,Hillsborough,Standard,"(Tampa, Hillsborough County, Florida, 33603, U...","(27.9823952329372, -82.4629461754956, 0.0)"
2,33604,Tampa,Hillsborough,Standard,"(Sulphur Springs, Tampa, Hillsborough County, ...","(28.0127051, -82.4665599, 0.0)"
3,33605,Tampa,Hillsborough,Standard,"(East Ybor, Tampa, Hillsborough County, Florid...","(27.96589, -82.4209639, 0.0)"
4,33606,Tampa,Hillsborough,Standard,"(Davis Islands, Tampa, Hillsborough County, Fl...","(27.9368959, -82.4596737, 0.0)"


In [10]:
df[['Latitude','Longitude','3']] = pd.DataFrame(df['point'].tolist(), index=df.index)
df = df.drop(columns=['point','3'])

Now the latitude and longitude values for each of the postal codes are separated out into respective columns. Next we take the first portion of the location string, removing everything after the first comma, then renaming the column "Neighborhood".

In [11]:
df['location'] = df['location'].astype(str)
df['location'] = df['location'].str.split(", ").str[0].tolist()
df = df.rename(columns={"location": "Neighborhood"})

The dataframe now looks like this.

In [12]:
df.head()

Unnamed: 0,Zip_Code,City,County,Type,Neighborhood,Latitude,Longitude
0,33602,Tampa,Hillsborough,Standard,Ybor City,27.951657,-82.449638
1,33603,Tampa,Hillsborough,Standard,Tampa,27.982395,-82.462946
2,33604,Tampa,Hillsborough,Standard,Sulphur Springs,28.012705,-82.46656
3,33605,Tampa,Hillsborough,Standard,East Ybor,27.96589,-82.420964
4,33606,Tampa,Hillsborough,Standard,Davis Islands,27.936896,-82.459674


<b>Step three:</b> The last feature of Zip Code data needed are the boundaries of each Zip Code. These will be stored as latitudes and longitudes for the verices of polygons representing areas corresponding to each Zip Code. This data is downloaded as a GeoJSON file from https://opendata.arcgis.com/datasets/d356e19e0fb34524b54d189fafb0d675_0.geojson.

### Business Data
Once the Zip Code data are collected, we need to collect the data on the surrounding businesses. We use the Foursquare API to collect data about the businesses near each Zip Code loaction.

## Methodology <a name="methodology"></a>

#### Locate Zip Codes Lacking Gyms
We can start by visualizing the location of each zip code (based on the coordinates associated with it).

In [13]:
# Create map of Tampa using latitude and longitude values
tampa = geolocator.geocode({"state": "fl", "city": "tampa"})
map_tampa = folium.Map(location=[tampa.latitude, tampa.longitude], zoom_start=10)

# add markers to map
for lat, lng, hood in zip(df['Latitude'], df['Longitude'], df['Neighborhood']):
    label = '{}'.format(hood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_tampa)  
    
map_tampa

In [14]:
# @hidden_cell
CLIENT_ID = 'MAI43NUPMV0YXXNFKS2XVGUPBMIB5SBO5T5W5FV4ZND2VTJW' # your Foursquare ID
CLIENT_SECRET = 'V1POSAELWQ0NIURPOW2C43LH2FTO5NJ0VGYQXMSD2OGRLEND' # your Foursquare Secret
VERSION = '20180604' # Foursquare API version

Create the url that will query the Foursquare API for the top 100 venues within 500 meters of the location. The cell above assigns the client ID and client secret to variables that will be called below.

In [15]:
search_lat = df.Latitude[0]
search_lon = df.Longitude[0]
LIMIT = 150 # Limit of number of venues returned by Foursquare API
radius = 1000 # Define radius in meters

# Create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    search_lat, 
    search_lon, 
    radius, 
    LIMIT)


The url is passed using `get()` and returned in a json format.

In [16]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5dfbbff247b43d38a9690110'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'},
    {'name': '$-$$$$', 'key': 'price'}]},
  'headerLocation': 'Village of Tampa',
  'headerFullLocation': 'Village of Tampa, University',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 87,
  'suggestedBounds': {'ne': {'lat': 27.960657409000007,
    'lng': -82.43946844978649},
   'sw': {'lat': 27.94265739099999, 'lng': -82.4598075502135}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4b53ccf9f964a520b8ab27e3',
       'name': 'Pour House at Grand Central',
       'location': {'address': '1208 E Kennedy Blvd #112',
        'crossStreet': 'at N. 11th St.',
        'lat': 27.95

After reviewing the structure of the JSON returned above, the below function was created to extract the category types.

In [17]:
# Function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [18]:
venues = results['response']['groups'][0]['items']
#venues # Uncomment to see the results, potentially large.

In [19]:
# Flatten JSON
nearby_venues = json_normalize(venues)

In [20]:
# Filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]

In [21]:
# Filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

In [22]:
# Clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

In [23]:
nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Pour House at Grand Central,Bar,27.951357,-82.44774
1,Crunch - Channelside,Gym / Fitness Center,27.951152,-82.44794
2,Cena,Italian Restaurant,27.951569,-82.447869
3,Publix - Channelside,Grocery Store,27.952128,-82.448741
4,City Dog Cantina,Mexican Restaurant,27.951118,-82.447726


This script creates a function for using the Foursquare API to find the nearby venues for all of the boroughs in the dataframe.

In [24]:
def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # Create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # Make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # Return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Zip_Code', 
                  'Zip Latitude', 
                  'Zip Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

The `getNearbyVenues` function can then be applied to the dataframe to create a dataframe of the venues near the grid associated with each zip code. 

In [25]:
tampa_venues = getNearbyVenues(names=df['Zip_Code'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

The resulting dataframe will look like this.

In [26]:
print(tampa_venues.shape)
tampa_venues.head()

(655, 7)


Unnamed: 0,Zip_Code,Zip Latitude,Zip Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,33602,27.951657,-82.449638,Pour House at Grand Central,27.951357,-82.44774,Bar
1,33602,27.951657,-82.449638,Crunch - Channelside,27.951152,-82.44794,Gym / Fitness Center
2,33602,27.951657,-82.449638,Cena,27.951569,-82.447869,Italian Restaurant
3,33602,27.951657,-82.449638,Publix - Channelside,27.952128,-82.448741,Grocery Store
4,33602,27.951657,-82.449638,City Dog Cantina,27.951118,-82.447726,Mexican Restaurant


Before the frequency of each venue can be calculated, a list of the unique venue categories must be created and evaluated.

In [27]:
tampa_venues['Venue Category'].unique()

array(['Bar', 'Gym / Fitness Center', 'Italian Restaurant',
       'Grocery Store', 'Mexican Restaurant', 'Theater', 'Food Truck',
       'Art Gallery', 'Plaza', 'Dog Run', 'Pizza Place', 'Tea Room',
       'Beer Garden', 'Coffee Shop', 'Café', 'Asian Restaurant',
       'Skate Park', 'Movie Theater', 'Bagel Shop', 'Tex-Mex Restaurant',
       'Harbor / Marina', 'Speakeasy', 'Restaurant', 'Floating Market',
       'Hockey Arena', 'Aquarium', 'Print Shop', 'Hotel', 'Sports Bar',
       'Cuban Restaurant', 'Deli / Bodega', 'Mediterranean Restaurant',
       'Zoo Exhibit', 'Shipping Store', 'Gastropub', 'BBQ Joint',
       'Thai Restaurant', 'Park', 'Nightclub', 'Gym', 'Bowling Alley',
       'French Restaurant', 'Juice Bar', 'Cocktail Bar', 'Clothing Store',
       'Sandwich Place', 'Bank', 'Pharmacy', 'Caribbean Restaurant',
       'American Restaurant', 'Chinese Restaurant', 'Hookah Bar',
       'Fast Food Restaurant', 'Cruise', 'Port', 'Brewery',
       'Burger Joint', 'Playground', '

As you can see above, there are several venue categories that could be generally categorized as a 'gym'. There are other venue categories that are not necesarily types of gyms, but might compete with a gym as a place where people go to be active. Another venue that would compete with a gym is 'Military Base'. Military bases have gyms and fitness centers for military members at no cost. This could reduce the need for another gym in the area. We will need to recode these categories with a common category name (i.e., gym).

In [28]:
gym = ['Gym / Fitness Center', 'Park', 'Martial Arts Dojo', 'Gym', 'Pool', 'Tennis Court', 'Disc Golf', 'Volleyball Court',
       'Soccer Field', 'Basketball Court', 'Yoga Studio', 'College Basketball Court', 'College Gym','College Track',
       'Dance Studio', 'Military Base', 'Athletics & Sports', 'Golf Course', 'Baseball Field', 'Trail', 'Hockey Arena',
       'Hockey Field', 'Track', 'Water Park', 'Outdoors & Recreation', 'State / Provincial Park', 'Playground']

The gym-like venues in the venues dataframe can be renamed "Gym" to treat them as one category.

In [29]:
tampa_venues['Venue Category'].replace(to_replace = gym, value = "Gym", inplace = True)

A relative density of all venues for each zip code can be determined with a simple count.

In [30]:
tampa_venues.groupby('Zip_Code').count()

Unnamed: 0_level_0,Zip Latitude,Zip Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Zip_Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
33602,87,87,87,87,87,87
33603,19,19,19,19,19,19
33604,45,45,45,45,45,45
33605,11,11,11,11,11,11
33606,60,60,60,60,60,60
33607,4,4,4,4,4,4
33609,52,52,52,52,52,52
33610,15,15,15,15,15,15
33611,5,5,5,5,5,5
33612,24,24,24,24,24,24


In [31]:
print('There are {} uniques categories.'.format(len(tampa_venues['Venue Category'].unique())))

There are 168 uniques categories.


We use one-hot encoding to determine if a venue type exists in a neighborhood. This will create a column for each of the unique categories, and assign a value of 1 if that venue type exists in the neighborhood or 0 otherwise.

In [32]:
# One hot encoding
tampa_onehot = pd.get_dummies(tampa_venues[['Venue Category']], prefix="", prefix_sep="")

# Add zip code column back to dataframe
tampa_onehot['Zip_Code'] = tampa_venues['Zip_Code'] 

# Move zip code column to the first column
fixed_columns = [tampa_onehot.columns[-1]] + list(tampa_onehot.columns[:-1])
tampa_onehot = tampa_onehot[fixed_columns]

tampa_onehot.head()

Unnamed: 0,Zip_Code,Accessories Store,American Restaurant,Antique Shop,Aquarium,Art Gallery,Asian Restaurant,Automotive Shop,BBQ Joint,Bagel Shop,...,Vape Store,Video Game Store,Video Store,Vietnamese Restaurant,Waste Facility,Wine Bar,Wings Joint,Women's Store,Zoo,Zoo Exhibit
0,33602,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,33602,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,33602,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,33602,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,33602,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


With the one-hot encoded data, we can determine the frequency with which each venue type occurs in each borough. This results in a dataframe with a column for each unique venue type and a row for each unique borough.

In [33]:
tampa_grouped = tampa_onehot.groupby('Zip_Code').mean().reset_index()
tampa_grouped.head()

Unnamed: 0,Zip_Code,Accessories Store,American Restaurant,Antique Shop,Aquarium,Art Gallery,Asian Restaurant,Automotive Shop,BBQ Joint,Bagel Shop,...,Vape Store,Video Game Store,Video Store,Vietnamese Restaurant,Waste Facility,Wine Bar,Wings Joint,Women's Store,Zoo,Zoo Exhibit
0,33602,0.0,0.034483,0.0,0.011494,0.011494,0.011494,0.0,0.022989,0.011494,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011494
1,33603,0.0,0.0,0.052632,0.0,0.105263,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,33604,0.0,0.022222,0.0,0.0,0.0,0.022222,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.022222,0.0,0.0,0.022222,0.288889
3,33605,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,33606,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.016667,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Next we will determine the five most frequent venues within a borough to describe a neighborhood 'type', and group the borough by type symilarity.

In [34]:
num_top_venues = 5

for hood in tampa_grouped['Zip_Code']:
    print("----"+hood+"----")
    temp = tampa_grouped[tampa_grouped['Zip_Code'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----33602----
         venue  freq
0          Bar  0.08
1          Gym  0.07
2  Coffee Shop  0.05
3         Café  0.05
4   Sports Bar  0.03


----33603----
                    venue  freq
0            Intersection  0.16
1                 Brewery  0.11
2             Art Gallery  0.11
3  Furniture / Home Store  0.05
4  Thrift / Vintage Store  0.05


----33604----
                          venue  freq
0                   Zoo Exhibit  0.29
1                           Gym  0.16
2                   Coffee Shop  0.07
3  Theme Park Ride / Attraction  0.04
4                           Bar  0.04


----33605----
                    venue  freq
0                     Gym  0.27
1            Dessert Shop  0.09
2             Record Shop  0.09
3  Furniture / Home Store  0.09
4          Hardware Store  0.09


----33606----
         venue  freq
0          Gym  0.17
1  Coffee Shop  0.07
2   Sports Bar  0.05
3  Pizza Place  0.05
4          Bar  0.05


----33607----
                     venue  freq
0      Re

We create a function that will return the most common venues for each zip code.

In [35]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [36]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# Create columns according to number of top venues
columns = ['Zip_Code']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# Create a new dataframe
zip_venues_sorted = pd.DataFrame(columns=columns)
zip_venues_sorted['Zip_Code'] = tampa_grouped['Zip_Code']

for ind in np.arange(tampa_grouped.shape[0]):
    zip_venues_sorted.iloc[ind, 1:] = return_most_common_venues(tampa_grouped.iloc[ind, :], num_top_venues)

# Define a function to color the text red if a venue type is "Gym"
def color_gym_red(val):
    color = 'red' if val == "Gym" else 'black'
    return 'color: %s' % color

# Display the first five records of the dataframe, with "Gym" highlighted red.
zip_venues_sorted.head().style.applymap(color_gym_red)

Unnamed: 0,Zip_Code,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,33602,Bar,Gym,Coffee Shop,Café,American Restaurant
1,33603,Intersection,Art Gallery,Brewery,Burger Joint,Thrift / Vintage Store
2,33604,Zoo Exhibit,Gym,Coffee Shop,Bar,Theme Park Ride / Attraction
3,33605,Gym,Dessert Shop,Light Rail Station,Southern / Soul Food Restaurant,Hardware Store
4,33606,Gym,Coffee Shop,Pizza Place,Sports Bar,Bar


Now that we can see what the five most common venues are in each Zip Code, we can eliminate those Zip Codes with 'gym' type venues in the top five.

In [37]:
zip_venues_reduced = zip_venues_sorted[(zip_venues_sorted['1st Most Common Venue'] != 'Gym') & (zip_venues_sorted['2nd Most Common Venue'] != 'Gym') & 
                                       (zip_venues_sorted['3rd Most Common Venue'] != 'Gym') & (zip_venues_sorted['4th Most Common Venue'] != 'Gym') &
                                       (zip_venues_sorted['5th Most Common Venue'] != 'Gym')]
zip_venues_reduced

Unnamed: 0,Zip_Code,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
1,33603,Intersection,Art Gallery,Brewery,Burger Joint,Thrift / Vintage Store
5,33607,Health & Beauty Service,Food Truck,Rental Car Location,Harbor / Marina,Filipino Restaurant
15,33618,Pizza Place,American Restaurant,Coffee Shop,Massage Studio,Breakfast Spot
18,33625,Fast Food Restaurant,Cafeteria,Dog Run,Coffee Shop,Pizza Place
19,33626,Insurance Office,Home Service,Zoo Exhibit,Filipino Restaurant,Fried Chicken Joint


Now the list only includes Zip Codes where 'gym' type venues are not one of the five most frequent venue types. We can sort this list by descending frequency of gyms. Where the gym frequencies are equal, records are sorted by Zip_Code in ascending order.

In [38]:
index = zip_venues_reduced.index
locations = tampa_grouped[['Zip_Code','Gym']].iloc[index].sort_values(by=['Gym'], ascending=True)
locations

Unnamed: 0,Zip_Code,Gym
5,33607,0.0
18,33625,0.0
19,33626,0.0
15,33618,0.025641
1,33603,0.052632


Now that we have the reduced list of zip codes, we join it to our location dataframe, rename the 'Gym' column as 'Gym Frequency', and reset the indeces.

In [39]:
cols = ['Zip_Code']
locations = locations.join(df.set_index(cols), on=cols)
locations = locations.rename(columns={"Gym": "Gym Frequency"}).reset_index(drop=True)
locations

Unnamed: 0,Zip_Code,Gym Frequency,City,County,Type,Neighborhood,Latitude,Longitude
0,33607,0.0,Tampa,Hillsborough,Standard,Tampa,27.973131,-82.585196
1,33625,0.0,Tampa,Hillsborough,Standard,Hillsborough County,28.068383,-82.557238
2,33626,0.0,Tampa,Hillsborough,Standard,Hillsborough County,28.057031,-82.610797
3,33618,0.025641,Tampa,Hillsborough,Standard,Mullis City,28.039589,-82.508293
4,33603,0.052632,Tampa,Hillsborough,Standard,Tampa,27.982395,-82.462946


Now we can display the locations on a map. Selecting a marker on the map will display that zip code and the frequency of 'gym' type venues within 1km of the zip code central point.

In [40]:
# Create map
map_locations = folium.Map(location=[tampa.latitude, tampa.longitude], zoom_start=11)

# Set color scheme for the clusters
x = np.arange(locations.shape[0])
ys = [i + x + (i*x)**2 for i in range(locations.shape[0])]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers to the map
markers_colors = []
count = 0
for lat, lon, poi, freq, nbh in zip(locations['Latitude'], locations['Longitude'], locations['Zip_Code'], 
                                    locations['Gym Frequency'], locations['Neighborhood']):
    label = folium.Popup('Neigborhood: ' + str(nbh) + ' / Zip Code: ' + str(poi) + ' / Gym Frequency: ' + str(freq), 
                         parse_html=False,)
    count = count + 1
    folium.CircleMarker(
        [lat, lon],
        radius=7,
        popup=label,
        color=rainbow[count-1],
        fill=True,
        fill_color=rainbow[count-1],
        fill_opacity=0.7).add_to(map_locations)
       
map_locations

In [41]:
url = 'https://opendata.arcgis.com/datasets/d356e19e0fb34524b54d189fafb0d675_0.geojson'
with urllib.request.urlopen(url) as url:
    plt = json.loads(url.read().decode())
#plt #If uncommented the JSON will be displayed, and is potentially very large.

Using the GeoJSON file from https://opendata.arcgis.com/datasets/d356e19e0fb34524b54d189fafb0d675_0.geojson polygons for the Zip Codes of intereset can be defined using the latitude and longitude coordinates. Below we create a list of coordinates for both latitudes and longitudes, then place these lists at the end of the dataframe.

In [42]:
for i in reversed(range(len(plt['features']))):
    count = locations.shape[0]
    for j in range(locations.shape[0]):
        if plt['features'][i]['properties']['Zip_Code'] != locations['Zip_Code'][j]:
            count = count - 1
            if count == 0:
                del plt['features'][i]
                break

## Results <a name="results"></a>

Now the polygons for the areas represented by the zip code can be overlaid on the map.

In [43]:
# Create map
map_test = folium.Map(location=[tampa.latitude, tampa.longitude], zoom_start=11)

# Add polygons to the map
for i in range(len(plt['features'])):
    ZIP = plt['features'][i]['properties']['Zip_Code']
    neighborhood = locations.Neighborhood[locations[locations.Zip_Code == ZIP].index[0]]
    zip_code = locations.Zip_Code[locations[locations.Zip_Code == ZIP].index[0]]
    geojson = folium.GeoJson(
        plt['features'][i],
        name=neighborhood
    )
    popup = folium.Popup(neighborhood + " " + zip_code)
    popup.add_to(geojson)
    geojson.add_to(map_test)

folium.LayerControl().add_to(map_test)

map_test

## Discussion <a name="discussion"></a>

Using this method the analyst is able to quickly gather and display location and venue information for the area of interest. With this data the analyst can categorize the areas by the types of venues in that are and the frequency with which they occur. This allows for a cursory analysis to narrow down the locations that may be good choices for a new gym facility.

There are some drawbacks to this application. Primarily that the search for venues is conducted in a circular area of radius 1km from the coordinates pulled from the website https://www.zip-codes.com/state/fl.asp#zipcodes. These coordinates do not alwasy correspond to the geographic center of the area. If the coordinates map to a location within the zip code area that is in a remote section, there may not be many venues within 1km of the point. Also, some of the points may be less than 1km from the boundary. This may result in some venues from other zip codes being included with multiple zip codes.

However, the strength of this methodology is that it is dynamic. As more venue infromation is added or modified within the FourSquare platform, the results of this analysis will take those changes into account when rerun.

## Conclusion <a name="conclusion"></a>

At the time of this model run, there were 5 zip codes that met the criteria for the new location. The customer can now focus their location serach to a few zip codes, saving time and money.