# Final Project:
## New Facility Location Selection
### by: Jeffrey Dupree

This notebook will scrape neighborhood information from a ZIP-CODES.COM page https://www.zip-codes.com/state/fl.asp#zipcodes to create a dataframe consisting of the Zip Code, the City name, County name and the Zip Code type.

#### Section One: Scrape Tampa, FL ZIP Codes from website

First, we install the necessary libraries.

In [3]:
# If you don't have these packages available, uncomment the appropriate lines below to install them.

import sys
#!{sys.executable} -m pip install beautifulsoup4
#!{sys.executable} -m pip install lxml
#!{sys.executable} -m pip install requests

from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

Next, we need to get the information from the webpage using `requests.get`.

In [4]:
source = requests.get('https://www.zip-codes.com/state/fl.asp#zipcodes').text

Use the BeautifulSoup package to scrape the information from the webpage. I used the lxml parsing method, but you can use any you like.

In [5]:
soup = BeautifulSoup(source, 'lxml')

Find the table using `soup.find` from BeautifulSoup. Uncomment the second line to see the structure and content of the table. The tags are needed for the next steps.

In [6]:
table = soup.find(id="tblZIP")
# print(table.prettify())

Now a pandas dataframe needs to be created. This will require looping through the elements from the table and assigning the to a list. The list can then be made into a dataframe using `pd.DataFrame`. The columns will need header names. I manually assigned these instead of pulling them from the BeautifulSoup object `table`.

In [7]:
table_rows = table.find_all('tr')

res = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        res.append(row)

# Label the columns.
df = pd.DataFrame(res[1:], columns=['Zip_Code','City','County','Type'])

# Remove the text 'Zip Code' from the records in the Zip Code column.
df['Zip_Code'] = df['Zip_Code'].str[-5:]

# Select only the Zip Codes for Tampa, FL.
df = df.loc[df['City'] == "Tampa"]

Next remove the rows where the type is "P.O. Box".

In [8]:
# Remove rows with Type = "P.O. Box" and "Unique", and reset the index to start at 0
df = df[df.Type == 'Standard']
df = df.reset_index(drop=True)

The resulting dataframe looks like this.

In [9]:
df

Unnamed: 0,Zip_Code,City,County,Type
0,33602,Tampa,Hillsborough,Standard
1,33603,Tampa,Hillsborough,Standard
2,33604,Tampa,Hillsborough,Standard
3,33605,Tampa,Hillsborough,Standard
4,33606,Tampa,Hillsborough,Standard
5,33607,Tampa,Hillsborough,Standard
6,33609,Tampa,Hillsborough,Standard
7,33610,Tampa,Hillsborough,Standard
8,33611,Tampa,Hillsborough,Standard
9,33612,Tampa,Hillsborough,Standard


Check the size of the dataframe.

In [10]:
df.shape

(26, 4)

#### Section Two: Geolocate ZIP Codes

In [11]:
# @hidden_cell
user_agent = "JGD_20191006"

In [12]:
import re

# Uncomment next line to install geopy if necessary.
!{sys.executable} -m pip install geopy

from tqdm import tqdm #This will allow a progress bar to show that there is progress being made. This is helpful when an
tqdm.pandas()         #iterative process may take more than a few seconds.

from functools import partial #This will allow multiple arguments to be passed to RateLimiter.

import geopy
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent=user_agent)
from geopy.extra.rate_limiter import RateLimiter #This will get around getting shut down for too many request errors.
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=0.5, max_retries=2, error_wait_seconds=5.0, swallow_exceptions=True, return_value_on_exception=None)
df['location'] = df['Zip_Code'].progress_apply(partial(geocode, country_codes='us'))

df['point'] = df['location'].apply(lambda loc: tuple(loc.point) if loc else None)
df

  from pandas import Panel
  0%|                                                                                           | 0/26 [00:00<?, ?it/s]



100%|██████████████████████████████████████████████████████████████████████████████████| 26/26 [00:26<00:00,  1.02s/it]


Unnamed: 0,Zip_Code,City,County,Type,location,point
0,33602,Tampa,Hillsborough,Standard,"(Tampa Heights, Tampa, Hillsborough County, Fl...","(27.964132, -82.459452, 0.0)"
1,33603,Tampa,Hillsborough,Standard,"(Tampa, Florida, 33603, USA, (27.9824664343995...","(27.9824664343995, -82.4630092024653, 0.0)"
2,33604,Tampa,Hillsborough,Standard,"(Sulphur Springs, Tampa, Hillsborough County, ...","(28.0127051, -82.4665599, 0.0)"
3,33605,Tampa,Hillsborough,Standard,"(East Ybor, Tampa, Hillsborough County, Florid...","(27.96589, -82.4209639, 0.0)"
4,33606,Tampa,Hillsborough,Standard,"(Davis Islands, Tampa, Hillsborough County, Fl...","(27.9368959, -82.4596737, 0.0)"
5,33607,Tampa,Hillsborough,Standard,"(Tampa, Hillsborough County, Florida, 33607, U...","(27.9727946, -82.5855743, 0.0)"
6,33609,Tampa,Hillsborough,Standard,"(Palma Ceia, Tampa, Hillsborough County, Flori...","(27.9448134, -82.5362755, 0.0)"
7,33610,Tampa,Hillsborough,Standard,"(Ybor City, Tampa, Hillsborough County, Florid...","(27.977944, -82.4429745, 0.0)"
8,33611,Tampa,Hillsborough,Standard,"(Palma Ceia, Tampa, Hillsborough County, Flori...","(27.8731959, -82.4885783, 0.0)"
9,33612,Tampa,Hillsborough,Standard,"(Sulphur Springs, Tampa, Hillsborough County, ...","(28.0495089, -82.4146255, 0.0)"


In [13]:
df[['Latitude','Longitude','3']] = pd.DataFrame(df['point'].tolist(), index=df.index)
df = df.drop(columns=['point','3'])
df

Unnamed: 0,Zip_Code,City,County,Type,location,Latitude,Longitude
0,33602,Tampa,Hillsborough,Standard,"(Tampa Heights, Tampa, Hillsborough County, Fl...",27.964132,-82.459452
1,33603,Tampa,Hillsborough,Standard,"(Tampa, Florida, 33603, USA, (27.9824664343995...",27.982466,-82.463009
2,33604,Tampa,Hillsborough,Standard,"(Sulphur Springs, Tampa, Hillsborough County, ...",28.012705,-82.46656
3,33605,Tampa,Hillsborough,Standard,"(East Ybor, Tampa, Hillsborough County, Florid...",27.96589,-82.420964
4,33606,Tampa,Hillsborough,Standard,"(Davis Islands, Tampa, Hillsborough County, Fl...",27.936896,-82.459674
5,33607,Tampa,Hillsborough,Standard,"(Tampa, Hillsborough County, Florida, 33607, U...",27.972795,-82.585574
6,33609,Tampa,Hillsborough,Standard,"(Palma Ceia, Tampa, Hillsborough County, Flori...",27.944813,-82.536276
7,33610,Tampa,Hillsborough,Standard,"(Ybor City, Tampa, Hillsborough County, Florid...",27.977944,-82.442975
8,33611,Tampa,Hillsborough,Standard,"(Palma Ceia, Tampa, Hillsborough County, Flori...",27.873196,-82.488578
9,33612,Tampa,Hillsborough,Standard,"(Sulphur Springs, Tampa, Hillsborough County, ...",28.049509,-82.414625


Now there are latitude and longitude values for each of the postal codes.

In [14]:
df

Unnamed: 0,Zip_Code,City,County,Type,location,Latitude,Longitude
0,33602,Tampa,Hillsborough,Standard,"(Tampa Heights, Tampa, Hillsborough County, Fl...",27.964132,-82.459452
1,33603,Tampa,Hillsborough,Standard,"(Tampa, Florida, 33603, USA, (27.9824664343995...",27.982466,-82.463009
2,33604,Tampa,Hillsborough,Standard,"(Sulphur Springs, Tampa, Hillsborough County, ...",28.012705,-82.46656
3,33605,Tampa,Hillsborough,Standard,"(East Ybor, Tampa, Hillsborough County, Florid...",27.96589,-82.420964
4,33606,Tampa,Hillsborough,Standard,"(Davis Islands, Tampa, Hillsborough County, Fl...",27.936896,-82.459674
5,33607,Tampa,Hillsborough,Standard,"(Tampa, Hillsborough County, Florida, 33607, U...",27.972795,-82.585574
6,33609,Tampa,Hillsborough,Standard,"(Palma Ceia, Tampa, Hillsborough County, Flori...",27.944813,-82.536276
7,33610,Tampa,Hillsborough,Standard,"(Ybor City, Tampa, Hillsborough County, Florid...",27.977944,-82.442975
8,33611,Tampa,Hillsborough,Standard,"(Palma Ceia, Tampa, Hillsborough County, Flori...",27.873196,-82.488578
9,33612,Tampa,Hillsborough,Standard,"(Sulphur Springs, Tampa, Hillsborough County, ...",28.049509,-82.414625


#### Section Three

In [2]:
import json # library to handle JSON files

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# uncomment this line if you haven't completed the Foursquare API lab
!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\Users\JeffDupree\Anaconda3

  added / updated specs:
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-3.2.0               |           py37_0         749 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    conda-4.7.12               |           py37_0         3.0 MB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    ------------------------------------------------------------
    

In [None]:
# This may no longer be useful consider deleting.
print('The dataframe has {} boroughs.'.format(
        len(df['Borough'].unique())
    )
)

In [17]:
# create map of Tampa using latitude and longitude values
tampa = geolocator.geocode({"state": "fl", "city": "tampa"})
map_tampa = folium.Map(location=[tampa.latitude, tampa.longitude], zoom_start=11)

# add markers to map
for lat, lng, county, city in zip(df['Latitude'], df['Longitude'], df['County'], df['City']):
    label = '{}, {}'.format(county, city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_tampa)  
    
map_tampa

In [20]:
tampa = geolocator.geocode({"state": "fl", "city": "tampa"})
map_tampa = folium.Map(location=[tampa.latitude, tampa.longitude], zoom_start=11)
map_tampa

In [None]:
# The code was removed by Watson Studio for sharing.

Create the url that will query the Foursquare API for the top 100 venues within 500 meters of the location. The cell above assigns the client ID and client secret to variables that will be called below.

In [None]:
search_lat = df.Latitude[0]
search_lon = df.Longitude[0]
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    search_lat, 
    search_lon, 
    radius, 
    LIMIT)


In [None]:
results = requests.get(url).json()
results

In [None]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [None]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

This creates a function for using the Foursquare API to find the nearby venues for all of the boroughs in the dataframe.

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Borough', 
                  'Borough Latitude', 
                  'Borough Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
toronto_venues = getNearbyVenues(names=df['Borough'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

In [None]:
print(toronto_venues.shape)
toronto_venues.head()

In [None]:
toronto_venues.groupby('Borough').count()

In [None]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

We use one-hot encoding to determine if a venue type exists in a neighborhood. This will create a column for each of the unique categories, and assign a value of 1 if that venue type exists in the neighborhood or 0 otherwise.

In [None]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add borough column back to dataframe
toronto_onehot['Borough'] = toronto_venues['Borough'] 

# move borough column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

In [None]:
toronto_onehot.shape

With the one-hot encoded data, we can determine the frequency with which each venue type occurs in each borough. This results in a dataframe with a column for each unique venue type and a row for each unique borough.

In [None]:
toronto_grouped = toronto_onehot.groupby('Borough').mean().reset_index()
toronto_grouped

Next we will determine the five most frequent venues within a borough to describe a neighborhood 'type', and group the borough by type symilarity.

In [None]:
num_top_venues = 5

for hood in toronto_grouped['Borough']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Borough'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Borough']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
boroughs_venues_sorted = pd.DataFrame(columns=columns)
boroughs_venues_sorted['Borough'] = toronto_grouped['Borough']

for ind in np.arange(toronto_grouped.shape[0]):
    boroughs_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

boroughs_venues_sorted.head()

Using a k-means clustering, we group the boroughs by similarity of venues available. For this example we chose 5 clusters, but this can be adjusted by setting the `kclusters` variable to the desired number of clusters in the code below.

In [None]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Borough', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

Each borough is now assigned to one of five clusters, indexed as 0-4.

In [None]:
# add clustering labels
boroughs_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(boroughs_venues_sorted.set_index('Borough'), on='Borough')

toronto_merged.head() # check the last columns!

Visualized on a map, the borough clusters look like this.

In [None]:
# create map
map_clusters = folium.Map(location=[toronto.latitude, toronto.longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Borough'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters