### Clustering of San Francisco Bay Area

I am trying to cluster the ZIP codes in the San Francisco Bay Area in terms of popularly visited venues.

My target audience is such businessmen who have a successful business in some zip code in the Bay Area and want to expand (or open more shops) in other zip code areas in the Bay Area. But they need Data Science inputs to find where they may be successful.

This project will help those businessmen to consider areas, which are similar in market and customer characteristics (as found by clustering) to the zip code of his current successful business. This is because the the data is sourced from customer reviews about businesses which indicate volume of customers as well their interest. So if he has been successful in one zip code, he is more likely to be successful in other zip codes in the same cluster. So he may choose to open his new shops in those zip codes and NOT in the zip codes that fall under the other clusters.

#### Data Source

I have used the sfgov.org public data for this project. I have also used the FourSquare API for getting information about popular venues in the Bay Area ZIP codes. It is downloaded from https://data.sfgov.org/Geographic-Locations-and-Boundaries/Bay-Area-ZIP-Codes/u5j3-svi6

So, let us get into the project.

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
#!pip install geopy
!pip install --upgrade --force-reinstall geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
#from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
from pandas import json_normalize # tranform JSON file into a pandas dataframe


# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting geopy
  Obtaining dependency information for geopy from https://files.pythonhosted.org/packages/e5/15/cf2a69ade4b194aa524ac75112d5caac37414b20a3a03e6865dfe0bd1539/geopy-2.4.1-py3-none-any.whl.metadata
  Downloading geopy-2.4.1-py3-none-any.whl.metadata (6.8 kB)
Collecting geographiclib<3,>=1.52 (from geopy)
  Downloading geographiclib-2.0-py3-none-any.whl (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.3/40.3 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading geopy-2.4.1-py3-none-any.whl (125 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m125.4/125.4 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: geographiclib, geopy
  Attempting uninstall: geographiclib
    Found existing installation: geographiclib 2.0
    Uninstalling geographiclib-2.0:
      Successfully uninstalled geographiclib-2.0
  Attempting uninstall: geopy
    Found existing installation: geopy 2.4.0
    Uninstalling geo

<a id='item1'></a>

We start by reading the Bay Area data file:

Read the San Francisco Bay Area CSV Data, which was uploaded to the local directory as this notebook.

In [2]:
BA_df=pd.read_csv('/kaggle/input/bay-area-zip-codes/bayarea_zipcodes.csv')
BA_df.head()


Unnamed: 0,PO_NAME,the_geom,ZIP,STATE,Area__,Length__
0,NAPA,MULTIPOLYGON (((-122.10329200180091 38.5132829...,94558,CA,12313260000.0,995176.225313
1,FAIRFIELD,MULTIPOLYGON (((-121.947475002335 38.301511000...,94533,CA,991786100.0,200772.556587
2,DIXON,MULTIPOLYGON (((-121.65335500334429 38.3133870...,95620,CA,7236950000.0,441860.2014
3,SONOMA,MULTIPOLYGON (((-122.406843003057 38.155681999...,95476,CA,3001414000.0,311318.546326
4,NAPA,MULTIPOLYGON (((-122.29368500225117 38.1552379...,94559,CA,1194302000.0,359104.646602


Drop the columns 'the_geom','Area_' and 'Length_'

In [3]:
BA_df.drop(['the_geom','Area__','Length__'],axis=1, inplace=True)
BA_df.head()
#BA_df.shape

Unnamed: 0,PO_NAME,ZIP,STATE
0,NAPA,94558,CA
1,FAIRFIELD,94533,CA
2,DIXON,95620,CA
3,SONOMA,95476,CA
4,NAPA,94559,CA


Import the geospatial data

In [4]:
geolocator = Nominatim(user_agent="ba_explorer")

#BA_df_s=BA_df.loc[0:19,:]
BA_df_s=BA_df
#BA_df_s.head()

latAll=[]
longAll=[]

for z in BA_df_s['ZIP']:
    latAll.append(geolocator.geocode(z).latitude)
    longAll.append(geolocator.geocode(z).longitude)

BA_df_s['LAT']=latAll
BA_df_s['LONG']=longAll

BA_df_s.head()

Unnamed: 0,PO_NAME,ZIP,STATE,LAT,LONG
0,NAPA,94558,CA,38.319227,-122.286037
1,FAIRFIELD,94533,CA,47.757226,18.127488
2,DIXON,95620,CA,49.114702,2.208189
3,SONOMA,95476,CA,38.289863,-122.463968
4,NAPA,94559,CA,48.883567,12.79424


The map below shows the location of the Postcodes on the Toronto map

In [5]:
# create map of Bay Area using latitude and longitude values
latitude1=37.834929
longitude1=-122.042662
map_ba = folium.Map(location=[latitude1, longitude1], zoom_start=9)

# add markers to map
for lat, lng, po, zipcode in zip(BA_df_s['LAT'], BA_df_s['LONG'], BA_df_s['PO_NAME'], BA_df_s['ZIP']):
    label = '{}, {}'.format(po, zipcode)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_ba)  
    
map_ba

Now define the FourSquare credentials

In [6]:
CLIENT_ID = 'PQAASFB12T0V......5M2XGTGT0O5' # your Foursquare ID
CLIENT_SECRET = '1R1LEKIGYRWH.......I0JYD20QRT' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: PQAASFB12T0VF300NQX3AK4XCJYYLPH14BSMC5M2XGTGT0O5
CLIENT_SECRET:1R1LEKIGYRWHI5KXZQ3U1G5EU2GPAT2ATP4SD5I0JYD20QRT


Exploratory Data Analysis for the first row with FourSquare API

In [7]:
neighborhood_latitude = BA_df_s.loc[0, 'LAT'] # neighborhood latitude value
neighborhood_longitude = BA_df_s.loc[0, 'LONG'] # neighborhood longitude value

neighborhood_name = BA_df_s.loc[0, 'PO_NAME'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

LIMIT = 5
radius = 1000
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},\
{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, neighborhood_latitude, \
                                   neighborhood_longitude, VERSION, radius, LIMIT)

print(url)

Latitude and longitude values of NAPA are 38.3192266951654, -122.2860374389313.
https://api.foursquare.com/v2/venues/explore?client_id=PQAASFB12T0VF300NQX3AK4XCJYYLPH14BSMC5M2XGTGT0O5&client_secret=1R1LEKIGYRWHI5KXZQ3U1G5EU2GPAT2ATP4SD5I0JYD20QRT&ll=38.3192266951654,-122.2860374389313&v=20180605&radius=1000&limit=5


In [8]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now make a trial call to the FourSquare API

In [9]:
import json # library to handle JSON files
#from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
from pandas import json_normalize # tranform JSON file into a pandas dataframe

results = requests.get(url).json()
venues = results['response']['groups'][0]['items']
 
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head(10)

Unnamed: 0,name,categories,lat,lng
0,Elks Club,Concert Hall,38.320073,-122.283828
1,Nob Hill Foods,Grocery Store,38.324249,-122.286022
2,Jamba Juice,Juice Bar,38.324214,-122.286552
3,Trancas Steakhouse,Steakhouse,38.323378,-122.295698
4,Starbucks,Coffee Shop,38.323991,-122.287089


Get nearby venues

In [10]:


def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['ZIP', 
                  'ZIP Latitude', 
                  'ZIP Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)



ba_venues = getNearbyVenues(names=BA_df_s['ZIP'],
                                   latitudes=BA_df_s['LAT'],
                                   longitudes=BA_df_s['LONG']
                                  )

ba_venues.head()

Unnamed: 0,ZIP,ZIP Latitude,ZIP Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,94558,38.319227,-122.286037,Elks Club,38.320073,-122.283828,Concert Hall
1,94558,38.319227,-122.286037,Nob Hill Foods,38.324249,-122.286022,Grocery Store
2,94558,38.319227,-122.286037,Jamba Juice,38.324214,-122.286552,Juice Bar
3,94558,38.319227,-122.286037,Trancas Steakhouse,38.323378,-122.295698,Steakhouse
4,94558,38.319227,-122.286037,Starbucks,38.323991,-122.287089,Coffee Shop


Now one hot encoding

In [11]:
# one hot encoding
ba_onehot = pd.get_dummies(ba_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
ba_onehot['ZIP'] = ba_venues['ZIP'] 

# move neighborhood column to the first column
fixed_columns = [ba_onehot.columns[-1]] + list(ba_onehot.columns[:-1])
ba_onehot = ba_onehot[fixed_columns]

ba_grouped = ba_onehot.groupby('ZIP').mean().reset_index()
ba_grouped.head()

Unnamed: 0,ZIP,Adult Store,Advertising Agency,Airport,American Restaurant,Animal Shelter,Antique Store,Art Museum,Arts and Crafts Store,Arts and Entertainment,Asian Restaurant,Auditorium,BBQ Joint,Bakery,Bar,Baseball Field,Bavarian Restaurant,Beach,Bed and Breakfast,Beer Bar,Beer Garden,Bicycle Store,Bistro,Boat or Ferry,Bookstore,Boutique,Brazilian Restaurant,Breakfast Spot,Brewery,Bubble Tea Shop,Burger Joint,Burmese Restaurant,Bus Station,Bus Stop,Butcher,Café,Cajun and Creole Restaurant,Campground,Candy Store,Car Parts and Accessories,Castle,Cheese Store,Children's Clothing Store,Chinese Restaurant,Church,Circus,Clothing Store,Cocktail Bar,Coffee Shop,Comedy Club,Comfort Food Restaurant,Concert Hall,Construction Supplies Store,Convenience Store,Cuban Restaurant,Cupcake Shop,Department Store,Dessert Shop,Disc Golf,Dog Park,Donut Shop,Drugstore,Eastern European Restaurant,Electric Vehicle Charging Station,Electronics Store,Exhibit,Fair,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Fish Market,Fishing Store,Flea Market,Flower Store,Food Court,Food Truck,Food and Beverage Retail,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Fruit and Vegetable Store,Fuel Station,Furniture and Home Store,Garden,Garden Center,Gastropub,German Restaurant,Gift Store,Gourmet Store,Grocery Store,Gym,Hair Salon,Harbor or Marina,Hiking Trail,History Museum,Hobby Store,Hot Dog Joint,Hotel,Hunan Restaurant,Ice Cream Parlor,Indian Restaurant,Indonesian Restaurant,Island,Italian Restaurant,Japanese Restaurant,Jewelry Store,Juice Bar,Karaoke Bar,Kitchen Supply Store,Korean Restaurant,Lake,Landmarks and Outdoors,Latin American Restaurant,Lighting Store,Lingerie Store,Liquor Store,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Monument,Movie Theater,Museum,Music Store,Music Venue,Nail Salon,National Park,Nature Preserve,Neighborhood,New American Restaurant,Other Great Outdoors,Park,Performing Arts Venue,Pet Supplies Store,Pharmacy,Pizzeria,Playground,Plaza,Poke Restaurant,Pub,Puerto Rican Restaurant,Rail Station,Ramen Restaurant,Recording Studio,Rest Area,Restaurant,Retail,Road,Rock Club,Salad Restaurant,Sandwich Spot,Scandinavian Restaurant,Scenic Lookout,Seafood Restaurant,"Shipping, Freight, and Material Transportation Service",Shoe Store,Shopping Mall,Shopping Plaza,Snack Place,Spa,Speakeasy,Sporting Goods Retail,Stable,Steakhouse,Street Art,Supermarket,Sushi Restaurant,Taco Restaurant,Taiwanese Restaurant,Taxi Stand,Tea Room,Tennis Stadium,Thai Restaurant,Theater,Tibetan Restaurant,Tiki Bar,Toy Store,Trattoria,Travel Agency,Travel and Transportation,Tree,Vacation Rental,Vietnamese Restaurant,Vintage and Thrift Store,Warehouse or Wholesale Store,Wine Bar,Wine Store,Zoo,Zoo Exhibit
0,94002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,94015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,94019,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,94022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,94024,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Get the top 10 venues per ZIP code

In [12]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]


In [13]:
import numpy as np
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['ZIP']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
zip_venues_sorted = pd.DataFrame(columns=columns)
zip_venues_sorted['ZIP'] = ba_grouped['ZIP']

for ind in np.arange(ba_grouped.shape[0]):
    zip_venues_sorted.iloc[ind, 1:] = return_most_common_venues(ba_grouped.iloc[ind, :], num_top_venues)

zip_venues_sorted.head(10)

Unnamed: 0,ZIP,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,94002,Supermarket,Café,Fast Food Restaurant,Breakfast Spot,Plaza,Playground,Movie Theater,Museum,Pub,Music Store
1,94015,Restaurant,Trattoria,Hotel,Italian Restaurant,Park,Museum,Music Store,Music Venue,Nail Salon,National Park
2,94019,Hotel,Food and Beverage Retail,Fried Chicken Joint,Travel Agency,Adult Store,New American Restaurant,Movie Theater,Museum,Music Store,Music Venue
3,94022,History Museum,Pizzeria,Bookstore,Farmers Market,Bakery,Plaza,Playground,Puerto Rican Restaurant,Museum,Music Store
4,94024,Park,Fuel Station,American Restaurant,Pizzeria,Movie Theater,Museum,Music Store,Music Venue,Nail Salon,National Park
5,94025,Toy Store,Coffee Shop,Bookstore,Cupcake Shop,Grocery Store,Other Great Outdoors,Museum,Music Store,Music Venue,Nail Salon
6,94027,Hiking Trail,Stable,Speakeasy,Grocery Store,Performing Arts Venue,Neighborhood,Movie Theater,Museum,Music Store,Music Venue
7,94028,Mexican Restaurant,Grocery Store,Italian Restaurant,Café,Seafood Restaurant,Adult Store,Other Great Outdoors,Music Store,Music Venue,Nail Salon
8,94030,Bubble Tea Shop,Hunan Restaurant,Grocery Store,Vietnamese Restaurant,Pet Supplies Store,Adult Store,New American Restaurant,Museum,Music Store,Music Venue
9,94035,Scenic Lookout,American Restaurant,Sporting Goods Retail,Other Great Outdoors,Movie Theater,Museum,Music Store,Music Venue,Nail Salon,National Park


Run K-Means clustering to get the clusters. We assume that there are 4 clusters to adequately segment the market while at the same time not providing too much, at times, unnecessary, information to the businessman who wants to expend his business by opening shops in other zip codes.

In [14]:
# import k-means from clustering stage
from sklearn.cluster import KMeans
    
# set number of clusters
kclusters = 4

ba_grouped_clustering = ba_grouped.drop('ZIP', axis=1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(ba_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]



array([1, 0, 0, 1, 1, 2, 2, 2, 2, 1], dtype=int32)

In [15]:
# add clustering labels
zip_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
#ba_merged['Cluster Labels'] = kmeans.labels_

ba_merged = BA_df_s

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
ba_merged = ba_merged.join(zip_venues_sorted.set_index('ZIP'), on='ZIP')
#ba_merged1=ba_merged.dropna(axis=0, inplace=False)
ba_merged.dropna(axis=0, inplace=True)
ba_merged['Cluster Labels']=ba_merged['Cluster Labels'].astype(int)
#ba_merged1['Cluster Labels']
ba_merged.head(10) # check the last columns!
#zip_venues_sorted.head()

Unnamed: 0,PO_NAME,ZIP,STATE,LAT,LONG,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,NAPA,94558,CA,38.319227,-122.286037,2,Concert Hall,Grocery Store,Steakhouse,Juice Bar,Coffee Shop,Pub,Other Great Outdoors,Museum,Music Store,Music Venue
1,FAIRFIELD,94533,CA,47.757226,18.127488,1,Restaurant,Gastropub,Eastern European Restaurant,Bar,Adult Store,Other Great Outdoors,Museum,Music Store,Music Venue,Nail Salon
2,DIXON,95620,CA,49.114702,2.208189,1,Movie Theater,Rail Station,Italian Restaurant,Gastropub,French Restaurant,New American Restaurant,Museum,Music Store,Music Venue,Nail Salon
3,SONOMA,95476,CA,38.289863,-122.463968,2,Grocery Store,Wine Bar,Farmers Market,Breakfast Spot,Adult Store,New American Restaurant,Movie Theater,Museum,Music Store,Music Venue
4,NAPA,94559,CA,48.883567,12.79424,1,Fishing Store,Electronics Store,Fuel Station,Middle Eastern Restaurant,Movie Theater,Museum,Music Store,Music Venue,Nail Salon,National Park
5,PETALUMA,94954,CA,38.261442,-122.629879,1,Burger Joint,Disc Golf,Donut Shop,Juice Bar,Park,Adult Store,Other Great Outdoors,Music Store,Music Venue,Nail Salon
6,RIO VISTA,94571,CA,38.138132,-121.703775,1,Campground,Beach,Monument,Museum,Music Store,Music Venue,Nail Salon,National Park,Nature Preserve,Neighborhood
7,TRAVIS AFB,94535,CA,48.706599,13.249879,0,Supermarket,Hotel,Antique Store,Restaurant,Italian Restaurant,Other Great Outdoors,Movie Theater,Museum,Music Store,Music Venue
8,AMERICAN CANYON,94503,CA,38.192436,-122.259517,1,Fast Food Restaurant,Sporting Goods Retail,Farm,Donut Shop,Fuel Station,Adult Store,Neighborhood,Music Store,Music Venue,Nail Salon
9,NOVATO,94949,CA,38.067582,-122.531148,1,Sushi Restaurant,American Restaurant,Animal Shelter,Burger Joint,Sandwich Spot,New American Restaurant,Movie Theater,Museum,Music Store,Music Venue


Visualize the clusters

In [16]:
# create map
map_clusters = folium.Map(location=[latitude1, longitude1], zoom_start=9)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(ba_merged['LAT'], ba_merged['LONG'], ba_merged['ZIP'], ba_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters