### Clustering of San Francisco Bay Area

I am trying to cluster the ZIP codes in the San Francisco Bay Area in terms of popularly visited venues.

My target audience is such businessmen who have a successful business in some zip code in the Bay Area and want to expand (or open more shops) in other zip code areas in the Bay Area. But they need Data Science inputs to find where they may be successful.

This project will help those businessmen to consider areas, which are similar in market and customer characteristics (as found by clustering) to the zip code of his current successful business. This is because the the data is sourced from customer reviews about businesses which indicate volume of customers as well their interest. So if he has been successful in one zip code, he is more likely to be successful in other zip codes in the same cluster. So he may choose to open his new shops in those zip codes and NOT in the zip codes that fall under the other clusters.

#### Data Source

I have used the sfgov.org public data for this project. I have also used the FourSquare API for getting information about popular venues in the Bay Area ZIP codes. It is downloaded from https://data.sfgov.org/Geographic-Locations-and-Boundaries/Bay-Area-ZIP-Codes/u5j3-svi6

So, let us get into the project.

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geopy-1.20.0               |             py_0          57 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          91 KB

The following NEW packages will be INSTALLED:

    geographiclib: 1.50-py_0   conda-forge
    geopy:         1.20.0-py_0 conda-forge


Downloading and Extracting Packages
geopy-1.20.0         | 57 KB     | ##################################### | 100% 
geographiclib-1.50   | 34 KB     | ##

<a id='item1'></a>

We start by reading the Bay Area data file:

Read the San Francisco Bay Area CSV Data, which was uploaded to the local directory as this notebook.

In [3]:
BA_df=pd.read_csv('bayarea_zipcodes.csv')
BA_df.head()


Unnamed: 0,PO_NAME,the_geom,ZIP,STATE,Area__,Length__
0,NAPA,MULTIPOLYGON (((-122.10329200180091 38.5132829...,94558,CA,12313260000.0,995176.225313
1,FAIRFIELD,MULTIPOLYGON (((-121.947475002335 38.301511000...,94533,CA,991786100.0,200772.556587
2,DIXON,MULTIPOLYGON (((-121.65335500334429 38.3133870...,95620,CA,7236950000.0,441860.2014
3,SONOMA,MULTIPOLYGON (((-122.406843003057 38.155681999...,95476,CA,3001414000.0,311318.546326
4,NAPA,MULTIPOLYGON (((-122.29368500225117 38.1552379...,94559,CA,1194302000.0,359104.646602


Drop the columns 'the_geom','Area_' and 'Length_'

In [4]:
BA_df.drop(['the_geom','Area__','Length__'],axis=1, inplace=True)
BA_df.head()
#BA_df.shape

Unnamed: 0,PO_NAME,ZIP,STATE
0,NAPA,94558,CA
1,FAIRFIELD,94533,CA
2,DIXON,95620,CA
3,SONOMA,95476,CA
4,NAPA,94559,CA


Import the geospatial data

In [5]:
geolocator = Nominatim(user_agent="ba_explorer")

#BA_df_s=BA_df.loc[0:19,:]
BA_df_s=BA_df
#BA_df_s.head()

latAll=[]
longAll=[]

for z in BA_df_s['ZIP']:
    latAll.append(geolocator.geocode(z).latitude)
    longAll.append(geolocator.geocode(z).longitude)

BA_df_s['LAT']=latAll
BA_df_s['LONG']=longAll

BA_df_s.head()

Unnamed: 0,PO_NAME,ZIP,STATE,LAT,LONG
0,NAPA,94558,CA,38.323846,-122.276453
1,FAIRFIELD,94533,CA,38.264929,-122.042662
2,DIXON,95620,CA,38.457097,-121.834727
3,SONOMA,95476,CA,38.290392,-122.463819
4,NAPA,94559,CA,38.288119,-122.294436


The map below shows the location of the Postcodes on the Toronto map

In [7]:
# create map of Bay Area using latitude and longitude values
latitude1=37.834929
longitude1=-122.042662
map_ba = folium.Map(location=[latitude1, longitude1], zoom_start=9)

# add markers to map
for lat, lng, po, zipcode in zip(BA_df_s['LAT'], BA_df_s['LONG'], BA_df_s['PO_NAME'], BA_df_s['ZIP']):
    label = '{}, {}'.format(po, zipcode)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_ba)  
    
map_ba

Now define the FourSquare credentials

In [23]:
CLIENT_ID = 'PQAASFB12T0V....C5M2XGTGT0O5' # your Foursquare ID
CLIENT_SECRET = '1R1LEKIGYRW....2ATP4SD5I0JYD20QRT' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: PQAASFB12T0V....C5M2XGTGT0O5
CLIENT_SECRET:1R1LEKIGYRW....2ATP4SD5I0JYD20QRT


Exploratory Data Analysis for the first row with FourSquare API

In [24]:
neighborhood_latitude = BA_df_s.loc[0, 'LAT'] # neighborhood latitude value
neighborhood_longitude = BA_df_s.loc[0, 'LONG'] # neighborhood longitude value

neighborhood_name = BA_df_s.loc[0, 'PO_NAME'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

LIMIT = 5
radius = 1000
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},\
{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, neighborhood_latitude, \
                                   neighborhood_longitude, VERSION, radius, LIMIT)

print(url)

Latitude and longitude values of NAPA are 38.3238458220492, -122.27645284197.
https://api.foursquare.com/v2/venues/explore?client_id=PQAASFB12T0V....C5M2XGTGT0O5&client_secret=1R1LEKIGYRW....2ATP4SD5I0JYD20QRT&ll=38.3238458220492,-122.27645284197&v=20180605&radius=1000&limit=5


In [10]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now make a trial call to the FourSquare API

In [11]:
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

results = requests.get(url).json()
venues = results['response']['groups'][0]['items']
 
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head(10)

Unnamed: 0,name,categories,lat,lng
0,HdV Winery,Winery,38.325914,-122.283236
1,Nob Hill Foods,Grocery Store,38.324249,-122.286022
2,Elks Club,Concert Hall,38.320073,-122.283828
3,Jamba Juice,Juice Bar,38.324214,-122.286552
4,Starbucks,Coffee Shop,38.323991,-122.287089


Get nearby venues

In [12]:


def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['ZIP', 
                  'ZIP Latitude', 
                  'ZIP Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)



ba_venues = getNearbyVenues(names=BA_df_s['ZIP'],
                                   latitudes=BA_df_s['LAT'],
                                   longitudes=BA_df_s['LONG']
                                  )

ba_venues.head()

Unnamed: 0,ZIP,ZIP Latitude,ZIP Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,94558,38.323846,-122.276453,HdV Winery,38.325914,-122.283236,Winery
1,94558,38.323846,-122.276453,Nob Hill Foods,38.324249,-122.286022,Grocery Store
2,94558,38.323846,-122.276453,Elks Club,38.320073,-122.283828,Concert Hall
3,94558,38.323846,-122.276453,Jamba Juice,38.324214,-122.286552,Juice Bar
4,94558,38.323846,-122.276453,Starbucks,38.323991,-122.287089,Coffee Shop


Now one hot encoding

In [17]:
# one hot encoding
ba_onehot = pd.get_dummies(ba_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
ba_onehot['ZIP'] = ba_venues['ZIP'] 

# move neighborhood column to the first column
fixed_columns = [ba_onehot.columns[-1]] + list(ba_onehot.columns[:-1])
ba_onehot = ba_onehot[fixed_columns]

ba_grouped = ba_onehot.groupby('ZIP').mean().reset_index()
ba_grouped.head()

Unnamed: 0,ZIP,ATM,Accessories Store,Airport Lounge,American Restaurant,Animal Shelter,Antique Shop,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Workshop,Automotive Shop,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Basketball Court,Beach,Bed & Breakfast,Beer Bar,Beer Garden,Bike Shop,Bike Trail,Bistro,Boarding House,Boat or Ferry,Bookstore,Brazilian Restaurant,Breakfast Spot,Brewery,Bubble Tea Shop,Burger Joint,Burmese Restaurant,Burrito Place,Bus Station,Business Service,Café,Cajun / Creole Restaurant,Cambodian Restaurant,Campground,Cantonese Restaurant,Caribbean Restaurant,Cheese Shop,Chinese Restaurant,Church,City,Clothing Store,Cocktail Bar,Coffee Shop,College Baseball Diamond,College Basketball Court,College Bookstore,College Gym,College Rec Center,College Soccer Field,Comfort Food Restaurant,Comic Shop,Community Center,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop,Cycle Studio,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Diner,Discount Store,Distillery,Dive Bar,Dog Run,Donut Shop,Dumpling Restaurant,Eye Doctor,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Fish Market,Flower Shop,Food & Drink Shop,Food Truck,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Fruit & Vegetable Store,Furniture / Home Store,Garden,Garden Center,Gas Station,Gastropub,Gift Shop,Golf Course,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Gymnastics Gym,Harbor / Marina,Hardware Store,Hawaiian Restaurant,Health & Beauty Service,History Museum,Hobby Shop,Hookah Bar,Hot Dog Joint,Hotel,Hotel Bar,Hunan Restaurant,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Italian Restaurant,Japanese Restaurant,Juice Bar,Kebab Restaurant,Korean Restaurant,Lake,Latin American Restaurant,Library,Lighting Store,Liquor Store,Market,Massage Studio,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Mobile Phone Shop,Monument / Landmark,Motorcycle Shop,Mountain,Movie Theater,Multiplex,Museum,Music Store,Music Venue,Nail Salon,Nature Preserve,Neighborhood,New American Restaurant,Noodle House,Opera House,Paper / Office Supplies Store,Park,Performing Arts Venue,Peruvian Restaurant,Pet Store,Pharmacy,Pizza Place,Playground,Plaza,Poke Place,Pool,Print Shop,Pub,Public Art,Recording Studio,Recreation Center,Restaurant,Russian Restaurant,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,Sculpture Garden,Seafood Restaurant,Shipping Store,Shoe Store,Shopping Mall,Skate Park,Smoothie Shop,Snack Place,Soccer Field,South American Restaurant,South Indian Restaurant,Southern / Soul Food Restaurant,Spa,Spanish Restaurant,Sporting Goods Shop,Sports Club,Stables,Steakhouse,Street Food Gathering,Supermarket,Sushi Restaurant,Szechuan Restaurant,Taco Place,Tapas Restaurant,Tea Room,Tennis Court,Thai Restaurant,Theater,Thrift / Vintage Store,Toy / Game Store,Trail,Tree,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Winery,Wings Joint,Yoga Studio
0,94002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,94005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,94010,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,94014,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,94015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Get the top 10 venues per ZIP code

In [18]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]


In [19]:
import numpy as np
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['ZIP']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
zip_venues_sorted = pd.DataFrame(columns=columns)
zip_venues_sorted['ZIP'] = ba_grouped['ZIP']

for ind in np.arange(ba_grouped.shape[0]):
    zip_venues_sorted.iloc[ind, 1:] = return_most_common_venues(ba_grouped.iloc[ind, :], num_top_venues)

zip_venues_sorted.head(10)

Unnamed: 0,ZIP,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,94002,Falafel Restaurant,Park,Hot Dog Joint,Sandwich Place,Nail Salon,Yoga Studio,Cycle Studio,Dog Run,Dive Bar,Distillery
1,94005,Mexican Restaurant,Deli / Bodega,Coffee Shop,Park,Bubble Tea Shop,Dumpling Restaurant,Donut Shop,Dog Run,Dive Bar,Distillery
2,94010,Pizza Place,American Restaurant,Clothing Store,Coffee Shop,Italian Restaurant,College Soccer Field,Dance Studio,Donut Shop,Dog Run,Dive Bar
3,94014,Filipino Restaurant,Hookah Bar,Athletics & Sports,Kebab Restaurant,Yoga Studio,Deli / Bodega,Dumpling Restaurant,Donut Shop,Dog Run,Dive Bar
4,94015,Sandwich Place,Burger Joint,Golf Course,Supermarket,Deli / Bodega,Cycle Studio,Dog Run,Dive Bar,Distillery,Discount Store
5,94019,Mexican Restaurant,Spanish Restaurant,Grocery Store,Burger Joint,Deli / Bodega,Dumpling Restaurant,Donut Shop,Dog Run,Dive Bar,Distillery
6,94022,Italian Restaurant,Indian Restaurant,Food Truck,Japanese Restaurant,Mexican Restaurant,Comic Shop,Community Center,Dumpling Restaurant,Donut Shop,Dog Run
7,94024,Gym / Fitness Center,Pizza Place,Grocery Store,Park,Cosmetics Shop,Yoga Studio,Cycle Studio,Dog Run,Dive Bar,Distillery
8,94025,Bookstore,Gym / Fitness Center,Pool,Performing Arts Venue,Street Food Gathering,Dance Studio,Donut Shop,Dog Run,Dive Bar,Distillery
9,94027,Farm,Hookah Bar,Bed & Breakfast,Gas Station,Furniture / Home Store,Dance Studio,Dumpling Restaurant,Donut Shop,Dog Run,Dive Bar


Run K-Means clustering to get the clusters. We assume that there are 4 clusters to adequately segment the market while at the same time not providing too much, at times, unnecessary, information to the businessman who wants to expend his business by opening shops in other zip codes.

In [20]:
# import k-means from clustering stage
from sklearn.cluster import KMeans
    
# set number of clusters
kclusters = 4

ba_grouped_clustering = ba_grouped.drop('ZIP', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(ba_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 1, 0, 0, 0, 1, 1, 0, 0, 0], dtype=int32)

In [21]:
# add clustering labels
zip_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
#ba_merged['Cluster Labels'] = kmeans.labels_

ba_merged = BA_df_s

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
ba_merged = ba_merged.join(zip_venues_sorted.set_index('ZIP'), on='ZIP')
#ba_merged1=ba_merged.dropna(axis=0, inplace=False)
ba_merged.dropna(axis=0, inplace=True)
ba_merged['Cluster Labels']=ba_merged['Cluster Labels'].astype(int)
#ba_merged1['Cluster Labels']
ba_merged.head(10) # check the last columns!
#zip_venues_sorted.head()

Unnamed: 0,PO_NAME,ZIP,STATE,LAT,LONG,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,NAPA,94558,CA,38.323846,-122.276453,0,Juice Bar,Winery,Coffee Shop,Grocery Store,Concert Hall,Yoga Studio,Donut Shop,Dog Run,Dive Bar,Distillery
1,FAIRFIELD,94533,CA,38.264929,-122.042662,0,American Restaurant,Grocery Store,Burger Joint,Arts & Crafts Store,Sushi Restaurant,Yoga Studio,Deli / Bodega,Dumpling Restaurant,Donut Shop,Dog Run
2,DIXON,95620,CA,38.457097,-121.834727,0,Pizza Place,Bakery,Grocery Store,Burger Joint,Diner,Dumpling Restaurant,Donut Shop,Dog Run,Dive Bar,Distillery
3,SONOMA,95476,CA,38.290392,-122.463819,0,Grocery Store,Farmers Market,Breakfast Spot,American Restaurant,Yoga Studio,Dance Studio,Donut Shop,Dog Run,Dive Bar,Distillery
4,NAPA,94559,CA,38.288119,-122.294436,1,Playground,Pharmacy,Breakfast Spot,Bed & Breakfast,Mexican Restaurant,Deli / Bodega,Dumpling Restaurant,Donut Shop,Dog Run,Dive Bar
5,PETALUMA,94954,CA,38.261121,-122.632746,0,Pizza Place,Juice Bar,Burger Joint,Playground,Gym,College Soccer Field,Donut Shop,Dog Run,Dive Bar,Distillery
6,RIO VISTA,94571,CA,38.164218,-121.690924,1,Sandwich Place,Italian Restaurant,Mexican Restaurant,American Restaurant,Grocery Store,Diner,Deli / Bodega,Department Store,Dessert Shop,Yoga Studio
7,TRAVIS AFB,94535,CA,48.698179,13.231094,0,Boarding House,Hotel,Dance Studio,Eye Doctor,Dumpling Restaurant,Donut Shop,Dog Run,Dive Bar,Distillery,Discount Store
8,AMERICAN CANYON,94503,CA,38.182826,-122.251848,0,Donut Shop,Mobile Phone Shop,Fast Food Restaurant,Video Game Store,Salon / Barbershop,Cycle Studio,Dog Run,Dive Bar,Distillery,Discount Store
9,NOVATO,94949,CA,38.071253,-122.534211,0,Sandwich Place,American Restaurant,Animal Shelter,Sushi Restaurant,Thai Restaurant,Yoga Studio,Dumpling Restaurant,Donut Shop,Dog Run,Dive Bar


Visualize the clusters

In [22]:
# create map
map_clusters = folium.Map(location=[latitude1, longitude1], zoom_start=9)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(ba_merged['LAT'], ba_merged['LONG'], ba_merged['ZIP'], ba_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters