## Introduction:
In this project we will try to find the best city in the San Francisco Bay Area to live in based on the venues available throughout the different cities. This project will benefit those looking to move to the San Francisco Bay Area to look for new jobs or to start a family.

## Data:
Cities in San Francisco Bay Area
Initial data of the cities of the San Francisco Bay Area will be scraped from Wikipedia.

Geospatial
We'll pull latitude and longitude from Nominatim python package given the name of the cities.

Venues
with the latitude and longitude, we'll retrieve venue information with the Foursquare API.

Importing required packages

In [2]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.1.1f             |       h516909a_0         2.1 MB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    certifi-2019.11.28         |   py36h9f0ad1d_1         149 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    branca-0.4.0               |             py_0          26 KB  conda-forge
    altair-4.1.0               |             py_1         614 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    ------------------------------------------------------------
                       

Setting the variable for the Wiki webpage and parsing it with LXML

In [3]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_the_San_Francisco_Bay_Area').text

soup = BeautifulSoup(source, 'lxml')

Data Wrangling

Scraping Data from Wikipedia

In [4]:
sfbayarea = []

for city in soup.find_all('th', scope='row'):
    #print(city.text)
    sfbayarea.append(city.text)
    if city.text == "Yountville\n":
        break


Create Pandas Dataframe from list

In [5]:
df = pd.DataFrame(sfbayarea,columns=['City'])
df['City'] = df['City'].str.replace("\n","")
df = df.drop(df.index[[0]])

df.head(5)

Unnamed: 0,City
1,Alameda
2,Albany
3,American Canyon
4,Antioch
5,Atherton


Get the shape of the dataframe

In [6]:
df.shape

(101, 1)

Pulling geospatial data from Nominatim

In [7]:
cityWithGeo = []
for x in range(1,len(df)+1):
    #address = 'Oakland, California'
    address = df['City'][x] + ', California'
    geolocator = Nominatim(user_agent="ny_explorer")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    #print('The geograpical coordinate of {} are {}, {}.'.format(address, latitude, longitude))
    cityWithGeo.append([df['City'][x],"California",latitude,longitude])

Creating new dataframe with geospatial data

In [8]:
df_cityGeo = pd.DataFrame(cityWithGeo,columns=['City','State','Latitude','Longitude'])
df_cityGeo.head(5)

Unnamed: 0,City,State,Latitude,Longitude
0,Alameda,California,37.609029,-121.899142
1,Albany,California,37.88687,-122.297747
2,American Canyon,California,38.223457,-122.227043
3,Antioch,California,38.004921,-121.805789
4,Atherton,California,37.461327,-122.197743


Get the shape of the dataframe

In [9]:
df_cityGeo.shape

(101, 4)

Get latitude and longitude of Alameda, California. Center location of for folium map.

In [10]:
address = 'Alameda, California'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Alameda are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Alameda are 37.6090291, -121.899142.


Folium map of San Francisco Bay Area center at Alameda, California.

In [11]:
# create map of New York using latitude and longitude values
map_bayarea = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for city, lat, lng in zip(df_cityGeo['City'], df_cityGeo['Latitude'], df_cityGeo['Longitude']):
    label = '{}, California'.format(city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_bayarea)  
    
map_bayarea

Define FourSquare credential and version

In [12]:
CLIENT_ID = '54K3S1GZMKIQNXJMHKLPE4X3RGW15YISTFJ102HMXOC2X3GP' # your Foursquare ID
CLIENT_SECRET = 'RQKZENVAAOZXU2HHP2SKZKUUD2TOL4RPQ3A3QAJCKZKALBME' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 54K3S1GZMKIQNXJMHKLPE4X3RGW15YISTFJ102HMXOC2X3GP
CLIENT_SECRET:RQKZENVAAOZXU2HHP2SKZKUUD2TOL4RPQ3A3QAJCKZKALBME


Exploring second city in dataframe

In [13]:
df_cityGeo.loc[1, 'City']

'Albany'

Get the city's latitude and longitude values.

In [14]:
city_latitude = df_cityGeo.loc[1, 'Latitude'] # city latitude value
city_longitude = df_cityGeo.loc[1, 'Longitude'] # city longitude value

city_name = df_cityGeo.loc[1, 'City'] # city name

print('Latitude and longitude values of {} are {}, {}.'.format(city_name, 
                                                               city_latitude, 
                                                               city_longitude))



Latitude and longitude values of Albany are 37.88687, -122.2977475.


Now, let's get the top 100 venues that are in Albany within a radius of 500 meters.

In [15]:
LIMIT = 100

radius = 500

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    city_latitude, 
    city_longitude, 
    radius, 
    LIMIT)

url

'https://api.foursquare.com/v2/venues/explore?&client_id=54K3S1GZMKIQNXJMHKLPE4X3RGW15YISTFJ102HMXOC2X3GP&client_secret=RQKZENVAAOZXU2HHP2SKZKUUD2TOL4RPQ3A3QAJCKZKALBME&v=20180605&ll=37.88687,-122.2977475&radius=500&limit=100'

Send GET request

In [16]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e84dc7a6d8c56001b22433e'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'},
    {'name': '$-$$$$', 'key': 'price'}]},
  'headerLocation': 'Albany',
  'headerFullLocation': 'Albany',
  'headerLocationGranularity': 'city',
  'totalResults': 60,
  'suggestedBounds': {'ne': {'lat': 37.891370004500004,
    'lng': -122.29205634295731},
   'sw': {'lat': 37.8823699955, 'lng': -122.30343865704269}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4b2e7ca3f964a52010e124e3',
       'name': "Sam's Log Cabin",
       'location': {'address': '945 San Pablo Ave',
        'lat': 37.88858866521545,
        'lng': -122.29825769171232,
        'labeledLatLngs': [{'label': 'display',
   

Get category type

In [17]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

JSON to dataframe

In [18]:
from pandas.io.json import json_normalize

venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Sam's Log Cabin,Breakfast Spot,37.888589,-122.298258
1,Potala Organic Cafe,Vegetarian / Vegan Restaurant,37.885131,-122.297013
2,Patisserie Rotha,Bakery,37.884811,-122.296931
3,Sprouts Farmers Market,Grocery Store,37.885157,-122.297564
4,Hal's Office,Café,37.890522,-122.295885


Venues returned by Foursquare

In [19]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

60 venues were returned by Foursquare.


Repeat function for all cities

In [20]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'City Latitude', 
                  'City Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Calling function to get venues for all cities

In [21]:
bayarea_venues = getNearbyVenues(names=df_cityGeo['City'],
                                   latitudes=df_cityGeo['Latitude'],
                                   longitudes=df_cityGeo['Longitude']
                                  )

Alameda
Albany
American Canyon
Antioch
Atherton
Belmont
Belvedere
Benicia
Berkeley
Brentwood
Brisbane
Burlingame
Calistoga
Campbell
Clayton
Cloverdale
Colma
Concord
Corte Madera
Cotati
Cupertino
Daly City
Danville
Dixon
Dublin
East Palo Alto
El Cerrito
Emeryville
Fairfax
Fairfield
Foster City
Fremont
Gilroy
Half Moon Bay
Hayward
Healdsburg
Hercules
Hillsborough
Lafayette
Larkspur
Livermore
Los Altos
Los Altos Hills
Los Gatos
Martinez
Menlo Park
Mill Valley
Millbrae
Milpitas
Monte Sereno
Moraga
Morgan Hill
Mountain View
Napa
Newark
Novato
Oakland
Oakley
Orinda
Pacifica
Palo Alto
Petaluma
Piedmont
Pinole
Pittsburg
Pleasant Hill
Pleasanton
Portola Valley
Redwood City
Richmond
Rio Vista
Rohnert Park
Ross
St. Helena
San Anselmo
San Bruno
San Carlos
San Francisco
San Jose
San Leandro
San Mateo
San Pablo
San Rafael
San Ramon
Santa Clara
Santa Rosa
Saratoga
Sausalito
Sebastopol
Sonoma
South San Francisco
Suisun City
Sunnyvale
Tiburon
Union City
Vacaville
Vallejo
Walnut Creek
Windsor
Woodside
Y

Checking on size of dataframe

In [22]:
print(bayarea_venues.shape)
bayarea_venues.head()

(3167, 7)


Unnamed: 0,City,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Albany,37.88687,-122.297747,Sam's Log Cabin,37.888589,-122.298258,Breakfast Spot
1,Albany,37.88687,-122.297747,Potala Organic Cafe,37.885131,-122.297013,Vegetarian / Vegan Restaurant
2,Albany,37.88687,-122.297747,Patisserie Rotha,37.884811,-122.296931,Bakery
3,Albany,37.88687,-122.297747,Sprouts Farmers Market,37.885157,-122.297564,Grocery Store
4,Albany,37.88687,-122.297747,Hal's Office,37.890522,-122.295885,Café


Let's check how many venues were returned for each city

In [23]:
bayarea_venues.groupby('City').count()

Unnamed: 0_level_0,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Albany,60,60,60,60,60,60
American Canyon,2,2,2,2,2,2
Antioch,11,11,11,11,11,11
Atherton,6,6,6,6,6,6
Belmont,61,61,61,61,61,61
Belvedere,8,8,8,8,8,8
Benicia,29,29,29,29,29,29
Berkeley,49,49,49,49,49,49
Brentwood,24,24,24,24,24,24
Brisbane,24,24,24,24,24,24


Let's find out how many unique categories can be curated from all the returned venues

In [24]:
print('There are {} uniques categories.'.format(len(bayarea_venues['Venue Category'].unique())))

There are 308 uniques categories.


Analyze Each City

In [25]:
# one hot encoding
bayarea_onehot = pd.get_dummies(bayarea_venues[['Venue Category']], prefix="", prefix_sep="")

# add city column back to dataframe
bayarea_onehot['City'] = bayarea_venues['City'] 

# move city column to the first column
fixed_columns = [bayarea_onehot.columns[-1]] + list(bayarea_onehot.columns[:-1])
bayarea_onehot = bayarea_onehot[fixed_columns]

bayarea_onehot.head()

Unnamed: 0,City,ATM,Accessories Store,Afghan Restaurant,African Restaurant,Airport,American Restaurant,Antique Shop,Arcade,Argentinian Restaurant,...,Vietnamese Restaurant,Vineyard,Waterfront,Weight Loss Center,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio
0,Albany,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Albany,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Albany,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Albany,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Albany,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Get size of dataframe

In [26]:
bayarea_onehot.shape

(3167, 309)

Grouping rows by cities

In [27]:
bayarea_grouped = bayarea_onehot.groupby('City').mean().reset_index()
bayarea_grouped

Unnamed: 0,City,ATM,Accessories Store,Afghan Restaurant,African Restaurant,Airport,American Restaurant,Antique Shop,Arcade,Argentinian Restaurant,...,Vietnamese Restaurant,Vineyard,Waterfront,Weight Loss Center,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio
0,Albany,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.0,0.0,0.000000,0.016667,0.0,0.000000,0.000000,0.000000
1,American Canyon,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.500000,0.0,0.0,0.000000,0.000000,0.5,0.000000,0.000000,0.000000
2,Antioch,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000
3,Atherton,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000
4,Belmont,0.016393,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,...,0.016393,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000
5,Belvedere,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000
6,Benicia,0.000000,0.000000,0.000000,0.000000,0.0,0.068966,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.0,0.0,0.068966,0.000000,0.0,0.000000,0.000000,0.000000
7,Berkeley,0.000000,0.000000,0.000000,0.000000,0.0,0.020408,0.000000,0.000000,0.000000,...,0.020408,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.020408
8,Brentwood,0.000000,0.000000,0.000000,0.000000,0.0,0.125000,0.000000,0.000000,0.000000,...,0.041667,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000
9,Brisbane,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,...,0.083333,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000


Confirming size

In [28]:
bayarea_grouped.shape

(92, 309)

Let's print each city along with the top 5 most common venues

In [29]:
num_top_venues = 5

for hood in bayarea_grouped['City']:
    print("----"+hood+"----")
    temp = bayarea_grouped[bayarea_grouped['City'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Albany----
                 venue  freq
0          Coffee Shop  0.07
1      Thai Restaurant  0.05
2          Pizza Place  0.05
3  Japanese Restaurant  0.05
4    Indian Restaurant  0.03


----American Canyon----
                     venue  freq
0                   Winery   0.5
1                 Vineyard   0.5
2                      ATM   0.0
3  New American Restaurant   0.0
4     Outdoor Supply Store   0.0


----Antioch----
                  venue  freq
0  Fast Food Restaurant  0.27
1    Mexican Restaurant  0.18
2              Pharmacy  0.09
3         Grocery Store  0.09
4           Flower Shop  0.09


----Atherton----
                venue  freq
0   Food & Drink Shop  0.17
1                 Spa  0.17
2       Train Station  0.17
3        Home Service  0.17
4  Mexican Restaurant  0.17


----Belmont----
              venue  freq
0       Coffee Shop  0.05
1         Pet Store  0.05
2  Sushi Restaurant  0.05
3       Pizza Place  0.03
4    Sandwich Place  0.03


----Belvedere----
        

                venue  freq
0         Coffee Shop  0.12
1  Italian Restaurant  0.06
2            Pharmacy  0.06
3       Burrito Place  0.06
4             Brewery  0.06


----Morgan Hill----
                   venue  freq
0     Italian Restaurant  0.10
1                Brewery  0.08
2  Vietnamese Restaurant  0.05
3                   Bank  0.05
4    American Restaurant  0.05


----Mountain View----
                venue  freq
0         Coffee Shop  0.07
1              Bakery  0.05
2                Park  0.05
3    Sushi Restaurant  0.05
4  Chinese Restaurant  0.03


----Napa----
                 venue  freq
0             Wine Bar  0.10
1  American Restaurant  0.09
2   Italian Restaurant  0.04
3     Sushi Restaurant  0.03
4               Lounge  0.03


----Newark----
                venue  freq
0  Mexican Restaurant  0.27
1    Asian Restaurant  0.13
2  Chinese Restaurant  0.07
3        Dessert Shop  0.07
4     Bubble Tea Shop  0.07


----Novato----
                venue  freq
0         Cof

                venue  freq
0         Coffee Shop  0.11
1  Mexican Restaurant  0.07
2                 Gym  0.07
3              Market  0.07
4           BBQ Joint  0.07


----Woodside----
                     venue  freq
0  New American Restaurant   0.3
1                     Bank   0.1
2                  Stables   0.1
3           Breakfast Spot   0.1
4      Sporting Goods Shop   0.1


----Yountville----
             venue  freq
0         Wine Bar  0.11
1            Hotel  0.11
2         Vineyard  0.09
3  Bed & Breakfast  0.06
4           Bakery  0.06




First, let's write a function to sort the venues in descending order.

In [30]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each city

In [31]:
import numpy as np

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
city_venues_sorted = pd.DataFrame(columns=columns)
city_venues_sorted['City'] = bayarea_grouped['City']

for ind in np.arange(bayarea_grouped.shape[0]):
    city_venues_sorted.iloc[ind, 1:] = return_most_common_venues(bayarea_grouped.iloc[ind, :], num_top_venues)

city_venues_sorted.head()

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Albany,Coffee Shop,Japanese Restaurant,Pizza Place,Thai Restaurant,Burger Joint,Sandwich Place,Indian Restaurant,French Restaurant,Flower Shop,Mexican Restaurant
1,American Canyon,Winery,Vineyard,Yoga Studio,Flower Shop,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Financial or Legal Service,Fish & Chips Shop,Fish Market
2,Antioch,Fast Food Restaurant,Mexican Restaurant,Gym,Grocery Store,Flower Shop,Bank,Pharmacy,Coffee Shop,Food,Food & Drink Shop
3,Atherton,Home Service,Spa,Baseball Field,Mexican Restaurant,Food & Drink Shop,Train Station,Flea Market,Fast Food Restaurant,Filipino Restaurant,Financial or Legal Service
4,Belmont,Coffee Shop,Sushi Restaurant,Pet Store,Grocery Store,Dessert Shop,Sandwich Place,Pizza Place,Convenience Store,Mobile Phone Shop,Salon / Barbershop


## Cluster Neighborhoods

Run k-means to cluster the neighborhood into 5 clusters.


In [41]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

bayarea_grouped_clustering = bayarea_grouped.drop('City', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(bayarea_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 0, 0, 1, 1, 1, 1, 0, 1], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each city

In [42]:
# add clustering labels
#city_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

bayarea_merged = df_cityGeo

bayarea_merged = bayarea_merged.join(city_venues_sorted.set_index('City'), on='City')

bayarea_merged.dropna(inplace=True)

bayarea_merged.head() # check the last columns!

Unnamed: 0,City,State,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Albany,California,37.88687,-122.297747,1.0,Coffee Shop,Japanese Restaurant,Pizza Place,Thai Restaurant,Burger Joint,Sandwich Place,Indian Restaurant,French Restaurant,Flower Shop,Mexican Restaurant
2,American Canyon,California,38.223457,-122.227043,1.0,Winery,Vineyard,Yoga Studio,Flower Shop,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Financial or Legal Service,Fish & Chips Shop,Fish Market
3,Antioch,California,38.004921,-121.805789,0.0,Fast Food Restaurant,Mexican Restaurant,Gym,Grocery Store,Flower Shop,Bank,Pharmacy,Coffee Shop,Food,Food & Drink Shop
4,Atherton,California,37.461327,-122.197743,0.0,Home Service,Spa,Baseball Field,Mexican Restaurant,Food & Drink Shop,Train Station,Flea Market,Fast Food Restaurant,Filipino Restaurant,Financial or Legal Service
5,Belmont,California,37.520215,-122.275801,1.0,Coffee Shop,Sushi Restaurant,Pet Store,Grocery Store,Dessert Shop,Sandwich Place,Pizza Place,Convenience Store,Mobile Phone Shop,Salon / Barbershop


Finally, let's visualize the resulting clusters

In [43]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(bayarea_merged['Latitude'], bayarea_merged['Longitude'], bayarea_merged['City'], bayarea_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
 
map_clusters

# Examine Clusters

Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.

### Cluster 1

In [44]:
bayarea_merged.loc[bayarea_merged['Cluster Labels'] == 0, bayarea_merged.columns[[0] + list(range(5, bayarea_merged.shape[1]))]]

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,Antioch,Fast Food Restaurant,Mexican Restaurant,Gym,Grocery Store,Flower Shop,Bank,Pharmacy,Coffee Shop,Food,Food & Drink Shop
4,Atherton,Home Service,Spa,Baseball Field,Mexican Restaurant,Food & Drink Shop,Train Station,Flea Market,Fast Food Restaurant,Filipino Restaurant,Financial or Legal Service
9,Brentwood,Mexican Restaurant,Pizza Place,American Restaurant,Bar,Breakfast Spot,Gas Station,Liquor Store,Sandwich Place,Taco Place,Laundromat
23,Dixon,Mexican Restaurant,Sushi Restaurant,Café,Tea Room,Bistro,Bar,Breakfast Spot,Fountain,Food Truck,French Restaurant
25,East Palo Alto,Mexican Restaurant,Fast Food Restaurant,Bagel Shop,Gas Station,Market,Library,Gym / Fitness Center,Grocery Store,Fish Market,Filipino Restaurant
44,Martinez,Coffee Shop,American Restaurant,Mexican Restaurant,Farmers Market,Liquor Store,Bar,Sandwich Place,BBQ Joint,Park,Asian Restaurant
54,Newark,Mexican Restaurant,Asian Restaurant,Chinese Restaurant,Pharmacy,Supermarket,Sandwich Place,Coffee Shop,Bubble Tea Shop,Convenience Store,American Restaurant
57,Oakley,Mexican Restaurant,Hawaiian Restaurant,Convenience Store,Ice Cream Shop,Grocery Store,Flea Market,Fast Food Restaurant,Filipino Restaurant,Financial or Legal Service,Fish & Chips Shop
64,Pittsburg,Mexican Restaurant,Fast Food Restaurant,Convenience Store,Bakery,Deli / Bodega,Dentist's Office,Ice Cream Shop,Park,Chinese Restaurant,Sandwich Place
69,Richmond,Convenience Store,Food Truck,Art Gallery,Grocery Store,Mexican Restaurant,Cosmetics Shop,BBQ Joint,Hot Dog Joint,Fast Food Restaurant,Metro Station


### Cluster 2

In [45]:
bayarea_merged.loc[bayarea_merged['Cluster Labels'] == 1, bayarea_merged.columns[[0] + list(range(5, bayarea_merged.shape[1]))]]

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Albany,Coffee Shop,Japanese Restaurant,Pizza Place,Thai Restaurant,Burger Joint,Sandwich Place,Indian Restaurant,French Restaurant,Flower Shop,Mexican Restaurant
2,American Canyon,Winery,Vineyard,Yoga Studio,Flower Shop,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Financial or Legal Service,Fish & Chips Shop,Fish Market
5,Belmont,Coffee Shop,Sushi Restaurant,Pet Store,Grocery Store,Dessert Shop,Sandwich Place,Pizza Place,Convenience Store,Mobile Phone Shop,Salon / Barbershop
6,Belvedere,Bakery,Bookstore,Bay,Chinese Restaurant,Park,Clothing Store,Harbor / Marina,Deli / Bodega,Food Court,Food & Drink Shop
7,Benicia,Sushi Restaurant,Wine Bar,American Restaurant,Mexican Restaurant,Italian Restaurant,Pet Store,Theater,Coffee Shop,Clothing Store,Bakery
8,Berkeley,Sushi Restaurant,Music Venue,Theater,Asian Restaurant,Vegetarian / Vegan Restaurant,Brewery,Pizza Place,Ramen Restaurant,Coffee Shop,Gastropub
10,Brisbane,Mexican Restaurant,Vietnamese Restaurant,Pizza Place,Farmers Market,Sushi Restaurant,Sandwich Place,Thai Restaurant,Donut Shop,Café,Chinese Restaurant
11,Burlingame,Japanese Restaurant,Breakfast Spot,Coffee Shop,Sandwich Place,Italian Restaurant,Pub,Chinese Restaurant,Salon / Barbershop,Diner,Business Service
12,Calistoga,Hotel,Bed & Breakfast,Wine Bar,American Restaurant,Resort,Italian Restaurant,Coffee Shop,Bakery,Ice Cream Shop,New American Restaurant
13,Campbell,Yoga Studio,Sandwich Place,Italian Restaurant,Mexican Restaurant,Pizza Place,Ice Cream Shop,Sushi Restaurant,Steakhouse,Boutique,Sporting Goods Shop


### Cluster 3

In [46]:
bayarea_merged.loc[bayarea_merged['Cluster Labels'] == 2, bayarea_merged.columns[[0] + list(range(5, bayarea_merged.shape[1]))]]

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
88,Sebastopol,Coffee Shop,Fondue Restaurant,Fast Food Restaurant,Filipino Restaurant,Financial or Legal Service,Fish & Chips Shop,Fish Market,Flea Market,Flower Shop,Yoga Studio


### Cluster 4

In [47]:
bayarea_merged.loc[bayarea_merged['Cluster Labels'] == 3, bayarea_merged.columns[[0] + list(range(5, bayarea_merged.shape[1]))]]

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
42,Los Altos Hills,Home Service,Music Venue,Yoga Studio,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Financial or Legal Service,Fish & Chips Shop,Fish Market,Flea Market


### Cluster 5

In [48]:
bayarea_merged.loc[bayarea_merged['Cluster Labels'] == 4, bayarea_merged.columns[[0] + list(range(5, bayarea_merged.shape[1]))]]

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
61,Petaluma,Farm,Dog Run,Yoga Studio,Flower Shop,Fast Food Restaurant,Filipino Restaurant,Financial or Legal Service,Fish & Chips Shop,Fish Market,Flea Market
