This notebook will be used to complete the Data Science Capstone project for Coursera, and serve as a more detailed description and "deep dive" on methods used to pull, clean, and analyze the Foursquare data.

In [2]:
import pandas as pd
import numpy as np

# More needed dependencies, taken from Week 3 lab on NYC:
import requests

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

import json

import pickle # for saving pulled foursquare data later

print('Libraries imported.')

Libraries imported.


In [3]:
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


## Section 1: Data

### 1.1: Foursquare (Venues)

We will source our venues data from Foursquare. Specifically, we'll aim to get all bars & lounges in New York City. Howver, we have a bit of a snag to work out: Foursquare limits us to only 50 results from a single search query, so we will have to get a little creative to get a sizable dataset to work with. I propose the following:

1. A handy-dandy New York City neighborhoods dataset was provided to us in the "Segmenting and Clustering Neighborhoods in New York City" lab. It included the borough, name, latitude and longitude of all 306 neighborhoods in NYC. If we take the last two bits, the latitude and longitude, we can get Foursquare results for *each* of these neighborhoods.

2. We'll do the same search query on each of these lat/long pairs: 50 results from 306 pairs, which is *15,300 results*. We will store each resulting dataframe into a list.

3. Pandas is magical, so there's obviously a function to combine these into one big dataframe: pd.concat. 

4. However, these results will **not all be unique**, so we'll have to drop all duplicate rows. Again, pandas is magical, so of course there's a simple method for this: df.drop_duplicates().

5. Our only other constraint on the results is that they're located *in* New York City, which may be a little silly (consider a bar right outside the city limits next to 10 others right inside of the city limits), but we'll roll with it for the sake of the product. This is, again, easy to do with pandas.

Let's do it!

#### Step 1: New York City Neighborhoods Dataset

In [None]:
# Much of this code is from the lab, but adapted to use requests instead of wget.

r = requests.get('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json')
newyork_data = r.json()
neighborhoods_data = newyork_data['features']

# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
    
neighborhoods.head()

In [None]:
# Double check
neighborhoods.shape

We got it!

#### Step 2: Search Query on Each Lat/Long Pair

Now, we just need to run a regular Foursquare search query on each one of these lat/long pairs. We can iterate over the dataframe; do a search on each row's coordinates; then store the resulting dataframe into a list of results.

**!!! WARNING !!! If you plan on running this yourself, be prepared to make _306_ Foursquare API calls. Remember that a free account without a verified credit card only gets 950 per day.**

In [None]:
CLIENT_ID = 'JCLRUISTSH4ETI2F1GQRNHLJVAWMOACMQVEZJJ2XP2LKVD5U'
CLIENT_SECRET = 'Removed for Upload'
VERSION = '20210121'
QUERY = 'bar'
LIMIT = 50 # max
# The following categoryID will limit results to the 'Nightlife' category, which includes bars; lounges; nightclubs; etc.
# List of category IDs: https://developer.foursquare.com/docs/build-with-foursquare/categories/
CATEGORYID = '4d4b7105d754a06376d81259'


# For later cleaning of search results:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
    

def clean_results(data):
    # turns data from venues query into a dataframe and cleans it up into an easily-usable format
    results_data = data['response']['venues']
    result = pd.json_normalize(results_data)
    filtered_columns = ['name', 'categories'] + [col for col in result.columns if col.startswith('location.')] + ['id']
    clean_results = result.loc[:, filtered_columns]
    # filter the category for each row
    clean_results['categories'] = clean_results.apply(get_category_type, axis=1)
    # clean column names by keeping only last term
    clean_results.columns = [column.split('.')[-1] for column in clean_results.columns]
    
    return clean_results
    

def fs_venue_batch_query(lldf, CLIENT_ID=CLIENT_ID, CLIENT_SECRET=CLIENT_SECRET, VERSION=VERSION, LIMIT=LIMIT, CATEGORYID=CATEGORYID):
    results = []
    base_uri = 'https://api.foursquare.com/v2/venues/search?'
    
    # Iterate over rows of lat/long dataframe (lldf):
    for index, row in lldf.iterrows():
        LL = '{},{}'.format(row['Latitude'], row['Longitude'])
        payload = {'client_id':CLIENT_ID,
                   'client_secret':CLIENT_SECRET,
                   'll':LL,
                   'v':VERSION,
                   'query':QUERY,
                   'limit':LIMIT,
                   'categoryId': CATEGORYID}
        
        # Pull data with requests
        r = requests.get(base_uri, params=payload)
        data = r.json()

        if r.status_code != requests.codes.ok:
            print('Something went wrong at lat-long {}, {}, index {}. See response:\n'.format(row['Latitude'], row['Longitude'], index))
            json.dumps(data)
            break

        else:
            results.append(clean_results(data))
    
    return results
    

search_results = fs_venue_batch_query(neighborhoods)

In [None]:
print(len(search_results))

We got it! Again!

#### Step 3: Combine Into One DataFrame
Simple enough, no?

In [None]:
all_results = pd.concat(search_results)
all_results.reset_index(drop=True, inplace=True)

In [None]:
all_results.head()

In [None]:
all_results.shape

It really was! And we really did get nearly 15,300 results.

**However**, I ran into an issue that I believe is worth mentioning. Notice that I reset the index, and do this repeatedly going forward. This is key to preserving a lot of the results. Without doing this, following the same process, I ended up with less than 200 results. One of these functions along the way gets rid of duplicate indices (not sure which one, but I haven't given that much thought - probably the most likely candidate, drop_duplicates()). If you attempt to do something similar, remember to reset your indices!

Anyway, the data looks great now, so back to the show.

#### Step 4: Drop All Duplicate Venues

Turns out, drop_duplicates doesn't work if some of your values are lists. This is only an issue because of the "labledLatLngs" and "formattedAddress" columns, neither of which we really need, so I'll just get rid of them, and then keep only the unique results.

**NOTE**: This is an opportunity for some leeway. How do we determine if a particular result is unqiue across all of our data? Do all aspects need to match exactly? The pandas drop_duplicates method actually has a "subset" parameter that lets us choose which columns to consider. For our purposes, I'm going to consider only the 'name' and 'address' columns - if a venue has the same name or address as another, it's probably a duplicate. This might get rid of chains (same name, different address), but that's fine - makes for a more 'local' experience, anyway. :-)

In [None]:
all_results_no_lists = all_results.drop(columns=['labeledLatLngs', 'formattedAddress'])
unique_results = all_results_no_lists.drop_duplicates(subset=['name', 'address']) # only consider uniquely-named bars
unique_results.reset_index(drop=True, inplace=True)

In [None]:
unique_results.shape

In [None]:
unique_results.head()

#### Step 5: Ensure All Venues Are In New York

Slight issue: some values in 'city' are the boroughs, and others are the neighborhoods, not just 'New York'. We can't just do a simple drop command on any rows that don't have city == 'New York' anymore. So, we'll have to get a little creative. We can instead make a list of the unique boroughs and neighborhoods out of the Neighborhoods data, add NYC, and then do a drop command on any rows that have a 'city' value that's not in any of the places to include. It's a little complicated, but can be done with just a few lines of code (isn't Python awesome?).

And finally, we'll drop any rows that don't have lat/long values, just in case.

In [None]:
neighborhoods_to_keep = list(neighborhoods.Borough.unique())
neighborhoods_to_keep.extend(list(neighborhoods.Neighborhood.unique()))
neighborhoods_to_keep.extend(['New York', 'New York City'])

ny_bars = unique_results.drop(unique_results[~unique_results.city.isin(neighborhoods_to_keep)].index)
ny_bars.dropna(axis=0,how='any',subset=['lat','lng'],inplace=True)
ny_bars.reset_index(drop=True, inplace=True)

In [None]:
ny_bars.head(20)

In [None]:
print(ny_bars.shape)

Bingo! We got it. That's our dataset! It's actually become quite slim, down from over 15,000 to just under 650. However, that's more than enough to work with - I was hoping for at least 500.

However, there's a lot of excess information here. Why don't we get rid of most of these unneccesary columns? We'll only keep the name, category, address, latitude, longitude, location, and id - which is probably a bit more than we need, but still a big improvement. I'd like to keep id because, even though it doesn't carry any useful information in and of itself, it is useful as being a unique identifier for each location.

In [None]:
bars = ny_bars.drop(columns=['distance', 'postalCode', 'cc', 'state', 'country', 'crossStreet', 'neighborhood'])
bars.rename(columns={'name':'Name', 
                     'categories':'Category', 
                     'address':'Address', 
                     'lat':'Lat', 
                     'lng':'Lng', 
                     'city':'Location'},
            inplace=True)

bars.head()

Perfect. To stop wasting API calls, I'll also save both of these DataFrames with Pickle (the original and the slim versions).

In [None]:
ny_bars.to_pickle('NY Bars, v20210121, Full')
bars.to_pickle('NY Bars, v20210121, Slim')
print('Save successful!')

In [None]:
# Just to make sure:
test_df = pd.read_pickle('NY Bars, v20210121, Slim')
test_df.head()

Absolutely amazing. That's our *final* dataset. Really, I won't be making any more changes, promise (at least, in this section).

### 1.2: Folium (Maps)

Now how do these venues look plotted on a map of New York City? Let's find out using Folium. We can use other packages (such as Nominatim) to find its exact latitude and longitude, but since we did that in a previous lab (and the informaiton is readily available through a quick Google search), we'll just hard-code the latitude and longitude: 40.7127281, -74.0060152.

In [4]:
# If running from this section, run this cell to load the locally-stored dataframes.
ny_bars = pd.read_pickle('NY Bars, v20210121, Full')
bars = pd.read_pickle('NY Bars, v20210121, Slim')
print('Load successful!')

Load successful!


In [6]:
ny_bars.head()

Unnamed: 0,name,categories,address,lat,lng,distance,postalCode,cc,city,state,country,crossStreet,neighborhood,id
0,Wine Bar,Wine Bar,at Citi Field,40.756446,-73.84638,15391,11368,US,Flushing,NY,United States,,,4deab02dae60e98923605bc2
1,Last Stop Bar,Bar,4609 White Plains Rd,40.90241,-73.851398,927,10470,US,Bronx,NY,United States,east 240th street,,5994fb6362420b10800f6429
2,Klassique Bar & Lounge,Bar,3801-3823 U.S. 1,40.881665,-73.838284,1634,10466,US,Bronx,NY,United States,,,4f935dd2e4b00fe65f18aa7c
3,Tommy Bahama's Bar,Sports Bar,Yankee Stadium,40.829045,-73.927661,9965,10452,US,Bronx,NY,United States,"Gate 4, upstairs from Great Hall",,4bf1ec69324cc9b672a5cc92
4,The Liffy II Bar,Bar,5009 Broadway,40.869042,-73.9173,6555,10034,US,New York,NY,United States,,,4b2b1dfbf964a52083b424e3


In [7]:
bars.head()

Unnamed: 0,Name,Category,Address,Lat,Lng,Location,id
0,Wine Bar,Wine Bar,at Citi Field,40.756446,-73.84638,Flushing,4deab02dae60e98923605bc2
1,Last Stop Bar,Bar,4609 White Plains Rd,40.90241,-73.851398,Bronx,5994fb6362420b10800f6429
2,Klassique Bar & Lounge,Bar,3801-3823 U.S. 1,40.881665,-73.838284,Bronx,4f935dd2e4b00fe65f18aa7c
3,Tommy Bahama's Bar,Sports Bar,Yankee Stadium,40.829045,-73.927661,Bronx,4bf1ec69324cc9b672a5cc92
4,The Liffy II Bar,Bar,5009 Broadway,40.869042,-73.9173,New York,4b2b1dfbf964a52083b424e3


In [5]:
nyc_latitude = 40.7127281
nyc_longitude = -74.0060152

In [21]:
# Now I'm taking code, again, directly from a previous lab (Segmenting and Clustering Neighborhoods in New York City).

# Creating a map of New York City
map_newyork = folium.Map(location=[nyc_latitude, nyc_longitude], zoom_start=10)

# add markers to map
for lat, lng, name in zip(bars['Lat'], bars['Lng'], bars['Name']):
    label = folium.Popup(name, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=1,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

Amazing results - we're ready to analyze the data.

## Section 2: Analysis

The algorithm I've chosen to use for this analysis is k-Means, and that's for a couple reasons:

1. It's really simple, easy to understand and modify quickly, and fun to use, in my opinion.
2. The value of k will (can) directly correspond to our "(x) Bar Crawl Locations" list (x = k).

It may not be absolutely perfect for this kind of analysis, but it works well enough, and I'm happy with that.

In [12]:
kClusters = 100
bars_grouped_clustering = bars[['Lat', 'Lng']]
bars_grouped_clustering.head()

Unnamed: 0,Lat,Lng
0,40.756446,-73.84638
1,40.90241,-73.851398
2,40.881665,-73.838284
3,40.829045,-73.927661
4,40.869042,-73.9173


In [13]:
# run k-means clustering
kmeans = KMeans(n_clusters=kClusters, random_state=0).fit(bars_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([ 6, 92, 45, 57, 83, 23, 83, 25, 28, 33])

In [14]:
grouped_bars = bars.copy()
grouped_bars.insert(0, 'Cluster Label', kmeans.labels_)
grouped_bars.head()

Unnamed: 0,Cluster Label,Name,Category,Address,Lat,Lng,Location,id
0,6,Wine Bar,Wine Bar,at Citi Field,40.756446,-73.84638,Flushing,4deab02dae60e98923605bc2
1,92,Last Stop Bar,Bar,4609 White Plains Rd,40.90241,-73.851398,Bronx,5994fb6362420b10800f6429
2,45,Klassique Bar & Lounge,Bar,3801-3823 U.S. 1,40.881665,-73.838284,Bronx,4f935dd2e4b00fe65f18aa7c
3,57,Tommy Bahama's Bar,Sports Bar,Yankee Stadium,40.829045,-73.927661,Bronx,4bf1ec69324cc9b672a5cc92
4,83,The Liffy II Bar,Bar,5009 Broadway,40.869042,-73.9173,New York,4b2b1dfbf964a52083b424e3


Well... that was easy enough. Let's save this locally and then visualize using some code again adapted from the "Segmenting and Clustering..." lab.

In [16]:
grouped_bars.to_pickle('Grouped NY Bars, v20210121')
print('Save successful!')

Save successful!


In [17]:
grouped_bars = pd.read_pickle('Grouped NY Bars, v20210121')
print('Load successful!')

Load successful!


In [18]:
# create map
map_clusters = folium.Map(location=[nyc_latitude, nyc_longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kClusters)
ys = [i + x + (i*x)**2 for i in range(kClusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(grouped_bars['Lat'], grouped_bars['Lng'], grouped_bars['Name'], grouped_bars['Cluster Label']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=1,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Wow! That's awesome. If you zoom in and click around, you'll find that even though some colors are really similar, they're actually different destinations. The resolution here is incredible. As expected, the most 'lively' spots are all in and around Manhattan, but there are some other clear clusters well outside the heart of the city.

However...this map is hard to read. There are *fifty* different clusters. It's tough to tell different ones apart where the colors are quite similar, and so the only way to make sure you're looking at the same cluster is to click each one individually and make sure they're the same cluster. Not very efficient.

To address this, let's write up a function to show only one cluster at a time. The clusters aren't ordered in any way, so we'll just write a function to show only one Cluster Label.

In [26]:
def map_cluster(cluster_num, main_lat=nyc_latitude, main_lng=nyc_longitude, data=grouped_bars, k=kClusters):
    #print('Main Lat: {}, Main Lng: {}, k = {}'.format(main_lat, main_lng, k))
    #print(data.head())
    # create map
    main_map = folium.Map(location=[main_lat, main_lng], zoom_start=10)
    cluster_df = data[data['Cluster Label'] == cluster_num]
    

    # add markers to the map
    for lat, lng, name in zip(cluster_df['Lat'], cluster_df['Lng'], cluster_df['Name']):
        label = folium.Popup(name, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=3,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(main_map)  
        
    return main_map

Let's give it a try!

In [35]:
bar_crawl_map = map_cluster(33)

bar_crawl_map

In [36]:
grouped_bars[grouped_bars['Cluster Label'] == 33]

Unnamed: 0,Cluster Label,Name,Category,Address,Lat,Lng,Location,id
9,33,Teddy's Bar & Grill,Pub,96 Berry St,40.719205,-73.958431,Brooklyn,4262f880f964a52025211fe3
27,33,Surf Bar,Seafood Restaurant,139 N 6th St,40.717636,-73.958714,Brooklyn,4581734ff964a520653f1fe3
144,33,Full Circle Bar,Bar,318 Grand St,40.712662,-73.956688,Brooklyn,4af33812f964a520d2eb21e3
149,33,D.O.C. Wine Bar,Italian Restaurant,83 N 7th St,40.719581,-73.960445,Brooklyn,4f825769e4b024a26f122b5d
153,33,Baker Bar,Cocktail Bar,200 N 11th St,40.718704,-73.953766,Brooklyn,4ed990830e61d46ad70a4bf9
159,33,North 4 Bar,Bar,160 N 4th St,40.715708,-73.959199,Brooklyn,4a7f9a5cf964a5205ff41fe3
160,33,The Bar 245,Bar,245 S 1st St,40.712777,-73.957918,Brooklyn,537fb988498e691413819ab5
169,33,East River Bar,Dive Bar,97 S 6th St,40.710883,-73.964771,Brooklyn,40bfbb80f964a520d4001fe3
214,33,Las Tainas Bar & Restaurant,Lounge,347 Broadway,40.707903,-73.955831,Brooklyn,4bd3e8089854d13a2ce1fe4d
215,33,Zamaan Hookah Bar and Lounge,Hookah Bar,349 Broadway,40.707847,-73.955812,Brooklyn,4d33b71c2c76a1434f2f7ec7


It works! We can show a map of any of the 100 bar crawl locations we found.