# Car ride-share potential in mid-size U.S. cities from geographic spread
## (notebook 1)

This notebook supports the IBM Data Science Specialization on Coursera, per official report PDF. For all details, see the PDF

## Extract geographic location of mid-size U.S. cities

The 2017 U.S. census estimate for city size can be obtained from https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=bkmk . It is faithfully represented in a corresponding Wikipedia page https://en.wikipedia.org/w/index.php?title=List_of_United_States_cities_by_population&oldid=883568308 (retrieved 3 March 2019) from which it can be easily parsed.

In [1]:
import pandas as pd
import numpy as np

Following https://medium.com/analytics-vidhya/web-scraping-wiki-tables-using-beautifulsoup-and-python-6b9ea26d8722 (retrieved 24 Feb 2019) using BeautifulSoup to get a parseable representation of the Wikipedia page, then load the table with all cities into `city_table`:

In [2]:
import requests
url = 'https://en.wikipedia.org/w/index.php?title=List_of_United_States_cities_by_population&oldid=883568308'
website_url = requests.get(url).text

from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url, 'lxml')
city_table = soup.find('table', { 'class' : 'wikitable sortable' })
print("{}\n\n   [...]\n\n{}".format(str(city_table)[:500].replace('\n', '').replace('<tr>', '\n\n<tr>'), str(city_table)[-500:]))


<table class="wikitable sortable" style="text-align:center"><tbody>

<tr><th>2017<br/>rank</th><th>City</th><th>State<sup class="reference" id="cite_ref-5"><a href="#cite_note-5">[5]</a></sup></th><th>2017<br/>estimate</th><th>2010<br/>Census</th><th>Change</th><th colspan="2">2016 land area</th><th colspan="2">2016 population density</th><th>Location</th></tr>

<tr><td>1</td><td style="text-align:left;background-color:#cfecec"><i><a href="/wiki/New_York_City" title="New York 

   [...]

"latitude">38°21′14″N</span> <span class="longitude">121°58′22″W</span></span></span><span class="geo-multi-punct">﻿ / ﻿</span><span class="geo-default"><span class="vcard"><span class="geo-dec" title="Maps, aerial photos, and other data for this location">38.3539°N 121.9728°W</span><span style="display:none">﻿ / <span class="geo">38.3539; -121.9728</span></span><span style="display:none">﻿ (<span class="fn org">Vacaville</span>)</span></span></span></a></span></small>
</td></tr></tbody></table>


From `city_table` find all cities with an estimated 2017 population between 300,000 and 400,000 and parse out the latitude and longitude into numeric values:
* City name is the second column (remove references in square brackets if present),
* city state is the third column,
* population is the 4th column (remove thousands-separator commas before interpreting as integer),
* lattitude and longitude is contained in the 11th column, but has to be substring-filtered.

In [3]:
import re

l = []

table_rows = city_table.find_all('tr')
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td]
    if len(row) < 1:
        print("(ignoring empty row)")
        test_size = 0
    else:
        test_size = int(row[3].replace(',', ''))
        
    if test_size >= 300000 and test_size <= 400000:
        city_name = re.sub('\[.*\]', '', row[1])
        city_state = row[2]
        city_estd_pop2017 = test_size
        city_latlongraw = re.sub('^.*/', '', re.sub('\(.*\)', '', row[10])).replace(' ', '')
        # strip non-ASCII residue
        city_latlongraw = city_latlongraw.encode('ascii',errors='ignore').decode()
        city_lat = float(re.sub(';.*$', '', city_latlongraw))
        city_long = float(re.sub('^.*;', '', city_latlongraw))
        l.append([city_name, city_state, city_estd_pop2017, city_lat, city_long])

cities_df = pd.DataFrame(l)
cities_df.columns = ['City name', 'City state', 'Population', 'Latitude', 'Longitude']
print(cities_df)


(ignoring empty row)
         City name    City state  Population  Latitude  Longitude
0        Arlington         Texas      396394   32.7007   -97.1247
1      New Orleans     Louisiana      393292   30.0534   -89.9345
2          Wichita        Kansas      390591   37.6907   -97.3459
3        Cleveland          Ohio      385525   41.4785   -81.6794
4            Tampa       Florida      385430   27.9701   -82.4797
5      Bakersfield    California      380874   35.3212  -119.0183
6           Aurora      Colorado      366623   39.6880  -104.6897
7          Anaheim    California      352497   33.8555  -117.7601
8         Honolulu        Hawaii      350395   21.3243  -157.8476
9        Santa Ana    California      334136   33.7363  -117.8830
10       Riverside    California      327728   33.9381  -117.3932
11  Corpus Christi         Texas      325605   27.7543   -97.1734
12       Lexington      Kentucky      321959   38.0407   -84.4583
13        Stockton    California      310496   37.9763 

In [4]:
# persist the DataFrame (at least for a little while)
cities_df.to_csv('cities_df.csv')


## Get venues using Foursquare

(copy secrets from local file)

In [6]:
print('CLIENT_ID set: {}'.format(CLIENT_ID is not None))
print('CLIENT_SECRET set: {}'.format(CLIENT_SECRET is not None))

VERSION = '20180605' # Foursquare API version


CLIENT_ID set: True
CLIENT_SECRET set: True


In [7]:
# (optional: restore all of the above data from storaget, import all libraries from above)
import pandas as pd
import numpy as np
import requests
import re

cities_df = pd.read_csv('cities_df.csv')

Define a Foursquare query that gets all venues within a default radius of 500 meters around a latitude and longitude. The number of venues returned is capped at 100 by default.

In [8]:
import time

def getVenuesNearLatLong(latitude, longitude, radius=500, limit=100, verbose=True):
    
    venues_list=[]
                
    # create the API request URL
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION,
            latitude, 
            longitude, 
            radius, 
            limit)
            
    # make the GET request (on error try four more times before giving up)
    num_tries = 0
    results = [] # assume no venues if persistent error
    
    while num_tries < 5:
        num_tries +=1
        try:
            results_raw = requests.get(url)
            results = results_raw.json()["response"]['groups'][0]['items']
        except:
            print('(err)', end='')
            time.sleep(2) # sleep for two seconds, then retry
        
    # return only relevant information for each nearby venue
    venues_list.append([(
            latitude, 
            longitude, 
            v['venue']['name'], 
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    if (len(results) > 0):
        nearby_venues.columns = [
                  'Latitude', 
                  'Longitude', 
                  'Venue', 
                  'Venue Category']
    
    if verbose:
        print('found {} venues within {} meters of {}/{}'.format(len(results), radius, latitude, longitude))
    else:
        print('{}'.format(len(results)), end='. ')
    
    return(nearby_venues)


### Test number of venues finder

Using the first record in the cities_df DataFrame, test the function defined above.

In [9]:
print(cities_df.head(1))


   Unnamed: 0  City name City state  Population  Latitude  Longitude
0           0  Arlington      Texas      396394   32.7007   -97.1247


In [10]:
venues_df = getVenuesNearLatLong(cities_df.Latitude[0], cities_df.Longitude[0])
city_name = cities_df['City name'][0]
city_state = cities_df['City state'][0]


found 7 venues within 500 meters of 32.7007/-97.1247


In [11]:
print('{}, {} venues_df.shape = {}'.format(city_name, city_state, venues_df.shape))
venues_df


Arlington, Texas venues_df.shape = (7, 4)


Unnamed: 0,Latitude,Longitude,Venue,Venue Category
0,32.7007,-97.1247,Krispy Kreme Doughnuts,Donut Shop
1,32.7007,-97.1247,Texas Vision Care,Optical Shop
2,32.7007,-97.1247,Kenner's Kolache Bakery,Breakfast Spot
3,32.7007,-97.1247,Cooper St Bakery,Bakery
4,32.7007,-97.1247,El Pollo Regio,Mexican Restaurant
5,32.7007,-97.1247,Avis Car Rental,Rental Car Location
6,32.7007,-97.1247,Metro Flex,Gym


### Search hex grid around a given coordinate

Define a function that takes a latitude and longitude and return the venues at 6 coordinate points around that location, in a hex grid. Each point in the hex grid will be labeled by an integer as shown in the following diagram around the origin (0,0):

```
        ( -1, 1 )      ( 0, 1 )
                \      /
                 \    /
( -1, 0 )  ---  ( 0, 0 )  ---  ( 1, 0 )
                  /  \
                 /    \
        ( 0, -1 )     ( 1, -1 )
```

Given a latitude and longitude, the entire hex grid can therefore be described by a set of tuples `( x, y )`. The function will collect the venues result and append it to a dictionary that is keyed by these `( x, y )` tuples.

In [12]:
import math

def get_venues_in_hex_grid(latitude, longitude, venues_dict, this_coord, radius=500, limit=100, new_coords=[], verbose=True):
    '''
    Calls Foursquare in a hex grid around a given coordinate point. If venues
    have already been searched on one of the hex grid points, that result is
    kept and no new search is executed.
    
    Parameters:
    
    latitude and longitude are as of the origin coordinate (0, 0),
    venues_dict are the venues found so far (dictionary keys are a coordinate tuple),
    this_coord is the center coordinate around which the hex grid is to be searched,
    radius is the radius [meters] to search around a coordinate point,
    limit is the maximum number of venues to return from a Foursquare search.
    new_coords is a list of coordinate points that wasn't probed yet
    
    Returns a list of new coordinate tuples appended to the new_coords parameter, if any
    '''
    
    r_earth = 6378000. # approximate radius of the Earth in meters
    pi = math.pi
    sqrt_three = math.sqrt(3.)
    overlap = 1.4 # 40% overlap
    
    cx = this_coord[0] # center X
    cy = this_coord[1] # center Y
    hex_coords = [ (cx-1,cy), (cx+1, cy), (cx,cy-1), (cx,cy+1), (cx-1,cy+1), (cx+1,cy-1) ] # the gex grid around this_coord
    
    if (cx, cy) in new_coords:
        new_coords.remove((cx, cy))
    
    for this_hex in hex_coords:
        if not this_hex in venues_dict:
            # the coordinate has not been searched for
            
            # get the x- and y-step from a hex grid; start with a square grid (letting the circles overlap a bit):
            dx_square = this_hex[0] * radius * ( overlap / 2. )
            dy_square = this_hex[1] * radius * ( overlap / 2. )
            # now convert to a hex grid:
            dx = dx_square + dy_square / 2.
            dy = dy_square * ( sqrt_three / 2. )
            # approximate the center point's latitude and longitude assuming locally flat Earth
            hex_latitude  = latitude  + (dy / r_earth) * (180 / pi);
            hex_longitude = longitude + (dx / r_earth) * (180 / pi) / math.cos(latitude * pi/180);
            
            if verbose:
                print('getting coordinate {}...'.format(this_hex))
            else:
                print('({},{}):'.format(this_hex[0], this_hex[1]), end='')
                
            this_venues = getVenuesNearLatLong(hex_latitude, hex_longitude, radius=radius, limit=limit, verbose=verbose)
            venues_dict[this_hex] = this_venues
            if not this_hex in new_coords:
                new_coords.append(this_hex)
    
    return new_coords


Test the hexagonal grid function on the above city, to see how it functions.

In [13]:
# initialize venues_dict with the venues dataframe at the origin
venues_dict = {}
origin_coord = (0,0)
venues_dict[origin_coord] = venues_df
lat_orig = cities_df.Latitude[0]
long_orig = cities_df.Longitude[0]

# call the hex grid exploration function
new_coords = get_venues_in_hex_grid(lat_orig, long_orig, venues_dict, (0,0) )


print('\n ... same call but with terse output:\n')

venues_dict = {}
origin_coord = (0,0)
venues_dict[origin_coord] = venues_df
new_coords = get_venues_in_hex_grid(lat_orig, long_orig, venues_dict, (0,0), verbose=False )


getting coordinate (-1, 0)...
found 3 venues within 500 meters of 32.7007/-97.12843637006064
getting coordinate (1, 0)...
found 8 venues within 500 meters of 32.7007/-97.12096362993937
getting coordinate (0, -1)...
found 12 venues within 500 meters of 32.69797706801414/-97.12656818503032
getting coordinate (0, 1)...
found 12 venues within 500 meters of 32.703422931985855/-97.12283181496969
getting coordinate (-1, 1)...
found 4 venues within 500 meters of 32.703422931985855/-97.12656818503032
getting coordinate (1, -1)...
found 11 venues within 500 meters of 32.69797706801414/-97.12283181496969

 ... same call but with terse output:

(-1,0):3. (1,0):8. (0,-1):12. (0,1):12. (-1,1):4. (1,-1):11. 

Output the result:

In [14]:
print('New coordinates probed: {}\n'.format(new_coords))

for key, entry in venues_dict.items():
    print('Venues at {}:\n{}\n\n'.format(key, entry['Venue Category'])) 
    

New coordinates probed: [(-1, 0), (1, 0), (0, -1), (0, 1), (-1, 1), (1, -1)]

Venues at (0, 1):
0                 Donut Shop
1       Caribbean Restaurant
2               Burger Joint
3         Mexican Restaurant
4               Burger Joint
5        Filipino Restaurant
6     Thrift / Vintage Store
7                Pizza Place
8         Mexican Restaurant
9           Department Store
10                Smoke Shop
11        Mexican Restaurant
Name: Venue Category, dtype: object


Venues at (-1, 1):
0          Dessert Shop
1    Mexican Restaurant
2    Mexican Restaurant
3           Pizza Place
Name: Venue Category, dtype: object


Venues at (0, 0):
0             Donut Shop
1           Optical Shop
2         Breakfast Spot
3                 Bakery
4     Mexican Restaurant
5    Rental Car Location
6                    Gym
Name: Venue Category, dtype: object


Venues at (-1, 0):
0           Optical Shop
1         Breakfast Spot
2    Rental Car Location
Name: Venue Category, dtype: object


Ve

Get a graphical representation of where these venues are, using folium

In [15]:
# import libraries required for folium

import json
!conda install -c conda-forge geopy=1.18.1 --yes


Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
geopy                     1.18.1                     py_0    conda-forge


In [16]:
from geopy.geocoders import Nominatim
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
!conda install -c conda-forge folium=0.8.0 --yes
import folium


Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
folium                    0.8.0                      py_0    conda-forge


In [17]:
def make_map_from_dict(lat_orig, long_orig, venues_dict, zoom_start,
                       city_name='(city_name)', city_state='(city_state)',
                       radius_zoom=1.0, width=600, height=600,
                       no_touch=True,
                       zoom_control=False):
    
    if zoom_control == False:
        min_zoom = zoom_start
        max_zoom = zoom_start
    else:
        min_zoom = 0
        max_zoom = 16
    
    new_map = folium.Map(
        location=[lat_orig, long_orig],
        zoom_start=zoom_start,
        width=width,
        height=height,
        control_scale=True,
        no_touch=no_touch,
        min_zoom=min_zoom,
        max_zoom=max_zoom
    )

    # add markers to map
    for coords, venues in venues_dict.items():
        if venues.shape[0] > 0: 
            label = '{}, # venues: {}'.format(coords, venues.shape[0])
            label = folium.Popup(label, parse_html=True)
            folium.CircleMarker(
                [venues['Latitude'][0], venues['Longitude'][0]],
                radius=venues.shape[0] * radius_zoom,
                popup=label,
                color='blue',
                fill=True,
                fill_color='#3186cc',
                fill_opacity=0.7,
                parse_html=False).add_to(new_map)
        
    legend_html = ('<div style="position: fixed; top: 30px; left: 50px; width: 450px;' 
                + 'height: 30px; border: 2px solid grey; z-index: 9999; font-size: 16px; background-color: white">' 
                + '&nbsp;{},&nbsp;{}' 
                + '</div>').format(
                     city_name,
                     city_state
                     )
    new_map.get_root().html.add_child(folium.Element(legend_html))
     
    return new_map

make_map_from_dict(lat_orig, long_orig, venues_dict, 14, city_name, city_state, radius_zoom=0.5)


### Interpretation of the test

A quick visual inspection confirms the correctness of the algorithm (the coordinates and expected number of venues matches the text printout).

However, it is also obvious that the city is much larger, and many more venue points would have to be calculated this way. The sandbox account into Foursquare is limited to just under 1000 API queries per day, and we have 19 cities in our data set.

Performing 100 query points for each city amounts to 1900 Foursquare API queries. This can be done over the course of two or three days. 100 query points, however, is just a 10x10 grid. Therefore, increase the search size to 1500 meters, and accept that in certain circles we'll exceed the maximum number of venues returned by Foursquare.

Re-testing with the new parameters:

In [18]:
# initialize venues_dict with the venues dataframe at the origin
venues_dict = {}
origin_coord = (0,0)
lat_orig = cities_df.Latitude[0]
long_orig = cities_df.Longitude[0]
venues_dict[origin_coord] = getVenuesNearLatLong(lat_orig, long_orig, radius=1500)

# call the hex grid exploration function
new_coords = get_venues_in_hex_grid(
    lat_orig, long_orig, venues_dict, (0,0), radius=1500, verbose=False)


found 58 venues within 1500 meters of 32.7007/-97.1247
(-1,0):64. (1,0):72. (0,-1):96. (0,1):63. (-1,1):50. (1,-1):84. 

In [19]:
make_map_from_dict(lat_orig, long_orig, venues_dict, 14, city_name, city_state, radius_zoom=0.5)


At the risk of hitting the maximum number of venues returned from Foursquare, the radius 1500 meters will be chosen going forward. The number of cities will be capped at 19, and the number of venue points in the algorithm at 100 so that we can gather all required data within 2-3 days from Foursquare.

In a real analysis, we would of course pay for a Foursquare subscription and get many more data points.

### Manually run the next iteration

The algorithm to be built will then look at the coordinate points found so far, pick the one with the highest number of venues, and perform a hex lookup around that coordinate. In the above city, the most venues were returned at coordinate `(0, -1)`. Manually perform an iteration step around that coordinate:

In [20]:
new_coords = get_venues_in_hex_grid(
    lat_orig, long_orig, venues_dict, (0,-1), radius=1500, new_coords=new_coords)


getting coordinate (-1, -1)...
found 47 venues within 1500 meters of 32.69253120404243/-97.14151366527287
getting coordinate (0, -2)...
found 100 venues within 1500 meters of 32.684362408084866/-97.13590911018191
getting coordinate (1, -2)...
found 94 venues within 1500 meters of 32.684362408084866/-97.1247


In [21]:
print('total new coordinates not probed yet: {}'.format(new_coords))


total new coordinates not probed yet: [(-1, 0), (1, 0), (0, 1), (-1, 1), (1, -1), (-1, -1), (0, -2), (1, -2)]


Three new coordinate points were queried from Foursquare, because the other three are already known. Resulting map:

In [22]:
make_map_from_dict(lat_orig, long_orig, venues_dict, 14, city_name, city_state, radius_zoom=0.5)


Just to get an overview, perform a grid query for venues and plot the result.

In [23]:
for x in range(-4, 5):
    for y in range(-4, 5):
        get_venues_in_hex_grid(lat_orig, long_orig, venues_dict, (x,y), radius=1500, verbose=False)
        

(-5,-4):65. (-3,-4):16. (-4,-5):34. (-4,-3):52. (-5,-3):75. (-3,-5):16. (-3,-3):8. (-4,-4):70. (-4,-2):21. (-5,-2):48. (-3,-2):8. (-4,-1):8. (-5,-1):17. (-3,-1):10. (-4,0):12. (-5,0):14. (-3,0):34. (-4,1):27. (-5,1):21. (-3,1):61. (-4,2):71. (-5,2):17. (-3,2):79. (-4,3):44. (-5,3):38. (-3,3):52. (-4,4):51. (-5,4):37. (-3,4):34. (-4,5):49. (-5,5):26. (-2,-4):17. (-2,-5):10. (-2,-3):12. (-2,-2):13. (-2,-1):7. (-2,0):39. (-2,1):58. (-2,2):68. (-2,3):36. (-2,4):35. (-3,5):42. (-1,-4):25. (-1,-5):10. (-1,-3):33. (-1,-2):42. (-1,2):(err)69. (-1,3):58. (-1,4):70. (-2,5):51. (0,-4):98. (0,-5):11. (0,-3):100. (0,2):74. (0,3):64. (0,4):57. (-1,5):75. (1,-4):(err)100. (1,-5):94. (1,-3):100. (1,1):61. (1,2):82. (1,3):29. (1,4):45. (0,5):94. (2,-4):100. (2,-5):71. (2,-3):100. (2,-2):100. (2,-1):60. (2,0):51. (2,1):49. (2,2):61. (2,3):27. (2,4):29. (1,5):67. (3,-4):73. (3,-5):27. (3,-3):97. (3,-2):74. (3,-1):15. (3,0):56. (3,1):43. (3,2):42. (3,3):(err)31. (3,4):16. (2,5):53. (4,-4):74. (4,-5):20. (

That's one way of running up your Foursquare quota! Result:

In [24]:
make_map_from_dict(lat_orig, long_orig, venues_dict, 12, city_name, city_state, radius_zoom=0.15)


### Edge case test: no venues

In the middle of the ocean there are no venues, since we're in the Atlantic. Make sure it behaves OK in that edge case as well.

In [25]:
# initialize venues_dict with the venues dataframe at the origin
venues_dict = {}
origin_coord = (0,0)
venues_dict[origin_coord] = {}
lat_orig = -10
long_orig = 0

# call the hex grid exploration function
new_coords = get_venues_in_hex_grid(lat_orig, long_orig, venues_dict, (0,0) )

venues_dict


getting coordinate (-1, 0)...
found 0 venues within 500 meters of -10.0/-0.003192674936220234
getting coordinate (1, 0)...
found 0 venues within 500 meters of -10.0/0.003192674936220234
getting coordinate (0, -1)...
found 0 venues within 500 meters of -10.002722931985856/-0.001596337468110117
getting coordinate (0, 1)...
found 0 venues within 500 meters of -9.997277068014144/0.001596337468110117
getting coordinate (-1, 1)...
found 0 venues within 500 meters of -9.997277068014144/-0.001596337468110117
getting coordinate (1, -1)...
found 0 venues within 500 meters of -10.002722931985856/0.001596337468110117


{(-1, 0): Empty DataFrame
 Columns: []
 Index: [], (-1, 1): Empty DataFrame
 Columns: []
 Index: [], (0, -1): Empty DataFrame
 Columns: []
 Index: [], (0, 0): {}, (0, 1): Empty DataFrame
 Columns: []
 Index: [], (1, -1): Empty DataFrame
 Columns: []
 Index: [], (1, 0): Empty DataFrame
 Columns: []
 Index: []}

Works. We'll have to do some filtering later when looking for indicator venues, but that's later.

### Iterative search algorithm

The following iterative search will be performed:
1. Seed the venues_dict dictionary with the venues at the origin,
2. mark the origin as the first (and only) coordinate point not yet explored,
3. loop through exploring the coordinate point with the highest number of venues, until 100 coordinate points have been tested

In [26]:
def find_venues_geo_distribution(cities_df, city_index, max_coords_tested=100, radius=1500, limit=100, verbose=True):
    
    city_name = cities_df['City name'][city_index]
    city_state = cities_df['City state'][city_index]
    
    # initialize venues_dict with the venues dataframe at the origin
    venues_dict = {}
    origin_coord = (0,0)
    lat_orig = cities_df.Latitude[city_index]
    long_orig = cities_df.Longitude[city_index]
    print('[test #1 of {}] (0,0):'.format(max_coords_tested), end='')
    venues_df = getVenuesNearLatLong(lat_orig, long_orig, radius=radius, verbose=verbose)
    venues_dict[origin_coord] = venues_df
    
    # mark the origin as the first (and only) coordinate point not yet explored
    new_coords = [(0, 0)]
    num_coords_tested = 1
    
    while num_coords_tested < max_coords_tested:
        
        highest_venues = -1
        new_test_coord = None
        
        for this_coord in new_coords:
            venues_df = venues_dict[this_coord]
            if venues_df.shape[0] > highest_venues:
                new_test_coord = this_coord
                highest_venues = venues_df.shape[0]
        
        # call the hex grid exploration function
        num_coords_tested += 1
        print('[test #{} of {}]'.format(num_coords_tested, max_coords_tested), end=' ')
        new_coords = get_venues_in_hex_grid(lat_orig, long_orig, venues_dict, new_test_coord, radius=radius, new_coords=new_coords, limit=limit, verbose=verbose )
        
    return lat_orig, long_orig, venues_dict, city_name, city_state


### Test run on the first city

Manually test the algorithm for the first city in cities_df. The earlier fixed 10x10 grid needs to have overlap with this iteratively acquired geographic venues distribution.

In [27]:
lat_orig, long_orig, venues_dict, city_name, city_state = find_venues_geo_distribution(
    cities_df, 0, max_coords_tested=100, verbose=False)


[test #1 of 100] (0,0):58. [test #2 of 100] (-1,0):64. (1,0):72. (0,-1):96. (0,1):63. (-1,1):50. (1,-1):84. [test #3 of 100] (-1,-1):47. (0,-2):100. (1,-2):94. [test #4 of 100] (-1,-2):42. (0,-3):100. (1,-3):100. [test #5 of 100] (-1,-3):33. (0,-4):98. (1,-4):100. [test #6 of 100] (2,-3):100. (2,-4):100. [test #7 of 100] (1,-5):94. (2,-5):71. [test #8 of 100] (3,-3):97. (2,-2):100. (3,-4):73. [test #9 of 100] (3,-5):27. [test #10 of 100] (3,-2):74. (2,-1):60. [test #11 of 100] (-1,-4):25. (0,-5):11. [test #12 of 100] (4,-3):76. (4,-4):74. [test #13 of 100] [test #14 of 100] (1,-6):46. (2,-6):73. [test #15 of 100] [test #16 of 100] (5,-3):28. (4,-2):46. (5,-4):23. [test #17 of 100] (3,-1):15. [test #18 of 100] (4,-5):20. (5,-5):15. [test #19 of 100] [test #20 of 100] (3,-6):78. (2,-7):44. (3,-7):41. [test #21 of 100] (4,-6):28. (4,-7):25. [test #22 of 100] (2,0):51. (1,1):61. [test #23 of 100] [test #24 of 100] (-2,0):39. (-2,1):58. [test #25 of 100] (0,2):74. (-1,2):69. [test #26 of 10

In [28]:
make_map_from_dict(lat_orig, long_orig, venues_dict, 11, city_name, city_state, radius_zoom=0.1)


In [29]:
# test the second city in the list
lat_orig, long_orig, venues_dict, city_name, city_state = find_venues_geo_distribution(
    cities_df, 1, max_coords_tested=100, verbose=False)


[test #1 of 100] (0,0):4. [test #2 of 100] (-1,0):5. (1,0):3. (0,-1):9. (0,1):2. (-1,1):6. (1,-1):5. [test #3 of 100] (-1,-1):31. (0,-2):18. (1,-2):7. [test #4 of 100] (-2,-1):33. (-1,-2):29. (-2,0):24. [test #5 of 100] (-3,-1):36. (-2,-2):40. (-3,0):20. [test #6 of 100] (-3,-2):19. (-2,-3):26. (-1,-3):22. [test #7 of 100] (-4,-1):19. (-4,0):8. [test #8 of 100] (0,-3):9. [test #9 of 100] (-3,-3):24. (-2,-4):16. (-1,-4):14. [test #10 of 100] (-2,1):3. (-3,1):8. [test #11 of 100] (-4,-3):10. (-3,-4):16. (-4,-2):25. [test #12 of 100] (-5,-2):19. (-5,-1):8. [test #13 of 100] (0,-4):11. [test #14 of 100] (-4,1):6. [test #15 of 100] [test #16 of 100] (-5,0):7. [test #17 of 100] (-6,-2):12. (-5,-3):18. (-6,-1):8. [test #18 of 100] (1,-3):8. [test #19 of 100] (-6,-3):17. (-5,-4):16. (-4,-4):18. [test #20 of 100] (-4,-5):12. (-3,-5):9. [test #21 of 100] (-7,-3):18. (-6,-4):26. (-7,-2):12. [test #22 of 100] (-7,-4):14. (-6,-5):37. (-5,-5):20. [test #23 of 100] (-7,-5):30. (-6,-6):31. (-5,-6):26.

In [30]:
make_map_from_dict(
    lat_orig, long_orig, venues_dict, 11, city_name, city_state, radius_zoom=0.1)


Looks like it's working ok. The distribution of venues across these two citis is very different, so we should get good results from clustering.

### Foursquare Personal Account

Because I'd really like to get more points at a better resolution (which also means not losing possible indicator venues that will go into the analysis later) I've upgraded to the Foursquare personal account. Now armed with 99,500 queries per day, I can query with much higher resolution.


## Determine aggregates describing the geographical distribution

In order to cluster cities by shape of their venues, we need to calculate aggregate variables:
* Determine the center of the venues by averaging each coordinate point, weighed by the number of venues there.
* Determine the mean distance of all venues from that center.
* Determine the standard deviation of the distance distribution, as well as skewedness and kurtosis ("peaky-ness").
Once we have these values, all details on the geographic venues distribution will be discarded.

We will also sum up the number of indicator venues as compared to the total number of venues.

# Continued in notebook 2

In order to clean up things a bit I'll start a new notebook.