# Car ride-share potential in mid-size U.S. cities from geographic spread

This notebook supports the IBM Data Science Specialization on Coursera, per official report PDF. For all details, see the PDF

## Extract geographic location of mid-size U.S. cities

The 2017 U.S. census estimate for city size can be obtained from https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=bkmk . It is faithfully represented in a corresponding Wikipedia page https://en.wikipedia.org/w/index.php?title=List_of_United_States_cities_by_population&oldid=883568308 (retrieved 3 March 2019) from which it can be easily parsed.

In [4]:
import pandas as pd
import numpy as np

Following https://medium.com/analytics-vidhya/web-scraping-wiki-tables-using-beautifulsoup-and-python-6b9ea26d8722 (retrieved 24 Feb 2019) using BeautifulSoup to get a parseable representation of the Wikipedia page, then load the table with all cities into `city_table`:

In [5]:
import requests
url = 'https://en.wikipedia.org/w/index.php?title=List_of_United_States_cities_by_population&oldid=883568308'
website_url = requests.get(url).text

from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url, 'lxml')
city_table = soup.find('table', { 'class' : 'wikitable sortable' })
print("{}\n\n   [...]\n\n{}".format(str(city_table)[:500].replace('\n', '').replace('<tr>', '\n\n<tr>'), str(city_table)[-500:]))

<table class="wikitable sortable" style="text-align:center"><tbody>

<tr><th>2017<br/>rank</th><th>City</th><th>State<sup class="reference" id="cite_ref-5"><a href="#cite_note-5">[5]</a></sup></th><th>2017<br/>estimate</th><th>2010<br/>Census</th><th>Change</th><th colspan="2">2016 land area</th><th colspan="2">2016 population density</th><th>Location</th></tr>

<tr><td>1</td><td style="text-align:left;background-color:#cfecec"><i><a href="/wiki/New_York_City" title="New York 

   [...]

"latitude">38°21′14″N</span> <span class="longitude">121°58′22″W</span></span></span><span class="geo-multi-punct">﻿ / ﻿</span><span class="geo-default"><span class="vcard"><span class="geo-dec" title="Maps, aerial photos, and other data for this location">38.3539°N 121.9728°W</span><span style="display:none">﻿ / <span class="geo">38.3539; -121.9728</span></span><span style="display:none">﻿ (<span class="fn org">Vacaville</span>)</span></span></span></a></span></small>
</td></tr></tbody></table>


From `city_table` find all cities with an estimated 2017 population between 300,000 and 400,000 and parse out the latitude and longitude into numeric values:
* City name is the second column (remove references in square brackets if present),
* city state is the third column,
* population is the 4th column (remove thousands-separator commas before interpreting as integer),
* lattitude and longitude is contained in the 11th column, but has to be substring-filtered.

In [7]:
import re

l = []

table_rows = city_table.find_all('tr')
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td]
    if len(row) < 1:
        print("(ignoring empty row)")
        test_size = 0
    else:
        test_size = int(row[3].replace(',', ''))
        
    if test_size >= 300000 and test_size <= 400000:
        city_name = re.sub('\[.*\]', '', row[1])
        city_state = row[2]
        city_estd_pop2017 = test_size
        city_latlongraw = re.sub('^.*/', '', re.sub('\(.*\)', '', row[10])).replace(' ', '')
        # strip non-ASCII residue
        city_latlongraw = city_latlongraw.encode('ascii',errors='ignore').decode()
        city_lat = float(re.sub(';.*$', '', city_latlongraw))
        city_long = float(re.sub('^.*;', '', city_latlongraw))
        l.append([city_name, city_state, city_estd_pop2017, city_lat, city_long])

cities_df = pd.DataFrame(l)
cities_df.columns = ['City name', 'City state', 'Population', 'Latitude', 'Longitude']
print(cities_df)

(ignoring empty row)
         City name    City state  Population  Latitude  Longitude
0        Arlington         Texas      396394   32.7007   -97.1247
1      New Orleans     Louisiana      393292   30.0534   -89.9345
2          Wichita        Kansas      390591   37.6907   -97.3459
3        Cleveland          Ohio      385525   41.4785   -81.6794
4            Tampa       Florida      385430   27.9701   -82.4797
5      Bakersfield    California      380874   35.3212  -119.0183
6           Aurora      Colorado      366623   39.6880  -104.6897
7          Anaheim    California      352497   33.8555  -117.7601
8         Honolulu        Hawaii      350395   21.3243  -157.8476
9        Santa Ana    California      334136   33.7363  -117.8830
10       Riverside    California      327728   33.9381  -117.3932
11  Corpus Christi         Texas      325605   27.7543   -97.1734
12       Lexington      Kentucky      321959   38.0407   -84.4583
13        Stockton    California      310496   37.9763 

In [9]:
# persist the DataFrame (at least for a little while)
cities_df.to_csv('cities_df.csv')

## Get venues using Foursquare

(this cell contains the Foursquare API credentials)

<!--
CLIENT_ID = 'NBLOR5JJCSM43LTXYWBQYVJ5U3LMNZ2ULCHERZAZVLJTHBYA' # your Foursquare ID
CLIENT_SECRET = 'D1G4RELNK2MGSOZSO1C4DTGPYBAWHIW0MQJTXWSBTGH2JL41' # your Foursquare Secret
-->

In [10]:
print('CLIENT_ID set: {}'.format(CLIENT_ID is not None))
print('CLIENT_SECRET set: {}'.format(CLIENT_SECRET is not None))

VERSION = '20180605' # Foursquare API version

CLIENT_ID set: True
CLIENT_SECRET set: True


In [11]:
# (optional: restore all of the above data from storaget, import all libraries from above)
import pandas as pd
import numpy as np
import requests
import re

cities_df = pd.read_csv('cities_df.csv')

Define a Foursquare query that gets all venues within a default radius of 500 meters around a latitude and longitude. The number of venues returned is capped at 100 by default.

In [12]:
def getVenuesNearLatLong(latitude, longitude, radius=500, limit=100):
    
    venues_list=[]
                
    # create the API request URL
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION,
            latitude, 
            longitude, 
            radius, 
            limit)
            
    # make the GET request
    results_raw = requests.get(url)
    results = results_raw.json()["response"]['groups'][0]['items']
        
    # return only relevant information for each nearby venue
    venues_list.append([(
            latitude, 
            longitude, 
            v['venue']['name'], 
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = [
                  'Latitude', 
                  'Longitude', 
                  'Venue', 
                  'Venue Category']
    
    print('found {} venues within {} meters of {}/{}'.format(len(results), radius, latitude, longitude))
    
    return(nearby_venues)

### Test number of venues finder

Using the first record in the cities_df DataFrame, test the function defined above.

In [13]:
print(cities_df.head(1))

   Unnamed: 0  City name City state  Population  Latitude  Longitude
0           0  Arlington      Texas      396394   32.7007   -97.1247


In [14]:
venues_df = getVenuesNearLatLong(cities_df.Latitude[0], cities_df.Longitude[0])

found 7 venues within 500 meters of 32.7007/-97.1247


In [15]:
print('venues_df.shape = {}'.format(venues_df.shape))
venues_df

venues_df.shape = (7, 4)


Unnamed: 0,Latitude,Longitude,Venue,Venue Category
0,32.7007,-97.1247,Krispy Kreme Doughnuts,Donut Shop
1,32.7007,-97.1247,Kenner's Kolache Bakery,Breakfast Spot
2,32.7007,-97.1247,Texas Vision Care,Optical Shop
3,32.7007,-97.1247,Cooper St Bakery,Bakery
4,32.7007,-97.1247,El Pollo Regio,Mexican Restaurant
5,32.7007,-97.1247,Avis Car Rental,Rental Car Location
6,32.7007,-97.1247,Metro Flex,Gym


### Search hex grid around a given coordinate

Define a function that takes a latitude and longitude and return the venues at 6 coordinate points around that location, in a hex grid. Each point in the hex grid will be labeled by an integer as shown in the following diagram around the origin (0,0):

```
        ( -1, 1 )      ( 0, 1 )
                \      /
                 \    /
( -1, 0 )  ---  ( 0, 0 )  ---  ( 1, 0 )
                  /  \
                 /    \
        ( 0, -1 )     ( 1, -1 )
```

Given a latitude and longitude, the entire hex grid can therefore be described by a set of tuples `( x, y )`. The function will collect the venues result and append it to a dictionary that is keyed by these `( x, y )` tuples.

In [23]:
import math

def get_venues_in_hex_grid(latitude, longitude, venues_dict, this_coord, radius=500, limit=100):
    '''
    Calls Foursquare in a hex grid around a given coordinate point. If venues
    have already been searched on one of the hex grid points, that result is
    kept and no new search is executed.
    
    Parameters:
    
    latitude and longitude are as of the origin coordinate (0, 0),
    venues_dict are the venues found so far (dictionary keys are a coordinate tuple),
    this_coord is the center coordinate around which the hex grid is to be searched,
    radius is the radius [meters] to search around a coordinate point,
    limit is the maximum number of venues to return from a Foursquare search.
    '''
    
    r_earth = 6378000. # approximate radius of the Earth in meters
    pi = math.pi
    sqrt_three = math.sqrt(3.)
    overlap = 1.4 # 40% overlap
    
    cx = this_coord[0] # center X
    cy = this_coord[1] # center Y
    hex_coords = [ (cx-1,cy), (cx+1, cy), (cx,cy-1), (cx,cy+1), (cx-1,cy+1), (cx+1,cy-1) ] # the gex grid around this_coord
    
    for this_hex in hex_coords:
        if not this_hex in venues_dict:
            # the coordinate has not been searched for
            
            # get the x- and y-step from a hex grid; start with a square grid (letting the circles overlap a bit):
            dx_square = this_hex[0] * radius * ( overlap / 2. )
            dy_square = this_hex[1] * radius * ( overlap / 2. )
            # now convert to a hex grid:
            dx = dx_square + dy_square / 2.
            dy = dy_square * ( sqrt_three / 2. )
            # approximate the center point's latitude and longitude assuming locally flat Earth
            hex_latitude  = latitude  + (dy / r_earth) * (180 / pi);
            hex_longitude = longitude + (dx / r_earth) * (180 / pi) / math.cos(latitude * pi/180);
            
            this_venues = getVenuesNearLatLong(hex_latitude, hex_longitude, radius, limit)
            venues_dict[this_hex] = this_venues


Test the hexagonal grid function on the above city, to see how it functions.

In [49]:
# initialize venues_dict with the venues dataframe at the origin
venues_dict = {}
origin_coord = (0,0)
venues_dict[origin_coord] = venues_df
lat_orig = cities_df.Latitude[0]
long_orig = cities_df.Longitude[0]

# call the hex grid exploration function
get_venues_in_hex_grid(lat_orig, long_orig, venues_dict, (0,0) )

found 4 venues within 500 meters of 32.7007/-97.12843637006064
found 8 venues within 500 meters of 32.7007/-97.12096362993937
found 10 venues within 500 meters of 32.69797706801414/-97.12656818503032
found 11 venues within 500 meters of 32.703422931985855/-97.12283181496969
found 4 venues within 500 meters of 32.703422931985855/-97.12656818503032
found 10 venues within 500 meters of 32.69797706801414/-97.12283181496969


Output the result:

In [50]:
for key, entry in venues_dict.items():
    print('Venues at {}:\n{}\n\n'.format(key, entry['Venue Category'])) 

Venues at (0, 1):
0                 Donut Shop
1       Caribbean Restaurant
2               Burger Joint
3         Mexican Restaurant
4               Burger Joint
5        Filipino Restaurant
6     Thrift / Vintage Store
7                Pizza Place
8         Mexican Restaurant
9                        Bar
10        Mexican Restaurant
Name: Venue Category, dtype: object


Venues at (-1, 1):
0              Dessert Shop
1        Mexican Restaurant
2    Thrift / Vintage Store
3               Pizza Place
Name: Venue Category, dtype: object


Venues at (0, 0):
0             Donut Shop
1         Breakfast Spot
2           Optical Shop
3                 Bakery
4     Mexican Restaurant
5    Rental Car Location
6                    Gym
Name: Venue Category, dtype: object


Venues at (-1, 0):
0           Optical Shop
1         Breakfast Spot
2    Rental Car Location
3                   Café
Name: Venue Category, dtype: object


Venues at (0, -1):
0           Optical Shop
1         Breakfast Spot

Get a graphical representation of where these venues are, using folium

In [51]:
# import libraries required for folium

import json
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
!conda install -c conda-forge folium=0.5.0 --yes
import folium

Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
geopy                     1.18.1                     py_0    conda-forge
Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
folium                    0.5.0                      py_0    conda-forge


In [53]:
def make_map_from_dict(lat_orig, long_orig, venues_dict, zoom_start):
    new_map = folium.Map(location=[lat_orig, long_orig], zoom_start=14)

    # add markers to map
    for coords, venues in venues_dict.items():
        label = '{}, # venues: {}'.format(coords, venues.shape[0])
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [venues['Latitude'][0], venues['Longitude'][0]],
            radius=venues.shape[0],
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(new_map)  
     
    return new_map

make_map_from_dict(lat_orig, long_orig, venues_dict, 14)

### Interpretation of the test

A quick visual inspection confirms the correctness of the algorithm (the coordinates and expected number of venues matches the text printout).

However, it is also obvious that the city is much larger, and many more venue points would have to be calculated this way. The sandbox account into Foursquare is limited to just under 1000 API queries per day, and we have 19 cities in our data set.

Performing 100 query points for each city amounts to 1900 Foursquare API queries. This can be done over the course of two or three days. 100 query points, however, is just a 10x10 grid. Therefore, increase the search size to 1500 meters, and accept that in certain circles we'll exceed the maximum number of venues returned by Foursquare.

Re-testing with the new parameters:

In [54]:
# initialize venues_dict with the venues dataframe at the origin
venues_dict = {}
origin_coord = (0,0)
lat_orig = cities_df.Latitude[0]
long_orig = cities_df.Longitude[0]
venues_dict[origin_coord] = getVenuesNearLatLong(lat_orig, long_orig, radius=1500)

# call the hex grid exploration function
get_venues_in_hex_grid(lat_orig, long_orig, venues_dict, (0,0), radius=1500)

found 59 venues within 1500 meters of 32.7007/-97.1247
found 69 venues within 1500 meters of 32.7007/-97.13590911018191
found 71 venues within 1500 meters of 32.7007/-97.1134908898181
found 96 venues within 1500 meters of 32.69253120404243/-97.13030455509096
found 62 venues within 1500 meters of 32.708868795957564/-97.11909544490905
found 48 venues within 1500 meters of 32.708868795957564/-97.13030455509096
found 84 venues within 1500 meters of 32.69253120404243/-97.11909544490905


In [55]:
make_map_from_dict(lat_orig, long_orig, venues_dict, 14)

At the risk of hitting the maximum number of venues returned from Foursquare, the radius 1500 meters will be chosen going forward. The number of cities will be capped at 19, and the number of venue points in the algorithm at 100 so that we can gather all required data within 2-3 days from Foursquare.

In a real analysis, we would of course pay for a Foursquare subscription and get many more data points.