# Clustering and Scoring Job Relocation Opportunities - Data Gathering Playground

Austin Rainwater

---

# Initialization

In [139]:
!pip install --quiet --upgrade sqlalchemy pymysql

from urllib.parse import quote as url_encode

import pandas as pd
import numpy as np
import aiohttp
import asyncio
import requests
import re # regex
import xml.etree.ElementTree as xml

from concurrent.futures import ProcessPoolExecutor

from pandas import json_normalize
from itertools import product

from sqlalchemy import (
    create_engine,
    Table,
    Column,
    MetaData,
    String,
    Numeric,
    Integer
)

import yaml

with open('secrets.yaml', 'r') as secrets_file:
    secrets = yaml.safe_load(secrets_file)
    
header = {"User-Agent": 
          'datascience jupyter notebook/0.0 '
          '(https://github.com/pacorain/datascience-certification-final-project; '
          'Austin Rainwater, paco@heckin.io)'}
v = '20201108'

---

# City Definition

Obviously, a good place for me to start is with some cities. Below is the table definition for the cities I will be exploring and their specific traits.

In [2]:
engine = create_engine(secrets['db_connection_string'], echo=True)

In [3]:
meta = MetaData()

cities = Table(
    'city', meta,
    Column('city_name', String(50), primary_key=True, comment='Community Name'),
    Column('metro_name', String(50), comment='Metropolitan Area Name'),
    Column('state', String(2), nullable=False, comment='2-Letter abbreviation of State'),
    Column('lat', Numeric(10, 6), nullable=False, comment='Latitude of City'),
    Column('lng', Numeric(10, 6), nullable=False, comment='Longitude of City'),
    Column('area_val', Numeric(10, 4), nullable=False, comment='Area of city in square miles'),
    Column('total_pop', Integer, nullable=False, comment='Total population of city')
)

In [4]:
meta.drop_all(engine) # During development
meta.create_all(engine)

2020-12-07 19:43:49,114 INFO sqlalchemy.engine.base.Engine SHOW VARIABLES LIKE 'sql_mode'
2020-12-07 19:43:49,115 INFO sqlalchemy.engine.base.Engine {}
2020-12-07 19:43:49,123 INFO sqlalchemy.engine.base.Engine SHOW VARIABLES LIKE 'lower_case_table_names'
2020-12-07 19:43:49,124 INFO sqlalchemy.engine.base.Engine {}
2020-12-07 19:43:49,142 INFO sqlalchemy.engine.base.Engine SELECT DATABASE()
2020-12-07 19:43:49,142 INFO sqlalchemy.engine.base.Engine {}
2020-12-07 19:43:49,150 INFO sqlalchemy.engine.base.Engine show collation where `Charset` = 'utf8mb4' and `Collation` = 'utf8mb4_bin'
2020-12-07 19:43:49,151 INFO sqlalchemy.engine.base.Engine {}
2020-12-07 19:43:49,163 INFO sqlalchemy.engine.base.Engine SELECT CAST('test plain returns' AS CHAR(60)) AS anon_1
2020-12-07 19:43:49,165 INFO sqlalchemy.engine.base.Engine {}
2020-12-07 19:43:49,168 INFO sqlalchemy.engine.base.Engine SELECT CAST('test unicode returns' AS CHAR(60)) AS anon_1
2020-12-07 19:43:49,169 INFO sqlalchemy.engine.base.E

Let's start with my birthplace: Fort Wayne, Indiana.

In [5]:
new_city = cities.insert()

try:
    engine.execute(new_city, [
        {'city_name': 'Fort Wayne, IN', 'metro_name': 'Fort Wayne, IN', 'state': 'IN'}
    ])
except:
    print("Oops! That didn't work.")

2020-12-07 19:43:49,300 INFO sqlalchemy.engine.base.Engine INSERT INTO city (city_name, metro_name, state) VALUES (%(city_name)s, %(metro_name)s, %(state)s)
2020-12-07 19:43:49,302 INFO sqlalchemy.engine.base.Engine {'city_name': 'Fort Wayne, IN', 'metro_name': 'Fort Wayne, IN', 'state': 'IN'}
2020-12-07 19:43:49,306 INFO sqlalchemy.engine.base.Engine ROLLBACK
Oops! That didn't work.


Ah, the table requires some more data to be able to insert the record. I could use the geocoder library from before to get the latitude and longitude, but since I will be using Wikipedia anyway, let's see if I can grab it from there.

I did some experimenting with the [Wikipedia API Sandbox](https://en.wikipedia.org/wiki/Special:ApiSandbox#action=parse&format=json&page=Fort%20Wayne%2C%20Indiana&redirects=1&prop=wikitext), and oddly enough while there are multiple endpoints capable of getting the _names_ of the templates used in a page, I could not for the life of me find a way to get the _data inserted to_ the templates in an easy format such as JSON. So instead, I'm going to grab the `parsetree` and parse it with Python's XML libraries.

In [6]:
city_name = 'Fort Wayne'
state_name = 'IN'

wikipedia_url = 'https://en.wikipedia.org/w/api.php'
params = {
    "action": "parse",
    "format": "json",
    "redirects": "1",
    "page": f"{city_name}, {state_name}",
    "prop": "parsetree"
}

response = requests.get(wikipedia_url, params=params, headers=header).json()['parse']['parsetree']['*']
response = xml.canonicalize(response, strip_text=True)

# Write XML data for local exploration
with open('data/fort_wayne.xml', 'w') as xml_file:
    xml_file.write(response)

Ah, going through the XML file, the map on the Wikipedia article is an SVG (i.e. an image, not something that contains computer-readable geographic data), so I will need to use a geocoder. 

I recall from the previous lab that when you grab data from Foursquare's API, it will geocode the 'near' parameter and return the latitude and logitude used.

I also want to include the total size of the city, so in order to enter data into the table, I need to grab data from Wikipedia _and_ Foursquare. Which is fine, because I need more data to explore possible features

In [7]:
wiki_data = xml.fromstring(response)

In [8]:
foursquare_url = "https://api.foursquare.com/v2/venues/explore"

params = {
    'client_id': secrets['4SQ_CLIENT_ID'],
    'client_secret': secrets['4SQ_CLIENT_SECRET'],
    'limit': '50',
    'v': v,
    'near': 'Fort Wayne, IN',
    'radius': 1000,
    'time': 'any', 
    'day': 'any',
    'sortByPopularity': '1'
}

foursquare_response = requests.get(foursquare_url, params=params, headers=header).json()['response']

In [9]:
def template_value(wiki_data, template_title, part_name):
    template = wiki_data.find(".//template[title='{}']".format(template_title))
    return template.find(".part[name='{}'].value".format(part_name)).text

lat = float(foursquare_response['geocode']['center']['lat'])
lng = float(foursquare_response['geocode']['center']['lng'])
sq_mi = float(template_value(wiki_data, "Infobox settlement", "area_total_sq_mi"))
total_pop = int(template_value(wiki_data, "Infobox settlement", "population_est"))

Alright, I've gotten the values I need initially for a city; now let's try inserting it.

In [10]:
engine.execute(new_city, [{
    'city_name': 'Fort Wayne', 
    'metro_name': 'Fort Wayne', 
    'state': 'IN', 
    'lat': lat,
    'lng': lng,
    'area_val': sq_mi,
    'total_pop': total_pop
}])

2020-12-07 19:43:50,558 INFO sqlalchemy.engine.base.Engine INSERT INTO city (city_name, metro_name, state, lat, lng, area_val, total_pop) VALUES (%(city_name)s, %(metro_name)s, %(state)s, %(lat)s, %(lng)s, %(area_val)s, %(total_pop)s)
2020-12-07 19:43:50,559 INFO sqlalchemy.engine.base.Engine {'city_name': 'Fort Wayne', 'metro_name': 'Fort Wayne', 'state': 'IN', 'lat': 41.1306, 'lng': -85.12886, 'area_val': 110.79, 'total_pop': 270402}
2020-12-07 19:43:50,563 INFO sqlalchemy.engine.base.Engine COMMIT


<sqlalchemy.engine.result.ResultProxy at 0x7f4a9addcc70>

In [11]:
query = cities.select()

pd.read_sql(query, engine)

2020-12-07 19:43:50,593 INFO sqlalchemy.engine.base.OptionEngine SELECT city.city_name, city.metro_name, city.state, city.lat, city.lng, city.area_val, city.total_pop 
FROM city
2020-12-07 19:43:50,594 INFO sqlalchemy.engine.base.OptionEngine {}


Unnamed: 0,city_name,metro_name,state,lat,lng,area_val,total_pop
0,Fort Wayne,Fort Wayne,IN,41.1306,-85.12886,110.79,270402


Not bad. 

Next, I want to grab some data from Foursquare to build a feature based on what's popular within 1, 5, 25, and 100 km. I'll use the category hierarchy like I did in the week 3 lab. Given that the Foursquare API allows for 99,500 of these calls a day, and up to 5,000 per hour, I can also do this comfortably with each section defined in the `venues/explore` enpoint to see how much variety is in each section in an area.

In [12]:
url = 'https://api.foursquare.com/v2/venues/categories'
params = {
    'client_id': secrets['4SQ_CLIENT_ID'],
    'client_secret': secrets['4SQ_CLIENT_SECRET'],
    'v': v
}
foursquare_categories = requests.get(url, params=params).json()

def category_hier(categories, prefix=[]):
    result = []
    
    for category in categories:
        category = json_normalize(category).iloc[0]
        current_category = pd.Series(
            data=prefix + [category.shortName] + [np.nan] * (4 - len(prefix)),
            name=str(category.id),
            index=[
                'cat_level_1',
                'cat_level_2',
                'cat_level_3',
                'cat_level_4',
                'cat_level_5'
            ]
        )
        result.append(current_category)
        if subcategories := category.categories:
            result += category_hier(subcategories, prefix + [category.shortName])
            
    return result

categories = foursquare_categories['response']['categories']
category_df = pd.DataFrame(category_hier(categories))

In [13]:
radii = [1000, 5000, 25000, 100000]
sections = ['food', 'drinks', 'coffee', 'shops', 'arts', 'outdoors', 'sights', 'trending', 'topPicks']

async def get_popular_spots(city):
    """
    Get popular spots in various "sections" within various distances of `city`
    """
    async with aiohttp.ClientSession() as session:
        tasks = []
        for r, s in product(radii, sections):
            task = query_places(session, city, r, section=s)
            tasks.append(task)
        results = await asyncio.gather(*tasks)
    return pd.concat(results, ignore_index=True)
    
    
async def query_places(session, location, radius, section='', query=''):
    """
    With an existing HTTP `session`, get popular spots of the type `section` within `radius` meters of `location`
    
    Uses multiprocessing for quicker processing of the 36 times this function is called
    """
    async with session.get("https://api.foursquare.com/v2/venues/explore", params={
        'client_id': secrets['4SQ_CLIENT_ID'],
        'client_secret': secrets['4SQ_CLIENT_SECRET'],
        'limit': '50',
        'v': v,
        'near': location,
        'radius': radius, 
        'section': section,
        'query': query,
        'sortByPopularity': 1
    }) as result:
        data = await result.json()
    loop = asyncio.get_running_loop()
    venues = await loop.run_in_executor(executor, normalize_foursquare_response, data)
    if venues is not None:
        venues['city'] = location
        venues['radius'] = radius
        if section:
            venues['section'] = section
        if query:
            venues['query'] = query
    return venues
    
    
def normalize_foursquare_response(data):
    """
    Converts the Foursquare response into a dataframe with all of the venues, as well geolocation metadata.
    """
    if 'groups' not in data['response']:
        return None
    venues = json_normalize(data, ['response', 'groups', 'items'], sep='_')
    geo = json_normalize(data['response']['geocode'], sep='_').loc[0] # json_normalize returns single-index df
    geo.index = pd.Index(f'geo_{name}' for name in geo.index)
    venues.loc[:, geo.index] = geo.values
    venues['search_popularity'] = venues.index.values
    return venues


In [14]:
executor = ProcessPoolExecutor()
places_df = await get_popular_spots('Fort Wayne, IN')
places_df

Unnamed: 0,referralId,reasons_count,reasons_items,venue_id,venue_name,venue_location_address,venue_location_lat,venue_location_lng,venue_location_labeledLatLngs,venue_location_postalCode,...,geo_geometry_bounds_sw_lng,search_popularity,city,radius,section,venue_venuePage_id,flags_outsideRadius,venue_location_neighborhood,venue_events_count,venue_events_summary
0,e-3-4b5a3e80f964a5201ab728e3-0,0,"[{'summary': 'This spot is popular', 'type': '...",4b5a3e80f964a5201ab728e3,BakerStreet,4820 N Clinton St,41.122200,-85.125421,"[{'label': 'display', 'lat': 41.12219979566053...",46825,...,-85.303308,0,"Fort Wayne, IN",1000,food,,,,,
1,e-3-4b26e239f964a5207c8224e3-1,0,"[{'summary': 'This spot is popular', 'type': '...",4b26e239f964a5207c8224e3,Agaves Mexican Grill,211 E Washington Center Rd,41.132293,-85.138746,"[{'label': 'display', 'lat': 41.13229322911061...",46825,...,-85.303308,1,"Fort Wayne, IN",1000,food,,,,,
2,e-3-4b86ebf4f964a5200ba631e3-2,0,"[{'summary': 'This spot is popular', 'type': '...",4b86ebf4f964a5200ba631e3,Papa John's Pizza,5626 Coldwater Rd,41.130839,-85.135060,"[{'label': 'display', 'lat': 41.13083878708079...",46825,...,-85.303308,2,"Fort Wayne, IN",1000,food,,,,,
3,e-3-4b5f57bff964a5201db529e3-3,0,"[{'summary': 'This spot is popular', 'type': '...",4b5f57bff964a5201db529e3,Cork 'n Cleaver,221 Washington Ctr Rd,41.132858,-85.138249,"[{'label': 'display', 'lat': 41.132858, 'lng':...",46825,...,-85.303308,3,"Fort Wayne, IN",1000,food,,,,,
4,e-3-4bfec4844e5d0f47a7207d1f-4,0,"[{'summary': 'This spot is popular', 'type': '...",4bfec4844e5d0f47a7207d1f,Wendy’s,5701 Coldwater Rd,41.130943,-85.136557,"[{'label': 'display', 'lat': 41.13094264479011...",46825,...,-85.303308,4,"Fort Wayne, IN",1000,food,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1518,e-1-4b86e581f964a52089a431e3-45,0,"[{'summary': 'This spot is popular', 'type': '...",4b86e581f964a52089a431e3,Bruno's Pizza,28046 County Road 16,41.674179,-86.004036,"[{'label': 'display', 'lat': 41.67417862568361...",46516,...,-85.303308,45,"Fort Wayne, IN",100000,topPicks,,,,,
1519,e-1-4c955265f244b1f70cd72a1d-46,0,"[{'summary': 'This spot is popular', 'type': '...",4c955265f244b1f70cd72a1d,Lucy's vedi twist,,41.627403,-85.424383,"[{'label': 'display', 'lat': 41.62740269, 'lng...",46761,...,-85.303308,46,"Fort Wayne, IN",100000,topPicks,,,,,
1520,e-1-4d0a57e6e023224b45825fd2-47,0,"[{'summary': 'This spot is popular', 'type': '...",4d0a57e6e023224b45825fd2,Yoder's Country Market,375 Eleanor Dr,41.920600,-85.511006,"[{'label': 'display', 'lat': 41.92060038578456...",49032,...,-85.303308,47,"Fort Wayne, IN",100000,topPicks,,,,,
1521,e-1-4cbb1d4fa33bb1f7d04992fd-48,0,"[{'summary': 'This spot is popular', 'type': '...",4cbb1d4fa33bb1f7d04992fd,Orchard Tree,501 Grand Lake Rd,40.555072,-84.551998,"[{'label': 'display', 'lat': 40.55507221383220...",45822,...,-85.303308,48,"Fort Wayne, IN",100000,topPicks,,,,,


In [15]:
print(f"{len(places_df.venue_id.unique())} unique venues")

799 unique venues


Cool, that will give me the ability to get an idea of what we can do on an evening or a weekend. 

Let's add in the venue category hierarchy.

In [16]:
def get_categories(row):
    if not row.venue_categories:
        return pd.Series(
            [np.nan] * 5,
            category_df.columns,
            name=row.name
        )
    return category_df.loc[row.venue_categories[0]['id']]

places_df = places_df.merge(
    places_df.apply(get_categories, axis=1), 
    left_index=True,
    right_index=True
)
places_df

Unnamed: 0,referralId,reasons_count,reasons_items,venue_id,venue_name,venue_location_address,venue_location_lat,venue_location_lng,venue_location_labeledLatLngs,venue_location_postalCode,...,venue_venuePage_id,flags_outsideRadius,venue_location_neighborhood,venue_events_count,venue_events_summary,cat_level_1,cat_level_2,cat_level_3,cat_level_4,cat_level_5
0,e-3-4b5a3e80f964a5201ab728e3-0,0,"[{'summary': 'This spot is popular', 'type': '...",4b5a3e80f964a5201ab728e3,BakerStreet,4820 N Clinton St,41.122200,-85.125421,"[{'label': 'display', 'lat': 41.12219979566053...",46825,...,,,,,,Food,Steakhouse,,,
1,e-3-4b26e239f964a5207c8224e3-1,0,"[{'summary': 'This spot is popular', 'type': '...",4b26e239f964a5207c8224e3,Agaves Mexican Grill,211 E Washington Center Rd,41.132293,-85.138746,"[{'label': 'display', 'lat': 41.13229322911061...",46825,...,,,,,,Food,Mexican,,,
2,e-3-4b86ebf4f964a5200ba631e3-2,0,"[{'summary': 'This spot is popular', 'type': '...",4b86ebf4f964a5200ba631e3,Papa John's Pizza,5626 Coldwater Rd,41.130839,-85.135060,"[{'label': 'display', 'lat': 41.13083878708079...",46825,...,,,,,,Food,Pizza,,,
3,e-3-4b5f57bff964a5201db529e3-3,0,"[{'summary': 'This spot is popular', 'type': '...",4b5f57bff964a5201db529e3,Cork 'n Cleaver,221 Washington Ctr Rd,41.132858,-85.138249,"[{'label': 'display', 'lat': 41.132858, 'lng':...",46825,...,,,,,,Food,Steakhouse,,,
4,e-3-4bfec4844e5d0f47a7207d1f-4,0,"[{'summary': 'This spot is popular', 'type': '...",4bfec4844e5d0f47a7207d1f,Wendy’s,5701 Coldwater Rd,41.130943,-85.136557,"[{'label': 'display', 'lat': 41.13094264479011...",46825,...,,,,,,Food,Fast Food,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1518,e-1-4b86e581f964a52089a431e3-45,0,"[{'summary': 'This spot is popular', 'type': '...",4b86e581f964a52089a431e3,Bruno's Pizza,28046 County Road 16,41.674179,-86.004036,"[{'label': 'display', 'lat': 41.67417862568361...",46516,...,,,,,,Food,Pizza,,,
1519,e-1-4c955265f244b1f70cd72a1d-46,0,"[{'summary': 'This spot is popular', 'type': '...",4c955265f244b1f70cd72a1d,Lucy's vedi twist,,41.627403,-85.424383,"[{'label': 'display', 'lat': 41.62740269, 'lng...",46761,...,,,,,,Food,Desserts,Ice Cream,,
1520,e-1-4d0a57e6e023224b45825fd2-47,0,"[{'summary': 'This spot is popular', 'type': '...",4d0a57e6e023224b45825fd2,Yoder's Country Market,375 Eleanor Dr,41.920600,-85.511006,"[{'label': 'display', 'lat': 41.92060038578456...",49032,...,,,,,,Food,American,,,
1521,e-1-4cbb1d4fa33bb1f7d04992fd-48,0,"[{'summary': 'This spot is popular', 'type': '...",4cbb1d4fa33bb1f7d04992fd,Orchard Tree,501 Grand Lake Rd,40.555072,-84.551998,"[{'label': 'display', 'lat': 40.55507221383220...",45822,...,,,,,,Food,American,,,


Let's pull out the columns that would be helpful in creating or visualizing features.

In [17]:
columns = [
    'venue_id', 'venue_name', 'venue_location_lat', 'venue_location_lng', 
    'venue_location_crossStreet', 'venue_delivery_id', 'search_popularity', 
    'venue_location_city', 'venue_location_state',
    'geo_where', 'geo_slug', 'geo_longId', 'geo_center_lat', 
    'geo_center_lng', 'city', 'radius', 'section', 'cat_level_1', 
    'cat_level_2', 'cat_level_3', 'cat_level_4'
]

places_df[columns]

Unnamed: 0,venue_id,venue_name,venue_location_lat,venue_location_lng,venue_location_crossStreet,venue_delivery_id,search_popularity,venue_location_city,venue_location_state,geo_where,...,geo_longId,geo_center_lat,geo_center_lng,city,radius,section,cat_level_1,cat_level_2,cat_level_3,cat_level_4
0,4b5a3e80f964a5201ab728e3,BakerStreet,41.122200,-85.125421,,1502561,0,Fort Wayne,IN,fort wayne in,...,72057594042848359,41.1306,-85.12886,"Fort Wayne, IN",1000,food,Food,Steakhouse,,
1,4b26e239f964a5207c8224e3,Agaves Mexican Grill,41.132293,-85.138746,Coldwater Road,,1,Fort Wayne,IN,fort wayne in,...,72057594042848359,41.1306,-85.12886,"Fort Wayne, IN",1000,food,Food,Mexican,,
2,4b86ebf4f964a5200ba631e3,Papa John's Pizza,41.130839,-85.135060,,2413959,2,Fort Wayne,IN,fort wayne in,...,72057594042848359,41.1306,-85.12886,"Fort Wayne, IN",1000,food,Food,Pizza,,
3,4b5f57bff964a5201db529e3,Cork 'n Cleaver,41.132858,-85.138249,Coldwater Rd,,3,Fort Wayne,IN,fort wayne in,...,72057594042848359,41.1306,-85.12886,"Fort Wayne, IN",1000,food,Food,Steakhouse,,
4,4bfec4844e5d0f47a7207d1f,Wendy’s,41.130943,-85.136557,,1682119,4,Fort Wayne,IN,fort wayne in,...,72057594042848359,41.1306,-85.12886,"Fort Wayne, IN",1000,food,Food,Fast Food,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1518,4b86e581f964a52089a431e3,Bruno's Pizza,41.674179,-86.004036,,,45,Elkhart,IN,fort wayne in,...,72057594042848359,41.1306,-85.12886,"Fort Wayne, IN",100000,topPicks,Food,Pizza,,
1519,4c955265f244b1f70cd72a1d,Lucy's vedi twist,41.627403,-85.424383,,,46,Lagrange,IN,fort wayne in,...,72057594042848359,41.1306,-85.12886,"Fort Wayne, IN",100000,topPicks,Food,Desserts,Ice Cream,
1520,4d0a57e6e023224b45825fd2,Yoder's Country Market,41.920600,-85.511006,,,47,Centreville,MI,fort wayne in,...,72057594042848359,41.1306,-85.12886,"Fort Wayne, IN",100000,topPicks,Food,American,,
1521,4cbb1d4fa33bb1f7d04992fd,Orchard Tree,40.555072,-84.551998,,,48,Celina,OH,fort wayne in,...,72057594042848359,41.1306,-85.12886,"Fort Wayne, IN",100000,topPicks,Food,American,,


Finally, let's put these results in some tables.

In [18]:
meta = MetaData()

search_results_df = places_df[['venue_id', 'city', 'radius', 'section', 'search_popularity']]
venue_data_df = places_df[[
    'venue_id', 'venue_name', 'venue_location_lat', 'venue_location_lng', 'venue_location_crossStreet',
    'venue_location_city', 'venue_location_state',
    'venue_delivery_id', 'cat_level_1', 'cat_level_2', 'cat_level_3', 'cat_level_4'
]].drop_duplicates('venue_id')

with engine.begin() as conn:
    venue_searches = Table(
        "venue_searches", meta,
        Column('id', Integer, primary_key=True, comment='Venue search ID'),
        Column('venue_id', String(24), nullable=False, comment='Foursquare Venue ID'),
        Column('city', String(128), nullable=False, comment='Search City'),
        Column('radius', Integer, nullable=False, comment='Radius in meters'),
        Column('section', String(20), nullable=False, comment='Search section'),
        Column('search_popularity', Integer, nullable=False, comment='Popularity in search results')
    )

    venue_data = Table(
        'venue_data', meta,
        Column('venue_id', String(24), primary_key=True, comment='Foursquare Venue ID'),
        Column('venue_name', String(255), nullable=False, comment='Venue name'),
        Column('venue_location_lat', Numeric(10, 6), nullable=False, comment='Venue Location Latitude'),
        Column('venue_location_lng', Numeric(10, 6), nullable=False, comment='Venue Location Longitude'),
        Column('venue_location_crossStreet', String(255), comment='Street Intersection of Venue Location'),
        Column('venue_delivery_id', String(40), comment='Venue Delivery Identifier'),
        Column('venue_location_city', String(100), comment='City Name of Venue Location'),
        Column('venue_location_state', String(25), comment='State Code of Venue Location'),
        Column('cat_level_1', String(50), comment='Level 1 Category Name'),
        Column('cat_level_2', String(50), comment='Level 2 Category Name'),
        Column('cat_level_3', String(50), comment='Level 3 Category Name'),
        Column('cat_level_4', String(50), comment='Level 4 Category Name')
    )

    meta.drop_all(conn) # During development
    meta.create_all(conn)

    search_results_df.to_sql('venue_searches', conn, if_exists='append', index=False)
    venue_data_df.to_sql('venue_data', conn, if_exists='append', index=False)

2020-12-07 19:43:53,994 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
2020-12-07 19:43:53,997 INFO sqlalchemy.engine.base.Engine DESCRIBE `venue_searches`
2020-12-07 19:43:53,998 INFO sqlalchemy.engine.base.Engine {}
2020-12-07 19:43:54,011 INFO sqlalchemy.engine.base.Engine DESCRIBE `venue_data`
2020-12-07 19:43:54,012 INFO sqlalchemy.engine.base.Engine {}
2020-12-07 19:43:54,018 INFO sqlalchemy.engine.base.Engine 
DROP TABLE venue_data
2020-12-07 19:43:54,020 INFO sqlalchemy.engine.base.Engine {}
2020-12-07 19:43:54,042 INFO sqlalchemy.engine.base.Engine 
DROP TABLE venue_searches
2020-12-07 19:43:54,042 INFO sqlalchemy.engine.base.Engine {}
2020-12-07 19:43:54,071 INFO sqlalchemy.engine.base.Engine DESCRIBE `venue_searches`
2020-12-07 19:43:54,072 INFO sqlalchemy.engine.base.Engine {}
2020-12-07 19:43:54,076 INFO sqlalchemy.engine.base.Engine DESCRIBE `venue_data`
2020-12-07 19:43:54,077 INFO sqlalchemy.engine.base.Engine {}
2020-12-07 19:43:54,086 INFO sqlalchemy.engine.base.

In [19]:
pd.read_sql(venue_searches.select(), engine, index_col='id')

2020-12-07 19:43:54,519 INFO sqlalchemy.engine.base.OptionEngine SELECT venue_searches.id, venue_searches.venue_id, venue_searches.city, venue_searches.radius, venue_searches.section, venue_searches.search_popularity 
FROM venue_searches
2020-12-07 19:43:54,520 INFO sqlalchemy.engine.base.OptionEngine {}


Unnamed: 0_level_0,venue_id,city,radius,section,search_popularity
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,4b5a3e80f964a5201ab728e3,"Fort Wayne, IN",1000,food,0
2,4b26e239f964a5207c8224e3,"Fort Wayne, IN",1000,food,1
3,4b86ebf4f964a5200ba631e3,"Fort Wayne, IN",1000,food,2
4,4b5f57bff964a5201db529e3,"Fort Wayne, IN",1000,food,3
5,4bfec4844e5d0f47a7207d1f,"Fort Wayne, IN",1000,food,4
...,...,...,...,...,...
1519,4b86e581f964a52089a431e3,"Fort Wayne, IN",100000,topPicks,45
1520,4c955265f244b1f70cd72a1d,"Fort Wayne, IN",100000,topPicks,46
1521,4d0a57e6e023224b45825fd2,"Fort Wayne, IN",100000,topPicks,47
1522,4cbb1d4fa33bb1f7d04992fd,"Fort Wayne, IN",100000,topPicks,48


In [20]:
pd.read_sql(venue_data.select(), engine, index_col='venue_id')

2020-12-07 19:43:54,614 INFO sqlalchemy.engine.base.OptionEngine SELECT venue_data.venue_id, venue_data.venue_name, venue_data.venue_location_lat, venue_data.venue_location_lng, venue_data.`venue_location_crossStreet`, venue_data.venue_delivery_id, venue_data.venue_location_city, venue_data.venue_location_state, venue_data.cat_level_1, venue_data.cat_level_2, venue_data.cat_level_3, venue_data.cat_level_4 
FROM venue_data
2020-12-07 19:43:54,615 INFO sqlalchemy.engine.base.OptionEngine {}


Unnamed: 0_level_0,venue_name,venue_location_lat,venue_location_lng,venue_location_crossStreet,venue_delivery_id,venue_location_city,venue_location_state,cat_level_1,cat_level_2,cat_level_3,cat_level_4
venue_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
4a8731bef964a5202d0320e3,Lake James,41.697409,-85.031737,,,Angola,IN,Outdoors & Recreation,Lake,,
4b0618b3f964a52098e822e3,Regal Coldwater Crossing,41.131203,-85.142060,,,Fort Wayne,IN,Arts & Entertainment,Movie Theater,,
4b12ed20f964a520ff9023e3,Starbucks,41.075410,-85.145640,,1507234,Fort Wayne,IN,Food,Coffee Shop,,
4b130366f964a520a89223e3,Mad Anthony Brewing Company,41.067643,-85.152640,at Taylor,2274630,Fort Wayne,IN,Nightlife,Brewery,,
4b1304aaf964a520c29223e3,JK O'Donnell's Irish Pub,41.078097,-85.140302,btw Harrison & Calhoun,,Fort Wayne,IN,Food,Irish,,
...,...,...,...,...,...,...,...,...,...,...,...
5c6b13485455b2002c90eaed,The Backyard,41.700681,-85.001763,,,Angola,IN,Arts & Entertainment,Mini Golf,,
5c9530f98c35dc002ce78390,Walmart Grocery Pickup and Delivery,41.128227,-85.137894,,,Fort Wayne,IN,Shops,Food & Drink,Grocery Store,
5c95310401bc5a002cc1d2ba,Walmart Grocery Pickup & Delivery,41.687057,-86.058655,,,Elkhart,IN,Shops,Food & Drink,Grocery Store,
5d4df4ed78bb0e0007c07d84,Promenade Park,41.084171,-85.143061,,,Fort Wayne,IN,Outdoors & Recreation,Park,,


---

## Favorite Venue Types

The other thing I want to look for in cities is places that we know we enjoy. I will still use the `explore` enpoint, but with the `query` parameter.

In [21]:
favorite_venue_types = [
    'hiking trail',
    'bbq',
    'historic sites',
    'park',
    'dog park',
    'british food',
    'irish food',
    'arcade',
    'pizzeria',
    'ice cream shop'
]

In [22]:
async def get_favorite_sites(city):
    """
    Get popular spots of my favorite types in the city `city`
    """
    async with aiohttp.ClientSession() as session:
        tasks = []
        for r, q in product(radii, favorite_venue_types):
            task = query_places(session, city, r, query=q)
            tasks.append(task)
        results = await asyncio.gather(*tasks)
    return pd.concat(results, ignore_index=True)

In [23]:
executor = ProcessPoolExecutor()
favorite_sites = await get_favorite_sites('Fort Wayne, IN')
favorite_sites

Unnamed: 0,referralId,reasons_count,reasons_items,venue_id,venue_name,venue_location_lat,venue_location_lng,venue_location_labeledLatLngs,venue_location_postalCode,venue_location_cc,...,city,radius,query,venue_venuePage_id,venue_delivery_id,venue_delivery_url,venue_delivery_provider_name,venue_delivery_provider_icon_prefix,venue_delivery_provider_icon_sizes,venue_delivery_provider_icon_name
0,e-0-4dea1f20b0fb8293f7cb31de-0,0.0,"[{'summary': 'This spot is popular', 'type': '...",4dea1f20b0fb8293f7cb31de,Pufferbelly Trail,41.180053,-85.155595,"[{'label': 'display', 'lat': 41.18005280328018...",46825,US,...,"Fort Wayne, IN",1000,hiking trail,,,,,,,
1,e-0-4c780866a868370430df0b4d-1,0.0,"[{'summary': 'This spot is popular', 'type': '...",4c780866a868370430df0b4d,Parkview Outdoor Trail Head,41.173982,-85.148131,"[{'label': 'display', 'lat': 41.17398188943014...",46825,US,...,"Fort Wayne, IN",1000,hiking trail,,,,,,,
2,e-0-53a98b9d498e0aca74c23326-2,0.0,"[{'summary': 'This spot is popular', 'type': '...",53a98b9d498e0aca74c23326,RiverGreen Way Trail Head,41.087400,-85.048495,"[{'label': 'display', 'lat': 41.0874, 'lng': -...",,US,...,"Fort Wayne, IN",1000,hiking trail,,,,,,,
3,e-0-50561db8e4b017bb26f16c5b-3,0.0,"[{'summary': 'This spot is popular', 'type': '...",50561db8e4b017bb26f16c5b,Safari Trail,41.105984,-85.151028,"[{'label': 'display', 'lat': 41.10598377025788...",46805,US,...,"Fort Wayne, IN",1000,hiking trail,,,,,,,
4,e-0-4d12510e80f6721eb7bf16eb-4,0.0,"[{'summary': 'This spot is popular', 'type': '...",4d12510e80f6721eb7bf16eb,Towpath Trail (Smith Road Trailhead),41.050813,-85.210342,"[{'label': 'display', 'lat': 41.05081337287187...",,US,...,"Fort Wayne, IN",1000,hiking trail,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1124,e-0-4c6346e8ec94a5937e882bca-45,0.0,"[{'summary': 'This spot is popular', 'type': '...",4c6346e8ec94a5937e882bca,Dairy Queen,41.946599,-84.874081,"[{'label': 'display', 'lat': 41.94659899648241...",49082,US,...,"Fort Wayne, IN",100000,ice cream shop,,,,,,,
1125,e-0-4ba2cfa0f964a520191b38e3-46,0.0,"[{'summary': 'This spot is popular', 'type': '...",4ba2cfa0f964a520191b38e3,Eric's All-American Ice Cream Factory,41.283002,-84.330147,"[{'label': 'display', 'lat': 41.28300205877007...",43512,US,...,"Fort Wayne, IN",100000,ice cream shop,,,,,,,
1126,e-0-4ba02e2df964a520dd5f37e3-47,0.0,"[{'summary': 'This spot is popular', 'type': '...",4ba02e2df964a520dd5f37e3,Payne's Restaurant,40.481977,-85.546167,"[{'label': 'display', 'lat': 40.48197658404458...",46933,US,...,"Fort Wayne, IN",100000,ice cream shop,77441402,,,,,,
1127,e-0-4c040043f56c2d7f55a31d66-48,0.0,"[{'summary': 'This spot is popular', 'type': '...",4c040043f56c2d7f55a31d66,Ritter's Frozen Custard,41.227833,-85.796091,"[{'label': 'display', 'lat': 41.22783310777793...",46580,US,...,"Fort Wayne, IN",100000,ice cream shop,,,,,,,


In [24]:
def get_categories(row):
    if not row.venue_categories:
        return pd.Series(
            [np.nan] * 5,
            category_df.columns,
            name=row.name
        )
    return category_df.loc[row.venue_categories[0]['id']]

favorite_sites = favorite_sites.merge(
    favorite_sites.apply(get_categories, axis=1), 
    left_index=True,
    right_index=True
)


In [25]:
fave_columns = [
    'venue_id',
    'venue_name',
    'venue_location_lat',
    'venue_location_lng',
    'venue_location_crossStreet',
    'venue_delivery_id',
    'venue_location_city',
    'venue_location_state',
    'search_popularity',
    'geo_where',
    'geo_slug',
    'geo_longId',
    'geo_center_lat',
    'geo_center_lng',
    'city',
    'radius',
    'query',
    'cat_level_1',
    'cat_level_2',
    'cat_level_3',
    'cat_level_4'
]

favorite_sites[fave_columns]

Unnamed: 0,venue_id,venue_name,venue_location_lat,venue_location_lng,venue_location_crossStreet,venue_delivery_id,venue_location_city,venue_location_state,search_popularity,geo_where,...,geo_longId,geo_center_lat,geo_center_lng,city,radius,query,cat_level_1,cat_level_2,cat_level_3,cat_level_4
0,4dea1f20b0fb8293f7cb31de,Pufferbelly Trail,41.180053,-85.155595,,,Fort Wayne,IN,0,fort wayne in,...,72057594042848359,41.1306,-85.12886,"Fort Wayne, IN",1000,hiking trail,Outdoors & Recreation,Trail,,
1,4c780866a868370430df0b4d,Parkview Outdoor Trail Head,41.173982,-85.148131,,,Fort Wayne,IN,1,fort wayne in,...,72057594042848359,41.1306,-85.12886,"Fort Wayne, IN",1000,hiking trail,Outdoors & Recreation,Trail,,
2,53a98b9d498e0aca74c23326,RiverGreen Way Trail Head,41.087400,-85.048495,,,Fort Wayne,IN,2,fort wayne in,...,72057594042848359,41.1306,-85.12886,"Fort Wayne, IN",1000,hiking trail,Outdoors & Recreation,Trail,,
3,50561db8e4b017bb26f16c5b,Safari Trail,41.105984,-85.151028,,,Fort Wayne,IN,3,fort wayne in,...,72057594042848359,41.1306,-85.12886,"Fort Wayne, IN",1000,hiking trail,Outdoors & Recreation,Trail,,
4,4d12510e80f6721eb7bf16eb,Towpath Trail (Smith Road Trailhead),41.050813,-85.210342,,,,Indiana,4,fort wayne in,...,72057594042848359,41.1306,-85.12886,"Fort Wayne, IN",1000,hiking trail,Outdoors & Recreation,Trail,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1124,4c6346e8ec94a5937e882bca,Dairy Queen,41.946599,-84.874081,at Brown St.,,Quincy,MI,45,fort wayne in,...,72057594042848359,41.1306,-85.12886,"Fort Wayne, IN",100000,ice cream shop,Food,Desserts,Ice Cream,
1125,4ba2cfa0f964a520191b38e3,Eric's All-American Ice Cream Factory,41.283002,-84.330147,btw Greenhouse & Degler,,Defiance,OH,46,fort wayne in,...,72057594042848359,41.1306,-85.12886,"Fort Wayne, IN",100000,ice cream shop,Food,Desserts,Ice Cream,
1126,4ba02e2df964a520dd5f37e3,Payne's Restaurant,40.481977,-85.546167,I-69 Exit 59,,Gas City,IN,47,fort wayne in,...,72057594042848359,41.1306,-85.12886,"Fort Wayne, IN",100000,ice cream shop,Food,Desserts,Ice Cream,
1127,4c040043f56c2d7f55a31d66,Ritter's Frozen Custard,41.227833,-85.796091,,,Warsaw,IN,48,fort wayne in,...,72057594042848359,41.1306,-85.12886,"Fort Wayne, IN",100000,ice cream shop,Food,Desserts,Ice Cream,


In [26]:
meta = MetaData()

search_results_df = favorite_sites[['venue_id', 'city', 'radius', 'query', 'search_popularity']]
venue_data_df = favorite_sites[[
    'venue_id', 'venue_name', 'venue_location_lat', 'venue_location_lng', 'venue_location_crossStreet',
    'venue_location_city', 'venue_location_state',
    'venue_delivery_id', 'cat_level_1', 'cat_level_2', 'cat_level_3', 'cat_level_4'
]].drop_duplicates('venue_id')



with engine.begin() as conn:
    
    # Insert only venues we don't already have
    current_venues = pd.read_sql_table('venue_data', conn)
    venues_to_insert = venue_data_df[~venue_data_df.venue_id.isin(current_venues.venue_id)]
    
    venue_favorites = Table(
        "venue_favorites", meta,
        Column('id', Integer, primary_key=True, comment='Venue search ID'),
        Column('venue_id', String(24), nullable=False, comment='Foursquare Venue ID'),
        Column('city', String(128), nullable=False, comment='Search City'),
        Column('radius', Integer, nullable=False, comment='Radius in meters'),
        Column('query', String(20), nullable=False, comment='Search section'),
        Column('search_popularity', Integer, nullable=False, comment='Popularity in search results')
    )

    meta.drop_all(conn) # During development
    meta.create_all(conn)

    search_results_df.to_sql('venue_favorites', conn, if_exists='append', index=False)
    venues_to_insert.to_sql('venue_data', conn, if_exists='append', index=False)

2020-12-07 19:43:56,397 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
2020-12-07 19:43:56,400 INFO sqlalchemy.engine.base.Engine SHOW FULL TABLES FROM `coursera_dev`
2020-12-07 19:43:56,402 INFO sqlalchemy.engine.base.Engine {}
2020-12-07 19:43:56,410 INFO sqlalchemy.engine.base.Engine SHOW FULL TABLES FROM `coursera_dev`
2020-12-07 19:43:56,411 INFO sqlalchemy.engine.base.Engine {}
2020-12-07 19:43:56,433 INFO sqlalchemy.engine.base.Engine SHOW CREATE TABLE `venue_data`
2020-12-07 19:43:56,434 INFO sqlalchemy.engine.base.Engine {}
2020-12-07 19:43:56,444 INFO sqlalchemy.engine.base.Engine SELECT venue_data.venue_id, venue_data.venue_name, venue_data.venue_location_lat, venue_data.venue_location_lng, venue_data.`venue_location_crossStreet`, venue_data.venue_delivery_id, venue_data.venue_location_city, venue_data.venue_location_state, venue_data.cat_level_1, venue_data.cat_level_2, venue_data.cat_level_3, venue_data.cat_level_4 
FROM venue_data
2020-12-07 19:43:56,445 INFO sqlalch

---

## More Wiki Data

Let's see what else I can get from Wikipedia. I'd like to get surounding cities.

In [63]:
population_sq_mi = float(wiki_data.find(".//part[name='population_density_sq_mi'].value").text)

In [63]:


def get_weather_data(wiki_data):
    weather_box = wiki_data.find(".//template[title='Weather box']")
    if weather_box is None:
        return None
    months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'year']
    stat_names = {
        'record high F': 'record_high',
        'avg record high F': 'avg_record_high',
        'high F': 'avg_high',
        'low F': 'avg_low',
        'avg record low F': 'avg_record_low',
        'record low F': 'record_low',
        'precipitation inch': 'avg_precip',
        'snow inch': 'avg_snow',
        'precipitation days': 'precip_days',
        'snow days': 'snow_days',
        'sun': 'sunshine_hours',
        'percentsun': 'daily_sunshine'
    }
    
    series_list = []
    for month in months:
        data = {}
        for stat in stat_names.keys():
            elem = weather_box.find(f".//part[name='{month} {stat}'].value")
            if elem is None:
                val = np.nan
            else:
                val = float(elem.text.replace('−', '-')) # The dashes in the data are not standard dashes for some reason
            data[stat_names[stat]] = val
        series_list.append(
            pd.Series(data=data.values(), index=data.keys(), name=month)
        )
    return pd.DataFrame(series_list)
            

weather_data = get_weather_data(wiki_data)
weather_data['metro'] = 'Fort Wayne, IN'
weather_data

Unnamed: 0,record_high,avg_record_high,avg_high,avg_low,avg_record_low,record_low,avg_precip,avg_snow,precip_days,snow_days,sunshine_hours,daily_sunshine,metro
Jan,69.0,53.5,32.4,17.4,-4.5,-24.0,2.26,10.1,12.6,9.5,148.5,50.0,"Fort Wayne, IN"
Feb,73.0,56.9,36.3,20.3,0.5,-19.0,2.04,7.7,10.1,6.9,158.5,53.0,"Fort Wayne, IN"
Mar,87.0,72.5,48.0,28.7,10.9,-10.0,2.71,4.1,12.2,4.1,206.3,56.0,"Fort Wayne, IN"
Apr,90.0,81.0,61.1,38.9,23.4,7.0,3.52,1.0,12.9,1.0,251.4,63.0,"Fort Wayne, IN"
May,97.0,86.6,71.7,49.2,35.4,27.0,4.27,0.0,13.0,0.0,311.9,69.0,"Fort Wayne, IN"
Jun,106.0,93.0,80.9,59.3,46.2,36.0,4.16,0.0,10.9,0.0,340.0,75.0,"Fort Wayne, IN"
Jul,106.0,93.4,84.4,62.7,51.6,38.0,4.24,0.0,9.8,0.0,347.0,76.0,"Fort Wayne, IN"
Aug,102.0,91.7,82.2,60.8,49.3,38.0,3.64,0.0,9.4,0.0,318.2,75.0,"Fort Wayne, IN"
Sep,100.0,88.9,76.0,52.6,38.2,29.0,2.8,0.0,9.1,0.0,258.1,69.0,"Fort Wayne, IN"
Oct,91.0,81.0,63.4,41.8,27.8,19.0,2.84,0.3,9.7,0.2,207.6,60.0,"Fort Wayne, IN"


In [87]:
def get_population_data(wiki_data, city_name):
    census = wiki_data.find(".//template[title='US Census population']")
    if census is None:
        return
    
    data = []
    index = []
    
    for part in census.findall("part"):
        year = part.find("name").text
        if year.isnumeric() and len(year) == 4:
            data.append(int(part.find("value").text))
            index.append(year)
    
    return pd.Series(data, pd.MultiIndex.from_arrays([[city_name] * len(index), index], names=['city', 'year']), name='census_population')
        
get_population_data(wiki_data, 'Fort Wayne, IN')

Fort Wayne, IN  1850      4282
                1860     10388
                1870     17718
                1880     26880
                1890     35393
                1900     45115
                1910     63933
                1920     86549
                1930    114946
                1940    118410
                1950    133607
                1960    161776
                1970    178269
                1980    172196
                1990    173072
                2000    205727
                2010    253691
Name: census_data, dtype: int64

I also realized I missed the fact that the Infobox Settlement _does_ in fact have the Latitude and Logitude of a city. So I can go ahead and grab that.

In [114]:
def get_settlement_coordinates(wiki_data):
    coords = wiki_data.findall(".//part[name='coordinates']/value/template[title='coord']/part/value")
    # Convert from DMS (degrees, minutes, seconds) to Decimal
    lat_deg, lat_min, lat_sec, lat_pole, lng_deg, lng_min, lng_sec, lng_pole = [x.text for x in coords[:8]]
    lat_sign = 1 if lat_pole == 'N' else -1
    lng_sign = 1 if lng_pole == 'E' else -1
    latitude = sum([float(lat_deg), float(lat_min)/60.0, float(lat_sec)/3600.0]) * lat_sign
    longitude = sum([float(lng_deg), float(lng_min)/60.0, float(lng_sec)/3600.0]) * lng_sign
    return latitude, longitude

get_settlement_coordinates(wiki_data)

(41.080555555555556, -85.13916666666667)

There's one more thing to grab. I know that Fort Wayne is a metropolitan area with lots of cities surrounding it; however, I don't know for sure that every city I'm going to input is going to be the metro area. I also want to grab surrounding cities and get information on them so that I can perhaps find a nearby city that's not as populated or has different character.

I've found that the best place to find this on Wikipedia is in their nav boxes at the bottom of pages. There is a box that gives you the county you are looking at and the other cities and communities within it. It also gives you the "county seat". I'll use this data to help me branch out my queries and, instead of getting single cities of job locations, look more at what different areas nearby have to offer.

Here's the hard part. For Fort Wayne, the XML in `wiki_data` that puts this box on the page is as follows:

```xml
<template><title>Allen County, Indiana</title></template>
```

It's not much. There's not even really enough here to tell me difinitively that this gives me the county info.

So I think I'm going ot have to iterate through all of the templates in `Navboxes` and get the one that matches what I'm looking for.

In [121]:
navboxes = wiki_data.findall(".//template[title='Navboxes']/part[name='list']/value/template/title")
templates_to_search = ['Template:{}'.format(elem.text) for elem in navboxes]
templates_to_search

['Template:Fort Wayne, Indiana',
 'Template:Allen County, Indiana',
 'Template:Fort Wayne Metro',
 'Template:Indiana cities and mayors of 100,000 population',
 'Template:All-American City Award Hall of Fame',
 'Template:County Seats of Indiana',
 'Template:Indiana',
 'Template:Midwestern United States']

Now let's grab the data about the template from the API that shows what I'm looking for.

In [145]:
wikipedia_url = 'https://en.wikipedia.org/w/api.php'
params = {
    "action": "parse",
    "format": "json",
    "redirects": "1",
    "page": "Template:Allen County, Indiana",
    "prop": "parsetree"
}

raw_response = requests.get(wikipedia_url, params=params, headers=header).json()['parse']['parsetree']['*']
with open('data/allen_count.xml', 'w') as f:
    f.write(raw_response)
response = xml.canonicalize(raw_response, strip_text=True)
template_data = xml.fromstring(response)
template_data

<Element 'root' at 0x7f4a99f90630>

In [146]:
root = template_data.find(".//template[title='US county navigation box']")
# From here, I could continue a loop if root == None
seat = root.find(".//part[name='seat']/value").text
seat

'Fort Wayne'

Hm...it doesn't give the state. 

I may need to store city names based on how Wikipedia resolves them. This is easy using the "query" action, though--if I [query](https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&titles=Fort%20Wayne%7CFort%20Wayne%2C%20IN%7CFt.%20Wayne%2C%20Indiana&redirects=1) with `redirects` on, it gives me a full list of the page names that were redirected to something else, as well as the pages themselves (if they exist).

Example response:

```json
{
    "batchcomplete": "",
    "query": {
        "normalized": [
            {
                "from": "ftwnindnnnnn",
                "to": "Ftwnindnnnnn"
            }
        ],
        "redirects": [
            {
                "from": "Ft. Wayne, Indiana",
                "to": "Fort Wayne, Indiana"
            },
            {
                "from": "Fort Wayne, IN",
                "to": "Fort Wayne, Indiana"
            },
            {
                "from": "Fort Wayne",
                "to": "Fort Wayne, Indiana"
            }
        ],
        "pages": {
            "-1": {
                "ns": 0,
                "title": "Ftwnindnnnnn",
                "missing": ""
            },
            "11232": {
                "pageid": 11232,
                "ns": 0,
                "title": "Fort Wayne, Indiana"
            }
        }
    }
}
```

Let's write a function real quick to get the normalized name of a city according to Wikipedia.

In [147]:
def normalize_city_names(city_names):
    normalization = {}
    city_names = pd.unique(city_names)
    wikipedia_url = 'https://en.wikipedia.org/w/api.php'
    params = {
        "action": "query",
        "format": "json",
        "redirects": 1,
        "titles": ",".join(city_names)
    }
    response = requests.get(wikipedia_url, params=params, headers=header).json()['query']
    for redirect in response['redirects']:
        normalization[redirect['from']] = redirect['to']
    for page_id in response['pages'].keys():
        page = response['pages'][page_id]
        if "missing" in page.keys():
            raise ValueError("City name {} is not on Wikipedia".format(page['title']))
        if page['title'] in city_names:
            normalization[page['title']] = page['title']
    return normalization

normalize_city_names(['Fort Wayne'])['Fort Wayne'] # normalize_city_names returns a dict, but I'm only checking one name right now

'Fort Wayne, Indiana'

Cool, now back to our regularly scheduled wiki parsing.

Taking a look back at the county file, the sections are not well defined in the template. They really just are called things like `title1` and `body1`, `title2` and `body2`, and so forth. However, I can use [regular expressions](https://docs.python.org/3/library/re.html) to just grab all of the cities mentioned.

In [152]:
listed_city = re.compile('^\*\ *\[\[([^\|]+)\|([^\|]+)\]\](‡)?$', re.MULTILINE)
listed_city.findall(raw_response)

[('Fort Wayne, Indiana', 'Fort Wayne', ''),
 ('New Haven, Indiana', 'New Haven', ''),
 ('Woodburn, Indiana', 'Woodburn', ''),
 ('Grabill, Indiana', 'Grabill', ''),
 ('Huntertown, Indiana', 'Huntertown', ''),
 ('Leo-Cedarville, Indiana', 'Leo-Cedarville', ''),
 ('Monroeville, Indiana', 'Monroeville', ''),
 ('Zanesville, Indiana', 'Zanesville', '‡'),
 ('Aboite Township, Allen County, Indiana', 'Aboite', ''),
 ('Adams Township, Allen County, Indiana', 'Adams', ''),
 ('Cedar Creek Township, Allen County, Indiana', 'Cedar Creek', ''),
 ('Eel River Township, Allen County, Indiana', 'Eel River', ''),
 ('Jackson Township, Allen County, Indiana', 'Jackson', ''),
 ('Jefferson Township, Allen County, Indiana', 'Jefferson', ''),
 ('Lafayette Township, Allen County, Indiana', 'Lafayette', ''),
 ('Lake Township, Allen County, Indiana', 'Lake', ''),
 ('Madison Township, Allen County, Indiana', 'Madison', ''),
 ('Marion Township, Allen County, Indiana', 'Marion', ''),
 ('Maumee Township, Allen County,