# Clustering and Scoring Job Relocation Opportunities - ETL Scripts

Austin Rainwater

---

For this notebook, I have created some libraries to assist in building an Extract/Transform/Load Data Pipeline. This pipeline uses asyncio, so that multiple requests to different endpoints can be made at one time. The pipeline allows the full Extract/Transform/Load process to be run concurrently, instead of synchronously running each step and waiting for their results.

The pipeline consists of an abstract `PipelineStep` class. Each step must define a `process_batch(self, batch)` coroutine. 

Before running my imports, I'm going to run `test.py`, which tests to make sure the libraries are working as intended.

In [1]:
!python test.py

..............
----------------------------------------------------------------------
Ran 14 tests in 1.231s

OK


In [2]:
!pip install --quiet --upgrade sqlalchemy==1.3.22 pymysql==0.9.3 PyYAML aiohttp aiomysql==0.0.21

from etl import PipelineStep, DataPipeline
from warnings import warn
from abc import ABC
import xml.etree.ElementTree as xml
import re
import yaml
import warnings

import numpy as np
import pandas as pd

import asyncio
import aiohttp

import sqlalchemy as sa
from aiomysql.sa import create_engine
from aiomysql import Warning as MariaDBWarning
from sqlalchemy import (
    Table,
    Column,
    MetaData,
    String,
    Numeric
)
from sqlalchemy.schema import CreateTable, DropTable
from sqlalchemy.dialects.mysql import YEAR as Year, INTEGER as Integer

with open('secrets.yaml', 'r') as secrets_file:
    secrets = yaml.safe_load(secrets_file)

## PipelineStep Subclasses

These are some PipelineStep subclasses I plan to use multiple times because I will be accessing specific APIs.

In [3]:
class WikipediaPipelineStep(PipelineStep, ABC):
    request_counter = 0
    wikipedia_url = 'https://en.wikipedia.org/w/api.php'
    wikipedia_header = {"User-Agent": 
          'datascience jupyter notebook/0.2 '
          '(https://github.com/pacorain/datascience-certification-final-project; '
          'Austin Rainwater, paco@heckin.io)'}
    max_batch_size = 1
    async_batches = True
    
    def start(self):
        self.session = aiohttp.ClientSession()
        super().start()
        
    def stop(self):
        super().stop()
        loop = asyncio.get_running_loop()
        loop.create_task(self.session.close())
    
    async def make_request(self, params):
        async with self.session.get(self.wikipedia_url, params=params, headers=self.wikipedia_header) as request:
            self.__class__.request_counter += 1
            assert 200 <= request.status <= 299
            response = await request.json()
            if params['action'] == 'query' and 'continue' in response:
                warn("A Continue was issued but not handled.")
                # I'm not sure how to integrate this, or if I will even need to.
            return response

In [4]:
class FoursquarePipelineStep(PipelineStep, ABC):
    pass

In [5]:
class DatabasePipelineStep(PipelineStep, ABC):
    """Pipeline step for handle data for the database for this project.
    
    async_batches are disabled here because I just have a simple MariaDB instance, on a single
    disk, so running multiple queries will not likely run any faster--and may even run slower.
    
    Attributes
    ----------
    drop_tables : bool
        Whether to drop and re-create the tables before running the pipeline, essentially
        starting over. Defaults to false.
    """
    async_batches = False
    max_batch_size = 500
    
    engine = None
    meta = None
    
    def __init__(self, drop_tables=False):
        super().__init__()
        self.drop_tables = drop_tables
        
    async def setup(self):
         if self.meta:
            async with self.engine.acquire() as conn:
                if self.drop_tables:
                    await self.drop_all(conn)
                await self.create_all(conn)
                
    async def drop_all(self, conn):
        for table in self.meta.tables.values():
            try:
                await conn.execute(DropTable(table))
            except:
                # Table doesn't exist
                pass
                
    async def create_all(self, conn):
        for table in self.meta.tables.values():
            await conn.execute(CreateTable(table))
            
DatabasePipelineStep.engine = await create_engine(**secrets['db_prod'], autocommit=True, echo=True)


# Defining the actual pipeline steps

Now, I can use the code I've written to create a data pipeline. To test, I will start with Fort Wayne, IN like I did in the previous document.

In [6]:
class NormalizeCityNames(WikipediaPipelineStep):
    """First step: take incoming city names and normalize them according to the titles of
    their pages on Wikipedia.

    Changes batch size to 50 since Wikipedia supports _querying_ 50 pages at a time
    """

    batch_size = 49

    async def process_batch(self, city_names):
        params = {
            "action": "query",
            "format": "json",
            "redirects": 1,
            "titles": "|".join(city_names)
        }
        try:
            result = await self.make_request(params)
            response = result['query']
        except KeyError as e:
            raise RuntimeError(f"Response not valid: {result}", city_names)
        if 'redirects' in response:
            for redirect in response['redirects']:
                #TODO: Cache redirects
                yield redirect['to']
        for page in response['pages'].values():
            if "missing" in page.keys():
                warn(f"The city {page['title']} was provided but is not available on Wikipedia, and has been skipped")
            if page['title'] in city_names:
                # Original name was not redirected; yield the original name
                yield page['title']

```python
>>> pipeline = DataPipeline(NormalizeCityNames(), data=['Fort Wayne'])
>>> await pipeline.run()
```
```
['Fort Wayne, Indiana']
```

In [7]:
class ParseTree(WikipediaPipelineStep):
    """
    For each incoming city name, attatch its Wikipedia parsetree
    """
    async def process_batch(self, normalized_city_names):
        for city in normalized_city_names:
            params = {
                "action": "parse",
                "format": "json",
                "redirects": 1,
                "prop": "parsetree",
                "page": city
            }
            raw_response = (await self.make_request(params))['parse']['parsetree']['*']
            response = xml.canonicalize(raw_response, strip_text=True)
            tree = xml.fromstring(response)
            await self.remove_comments(tree)
            yield (city, tree)
            
    async def remove_comments(self, root):
        """Remove any `<comment>` children tags, as they can interfere with parsing"""
        for child in list(root):  # Create list of elements that doesn't change during iteration
            if child.tag == 'comment':
                root.remove(child)
            else:
                await self.remove_comments(child)

```python 
>>> pipeline = DataPipeline(NormalizeCityNames(), ParseTree(), data=['Fort Wayne'])
>>> await pipeline.run()
```
```
[('Fort Wayne, Indiana', <Element 'root' at 0x7ff547383f90>)]
```

In [8]:
class GetCitiesFromCounties(WikipediaPipelineStep):
    """
    Figure out which of a city's navboxes is for the county, and use that to expand each
    city into the cities in its county.

    Leaves city and tree attached so that the tree for the original city is not obtained twice.
    """
    async def process_batch(self, city_parsetrees):
        for original_city, tree in city_parsetrees:
            seat = None
            cities = None
            state = None
            for template in self.get_navbox_templates(tree):
                raw_response, template_page = await self.get_template_page(template)
                county_root = template_page.find(".//template[title='US county navigation box']")
                state_root = template_page.find(".//template[title='US state navigation box']")
                if county_root is not None:
                    seat = await self.get_seat(county_root)
                    cities = await self.parse_cities(raw_response)
                elif state_root is not None:
                    state = await self.get_state(state_root)
                if seat and state:
                    break
            for city in cities:
                yield (original_city, tree, city, state, seat)

    def get_navbox_templates(self, wiki_page_tree):
        """Finds the topic navigation boxes on the wiki page (usually at the bottom)"""
        navboxes = wiki_page_tree.findall(".//template[title='Navboxes']/part[name='list']/value/template/title")
        return ['Template:{}'.format(elem.text) for elem in navboxes]

    async def get_template_page(self, template):
        """Gets the template that defines the navigation box.
        
        Returns
        -------
        tuple[str, ElementTree]
            The first element is the raw XML returned by the WikiPedia API.
            The second element is the stripped and parsed XML data.
        """
        params = {
            "action": "parse",
            "format": "json",
            "redirects": 1,
            "prop": "parsetree",
            "page": template
        }
        raw_response = (await self.make_request(params))['parse']['parsetree']['*']
        response = xml.canonicalize(raw_response, strip_text=True)
        return raw_response, xml.fromstring(response)
        
    async def get_seat(self, root):
        """Gets the seat (e.g. metropolis) for a specified area from the navbox"""
        seat_name = root.find(".//part[name='seat']/value").text
        params = {
            "action": "query",
            "format": "json",
            "redirects": "1",
            "titles": seat_name
        }
        response = await self.make_request(params)
        page = list(response['query']['pages'].values())[0]
        assert 'missing' not in page
        return page['title']

        
    async def parse_cities(self, raw_response_txt):
        """Using the raw XML, gets the cities from the navbox because they will be formatted as a list."""
        listed_city = re.compile(r"""
            ^\* \ *       # Line starts with "*" plus any number of spaces
            \[{2}         # Start of link "[["
                ([^\|]+)  # First part of link (between "[[" and "|"). This is the part that gets captured.
                \|        # Separator "|"
                [^\|]+    # Second part of link (between "|" and "]]")
            \]{2}‡?       # End of link "]]" plus optional ‡ character
            \ *$          # End with any number of spaces
        """, re.VERBOSE + re.MULTILINE)
        listed_cities = listed_city.findall(raw_response_txt)
        return listed_cities
    
    async def get_state(self, root):
        """Gets the canonical state name for a city from the navbox"""
        return root.find(".//part[name='template_name']/value").text

```python
>>> pipeline = DataPipeline(
...     NormalizeCityNames(), 
...     ParseTree(), 
...     GetCitiesFromCounties(), 
...     data=['Fort Wayne']
... )
...
>>> results = await pipeline.run()
>>> results[:3]
```

```
[('Fort Wayne, Indiana',
  <Element 'root' at 0x7ff546a30d60>,
  'Fort Wayne, Indiana',
  'Indiana',
  'Fort Wayne, Indiana'),
 ('Fort Wayne, Indiana',
  <Element 'root' at 0x7ff546a30d60>,
  'New Haven, Indiana',
  'Indiana',
  'Fort Wayne, Indiana'),
 ('Fort Wayne, Indiana',
  <Element 'root' at 0x7ff546a30d60>,
  'Woodburn, Indiana',
  'Indiana',
  'Fort Wayne, Indiana')]
```

In [9]:
normalize = NormalizeCityNames()
parse = ParseTree()

class SecondNormalizeStep(NormalizeCityNames):
    """Subclass of NormalizeCityNames, which is modified to accept and yield back the city's county seat"""
    max_batch_size = 49
    def __init__(self, first_step):
        super().__init__()
        self._duplicate_cache = first_step._duplicate_cache
    
    async def process_batch(self, batch):
        cities, states, seats = zip(*batch)
        i = 0
        async for normalized_city in super().process_batch(cities):
            yield normalized_city, states[i], seats[i]
            i += 1
            
    def is_duplicate(self, record):
        city = record[0]
        return super().is_duplicate(city)
        
    
class SecondParseTreeStep(ParseTree):
    """Subclass of ParseTree, which is modified to accept and yield back the city's county seat"""
    def __init__(self, first_step):
        super().__init__()
        self._duplicate_cache = first_step._duplicate_cache
        
    async def process_batch(self, batch):
        norm_cities, states, seats = zip(*batch)
        i = 0
        async for city_name, tree in super().process_batch(norm_cities):
            yield city_name, states[i], seats[i], tree
            
    def is_duplicate(self, record):
        city = record[0]
        return super().is_duplicate(city)
        

class NewCitySplit(PipelineStep):
    """Runs the same processing on new cities in the county.
    
    It's designed specifically not to process the original city by creating a "split":
    
        incoming city matches original?
          |          \
         yes          no
          |            \
          |             |
          |             v
          |         normalize
          |             |
          |             v
          |           parse
          |             |
          v             v
         yield        yield
           \            /
            \          /
             \        /
             |       |
             v       v
         NewCitySplit.outputs
    
    """
    def __init__(self, normalize_step, parse_step):
        super(NewCitySplit, self).__init__()
        self.normalize_step = SecondNormalizeStep(normalize_step)
        self.parse_step = SecondParseTreeStep(parse_step)
        self.normalize_step.attach(self.parse_step)
        self.parse_step.outputs = self.outputs
    
    async def process_batch(self, batch):
        for original_city, tree, city, state, seat in batch:
            if city == original_city:
                yield city, state, seat, tree
            else:
                self.normalize_step.put((city, state, seat))
        # Don't mark this task as "complete" until the pipeline step are done
        await self.normalize_step.join()
        await self.parse_step.join()
    
    def start(self):
        super().start()
        self.normalize_step.start()
        self.parse_step.start()
        
        
    def stop(self):
        super().stop()
        self.normalize_step.stop()
        self.parse_step.stop()
                
        

```python
>>> normalize = NormalizeCityNames()
>>> parse = ParseTree()
>>>
>>> pipeline = DataPipeline(
...     normalize, 
...     parse, 
...     GetCitiesFromCounties(), 
...     NewCitySplit(normalize, parse), 
...     data=['Fort Wayne, IN']
... )
... 
>>> wikipedia_results = await pipeline.run()
>>> wikipedia_results[:5]
```

```
[('Fort Wayne, Indiana',
  'Indiana',
  'Fort Wayne, Indiana',
  <Element 'root' at 0x7ff546932f90>),
 ('Aboite, Indiana',
  'Indiana',
  'Fort Wayne, Indiana',
  <Element 'root' at 0x7ff546890400>),
 ('Aboite Township, Allen County, Indiana',
  'Indiana',
  'Fort Wayne, Indiana',
  <Element 'root' at 0x7ff546897b30>),
 ('Academie, Indiana',
  'Indiana',
  'Fort Wayne, Indiana',
  <Element 'root' at 0x7ff5468a2360>),
 ('Adams Township, Allen County, Indiana',
  'Indiana',
  'Fort Wayne, Indiana',
  <Element 'root' at 0x7ff54689b950>)]
```

In [10]:
class WikiParsing(PipelineStep):
    """
    Extracts data from XML ElementTree to put into database or use with FourSquare API.
    
    Methods
    -------
    put(record):
        Adds a record. The record should be a Tuple of the following elements:
        
            1. city (string): The canonical name of the city; e.g. `'New Haven, Indiana'`
            2. state (string): The name of the state; e.g. `'Indiana'`
            3. seat (string): The county seat; e.g. `'Fort Wayne, Indiana'`
            4. tree (xml.etree.ElementTree): The parsed XML from Wikipedia for the city
    """
    async_batches = True
    batch_size = None
    
    async def process_batch(self, cities):
        for city, state, seat, tree in cities:
            latitude, longitude = self.parse_settlement_coords(tree)
            yield (city, {
                'city_name': city,
                'metro_name': seat,
                'state_name': state,
                'center_latitude': latitude,
                'center_longitude': longitude,
                'area_val': self.infobox_value(tree, "area_total_sq_mi", float),
                'city_population': self.infobox_value(tree, "population_est", int),
                'population_density': self.infobox_value(tree, "population_density_sq_mi", float)
            }, self.get_weather_table(tree) if city == seat else None, self.get_population_history(tree, city))
            
    def is_duplicate(self, record):
        city = record[0]
        return super().is_duplicate(city)
    
    def parse_settlement_coords(self, wiki_data):
        coords = wiki_data.findall(".//part[name='coordinates']/value/template[title='coord']/part/value")
        if coords == None or len(coords) == 0:
            return (None, None)
        # Convert from DMS (degrees, minutes, seconds) to Decimal
        lat_deg, lat_min, lat_sec, lat_pole, lng_deg, lng_min, lng_sec, lng_pole = [x.text for x in coords[:8]]
        lat_sign = 1 if lat_pole == 'N' else -1
        lng_sign = 1 if lng_pole == 'E' else -1
        latitude = sum([float(lat_deg), float(lat_min)/60.0, float(lat_sec)/3600.0]) * lat_sign
        longitude = sum([float(lng_deg), float(lng_min)/60.0, float(lng_sec)/3600.0]) * lng_sign
        return latitude, longitude
    
    def infobox_value(self, tree, part_name, astype=str):
        template = tree.find(".//template[title='Infobox settlement']")
        if template == None:
            return None
        value = template.find(".part[name='{}'].value".format(part_name))
        if value == None or value.text == None:
            return None
        return astype(value.text)
    
    def get_weather_table(self, tree):
        weather_box = tree.find(".//template[title='Weather box']")
        if weather_box is None:
            return None
        months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'year']
        stat_names = {
            'record high F': 'record_high',
            'avg record high F': 'avg_record_high',
            'high F': 'avg_high',
            'low F': 'avg_low',
            'avg record low F': 'avg_record_low',
            'record low F': 'record_low',
            'precipitation inch': 'avg_precip',
            'snow inch': 'avg_snow',
            'precipitation days': 'precip_days',
            'snow days': 'snow_days',
            'sun': 'sunshine_hours',
            'percentsun': 'daily_sunshine'
        }

        series_list = []
        for month in months:
            data = {}
            for stat in stat_names.keys():
                elem = weather_box.find(f".//part[name='{month} {stat}'].value")
                if elem is None:
                    val = None
                else:
                    val = float(elem.text.replace('−', '-')) # The dashes in the data are not standard dashes for some reason
                data[stat_names[stat]] = val
            ix = pd.Index(data.keys(), name='weather_month')
            series_list.append(
                pd.Series(data=data.values(), index=ix, name=month)
            )
        return pd.DataFrame(series_list)
    
    def get_population_history(self, tree, city_name):
        census = tree.find(".//template[title='US Census population']")
        if census is None:
            return

        data = []

        for part in census.findall("part"):
            year = part.find("name").text
            if year.isnumeric() and len(year) == 4:
                data.append((city_name, year, int(part.find("value").text)))
                
        return pd.DataFrame(data, columns=['city_name', 'census_year', 'census_population'])

In [11]:
%store -r
try:
    parse_results
except NameError:
    parsing_pipeline = DataPipeline(
        normalize, 
        parse, 
        GetCitiesFromCounties(),
        NewCitySplit(normalize, parse), 
        WikiParsing(),
        data=['Fort Wayne, IN']
    )
    parse_results = await parsing_pipeline.run()
    fort_wayne_data = parse_results[0][1]
    fort_wayne_weather_data = parse_results[0][2]
    fort_wayne_pop_history = parse_results[0][3]
    %store parse_results fort_wayne_data fort_wayne_weather_data fort_wayne_pop_history

In [12]:
fort_wayne_data

{'city_name': 'Fort Wayne, Indiana',
 'metro_name': 'Fort Wayne, Indiana',
 'state_name': 'Indiana',
 'center_latitude': 41.080555555555556,
 'center_longitude': -85.13916666666667,
 'area_val': 110.79,
 'city_population': 270402,
 'population_density': 2445.46}

In [13]:
fort_wayne_weather_data

weather_month,record_high,avg_record_high,avg_high,avg_low,avg_record_low,record_low,avg_precip,avg_snow,precip_days,snow_days,sunshine_hours,daily_sunshine
Jan,69.0,53.5,32.4,17.4,-4.5,-24.0,2.26,10.1,12.6,9.5,148.5,50.0
Feb,73.0,56.9,36.3,20.3,0.5,-19.0,2.04,7.7,10.1,6.9,158.5,53.0
Mar,87.0,72.5,48.0,28.7,10.9,-10.0,2.71,4.1,12.2,4.1,206.3,56.0
Apr,90.0,81.0,61.1,38.9,23.4,7.0,3.52,1.0,12.9,1.0,251.4,63.0
May,97.0,86.6,71.7,49.2,35.4,27.0,4.27,0.0,13.0,0.0,311.9,69.0
Jun,106.0,93.0,80.9,59.3,46.2,36.0,4.16,0.0,10.9,0.0,340.0,75.0
Jul,106.0,93.4,84.4,62.7,51.6,38.0,4.24,0.0,9.8,0.0,347.0,76.0
Aug,102.0,91.7,82.2,60.8,49.3,38.0,3.64,0.0,9.4,0.0,318.2,75.0
Sep,100.0,88.9,76.0,52.6,38.2,29.0,2.8,0.0,9.1,0.0,258.1,69.0
Oct,91.0,81.0,63.4,41.8,27.8,19.0,2.84,0.3,9.7,0.2,207.6,60.0


In [14]:
fort_wayne_pop_history

Unnamed: 0,city_name,census_year,census_population
0,"Fort Wayne, Indiana",1850,4282
1,"Fort Wayne, Indiana",1860,10388
2,"Fort Wayne, Indiana",1870,17718
3,"Fort Wayne, Indiana",1880,26880
4,"Fort Wayne, Indiana",1890,35393
5,"Fort Wayne, Indiana",1900,45115
6,"Fort Wayne, Indiana",1910,63933
7,"Fort Wayne, Indiana",1920,86549
8,"Fort Wayne, Indiana",1930,114946
9,"Fort Wayne, Indiana",1940,118410


In [15]:
class WikipediaDatabaseStep(DatabasePipelineStep):
    meta = MetaData()
        
    city_table = Table("citites", meta,
        Column('city_name', String(50), primary_key=True, comment='City Name'),
        Column('metro_name', String(50), comment='Metropolitan Area Name'),
        Column('state_name', String(25), nullable=False, comment='State Name'),
        Column('center_latitude', Numeric(10, 6), comment='Latitude of City center'),
        Column('center_longitude', Numeric(10, 6), comment='Longitude of City center'),
        Column('area_val', Numeric(10, 4), comment='Area of city in square miles'),
        Column('city_population', Integer, comment='Total population of city'),
        Column('population_density', Numeric(10, 4), comment='Population Density per square mile')
    )
    
    weather_table = Table(
        "metro_weather_data", meta,
        Column('city_name', String(50), primary_key=True, comment='City Name'),
        Column('weather_month', String(10), primary_key=True, comment='Month of weather conditions'),
        Column('record_high', Numeric(5, 2), comment='Record high temperature'),
        Column('avg_record_high', Numeric(5, 2), comment='Average highest temperature of month'),
        Column('avg_high', Numeric(5, 2), comment='Average daily high temperature of month'),
        Column('avg_low', Numeric(5, 2), comment='Average daily low temerature of month'),
        Column('avg_record_low', Numeric(5, 2), comment='Average lowest temperature of month'),
        Column('record_low', Numeric(5, 2), comment='Record low temperature'),
        Column('avg_precip', Numeric(7, 3), comment='Average precipitation inches'),
        Column('avg_snow', Numeric(7, 3), comment='Average snowfall inches'),
        Column('precip_days', Numeric(5, 2), comment='Average number of days where precipitation ≥0.1 inches'),
        Column('snow_days', Numeric(5, 1), comment='Average number of days where snowfall ≥0.1 inches'),
        Column('sunshine_hours', Numeric(7, 2), comment='Average number of hours with sunshine'),
        Column('daily_sunshine', Numeric(5, 2), comment='Average percent of daily sunshine')
    )
    
    population_table = Table(
        "city_population_history", meta,
        Column('city_name', String(50), primary_key=True, comment='City Name'),
        Column('census_year', Integer(4, zerofill=True), primary_key=True, comment='Census Year'),
        Column('census_population', Integer, comment='Total recorded population')
    )
    
    async def process_batch(self, batch):
        cities, city_records, weather_tables, population_tables = zip(*batch)
        # No preprocessing needs to be done, so we can just yield the city names back
        for city in cities:
            yield city 
            
        await self.insert_cities(city_records)
        await self.insert_weather_data(cities, weather_tables)
        await self.insert_population_history(population_tables)
        
    async def insert_cities(self, city_records):
        async with self.engine.acquire() as conn:
            with warnings.catch_warnings():
                warnings.simplefilter("ignore", category=MariaDBWarning)
                await conn.execute(
                    self.city_table.insert(),
                    city_records
                )
    
    async def insert_weather_data(self, cities, weather_tables):
        data_frame = pd.concat(weather_tables, keys=cities, names=['city_name']).reset_index()
        data_frame.columns.values[1] = 'weather_month'
        records = data_frame.to_dict('records')
        for record in records:
            record.update({key: None for key in record.keys() if pd.isna(record[key])})
        async with self.engine.acquire() as conn:
            await conn.execute(
                self.weather_table.insert(),
                records
            )
            
    async def insert_population_history(self, population_tables):
        data_frame = pd.concat(population_tables)
        records = data_frame.to_dict('records')
        
        async with self.engine.acquire() as conn:
            await conn.execute(
                self.population_table.insert(),
                records
            )

In [16]:
await DataPipeline(
    WikipediaDatabaseStep(drop_tables=True),
    data=parse_results
).run()

sync_engine = sa.create_engine(secrets['db_connection_string'])


In [17]:
pd.read_sql(WikipediaDatabaseStep.city_table.select(), sync_engine)

Unnamed: 0,city_name,metro_name,state_name,center_latitude,center_longitude,area_val,city_population,population_density
0,"Aboite Township, Allen County, Indiana","Fort Wayne, Indiana",Indiana,41.052222,-85.285278,33.35,,1073.20
1,"Aboite, Indiana","Fort Wayne, Indiana",Indiana,41.000000,-85.318056,,,
2,"Academie, Indiana","Fort Wayne, Indiana",Indiana,41.170833,-85.145833,,,
3,"Adams Township, Allen County, Indiana","Fort Wayne, Indiana",Indiana,41.045833,-85.050556,,,
4,"Allen, Indiana","Fort Wayne, Indiana",Indiana,41.200000,-85.195833,,,
...,...,...,...,...,...,...,...,...
65,"Wayne Township, Allen County, Indiana","Fort Wayne, Indiana",Indiana,41.049444,-85.164722,,,
66,"Woodburn, Indiana","Fort Wayne, Indiana",Indiana,41.126111,-84.852778,0.96,1639.0,1712.64
67,"Yoder, Indiana","Fort Wayne, Indiana",Indiana,40.931111,-85.176667,,,
68,"Zanesville, Indiana","Fort Wayne, Indiana",Indiana,40.915833,-85.281389,0.84,621.0,750.00


In [18]:
pd.read_sql(WikipediaDatabaseStep.weather_table.select(), sync_engine)

Unnamed: 0,city_name,weather_month,record_high,avg_record_high,avg_high,avg_low,avg_record_low,record_low,avg_precip,avg_snow,precip_days,snow_days,sunshine_hours,daily_sunshine
0,"Fort Wayne, Indiana",Apr,90.0,81.0,61.1,38.9,23.4,7.0,3.52,1.0,12.9,1.0,251.4,63.0
1,"Fort Wayne, Indiana",Aug,102.0,91.7,82.2,60.8,49.3,38.0,3.64,0.0,9.4,0.0,318.2,75.0
2,"Fort Wayne, Indiana",Dec,71.0,56.0,36.2,22.1,1.9,-18.0,2.77,8.5,13.0,8.2,108.2,38.0
3,"Fort Wayne, Indiana",Feb,73.0,56.9,36.3,20.3,0.5,-19.0,2.04,7.7,10.1,6.9,158.5,53.0
4,"Fort Wayne, Indiana",Jan,69.0,53.5,32.4,17.4,-4.5,-24.0,2.26,10.1,12.6,9.5,148.5,50.0
5,"Fort Wayne, Indiana",Jul,106.0,93.4,84.4,62.7,51.6,38.0,4.24,0.0,9.8,0.0,347.0,76.0
6,"Fort Wayne, Indiana",Jun,106.0,93.0,80.9,59.3,46.2,36.0,4.16,0.0,10.9,0.0,340.0,75.0
7,"Fort Wayne, Indiana",Mar,87.0,72.5,48.0,28.7,10.9,-10.0,2.71,4.1,12.2,4.1,206.3,56.0
8,"Fort Wayne, Indiana",May,97.0,86.6,71.7,49.2,35.4,27.0,4.27,0.0,13.0,0.0,311.9,69.0
9,"Fort Wayne, Indiana",Nov,79.0,68.9,49.9,32.9,18.9,-1.0,3.09,1.8,11.2,2.6,124.2,42.0


In [19]:
pd.read_sql(WikipediaDatabaseStep.population_table.select(), sync_engine)

Unnamed: 0,city_name,census_year,census_population
0,"Fort Wayne, Indiana",1850,4282
1,"Fort Wayne, Indiana",1860,10388
2,"Fort Wayne, Indiana",1870,17718
3,"Fort Wayne, Indiana",1880,26880
4,"Fort Wayne, Indiana",1890,35393
...,...,...,...
73,"Woodburn, Indiana",2000,1579
74,"Woodburn, Indiana",2010,1520
75,"Zanesville, Indiana",1880,93
76,"Zanesville, Indiana",2000,602
