Research solution for loading project data to new PostgreSQL database instances #3

MikeTheCanuck · 2018-02-01T06:04:36Z

Very soon after project teams have distributed (#2) and wrangled their initial project data, they will ask for us to host a PostgreSQL-compatible database instance in our AWS cloud.

We need to work out a procedure to automate the creation of new PostgreSQL instances into which cleaned data can be loaded. This will have to be performed at least once for each of the five project teams, and it would be reasonable to expect to have to perform this task more often as the

Assume that:

project team has cleaned the data and it is ready for consumption by the team's API
project team has created a script (perhaps with our help, or based on a generic template we supply) for automating the migration of data
project team has already tested this migration script on a separate, non-AWS-hosted PostgreSQL instance
each project team's members will have credentials to read/write data to existing tables (but not to create new tables)
there will be a privileged user (from devops team, at least at first) who will have the ability to create new instances, instantiate the schema, create database credentials, and bulk load data

Requirements

Here are my initial thoughts on what is required to successfully provide the "new database" service:

propose automation script(s) that would create the database instance, create one or more set of credentials, set permissions for those credentials to database objects and perform any other actions necessary to enable a database to function for the ECS-based APIs
propose automation script(s) that would upload data from an S3 bucket (in a specific file format) to an existing PostgreSQL database instance
document the prerequisites - what information do we need from project teams, and what steps they must take, for us to be able to successfully setup a new database for a project team
propose a naming convention for database instances
propose a permissions model for each database instance to enable read, read/write and/or create/modify schema for various roles (e.g. DevOps squadroneers, Project Team data manager, Project Team developers)
summarize any suggestions on how to test that the database instance was successfully created and is ready for use by the project team
summarize any suggestions on how to communicate the successful result, including how to obtain the IP address of the database instance, and reasonable security measures to distribute database credentials to needed parties

The steps in this procedure should ultimately be documented so that any project team can understand their prerequisites and what happens once their "new database" request is initiated. Let's hack some initial docs up quickly, and use our first couple of rounds of the actual fulfilment to refine the docs.

znmeb · 2018-02-01T06:46:17Z

I can tell you what I'm doing (manually) at the moment for Transportation Systems. It clearly won't scale, but at this point I don't know that we're going to get a huge amount of data.

I started with the official Postgres image from the Docker store. (https://store.docker.com/images/postgres). Pros: Official, supported, tracks updates to the underlying software. Cons: it is non-trivial to configure, and the documentation is difficult to decipher. We'd have to write our own; I'm willing to do that.

Because we're doing GIS things, I augmented that image with PostGIS. That image currently lives at https://hub.docker.com/r/znmeb/postgis/. It's an autobuild; whenever I push to master in GitHub or the base official PostgreSQL image changes, it gets rebuilt on Docker Hub.

Once that's up and running via Docker Compose, I write code to insert our raw data into the database and take a dump file. It's usually easy; the only tricky one so far has been a database we received in Microsoft Access MDB format.

At run time, I make a Docker image with the database dump in the right place. When the container starts, the magic inside the Postgres image restores the database dump and the container is listening on port 5432 just like any Postgres server wouid.

To be determined at deployment time: where the data filesystems are stored. In one case (https://github.com/hackoregon/postgis-geocoder-test/tree/master/Docker) the files were large, so I mounted host filesystems into the container. In the other (https://github.com/hackoregon/crash-data-wrangling/tree/master/Docker) I just put everything in the image, since it was only 70 megabytes bigger than the raw image, which was about 450 megabytes.

k4y3ff · 2018-02-02T02:30:10Z

the only tricky one so far has been a database we received in Microsoft Access MDB format.

Aside from CSVs and MDBs, are there other data formats we frequently receive data in?

Not sure who, exactly, is the best person to ask. @MikeTheCanuck? @znmeb, since you've responded?

znmeb · 2018-02-02T04:33:30Z

As far as I know Transportation Systems will only be receiving MDBs and CSVs. However, other teams are getting GIS data (usually shapefiles) and may be importing data directly from APIs. We will probably also be importing Census data but I'm hoping we can get a group of people from all the teams to co-ordinate that. @BrianHGrant is the data manager.

DingoEatingFuzz · 2018-02-03T23:38:21Z

We have also received excel sheets of varying machine readability and PDFs that plain aren't machine readable. For these files, there's no expectation that we can automagically import them into a db.

As mentioned in our devops meeting, I'd like to get into the habit of backing up all data as we receive it (even if that's a PDF) to the cloud. Then humans can use those files to create clean CSVs which would in turn be imported into some db.

znmeb · 2018-02-04T01:25:01Z

@DingoEatingFuzz Excel sheets? PDFs? Bring it on! They have to be dealt with one at a time but the processes can be automated once we validate them with the providers of the data.

Backups? Unless we've got a budget constraint I think we should use Google Drive for data files; make our existing Google Drive as big as it needs to be. I'm extremely leery of raw AWS / S3 for storage - it's an extra cognitive load for people who are primarily application developers.

The other artifact we will have and need to back up / manage is Docker images. The correct way to manage those is with a Docker registry. It would need to be private to Hack Oregon but that's something we can buy and it's not too hard to build if that's something the DevOps people want to do.

MikeTheCanuck · 2018-02-04T17:42:21Z

@k4y3ff Based on past seasons and other experience I have, I'd say that the vast majority of data will show up in CSV format, then MDB, "other Excel formats" and one-offs. What do you think about prioritizing "find a solution for CSV first, and then tackle the other scenarios after the CSV case is working"?

@DingoEatingFuzz We're addressing raw data uploads in #2, though I'm not sure that I explicitly called out PDF as an expected data format. (Not sure that matters, when S3 just works as the file level.)

@znmeb good thought to consider the complexity differential between S3 and other systems like Google Drive. We also want to consider the confusion of anyone trying to find data that's been inconsistently scattered across a variety of systems and platforms.

@znmeb I'll spawn off the Docker image question to a separate issue #6.

znmeb · 2018-02-04T20:23:21Z

We are covered on CSV, MDB and xls/xlsx. PDFs that don't require optical character recognition are slightly more work but usually respond to TabulaPDF. At some point I should sit down and document all this magic in containers, though.

znmeb · 2018-02-07T08:26:27Z

I'd like to have something functional in the next week or so - we lost a couple of person-hours tonight because I somehow managed to get two different versions of the ODOT crash data SQL dump into the wild. ;-)

an FTP server? Webdav? All we need is some kind of fileserver with authentication and key management.

MikeTheCanuck · 2018-02-09T22:30:09Z

@k4y3ff and @khashf I've updated the text of the above "Requirements" - clarified each ask and formatted them as a checklist that can be tackled one by one.

@khashf Based on last night's discussion, please focus on the second Requirement above. Here's the scenarios I believe we'll need to address first for the broadest application across teams:

import a CSV file from a specified S3 bucket to an existing PostgreSQL instance.
import a PostgreSQL database dump from a specified S3 bucket to an existing PostgreSQL instance.

Other scenarios to follow once we have these two worked out.

znmeb · 2018-02-10T01:32:12Z

We are using PostgreSQL 10.1 and PostGIS 2.4.2 for Transportation. Any server we use needs both, and they'll have to be at least those versions.
I have pgdump files and scripts to load them in our Google Drive - hit me up on Slack for the location.
The way the PostgreSQL / PostGIS Docker image works, if you put .sh or .sql script files in a specified location, they will be executed in alphabetical order the first time the image runs. That's how I made those dump files.

I need to clarify what kind of "sql" that is - I think it's PSQL, which means the CSV import is two lines per table - create the database to define the columns and their types and then a \copy to. You'll need to check with the users on whether you need to add a primary key to tables restored from CSV. I've been adding them when I create the dumps.

Here's an example: https://github.com/hackoregon/ingest-ridership-data/blob/master/ingest.psql

znmeb · 2018-02-15T09:59:04Z

We now have automated restores in our PostgreSQL / PostGIS images / containers. If you put a PostgreSQL pg_dump format backup file in the designated host directory before doing the Docker build, the Dockerfile will copy it onto the image and it will be restored the first time the container runs.

See https://github.com/hackoregon/data-science-pet-containers#automatic-database-restores for the details; if anyone else wants to do this I can get them started. Note that the databases we're using this for are smallish - in the 80 - 140 megabyte range. We have one coming that is much larger and I doubt if we'll be able to use this for it.

MikeTheCanuck · 2018-03-25T21:56:46Z

This is the message I sent out to all Data Managers today - with this, I believe we have fulfilled the purpose of this issue and it's now ready to close:

To enable the DevOps team to keep straight all the databases that will be co-located on one server, we will be prefixing all database instances with the team/project names as documented here:
#1 (comment)

That would mean, for example, that the database for passenger census would be addressable on the AWS server as transportation-systems-passenger_census.

All database backups should restore to a database that follows this naming convention. If it doesn’t, we’ll manually rename the database once it’s restored on the server, and all calling Django code or other will have to be reconfigured. Please, if at all possible, follow this naming convention for the database on the box from which you’ll be generating the backup that DevOps will use to restore. Here’s the backup instructions that Ed and I put together:
https://github.com/hackoregon/civic-devops/blob/master/docs/HOWTO-create-backup-for-new-database-creation.md

We’re looking forward to bringing this centralized developer PostgreSQL service online for you shortly - please let us know which database backups we should pull from the “hacko-data-archive” S3 bucket. Once we have your first database online, I’ll communicate the IP address, database user + password and database name to each of you for your databases.

Here’s the design we’re aiming to provide:

a single database server - hosted in an EC2 instance, running PostgreSQL 9.6.6 (latest supported by Amazon Linux 2)

a single database user + password for each project (i.e. one set of credentials used to access multiple databases)

each database user is Owner for all databases for that project

backups and/or snapshots provided on some sort of semi-automated basis by the DevOps squad (not something you’ll have to get involved in) - TBD as we haven’t flushed out that infrastructure yet

MikeTheCanuck assigned k4y3ff Feb 1, 2018

MikeTheCanuck mentioned this issue Feb 1, 2018

Create exemplar repo to support deploying a new Django API to AWS #4

Closed

12 tasks

khashf self-assigned this Feb 1, 2018

This was referenced Feb 4, 2018

Research solution for uploading and distributing raw data #2

Closed

Select and document a Docker registry for centralizing Docker images #6

Closed

znmeb mentioned this issue Feb 7, 2018

Create Data Science Environment hackoregon/transportation-systems#18

Closed

4 tasks

MikeTheCanuck added the Priority: high label Feb 21, 2018

MikeTheCanuck mentioned this issue Feb 22, 2018

Define script that initializes new database from PostgreSQL dump file in S3 #17

Closed

MikeTheCanuck added Research and removed Priority: high labels Mar 8, 2018

MikeTheCanuck assigned MikeTheCanuck and unassigned k4y3ff and khashf Mar 17, 2018

MikeTheCanuck closed this as completed Mar 25, 2018

MikeTheCanuck mentioned this issue Mar 31, 2018

Investigate adding PostgreSQL server to the Cloud Formation scripts #61

Closed

khashf mentioned this issue May 12, 2018

ECS was unable to run container of the api image built by the ExemplarAPI 2018 #87

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research solution for loading project data to new PostgreSQL database instances #3

Research solution for loading project data to new PostgreSQL database instances #3

MikeTheCanuck commented Feb 1, 2018 •

edited

Loading

znmeb commented Feb 1, 2018 •

edited

Loading

k4y3ff commented Feb 2, 2018

znmeb commented Feb 2, 2018

DingoEatingFuzz commented Feb 3, 2018

znmeb commented Feb 4, 2018

MikeTheCanuck commented Feb 4, 2018

znmeb commented Feb 4, 2018

znmeb commented Feb 7, 2018

MikeTheCanuck commented Feb 9, 2018

znmeb commented Feb 10, 2018 •

edited

Loading

znmeb commented Feb 15, 2018

MikeTheCanuck commented Mar 25, 2018

Research solution for loading project data to new PostgreSQL database instances #3

Research solution for loading project data to new PostgreSQL database instances #3

Comments

MikeTheCanuck commented Feb 1, 2018 • edited Loading

Requirements

znmeb commented Feb 1, 2018 • edited Loading

k4y3ff commented Feb 2, 2018

znmeb commented Feb 2, 2018

DingoEatingFuzz commented Feb 3, 2018

znmeb commented Feb 4, 2018

MikeTheCanuck commented Feb 4, 2018

znmeb commented Feb 4, 2018

znmeb commented Feb 7, 2018

MikeTheCanuck commented Feb 9, 2018

znmeb commented Feb 10, 2018 • edited Loading

znmeb commented Feb 15, 2018

MikeTheCanuck commented Mar 25, 2018

MikeTheCanuck commented Feb 1, 2018 •

edited

Loading

znmeb commented Feb 1, 2018 •

edited

Loading

znmeb commented Feb 10, 2018 •

edited

Loading