Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research solution for loading project data to new PostgreSQL database instances #3

Closed
6 of 7 tasks
MikeTheCanuck opened this issue Feb 1, 2018 · 12 comments
Closed
6 of 7 tasks
Assignees
Labels

Comments

@MikeTheCanuck
Copy link
Contributor

MikeTheCanuck commented Feb 1, 2018

Very soon after project teams have distributed (#2) and wrangled their initial project data, they will ask for us to host a PostgreSQL-compatible database instance in our AWS cloud.

We need to work out a procedure to automate the creation of new PostgreSQL instances into which cleaned data can be loaded. This will have to be performed at least once for each of the five project teams, and it would be reasonable to expect to have to perform this task more often as the

Assume that:

  • project team has cleaned the data and it is ready for consumption by the team's API
  • project team has created a script (perhaps with our help, or based on a generic template we supply) for automating the migration of data
  • project team has already tested this migration script on a separate, non-AWS-hosted PostgreSQL instance
  • each project team's members will have credentials to read/write data to existing tables (but not to create new tables)
  • there will be a privileged user (from devops team, at least at first) who will have the ability to create new instances, instantiate the schema, create database credentials, and bulk load data

Requirements

Here are my initial thoughts on what is required to successfully provide the "new database" service:

  • propose automation script(s) that would create the database instance, create one or more set of credentials, set permissions for those credentials to database objects and perform any other actions necessary to enable a database to function for the ECS-based APIs
  • propose automation script(s) that would upload data from an S3 bucket (in a specific file format) to an existing PostgreSQL database instance
  • document the prerequisites - what information do we need from project teams, and what steps they must take, for us to be able to successfully setup a new database for a project team
  • propose a naming convention for database instances
  • propose a permissions model for each database instance to enable read, read/write and/or create/modify schema for various roles (e.g. DevOps squadroneers, Project Team data manager, Project Team developers)
  • summarize any suggestions on how to test that the database instance was successfully created and is ready for use by the project team
  • summarize any suggestions on how to communicate the successful result, including how to obtain the IP address of the database instance, and reasonable security measures to distribute database credentials to needed parties

The steps in this procedure should ultimately be documented so that any project team can understand their prerequisites and what happens once their "new database" request is initiated. Let's hack some initial docs up quickly, and use our first couple of rounds of the actual fulfilment to refine the docs.

@znmeb
Copy link
Contributor

znmeb commented Feb 1, 2018

I can tell you what I'm doing (manually) at the moment for Transportation Systems. It clearly won't scale, but at this point I don't know that we're going to get a huge amount of data.

I started with the official Postgres image from the Docker store. (https://store.docker.com/images/postgres). Pros: Official, supported, tracks updates to the underlying software. Cons: it is non-trivial to configure, and the documentation is difficult to decipher. We'd have to write our own; I'm willing to do that.

Because we're doing GIS things, I augmented that image with PostGIS. That image currently lives at https://hub.docker.com/r/znmeb/postgis/. It's an autobuild; whenever I push to master in GitHub or the base official PostgreSQL image changes, it gets rebuilt on Docker Hub.

Once that's up and running via Docker Compose, I write code to insert our raw data into the database and take a dump file. It's usually easy; the only tricky one so far has been a database we received in Microsoft Access MDB format.

At run time, I make a Docker image with the database dump in the right place. When the container starts, the magic inside the Postgres image restores the database dump and the container is listening on port 5432 just like any Postgres server wouid.

To be determined at deployment time: where the data filesystems are stored. In one case (https://github.com/hackoregon/postgis-geocoder-test/tree/master/Docker) the files were large, so I mounted host filesystems into the container. In the other (https://github.com/hackoregon/crash-data-wrangling/tree/master/Docker) I just put everything in the image, since it was only 70 megabytes bigger than the raw image, which was about 450 megabytes.

@khashf khashf self-assigned this Feb 1, 2018
@k4y3ff
Copy link

k4y3ff commented Feb 2, 2018

the only tricky one so far has been a database we received in Microsoft Access MDB format.

Aside from CSVs and MDBs, are there other data formats we frequently receive data in?

Not sure who, exactly, is the best person to ask. @MikeTheCanuck? @znmeb, since you've responded?

@znmeb
Copy link
Contributor

znmeb commented Feb 2, 2018

As far as I know Transportation Systems will only be receiving MDBs and CSVs. However, other teams are getting GIS data (usually shapefiles) and may be importing data directly from APIs. We will probably also be importing Census data but I'm hoping we can get a group of people from all the teams to co-ordinate that. @BrianHGrant is the data manager.

@DingoEatingFuzz
Copy link
Contributor

We have also received excel sheets of varying machine readability and PDFs that plain aren't machine readable. For these files, there's no expectation that we can automagically import them into a db.

As mentioned in our devops meeting, I'd like to get into the habit of backing up all data as we receive it (even if that's a PDF) to the cloud. Then humans can use those files to create clean CSVs which would in turn be imported into some db.

@znmeb
Copy link
Contributor

znmeb commented Feb 4, 2018

@DingoEatingFuzz Excel sheets? PDFs? Bring it on! They have to be dealt with one at a time but the processes can be automated once we validate them with the providers of the data.

Backups? Unless we've got a budget constraint I think we should use Google Drive for data files; make our existing Google Drive as big as it needs to be. I'm extremely leery of raw AWS / S3 for storage - it's an extra cognitive load for people who are primarily application developers.

The other artifact we will have and need to back up / manage is Docker images. The correct way to manage those is with a Docker registry. It would need to be private to Hack Oregon but that's something we can buy and it's not too hard to build if that's something the DevOps people want to do.

@MikeTheCanuck
Copy link
Contributor Author

@k4y3ff Based on past seasons and other experience I have, I'd say that the vast majority of data will show up in CSV format, then MDB, "other Excel formats" and one-offs. What do you think about prioritizing "find a solution for CSV first, and then tackle the other scenarios after the CSV case is working"?

@DingoEatingFuzz We're addressing raw data uploads in #2, though I'm not sure that I explicitly called out PDF as an expected data format. (Not sure that matters, when S3 just works as the file level.)

@znmeb good thought to consider the complexity differential between S3 and other systems like Google Drive. We also want to consider the confusion of anyone trying to find data that's been inconsistently scattered across a variety of systems and platforms.

@znmeb I'll spawn off the Docker image question to a separate issue #6.

@znmeb
Copy link
Contributor

znmeb commented Feb 4, 2018

We are covered on CSV, MDB and xls/xlsx. PDFs that don't require optical character recognition are slightly more work but usually respond to TabulaPDF. At some point I should sit down and document all this magic in containers, though.

@znmeb
Copy link
Contributor

znmeb commented Feb 7, 2018

I'd like to have something functional in the next week or so - we lost a couple of person-hours tonight because I somehow managed to get two different versions of the ODOT crash data SQL dump into the wild. ;-)

an FTP server? Webdav? All we need is some kind of fileserver with authentication and key management.

@MikeTheCanuck
Copy link
Contributor Author

@k4y3ff and @khashf I've updated the text of the above "Requirements" - clarified each ask and formatted them as a checklist that can be tackled one by one.

@khashf Based on last night's discussion, please focus on the second Requirement above. Here's the scenarios I believe we'll need to address first for the broadest application across teams:

  • import a CSV file from a specified S3 bucket to an existing PostgreSQL instance.
  • import a PostgreSQL database dump from a specified S3 bucket to an existing PostgreSQL instance.

Other scenarios to follow once we have these two worked out.

@znmeb
Copy link
Contributor

znmeb commented Feb 10, 2018

  1. We are using PostgreSQL 10.1 and PostGIS 2.4.2 for Transportation. Any server we use needs both, and they'll have to be at least those versions.
  2. I have pgdump files and scripts to load them in our Google Drive - hit me up on Slack for the location.
  3. The way the PostgreSQL / PostGIS Docker image works, if you put .sh or .sql script files in a specified location, they will be executed in alphabetical order the first time the image runs. That's how I made those dump files.

I need to clarify what kind of "sql" that is - I think it's PSQL, which means the CSV import is two lines per table - create the database to define the columns and their types and then a \copy to. You'll need to check with the users on whether you need to add a primary key to tables restored from CSV. I've been adding them when I create the dumps.

Here's an example: https://github.com/hackoregon/ingest-ridership-data/blob/master/ingest.psql

@znmeb
Copy link
Contributor

znmeb commented Feb 15, 2018

We now have automated restores in our PostgreSQL / PostGIS images / containers. If you put a PostgreSQL pg_dump format backup file in the designated host directory before doing the Docker build, the Dockerfile will copy it onto the image and it will be restored the first time the container runs.

See https://github.com/hackoregon/data-science-pet-containers#automatic-database-restores for the details; if anyone else wants to do this I can get them started. Note that the databases we're using this for are smallish - in the 80 - 140 megabyte range. We have one coming that is much larger and I doubt if we'll be able to use this for it.

@MikeTheCanuck
Copy link
Contributor Author

This is the message I sent out to all Data Managers today - with this, I believe we have fulfilled the purpose of this issue and it's now ready to close:

To enable the DevOps team to keep straight all the databases that will be co-located on one server, we will be prefixing all database instances with the team/project names as documented here:
#1 (comment)

That would mean, for example, that the database for passenger census would be addressable on the AWS server as transportation-systems-passenger_census.

All database backups should restore to a database that follows this naming convention. If it doesn’t, we’ll manually rename the database once it’s restored on the server, and all calling Django code or other will have to be reconfigured. Please, if at all possible, follow this naming convention for the database on the box from which you’ll be generating the backup that DevOps will use to restore. Here’s the backup instructions that Ed and I put together:
https://github.com/hackoregon/civic-devops/blob/master/docs/HOWTO-create-backup-for-new-database-creation.md

We’re looking forward to bringing this centralized developer PostgreSQL service online for you shortly - please let us know which database backups we should pull from the “hacko-data-archive” S3 bucket. Once we have your first database online, I’ll communicate the IP address, database user + password and database name to each of you for your databases.

Here’s the design we’re aiming to provide:

  • a single database server - hosted in an EC2 instance, running PostgreSQL 9.6.6 (latest supported by Amazon Linux 2)
  • a single database user + password for each project (i.e. one set of credentials used to access multiple databases)
  • each database user is Owner for all databases for that project
  • backups and/or snapshots provided on some sort of semi-automated basis by the DevOps squad (not something you’ll have to get involved in) - TBD as we haven’t flushed out that infrastructure yet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants