-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research solution for loading project data to new PostgreSQL database instances #3
Comments
I can tell you what I'm doing (manually) at the moment for Transportation Systems. It clearly won't scale, but at this point I don't know that we're going to get a huge amount of data. I started with the official Postgres image from the Docker store. (https://store.docker.com/images/postgres). Pros: Official, supported, tracks updates to the underlying software. Cons: it is non-trivial to configure, and the documentation is difficult to decipher. We'd have to write our own; I'm willing to do that. Because we're doing GIS things, I augmented that image with PostGIS. That image currently lives at https://hub.docker.com/r/znmeb/postgis/. It's an autobuild; whenever I push to master in GitHub or the base official PostgreSQL image changes, it gets rebuilt on Docker Hub. Once that's up and running via Docker Compose, I write code to insert our raw data into the database and take a dump file. It's usually easy; the only tricky one so far has been a database we received in Microsoft Access MDB format. At run time, I make a Docker image with the database dump in the right place. When the container starts, the magic inside the Postgres image restores the database dump and the container is listening on port 5432 just like any Postgres server wouid. To be determined at deployment time: where the data filesystems are stored. In one case (https://github.com/hackoregon/postgis-geocoder-test/tree/master/Docker) the files were large, so I mounted host filesystems into the container. In the other (https://github.com/hackoregon/crash-data-wrangling/tree/master/Docker) I just put everything in the image, since it was only 70 megabytes bigger than the raw image, which was about 450 megabytes. |
Aside from CSVs and MDBs, are there other data formats we frequently receive data in? Not sure who, exactly, is the best person to ask. @MikeTheCanuck? @znmeb, since you've responded? |
As far as I know Transportation Systems will only be receiving MDBs and CSVs. However, other teams are getting GIS data (usually shapefiles) and may be importing data directly from APIs. We will probably also be importing Census data but I'm hoping we can get a group of people from all the teams to co-ordinate that. @BrianHGrant is the data manager. |
We have also received excel sheets of varying machine readability and PDFs that plain aren't machine readable. For these files, there's no expectation that we can automagically import them into a db. As mentioned in our devops meeting, I'd like to get into the habit of backing up all data as we receive it (even if that's a PDF) to the cloud. Then humans can use those files to create clean CSVs which would in turn be imported into some db. |
@DingoEatingFuzz Excel sheets? PDFs? Bring it on! They have to be dealt with one at a time but the processes can be automated once we validate them with the providers of the data. Backups? Unless we've got a budget constraint I think we should use Google Drive for data files; make our existing Google Drive as big as it needs to be. I'm extremely leery of raw AWS / S3 for storage - it's an extra cognitive load for people who are primarily application developers. The other artifact we will have and need to back up / manage is Docker images. The correct way to manage those is with a Docker registry. It would need to be private to Hack Oregon but that's something we can buy and it's not too hard to build if that's something the DevOps people want to do. |
@k4y3ff Based on past seasons and other experience I have, I'd say that the vast majority of data will show up in CSV format, then MDB, "other Excel formats" and one-offs. What do you think about prioritizing "find a solution for CSV first, and then tackle the other scenarios after the CSV case is working"? @DingoEatingFuzz We're addressing raw data uploads in #2, though I'm not sure that I explicitly called out PDF as an expected data format. (Not sure that matters, when S3 just works as the file level.) @znmeb good thought to consider the complexity differential between S3 and other systems like Google Drive. We also want to consider the confusion of anyone trying to find data that's been inconsistently scattered across a variety of systems and platforms. @znmeb I'll spawn off the Docker image question to a separate issue #6. |
We are covered on CSV, MDB and xls/xlsx. PDFs that don't require optical character recognition are slightly more work but usually respond to TabulaPDF. At some point I should sit down and document all this magic in containers, though. |
I'd like to have something functional in the next week or so - we lost a couple of person-hours tonight because I somehow managed to get two different versions of the ODOT crash data SQL dump into the wild. ;-) an FTP server? Webdav? All we need is some kind of fileserver with authentication and key management. |
@k4y3ff and @khashf I've updated the text of the above "Requirements" - clarified each ask and formatted them as a checklist that can be tackled one by one. @khashf Based on last night's discussion, please focus on the second Requirement above. Here's the scenarios I believe we'll need to address first for the broadest application across teams:
Other scenarios to follow once we have these two worked out. |
I need to clarify what kind of "sql" that is - I think it's PSQL, which means the CSV import is two lines per table - create the database to define the columns and their types and then a Here's an example: https://github.com/hackoregon/ingest-ridership-data/blob/master/ingest.psql |
We now have automated restores in our PostgreSQL / PostGIS images / containers. If you put a PostgreSQL See https://github.com/hackoregon/data-science-pet-containers#automatic-database-restores for the details; if anyone else wants to do this I can get them started. Note that the databases we're using this for are smallish - in the 80 - 140 megabyte range. We have one coming that is much larger and I doubt if we'll be able to use this for it. |
This is the message I sent out to all Data Managers today - with this, I believe we have fulfilled the purpose of this issue and it's now ready to close:
|
Very soon after project teams have distributed (#2) and wrangled their initial project data, they will ask for us to host a PostgreSQL-compatible database instance in our AWS cloud.
We need to work out a procedure to automate the creation of new PostgreSQL instances into which cleaned data can be loaded. This will have to be performed at least once for each of the five project teams, and it would be reasonable to expect to have to perform this task more often as the
Assume that:
Requirements
Here are my initial thoughts on what is required to successfully provide the "new database" service:
The steps in this procedure should ultimately be documented so that any project team can understand their prerequisites and what happens once their "new database" request is initiated. Let's hack some initial docs up quickly, and use our first couple of rounds of the actual fulfilment to refine the docs.
The text was updated successfully, but these errors were encountered: