Add single facility submission endpoint #896
Conversation
c0f66d3
to
7ab5ec8
Compare
User accounts will need to be added to the `can_submit_facility` group in order to use the single facility submission endpoint.
Implementing the previously unused create action on the facilities resource is a natural fit for submitting a single facility record. This commit implements the required authentication but none of the actual logic.
The single facility create endpoint requires POSTing a JSON body with `name`, `address`, and `country` fields. We reuse the existing validation function from list processing to ensure that the submitted `country` can be converted to a country code.
Handles parsing and setting the default values for the `create` and `public` boolean query string arguments.
It's a best practice to use string constants rather than literals when referring to a string value in multiple places. It is safe to edit the migration since we are not changing any actual logic, just extracting the string constant.
User accounts must belong to the `can_submit_private_facility` group in order to submit facilities to be displayed without a specific contributor affiliation.
In order to support quickly matching individual list items we needed a way of holding a trained and indexed model in memory and keeping it updated as the facility data changed. The new `GazetteerCache` class implements this by holding a trained model in a class variable and reading from the `HistoricalFacility` table to determine whether items need to be added or removed from the index. The `GazetteerCache` includes a threading lock to ensure that only one thread can ever be updating the cached gazetteer. Although the management command-based matching uses the new `GazetteerCache` the actual behavior has not changed, since starting a process from scratch will result in training a new model, as we have always done. As part of this refactor we have also addressed a problem with single item CSV uploads causing the match process run out of memory. We address this by changing the way models are trained. Instead of using the input file, we use 20% of the submitted list items as our "dirty" data.
7ab5ec8
to
876ef29
Compare
@hectcastro There is an ops consideration regarding the Based on the research documented in #702, matching facilities in the length of a web request requires holding a trained and index Dedupe model in memory. My solution is to keep model in a class variable, control access to it with a class method, and use the This will require expanding the RAM available to our app containers as the Facility dataset grows. My experiments with 100000 facilities show that it will fit within 16GB. |
Starting to read up this now. |
I pushed an additional commit that adds some swagger documentation. There are some known shortcomings, including the "Try it out form" returning CSRF errors. I want to make resolving that a separate issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 went through all the tests and they work as described. Will give the code a quick look in a bit.
The only minor suggestion I'd make is that when successfully POSTing without create=false
, we should send back a 201 instead of a 200, to better communicate that something was created. Not sure how big of a chance that would be.
That's a good idea. Should be a simple change. |
Implemented in fixup 8841b18 |
So, the Fargate pricing page has a pricing example that reads:
Following the math there for our current setup leads to:
That roughly checks out via the Cost Explorer if I group by costs for usage type If we went up to 16GB for production and played it more conservatively (2GB) for staging, it leads to:
This change appears as though it would increase our overall bill by about a third of its current cost. |
Thanks for dpcumenting those price calculations. Good to see that there is a basically linear relationship between resources and dollars. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 read through the code. Everything is very cleanly written and very well separated, easy to follow. Exemplary work! Reminder to squash the fixups before merging.
This commit adds the matching and persistence logic to the view function. The flow of the view is based on the batch processing pipeline, but all three steps are completed at once. Each stage is wrapped in an individual try block so we can log unexpected failures similar to the batch process. The "parse" step is combined with the saving of the initial `Source` and `FacilityListItem` records, since the data fields are submitted "pre-parsed." We create `Source` and `FacilityListItem` objects even when the user passes the `create=false` option because we still want to track the submission of the data in the `Source` table and the performance of the matching in the `FacilityListItem.processing_results` field. We were able to reuse the `save_match_details` function used by the batch processing. The only change we needed to make was to wrap the model save code with a check of the `Source.create` property. We needed to adjust the facility delete method because submitting high-confidence matches with `create=false` was creating `FacilityListItem` rows with foreign key references to the `Facility`, but not a corresponding `FacilityMatch` row.
This adds some example request and response data and documents the available query parameters. There are known issues: - Requests made with the "Try it out" form return a CSRF error. - The "data" parameter has an `undefined` data type. It was unclear how to properly define the data type within the `coreapi.Field` object. - The return status code is listed as 201, but this endpoint does not always create objects and sometimes returns 200. It was unclear how to change the response section.
8841b18
to
6dfda0c
Compare
Thanks for the review. |
Overview
Add single facility submission endpoint.
Connects #820
Connects #872
Connects #873
Demo
Match response
Potential match response
New facility response
No match and geocode returned no results response
Notes
We are introducing 2 new groups/waffle flags
can_submit_facility
controls access to the single-item submission endpoint.can_submit_private_facility
allows passing apublic=false
query string argument.Implementing the previously unused
create
action on the facilities resource is a natural fit for submitting a single facility record.The endpoint requires POSTing a JSON body with
name
,address
, andcountry
fields. We reuse the existing validation function from list processing to ensure that the submittedcountry
can be converted to acountry code.In order to support quickly matching individual list items we needed a way of holding a trained and indexed model in memory and keeping it updated as the facility data changed. The new
GazetteerCache
class implements this by holding a trained model in a class variable and reading from theHistoricalFacility
table to determine whether items need to be added or removed from the index.The
GazetteerCache
includes a threading lock to ensure that only one thread can ever be updating the cached gazetteer.Although the management command-based matching uses the new
GazetteerCache
the actual behavior has not changed, since starting a process from scratch will result in training a new model, as we have always done.As part of this refactor we have also addressed a problem with single item CSV uploads causing the match process run out of memory. We address this by changing the way models are trained. Instead of using the input file, we use 20% of the submitted list items as our "messy" data.
The flow of the
create
view is based on the batch processing pipeline, but all three steps are completed at once. Each stage is wrapped in an individual try block so we can report unexpected failures similar to the batch process.The "parse" step is combined with the saving of the initial
Source
andFacilityListItem
records, since the data fields are submitted "pre-parsed." We createSource
andFacilityListItem
objects even when the user passes thecreate=false
option because we still want to track the submission of the data in theSource
table and the performance of the matching in theFacilityListItem.processing_results
field.We were able to reuse the
save_match_details
function used by the batch processing. The only change we needed to make was to wrap the model save code with a check of theSource.create
property.We needed to adjust the facility delete method because submitting high-confidence matches with
create=false
was creatingFacilityListItem
rows with foreign key references to theFacility
, but not a correspondingFacilityMatch
row.Testing Instructions
Setup / Regression test batch processing
./scripts/manage migrate
and./scripts/resetdb
. Verify thatresetdb
matches models without error and that the dev data is matched correctly.c8@example.com
, browse http://localhost:6543/lists, and verify that there are some pending matches../scripts/manage batch_process --list-id 16 --action parse
./scripts/manage batch_process --list-id 16 --action geocode
./scripts/manage batch_process --list-id 16 --action match
Test single item submission
export OAR_API_TOKEN={token}
./scripts/manage shell_plus
and add the user to the group that allows single-item submissionNEW_FACILITY
.POTENTIAL_MATCH
MATCHED
create=false
argument, verify a successful response, and verify that the "Azavea" facility is now searchable http://localhost:6543/facilities?q=azaveacreate=false
, verify a successful response, and verify that the contributor with"id": 8
is listed in thecontributors
array.public=false
and verify that a 403 is returned./scripts/manage shell_plus
and add the user to the group that allows private submissionpublic=false
and verify a successful response with a status ofNEW_FACILITY
Checklist
fixup!
commits have been squashed