Skip to content
This repository has been archived by the owner on Feb 1, 2024. It is now read-only.

Add single facility submission endpoint #896

Merged
merged 10 commits into from Oct 31, 2019
Merged

Conversation

jwalgran
Copy link
Contributor

@jwalgran jwalgran commented Oct 25, 2019

Overview

Add single facility submission endpoint.

Connects #820
Connects #872
Connects #873

Demo

Match response

{
  "matches": [
    {
      "id": "CN2019303BQ3FZP",
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [
          120.596047,
          32.172013
        ]
      },
      "properties": {
        "name": "Nantong Jackbeanie Headwear Garment Co. Ltd.",
        "address": "No. 808, The Third Industry Park, Guoyuan Town, Rugao City Nantong",
        "country_code": "CN",
        "oar_id": "CN2019303BQ3FZP",
        "other_names": [],
        "other_addresses": [],
        "contributors": [
          {
            "id": 4,
            "name": "Researcher A (Summer 2019 Affiliate List)",
            "is_verified": false
          }
        ],
        "country_name": "China",
        "claim_info": null,
        "other_locations": []
      },
      "confidence": 0.8153
    }
  ],
  "item_id": 964,
  "geocoded_geometry": {
    "type": "Point",
    "coordinates": [
      120.596047,
      32.172013
    ]
  },
  "geocoded_address": "Guoyuanzhen, Rugao, Nantong, Jiangsu, China",
  "status": "MATCHED",
  "oar_id": "CN2019303BQ3FZP"
}

Potential match response

{
  "matches": [
    {
      "id": "CN2019303BQ3FZP",
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [
          120.596047,
          32.172013
        ]
      },
      "properties": {
        "name": "Nantong Jackbeanie Headwear Garment Co. Ltd.",
        "address": "No. 808, The Third Industry Park, Guoyuan Town, Rugao City Nantong",
        "country_code": "CN",
        "oar_id": "CN2019303BQ3FZP",
        "other_names": [],
        "other_addresses": [],
        "contributors": [
          {
            "id": 4,
            "name": "Researcher A (Summer 2019 Affiliate List)",
            "is_verified": false
          }
        ],
        "country_name": "China",
        "claim_info": null,
        "other_locations": []
      },
      "confidence": 0.7686
    }
  ],
  "item_id": 959,
  "geocoded_geometry": {
    "type": "Point",
    "coordinates": [
      120.596047,
      32.172013
    ]
  },
  "geocoded_address": "Guoyuanzhen, Rugao, Nantong, Jiangsu, China",
  "status": "POTENTIAL_MATCH"
}

New facility response

{
  "matches": [],
  "item_id": 954,
  "geocoded_geometry": {
    "type": "Point",
    "coordinates": [
      119.2221539,
      33.79772
    ]
  },
  "geocoded_address": "30, 32 Yanhuang Ave, Lianshui Xian, Huaian Shi, Jiangsu Sheng, China, 223402",
  "status": "NEW_FACILITY"
}

No match and geocode returned no results response

{
  "matches": [],
  "item_id": 965,
  "geocoded_geometry": null,
  "geocoded_address": null,
  "status": "ERROR_MATCHING"
}

Notes

We are introducing 2 new groups/waffle flags

  • can_submit_facility controls access to the single-item submission endpoint.
  • can_submit_private_facility allows passing a public=false query string argument.

Implementing the previously unused create action on the facilities resource is a natural fit for submitting a single facility record.

The endpoint requires POSTing a JSON body with name, address, and country fields. We reuse the existing validation function from list processing to ensure that the submitted country can be converted to acountry code.

In order to support quickly matching individual list items we needed a way of holding a trained and indexed model in memory and keeping it updated as the facility data changed. The new GazetteerCache class implements this by holding a trained model in a class variable and reading from the HistoricalFacility table to determine whether items need to be added or removed from the index.

The GazetteerCache includes a threading lock to ensure that only one thread can ever be updating the cached gazetteer.

Although the management command-based matching uses the new GazetteerCache the actual behavior has not changed, since starting a process from scratch will result in training a new model, as we have always done.

As part of this refactor we have also addressed a problem with single item CSV uploads causing the match process run out of memory. We address this by changing the way models are trained. Instead of using the input file, we use 20% of the submitted list items as our "messy" data.

The flow of the create view is based on the batch processing pipeline, but all three steps are completed at once. Each stage is wrapped in an individual try block so we can report unexpected failures similar to the batch process.

The "parse" step is combined with the saving of the initial Source and FacilityListItem records, since the data fields are submitted "pre-parsed." We create Source and FacilityListItem objects even when the user passes the create=false option because we still want to track the submission of the data in the Source table and the performance of the matching in the FacilityListItem.processing_results field.

We were able to reuse the save_match_details function used by the batch processing. The only change we needed to make was to wrap the model save code with a check of the Source.create property.

We needed to adjust the facility delete method because submitting high-confidence matches with create=false was creating FacilityListItem rows with foreign key references to the Facility, but not a corresponding FacilityMatch row.

Testing Instructions

Setup / Regression test batch processing

  • Run ./scripts/manage migrate and ./scripts/resetdb. Verify that resetdb matches models without error and that the dev data is matched correctly.
  • Log in as c8@example.com, browse http://localhost:6543/lists, and verify that there are some pending matches.
  • Submit and process other-location-test.csv.zip. Verify that the steps complete without error
    • ./scripts/manage batch_process --list-id 16 --action parse
    • ./scripts/manage batch_process --list-id 16 --action geocode
    • ./scripts/manage batch_process --list-id 16 --action match
  • Browse http://localhost:6543/lists/16 and verify that there are matches

Test single item submission

  • Still logged in as c8@example.com, browse "My Profile" and create an API token"
  • On the Vagrant VM, add the token to your shell environemnt: export OAR_API_TOKEN={token}
  • Post to the endpoint and verify that it returns 401 / requires auth.
http POST http://localhost:8081/api/facilities/?create=false 
  • Post to the endpoint with auth and verify that it returns 403
http POST http://localhost:8081/api/facilities/?create=false \
"Authorization: Token $OAR_API_TOKEN"
  • Run ./scripts/manage shell_plus and add the user to the group that allows single-item submission
from django.contrib import auth
g = auth.models.Group.objects.get(name='can_submit_facility')
u = User.objects.get(email='c8@example.com')
u.groups.add(g)
u.save()
  • Post to the endpoint and verify that it returns 400 because a JSON body was not included
http POST http://localhost:8081/api/facilities/?create=false \
"Authorization: Token $OAR_API_TOKEN" 
  • POST with a body and verify a successful response with a status of NEW_FACILITY.
http POST http://localhost:8081/api/facilities/?create=false \
"Authorization: Token $OAR_API_TOKEN" \
<<< '{
  "country": "US",
  "name": "Azavea",
  "address": "990 Spring Garden Street, Philadelphia"
}'
  • POST a near match and verify a successful response with a status of POTENTIAL_MATCH
http POST http://localhost:8081/api/facilities/?create=false \
"Authorization: Token $OAR_API_TOKEN" \
<<< '{
	"country": "China",
	"name": "Nantong Jackbeanie Headwear & Garment Co. Ltd.",
	"address": "No.808,the third industry park,Guoyuan Town,Nantong 226500."
}'
  • POST an exact match and verify a successful response with a status of MATCHED
http POST http://localhost:8081/api/facilities/?create=false \
"Authorization: Token $OAR_API_TOKEN" \
<<< '{
	"country": "China",
	"name": "Nantong Jackbeanie Headwear Garment Co. Ltd.",
	"address": "No. 808, The Third Industry Park, Guoyuan Town, Rugao City Nantong"
}'
http POST http://localhost:8081/api/facilities/ \
"Authorization: Token $OAR_API_TOKEN" \
<<< '{
  "country": "US",
  "name": "Azavea",
  "address": "990 Spring Garden Street, Philadelphia"
}'
  • POST a match without create=false, verify a successful response, and verify that the contributor with "id": 8 is listed in the contributors array.
http POST http://localhost:8081/api/facilities/ \
"Authorization: Token $OAR_API_TOKEN" \
<<< '{
  "country": "Bangladesh",
  "name": "AKH Shirts Ltd.",
  "address": "Savar Dhaka 1340 Bangladesh"
}'
  • Attempt to POST a facility with public=false and verify that a 403 is returned
http POST http://localhost:8081/api/facilities/?public=false \
"Authorization: Token $OAR_API_TOKEN" \
<<< '{
  "country": "Vietnam",
  "name": "28 Hung Phu",
  "address": "No. 168 Quang Trung Street Ward 10 Go Vap District, Hochi Minh"
}'
  • Run ./scripts/manage shell_plus and add the user to the group that allows private submission
from django.contrib import auth
g = auth.models.Group.objects.get(name='can_submit_private_facility')
u = User.objects.get(email='c8@example.com')
u.groups.add(g)
u.save()
  • Repeat the POST with public=false and verify a successful response with a status of NEW_FACILITY
http POST http://localhost:8081/api/facilities/?public=false \
"Authorization: Token $OAR_API_TOKEN" \
<<< '{
  "country": "Vietnam",
  "name": "28 Hung Phu",
  "address": "No. 168 Quang Trung Street Ward 10 Go Vap District, Hochi Minh"
}'

Checklist

  • fixup! commits have been squashed
  • CI passes after rebase
  • CHANGELOG.md updated with summary of features or fixes, following Keep a Changelog guidelines

@jwalgran jwalgran force-pushed the jcw/submit_single_facility branch 5 times, most recently from c0f66d3 to 7ab5ec8 Compare October 30, 2019 05:26
User accounts will need to be added to the `can_submit_facility` group in order
to use the single facility submission endpoint.
Implementing the previously unused create action on the facilities resource is a
natural fit for submitting a single facility record. This commit implements the
required authentication but none of the actual logic.
The single facility create endpoint requires POSTing a JSON body with `name`,
`address`, and `country` fields. We reuse the existing validation function from
list processing to ensure that the submitted `country` can be converted to a
country code.
Handles parsing and setting the default values for the `create` and `public`
boolean query string arguments.
It's a best practice to use string constants rather than literals when referring
to a string value in multiple places.

It is safe to edit the migration since we are not changing any actual logic,
just extracting the string constant.
User accounts must belong to the `can_submit_private_facility` group in order to
submit facilities to be displayed without a specific contributor affiliation.
In order to support quickly matching individual list items we needed a way of
holding a trained and indexed model in memory and keeping it updated as the
facility data changed. The new `GazetteerCache` class implements this by holding
a trained model in a class variable and reading from the `HistoricalFacility`
table to determine whether items need to be added or removed from the index.

The `GazetteerCache` includes a threading lock to ensure that only one thread
can ever be updating the cached gazetteer.

Although the management command-based matching uses the new `GazetteerCache` the
actual behavior has not changed, since starting a process from scratch will
result in training a new model, as we have always done.

As part of this refactor we have also addressed a problem with single item CSV
uploads causing the match process run out of memory. We address this by changing
the way models are trained. Instead of using the input file, we use 20% of the
submitted list items as our "dirty" data.
@jwalgran jwalgran marked this pull request as ready for review October 30, 2019 16:40
@jwalgran
Copy link
Contributor Author

@hectcastro There is an ops consideration regarding the GazetteerCache introduced in this PR that I would like your commentary on.

https://github.com/open-apparel-registry/open-apparel-registry/blob/876ef29a40f485a1073c1925a4c64cd55b320dce/src/django/api/matching.py#L360-L421

Based on the research documented in #702, matching facilities in the length of a web request requires holding a trained and index Dedupe model in memory. My solution is to keep model in a class variable, control access to it with a class method, and use the HistoricalFacility table to freshen the index with any Facility records that have been modified since the last time the model was used.

This will require expanding the RAM available to our app containers as the Facility dataset grows. My experiments with 100000 facilities show that it will fit within 16GB.

@rajadain
Copy link
Contributor

Starting to read up this now.

@jwalgran
Copy link
Contributor Author

I pushed an additional commit that adds some swagger documentation. There are some known shortcomings, including the "Try it out form" returning CSRF errors. I want to make resolving that a separate issue.

Copy link
Contributor

@rajadain rajadain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 went through all the tests and they work as described. Will give the code a quick look in a bit.

The only minor suggestion I'd make is that when successfully POSTing without create=false, we should send back a 201 instead of a 200, to better communicate that something was created. Not sure how big of a chance that would be.

@jwalgran
Copy link
Contributor Author

we should send back a 201 instead of a 200, to better communicate that something was created.

That's a good idea. Should be a simple change.

@jwalgran
Copy link
Contributor Author

we should send back a 201 instead of a 200, to better communicate that something was created.

Implemented in fixup 8841b18

@hectcastro
Copy link
Contributor

So, the Fargate pricing page has a pricing example that reads:

Total memory charges = # of Tasks x memory in GB x price per GB x memory duration per day (seconds) x # of days

Total memory charges = 5 x 2 x 0.000001235 x 600 x 30 = $0.22

Following the math there for our current setup leads to:

Staging: 1 * .5 * 0.000001235 * 86400 * 30 = ~$1.60
Production: 2 * 2 * 0.000001235 * 86400 * 30 = ~$12.80

Total: ~$14.40

That roughly checks out via the Cost Explorer if I group by costs for usage type EU-Fargate-GB-Hours (Hrs).

If we went up to 16GB for production and played it more conservatively (2GB) for staging, it leads to:

Staging: 1 * 2 * 0.000001235 * 86400 * 30 = ~$6.40
Production: 2 * 16 * 0.000001235 * 86400 * 30 = ~$102.40

Total: $108.80

This change appears as though it would increase our overall bill by about a third of its current cost.

@jwalgran
Copy link
Contributor Author

Thanks for dpcumenting those price calculations. Good to see that there is a basically linear relationship between resources and dollars.

Copy link
Contributor

@rajadain rajadain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 read through the code. Everything is very cleanly written and very well separated, easy to follow. Exemplary work! Reminder to squash the fixups before merging.

@rajadain rajadain assigned jwalgran and unassigned rajadain Oct 31, 2019
This commit adds the matching and persistence logic to the view function.

The flow of the view is based on the batch processing pipeline, but all three
steps are completed at once. Each stage is wrapped in an individual try block so
we can log unexpected failures similar to the batch process.

The "parse" step is combined with the saving of the initial `Source` and
`FacilityListItem` records, since the data fields are submitted "pre-parsed." We
create `Source` and `FacilityListItem` objects even when the user passes the
`create=false` option because we still want to track the submission of the data
in the `Source` table and the performance of the matching in the
`FacilityListItem.processing_results` field.

We were able to reuse the `save_match_details` function used by the batch
processing. The only change we needed to make was to wrap the model save code
with a check of the `Source.create` property.

We needed to adjust the facility delete method because submitting
high-confidence matches with `create=false` was creating `FacilityListItem` rows
with foreign key references to the `Facility`, but not a corresponding
`FacilityMatch` row.
This adds some example request and response data and documents the available
query parameters. There are known issues:

- Requests made with the "Try it out" form return a CSRF error.
- The "data" parameter has an `undefined` data type. It was unclear how to
  properly define the data type within the `coreapi.Field` object.
- The return status code is listed as 201, but this endpoint does not always
  create objects and sometimes returns 200. It was unclear how to change the
  response section.
@jwalgran
Copy link
Contributor Author

Thanks for the review.

@jwalgran jwalgran merged commit ef1e928 into develop Oct 31, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants