Add single facility submission endpoint #896

jwalgran · 2019-10-25T21:34:55Z

Overview

Add single facility submission endpoint.

Connects #820
Connects #872
Connects #873

Demo

Match response

{
  "matches": [
    {
      "id": "CN2019303BQ3FZP",
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [
          120.596047,
          32.172013
        ]
      },
      "properties": {
        "name": "Nantong Jackbeanie Headwear Garment Co. Ltd.",
        "address": "No. 808, The Third Industry Park, Guoyuan Town, Rugao City Nantong",
        "country_code": "CN",
        "oar_id": "CN2019303BQ3FZP",
        "other_names": [],
        "other_addresses": [],
        "contributors": [
          {
            "id": 4,
            "name": "Researcher A (Summer 2019 Affiliate List)",
            "is_verified": false
          }
        ],
        "country_name": "China",
        "claim_info": null,
        "other_locations": []
      },
      "confidence": 0.8153
    }
  ],
  "item_id": 964,
  "geocoded_geometry": {
    "type": "Point",
    "coordinates": [
      120.596047,
      32.172013
    ]
  },
  "geocoded_address": "Guoyuanzhen, Rugao, Nantong, Jiangsu, China",
  "status": "MATCHED",
  "oar_id": "CN2019303BQ3FZP"
}

Potential match response

{
  "matches": [
    {
      "id": "CN2019303BQ3FZP",
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [
          120.596047,
          32.172013
        ]
      },
      "properties": {
        "name": "Nantong Jackbeanie Headwear Garment Co. Ltd.",
        "address": "No. 808, The Third Industry Park, Guoyuan Town, Rugao City Nantong",
        "country_code": "CN",
        "oar_id": "CN2019303BQ3FZP",
        "other_names": [],
        "other_addresses": [],
        "contributors": [
          {
            "id": 4,
            "name": "Researcher A (Summer 2019 Affiliate List)",
            "is_verified": false
          }
        ],
        "country_name": "China",
        "claim_info": null,
        "other_locations": []
      },
      "confidence": 0.7686
    }
  ],
  "item_id": 959,
  "geocoded_geometry": {
    "type": "Point",
    "coordinates": [
      120.596047,
      32.172013
    ]
  },
  "geocoded_address": "Guoyuanzhen, Rugao, Nantong, Jiangsu, China",
  "status": "POTENTIAL_MATCH"
}

New facility response

{
  "matches": [],
  "item_id": 954,
  "geocoded_geometry": {
    "type": "Point",
    "coordinates": [
      119.2221539,
      33.79772
    ]
  },
  "geocoded_address": "30, 32 Yanhuang Ave, Lianshui Xian, Huaian Shi, Jiangsu Sheng, China, 223402",
  "status": "NEW_FACILITY"
}

No match and geocode returned no results response

{
  "matches": [],
  "item_id": 965,
  "geocoded_geometry": null,
  "geocoded_address": null,
  "status": "ERROR_MATCHING"
}

Notes

We are introducing 2 new groups/waffle flags

can_submit_facility controls access to the single-item submission endpoint.
can_submit_private_facility allows passing a public=false query string argument.

Implementing the previously unused create action on the facilities resource is a natural fit for submitting a single facility record.

The endpoint requires POSTing a JSON body with name, address, and country fields. We reuse the existing validation function from list processing to ensure that the submitted country can be converted to acountry code.

In order to support quickly matching individual list items we needed a way of holding a trained and indexed model in memory and keeping it updated as the facility data changed. The new GazetteerCache class implements this by holding a trained model in a class variable and reading from the HistoricalFacility table to determine whether items need to be added or removed from the index.

The GazetteerCache includes a threading lock to ensure that only one thread can ever be updating the cached gazetteer.

Although the management command-based matching uses the new GazetteerCache the actual behavior has not changed, since starting a process from scratch will result in training a new model, as we have always done.

As part of this refactor we have also addressed a problem with single item CSV uploads causing the match process run out of memory. We address this by changing the way models are trained. Instead of using the input file, we use 20% of the submitted list items as our "messy" data.

The flow of the create view is based on the batch processing pipeline, but all three steps are completed at once. Each stage is wrapped in an individual try block so we can report unexpected failures similar to the batch process.

The "parse" step is combined with the saving of the initial Source and FacilityListItem records, since the data fields are submitted "pre-parsed." We create Source and FacilityListItem objects even when the user passes the create=false option because we still want to track the submission of the data in the Source table and the performance of the matching in the FacilityListItem.processing_results field.

We were able to reuse the save_match_details function used by the batch processing. The only change we needed to make was to wrap the model save code with a check of the Source.create property.

We needed to adjust the facility delete method because submitting high-confidence matches with create=false was creating FacilityListItem rows with foreign key references to the Facility, but not a corresponding FacilityMatch row.

Testing Instructions

Setup / Regression test batch processing

Run ./scripts/manage migrate and ./scripts/resetdb. Verify that resetdb matches models without error and that the dev data is matched correctly.
Log in as c8@example.com, browse http://localhost:6543/lists, and verify that there are some pending matches.
Submit and process other-location-test.csv.zip. Verify that the steps complete without error
- ./scripts/manage batch_process --list-id 16 --action parse
- ./scripts/manage batch_process --list-id 16 --action geocode
- ./scripts/manage batch_process --list-id 16 --action match
Browse http://localhost:6543/lists/16 and verify that there are matches

Test single item submission

Still logged in as c8@example.com, browse "My Profile" and create an API token"
On the Vagrant VM, add the token to your shell environemnt: export OAR_API_TOKEN={token}
Post to the endpoint and verify that it returns 401 / requires auth.

http POST http://localhost:8081/api/facilities/?create=false

Post to the endpoint with auth and verify that it returns 403

http POST http://localhost:8081/api/facilities/?create=false \
"Authorization: Token $OAR_API_TOKEN"

Run ./scripts/manage shell_plus and add the user to the group that allows single-item submission

from django.contrib import auth
g = auth.models.Group.objects.get(name='can_submit_facility')
u = User.objects.get(email='c8@example.com')
u.groups.add(g)
u.save()

Post to the endpoint and verify that it returns 400 because a JSON body was not included

http POST http://localhost:8081/api/facilities/?create=false \
"Authorization: Token $OAR_API_TOKEN"

POST with a body and verify a successful response with a status of NEW_FACILITY.

http POST http://localhost:8081/api/facilities/?create=false \
"Authorization: Token $OAR_API_TOKEN" \
<<< '{
  "country": "US",
  "name": "Azavea",
  "address": "990 Spring Garden Street, Philadelphia"
}'

POST a near match and verify a successful response with a status of POTENTIAL_MATCH

http POST http://localhost:8081/api/facilities/?create=false \
"Authorization: Token $OAR_API_TOKEN" \
<<< '{
	"country": "China",
	"name": "Nantong Jackbeanie Headwear & Garment Co. Ltd.",
	"address": "No.808,the third industry park,Guoyuan Town,Nantong 226500."
}'

POST an exact match and verify a successful response with a status of MATCHED

http POST http://localhost:8081/api/facilities/?create=false \
"Authorization: Token $OAR_API_TOKEN" \
<<< '{
	"country": "China",
	"name": "Nantong Jackbeanie Headwear Garment Co. Ltd.",
	"address": "No. 808, The Third Industry Park, Guoyuan Town, Rugao City Nantong"
}'

POST without the create=false argument, verify a successful response, and verify that the "Azavea" facility is now searchable http://localhost:6543/facilities?q=azavea

http POST http://localhost:8081/api/facilities/ \
"Authorization: Token $OAR_API_TOKEN" \
<<< '{
  "country": "US",
  "name": "Azavea",
  "address": "990 Spring Garden Street, Philadelphia"
}'

POST a match without create=false, verify a successful response, and verify that the contributor with "id": 8 is listed in the contributors array.

http POST http://localhost:8081/api/facilities/ \
"Authorization: Token $OAR_API_TOKEN" \
<<< '{
  "country": "Bangladesh",
  "name": "AKH Shirts Ltd.",
  "address": "Savar Dhaka 1340 Bangladesh"
}'

Attempt to POST a facility with public=false and verify that a 403 is returned

http POST http://localhost:8081/api/facilities/?public=false \
"Authorization: Token $OAR_API_TOKEN" \
<<< '{
  "country": "Vietnam",
  "name": "28 Hung Phu",
  "address": "No. 168 Quang Trung Street Ward 10 Go Vap District, Hochi Minh"
}'

Run ./scripts/manage shell_plus and add the user to the group that allows private submission

from django.contrib import auth
g = auth.models.Group.objects.get(name='can_submit_private_facility')
u = User.objects.get(email='c8@example.com')
u.groups.add(g)
u.save()

Repeat the POST with public=false and verify a successful response with a status of NEW_FACILITY

http POST http://localhost:8081/api/facilities/?public=false \
"Authorization: Token $OAR_API_TOKEN" \
<<< '{
  "country": "Vietnam",
  "name": "28 Hung Phu",
  "address": "No. 168 Quang Trung Street Ward 10 Go Vap District, Hochi Minh"
}'

Verify that the facility is searchable http://localhost:6543/facilities?q=hung%20phu and that the details page for the facility does not mention the contributor.

Checklist

fixup! commits have been squashed
CI passes after rebase
CHANGELOG.md updated with summary of features or fixes, following Keep a Changelog guidelines

User accounts will need to be added to the `can_submit_facility` group in order to use the single facility submission endpoint.

Implementing the previously unused create action on the facilities resource is a natural fit for submitting a single facility record. This commit implements the required authentication but none of the actual logic.

The single facility create endpoint requires POSTing a JSON body with `name`, `address`, and `country` fields. We reuse the existing validation function from list processing to ensure that the submitted `country` can be converted to a country code.

Handles parsing and setting the default values for the `create` and `public` boolean query string arguments.

It's a best practice to use string constants rather than literals when referring to a string value in multiple places. It is safe to edit the migration since we are not changing any actual logic, just extracting the string constant.

User accounts must belong to the `can_submit_private_facility` group in order to submit facilities to be displayed without a specific contributor affiliation.

In order to support quickly matching individual list items we needed a way of holding a trained and indexed model in memory and keeping it updated as the facility data changed. The new `GazetteerCache` class implements this by holding a trained model in a class variable and reading from the `HistoricalFacility` table to determine whether items need to be added or removed from the index. The `GazetteerCache` includes a threading lock to ensure that only one thread can ever be updating the cached gazetteer. Although the management command-based matching uses the new `GazetteerCache` the actual behavior has not changed, since starting a process from scratch will result in training a new model, as we have always done. As part of this refactor we have also addressed a problem with single item CSV uploads causing the match process run out of memory. We address this by changing the way models are trained. Instead of using the input file, we use 20% of the submitted list items as our "dirty" data.

jwalgran · 2019-10-30T16:59:59Z

@hectcastro There is an ops consideration regarding the GazetteerCache introduced in this PR that I would like your commentary on.

https://github.com/open-apparel-registry/open-apparel-registry/blob/876ef29a40f485a1073c1925a4c64cd55b320dce/src/django/api/matching.py#L360-L421

Based on the research documented in #702, matching facilities in the length of a web request requires holding a trained and index Dedupe model in memory. My solution is to keep model in a class variable, control access to it with a class method, and use the HistoricalFacility table to freshen the index with any Facility records that have been modified since the last time the model was used.

This will require expanding the RAM available to our app containers as the Facility dataset grows. My experiments with 100000 facilities show that it will fit within 16GB.

rajadain · 2019-10-30T18:47:40Z

Starting to read up this now.

jwalgran · 2019-10-30T20:59:29Z

I pushed an additional commit that adds some swagger documentation. There are some known shortcomings, including the "Try it out form" returning CSRF errors. I want to make resolving that a separate issue.

rajadain

+1 went through all the tests and they work as described. Will give the code a quick look in a bit.

The only minor suggestion I'd make is that when successfully POSTing without create=false, we should send back a 201 instead of a 200, to better communicate that something was created. Not sure how big of a chance that would be.

jwalgran · 2019-10-30T21:59:11Z

we should send back a 201 instead of a 200, to better communicate that something was created.

That's a good idea. Should be a simple change.

jwalgran · 2019-10-30T22:23:56Z

we should send back a 201 instead of a 200, to better communicate that something was created.

Implemented in fixup 8841b18

hectcastro · 2019-10-31T03:24:22Z

So, the Fargate pricing page has a pricing example that reads:

Total memory charges = # of Tasks x memory in GB x price per GB x memory duration per day (seconds) x # of days

Total memory charges = 5 x 2 x 0.000001235 x 600 x 30 = $0.22

Following the math there for our current setup leads to:

Staging: 1 * .5 * 0.000001235 * 86400 * 30 = ~$1.60
Production: 2 * 2 * 0.000001235 * 86400 * 30 = ~$12.80

Total: ~$14.40

That roughly checks out via the Cost Explorer if I group by costs for usage type EU-Fargate-GB-Hours (Hrs).

If we went up to 16GB for production and played it more conservatively (2GB) for staging, it leads to:

Staging: 1 * 2 * 0.000001235 * 86400 * 30 = ~$6.40
Production: 2 * 16 * 0.000001235 * 86400 * 30 = ~$102.40

Total: $108.80

This change appears as though it would increase our overall bill by about a third of its current cost.

jwalgran · 2019-10-31T12:39:08Z

Thanks for dpcumenting those price calculations. Good to see that there is a basically linear relationship between resources and dollars.

rajadain

+1 read through the code. Everything is very cleanly written and very well separated, easy to follow. Exemplary work! Reminder to squash the fixups before merging.

This commit adds the matching and persistence logic to the view function. The flow of the view is based on the batch processing pipeline, but all three steps are completed at once. Each stage is wrapped in an individual try block so we can log unexpected failures similar to the batch process. The "parse" step is combined with the saving of the initial `Source` and `FacilityListItem` records, since the data fields are submitted "pre-parsed." We create `Source` and `FacilityListItem` objects even when the user passes the `create=false` option because we still want to track the submission of the data in the `Source` table and the performance of the matching in the `FacilityListItem.processing_results` field. We were able to reuse the `save_match_details` function used by the batch processing. The only change we needed to make was to wrap the model save code with a check of the `Source.create` property. We needed to adjust the facility delete method because submitting high-confidence matches with `create=false` was creating `FacilityListItem` rows with foreign key references to the `Facility`, but not a corresponding `FacilityMatch` row.

This adds some example request and response data and documents the available query parameters. There are known issues: - Requests made with the "Try it out" form return a CSRF error. - The "data" parameter has an `undefined` data type. It was unclear how to properly define the data type within the `coreapi.Field` object. - The return status code is listed as 201, but this endpoint does not always create objects and sometimes returns 200. It was unclear how to change the response section.

jwalgran · 2019-10-31T17:12:52Z

Thanks for the review.

jwalgran force-pushed the jcw/submit_single_facility branch 5 times, most recently from c0f66d3 to 7ab5ec8 Compare October 30, 2019 05:26

jwalgran added 7 commits October 30, 2019 08:17

Add migration to create submit facility Waffle flag and group

b78a336

User accounts will need to be added to the `can_submit_facility` group in order to use the single facility submission endpoint.

Add a create method to the FacilitiesViewSet

6e6ac5f

Implementing the previously unused create action on the facilities resource is a natural fit for submitting a single facility record. This commit implements the required authentication but none of the actual logic.

Add query param serializer for facility create endpoint

a007352

Handles parsing and setting the default values for the `create` and `public` boolean query string arguments.

Add migration and authorization for private facility submission

213f639

User accounts must belong to the `can_submit_private_facility` group in order to submit facilities to be displayed without a specific contributor affiliation.

jwalgran force-pushed the jcw/submit_single_facility branch from 7ab5ec8 to 876ef29 Compare October 30, 2019 15:18

jwalgran requested a review from rajadain October 30, 2019 16:40

jwalgran assigned rajadain Oct 30, 2019

jwalgran marked this pull request as ready for review October 30, 2019 16:40

rajadain approved these changes Oct 30, 2019

View reviewed changes

rajadain approved these changes Oct 31, 2019

View reviewed changes

rajadain assigned jwalgran and unassigned rajadain Oct 31, 2019

jwalgran added 3 commits October 31, 2019 09:25

Update CHANGELOG

e59ef7d

jwalgran force-pushed the jcw/submit_single_facility branch from 8841b18 to 6dfda0c Compare October 31, 2019 17:02

jwalgran merged commit ef1e928 into develop Oct 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add single facility submission endpoint #896

Add single facility submission endpoint #896

jwalgran commented Oct 25, 2019 •

edited

jwalgran commented Oct 30, 2019

rajadain commented Oct 30, 2019

jwalgran commented Oct 30, 2019

rajadain left a comment

jwalgran commented Oct 30, 2019

jwalgran commented Oct 30, 2019

hectcastro commented Oct 31, 2019

jwalgran commented Oct 31, 2019

rajadain left a comment

jwalgran commented Oct 31, 2019

Add single facility submission endpoint #896

Add single facility submission endpoint #896

Conversation

jwalgran commented Oct 25, 2019 • edited

Overview

Demo

Match response

Potential match response

New facility response

No match and geocode returned no results response

Notes

Testing Instructions

Setup / Regression test batch processing

Test single item submission

Checklist

jwalgran commented Oct 30, 2019

rajadain commented Oct 30, 2019

jwalgran commented Oct 30, 2019

rajadain left a comment

Choose a reason for hiding this comment

jwalgran commented Oct 30, 2019

jwalgran commented Oct 30, 2019

hectcastro commented Oct 31, 2019

jwalgran commented Oct 31, 2019

rajadain left a comment

Choose a reason for hiding this comment

jwalgran commented Oct 31, 2019

jwalgran commented Oct 25, 2019 •

edited