Exclude details from facility list API by defaut #1739

jwalgran · 2022-03-20T00:11:49Z

Serializing contributor fields and extended fields requires multiple database lookups and is expensive. We add a detail query parameter to the facilities list and only include the contributors, contributor_fields, and extended_fields in the serialized output if that detail parameter is set to true.

This change creates a regressing in the CSV downloads because the additional fields are no longer available. That should be fixed in a future commit.

Connects #1737

Demo

Quick search

Full download for CSV/Excel

Notes

While testing embedded map downloads I saw that there were contributor attributions in the fields. I opened #1741 to look into those.

Testing Instructions

Run searches. Verify that they are fast.
Log in as c2@example.com
Upload 100-nodupes-latlng-extended.csv and ./tools/batc_process {id}
Search for number of workers less than 100. View details for some facilities and and verify the extended fields appear.
Download the search results. They should be slow as the data is refetched. Verify that the downloaded file includes extended fields
BUG in my testing I found the result count returned and shown on the list tab did not match the row count in the file when searching by Number of Workers. See if you can repeat this and diagnose it
Use the Embed tab on http://localhost:6543/settings and verify that searching and downloading work correctly

Checklist

fixup! commits have been squashed
CI passes after rebase
CHANGELOG.md updated with summary of features or fixes, following Keep a Changelog guidelines

We reduced the max page size to 50 a while ago so that the more expensive to build responses that include all the details required to generate CSV/Excel downlod do not exceed our the cloudfront timeout.

Serializing contributor fields and extended fields requires multiple database lookups and is expensive. We add a `detail` query parameter to the facilities list and only include the `contributors`, `contributor_fields`, and `extended_fields` in the serialized output if that `detail` parameter is set to true. This change creates a regressing in the CSV downloads because the additional fields are no longer available. That should be fixed in a future commit.

Previously we defaulted to returning the full details so previously loaded data was ready for CSV/Excel download. Now that we default to excluding those details we need to update the action chain to refetch facilities.

TaiWilkin · 2022-03-20T12:55:47Z

src/django/api/views.py

@@ -967,8 +959,14 @@ def list(self, request):
        context = {'request': request}

        if page_queryset is not None:
+            should_serialize_details = params.validated_data['detail']
+            exclude_fields = [


This is a clever way to handle this.

TaiWilkin

This is (mostly?) working well for me. The search is running noticeably faster, and the extended fields are appearing correctly in the details and the download. I also have some custom contributor fields in my development database, and they downloaded/displayed correctly in conjunction with extended fields.
In regards to the counts bug: The count in the header says 102; the number of rows for CSV download was 102 on first try. The number of facilities actually returned when I count them manually is 152; the number of facilities in the XLSX download is 153. There is a 'loading more facilities' tab at the bottom, even though the all the items shown in the CSV have downloaded and there are more facilities than the count indicates. On a second CSV download attempt, I received 154 facilities and some new ones showed up at the bottom of the list in the sidebar (but it still says 102 and 'loading more'). I'm continuing to look into it.

TaiWilkin · 2022-03-20T13:55:55Z

Additional local testing results (will continue to add to this comment as I uncover things so as to not spam the PR):

>>> FacilityIndex.objects.filter(number_of_workers__overlap=['Less than 1000']).count()
103
>>> FacilityIndex.objects.exclude(number_of_workers__overlap=['Less than 1000']).count()
2499
>>> FacilityIndex.objects.exclude(number_of_workers__overlap=['Less than 1000']).distinct('number_of_workers')
<QuerySet [<FacilityIndex: FacilityIndex object (ID202207454ERZC)>]>
>>> FacilityIndex.objects.exclude(number_of_workers__overlap=['Less than 1000']).distinct('number_of_workers').values('number_of_workers')
<QuerySet [{'number_of_workers': []}]>
>>> FacilityIndex.objects.distinct('number_of_workers').values('number_of_workers')
<QuerySet [{'number_of_workers': []}, {'number_of_workers': ['Less than 1000']}]>
>>> f = FacilityIndex.objects.filter(number_of_workers__overlap=['Less than 1000']).values_list('id', flat=True)
>>> Facility.objects.filter(id__in=f).count()
102

I checked the larger downloaded file for duplicates and determined that there is a chunk of facilities that are duplicates. If we remove them, the XLSX contains the expected number of items.
The source of the bug seems to be that sometimes, inconsistently, when we select 'download', the second page of facilities is fetched twice. I am seeing the page=2 request twice in the network tab. Both requests contain detail=true.
In the logDownload function, we look to see if there is a 'next page url' in the redux state, and while there is, we download the next page. I suspect that the next page url in state might be getting overwritten/confused by the multiple types of download somehow.

TaiWilkin · 2022-03-20T14:52:04Z

I believe I have a fix/fixes to prevent the extra facilities from showing up. When running repeated tests with options A and B I didn't see the bug happen, although of course with an inconsistent bug it can be difficult to be 100% certain.
Option A: Prevent the list from displaying at all (and therefore initiating extra page downloads) when the download is occurring by wrapping the list in

{!downloadingCSV && (
                <div style={filterSidebarStyles.controlPanelContentStyles}>...</div>
)}

Option B: Prevent the list from downloading more items when the download is occurring (this seems like a better option since it doesn't change the display behavior; the only issue is that then the 'more loading' stays up longer if you scroll down since the list download takes longer):

onInfiniteLoad={() => {
                            if (!downloadingCSV) {
                                fetchNextPage();
                            }
                        }}

Option C: Run CSV detailed fetching on a completely different set of actions / redux state than the 'thin' fetching. Downside is that it's not a quick-fix solution, upside is that it feels more 'correct' in terms of handling the data.

While a download is in progress infinite scroll events from the component showing the list were previously resulting in duplicate data being added to the CSV. We attempt to resolve the issue by disabling infinite scroll handling while downloading.

jwalgran · 2022-03-20T15:59:17Z

Thank you for your incredibly detailed diagnosis of the download issue. I agree that a rewrite of the download handling to use different actions and redux storage is a good design, but I went with the the onInfiniteLoad fix which looks like it should have been part of the system from the beginning.

I pushed up that fix for re-review.

TaiWilkin

Looks good!

jwalgran · 2022-03-20T17:28:55Z

Thanks again for the detailed review.

jwalgran added 3 commits March 19, 2022 17:09

Correct the swagger summary for the facilities GET endpoint

f18709f

We reduced the max page size to 50 a while ago so that the more expensive to build responses that include all the details required to generate CSV/Excel downlod do not exceed our the cloudfront timeout.

Fetch full details when download facilities for CSV/Excel

6464be0

Previously we defaulted to returning the full details so previously loaded data was ready for CSV/Excel download. Now that we default to excluding those details we need to update the action chain to refetch facilities.

jwalgran requested a review from TaiWilkin March 20, 2022 04:48

jwalgran assigned TaiWilkin Mar 20, 2022

jwalgran marked this pull request as ready for review March 20, 2022 04:49

TaiWilkin reviewed Mar 20, 2022

View reviewed changes

TaiWilkin assigned jwalgran and unassigned TaiWilkin Mar 20, 2022

jwalgran added 2 commits March 20, 2022 08:50

Update CHANGELOG

63f5e7b

jwalgran requested a review from TaiWilkin March 20, 2022 15:59

jwalgran assigned TaiWilkin and unassigned jwalgran Mar 20, 2022

TaiWilkin approved these changes Mar 20, 2022

View reviewed changes

TaiWilkin assigned jwalgran and unassigned TaiWilkin Mar 20, 2022

jwalgran merged commit ade00e2 into develop Mar 20, 2022

jwalgran deleted the feature/jcw/fast-search branch March 20, 2022 17:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclude details from facility list API by defaut #1739

Exclude details from facility list API by defaut #1739

jwalgran commented Mar 20, 2022 •

edited

TaiWilkin Mar 20, 2022

TaiWilkin left a comment •

edited

TaiWilkin commented Mar 20, 2022 •

edited

TaiWilkin commented Mar 20, 2022

jwalgran commented Mar 20, 2022

TaiWilkin left a comment

jwalgran commented Mar 20, 2022

Exclude details from facility list API by defaut #1739

Exclude details from facility list API by defaut #1739

Conversation

jwalgran commented Mar 20, 2022 • edited

Demo

Quick search

Full download for CSV/Excel

Notes

Testing Instructions

Checklist

TaiWilkin Mar 20, 2022

Choose a reason for hiding this comment

TaiWilkin left a comment • edited

Choose a reason for hiding this comment

TaiWilkin commented Mar 20, 2022 • edited

TaiWilkin commented Mar 20, 2022

jwalgran commented Mar 20, 2022

TaiWilkin left a comment

Choose a reason for hiding this comment

jwalgran commented Mar 20, 2022

jwalgran commented Mar 20, 2022 •

edited

TaiWilkin left a comment •

edited

TaiWilkin commented Mar 20, 2022 •

edited