Skip to content
This repository has been archived by the owner on Feb 1, 2024. It is now read-only.

Exclude details from facility list API by defaut #1739

Merged
merged 5 commits into from Mar 20, 2022

Conversation

jwalgran
Copy link
Contributor

@jwalgran jwalgran commented Mar 20, 2022

Serializing contributor fields and extended fields requires multiple database lookups and is expensive. We add a detail query parameter to the facilities list and only include the contributors, contributor_fields, and extended_fields in the serialized output if that detail parameter is set to true.

This change creates a regressing in the CSV downloads because the additional fields are no longer available. That should be fixed in a future commit.

Connects #1737

Demo

Quick search

2022-03-19 20 59 37

Full download for CSV/Excel

2022-03-19 21 03 50

Notes

While testing embedded map downloads I saw that there were contributor attributions in the fields. I opened #1741 to look into those.

Testing Instructions

  • Run searches. Verify that they are fast.
  • Log in as c2@example.com
  • Upload 100-nodupes-latlng-extended.csv and ./tools/batc_process {id}
  • Search for number of workers less than 100. View details for some facilities and and verify the extended fields appear.
  • Download the search results. They should be slow as the data is refetched. Verify that the downloaded file includes extended fields
  • BUG in my testing I found the result count returned and shown on the list tab did not match the row count in the file when searching by Number of Workers. See if you can repeat this and diagnose it
  • Use the Embed tab on http://localhost:6543/settings and verify that searching and downloading work correctly

Checklist

  • fixup! commits have been squashed
  • CI passes after rebase
  • CHANGELOG.md updated with summary of features or fixes, following Keep a Changelog guidelines

We reduced the max page size to 50 a while ago so that the more expensive to
build responses that include all the details required to generate CSV/Excel
downlod do not exceed our the cloudfront timeout.
Serializing contributor fields and extended fields requires multiple database
lookups and is expensive. We add a `detail` query parameter to the facilities
list and only include the `contributors`, `contributor_fields`, and
`extended_fields` in the serialized output if that `detail` parameter is set to
true.

This change creates a regressing in the CSV downloads because the additional
fields are no longer available. That should be fixed in a future commit.
Previously we defaulted to returning the full details so previously loaded data
was ready for CSV/Excel download. Now that we default to excluding those details
we need to update the action chain to refetch facilities.
@jwalgran jwalgran requested a review from TaiWilkin March 20, 2022 04:48
@jwalgran jwalgran marked this pull request as ready for review March 20, 2022 04:49
@@ -967,8 +959,14 @@ def list(self, request):
context = {'request': request}

if page_queryset is not None:
should_serialize_details = params.validated_data['detail']
exclude_fields = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a clever way to handle this.

Copy link
Contributor

@TaiWilkin TaiWilkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is (mostly?) working well for me. The search is running noticeably faster, and the extended fields are appearing correctly in the details and the download. I also have some custom contributor fields in my development database, and they downloaded/displayed correctly in conjunction with extended fields.
In regards to the counts bug: The count in the header says 102; the number of rows for CSV download was 102 on first try. The number of facilities actually returned when I count them manually is 152; the number of facilities in the XLSX download is 153. There is a 'loading more facilities' tab at the bottom, even though the all the items shown in the CSV have downloaded and there are more facilities than the count indicates. On a second CSV download attempt, I received 154 facilities and some new ones showed up at the bottom of the list in the sidebar (but it still says 102 and 'loading more'). I'm continuing to look into it.

@TaiWilkin
Copy link
Contributor

TaiWilkin commented Mar 20, 2022

Additional local testing results (will continue to add to this comment as I uncover things so as to not spam the PR):

>>> FacilityIndex.objects.filter(number_of_workers__overlap=['Less than 1000']).count()
103
>>> FacilityIndex.objects.exclude(number_of_workers__overlap=['Less than 1000']).count()
2499
>>> FacilityIndex.objects.exclude(number_of_workers__overlap=['Less than 1000']).distinct('number_of_workers')
<QuerySet [<FacilityIndex: FacilityIndex object (ID202207454ERZC)>]>
>>> FacilityIndex.objects.exclude(number_of_workers__overlap=['Less than 1000']).distinct('number_of_workers').values('number_of_workers')
<QuerySet [{'number_of_workers': []}]>
>>> FacilityIndex.objects.distinct('number_of_workers').values('number_of_workers')
<QuerySet [{'number_of_workers': []}, {'number_of_workers': ['Less than 1000']}]>
>>> f = FacilityIndex.objects.filter(number_of_workers__overlap=['Less than 1000']).values_list('id', flat=True)
>>> Facility.objects.filter(id__in=f).count()
102

I checked the larger downloaded file for duplicates and determined that there is a chunk of facilities that are duplicates. If we remove them, the XLSX contains the expected number of items.
The source of the bug seems to be that sometimes, inconsistently, when we select 'download', the second page of facilities is fetched twice. I am seeing the page=2 request twice in the network tab. Both requests contain detail=true.
In the logDownload function, we look to see if there is a 'next page url' in the redux state, and while there is, we download the next page. I suspect that the next page url in state might be getting overwritten/confused by the multiple types of download somehow.

@TaiWilkin
Copy link
Contributor

I believe I have a fix/fixes to prevent the extra facilities from showing up. When running repeated tests with options A and B I didn't see the bug happen, although of course with an inconsistent bug it can be difficult to be 100% certain.
Option A: Prevent the list from displaying at all (and therefore initiating extra page downloads) when the download is occurring by wrapping the list in

{!downloadingCSV && (
                <div style={filterSidebarStyles.controlPanelContentStyles}>...</div>
)}

Option B: Prevent the list from downloading more items when the download is occurring (this seems like a better option since it doesn't change the display behavior; the only issue is that then the 'more loading' stays up longer if you scroll down since the list download takes longer):

onInfiniteLoad={() => {
                            if (!downloadingCSV) {
                                fetchNextPage();
                            }
                        }}

Option C: Run CSV detailed fetching on a completely different set of actions / redux state than the 'thin' fetching. Downside is that it's not a quick-fix solution, upside is that it feels more 'correct' in terms of handling the data.

@TaiWilkin TaiWilkin assigned jwalgran and unassigned TaiWilkin Mar 20, 2022
While a download is in progress infinite scroll events from the component
showing the list were previously resulting in duplicate data being added to the
CSV. We attempt to resolve the issue by disabling infinite scroll handling while
downloading.
@jwalgran
Copy link
Contributor Author

Thank you for your incredibly detailed diagnosis of the download issue. I agree that a rewrite of the download handling to use different actions and redux storage is a good design, but I went with the the onInfiniteLoad fix which looks like it should have been part of the system from the beginning.

I pushed up that fix for re-review.

@jwalgran jwalgran requested a review from TaiWilkin March 20, 2022 15:59
@jwalgran jwalgran assigned TaiWilkin and unassigned jwalgran Mar 20, 2022
Copy link
Contributor

@TaiWilkin TaiWilkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@TaiWilkin TaiWilkin assigned jwalgran and unassigned TaiWilkin Mar 20, 2022
@jwalgran
Copy link
Contributor Author

Thanks again for the detailed review.

@jwalgran jwalgran merged commit ade00e2 into develop Mar 20, 2022
@jwalgran jwalgran deleted the feature/jcw/fast-search branch March 20, 2022 17:29
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants