DataGranules.get raises IndexError when Granules found = zero #526

andypbarrett · 2024-04-16T21:59:38Z

What Happens

If I do a simple search for granules that returns zero hits, I get an index error.

In [12]: results = earthaccess.search_data(short_name='ATL07', temporal=('2023-01-01','2023-01-01'), bounding_box=(-180.,65.,180.,90.))
Granules found: 0
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[12], line 1
----> 1 results = earthaccess.search_data(short_name='ATL07', temporal=('2023-01-01','2023-01-01'), bounding_box=(-180.,65.,180.,90.))

File ~/src/earthaccess/earthaccess/api.py:124, in search_data(count, **kwargs)
    122 if count > 0:
    123     return query.get(count)
--> 124 return query.get_all()

File ~/mambaforge/envs/earthaccess-dev/lib/python3.10/site-packages/cmr/queries.py:105, in Query.get_all(self)
     96 def get_all(self):
     97     """
     98     Returns all of the results for the query. This will call hits() first to determine how many
     99     results their are, and then calls get() with that number. This method could take quite
   (...)
    102     :returns: query results as a list
    103     """
--> 105     return self.get(self.hits())

File ~/src/earthaccess/earthaccess/search.py:402, in DataGranules.get(self, limit)
    386 """Get all the collections (datasets) that match with our current parameters
    387 up to some limit, even if spanning multiple pages.
    388 
   (...)
    398     query results as a list of `DataGranules` instances.
    399 """
    400 response = get_results(self, limit)
--> 402 cloud = self._is_cloud_hosted(response[0])
    404 return list(DataGranule(granule, cloud_hosted=cloud) for granule in response)

IndexError: list index out of range

What I expect

I would expect this to return an empty list as print Granules found: 0 to stdout.

Proposed change

Check that response is not empty before checking _is_cloud_hosted. Return empty list if it is empty.

The text was updated successfully, but these errors were encountered:

mfisher87 · 2024-04-17T16:18:06Z

earthaccess/earthaccess/search.py

Lines 521 to 525 in 0e4f392

    
           response = get_results(self.session, self, limit) 
        
           cloud = self._is_cloud_hosted(response[0]) 
        
           return [DataGranule(granule, cloud_hosted=cloud) for granule in response]

This has a smell to me. Why are we doing this post-process step to calculate if all granules are cloud-hosted based on the first granule, instead of having the DataGranule class set cloud_hosted for each granule at init-time? It would be great to put that rationale in a comment for future us.

If there's good rationale, or if we're looking for a short-term fix, I think we can add a quick:

if not results:
    return []

to resolve the bug.

chuckwondo · 2024-04-17T17:07:40Z

earthaccess/earthaccess/search.py

Lines 521 to 525 in 0e4f392

response = get_results(self.session, self, limit)

cloud = self._is_cloud_hosted(response[0])

return [DataGranule(granule, cloud_hosted=cloud) for granule in response]

This has a smell to me. Why are we doing this post-process step to calculate if all granules are cloud-hosted based on the first granule, instead of having the DataGranule class set cloud_hosted for each granule at init-time? It would be great to put that rationale in a comment for future us.

If there's good rationale, or if we're looking for a short-term fix, I think we can add a quick:
if not results:
    return []
to resolve the bug.

I would suggest simply changing the assignment to cloud to this to fix the IndexError:

cloud = len(response) > 0 and self._is_cloud_hosted(response[0])

However, I think this reveals another bug. It is erroneous to assume that if response[0] is (or is not) cloud hosted, then the rest of the granules in the response are (or are not) cloud hosted as well. If the query does not specify the cloud_hosted parameter, then the results could possibly be a mix of cloud hosted and non-cloud hosted granules.

For example, it is valid to specify multiple collection concept IDs in a CMR granule search query. If one collection is cloud hosted and the other is not, then the granules will be a mix of cloud hosted and non-cloud hosted.

Therefore, I believe the proper fix to address both bugs (IndexError and wrong assumption) is to completely remove the line that assigns a value to cloud and then change the return statement to this:

return [DataGranule(granule, cloud_hosted=self._is_cloud_hosted(granule)) for granule in response]

mfisher87 · 2024-04-17T18:34:56Z

However, I think this reveals another bug. It is erroneous to assume that if response[0] is (or is not) cloud hosted, then the rest of the granules in the response are (or are not) cloud hosted as well. If the query does not specify the cloud_hosted parameter, then the results could possibly be a mix of cloud hosted and non-cloud hosted granules.

💯 This is the smell I was trying to communicate, thanks for explaining it better than me!

mfisher87 · 2024-04-17T18:39:20Z

Therefore, I believe the proper fix to address both bugs (IndexError and wrong assumption) is to completely remove the line that assigns a value to cloud and then change the return statement to this:
return [DataGranule(granule, cloud_hosted=self._is_cloud_hosted(granule)) for granule in response] 

Since DataGranule constructor and _is_cloud_hosted both depend on this granule object (loaded JSON), what do you think of encapsulating that logic within DataGranule and moving the _is_cloud_hosted method from DataGranules to DataGranule?

chuckwondo · 2024-04-17T19:11:16Z

Therefore, I believe the proper fix to address both bugs (IndexError and wrong assumption) is to completely remove the line that assigns a value to cloud and then change the return statement to this:
return [DataGranule(granule, cloud_hosted=self._is_cloud_hosted(granule)) for granule in response] 
Since DataGranule constructor and _is_cloud_hosted both depend on this granule object (loaded JSON), what do you think of encapsulating that logic within DataGranule and moving the _is_cloud_hosted method from DataGranules to DataGranule?

Agreed that _is_cloud_hosted makes sense to be moved to DataGranule. I would say there's also no need to make it "private" either (i.e., drop the underscore prefix), and should perhaps even be a cached property (from functools) on DataGranule, similar to this (I also suggest dropping the is prefix, but not utterly opposed to keeping it):

@cached_property
def cloud_hosted(self) -> bool:
    ...

I might also suggest dropping the cloud_hosted parameter of DataGranule, but I don't know if there's a specific reason that's there. Perhaps there are cases where we want to force cloud_hosted to be True even though the logic currently in _is_cloud_hosted would evaluate to False, so it serves as an override in the "I know what I'm doing" sense.

cc: @betolink

andypbarrett · 2024-04-17T19:29:58Z

The cloud_hosted parameter enables searching for only cloud-hosted data when cloud_hosted=True and only DAAC-hosted (a.k.a on-prem) when cloud_hosted=False.

chuckwondo · 2024-04-17T19:36:32Z

The cloud_hosted parameter enables searching for only cloud-hosted data when cloud_hosted=True and only DAAC-hosted (a.k.a on-prem) when cloud_hosted=False.

I'm referring to the cloud_hosted param to DataGranule (singular, not plural).

mfisher87 · 2024-04-17T20:00:08Z

Agreed that _is_cloud_hosted makes sense to be moved to DataGranule. I would say there's also no need to make it "private" either (i.e., drop the underscore prefix), and should perhaps even be a cached property (from functools) on DataGranule

💯

chuckwondo · 2024-05-01T11:08:27Z

I might also suggest dropping the cloud_hosted parameter of DataGranule, but I don't know if there's a specific reason that's there. Perhaps there are cases where we want to force cloud_hosted to be True even though the logic currently in _is_cloud_hosted would evaluate to False, so it serves as an override in the "I know what I'm doing" sense.

Anybody know why DataGranule takes a cloud_hosted parameter?

With my current understanding, it doesn't seem to make sense to me to explicitly specify such a parameter, particularly if we move the logic in DataGranules._is_cloud_hosted to a new attribute DataGranule.cloud_hosted. Note that we're moving logic from DataGranules (plural) to DataGranule (singular).

I would think that with such a move, we would want to deprecate the cloud_hosted parameter to DataGranule.__init__.

chuckwondo · 2024-05-10T18:42:59Z

I might also suggest dropping the cloud_hosted parameter of DataGranule, but I don't know if there's a specific reason that's there. Perhaps there are cases where we want to force cloud_hosted to be True even though the logic currently in _is_cloud_hosted would evaluate to False, so it serves as an override in the "I know what I'm doing" sense.

Anybody know why DataGranule takes a cloud_hosted parameter?

With my current understanding, it doesn't seem to make sense to me to explicitly specify such a parameter, particularly if we move the logic in DataGranules._is_cloud_hosted to a new attribute DataGranule.cloud_hosted. Note that we're moving logic from DataGranules (plural) to DataGranule (singular).

I would think that with such a move, we would want to deprecate the cloud_hosted parameter to DataGranule.__init__.

@betolink, do you have any background knowledge on why this is implemented this way?

betolink · 2024-05-10T19:34:03Z

Sorry for the delayed response, there are collections that exist both in the DAAC and AWS with the same short_name. cloud_hosted is a flag to indicate that we can use the S3 links. The ultimate reason is to simplify open() and download() so it defaults to S3 if the code is running in us-west-2. We could check for the S3 links in "RelatedUrls" (umm) but injecting a flag from the search results was faster if I recall. @chuckwondo

chuckwondo · 2024-05-10T23:38:03Z

Sorry for the delayed response, there are collections that exist both in the DAAC and AWS with the same short_name. cloud_hosted is a flag to indicate that we can use the S3 links. The ultimate reason is to simplify open() and download() so it defaults to S3 if the code is running in us-west-2. We could check for the S3 links in "RelatedUrls" (umm) but injecting a flag from the search results was faster if I recall. @chuckwondo

Sure, injecting a flag would be faster (although I suspect this would be negligible, thus a questionable "optimization"), but more importantly, it's possibly simply wrong, per an earlier comment I made:

However, I think this reveals another bug. It is erroneous to assume that if response[0] is (or is not) cloud hosted, then the rest of the granules in the response are (or are not) cloud hosted as well. If the query does not specify the cloud_hosted parameter, then the results could possibly be a mix of cloud hosted and non-cloud hosted granules.

For example, it is valid to specify multiple collection concept IDs in a CMR granule search query. If one collection is cloud hosted and the other is not, then the granules will be a mix of cloud hosted and non-cloud hosted.

mfisher87 added the bug Something isn't working label Apr 17, 2024

mfisher87 added the good first issue Good for newcomers label Apr 17, 2024

mfisher87 removed the good first issue Good for newcomers label May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataGranules.get raises IndexError when Granules found = zero #526

DataGranules.get raises IndexError when Granules found = zero #526

andypbarrett commented Apr 16, 2024

mfisher87 commented Apr 17, 2024 •

edited

chuckwondo commented Apr 17, 2024

mfisher87 commented Apr 17, 2024

mfisher87 commented Apr 17, 2024 •

edited

chuckwondo commented Apr 17, 2024

andypbarrett commented Apr 17, 2024

chuckwondo commented Apr 17, 2024

mfisher87 commented Apr 17, 2024

chuckwondo commented May 1, 2024

chuckwondo commented May 10, 2024

betolink commented May 10, 2024

chuckwondo commented May 10, 2024

DataGranules.get raises IndexError when Granules found = zero #526

DataGranules.get raises IndexError when Granules found = zero #526

Comments

andypbarrett commented Apr 16, 2024

What Happens

What I expect

Proposed change

mfisher87 commented Apr 17, 2024 • edited

chuckwondo commented Apr 17, 2024

mfisher87 commented Apr 17, 2024

mfisher87 commented Apr 17, 2024 • edited

chuckwondo commented Apr 17, 2024

andypbarrett commented Apr 17, 2024

chuckwondo commented Apr 17, 2024

mfisher87 commented Apr 17, 2024

chuckwondo commented May 1, 2024

chuckwondo commented May 10, 2024

betolink commented May 10, 2024

chuckwondo commented May 10, 2024

mfisher87 commented Apr 17, 2024 •

edited

mfisher87 commented Apr 17, 2024 •

edited