Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pluck: Check that all spiders support finding the latest date #451

Closed
2 tasks
jpmckinney opened this issue Jul 16, 2020 · 7 comments
Closed
2 tasks

pluck: Check that all spiders support finding the latest date #451

jpmckinney opened this issue Jul 16, 2020 · 7 comments
Labels
framework-spiders Relating to common spider functionality
Milestone

Comments

@jpmckinney
Copy link
Member

jpmckinney commented Jul 16, 2020

This looks mostly correct. However:

  • moldova_mtender (and moldova): I expected these to have more recent dates.
  • mexico_inai: I thought they started publishing again. Mexico: INAI new #452
  • france: is in the future. Are there any dates in the past? If so, we can maybe filter out future dates.

We should check those with data released "today" whether their data is fresh or if they implemented date incorrectly.

I think nepal is currently having intermittent errors.

2015-09-17,canada_buyandsell
2017-08-04,moldova_old
2018-04-24,mexico_administracion_publica_federal
2018-06-05,honduras_cost
2018-06-29,mexico_inai
2018-08-13,nepal_portal
2018-10-20,moldova_mtender
2018-10-23,moldova
2018-12-04,mexico_grupo_aeroporto
2019-06-25,georgia_records
2019-06-25,georgia_releases
2019-11-08,mexico_quien_es_quien
2020-01-08,honduras_portal_releases
2020-01-23,nepal_dhangadhi
2020-02-03,colombia
2020-03-11,indonesia_bandung
2020-05-19,nigeria_portal
2020-05-31,canada_montreal
2020-06-11,chile_compra_releases
2020-06-14,ecuador_emergency
2020-06-15,uruguay_releases
2020-06-19,afghanistan_releases
2020-07-01,australia_nsw
2020-07-08,uk_fts
2020-07-11,kenya_makueni
2020-07-14,uganda_releases
2020-07-15,scotland
2020-07-16,argentina_vialidad
2020-07-16,australia
2020-07-16,uk_contracts_finder
2020-07-17,armenia
2020-12-09,france
@jpmckinney jpmckinney added the framework-spiders Relating to common spider functionality label Jul 16, 2020
@jpmckinney jpmckinney added this to To do in CDS 2020-05/2021-02 via automation Jul 16, 2020
@yolile
Copy link
Member

yolile commented Jul 17, 2020

Mexico INAI needs a new spider, issue #452

@yolile yolile moved this from To do to To do: Kingfisher Collect in CDS 2020-05/2021-02 Jul 21, 2020
@jpmckinney jpmckinney changed the title latestreleasedate: Check that all spiders support finding the latest date pluck: Check that all spiders support finding the latest date Aug 21, 2020
@jpmckinney jpmckinney moved this from To do: Kingfisher Collect to Priority [12 max] in CDS 2020-05/2021-02 Oct 6, 2020
@jpmckinney
Copy link
Member Author

jpmckinney commented Jan 31, 2021

Now that #449 and #450 are closed via #572, we'll need to check whether the dates plucked by CompressedFileSpider and DigiwhistBase spiders are in fact the latest date.

@yolile yolile added this to the Priority milestone Mar 3, 2021
@jpmckinney
Copy link
Member Author

jpmckinney commented Mar 25, 2021

  • zambia: Download links in reverse order
  • budeshi: Sort initial list by year
  • nepal_dhanghadhi: Sort fiscal_years by name (or just reverse)

Sources that actually publish daily:

  • afghanistan_release_packages
  • australia
  • australia_nsw
  • pakistan_ppra_releases
  • peru_compras
  • scotland_public_contracts
  • spain_zaragoza
  • uk_contracts_finder
  • uk_fts
  • uruguay_releases

Sources that seem to change the date to the current time:

  • armenia

Other sources:

  • moldova: paginates with date offset in chronological order, so would need to set from_date
  • indonesia_bandung: time travel (data has 2021-04-01 on 2021-03-25)

Large files that are not in chronological order:

  • digiwhist_* (can manually check Last-Modified header instead)
  • france (can manually check metadata in open data catalog)

jpmckinney added a commit that referenced this issue Mar 26, 2021
@jpmckinney
Copy link
Member Author

jpmckinney commented Mar 26, 2021

I haven't yet investigated why these give old dates using pluck:

  • colombia
  • mexico_quien_es_quien
  • portugal_releases

These might be harder since they are bulk:

  • portugal
  • costa_rica_poder_judicial_releases
  • mexico_administracion_publica_federal_bulk

@yolile
Copy link
Member

yolile commented Mar 26, 2021

mexico_quien_es_quien

I've fixed this one in #686

colombia

They sort the results in reverse chronological order but support from_date parameters

@yolile
Copy link
Member

yolile commented May 19, 2023

Now that we have the registry, do we still care about this command exactitude?

@jpmckinney
Copy link
Member Author

I think we want spiders to sort in reverse chronological order by default, since that is what users mostly want (and it's also what we need when doing a BI project), but we can do this on a case-by-case basis and open issues for specific spiders.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
framework-spiders Relating to common spider functionality
Projects
No open projects
CDS 2020-05/2021-02
  
Priority [12 max]
Development

No branches or pull requests

2 participants