Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental updates with database store #1025

Closed
jbothma opened this issue Sep 19, 2023 · 4 comments · Fixed by #1026
Closed

Incremental updates with database store #1025

jbothma opened this issue Sep 19, 2023 · 4 comments · Fixed by #1026

Comments

@jbothma
Copy link

jbothma commented Sep 19, 2023

I think there's a bug with the incremental update behaviour of the DatabaseStore.

If I understand correctly, crawl_time has to be set to the same value each time the spider is run to have incremental crawl.

The first time,

  1. it crawls
  2. it saves data to the crawl_time directory in file names like start.json, PageNumber-2.json, ...
  3. it creates one CSV file from all the crawl files
  4. it creates the OCDS data table and inserts the data from the CSV file

Subsequent times,

  1. it gets the latest publish date from the data in the table
  2. it crawls from that publish date
    1. it saves data to the crawl_time directory in file names like start.json, PageNumber-2.json, ... (I think it's overwriting files here)
  3. it creates one CSV file from all the crawl files
  4. deletes the existing data and inserts the data from the CSV file

expected: All the data crawled previously plus the new data should be in the database
actual: Data in the overwritten files is missing from the database

Am I doing something wrong or is the overwriting an issue here? If I change crawl time for each crawl, none of the first crawl's data is included.

Some options I see:

  • parse out the latest PageNumber- index and save to the next number (decouple the API page number from the saved file)
  • use the existing data as part of the data going into compiling releases. That only solves the problem for people who enable compiling releases, but I want that, so it's fine for me.
@yolile
Copy link
Member

yolile commented Sep 19, 2023

Thank you, @jbothma for reporting. This is a bug indeed. This happens for spiders who use "generic" names as file names. One approach could be to ensure each file name is always unique (including a timestamp as part of the filename, for example). The only issue with this approach is that the compile release option will be required to avoid duplicates in some cases.

@jpmckinney
Copy link
Member

If a crawl is performed twice with the same parameters, the filenames should be the same.

I think the simplest solution might be to prepend from_date to start.json and to set formatter in start_requests to something like join(pretty(self.from_date), parameters('page')) (where pretty is a new function that formats datetimes).

@jpmckinney
Copy link
Member

The path and qs:* spider arguments are the only other parameters that change the response, but I don't think they are changed between incremental updates, so they don't need to be included in the filename.

@jbothma
Copy link
Author

jbothma commented Sep 20, 2023

Amazing. Thanks both!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants