Incremental updates with database store #1025

jbothma · 2023-09-19T07:45:28Z

I think there's a bug with the incremental update behaviour of the DatabaseStore.

If I understand correctly, crawl_time has to be set to the same value each time the spider is run to have incremental crawl.

The first time,

it crawls
it saves data to the crawl_time directory in file names like start.json, PageNumber-2.json, ...
it creates one CSV file from all the crawl files
it creates the OCDS data table and inserts the data from the CSV file

Subsequent times,

it gets the latest publish date from the data in the table
it crawls from that publish date
1. it saves data to the crawl_time directory in file names like start.json, PageNumber-2.json, ... (I think it's overwriting files here)
it creates one CSV file from all the crawl files
deletes the existing data and inserts the data from the CSV file

expected: All the data crawled previously plus the new data should be in the database
actual: Data in the overwritten files is missing from the database

Am I doing something wrong or is the overwriting an issue here? If I change crawl time for each crawl, none of the first crawl's data is included.

Some options I see:

parse out the latest PageNumber- index and save to the next number (decouple the API page number from the saved file)
use the existing data as part of the data going into compiling releases. That only solves the problem for people who enable compiling releases, but I want that, so it's fine for me.

The text was updated successfully, but these errors were encountered:

yolile · 2023-09-19T13:43:03Z

Thank you, @jbothma for reporting. This is a bug indeed. This happens for spiders who use "generic" names as file names. One approach could be to ensure each file name is always unique (including a timestamp as part of the filename, for example). The only issue with this approach is that the compile release option will be required to avoid duplicates in some cases.

jpmckinney · 2023-09-19T15:14:37Z

If a crawl is performed twice with the same parameters, the filenames should be the same.

I think the simplest solution might be to prepend from_date to start.json and to set formatter in start_requests to something like join(pretty(self.from_date), parameters('page')) (where pretty is a new function that formats datetimes).

jpmckinney · 2023-09-19T15:21:43Z

The path and qs:* spider arguments are the only other parameters that change the response, but I don't think they are changed between incremental updates, so they don't need to be included in the filename.

jbothma · 2023-09-20T07:56:36Z

Amazing. Thanks both!

yolile mentioned this issue Sep 19, 2023

fix(spiders): use date filters as part of file names #1026

Merged

yolile closed this as completed in #1026 Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental updates with database store #1025

Incremental updates with database store #1025

jbothma commented Sep 19, 2023

yolile commented Sep 19, 2023

jpmckinney commented Sep 19, 2023

jpmckinney commented Sep 19, 2023

jbothma commented Sep 20, 2023

Incremental updates with database store #1025

Incremental updates with database store #1025

Comments

jbothma commented Sep 19, 2023

yolile commented Sep 19, 2023

jpmckinney commented Sep 19, 2023

jpmckinney commented Sep 19, 2023

jbothma commented Sep 20, 2023