-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental updates with database store #1025
Comments
Thank you, @jbothma for reporting. This is a bug indeed. This happens for spiders who use "generic" names as file names. One approach could be to ensure each file name is always unique (including a timestamp as part of the filename, for example). The only issue with this approach is that the compile release option will be required to avoid duplicates in some cases. |
If a crawl is performed twice with the same parameters, the filenames should be the same. I think the simplest solution might be to prepend |
The |
Amazing. Thanks both! |
I think there's a bug with the incremental update behaviour of the DatabaseStore.
If I understand correctly, crawl_time has to be set to the same value each time the spider is run to have incremental crawl.
The first time,
start.json
,PageNumber-2.json
, ...Subsequent times,
start.json
,PageNumber-2.json
, ... (I think it's overwriting files here)expected: All the data crawled previously plus the new data should be in the database
actual: Data in the overwritten files is missing from the database
Am I doing something wrong or is the overwriting an issue here? If I change crawl time for each crawl, none of the first crawl's data is included.
Some options I see:
The text was updated successfully, but these errors were encountered: