Cannot save state when restarting scrapydweb #21

jdespatis · 2019-01-24T22:25:34Z

I'm running scrapydweb in docker
I can start job and I can then see some statistics, I also see the finished jobs, it's perfect

However, when I restart my container, I loose this state. I no more see the finished jobs for example
=> what is the data to persist in my docker container, so that I can see everything when I restart the container ?

I've tried to persist /usr/local/lib/python3.6/site-packages/scrapydweb/data, but it doesn't seem to do the trick

spiderkeeper keeps all its state in a SpiderKeeper.db which is perfect to keep state on container restart

Any idea how to have the same stuff with scrapydweb ?

Thanks again for your work !

my8100 · 2019-01-25T03:02:33Z

Actually, it's an issue of Scrapyd. I would figure it out in the next release.
You can check out the log of finished jobs in the Logs page for the time being, and there is no need to persist the data folder of ScrapydWeb.

jdespatis · 2019-01-25T07:11:45Z

Yes it would be awesome to support this feature !
Will it be possible to see the job graph after a scrapydweb restart ?

my8100 · 2019-01-25T07:19:56Z

I just implemented a snapshot mechanism for the Dashboard page so that you still can check out the last view of it in case the Scrapyd service is restarted.
What do you mean by 'job graph'?

jdespatis · 2019-01-25T16:02:48Z

I mean the graphic that shows the number of items stored by minute, the number of pages crawled by minute, and the other graph giving the progression of the global amount of crawled pages / stored items
Very handy to have after a scrapydweb restart.

my8100 · 2019-01-25T16:23:00Z

The stats and graphs of a job would be available as long as the json file generated by LogParser or the original logfile exist.
You may need to adjust jobs_to_keep and finished_to_keep in the config file of Scrapyd

my8100 · 2019-01-25T16:39:44Z

But why emphasizing "after a ScrapydWeb restart"? Is there anything wrong with v1.1.0?

jdespatis · 2019-01-25T17:09:12Z

Well indeed I’ve launched a job. It has finished, then I’ve restarted scrapydweb and scrapyd also. So I guess scrapydweb doesn’t show anymore the finished job and as a result I cannot get anymore the stats and graph of the job

I imagine that if scrapydweb persist the finished job(next release) then I’ll be able also to see the graph built in real time

Is it right ?
I’ll be happy to test this new release ;)

jdespatis · 2019-01-31T11:03:38Z

Well I've just noticed that I could see the graph of the job when going to the Files > Logs section, a nice column let see the graph for all log files, which is perfect for me!

With a snapshot of the dashboard, it will be even better!

my8100 · 2019-01-31T11:07:38Z

As I told you before: "You can check out the log of finished jobs in the Logs page for the time being"

my8100 · 2019-01-31T11:15:39Z

Also note that the json files generated by LogParser would also be removed by Scrapyd when it deletes the original logfiles.

my8100 · 2019-03-12T11:06:30Z

v1.2.0: Persist jobs information in the database

Digenis · 2019-04-16T18:06:11Z

Hi,
scrapyd uses sqlite only as a concurrently accessed queue.
The persistence of scheduled jobs that you see right now was not in purpose.
scrapyd should have used https://docs.python.org/3/library/queue.html
to implement the spider queue instead of sqlite.

I think what's best is to make scrapyd more modular
so that developers like @my8100 can easily plug custom components,
eg a persistent job table.

goshaQ · 2019-08-07T16:24:06Z

Is there any plans to add this feature in future releases? There are a lot of cases when it's nice to be able to restore failed parser right from where it stopped, so that already scheduled requests won't be lost, just like it implemented in SpiderKeeper.

my8100 · 2019-08-07T16:39:28Z

@goshaQ
What’s the meaning of “restore failed parser right from where it stopped“?

goshaQ · 2019-08-07T17:58:23Z

@my8100
The same as in the first comment. SpiderKeeper allows to save the state of the queue that contains scheduled requests, if a spider will stop (because of user request or anything else), it will be able to start not from scratch. But now I think that there are some considerations that make it hard to provide such functionality. And it's not so hard to do it by yourself with all specifics of particular use case.

Btw, just noticed that Items section show error if scrapyd doesn't return them, which is normal if the result is written to database. It looks for me for the same reason on Jobs section keep show the red tip that tells to install logparser to show number of parsed items, even after I've installed logparser and launched it. Or am I doing something wrong? Sorry for unrelated question.

my8100 · 2019-08-08T04:33:06Z

pip install scrapydweb==1.3.0
Both the Classic view and Database view of the Jobs page are provided,
that's why I closed this issue in v1.2.0

Set SHOW_SCRAPYD_ITEMS to False to hide the Items link in the sidebar.

scrapydweb/scrapydweb/default_settings.py

Lines 155 to 159 in a449dbf

    
           ############################## Page Display ################################### 
        
           # The default is True, set it to False to hide the Items page, as well as 
        
           # the Items column in the Jobs page. 
        
           SHOW_SCRAPYD_ITEMS = True

What's the result of visiting http://127.0.0.1:6800/logs/stats.json

goshaQ · 2019-08-08T07:05:01Z

Thanks, that's what I was looking for. But it appears that there is no stats.json on the server:

What's the result of visiting http://127.0.0.1:6800/logs/stats.json

The reply is No Such Resource.

my8100 · 2019-08-08T07:07:03Z

I've installed logparser and launched it.

Restart logparser and post the full log.

goshaQ · 2019-08-08T07:26:07Z

Restart logparser and post the full log.

[2019-08-08 07:23:22,926] INFO     in logparser.run: LogParser version: 0.8.2
[2019-08-08 07:23:22,927] INFO     in logparser.run: Use 'logparser -h' to get help
[2019-08-08 07:23:22,927] INFO     in logparser.run: Main pid: 20297
[2019-08-08 07:23:22,927] INFO     in logparser.run: Check out the config file below for more advanced settings.

****************************************************************************************************
Loading settings from /usr/local/lib/python3.6/dist-packages/logparser/settings.py
****************************************************************************************************

[2019-08-08 07:23:22,928] DEBUG    in logparser.run: Reading settings from command line: Namespace(delete_json_files=False, disable_telnet=False, main_pid=0, scrapyd_logs_dir='/somepath', scrapyd_server='127.0.0.1:6800', sleep=10, verbose=False)
[2019-08-08 07:23:22,928] DEBUG    in logparser.run: Checking config
[2019-08-08 07:23:22,928] INFO     in logparser.run: SCRAPYD_SERVER: 127.0.0.1:6800
[2019-08-08 07:23:22,928] INFO     in logparser.run: SCRAPYD_LOGS_DIR: /somepath
[2019-08-08 07:23:22,928] INFO     in logparser.run: PARSE_ROUND_INTERVAL: 10
[2019-08-08 07:23:22,928] INFO     in logparser.run: ENABLE_TELNET: True
[2019-08-08 07:23:22,928] INFO     in logparser.run: DELETE_EXISTING_JSON_FILES_AT_STARTUP: False
[2019-08-08 07:23:22,928] INFO     in logparser.run: VERBOSE: False

****************************************************************************************************
Visit stats at: http://127.0.0.1:6800/logs/stats.json
****************************************************************************************************

[2019-08-08 07:23:23,294] INFO     in logparser.utils: Running the latest version: 0.8.2
[2019-08-08 07:23:26,299] WARNING  in logparser.logparser: New logfile found: /somepath/2019-08-08T07_19_39.log (121355 bytes)
[2019-08-08 07:23:26,299] WARNING  in logparser.logparser: Json file not found: /somepath/2019-08-08T07_19_39.json
[2019-08-08 07:23:26,299] WARNING  in logparser.logparser: New logfile: /somepath/2019-08-08T07_19_39.log (121355 bytes) -> parse
[2019-08-08 07:23:26,331] WARNING  in logparser.logparser: Saved to /somepath/2019-08-08T07_19_39.json
[2019-08-08 07:23:26,332] WARNING  in logparser.logparser: Saved to http://127.0.0.1:6800/logs/stats.json
[2019-08-08 07:23:26,332] WARNING  in logparser.logparser: Sleep 10 seconds
[2019-08-08 07:23:36,343] WARNING  in logparser.logparser: Saved to http://127.0.0.1:6800/logs/stats.json
[2019-08-08 07:23:36,343] WARNING  in logparser.logparser: Sleep 10 seconds
[2019-08-08 07:23:46,350] WARNING  in logparser.logparser: Saved to http://127.0.0.1:6800/logs/stats.json
[2019-08-08 07:23:46,351] WARNING  in logparser.logparser: Sleep 10 seconds

my8100 · 2019-08-08T07:31:16Z

Check if SCRAPYD_LOGS_DIR/stats.json exists.
Visit http://127.0.0.1:6800/logs/stats.json again.

goshaQ · 2019-08-08T07:32:38Z

There is a file .json, but it's named the same as .log file.
The reply is the same - No Such Resource.

my8100 · 2019-08-08T07:33:39Z

Check if SCRAPYD_LOGS_DIR/stats.json exists.

my8100 · 2019-08-08T07:37:58Z

Did you see the comment below?
https://github.com/my8100/logparser/blob/711786042aece827be87acf0286fb68bfe5ebd20/logparser/settings.py#L20-L26

my8100 self-assigned this Jan 25, 2019

my8100 added the enhancement label Jan 25, 2019

my8100 closed this as completed Mar 12, 2019

my8100 added feature request Request for new features and removed enhancement labels Mar 23, 2019

my8100 removed their assignment May 16, 2019

jpmckinney mentioned this issue Feb 7, 2020

Allow people to resume scrapers that have experienced intermittent errors open-contracting/kingfisher-collect#79

Closed

jpmckinney mentioned this issue Sep 8, 2020

Scrapy job persistence open-contracting/kingfisher-collect#485

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot save state when restarting scrapydweb #21

Cannot save state when restarting scrapydweb #21

jdespatis commented Jan 24, 2019

my8100 commented Jan 25, 2019

jdespatis commented Jan 25, 2019

my8100 commented Jan 25, 2019 •

edited

jdespatis commented Jan 25, 2019

my8100 commented Jan 25, 2019

my8100 commented Jan 25, 2019

jdespatis commented Jan 25, 2019

jdespatis commented Jan 31, 2019

my8100 commented Jan 31, 2019

my8100 commented Jan 31, 2019

my8100 commented Mar 12, 2019

Digenis commented Apr 16, 2019

goshaQ commented Aug 7, 2019

my8100 commented Aug 7, 2019

goshaQ commented Aug 7, 2019

my8100 commented Aug 8, 2019

goshaQ commented Aug 8, 2019

my8100 commented Aug 8, 2019

goshaQ commented Aug 8, 2019

my8100 commented Aug 8, 2019

goshaQ commented Aug 8, 2019 •

edited

my8100 commented Aug 8, 2019

my8100 commented Aug 8, 2019

Cannot save state when restarting scrapydweb #21

Cannot save state when restarting scrapydweb #21

Comments

jdespatis commented Jan 24, 2019

my8100 commented Jan 25, 2019

jdespatis commented Jan 25, 2019

my8100 commented Jan 25, 2019 • edited

jdespatis commented Jan 25, 2019

my8100 commented Jan 25, 2019

my8100 commented Jan 25, 2019

jdespatis commented Jan 25, 2019

jdespatis commented Jan 31, 2019

my8100 commented Jan 31, 2019

my8100 commented Jan 31, 2019

my8100 commented Mar 12, 2019

Digenis commented Apr 16, 2019

goshaQ commented Aug 7, 2019

my8100 commented Aug 7, 2019

goshaQ commented Aug 7, 2019

my8100 commented Aug 8, 2019

goshaQ commented Aug 8, 2019

my8100 commented Aug 8, 2019

goshaQ commented Aug 8, 2019

my8100 commented Aug 8, 2019

goshaQ commented Aug 8, 2019 • edited

my8100 commented Aug 8, 2019

my8100 commented Aug 8, 2019

my8100 commented Jan 25, 2019 •

edited

goshaQ commented Aug 8, 2019 •

edited