Refresh crawl status after long time leads to memory error. #11

WNiels · 2018-11-30T12:25:39Z

I have a crawl running, where after 87000 seconds since last refresh, the following error occures when trying to refresh:

Traceback (most recent call last): File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 2292, in wsgi_app response = self.full_dispatch_request() File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 1815, in full_dispatch_request rv = self.handle_user_exception(e) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 1718, in handle_user_exception reraise(exc_type, exc_value, tb) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/_compat.py", line 35, in reraise raise value File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 1813, in full_dispatch_request rv = self.dispatch_request() File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 1799, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/views.py", line 88, in view return self.dispatch_request(*args, **kwargs) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/scrapydweb/directory/log.py", line 94, in dispatch_request self.request_scrapy_log() File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/scrapydweb/directory/log.py", line 142, in request_scrapy_log self.status_code, self.text = self.make_request(self.url, api=False, auth=self.AUTH) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/scrapydweb/myview.py", line 191, in make_request front = r.text[:min(100, len(r.text))].replace('\n', '') File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/requests/models.py", line 861, in text content = str(self.content, encoding, errors='replace') MemoryError

The crawl seems to be running fine thow.

The text was updated successfully, but these errors were encountered:

my8100 · 2018-11-30T12:47:11Z

In my experience, it's due to insufficient memory. Could your tell me the size of current log file and your spare / total RAM.

my8100 · 2018-11-30T13:11:16Z

Also, if ScrapydWeb and Scrapyd run on the same host, you can set up the SCRAPYD_LOGS_DIR item to read local log file directly, which works only when your Scrapyd server is added as '127.0.0.1' in the config file of ScrapydWeb.
Note that parsing the log file with regular expression still may cause memory error due to insufficient memory.

https://github.com/my8100/scrapydweb/blob/master/scrapydweb/default_settings.py#L60

# Set to speed up loading scrapy logs.
# e.g., 'C:/Users/username/logs/' or '/home/username/logs/'
# The setting takes effect only when both ScrapydWeb and Scrapyd run on the same machine,
# and the Scrapyd server ip is added as '127.0.0.1'.
# Check out here to find out where the Scrapy logs are stored:
# https://scrapyd.readthedocs.io/en/stable/config.html#logs-dir
SCRAPYD_LOGS_DIR = ''

WNiels · 2018-11-30T13:22:07Z

Thanks for the fast reply.
I don't want to interrupt the crawling, but it should finish within a few days. Then i'll test the above and give an update.

my8100 · 2018-11-30T13:27:16Z

Actually, you only need to reconfig and restart ScrapydWeb, without interrupting your crawling.

my8100 · 2018-11-30T13:35:21Z

It's possible that you can't reproduce the problem after your crawling is finished, since there would be enough memory for ScrapydWeb to parse log.
Or you can run another ScrapydWeb instance on other computer with enough memory, as a temporary solution.

WNiels · 2018-11-30T14:24:07Z

Ok, there's the issue. 600MB Ram left and a 800MB log.

my8100 · 2019-01-21T02:32:32Z

Fixed in v1.1.0: Now the large logfile would be cut into chunks and parsed periodically and incrementally with the help of LogParser.

WNiels closed this as completed Nov 30, 2018

my8100 mentioned this issue Dec 19, 2018

Always be killed #14

Closed

my8100 reopened this Jan 21, 2019

my8100 closed this as completed Jan 21, 2019

my8100 added the bug Something isn't working label Jan 21, 2019

my8100 self-assigned this Jan 21, 2019

my8100 removed their assignment May 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refresh crawl status after long time leads to memory error. #11

Refresh crawl status after long time leads to memory error. #11

WNiels commented Nov 30, 2018

my8100 commented Nov 30, 2018 •

edited

Loading

my8100 commented Nov 30, 2018

WNiels commented Nov 30, 2018

my8100 commented Nov 30, 2018

my8100 commented Nov 30, 2018

WNiels commented Nov 30, 2018

my8100 commented Jan 21, 2019

Refresh crawl status after long time leads to memory error. #11

Refresh crawl status after long time leads to memory error. #11

Comments

WNiels commented Nov 30, 2018

my8100 commented Nov 30, 2018 • edited Loading

my8100 commented Nov 30, 2018

WNiels commented Nov 30, 2018

my8100 commented Nov 30, 2018

my8100 commented Nov 30, 2018

WNiels commented Nov 30, 2018

my8100 commented Jan 21, 2019

my8100 commented Nov 30, 2018 •

edited

Loading