Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh crawl status after long time leads to memory error. #11

Closed
WNiels opened this issue Nov 30, 2018 · 7 comments
Closed

Refresh crawl status after long time leads to memory error. #11

WNiels opened this issue Nov 30, 2018 · 7 comments
Labels
bug Something isn't working

Comments

@WNiels
Copy link

WNiels commented Nov 30, 2018

I have a crawl running, where after 87000 seconds since last refresh, the following error occures when trying to refresh:

Traceback (most recent call last): File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 2292, in wsgi_app response = self.full_dispatch_request() File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 1815, in full_dispatch_request rv = self.handle_user_exception(e) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 1718, in handle_user_exception reraise(exc_type, exc_value, tb) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/_compat.py", line 35, in reraise raise value File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 1813, in full_dispatch_request rv = self.dispatch_request() File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 1799, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/views.py", line 88, in view return self.dispatch_request(*args, **kwargs) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/scrapydweb/directory/log.py", line 94, in dispatch_request self.request_scrapy_log() File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/scrapydweb/directory/log.py", line 142, in request_scrapy_log self.status_code, self.text = self.make_request(self.url, api=False, auth=self.AUTH) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/scrapydweb/myview.py", line 191, in make_request front = r.text[:min(100, len(r.text))].replace('\n', '') File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/requests/models.py", line 861, in text content = str(self.content, encoding, errors='replace') MemoryError

The crawl seems to be running fine thow.

@my8100
Copy link
Owner

my8100 commented Nov 30, 2018

In my experience, it's due to insufficient memory. Could your tell me the size of current log file and your spare / total RAM.

@my8100
Copy link
Owner

my8100 commented Nov 30, 2018

Also, if ScrapydWeb and Scrapyd run on the same host, you can set up the SCRAPYD_LOGS_DIR item to read local log file directly, which works only when your Scrapyd server is added as '127.0.0.1' in the config file of ScrapydWeb.
Note that parsing the log file with regular expression still may cause memory error due to insufficient memory.

https://github.com/my8100/scrapydweb/blob/master/scrapydweb/default_settings.py#L60

# Set to speed up loading scrapy logs.
# e.g., 'C:/Users/username/logs/' or '/home/username/logs/'
# The setting takes effect only when both ScrapydWeb and Scrapyd run on the same machine,
# and the Scrapyd server ip is added as '127.0.0.1'.
# Check out here to find out where the Scrapy logs are stored:
# https://scrapyd.readthedocs.io/en/stable/config.html#logs-dir
SCRAPYD_LOGS_DIR = ''

@WNiels
Copy link
Author

WNiels commented Nov 30, 2018

Thanks for the fast reply.
I don't want to interrupt the crawling, but it should finish within a few days. Then i'll test the above and give an update.

@my8100
Copy link
Owner

my8100 commented Nov 30, 2018

Actually, you only need to reconfig and restart ScrapydWeb, without interrupting your crawling.

@my8100
Copy link
Owner

my8100 commented Nov 30, 2018

It's possible that you can't reproduce the problem after your crawling is finished, since there would be enough memory for ScrapydWeb to parse log.
Or you can run another ScrapydWeb instance on other computer with enough memory, as a temporary solution.

@WNiels
Copy link
Author

WNiels commented Nov 30, 2018

Ok, there's the issue. 600MB Ram left and a 800MB log.

@WNiels WNiels closed this as completed Nov 30, 2018
@my8100 my8100 mentioned this issue Dec 19, 2018
@my8100 my8100 reopened this Jan 21, 2019
@my8100
Copy link
Owner

my8100 commented Jan 21, 2019

Fixed in v1.1.0: Now the large logfile would be cut into chunks and parsed periodically and incrementally with the help of LogParser.

@my8100 my8100 closed this as completed Jan 21, 2019
@my8100 my8100 added the bug Something isn't working label Jan 21, 2019
@my8100 my8100 self-assigned this Jan 21, 2019
@my8100 my8100 removed their assignment May 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants