Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

抓取公众号文章时,时间格式清洗出错 #46

Closed
showthesunli opened this issue Mar 22, 2022 · 3 comments
Closed

抓取公众号文章时,时间格式清洗出错 #46

showthesunli opened this issue Mar 22, 2022 · 3 comments

Comments

@showthesunli
Copy link
Contributor

测试脚本如下:

from src.collector.wechat_feddd.start import WeiXinSpider
WeiXinSpider.request_config = {"RETRIES": 3, "DELAY": 5, "TIMEOUT": 20}
WeiXinSpider.start_urls = ['https://mp.weixin.qq.com/s/OrCRVCZ8cGOLRf5p5avHOg']
WeiXinSpider.start()

错误原因:
数据清洗时,期望的数据格式是 2022-03-21 20:59,但实际抓取回来的数据是 2022-03-22 20:37:12,导致 clean_doc_ts函数报错。如下图
image

@showthesunli
Copy link
Contributor Author

如果把wechat_itme.py中的doc_ts抓取换成第47行,是可以正常抓取的,如下图
image

@howie6879
Copy link
Owner

是 bug,时间提取将更换成从js脚本直接提取:

image

@howie6879
Copy link
Owner

howie6879 commented Mar 22, 2022

已修复,更新景镜像重新启动即可:

docker pull liuliio/schedule:v0.2.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants