Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error In "search_crawler.py" #1

Open
simonjisu opened this issue Feb 2, 2019 · 1 comment
Open

Error In "search_crawler.py" #1

simonjisu opened this issue Feb 2, 2019 · 1 comment

Comments

@simonjisu
Copy link

안녕하세요! 오류가 좀 있어서 해결하다가 이슈를 남기게 되었습니다.

환경:

Ubuntu Server 18.04 LTS
Python : 3.6.5
beautifulsoup4 : 4.7.1
requests : 2.21.0

실행파일

$ python searching_news_comments.py --query_file queries.txt

오류 설명

search_crawler.py 파일에서 _parse_urls_from_page 함수에서 에러가 났었는데요,

url_patterns = ('a[href=https://news.naver.com/main/read.nhn?]',
            'a[href^=https://entertain.naver.com/main/read.nhn?]',
            'a[href^=https://sports.news.naver.com/sports/index.nhn?]',
            'a[href^=https://news.naver.com/sports/index.nhn?]')
...
for pattern in url_patterns:
        article_urls = [link['href'] for link in article_blocks.select(pattern)]
        urls_in_page.update(article_urls)

article_blocks 라는 변수(bs4.element.Tag 객체)가 pattern 을 인식못하는 현상이 발생했습니다.
그래서 각 패턴 string 중 링크 앞뒤에 " " 를 붙이니까 해결이 되었습니다.

  • 수정전: ('a[href=https://news.naver.com/main/read.nhn?]'
  • 수정후: ('a[href="https://news.naver.com/main/read.nhn?"]'

PS: 저만 그런지 모르겠지만, 다른 분들도 오류가 나면 참고 부탁드립니다. :)


오류상세

Traceback (most recent call last):
  File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 97, in _parse_urls_from_page
    article_urls = [link['href'] for link in article_blocks.select(pattern)]
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/bs4/element.py", line 1376, in select
    return soupsieve.select(selector, self, namespaces, limit, **kwargs)
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/__init__.py", line 108, in select
    return compile(select, namespaces, flags).select(tag, limit)
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/__init__.py", line 59, in compile
    return cp._cached_css_compile(pattern, namespaces, flags)
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/css_parser.py", line 192, in _cached_css_compile
    CSSParser(pattern, flags).process_selectors(),
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/css_parser.py", line 894, in process_selectors
    return self.parse_selectors(self.selector_iter(self.pattern), index, flags)
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/css_parser.py", line 744, in parse_selectors
    key, m = next(iselector)
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/css_parser.py", line 881, in selector_iter
    raise SyntaxError(msg)
SyntaxError: Malformed attribute selector at position 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "searching_news_comments.py", line 73, in <module>
    main()
  File "searching_news_comments.py", line 70, in main
    crawler.search(query, bd, ed)
  File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 131, in search
    scrap_date, verbose=self.verbose, debug=self.debug)
  File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 26, in get_article_urls
    search_result_url, num_articles, verbose, debug)
  File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 70, in _extract_urls_from_search_result
    urls_in_page = _parse_urls_from_page(search_result_url, page)
  File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 100, in _parse_urls_from_page
    raise ValueError('Failed to extract urls from page %s' % str(e))
ValueError: Failed to extract urls from page Malformed attribute selector at position 1
@lovit
Copy link
Owner

lovit commented Feb 2, 2019

오류를 확인해 주셔서 감사합니다. beautifulsoup4 버전이 의심스러워 확인해보니 이 문제였습니다.

제가 작업할 때 이용한 beautifulsoup4 의 버전은 4.6.0 입니다. 여기에서는 href^= 다음의 값 앞, 뒤에 " 를 넣어도 넣지 않아도 동일하게 작동하는데, 4.7.x 에서는 이 부분이 바뀐 듯 합니다. 반드시 " 를 입력해야만 작동을 하네요.

최근 버전에서 " 를 넣어야만 작동하고, 4.6.x 에서도 동일하게 작동하니, 해당 코드로 수정하도록 하겠습니다.

이슈와 해결 방법을 상세히 알려주셔서 감사합니다. 덕분에 수월하게 버그를 잡을 수 있었습니다!!

@lovit lovit closed this as completed in df85f08 Feb 2, 2019
lovit added a commit that referenced this issue Feb 2, 2019
@lovit lovit reopened this Feb 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants