# Web Crawling with Beautiful Soup

## Beautiful Soup
HTML과 XML 파일에서 의미있는 데이터를 추출하는 Python libray. Parser와 함께 동작하여 parse tree를 navigate, 탐색, 갱신한다.

설치 (Anaconda에는 설치되어 있음)

` pip install beautifulsoup4`

In [None]:
import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.naver.com')
soup = BeautifulSoup(page.content, "html.parser")
soup.find_all('img')[:5]

In [None]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

In [None]:
soup = BeautifulSoup(html_doc, 'html.parser')
print(type(soup))
# print(soup)
print(soup.prettify())

In [None]:
soup.title         # tag

In [None]:
soup.title.name    # tag name

In [None]:
soup.title.string

In [None]:
soup.title.parent.name

In [None]:
soup.p

In [None]:
soup.p['class']     # attributes

In [None]:
soup.a

In [None]:
soup.find_all('a')

In [None]:
soup.find(id="link3")

Extracting all the URLs found within a page’s `<a>` tags:

In [None]:
for link in soup.find_all('a'):
    print(link.get('href'))     # 'href' attribute

Extracting all the text from a page:

In [None]:
soup.get_text()

## Navigating the tree

In [None]:
soup

In [None]:
body_tag = soup.body
body_tag.contents   # list of strings

### Going down

In [None]:
for child in soup.body.children:    # 직계 자식을 방문하는 iterator 
    print(repr(child), end="***\n")

In [None]:
for child in soup.body.descendants: # 모든 후손을 방문하는 iterator (depth first)
    print(repr(child), end='***\n')

### Going up

In [None]:
title_tag = soup.title
title_tag.parent      # element's parent

In [None]:
link = soup.a
print(link)
for parent in link.parents:  # 직계 조상
    if parent is None:
        print(parent)
    else:
        print(parent.name)

### Going sideways

In [None]:
link = soup.a
link

In [None]:
for sibling in link.next_siblings: # 동생
    print(repr(sibling), end='***\n')

In [None]:
for sibling in soup.find(id="link3").previous_siblings: # 형, 언니
    print(repr(sibling))

### find_all()

In [None]:
soup.find_all("p", "title")

In [None]:
soup.find(id="link2")

In [None]:
import re
soup.find_all(id=re.compile("link"))

Remember that a single tag can have multiple values for its “class” attribute. When you search for a tag that matches a certain CSS class, you’re matching against any of its CSS classes:

In [None]:
css_soup = BeautifulSoup('<p class="body strikeout"></p>', "html.parser")
css_soup.find_all("p", class_="strikeout")

In [None]:
css_soup.find_all("p", class_="body")

### CSS selectors
Find tags:

In [None]:
soup.select("p")   # 'p' tag

In [None]:
soup.select("p:nth-of-type(3)")   # 3번쟤 'p' tag

Find tags beneath other tags:

In [None]:
soup.select("body a")  # 'body' tag 아래 'a' tag

Find tags directly beneath other tag

In [None]:
soup.select("p > a")   # 'p' tag 바로 아래 'a' tag

In [None]:
soup.select("p > a:nth-of-type(2)")

In [None]:
soup.select("p > #link1")  # 'p' tag 바로 아래 'link1' id

Find tags by CSS class

In [None]:
soup.select(".sister")

Find tags by attribute value:

In [None]:
soup.select('a[href="http://example.com/elsie"]')

## Example

In [13]:
import requests
from bs4 import BeautifulSoup
import ssl

r = requests.get('https://namu.wiki/w/%EB%B0%A9%ED%83%84%EC%86%8C%EB%85%84%EB%8B%A8')
r.content

b'<!doctype html>\n<html data-n-head-ssr><head ><title>\xeb\xb0\xa9\xed\x83\x84\xec\x86\x8c\xeb\x85\x84\xeb\x8b\xa8 - \xeb\x82\x98\xeb\xac\xb4\xec\x9c\x84\xed\x82\xa4</title><meta data-n-head="ssr" charset="utf-8"><meta data-n-head="ssr" name="viewport" content="user-scalable=no, initial-scale=1.0, maximum-scale=5.0, minimum-scale=1.0, width=device-width"><meta data-n-head="ssr" http-equv="x-ua-compatible" content="ie=edge"><meta data-n-head="ssr" name="generator" content="the seed"><meta data-n-head="ssr" name="mobile-web-app-capable" content="yes"><meta data-n-head="ssr" name="application-name" content="\xeb\x82\x98\xeb\xac\xb4\xec\x9c\x84\xed\x82\xa4"><meta data-n-head="ssr" name="msapplication-tooltip" content="\xeb\x82\x98\xeb\xac\xb4\xec\x9c\x84\xed\x82\xa4"><meta data-n-head="ssr" name="msapplication-starturl" content="/w/%EB%82%98%EB%AC%B4%EC%9C%84%ED%82%A4:%EB%8C%80%EB%AC%B8"><meta data-n-head="ssr" name="robots" content="max-image-preview:large"><meta data-n-head="ssr" name="

In [16]:
r.text

'<!doctype html>\n<html data-n-head-ssr><head ><title>방탄소년단 - 나무위키</title><meta data-n-head="ssr" charset="utf-8"><meta data-n-head="ssr" name="viewport" content="user-scalable=no, initial-scale=1.0, maximum-scale=5.0, minimum-scale=1.0, width=device-width"><meta data-n-head="ssr" http-equv="x-ua-compatible" content="ie=edge"><meta data-n-head="ssr" name="generator" content="the seed"><meta data-n-head="ssr" name="mobile-web-app-capable" content="yes"><meta data-n-head="ssr" name="application-name" content="나무위키"><meta data-n-head="ssr" name="msapplication-tooltip" content="나무위키"><meta data-n-head="ssr" name="msapplication-starturl" content="/w/%EB%82%98%EB%AC%B4%EC%9C%84%ED%82%A4:%EB%8C%80%EB%AC%B8"><meta data-n-head="ssr" name="robots" content="max-image-preview:large"><meta data-n-head="ssr" name="theme-color" content="#008275"><meta data-n-head="ssr" name="googlebot" content="noarchive"><link data-n-head="ssr" rel="canonical" href="https://namu.wiki/w/%EB%B0%A9%ED%83%84%EC%86%8C%EB

In [17]:
s = BeautifulSoup(r.content, "html.parser")
s

<!DOCTYPE doctype html>

<html data-n-head-ssr=""><head><title>방탄소년단 - 나무위키</title><meta charset="utf-8" data-n-head="ssr"/><meta content="user-scalable=no, initial-scale=1.0, maximum-scale=5.0, minimum-scale=1.0, width=device-width" data-n-head="ssr" name="viewport"/><meta content="ie=edge" data-n-head="ssr" http-equv="x-ua-compatible"/><meta content="the seed" data-n-head="ssr" name="generator"/><meta content="yes" data-n-head="ssr" name="mobile-web-app-capable"/><meta content="나무위키" data-n-head="ssr" name="application-name"/><meta content="나무위키" data-n-head="ssr" name="msapplication-tooltip"/><meta content="/w/%EB%82%98%EB%AC%B4%EC%9C%84%ED%82%A4:%EB%8C%80%EB%AC%B8" data-n-head="ssr" name="msapplication-starturl"/><meta content="max-image-preview:large" data-n-head="ssr" name="robots"/><meta content="#008275" data-n-head="ssr" name="theme-color"/><meta content="noarchive" data-n-head="ssr" name="googlebot"/><link data-n-head="ssr" href="https://namu.wiki/w/%EB%B0%A9%ED%83%84%EC%86%8

In [4]:
for tag in s.find_all('td'):
    href = tag.get("href")
    if href is not None:
        print(href, tag.text)