# Web Crawling with Beautiful Soup

## Beautiful Soup
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. 

설치 (Anaconda에는 설치되어 있음)

` pip install beautifulsoup4`

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "https://www.naver.com"
with urlopen(url) as f:
    naver = BeautifulSoup(f, "html.parser")
naver.find_all('img')[:5]

[<img alt="연합뉴스TV" class="api_logo" height="24" src="https://s.pstatic.net/static/newsstand/up/2017/0424/nsd154219877.png"/>,
 <img alt="아이뉴스24" class="api_logo" height="24" src="https://s.pstatic.net/static/newsstand/up/2017/0424/nsd153955864.png"/>,
 <img alt="서울신문" class="api_logo" height="24" src="https://s.pstatic.net/static/newsstand/up/2017/0424/nsd145738195.png"/>,
 <img alt="KBS" class="api_logo" height="24" src="https://s.pstatic.net/static/newsstand/up/2017/0424/nsd173124306.png"/>,
 <img alt="파이낸셜뉴스" class="api_logo" height="24" src="https://s.pstatic.net/static/newsstand/up/2017/0424/nsd172557496.png"/>]

In [2]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

In [3]:
soup = BeautifulSoup(html_doc, 'html.parser')
print(type(soup))
soup
# print(soup.prettify())

<class 'bs4.BeautifulSoup'>



<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [4]:
soup.title         # tag

<title>The Dormouse's story</title>

In [5]:
soup.title.name    # tag name

'title'

In [6]:
soup.title.string

"The Dormouse's story"

In [7]:
soup.title.parent.name

'head'

In [8]:
soup.p

<p class="title"><b>The Dormouse's story</b></p>

In [9]:
soup.p['class']     # attributes

['title']

In [10]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [11]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [12]:
soup.find(id="link3")

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

Extracting all the URLs found within a page’s `<a>` tags:

In [13]:
for link in soup.find_all('a'):
    print(link.get('href'))     # 'href' attribute

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


Extracting all the text from a page:

In [14]:
soup.get_text()

"\nThe Dormouse's story\n\nThe Dormouse's story\nOnce upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.\n...\n"

## Navigating the tree

In [15]:
soup


<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [16]:
body_tag = soup.body
body_tag.contents

['\n',
 <p class="title"><b>The Dormouse's story</b></p>,
 '\n',
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 '\n',
 <p class="story">...</p>,
 '\n']

### Going down

In [17]:
for child in soup.body.children:    # 직계 자손을 방문하는 iterator
    print(child, end="***\n")


***
<p class="title"><b>The Dormouse's story</b></p>***

***
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>***

***
<p class="story">...</p>***

***


In [18]:
for child in soup.body.descendants: # 모든 후손을 방문하는 iterator 
    print(child, end='***\n')


***
<p class="title"><b>The Dormouse's story</b></p>***
<b>The Dormouse's story</b>***
The Dormouse's story***

***
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>***
Once upon a time there were three little sisters; and their names were
***
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>***
Elsie***
,
***
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>***
Lacie***
 and
***
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>***
Tillie***
;
and they lived at the bottom of a well.***

***
<p class="story">...</p>***
...***

***


### Going up

In [19]:
title_tag = soup.title
title_tag.parent      # element's parent

<head><title>The Dormouse's story</title></head>

In [20]:
link = soup.a
print(link)
for parent in link.parents:  # all of an element's parents
    if parent is None:
        print(parent)
    else:
        print(parent.name)

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
p
body
html
[document]


### Going sideways

In [21]:
link = soup.a
link

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [22]:
for sibling in link.next_siblings:
    print(repr(sibling))

',\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'


In [23]:
for sibling in soup.find(id="link3").previous_siblings:
    print(repr(sibling))

' and\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
',\n'
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
'Once upon a time there were three little sisters; and their names were\n'


### find_all()

In [24]:
soup.find_all("p", "title")

[<p class="title"><b>The Dormouse's story</b></p>]

In [25]:
soup.find(id="link2")

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

In [26]:
import re
soup.find_all(id=re.compile("link"))

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Remember that a single tag can have multiple values for its “class” attribute. When you search for a tag that matches a certain CSS class, you’re matching against any of its CSS classes:

In [27]:
css_soup = BeautifulSoup('<p class="body strikeout"></p>', "html.parser")
css_soup.find_all("p", class_="strikeout")

[<p class="body strikeout"></p>]

In [28]:
css_soup.find_all("p", class_="body")

[<p class="body strikeout"></p>]

### CSS selectors
Find tags:

In [29]:
soup.select("p")

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [30]:
soup.select("p:nth-of-type(3)")

[<p class="story">...</p>]

Find tags beneath other tags:

In [31]:
soup.select("body a")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Find tags directly beneath other tag

In [32]:
soup.select("p > a")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [33]:
soup.select("p > a:nth-of-type(2)")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In [34]:
soup.select("p > #link1")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

Find tags by CSS class

In [35]:
soup.select(".sister")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Find tags by attribute value:

In [36]:
soup.select('a[href="http://example.com/elsie"]')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

## Example

In [37]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl

context = ssl._create_unverified_context()  # for https connection
url = "https://en.wikipedia.org/wiki/BTS_(band)"
with urlopen(url, context=context) as f:
    s = BeautifulSoup(f, "html.parser")

In [38]:
for tag in s.find_all('a'):
    href = tag.get("href")
    if href is not None:
        print(href, tag.text)

/wiki/Wikipedia:Protection_policy#semi 
#mw-head Jump to navigation
#p-search Jump to search
/wiki/File:ROK_Order_of_Cultural_Merit_Hwa-gwan_(5th_Class)_ribbon.PNG 
/wiki/Order_of_Cultural_Merit_(Korea) Hwagwan Order of Cultural Merit
/wiki/File:%E2%80%98LG_Q7_BTS_%EC%97%90%EB%94%94%EC%85%98%E2%80%99_%EC%98%88%EC%95%BD_%ED%8C%90%EB%A7%A4_%EC%8B%9C%EC%9E%91_(42773472410)_(cropped).jpg 
/wiki/LG_Electronics LG Electronics
/wiki/V_(singer) V
/wiki/J-Hope J-Hope
/wiki/RM_(rapper) RM
/wiki/Kim_Seok-jin Jin
/wiki/Jimin_(singer,_born_1995) Jimin
/wiki/Jungkook Jungkook
/wiki/Suga_(rapper) Suga
/wiki/Seoul Seoul
/wiki/K-pop K-pop
/wiki/Hip_hop_music hip hop
/wiki/Contemporary_R%26B R&B
/wiki/Electronic_dance_music EDM
/wiki/Big_Hit_Entertainment Big Hit
/wiki/Pony_Canyon Pony Canyon
/wiki/Def_Jam_Recordings Def Jam Japan
#cite_note-1 [1]
/wiki/Columbia_Records Columbia
#cite_note-2 [2]
http://bts.ibighit.com bts.ibighit.com
/wiki/Kim_Seok-jin Jin
/wiki/Suga_(rapper) Suga
/wiki/J-Hope J-Hope
/w

http://www.billboard.com/articles/columns/k-town/7549104/bts-korean-boy-band-kpop-record-break "How Korean Boy Band BTS Broke a U.S. K-pop Chart Record – Without Any Songs in English"
https://web.archive.org/web/20161022142349/http://www.billboard.com/articles/columns/k-town/7549104/bts-korean-boy-band-kpop-record-break Archived
#cite_ref-227 ^
https://www.ibtimes.sg/producer-talks-about-why-bts-became-successful-k-pop-10615 "Producer talks about why BTS became successful in K-pop"
#cite_ref-:12_228-0 a
#cite_ref-:12_228-1 b
http://m.entertain.naver.com/read?oid=433&aid=0000041444 ""우리가 바로, 앰배서더"…BTS, '푸마' 글로벌 모델 발탁"
#cite_ref-229 ^
https://www.billboard.com/articles/columns/k-town/8458071/south-korean-president-moon-jae-bts-first-no-1-album-billboard-200-chart-kpop "South Korean President Moon Jae-in Congratulates BTS on First No. 1 Album"
#cite_ref-230 ^
https://www.etonline.com/bts-names-their-musical-inspiration-and-their-most-unexpected-celebrity-fan-104281 "BTS Names Their Musica