# Beautiful Soup

HTML, XML 파일에서 데이터를 가져오는 라이브러리

[beautiful soup reference](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

<hr>

In [1]:
from bs4 import BeautifulSoup

In [2]:
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [3]:
soup = BeautifulSoup(html_doc, 'html.parser')

In [4]:
soup

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [5]:
soup.title # 태그와 태그 안의 내용

<title>The Dormouse's story</title>

In [6]:
soup.title.name # 태그의 이름

'title'

In [7]:
soup.title.string # 태그 안의 내용

"The Dormouse's story"

In [8]:
soup.p

<p class="title"><b>The Dormouse's story</b></p>

In [9]:
soup.p['class'] # 태그의 속성값 접근 시 dict 방식으로

['title']

In [10]:
soup.a # 가장 첫번째 a 태그를 찾음

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [11]:
soup.find_all('a') # a 태그를 모두 찾음

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [12]:
soup.find(id='link3')

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [13]:
for link in soup.find_all('a'):
    print(link.get('href')) # 여러 개의 속성이 있을 때 get 메소드를 이용

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


In [14]:
# 텍스트 전체를 추출하기
soup.get_text()

"The Dormouse's story\n\nThe Dormouse's story\nOnce upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.\n...\n"

<hr>

## parser
1. html.parser
2. lxml : 가장 빨라서 추천함
3. lxml-xml
4. html5lib

## 객체 종류

Beautiful Soup 은 복잡한 html 문서를 파이썬 객체의 트리로 변환함

1. Tag
    - html 태그에 해당
    - 모든 태그는 이름을 가짐. 이 이름은 .name으로 접근 가능
    - 태그는 여러개의 속성과 값을 가짐. 딕셔너리 형태로 접근 가능 (.attrs로 속성과 값의 딕셔너리를 반환)
    - 태그의 속성을 추가/삭제/수정 가능
    - 하나의 속성이 여러개의 값을 가지는 경우 : 리스트로 반환
2. NavigableString
    - 태그 내의 텍스트를 지정하는 클래스
    - unicode 문자열로 변환 가능
    - 문자열을 다른 문자열로 대체 가능 : replace_with()
    - python 환경에서 텍스트를 사용하려면 .unicode() 와 같이 유니코드 문자열로 바꾸어 사용해야 함
3. BeautifulSoup
    - 파싱된 document 전체를 나타내는 객체
4. Comment
    - special type of NavigableString
    
이외에도 Stylesheet (style 태그 내의 strings), Script (script 태그 내), TemplateString (template 태그 내) 등과 같은 다양한 객체가 정의되어 있다.

In [15]:
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')

In [16]:
tag = soup.b

In [17]:
type(tag)

bs4.element.Tag

In [18]:
tag

<b class="boldest">Extremely bold</b>

In [19]:
tag.name

'b'

In [20]:
tag.name = 'blockquote' # 태그의 이름을 변경하면

In [21]:
tag # 태그에 반영됨

<blockquote class="boldest">Extremely bold</blockquote>

In [22]:
tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser').b

In [23]:
tag['id']

'boldest'

In [24]:
tag.attrs

{'id': 'boldest'}

In [25]:
tag['id'] = 'verybold'

In [26]:
tag

<b id="verybold">bold</b>

In [28]:
tag['another-attribute'] = 1

In [29]:
tag

<b another-attribute="1" id="verybold">bold</b>

In [30]:
del tag['id']
del tag['another-attribute']
tag

<b>bold</b>

In [31]:
tag.get('id') # None

In [32]:
css_soup = BeautifulSoup('<p class="body"></p>', 'html.parser')
css_soup.p['class']

['body']

In [33]:
css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')
css_soup.p['class']

['body', 'strikeout']

In [34]:
rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>', 'html.parser')
rel_soup.a['rel']

['index']

In [35]:
rel_soup.a['rel'] = ['index', 'contents'] # 값을 추가 가능

In [36]:
print(rel_soup.p)

<p>Back to the <a rel="index contents">homepage</a></p>


In [37]:
id_soup = BeautifulSoup('<p id="my id"></p>', 'html.parser')
id_soup.p['id'] 
# 값이 여러개 같지만 표준에 따라 다중 값을 가지지 않는 속성은 그대로 반환

'my id'

In [38]:
no_list_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser', \
                             multi_valued_attributes=None)
# multi_valued_attributes=None 키워드로 값을 리스트로 반환하지 않게 만듦
no_list_soup.p['class']

'body strikeout'

In [39]:
id_soup.p.get_attribute_list('id') # 항상 리스트로 반환

['my id']

In [40]:
# xml 파서를 이용하면 다중 값 속성이 없음
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
xml_soup.p['class']

'body strikeout'

In [41]:
# xml 파서에 multi_valued_attributes 속성을 지정해서
# 다중 값 속성을 리스트로 반환하도록 만들 수 있음
class_is_multi = {'*':'class'}
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml', \
                         multi_valued_attributes=class_is_multi)
xml_soup.p['class']

['body', 'strikeout']

<hr>

In [42]:
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
tag.string

'Extremely bold'

In [43]:
type(tag.string)

bs4.element.NavigableString

In [44]:
unicode_string = str(tag.string)

In [45]:
unicode_string

'Extremely bold'

In [46]:
type(unicode_string)

str

In [47]:
tag.string.replace_with("No longer bold")
tag

<b class="boldest">No longer bold</b>

<hr>

In [48]:
doc = BeautifulSoup("<document><content/>INSERT FOOTER HERE</document", "xml")
footer = BeautifulSoup("<footer>Here's the footer</footer>", "xml")
doc.find(text="INSERT FOOTER HERE").replace_with(footer)
print(doc)

<?xml version="1.0" encoding="utf-8"?>
<document><content/><footer>Here's the footer</footer></document>


In [49]:
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string

In [50]:
type(comment)

bs4.element.Comment

In [51]:
comment

'Hey, buddy. Want to buy a used parser?'

In [52]:
print(soup.b.prettify())

<b>
 <!--Hey, buddy. Want to buy a used parser?-->
</b>


<hr>

## Navigating Tree

- 태그 이름으로 파싱된 트리를 조사하기

.contents : 해당 태그 안에 들어있는 html 요소를 반환
.children : 자식요소에 반복문으로 접근 가능

In [53]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')

In [54]:
# tag name - 해당하는 첫번째 태그만 반환
soup.head

<head><title>The Dormouse's story</title></head>

In [55]:
soup.title

<title>The Dormouse's story</title>

In [56]:
soup.body.b # body 태그 아래의 가장 첫번째 b 태그

<b>The Dormouse's story</b>

In [57]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [58]:
# 태그 이름에 해당하는 모든 태그를 찾으려면 find_all() 함수를 이용
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [59]:
head_tag = soup.head
head_tag

<head><title>The Dormouse's story</title></head>

In [60]:
head_tag.contents

[<title>The Dormouse's story</title>]

In [61]:
title_tag = head_tag.contents[0]
title_tag

<title>The Dormouse's story</title>

In [62]:
title_tag.contents

["The Dormouse's story"]

In [66]:
title_tag.contents[0].contents

AttributeError: 'NavigableString' object has no attribute 'contents'

In [63]:
len(soup.contents)

2

In [64]:
soup.contents

['\n',
 <html><head><title>The Dormouse's story</title></head>
 <body>
 <p class="title"><b>The Dormouse's story</b></p>
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>
 <p class="story">...</p>
 </body></html>]

In [65]:
soup.contents[1].name

'html'

In [68]:
for child in title_tag.children:
    print(child)

The Dormouse's story


In [69]:
for child in soup.children:
    print(child)



<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
