## 크롤링 또는 스크래이핑 방법
* 원하는 웹 페이지에 request를 보내 결과 html을 받는다.
* 받은 html을 파싱한다.
* 필요한 정보만 추출한다.
> 파이썬을 이용해서 웹 크롤러를 만들기 위해서는 http request / response를 다루는 모듈과, html을 파싱하는 모듈이 필요

## 필요모듈
* Beautiful soup : BeautifulSoup 모듈은 HTML과 XML을 파싱하는 데에 사용되는 파이썬 라이브러리
* requests : http request 관련 통신을 위해서 쓰임
* selenium : beautifulsoup은 사용자 행동을 특정해서 데이터를 가져올 수 없음, 사용자의 행동을 동적으로 추가하려면 selenium이 필요


# 웹 데이터를 가져오는 Beautiful Soup 익히기
## 주요 API 
* find_all
* find : find() : 가장 먼저 검색되는 태그 반환

In [1]:
from bs4 import BeautifulSoup

In [3]:
page = open("./data/03. test_first.html",'r').read()
soup = BeautifulSoup(page, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Very Simple HTML Code by PinkWink
  </title>
 </head>
 <body>
  <div>
   <p class="inner-text first-item" id="first">
    Happy PinkWink.
    <a href="http://www.pinkwink.kr" id="pw-link">
     PinkWink
    </a>
   </p>
   <p class="inner-text second-item">
    Happy Data Science.
    <a href="https://www.python.org" id="py-link">
     Python
    </a>
   </p>
  </div>
  <p class="outer-text first-item" id="second">
   <b>
    Data Science is funny.
   </b>
  </p>
  <p class="outer-text">
   <b>
    All I need is Love.
   </b>
  </p>
 </body>
</html>


In [5]:
type(soup.children)

list_iterator

In [11]:
list(soup.children)

['html', '\n', <html>
 <head>
 <title>Very Simple HTML Code by PinkWink</title>
 </head>
 <body>
 <div>
 <p class="inner-text first-item" id="first">
                 Happy PinkWink.
                 <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
 </p>
 <p class="inner-text second-item">
                 Happy Data Science.
                 <a href="https://www.python.org" id="py-link">Python</a>
 </p>
 </div>
 <p class="outer-text first-item" id="second">
 <b>
                 Data Science is funny.
             </b>
 </p>
 <p class="outer-text">
 <b>
                 All I need is Love.
             </b>
 </p>
 </body>
 </html>]

In [16]:
soup.body

<body>
<div>
<p class="inner-text first-item" id="first">
                Happy PinkWink.
                <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
</p>
<p class="inner-text second-item">
                Happy Data Science.
                <a href="https://www.python.org" id="py-link">Python</a>
</p>
</div>
<p class="outer-text first-item" id="second">
<b>
                Data Science is funny.
            </b>
</p>
<p class="outer-text">
<b>
                All I need is Love.
            </b>
</p>
</body>

In [28]:
soup.body.get_text()

'\n\n\n                Happy PinkWink.\n                PinkWink\n\n\n                Happy Data Science.\n                Python\n\n\n\n\n                Data Science is funny.\n            \n\n\n\n                All I need is Love.\n            \n\n'

In [19]:
soup.find_all('p')

[<p class="inner-text first-item" id="first">
                 Happy PinkWink.
                 <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
 </p>, <p class="inner-text second-item">
                 Happy Data Science.
                 <a href="https://www.python.org" id="py-link">Python</a>
 </p>, <p class="outer-text first-item" id="second">
 <b>
                 Data Science is funny.
             </b>
 </p>, <p class="outer-text">
 <b>
                 All I need is Love.
             </b>
 </p>]

In [20]:
soup.find('p')

<p class="inner-text first-item" id="first">
                Happy PinkWink.
                <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
</p>

In [26]:
for each_tag in soup.find_all('p'):
    print(each_tag.get_text(strip=True))

Happy PinkWink.PinkWink
Happy Data Science.Python
Data Science is funny.
All I need is Love.


In [21]:
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 Data Science is funny.
             </b>
 </p>, <p class="outer-text">
 <b>
                 All I need is Love.
             </b>
 </p>]

In [22]:
soup.find_all(class_="outer-text")

[<p class="outer-text first-item" id="second">
 <b>
                 Data Science is funny.
             </b>
 </p>, <p class="outer-text">
 <b>
                 All I need is Love.
             </b>
 </p>]

In [23]:
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 Happy PinkWink.
                 <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
 </p>]

In [29]:
links = soup.find_all('a')
links

[<a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>,
 <a href="https://www.python.org" id="py-link">Python</a>]

In [30]:
for each in links:
    href = each['href']
    text = each.string
    print(text + ' -> ' + href)

PinkWink -> http://www.pinkwink.kr
Python -> https://www.python.org


# 크롬 개발자 도구를 이용해서 원하는 태그 찾기

## Naver 금융 환율 정보 가져오기

In [33]:
from urllib.request import urlopen
 
url = "http://info.finance.naver.com/marketindex/"
page = urlopen(url)
soup = BeautifulSoup(page, "html.parser")

In [34]:
'''
# 환율 관련 핵심 HTML 태그 
<span class="value">
    1,067.00
</span>
'''
soup.find_all('span', 'value')[0].string

'1,077.00'