# 웹 크롤링을 위한 라이브러리
- 정적 수집도구: Requests, Urllib
- 동적 수집도구: Selenium, BeautifulSoup

## 정적 수집도구 라이브러리 사용
### Requests

In [None]:
# 로봇 배제기준(크롤링 가능여부) 확인하기
## 웹페이지 주소 + /robots.txt

In [1]:
import requests

html = requests.get('https://www.youtube.com')
print(html.encoding)
print(html.headers)
print(html.text[:200])

utf-8
{'Content-Type': 'text/html; charset=utf-8', 'X-Content-Type-Options': 'nosniff', 'Cache-Control': 'no-cache, no-store, max-age=0, must-revalidate', 'Pragma': 'no-cache', 'Expires': 'Mon, 01 Jan 1990 00:00:00 GMT', 'Date': 'Sun, 19 Mar 2023 11:52:08 GMT', 'Strict-Transport-Security': 'max-age=31536000', 'X-Frame-Options': 'SAMEORIGIN', 'Cross-Origin-Opener-Policy-Report-Only': 'same-origin-allow-popups; report-to="youtube_main"', 'Permissions-Policy': 'ch-ua-arch=*, ch-ua-bitness=*, ch-ua-full-version=*, ch-ua-full-version-list=*, ch-ua-model=*, ch-ua-wow64=*, ch-ua-platform=*, ch-ua-platform-version=*', 'Report-To': '{"group":"youtube_main","max_age":2592000,"endpoints":[{"url":"https://csp.withgoogle.com/csp/report-to/youtube_main"}]}', 'P3P': 'CP="This is not a P3P policy! See http://support.google.com/accounts/answer/151657?hl=ko for more info."', 'Content-Encoding': 'gzip', 'Server': 'ESF', 'X-XSS-Protection': '0', 'Set-Cookie': 'GPS=1; Domain=.youtube.com; Expires=Sun, 19-M

In [2]:
# 아이디, 패스워드가 필요한경우

html = requests.get("https://itmaster98.tistory.com", 
                    auth=('ghdcosml','password'))
print(html.text[:200])

<!doctype html>
<html lang="ko">
	                                                                <head>
                
                
                        <!-- BusinessLicenseInfo - START --


### Urllib
- requests와 다르게 데이터 전송시 바이너리 데이터로 변환하여 보내고 받기 때문에 디코딩 과정 필요
- html head 부분에서 관련 정보를 찾아 알맞게 디코딩 해야함

In [3]:
import urllib

html = urllib.request.urlopen('https://www.youtube.com')
print(html.headers.get_content_charset())
print(html.read().decode('utf-8')[:200])

utf-8
<!DOCTYPE html><html style="font-size: 10px;font-family: Roboto, Arial, sans-serif;" lang="ko-KR" system-icons typography typography-spacing darker-dark-theme darker-dark-theme-deprecate><head><meta h


## 동적 수집도구 활용
### BeautifulSoup
- html 파싱 가능
- find(), find_all(), select()이용
- 속성값을 출력하려면 get, 텍스트를 출력하려면 text 이용

In [7]:
!pip install bs4

Collecting bs4
  Using cached bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting beautifulsoup4
  Using cached beautifulsoup4-4.11.2-py3-none-any.whl (129 kB)
Collecting soupsieve>1.2
  Downloading soupsieve-2.4-py3-none-any.whl (37 kB)
Using legacy 'setup.py install' for bs4, since package 'wheel' is not installed.
Installing collected packages: soupsieve, beautifulsoup4, bs4
  Running setup.py install for bs4: started
  Running setup.py install for bs4: finished with status 'done'
Successfully installed beautifulsoup4-4.11.2 bs4-0.0.1 soupsieve-2.4


You should consider upgrading via the 'C:\Users\HONG SEO I\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip' command.


In [12]:
from bs4 import BeautifulSoup as bs
with open('연습.html', 'r', encoding='utf-8') as f:
    html_doc = f.read()
soup = bs(html_doc, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   제목
  </title>
 </head>
 <body>
  <h1>
   제목1
  </h1>
  <h2>
   제목2
  </h2>
  <p>
   내용들
  </p>
 </body>
</html>


In [13]:
print('제목:', soup.title.text)

제목: 제목


In [14]:
#상위 태그이름 식별
soup.title.parent.name

'head'

In [15]:
# 첫번째 p 태그 추출
soup.p

<p>내용들</p>

In [16]:
# 속성값 추출
soup.a['href']

TypeError: 'NoneType' object is not subscriptable

In [17]:
# find()
soup.find('p')

<p>내용들</p>

In [18]:
#findall
soup.find_all('p')

[<p>내용들</p>]

In [21]:
soup.find(class_='title')

In [22]:
soup.find(attrs={'id':'second'})

### select()

In [23]:
soup.select('p')

[<p>내용들</p>]

In [24]:
soup.select_one('p')

<p>내용들</p>

## selenium
- urllib, requests와 다르게 동적 데이터 수집 가능
- 사전 설치 필요

'''
!pip install selenium
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!chromium-browser --version
!chromedriver --version
'''


In [25]:
!pip install selenium
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!chromium-browser --version
!chromedriver --version

Collecting selenium
  Downloading selenium-4.8.2-py3-none-any.whl (6.9 MB)
     ---------------------------------------- 6.9/6.9 MB 10.8 MB/s eta 0:00:00
Collecting trio~=0.17
  Using cached trio-0.22.0-py3-none-any.whl (384 kB)
Collecting trio-websocket~=0.9
  Downloading trio_websocket-0.10.2-py3-none-any.whl (17 kB)
Collecting sortedcontainers
  Using cached sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
Collecting sniffio
  Using cached sniffio-1.3.0-py3-none-any.whl (10 kB)
Collecting outcome
  Using cached outcome-1.2.0-py2.py3-none-any.whl (9.7 kB)
Collecting cffi>=1.14
  Downloading cffi-1.15.1-cp39-cp39-win_amd64.whl (179 kB)
     ------------------------------------- 179.1/179.1 KB 11.3 MB/s eta 0:00:00
Collecting async-generator>=1.9
  Using cached async_generator-1.10-py3-none-any.whl (18 kB)
Collecting attrs>=19.2.0
  Using cached attrs-22.2.0-py3-none-any.whl (60 kB)
Collecting exceptiongroup>=1.0.0rc9
  Downloading exceptiongroup-1.1.1-py3-none-any.whl (14 kB)
Colle

You should consider upgrading via the 'C:\Users\HONG SEO I\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip' command.
'apt-get'��(��) ���� �Ǵ� �ܺ� ����, ������ �� �ִ� ���α׷�, �Ǵ�
��ġ ������ �ƴմϴ�.
'apt'��(��) ���� �Ǵ� �ܺ� ����, ������ �� �ִ� ���α׷�, �Ǵ�
��ġ ������ �ƴմϴ�.
'cp'��(��) ���� �Ǵ� �ܺ� ����, ������ �� �ִ� ���α׷�, �Ǵ�
��ġ ������ �ƴմϴ�.
'chromium-browser'��(��) ���� �Ǵ� �ܺ� ����, ������ �� �ִ� ���α׷�, �Ǵ�
��ġ ������ �ƴմϴ�.
'chromedriver'��(��) ���� �Ǵ� �ܺ� ����, ������ �� �ִ� ���α׷�, �Ǵ�
��ġ ������ �ƴմϴ�.


In [26]:
import pandas as pd
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
options = webdriver.ChromeOptions()
options.add_argument('--headless')        # Head-less 설정
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver', options=options)
driver.maximize_window()

ModuleNotFoundError: No module named 'pandas'

In [27]:
driver.get("https://www.daum.net/")
time.sleep(3)
driver.implicitly_wait(1)   
driver.find_element(By.XPATH, '//*[@id="q"]').send_keys('인공지능')
driver.find_element(By.CSS_SELECTOR, "#daumSearch > fieldset > div > div > button.ico_pctop.btn_search").click()
time.sleep(3)
rows=driver.find_elements(By.CSS_SELECTOR, 'p.f_eb.desc') 
# print(rows[:2])
text = []
for i, row in enumerate(rows):
    # if i >2 : break
    #print(row.text)
    text.append(row.text)
text

NameError: name 'driver' is not defined