## URL의 구성요소

  - 스키마(schema): 통신 방식
  - 호스트(host): 서버 주소
  - 패스(path): 서버에서 문서의 위치
  - 쿼리(query): 문서에 전달하는 추가 정보

## URL 예

http://www.bobaedream.co.kr/mycar/mycar_list.php?sel_m_gubun=ALL&page=2

- 스키마: `http://`
- 호스트: `www.bobaedream.co.kr`
- 패스: `/mycar/mycar_list.php`
- 쿼리: `?sel_m_gubun=ALL&page=2`

## URL 분석 코드

In [1]:
import urllib.parse
url = 'https://search.naver.com/search.naver?sm=top_hty&fbm=1&ie=utf8&query=scraping'
p = urllib.parse.urlparse(url)

In [2]:
p.scheme

'https'

In [3]:
p.hostname

'search.naver.com'

In [4]:
p.path

'/search.naver'

In [8]:
p.query

'sm=top_hty&fbm=1&ie=utf8&query=scraping'

# requests 모듈
- https://2.python-requests.org/en/master/
- HTTP 요청을 처리하는 라이브러리
- get/post 방식 모두를 지원하며 쿠키, 헤더정보등을 다양한 요청처리를 지원한다.
- 내장 라이브러리가 아니므로 인스톨이 필요 (아나콘다는 `내장되 있다.)`
    - `pip install requests2`
    - `conda install -c conda-forge requests`

In [9]:
import requests

# Guide: 아래 url은 교재 url로 페이지가 심플하다. 간단한 연습하기 좋다.

In [None]:
# get 요청
url = 'http://www.pythonscraping.com/pages/warandpeace.html'
res = requests.get(url)

In [34]:
res.url

'https://search.naver.com/search.naver?sm=top_hty&fbm=1&ie=utf8&query=scraping'

In [17]:
res.status_code

200

## 상태코드

- 2XX: 성공
- 3XX: 다른 주소로 이동 (이사)
    - 300번대이면 자동으로 이동해 준다. 크롤링시는 볼일이 별로 없다.
- 4XX: 클라이언트 오류 (사용자가 잘못한 것)
  - 404: 존재하지 않는 주소
- 5XX: 서버 오류 (서버에서 문제생긴 것)
  - 503: 서버가 다운 등의 문제로 서비스 불가 상태

In [21]:
res.headers

{'Date': 'Wed, 26 Jun 2019 21:35:54 GMT', 'Server': 'Apache', 'Last-Modified': 'Sat, 09 Jun 2018 19:15:59 GMT', 'ETag': '"4121bd1-2dcb-56e3a58bcb54a"', 'Accept-Ranges': 'bytes', 'Content-Length': '11723', 'Cache-Control': 'max-age=1209600', 'Expires': 'Wed, 10 Jul 2019 21:35:54 GMT', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html'}

In [23]:
res.cookies

<RequestsCookieJar[]>

In [26]:
res.encoding

ISO-8859-1


In [27]:
res.text

'<html>\n<head>\n<style>\n.green{\n\tcolor:#55ff55;\n}\n.red{\n\tcolor:#ff5555;\n}\n#text{\n\twidth:50%;\n}\n</style>\n</head>\n<body>\n<h1>War and Peace</h1>\n<h2>Chapter 1</h2>\n<div id="text">\n"<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the\nBuonapartes. But I warn you, if you don\'t tell me that this means war,\nif you still try to defend the infamies and horrors perpetrated by\nthat Antichrist- I really believe he is Antichrist- I will have\nnothing more to do with you and you are no longer my friend, no longer\nmy \'faithful slave,\' as you call yourself! But how do you do? I see\nI have frightened you- sit down and tell me all the news.</span>"\n<p/>\nIt was in July, 1805, and the speaker was the well-known <span class="green">Anna\nPavlovna Scherer</span>, maid of honor and favorite of the <span class="green">Empress Marya\nFedorovna</span>. With these words she greeted <span class="green">Prince Vasili Kuragin</span>, a man\nof high rank

## queryString 전달
- url 뒤에 붙여서 요청
- name:value 형식의 딕셔너리로 만들어 전달

- https://search.naver.com/search.naver?sm=top_hty&fbm=1&ie=utf8&query=scraping

In [28]:
url = 'https://search.naver.com/search.naver'
params = {'sm':'top_hty', 'fbm':1, 'ie':'utf8', 'query':'scraping'}
res = requests.get(url, params=params)
res.url

'https://search.naver.com/search.naver?sm=top_hty&fbm=1&ie=utf8&query=scraping'

In [30]:
print(res.text)

<!doctype html> <html lang="ko"> <head> <meta charset="utf-8"> <meta name="referrer" content="always">  <meta name="format-detection" content="telephone=no,address=no,email=no"> <meta name="viewport" content="width=device-width,initial-scale=1.0,maximum-scale=2.0"> <meta property="og:title" content="scraping : 네이버 통합검색"/> <meta property="og:image" content="https://ssl.pstatic.net/sstatic/search/common/og_v3.png"> <meta property="og:description" content="'scraping'의 네이버 통합검색 결과입니다."> <meta name="description" lang="ko" content="'scraping'의 네이버 통합검색 결과입니다."> <title>scraping : 네이버 통합검색</title> <link rel="shortcut icon" href="https://ssl.pstatic.net/sstatic/search/favicon/favicon_140327.ico">  <link rel="search" type="application/opensearchdescription+xml" href="https://ssl.pstatic.net/sstatic/search/opensearch-description.https.xml" title="Naver" /><link rel="stylesheet" type="text/css" href="https://ssl.pstatic.net/sstatic/search/pc/css/search1_190612.css"> <link rel="stylesheet" type="te

# BeautifulSoup4

- 파서
    - lxml 파서: 성능이 좋다.
        - `pip install lxml` : 아나콘다는 내장되 있음

In [33]:
from bs4 import BeautifulSoup
import requests
import 

In [71]:
url = 'http://www.pythonscraping.com/pages/warandpeace.html'
res = requests.get(url)
bs = BeautifulSoup(res.text, 'lxml') # html, 파서(lxml파서의 성능이 더 좋음.)

### 조회 메소드
- find_all(name=태그명, attrs={속성명:속성값, ..})
   - 일치하는 모든 태그들을 반환
- find(name=태그명, attrs={속성명:속성값})
    - 일치하는 첫번째 태그를 반환

In [69]:
# 특정 요소 조회
span = bs.find_all('span')
print(type(span), len(span), sep=' , ')

<class 'bs4.element.ResultSet'> , 75


In [42]:
span2 = bs.find_all('span', {'class':'red'})
print(type(span2), len(span2), sep=' , ')

<class 'bs4.element.ResultSet'> , 34


In [43]:
# class속성이 red이거나 green 인 것 모두 찾기
span3 = bs.find_all('span', {'class':{'red','green'}}) 
print(type(span3), len(span3), sep=' , ')

<class 'bs4.element.ResultSet'> , 75


In [56]:
print(type(span3[0]))
span3[0]

<class 'bs4.element.Tag'>


<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.</span>

In [62]:
# 여러 태그 찾기
h_tag = bs.find_all(['h1', 'h2'])
print(len(h_tag))
h_tag

2


[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]

In [63]:
# 태그 내 text 조회
for h in h_tag:
    print(h.get_text())

War and Peace
Chapter 1


In [64]:
for txt in span2:
    print(txt.get_text())
    print("-"*50)

Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.
--------------------------------------------------
If you have nothing better to do, Count [or Prince], and if the
prospect of spending an evening with a poor invalid is not too
terrible, I shall be very charmed to see you tonight between 7 and 10-
Annette Scherer.
--------------------------------------------------
Heavens! what a virulent attack!
--------------------------------------------------
First of all, dear friend, tell me how you are. Set your friend's
mind at rest,
--------------------------------------------------
C

In [81]:
# Guide : 수업할 필요 없겠다.
tags = bs.find_all(text='the prince') #태그의 텍스트가 the prince로만 이뤄져 있어야 함. (포함하고 있는 게 아님)
print(len(tags))

0


In [83]:
type(bs.find('span'))

bs4.element.Tag

KeyError: 1