## Beautifulsoup으로 이미지 크롤링 안될 경우

- python. re, requests로 크롤링 테스트
- 예제사이트 : 강아지(싫어) -> 고양이(좋아~)
    - https://search.daum.net/search?nil_suggest=btn&w=img&DA=SBC&q=%EA%B3%A0%EC%96%91%EC%9D%B4

In [1]:
!python --version

Python 3.11.9


### 테스트 순

1. 정규식 선언
2. URL을 뽑아내는 전처리 작업
3. 이미지 저장코드 

#### 모듈 임포트

In [2]:
import requests
import re
import time

#### 함수 선언

In [3]:
# url 이미지 저장 함수
def save_image(idx, img_url):
    res = requests.get(img_url).content
    with open(f'./results/{idx}.jpg', 'wb') as f:
        # 이미지 관련작업은 pillow 모듈 사용 권장
        f.write(res)
        print(f'saved by {idx}')

#### 순서대로 작성

In [4]:
start_time = time.time()

# 정규식 선언
## oimgurl과 cpid 사이의 글을 가져오기
reg = re.compile('oimgurl: ".{0,300}", cpid')  # 여러방법으로 정규식 만들 수 있음

url = f'https://search.daum.net/search?nil_suggest=btn&w=img&DA=SBC&q=%EA%B3%A0%EC%96%91%EC%9D%B4'
html = requests.get(url=url)
html_raw_data = str(html.content)
print(html_raw_data)

b'<!doctype html>\n<html xmlns="http://www.w3.org/1999/xhtml" lang="ko">\n<head profile="http://a9.com/-/spec/opensearch/1.1/">\n<meta http-equiv="content-Type" content="text/html;charset=utf-8" />\n<meta http-equiv="X-UA-Compatible" content="IE=edge" />\n<meta name="autocomplete" content="off" />\n<meta name="referrer" content="always">\n<meta name="format-detection" content="telephone=no" />\n<meta property="og:title" content="\xea\xb3\xa0\xec\x96\x91\xec\x9d\xb4 &ndash; Daum \xea\xb2\x80\xec\x83\x89" />\n<meta property="og:url" content="https://search.daum.net/search?nil_suggest=btn&amp;w=img&amp;DA=SBC&amp;q=%EA%B3%A0%EC%96%91%EC%9D%B4" />\n<meta property="og:description" content="Daum \xea\xb2\x80\xec\x83\x89\xec\x97\x90\xec\x84\x9c \xea\xb3\xa0\xec\x96\x91\xec\x9d\xb4\xec\x97\x90 \xeb\x8c\x80\xed\x95\x9c \xec\xb5\x9c\xec\x8b\xa0\xec\xa0\x95\xeb\xb3\xb4\xeb\xa5\xbc \xec\xb0\xbe\xec\x95\x84\xeb\xb3\xb4\xec\x84\xb8\xec\x9a\x94." />\n<meta property="og:image" content="https://search1

##### html_raw_data 확인
- 소스코드를 메모장에 옮겨서 확인하면
    - oimgurl 에 이미지 URL이 들어있음
    - cpid 에 각 이미지의 ID 포함

##### 반복해서 뽑아보기

- reg.finditer(html_raw_data) 는 한번 사용하고 나면 사라지므로

In [27]:
reg_iter = reg.finditer(html_raw_data)

for idx, res_data in enumerate(reg_iter):
    # url을 뽑아내는 전처리 작업
    img_url = res_data.group().split('oimgurl:')
    print(img_url)

['', ' "https://t1.daumcdn.net/news/202402/24/joongangsunday/20240224000150215efvd.jpg", cpid']
['', ' "https://t1.daumcdn.net/news/202402/24/joongangsunday/20240224000150215efvd.jpg", cpid']
['', ' "https://t1.daumcdn.net/news/202402/25/akn/20240225070103638jeqk.jpg", cpid']
['', ' "https://t1.daumcdn.net/news/202402/25/akn/20240225070103638jeqk.jpg", cpid']
['', ' "https://t1.daumcdn.net/news/202103/26/sportskhan/20210326154226470iwqg.jpg", cpid']
['', ' "https://t1.daumcdn.net/news/202103/26/sportskhan/20210326154226470iwqg.jpg", cpid']
['', ' "https://t1.daumcdn.net/news/202312/06/ned/20231206090317055bgdg.jpg", cpid']
['', ' "https://t1.daumcdn.net/news/202312/06/ned/20231206090317055bgdg.jpg", cpid']
['', ' "https://t1.daumcdn.net/news/202408/11/ZDNetKorea/20240811100837818cyjh.jpg", cpid']
['', ' "https://t1.daumcdn.net/news/202408/11/ZDNetKorea/20240811100837818cyjh.jpg", cpid']
['', ' "https://blog.kakaocdn.net/dn/oZZxr/btsHWRfL9Kd/KE3zKNpvsde1lJGg4lMeck/img.webp", cpid']
['',

In [28]:
reg_iter = reg.finditer(html_raw_data)

for idx, res_data in enumerate(reg_iter):
    # url을 뽑아내는 전처리 작업
    img_url = res_data.group().split('oimgurl:')[1]
    print(img_url)

 "https://t1.daumcdn.net/news/202402/24/joongangsunday/20240224000150215efvd.jpg", cpid
 "https://t1.daumcdn.net/news/202402/24/joongangsunday/20240224000150215efvd.jpg", cpid
 "https://t1.daumcdn.net/news/202402/25/akn/20240225070103638jeqk.jpg", cpid
 "https://t1.daumcdn.net/news/202402/25/akn/20240225070103638jeqk.jpg", cpid
 "https://t1.daumcdn.net/news/202103/26/sportskhan/20210326154226470iwqg.jpg", cpid
 "https://t1.daumcdn.net/news/202103/26/sportskhan/20210326154226470iwqg.jpg", cpid
 "https://t1.daumcdn.net/news/202312/06/ned/20231206090317055bgdg.jpg", cpid
 "https://t1.daumcdn.net/news/202312/06/ned/20231206090317055bgdg.jpg", cpid
 "https://t1.daumcdn.net/news/202408/11/ZDNetKorea/20240811100837818cyjh.jpg", cpid
 "https://t1.daumcdn.net/news/202408/11/ZDNetKorea/20240811100837818cyjh.jpg", cpid
 "https://blog.kakaocdn.net/dn/oZZxr/btsHWRfL9Kd/KE3zKNpvsde1lJGg4lMeck/img.webp", cpid
 "https://blog.kakaocdn.net/dn/oZZxr/btsHWRfL9Kd/KE3zKNpvsde1lJGg4lMeck/img.webp", cpid
 "ht

In [29]:
reg_iter = reg.finditer(html_raw_data)

for idx, res_data in enumerate(reg_iter):
    # url을 뽑아내는 전처리 작업
    img_url = res_data.group().split('oimgurl:')[1].split(', cpid')[0]
    print(img_url)

 "https://t1.daumcdn.net/news/202402/24/joongangsunday/20240224000150215efvd.jpg"
 "https://t1.daumcdn.net/news/202402/24/joongangsunday/20240224000150215efvd.jpg"
 "https://t1.daumcdn.net/news/202402/25/akn/20240225070103638jeqk.jpg"
 "https://t1.daumcdn.net/news/202402/25/akn/20240225070103638jeqk.jpg"
 "https://t1.daumcdn.net/news/202103/26/sportskhan/20210326154226470iwqg.jpg"
 "https://t1.daumcdn.net/news/202103/26/sportskhan/20210326154226470iwqg.jpg"
 "https://t1.daumcdn.net/news/202312/06/ned/20231206090317055bgdg.jpg"
 "https://t1.daumcdn.net/news/202312/06/ned/20231206090317055bgdg.jpg"
 "https://t1.daumcdn.net/news/202408/11/ZDNetKorea/20240811100837818cyjh.jpg"
 "https://t1.daumcdn.net/news/202408/11/ZDNetKorea/20240811100837818cyjh.jpg"
 "https://blog.kakaocdn.net/dn/oZZxr/btsHWRfL9Kd/KE3zKNpvsde1lJGg4lMeck/img.webp"
 "https://blog.kakaocdn.net/dn/oZZxr/btsHWRfL9Kd/KE3zKNpvsde1lJGg4lMeck/img.webp"
 "https://t1.daumcdn.net/cfile/tistory/1934853550FE4A3C13"
 "https://t1.daum

In [30]:
reg_iter = reg.finditer(html_raw_data)

for idx, res_data in enumerate(reg_iter):
    # url을 뽑아내는 전처리 작업
    img_url = res_data.group().split('oimgurl:')[1].split(', cpid')[0].replace('"', '')
    print(img_url)

 https://t1.daumcdn.net/news/202402/24/joongangsunday/20240224000150215efvd.jpg
 https://t1.daumcdn.net/news/202402/24/joongangsunday/20240224000150215efvd.jpg
 https://t1.daumcdn.net/news/202402/25/akn/20240225070103638jeqk.jpg
 https://t1.daumcdn.net/news/202402/25/akn/20240225070103638jeqk.jpg
 https://t1.daumcdn.net/news/202103/26/sportskhan/20210326154226470iwqg.jpg
 https://t1.daumcdn.net/news/202103/26/sportskhan/20210326154226470iwqg.jpg
 https://t1.daumcdn.net/news/202312/06/ned/20231206090317055bgdg.jpg
 https://t1.daumcdn.net/news/202312/06/ned/20231206090317055bgdg.jpg
 https://t1.daumcdn.net/news/202408/11/ZDNetKorea/20240811100837818cyjh.jpg
 https://t1.daumcdn.net/news/202408/11/ZDNetKorea/20240811100837818cyjh.jpg
 https://blog.kakaocdn.net/dn/oZZxr/btsHWRfL9Kd/KE3zKNpvsde1lJGg4lMeck/img.webp
 https://blog.kakaocdn.net/dn/oZZxr/btsHWRfL9Kd/KE3zKNpvsde1lJGg4lMeck/img.webp
 https://t1.daumcdn.net/cfile/tistory/1934853550FE4A3C13
 https://t1.daumcdn.net/cfile/tistory/19348

##### 이미지 저장
- 시간 측정

In [5]:
start_time = time.time()

reg_iter = reg.finditer(html_raw_data)

for idx, res_data in enumerate(reg_iter):
    # url을 뽑아내는 전처리 작업
    img_url = res_data.group().split('oimgurl:')[1].split(', cpid')[0].replace('"', '')
    save_image(idx, img_url)

end_time = time.time()
print(f'time elapsed : {end_time - start_time} sec.')

saved by 0
saved by 1
saved by 2
saved by 3
saved by 4
saved by 5
saved by 6
saved by 7
saved by 8
saved by 9
saved by 10
saved by 11
saved by 12
saved by 13
saved by 14
saved by 15
saved by 16
saved by 17
saved by 18
saved by 19
saved by 20
saved by 21
saved by 22
saved by 23
saved by 24
saved by 25
saved by 26
saved by 27
saved by 28
saved by 29
saved by 30
saved by 31
saved by 32
saved by 33
saved by 34
saved by 35
saved by 36
saved by 37
saved by 38
saved by 39
saved by 40
saved by 41
saved by 42
saved by 43
saved by 44
saved by 45
saved by 46
saved by 47
saved by 48
saved by 49
saved by 50
saved by 51
saved by 52
saved by 53
saved by 54
saved by 55
saved by 56
saved by 57
saved by 58
saved by 59
saved by 60
saved by 61
saved by 62
saved by 63
saved by 64
saved by 65
saved by 66
saved by 67
saved by 68
saved by 69
saved by 70
saved by 71
saved by 72
saved by 73
saved by 74
saved by 75
saved by 76
saved by 77
saved by 78
saved by 79
saved by 80
saved by 81
saved by 82
saved by 83
sa

##### 총 22.4초 가량 소요

### 병렬처리로 크롤링 시간 단축
- thread 와 concurrent.futures 사용으로 병렬 처리

#### Thread 사용으로 병렬처리

#### 모듈 임포트

In [7]:
import requests
import re
import threading    ## 스레드 추가
import time

##### 아래는 동일

In [8]:
# 정규식 선언
## oimgurl과 cpid 사이의 글을 가져오기
reg = re.compile('oimgurl: ".{0,300}", cpid')  # 여러방법으로 정규식 만들 수 있음

url = f'https://search.daum.net/search?nil_suggest=btn&w=img&DA=SBC&q=%EA%B3%A0%EC%96%91%EC%9D%B4'
html = requests.get(url=url)
html_raw_data = str(html.content)

In [10]:
start_time = time.time()

reg_iter = reg.finditer(html_raw_data)
thread_list = []

for idx, res_data in enumerate(reg_iter):
    img_url = res_data.group().split('oimgurl:')[1].split(', cpid')[0].replace('"', '')
    thread_worker = threading.Thread(target=save_image, args=(idx, img_url))
    thread_worker.start()
    thread_list.append(thread_worker)

for thread in thread_list:
    thread.join()

end_time = time.time()
print(f'time elapsed : {end_time - start_time} sec.')

saved by 10
saved by 18
saved by 19
saved by 26
saved by 27
saved by 30
saved by 7
saved by 31
saved by 12
saved by 6
saved by 25
saved by 8
saved by 16
saved by 35
saved by 9
saved by 22
saved by 24
saved by 5
saved by 34
saved by 3
saved by 23
saved by 0
saved by 20
saved by 11
saved by 13
saved by 4
saved by 1
saved by 2
saved by 39
saved by 50
saved by 53
saved by 51
saved by 52
saved by 58
saved by 62
saved by 63
saved by 68
saved by 59
saved by 65
saved by 64
saved by 69
saved by 21
saved by 17
saved by 81
saved by 48
saved by 80
saved by 38
saved by 49
saved by 86
saved by 87
saved by 90
saved by 56
saved by 93
saved by 57
saved by 92
saved by 91
saved by 144
saved by 145
saved by 99
saved by 98
saved by 101
saved by 100
saved by 107
saved by 106
saved by 104
saved by 105
saved by 45
saved by 110
saved by 78
saved by 115
saved by 44
saved by 114
saved by 121
saved by 120
saved by 111
saved by 126
saved by 127
saved by 61
saved by 60
saved by 148
saved by 79
saved by 149
saved by

##### 총 3.5초 가량 소요

#### concurrent.futures 사용으로 병렬 처리
- concurrent.futures는 비동기 처리 고수준 인터페이스 모듈

#### 모듈 임포트

In [11]:
import requests
import re
import time
from concurrent.futures import ThreadPoolExecutor

In [13]:
start_time = time.time()

reg_iter = reg.finditer(html_raw_data)
thread_list = []

with ThreadPoolExecutor(max_workers=16) as executor:
    for idx, res_data in enumerate(reg_iter):
        # url을 뽑아내는 전처리 확인
        img_url = res_data.group().split('oimgurl:')[1].split(', cpid')[0].replace('"', '')
        executor.submit(save_image, idx, img_url)

end_time = time.time()
print(f'time elapsed : {end_time - start_time} sec.')

saved by 10
saved by 7
saved by 4
saved by 3
saved by 2
saved by 6
saved by 11
saved by 5
saved by 8
saved by 1
saved by 9
saved by 0
saved by 12
saved by 16
saved by 22
saved by 17
saved by 18
saved by 20
saved by 13
saved by 19
saved by 27
saved by 26
saved by 23
saved by 21
saved by 24
saved by 25
saved by 30
saved by 31
saved by 14
saved by 34
saved by 35
saved by 28
saved by 39
saved by 38
saved by 29
saved by 37
saved by 15
saved by 36
saved by 46
saved by 42
saved by 44
saved by 47
saved by 43
saved by 40
saved by 45
saved by 41
saved by 48
saved by 51
saved by 52
saved by 49
saved by 53
saved by 57
saved by 33
saved by 56
saved by 58
saved by 62
saved by 59
saved by 63
saved by 64
saved by 32
saved by 65
saved by 60
saved by 61
saved by 69
saved by 66
saved by 67
saved by 78
saved by 70
saved by 79
saved by 72
saved by 73
saved by 71
saved by 76
saved by 80
saved by 77
saved by 81
saved by 86
saved by 87
saved by 54
saved by 55
saved by 85
saved by 84
saved by 75
saved by 50
sa

##### 1.73초 대로 처리 시간축소

#### 결론
- I/O 작업이 있을 때 병렬 처리 작업을 활용하여 작업시간 단축가능
- 되도록 thread 모듈보단 고수준 비동기 concurrent.futures 모듈을 사용할 것
    - 단, 무조건 빨라지지는 않음
- 병렬처리 작업의 결과물을 리턴받는 등 다양한 함수 제공

##### Python 참조링크
- [링크](https://docs.python.org/ko/3/library/concurrent.futures.html)