## Beautifulsoup으로 이미지 크롤링

- python. re, requests로 크롤링 테스트
- 예제사이트 : 빙 고양이 이미지 검색결과
    - https://www.bing.com/images/search?q=%EA%B3%A0%EC%96%91%EC%9D%B4&form=HDRSC3&first=1&cw=1177&ch=930

In [1]:
!python --version

Python 3.11.9


### 테스트 순

1. URL을 뽑아내는 전처리 작업
3. 이미지 저장코드 

#### 모듈 임포트

In [3]:
!pip install beautifulsoup4

Collecting beautifulsoup4
  Using cached beautifulsoup4-4.12.3-py3-none-any.whl.metadata (3.8 kB)
Collecting soupsieve>1.2 (from beautifulsoup4)
  Using cached soupsieve-2.6-py3-none-any.whl.metadata (4.6 kB)
Using cached beautifulsoup4-4.12.3-py3-none-any.whl (147 kB)
Using cached soupsieve-2.6-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.12.3 soupsieve-2.6



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
import requests
import time
from bs4 import BeautifulSoup

#### 함수 선언

In [29]:
# url 이미지 저장 함수
def save_image(idx, img_url):
    res = requests.get(img_url).content
    with open(f'./results/bing_{idx}.jpg', 'wb') as f:
        # 이미지 관련작업은 pillow 모듈 사용 권장
        f.write(res)
        print(f'saved by {idx}')

#### 순서대로 작성

In [11]:
start_time = time.time()

url = f'https://www.bing.com/images/search?q=%EA%B3%A0%EC%96%91%EC%9D%B4&form=HDRSC3&first=1&cw=1177&ch=930'
html = requests.get(url=url)
html_raw_data = str(html.content)
soup = BeautifulSoup(html_raw_data, 'html.parser')

##### 반복해서 뽑아보기

In [13]:
tag_list = []
tag_list = soup.find_all("img", "mimg")
print(len(tag_list))

30


In [27]:
for idx, res_data in enumerate(tag_list):
    val = ''
    if res_data.get('src') is None:
        print(res_data.get('data-src'))
    else:
        print(res_data.get('src'))
        

https://tse1.mm.bing.net/th/id/OIP.Y9h80DtXEhUNVqMCm8cn8wHaEJ?w=303&h=180&c=7&r=0&o=5&pid=1.7
https://tse1.mm.bing.net/th/id/OIP.N1JNQXWUeZWlWBdo7zwrEAHaE4?w=225&h=180&c=7&r=0&o=5&pid=1.7
https://tse4.mm.bing.net/th/id/OIP.HUXfYr1XIAm6QzPprjdzuwHaGP?w=202&h=180&c=7&r=0&o=5&pid=1.7
https://tse1.mm.bing.net/th/id/OIP.FQXMBKq8VyvEw98bOUiBUQHaE7?w=223&h=180&c=7&r=0&o=5&pid=1.7
https://tse4.mm.bing.net/th/id/OIP.nfOunZ_SW8pyzi--CPui6AHaEJ?w=315&h=180&c=7&r=0&o=5&pid=1.7
https://tse2.mm.bing.net/th/id/OIP.ZZhEg5i3_rLVlYGmFr2REgHaE8?w=234&h=180&c=7&r=0&o=5&pid=1.7
https://tse2.mm.bing.net/th/id/OIP.bnRLASbftzBG1750I_Lu1QHaHW?w=178&h=180&c=7&r=0&o=5&pid=1.7
https://tse3.mm.bing.net/th/id/OIP.xPQ03hsKbT5n1X_qjLHiNgHaEJ?w=316&h=180&c=7&r=0&o=5&pid=1.7
https://tse4.mm.bing.net/th/id/OIP.SuPaBflgtROH6Qq0igYEggHaE8?w=277&h=185&c=7&r=0&o=5&pid=1.7
https://tse1.mm.bing.net/th/id/OIP.ZxKHXOvHFDpOS0YW9RiMqAHaFx?w=238&h=185&c=7&r=0&o=5&pid=1.7
https://tse1.mm.bing.net/th/id/OIP.q-q9G52ELypGVoztuHdu2gHaE

##### 이미지 저장
- 시간 측정

In [30]:
start_time = time.time()

for idx, res_data in enumerate(tag_list):
    if res_data.get('src') is None:
        save_image(idx, res_data.get('data-src'))
    else:
        save_image(idx, res_data.get('src'))
        
end_time = time.time()
print(f'time elapsed : {end_time - start_time} sec.')

saved by 0
saved by 1
saved by 2
saved by 3
saved by 4
saved by 5
saved by 6
saved by 7
saved by 8
saved by 9
saved by 10
saved by 11
saved by 12
saved by 13
saved by 14
saved by 15
saved by 16
saved by 17
saved by 18
saved by 19
saved by 20
saved by 21
saved by 22
saved by 23
saved by 24
saved by 25
saved by 26
saved by 27
saved by 28
saved by 29
time elapsed : 6.309060335159302 sec.


##### 총 6.3초 가량 소요

### 병렬처리로 크롤링 시간 단축
- thread 와 concurrent.futures 사용으로 병렬 처리

#### Thread 사용으로 병렬처리

#### 모듈 임포트

In [31]:
import requests
import threading    ## 스레드 추가
import time
from bs4 import BeautifulSoup

##### 아래는 동일

In [34]:
start_time = time.time()
thread_list = []

for idx, res_data in enumerate(tag_list):
    img_url = ''
    if res_data.get('src') is None:
        img_url = res_data.get('data-src')
    else:
        img_url = res_data.get('src')

    thread_worker = threading.Thread(target=save_image, args=(idx, img_url))
    thread_worker.start()
    thread_list.append(thread_worker)

for thread in thread_list:
    thread.join()
        
end_time = time.time()
print(f'time elapsed : {end_time - start_time} sec.')

saved by 1
saved by 0
saved by 8
saved by 6
saved by 21
saved by 18
saved by 17
saved by 26
saved by 22
saved by 4
saved by 2
saved by 7
saved by 13
saved by 10
saved by 5
saved by 3
saved by 9
saved by 11
saved by 14
saved by 12
saved by 15
saved by 19
saved by 20
saved by 27
saved by 23
saved by 16
saved by 24
saved by 29
saved by 25
saved by 28
time elapsed : 0.27315640449523926 sec.


##### 총 0.5초 가량 소요

#### concurrent.futures 사용으로 병렬 처리
- concurrent.futures는 비동기 처리 고수준 인터페이스 모듈

#### 모듈 임포트

In [35]:
import requests
import time
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor

In [36]:
start_time = time.time()

thread_list = []

with ThreadPoolExecutor(max_workers=16) as executor:
    for idx, res_data in enumerate(tag_list):
        img_url = ''
        if res_data.get('src') is None:
            img_url = res_data.get('data-src')
        else:
            img_url = res_data.get('src')

        executor.submit(save_image, idx, img_url)

end_time = time.time()
print(f'time elapsed : {end_time - start_time} sec.')

saved by 1
saved by 5
saved by 11
saved by 3
saved by 0
saved by 2
saved by 8
saved by 16
saved by 18
saved by 19
saved by 10
saved by 4
saved by 9
saved by 14
saved by 13
saved by 7
saved by 12
saved by 6
saved by 15
saved by 20
saved by 27
saved by 17
saved by 28
saved by 22
saved by 21
saved by 23
saved by 24
saved by 25
saved by 26
saved by 29
time elapsed : 0.4644143581390381 sec.


##### 0.46초 소요