# 비정형 데이터 활용하기


# Scraping (BeautifulSoup)
> BeautifulSoup 라이브러리에 대한 이해와 실습.

> 당근마켓 웹페이지에서 각 태그에 해당하는 내용 가져오기.

> 기상청 기상정보를 가져와 dataframe으로 정리하고 csv 파일 생성.

> 실습 3. Totally Normal Gifts 웹페이지의 내용을 dataframe으로 정리하고 csv 파일 생성.

## BeautifulSoup

- HTML, XML 파일을 읽어들일 때 사용한다.
- 주로 bs4를 설치하고 사용하고 findAll 함수를 사용할 수 있다.
- children(자식), descendants(자손) 태그의 개념을 알아야 한다.


`<body>
    <div>			// 모든 태그는 body 태그의 자손
        <h1></h1>		// h1 태그는 div 태그의 자손이자 자식
        <p></p>
    </div>
    <div>
        <p>
            <a></a>	// a 태그는 p 태그의 자손이자 자식
        </p>
    </div>`

## 사전작업

In [5]:
import requests
from bs4 import BeautifulSoup

In [6]:
webpage = requests.get('https://www.daangn.com/hot_articles')
# webpage.text
soup = BeautifulSoup(webpage.content , 'html.parser') 

In [7]:
soup

<!DOCTYPE html>

<html lang="ko">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=no" name="viewport"/>
<link href="https://www.daangn.com/hot_articles" rel="canonical"/>
<title>당근마켓 중고거래 | 당신 근처의 당근마켓</title>
<meta content="당근마켓에서 거래되는 인기 중고 매물을 소개합니다. 지금 당근마켓에서 거래되고 있는 다양한 매물을 구경해보세요." name="description"/>
<meta content="https://www.daangn.com/hot_articles" property="og:url"/>
<meta content="당근마켓 중고거래 | 당신 근처의 당근마켓" property="og:title"/>
<meta content="당근마켓에서 거래되는 인기 중고 매물을 소개합니다. 지금 당근마켓에서 거래되고 있는 다양한 매물을 구경해보세요." property="og:description"/>
<meta content="당근마켓" property="og:site_name"/>
<meta content="https://www.daangn.com/images/meta/home/flea_market.png" property="og:image"/>
<meta content="article" property="og:type"/>
<meta content="ko_KR" property="og:locale"/>
<meta content="1463621440622064" property="fb:app_id"/>
<meta content="summary_large_image" name="tw

#### 필요한 Library를 가져오고 변수를 지정해주고 soup에 BeautifulSoup 적용

#### Web page에서 F12를 눌렀을 때 표시되는 html 코드가 나온다.

## 태그 탐색하기

In [8]:
print(soup.p)

<p>당근마켓 앱에서 따뜻한 거래를 직접 경험해보세요!</p>


In [9]:
print(soup.p.string)

당근마켓 앱에서 따뜻한 거래를 직접 경험해보세요!


In [10]:
print(soup.h1)

<h1 id="fixed-bar-logo-title">
<a href="https://www.daangn.com/">
<span class="sr-only">당근마켓</span>
<img alt="당근마켓" class="fixed-logo" src="https://d1unjqcospf8gs.cloudfront.net/assets/home/base/header/logo-basic-24b18257ac4ef693c02233bf21e9cb7ecbf43ebd8d5b40c24d99e14094a44c81.svg"/>
</a> </h1>


In [11]:
for child in soup.h1.children :
    print(child)



<a href="https://www.daangn.com/">
<span class="sr-only">당근마켓</span>
<img alt="당근마켓" class="fixed-logo" src="https://d1unjqcospf8gs.cloudfront.net/assets/home/base/header/logo-basic-24b18257ac4ef693c02233bf21e9cb7ecbf43ebd8d5b40c24d99e14094a44c81.svg"/>
</a>
 


#### `<a>` 태그의 정보를 출력한다. `<h1>` 태그의 자식 (children) 태그를 가져오는 개념

In [12]:
for d in soup.div.children :
    print(d) 



<h1 id="fixed-bar-logo-title">
<a href="https://www.daangn.com/">
<span class="sr-only">당근마켓</span>
<img alt="당근마켓" class="fixed-logo" src="https://d1unjqcospf8gs.cloudfront.net/assets/home/base/header/logo-basic-24b18257ac4ef693c02233bf21e9cb7ecbf43ebd8d5b40c24d99e14094a44c81.svg"/>
</a> </h1>


<section id="fixed-bar-search">
<div class="search-input-wrap">
<span class="sr-only">검색</span>
<input class="fixed-search-input" id="header-search-input" name="header-search-input" placeholder="동네 이름, 물품명 등을 검색해보세요!" type="text"/>
<button id="header-search-button">
<img alt="Search" class="fixed-search-icon" src="https://d1unjqcospf8gs.cloudfront.net/assets/home/base/header/search-icon-7008edd4f9aaa32188f55e65258f1c1905d7a9d1a3ca2a07ae809b5535380f14.svg"/>
</button>
</div>
</section>


<section id="fixed-bar-download">
<h3 class="hide">다운로드</h3>
<a class="fixed-download-button" href="https://itunes.apple.com/kr/app/pangyojangteo/id1018769995?l=ko&amp;ls=1&amp;mt=8" id="header-download-butto

#### find_all() : 원하는 부분을 모두 가져올 때 사용하는 함수

In [13]:
print( soup.find_all('h2'))

[<h2 class="card-title">위닉스10리터 제습기 급매 1만원</h2>, <h2 class="card-title">캠핑, 잡템 먼저가져가면 장땡</h2>, <h2 class="card-title">아이폰8 화이트 64기가 판매합니다.</h2>, <h2 class="card-title">스팸 판매합니다. </h2>, <h2 class="card-title">이케아 책상 가져가실분</h2>, <h2 class="card-title">복합기</h2>, <h2 class="card-title">캠핑키친테이블(거의새거임)</h2>, <h2 class="card-title">아이폰에어팟 2세대 무선(정품)</h2>, <h2 class="card-title">선물세트 팝니데이</h2>, <h2 class="card-title">전동킥보드</h2>, <h2 class="card-title">타이틀 리스트 골프백</h2>, <h2 class="card-title">제습기</h2>, <h2 class="card-title">스피드랙 선반(새거)</h2>, <h2 class="card-title">캠핑용품-해먹,릴선</h2>, <h2 class="card-title">아이패드 매직키보드 트랙패드 12.9인치</h2>, <h2 class="card-title">제습기</h2>, <h2 class="card-title">아이폰11</h2>, <h2 class="card-title">자전거</h2>, <h2 class="card-title">딥디크 도손향수 </h2>, <h2 class="card-title">몽클레어 야상/바람막이/경량패딩</h2>, <h2 class="card-title">미니책장 드립니다.</h2>, <h2 class="card-title">제습기</h2>, <h2 class="card-title">자전거 판매합니다</h2>, <h2 class="card-title">LG제습기 LD-067DSR</h2>, <h2 class="card-title">전

#### 정규식을 활용할 수 있다 : `<ol> <ul>` 포함하는 값을 리스트로 읽어오고 싶다면?

In [14]:
import re

In [15]:
soup.find_all(re.compile('[ou]l'))
# re.compile(pattern) : 패턴 문자열 pattern을 패턴 객체로 컴파일한다

[<ul class="footer-list">
 <li class="footer-list-item"><a class="link-highlight" href="/trust">믿을 수 있는 중고거래</a></li>
 <li class="footer-list-item"><a class="link-highlight" href="/wv/faqs">자주 묻는 질문</a></li>
 </ul>, <ul class="footer-list">
 <li class="footer-list-item"><a href="http://team.daangn.com" target="_blank">회사 소개</a></li>
 <li class="footer-list-item"><a class="link-highlight" href="https://ad.daangn.com/" target="_blank">광고주센터</a></li>
 <li class="footer-list-item">
 <a class="ga-click" data-event-action="hot_articles" data-event-category="town_link_from" data-event-label="footer_town" href="https://town.daangn.com" target="_blank">동네가게</a>
 </li>
 </ul>, <ul class="footer-list policy">
 <li class="footer-list-item"><a href="https://policy.daangn.com/terms.html" target="_blank">이용약관</a></li>
 <li class="footer-list-item"><a href="https://policy.daangn.com/privacy.html" target="_blank">개인정보취급방침</a></li>
 <li class="footer-list-item"><a href="https://policy.daangn.com/locatio

In [16]:
soup.find_all(re.compile('h[1-9]'))
# <h1> ~ <h9>를 의미

[<h1 id="fixed-bar-logo-title">
 <a href="https://www.daangn.com/">
 <span class="sr-only">당근마켓</span>
 <img alt="당근마켓" class="fixed-logo" src="https://d1unjqcospf8gs.cloudfront.net/assets/home/base/header/logo-basic-24b18257ac4ef693c02233bf21e9cb7ecbf43ebd8d5b40c24d99e14094a44c81.svg"/>
 </a> </h1>,
 <h3 class="hide">다운로드</h3>,
 <h1 class="head-title" id="hot-articles-head-title">
     
     
     중고거래 인기매물
   </h1>,
 <h2 class="card-title">위닉스10리터 제습기 급매 1만원</h2>,
 <h2 class="card-title">캠핑, 잡템 먼저가져가면 장땡</h2>,
 <h2 class="card-title">아이폰8 화이트 64기가 판매합니다.</h2>,
 <h2 class="card-title">스팸 판매합니다. </h2>,
 <h2 class="card-title">이케아 책상 가져가실분</h2>,
 <h2 class="card-title">복합기</h2>,
 <h2 class="card-title">캠핑키친테이블(거의새거임)</h2>,
 <h2 class="card-title">아이폰에어팟 2세대 무선(정품)</h2>,
 <h2 class="card-title">선물세트 팝니데이</h2>,
 <h2 class="card-title">전동킥보드</h2>,
 <h2 class="card-title">타이틀 리스트 골프백</h2>,
 <h2 class="card-title">제습기</h2>,
 <h2 class="card-title">스피드랙 선반(새거)</h2>,
 <h2 class="card-title">캠핑

In [17]:
soup.find_all(['h1', 'p'])
# 리스트를 사용하면 두가지 이상의 태그를 동시에 가져올 수 있다.

[<h1 id="fixed-bar-logo-title">
 <a href="https://www.daangn.com/">
 <span class="sr-only">당근마켓</span>
 <img alt="당근마켓" class="fixed-logo" src="https://d1unjqcospf8gs.cloudfront.net/assets/home/base/header/logo-basic-24b18257ac4ef693c02233bf21e9cb7ecbf43ebd8d5b40c24d99e14094a44c81.svg"/>
 </a> </h1>, <h1 class="head-title" id="hot-articles-head-title">
     
     
     중고거래 인기매물
   </h1>, <p>당근마켓 앱에서 따뜻한 거래를 직접 경험해보세요!</p>]

In [18]:
soup.find_all(attrs={'class':'card-title'})
# class 속성으로 데이터 가져오기

[<h2 class="card-title">위닉스10리터 제습기 급매 1만원</h2>,
 <h2 class="card-title">캠핑, 잡템 먼저가져가면 장땡</h2>,
 <h2 class="card-title">아이폰8 화이트 64기가 판매합니다.</h2>,
 <h2 class="card-title">스팸 판매합니다. </h2>,
 <h2 class="card-title">이케아 책상 가져가실분</h2>,
 <h2 class="card-title">복합기</h2>,
 <h2 class="card-title">캠핑키친테이블(거의새거임)</h2>,
 <h2 class="card-title">아이폰에어팟 2세대 무선(정품)</h2>,
 <h2 class="card-title">선물세트 팝니데이</h2>,
 <h2 class="card-title">전동킥보드</h2>,
 <h2 class="card-title">타이틀 리스트 골프백</h2>,
 <h2 class="card-title">제습기</h2>,
 <h2 class="card-title">스피드랙 선반(새거)</h2>,
 <h2 class="card-title">캠핑용품-해먹,릴선</h2>,
 <h2 class="card-title">아이패드 매직키보드 트랙패드 12.9인치</h2>,
 <h2 class="card-title">제습기</h2>,
 <h2 class="card-title">아이폰11</h2>,
 <h2 class="card-title">자전거</h2>,
 <h2 class="card-title">딥디크 도손향수 </h2>,
 <h2 class="card-title">몽클레어 야상/바람막이/경량패딩</h2>,
 <h2 class="card-title">미니책장 드립니다.</h2>,
 <h2 class="card-title">제습기</h2>,
 <h2 class="card-title">자전거 판매합니다</h2>,
 <h2 class="card-title">LG제습기 LD-067DSR</h2>,
 

In [19]:
soup.select('.card-title')
# class 속성으로 데이터 가져오기2

[<h2 class="card-title">위닉스10리터 제습기 급매 1만원</h2>,
 <h2 class="card-title">캠핑, 잡템 먼저가져가면 장땡</h2>,
 <h2 class="card-title">아이폰8 화이트 64기가 판매합니다.</h2>,
 <h2 class="card-title">스팸 판매합니다. </h2>,
 <h2 class="card-title">이케아 책상 가져가실분</h2>,
 <h2 class="card-title">복합기</h2>,
 <h2 class="card-title">캠핑키친테이블(거의새거임)</h2>,
 <h2 class="card-title">아이폰에어팟 2세대 무선(정품)</h2>,
 <h2 class="card-title">선물세트 팝니데이</h2>,
 <h2 class="card-title">전동킥보드</h2>,
 <h2 class="card-title">타이틀 리스트 골프백</h2>,
 <h2 class="card-title">제습기</h2>,
 <h2 class="card-title">스피드랙 선반(새거)</h2>,
 <h2 class="card-title">캠핑용품-해먹,릴선</h2>,
 <h2 class="card-title">아이패드 매직키보드 트랙패드 12.9인치</h2>,
 <h2 class="card-title">제습기</h2>,
 <h2 class="card-title">아이폰11</h2>,
 <h2 class="card-title">자전거</h2>,
 <h2 class="card-title">딥디크 도손향수 </h2>,
 <h2 class="card-title">몽클레어 야상/바람막이/경량패딩</h2>,
 <h2 class="card-title">미니책장 드립니다.</h2>,
 <h2 class="card-title">제습기</h2>,
 <h2 class="card-title">자전거 판매합니다</h2>,
 <h2 class="card-title">LG제습기 LD-067DSR</h2>,
 

In [20]:
soup.select('#hot-articles-navigation')
# id 속성으로 데이터 가져오기

[<nav id="hot-articles-navigation">
 <select class="hot-articles-nav-select" id="region1" name="region1" onchange="changeRegion('r1', this.value)"><option value="">지역을 선택하세요</option><option value="서울특별시">서울특별시</option>
 <option value="부산광역시">부산광역시</option>
 <option value="대구광역시">대구광역시</option>
 <option value="인천광역시">인천광역시</option>
 <option value="광주광역시">광주광역시</option>
 <option value="대전광역시">대전광역시</option>
 <option value="울산광역시">울산광역시</option>
 <option value="세종특별자치시">세종특별자치시</option>
 <option value="경기도">경기도</option>
 <option value="강원도">강원도</option>
 <option value="충청북도">충청북도</option>
 <option value="충청남도">충청남도</option>
 <option value="전라북도">전라북도</option>
 <option value="전라남도">전라남도</option>
 <option value="경상북도">경상북도</option>
 <option value="경상남도">경상남도</option>
 <option value="제주특별자치도">제주특별자치도</option></select>
 <select class="hot-articles-nav-select" disabled="disabled" id="region2" name="region2" onchange="changeRegion('r2', this.value)"><option value="">동네를 선택하세요</option><option va

In [21]:
for idx in range(0,10) :
    print(soup.select('.card-title')[idx].get_text())
 # 태그 안의 텍스트 가져오기

위닉스10리터 제습기 급매 1만원
캠핑, 잡템 먼저가져가면 장땡
아이폰8 화이트 64기가 판매합니다.
스팸 판매합니다. 
이케아 책상 가져가실분
복합기
캠핑키친테이블(거의새거임)
아이폰에어팟 2세대 무선(정품)
선물세트 팝니데이
전동킥보드


#### get_text : 텍스트만 출력하는 함수

In [22]:
# 한글 폰트 문제 해결
import platform

from matplotlib import font_manager, rc
# plt.rcParams['axes.unicode_minus'] = False

if platform.system() == 'Darwin':
    rc('font', family='AppleGothic')
elif platform.system() == 'Windows':
    path = "c:/Windows/Fonts/malgun.ttf"
    font_name = font_manager.FontProperties(fname=path).get_name()
    rc('font', family=font_name)
else:
    print('Unknown system... sorry~~~~') 


In [23]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [24]:
html = urlopen('https://www.daangn.com/hot_articles')

# urlopen 으로 url을 불러왔을 경우 html.read() 사용
soup = BeautifulSoup(html.read() , 'html.parser')

print(soup.h1)

In [62]:
from urllib.request import urlopen
from urllib.error   import HTTPError
from urllib.error   import URLError

In [63]:
try:
    html = urlopen('https://www.daangn.com/hot_articles')
except HTTPError as he :
    print('http error')
except URLError as ue :
    print('url error')
else :
    soup = BeautifulSoup(html.read() , 'html.parser')
    print(soup)

<!DOCTYPE html>

<html lang="ko">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=no" name="viewport"/>
<link href="https://www.daangn.com/hot_articles" rel="canonical"/>
<title>당근마켓 중고거래 | 당신 근처의 당근마켓</title>
<meta content="당근마켓에서 거래되는 인기 중고 매물을 소개합니다. 지금 당근마켓에서 거래되고 있는 다양한 매물을 구경해보세요." name="description"/>
<meta content="https://www.daangn.com/hot_articles" property="og:url"/>
<meta content="당근마켓 중고거래 | 당신 근처의 당근마켓" property="og:title"/>
<meta content="당근마켓에서 거래되는 인기 중고 매물을 소개합니다. 지금 당근마켓에서 거래되고 있는 다양한 매물을 구경해보세요." property="og:description"/>
<meta content="당근마켓" property="og:site_name"/>
<meta content="https://www.daangn.com/images/meta/home/flea_market.png" property="og:image"/>
<meta content="article" property="og:type"/>
<meta content="ko_KR" property="og:locale"/>
<meta content="1463621440622064" property="fb:app_id"/>
<meta content="summary_large_image" name="tw

## 실습 3
> Totally Normael Gifts 페이지에서 Scraping 하기
img 는 링크로 처리한다.

> list 형식과 append 형식의 차이 : dataframe 만드려면 append 형식으로 해야함. pandas 는 열 별로 추출해서 붙이는 것이기 때문.

In [79]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.error   import HTTPError
from urllib.error   import URLError

In [80]:
try:
    html = urlopen('http://www.pythonscraping.com/pages/page3.html')
except HTTPError as he :
    print('http error')
except URLError as ue :
    print('url error')
else :
    soup = BeautifulSoup(html.read() , 'html.parser')

## 강사님 답안
> 결과가 제대로 나온건지 모르겠다.

In [81]:
title = 0
desc  = 0
cost  = 0
img   = 0

test_data = []
for tr in table.find_all('tr'):
    if tr.find('td') :
        tds = list(tr.find_all('td'))   
    # print(tds)    
        for td in tds :    
            title = tds[0].text.strip('\n')
            desc  = tds[1].text.strip('\n')        
            cost  = tds[2].text.strip('\n')
            img_src   = tds[3].find('img')['src']
        
        test_data.append([title, desc, cost, img_src]) 

test_data

# for d in data : 
#     print(d)

with open('test.csv', 'w', encoding="utf-8") as file:       
    for idx in test_data:
        file.write('{0}, {1}, {2}, {3}\n'.format(idx[0], idx[1], idx[2], idx[3]))

In [82]:
# for child in soup.find('table' , {'id' : 'giftList'}).children :
#     print(child)

# 형제 태그 찾기

for child in soup.find('table' , {'id' : 'giftList'}).tr.next_siblings :
    print(child)



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parr

## 우석 답안
> append 형식으로 dataframe 만들고 csv 파일을 생성. \n 으로 구분된 부분이 있어서
strip 으로 분리했다.

In [83]:
table = soup.find('table', {'id' : 'giftList'})
table

trs = table.find_all('tr')
trs

# scraping_exec[03]
# <table> 태그의 데이터들을 가져와서 csv 파일 형식으로 만들기
import string
import pandas as pd

title = []
desc = []
cost = []
img = []

cnt = 0

for tr in trs :
    if (tr.find('td')) :
        
        tds = tr.find_all('td')
        
        title.append(tds[0].text.strip('\n'))
        desc.append(tds[1].text.strip('\n'))
        cost.append(tds[2].text.strip('\n'))
            
        for td in tds :
            if td.find('img') :
                i = td.find('img')
                src = i['src']
                img.append(src)
                img[cnt] = img[cnt].strip('\n')

        cnt += 1

        
gift_df = pd.DataFrame({
    'title' : title,
    'description' : desc,
    'cost' : cost,
    'image' : img
})

gift_df.to_csv('gift_df.csv', mode = 'w', encoding = 'utf-8')
gift_df

Unnamed: 0,title,description,cost,image
0,Vegetable Basket,This vegetable basket is the perfect gift for ...,$15.00,../img/gifts/img1.jpg
1,Russian Nesting Dolls,"Hand-painted by trained monkeys, these exquisi...","$10,000.52",../img/gifts/img2.jpg
2,Fish Painting,"If something seems fishy about this painting, ...","$10,005.00",../img/gifts/img3.jpg
3,Dead Parrot,This is an ex-parrot! Or maybe he's only resting?,$0.50,../img/gifts/img4.jpg
4,Mystery Box,"If you love suprises, this mystery box is for ...",$1.50,../img/gifts/img6.jpg


## 데이터를 가져오는 방법

> 기상정보 웹페이지에서 필요한 기상정보를 가져와 pandas를 이용해 표로 만들고
csv 파일을 생성하는 작업

In [76]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.error   import HTTPError
from urllib.error   import URLError

In [77]:
try:
    html = urlopen('https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168#.X21Im2gzaUm')
except HTTPError as he :
    print('http error')
except URLError as ue :
    print('url error')
else :
    soup = BeautifulSoup(html.read() , 'html.parser')


In [41]:
soup

<!DOCTYPE html>

<html class="no-js">
<head>
<!-- Meta -->
<meta content="width=device-width" name="viewport"/>
<link href="http://purl.org/dc/elements/1.1/" rel="schema.DC"/><title>National Weather Service</title><meta content="National Weather Service" name="DC.title"><meta content="NOAA National Weather Service National Weather Service" name="DC.description"/><meta content="US Department of Commerce, NOAA, National Weather Service" name="DC.creator"/><meta content="" name="DC.date.created" scheme="ISO8601"/><meta content="EN-US" name="DC.language" scheme="DCTERMS.RFC1766"/><meta content="weather, National Weather Service" name="DC.keywords"/><meta content="NOAA's National Weather Service" name="DC.publisher"/><meta content="National Weather Service" name="DC.contributor"/><meta content="http://www.weather.gov/disclaimer.php" name="DC.rights"/><meta content="General" name="rating"/><meta content="index,follow" name="robots"/>
<!-- Icons -->
<link href="./images/favicon.ico" rel="shor

In [42]:
# sevenDays = soup.select("#seven-day-forecast")
sevenDays = soup.find(id="seven-day-forecast")
sevenDays

<div class="panel panel-default" id="seven-day-forecast">
<div class="panel-heading">
<b>Extended Forecast for</b>
<h2 class="panel-title">
	    	    San Francisco CA	</h2>
</div>
<div class="panel-body" id="seven-day-forecast-body">
<div id="seven-day-forecast-container"><div id="headline-container"><div id="headline-info" onclick="$('#headline-detail').toggle(); $('#headline-detail-now').hide()" style="margin-top: 5px"><div id="headline-detail"><div>Heat Advisory October 1, 11:00am until October 1, 08:00pm</div></div><span class="fa fa-info-circle"></span>Click here for hazard details and duration</div><div class="headline-bar headline-advisory " style="top: 40px; left: 176px; height: 125px; width: 93px">
<div class="headline-title">Heat Advisory</div>
</div></div><ul class="list-unstyled" id="seven-day-forecast-list" style="padding-top: 60px"><li class="forecast-tombstone">
<div class="tombstone-container">
<p class="period-name">Overnight<br/><br/></p>
<p><img alt="Overnight: Wides

In [43]:
forecast = sevenDays.find_all(class_ = 'tombstone-container')

In [44]:
period = forecast[0].find(class_ = 'period-name').get_text()
period

'Overnight'

In [45]:
short_desc = forecast[0].find(class_ = 'short-desc').get_text()
short_desc

'Haze'

In [46]:
img = forecast[0].find('img')
img_src = img['src']
img_src

'newimages/medium/hz.png'

In [47]:
sevenDays = soup.select("#seven-day-forecast")

## 필요한 기상정보 전체를 가져오기

In [48]:
sevenDays = soup.find(id="seven-day-forecast")
sevenDays

<div class="panel panel-default" id="seven-day-forecast">
<div class="panel-heading">
<b>Extended Forecast for</b>
<h2 class="panel-title">
	    	    San Francisco CA	</h2>
</div>
<div class="panel-body" id="seven-day-forecast-body">
<div id="seven-day-forecast-container"><div id="headline-container"><div id="headline-info" onclick="$('#headline-detail').toggle(); $('#headline-detail-now').hide()" style="margin-top: 5px"><div id="headline-detail"><div>Heat Advisory October 1, 11:00am until October 1, 08:00pm</div></div><span class="fa fa-info-circle"></span>Click here for hazard details and duration</div><div class="headline-bar headline-advisory " style="top: 40px; left: 176px; height: 125px; width: 93px">
<div class="headline-title">Heat Advisory</div>
</div></div><ul class="list-unstyled" id="seven-day-forecast-list" style="padding-top: 60px"><li class="forecast-tombstone">
<div class="tombstone-container">
<p class="period-name">Overnight<br/><br/></p>
<p><img alt="Overnight: Wides

In [49]:
sevenDays = soup.find(id="seven-day-forecast")
periods = sevenDays.select('.tombstone-container .period-name')
periods

[<p class="period-name">Overnight<br/><br/></p>,
 <p class="period-name">Thursday<br/><br/></p>,
 <p class="period-name">Thursday<br/>Night</p>,
 <p class="period-name">Friday<br/><br/></p>,
 <p class="period-name">Friday<br/>Night</p>,
 <p class="period-name">Saturday<br/><br/></p>,
 <p class="period-name">Saturday<br/>Night</p>,
 <p class="period-name">Sunday<br/><br/></p>,
 <p class="period-name">Sunday<br/>Night</p>]

In [50]:
periods_text = [text.get_text() for text in periods ]
periods_text

['Overnight',
 'Thursday',
 'ThursdayNight',
 'Friday',
 'FridayNight',
 'Saturday',
 'SaturdayNight',
 'Sunday',
 'SundayNight']

In [51]:
descs = sevenDays.select('.tombstone-container .short-desc')
descs

[<p class="short-desc">Haze</p>,
 <p class="short-desc">Haze then<br/>Sunny</p>,
 <p class="short-desc">Mostly Clear</p>,
 <p class="short-desc">Sunny</p>,
 <p class="short-desc">Mostly Clear</p>,
 <p class="short-desc">Mostly Sunny</p>,
 <p class="short-desc">Mostly Cloudy</p>,
 <p class="short-desc">Partly Sunny</p>,
 <p class="short-desc">Mostly Cloudy</p>]

In [52]:
desc_text = [text.get_text() for text in descs ]
desc_text

['Haze',
 'Haze thenSunny',
 'Mostly Clear',
 'Sunny',
 'Mostly Clear',
 'Mostly Sunny',
 'Mostly Cloudy',
 'Partly Sunny',
 'Mostly Cloudy']

In [53]:
temp_text = [text.get_text() for text in sevenDays.select('.tombstone-container .temp')]
temp_text

['Low: 60 °F',
 'High: 92 °F',
 'Low: 62 °F',
 'High: 82 °F',
 'Low: 59 °F',
 'High: 74 °F',
 'Low: 57 °F',
 'High: 70 °F',
 'Low: 57 °F']

In [60]:
# temp_text.append('High: 77 °F')
print(len(periods_text))
print(len(desc_text))
print(len(temp_text))
# -> 8이라서 테이블을 만들 수 없으므로 temp_text에 append 해줌

9
9
9


#### 테이블을 만드려면 length 체크를 해주는 것이 좋다

In [55]:
import pandas as pd

In [56]:
forecast_df = pd.DataFrame({
    'period' : periods_text,
    'desc'   : desc_text,
    'temp'   : temp_text
})

In [57]:
forecast_df

Unnamed: 0,period,desc,temp
0,Overnight,Haze,Low: 60 °F
1,Thursday,Haze thenSunny,High: 92 °F
2,ThursdayNight,Mostly Clear,Low: 62 °F
3,Friday,Sunny,High: 82 °F
4,FridayNight,Mostly Clear,Low: 59 °F
5,Saturday,Mostly Sunny,High: 74 °F
6,SaturdayNight,Mostly Cloudy,Low: 57 °F
7,Sunday,Partly Sunny,High: 70 °F
8,SundayNight,Mostly Cloudy,Low: 57 °F


In [58]:
forecast_df.to_csv('forecast_df.csv' , mode='w' , encoding='utf-8')
print('success')

success
