Web Scraping
==============

Parsing with HTML file
-----------

In [10]:
from bs4 import BeautifulSoup
import requests

with open('simple.html') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    
print(soup.prettify()) # prettify() 메서드를 통해 가독성 있게 태그를 전달받을 수 있다

<!DOCTYPE html>
<html class="no-js" lang="">
 <head>
  <title>
   Test - A Sample Website
  </title>
  <meta charset="utf-8"/>
  <link href="css/normalize.css" rel="stylesheet"/>
  <link href="css/main.css" rel="stylesheet"/>
 </head>
 <body>
  <h1 id="site_title">
   Test Website
  </h1>
  <hr/>
  <div class="article">
   <h2>
    <a href="article_1.html">
     Article 1 Headline
    </a>
   </h2>
   <p>
    This is a summary of article 1
   </p>
  </div>
  <hr/>
  <div class="article">
   <h2>
    <a href="article_2.html">
     Article 2 Headline
    </a>
   </h2>
   <p>
    This is a summary of article 2
   </p>
  </div>
  <hr/>
  <div class="footer">
   <p>
    Footer Information
   </p>
  </div>
  <script src="js/vendor/modernizr-3.5.0.min.js">
  </script>
  <script src="js/plugins.js">
  </script>
  <script src="js/main.js">
  </script>
 </body>
</html>


In [11]:
match = soup.title # attribute를 통해 접근하고자 하는 태그 선택 가능
print(match)

<title>Test - A Sample Website</title>


In [12]:
match2 = soup.title.text
print(match2)

Test - A Sample Website


In [13]:
match3 = soup.div
print(match3)

<div class="article">
<h2><a href="article_1.html">Article 1 Headline</a></h2>
<p>This is a summary of article 1</p>
</div>


In [15]:
# find 메서드를 통해 태그 내의 특정 클래스에 접근 가능
match4 = soup.find('div', class_ = 'footer')
print(match4)

<div class="footer">
<p>Footer Information</p>
</div>


In [16]:
article = soup.find('div', class_ = 'article')
headline = article.h2.a.text
print(headline)

Article 1 Headline


In [17]:
summary = article.p.text
print(summary)

This is a summary of article 1


In [18]:
# find_all return a list of all the tags that match those arguments
for article in soup.find_all('div', class_ = 'article'):
    headline = article.h2.a.text
    print(headline)
    
    summary = article.p.text
    print(summary)
    
    print()

Article 1 Headline
This is a summary of article 1

Article 2 Headline
This is a summary of article 2



Scrap from actual websites
---------------

In [21]:
source = requests.get('http://coreyms.com').text

soup = BeautifulSoup(source, 'lxml')

In [22]:
article = soup.find('article')

In [24]:
headline = article.a.text
print(headline)

Python Tutorial: Itertools Module – Iterator Functions for Efficient Looping


In [25]:
summary = article.find('div', class_ = 'entry-content').p.text
print(summary)

In this Python Programming Tutorial, we will be learning about the itertools module. The itertools module is a collection of functions that allows us to work with iterators in an efficient way. Depending on your problem, this can save you a lot of memory and also a lot of work. Let’s get started…


In [26]:
vid_src = article.find('iframe', class_ = 'youtube-player')
print(vid_src)

<iframe allowfullscreen="true" class="youtube-player" height="360" src="https://www.youtube.com/embed/Qu3dThVy6KQ?version=3&amp;rel=1&amp;fs=1&amp;autohide=2&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;wmode=transparent" style="border:0;" type="text/html" width="640"></iframe>


In [27]:
vid_src = article.find('iframe', class_ = 'youtube-player')['src']
print(vid_src)

https://www.youtube.com/embed/Qu3dThVy6KQ?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent


In [30]:
vid_id = vid_src.split('/')[4]
vid_id = vid_id.split('?')[0]
print(vid_id)

Qu3dThVy6KQ


In [31]:
yt_link = f'http://youtube.com/watch?v={vid_id}'
print(yt_link)

http://youtube.com/watch?v=Qu3dThVy6KQ


In [41]:
#import csv

#csv_file = open('cms_scrape.csv', 'w')
#csv_writer = csv.writer(csv_file)
#csv_writer.writerow(['headline', 'summary', 'video_link'])

for article in soup.find_all('article'):
    headline = article.a.text
    print(headline)
    
    summary = article.find('div', class_ = 'entry-content').p.text
    print(summary)
    
    try:
        vid_src = article.find('iframe', class_ = 'youtube-player')['src']
        vid_id = vid_src.split('/')[4]
        vid_id = vid_id.split('?')[0] 
        yt_link = f'http://youtube.com/watch?v={vid_id}'
    
    except Exception as e:
        yt_link = None
    print(yt_link)
    
    print()
    
    #csv_writer.writerow([headline, summary, yt_link])
    
csv_file.close()

Python Tutorial: Itertools Module – Iterator Functions for Efficient Looping
In this Python Programming Tutorial, we will be learning about the itertools module. The itertools module is a collection of functions that allows us to work with iterators in an efficient way. Depending on your problem, this can save you a lot of memory and also a lot of work. Let’s get started…
http://youtube.com/watch?v=Qu3dThVy6KQ

Python Coding Problem: Creating Your Own Iterators
In this Python Coding Problem, we will be creating our own iterators from scratch. First, we will create an iterator using a class. Then we will create an iterator with the same functionality using a generator. If you haven’t watched the tutorial video on Iterators and Iterables then I would suggest watching that first. With that said, let’s get started…
http://youtube.com/watch?v=C3Z9lJXI6Qw

Python Tutorial: Iterators and Iterables – What Are They and How Do They Work?
In this Python Programming Tutorial, we will be learning a

References
--------

https://www.youtube.com/watch?v=ng2o98k983k&index=43&list=PL-osiE80TeTt2d9bfVyTiXJA-UTHn6WwU