# Web Scraping

* Parser allows us to parse through the HTML text
* lxml and html5lib are external lib
* lxml will be used in this notebook
* html.parser is Python bulit-in HTML parser

In [70]:
import requests as r
import os
from bs4 import BeautifulSoup
import csv

# Web Scraping with Simple HTML Text

In [11]:
os.chdir(r'C:\Users\tanzh\Documents\Python\Other Materials\web_scraping')
with open('simple_html.html') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')

In [12]:
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" lang="">
 <head>
  <title>
   Test - A Sample Website
  </title>
  <meta charset="utf-8"/>
  <link href="css/normalize.css" rel="stylesheet"/>
  <link href="css/main.css" rel="stylesheet"/>
 </head>
 <body>
  <h1 id="site_title">
   Test Website
  </h1>
  <hr/>
  <div class="article">
   <h2>
    <a href="article_1.html">
     Article 1 Headline
    </a>
   </h2>
   <p>
    This is a summary of article 1
   </p>
  </div>
  <hr/>
  <div class="article">
   <h2>
    <a href="article_2.html">
     Article 2 Headline
    </a>
   </h2>
   <p>
    This is a summary of article 2
   </p>
  </div>
  <hr/>
  <div class="footer">
   <p>
    Footer Information
   </p>
  </div>
  <script src="js/vendor/modernizr-3.5.0.min.js">
  </script>
  <script src="js/plugins.js">
  </script>
  <script src="js/main.js">
  </script>
 </body>
</html>


In [16]:
print(soup.title) # this will retrieve the first title tag on the page
print(soup.title.text) # this will retrieve the first title tag on the page

<title>Test - A Sample Website</title>
Test - A Sample Website


In [17]:
soup.div # this retive the div tag in the html

<div class="article">
<h2><a href="article_1.html">Article 1 Headline</a></h2>
<p>This is a summary of article 1</p>
</div>

In [20]:
soup.find('div') # this is the same as the above 

<div class="article">
<h2><a href="article_1.html">Article 1 Headline</a></h2>
<p>This is a summary of article 1</p>
</div>

In [21]:
soup.find('div', class_='footer') # this allow us to find the div of a particular class

<div class="footer">
<p>Footer Information</p>
</div>

In [25]:
# the below allow us to access the individual tags in a div

article = soup.find('div', class_='article')
headline = article.h2.a.text
summary = article.p.text

print(article)
print('\n')
print(headline)
print('\n')
print(summary)

<div class="article">
<h2><a href="article_1.html">Article 1 Headline</a></h2>
<p>This is a summary of article 1</p>
</div>


Article 1 Headline


This is a summary of article 1


In [34]:
# with the above, we can find all the div with the same class

# soup.findall('div', class_='article') ---> this returns a list

headline_list = []
summary_list = []


for article in soup.find_all('div', class_='article'):
    headline = article.h2.a.text
    summary = article.p.text
    
    headline_list += [headline]
    summary_list += [summary]

    print(headline)
    print(summary)

    print('\n')

print('==============')
print('\n')

print(headline_list)
print(summary_list)

Article 1 Headline
This is a summary of article 1


Article 2 Headline
This is a summary of article 2




['Article 1 Headline', 'Article 2 Headline']
['This is a summary of article 1', 'This is a summary of article 2']


# Web Scrapping on Corey Schafer Website

In [56]:
source = r.get('https://coreyms.com/').text # to retreive the HTML text of a webpage 
soup = BeautifulSoup(source,'lxml') # this become a soup object
article = soup.find('article') # this finds the FIRST article tag in the article
print(article.prettify())

<article class="post-1670 post type-post status-publish format-standard has-post-thumbnail category-development category-python tag-gzip tag-shutil tag-zip tag-zipfile entry" itemscope="" itemtype="https://schema.org/CreativeWork">
 <header class="entry-header">
  <h2 class="entry-title" itemprop="headline">
   <a class="entry-title-link" href="https://coreyms.com/development/python/python-tutorial-zip-files-creating-and-extracting-zip-archives" rel="bookmark">
    Python Tutorial: Zip Files – Creating and Extracting Zip Archives
   </a>
  </h2>
  <p class="entry-meta">
   <time class="entry-time" datetime="2019-11-19T13:02:37-05:00" itemprop="datePublished">
    November 19, 2019
   </time>
   by
   <span class="entry-author" itemprop="author" itemscope="" itemtype="https://schema.org/Person">
    <a class="entry-author-link" href="https://coreyms.com/author/coreymschafer" itemprop="url" rel="author">
     <span class="entry-author-name" itemprop="name">
      Corey Schafer
     </spa

In [57]:
article = soup.find('article')
headline = article.h2.a.text
print(headline)

Python Tutorial: Zip Files – Creating and Extracting Zip Archives


In [52]:
summary = article.find('div', class_='entry-content').p.text
print(summary)

In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…


In [59]:
video_source_tag = article.find('iframe', class_='youtube-player')
# the attribute of a tag can be accessed through slicing 
video_source = article.find('iframe', class_='youtube-player')['src']
print(video_source_tag)
print(video_source)

<iframe allowfullscreen="true" class="youtube-player" height="360" src="https://www.youtube.com/embed/z0gguhEmWiY?version=3&amp;rel=1&amp;fs=1&amp;autohide=2&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;wmode=transparent" style="border:0;" width="640"></iframe>
https://www.youtube.com/embed/z0gguhEmWiY?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent


In [63]:
video_source_id = video_source.split('/')[4]
video_source_id = video_source_id.split('?')[0]
print(video_source_id)

z0gguhEmWiY


In [64]:
yt_link =f'https://youtube.com/watch?v={video_source_id}'
print(yt_link)

https://youtube.com/watch?v=z0gguhEmWiY


In [74]:
# Now we can loop through all the information

os.chdir(r'C:\Users\tanzh\Documents\Python\Other Materials\web_scraping')


csv_file = open('cms_scrape.csv','w') # to create a csv file
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['headline','summary','video_link']) # we are writing the column header 

for article in soup.find_all('article'):
    try:
        headline = article.h2.a.text
        summary = article.find('div', class_='entry-content').p.text

        video_source = article.find('iframe', class_='youtube-player')['src']
        video_source_id = video_source.split('/')[4]
        video_source_id = video_source_id.split('?')[0]
        yt_link =f'https://youtube.com/watch?v={video_source_id}'

        print(headline)
        print(summary)

    except Exception as e:
        print(headline)
        print(summary)
        yt_link = None

    print(yt_link)
    print('')

    csv_writer.writerow([headline,summary,yt_link])

csv_file.close()

Python Tutorial: Zip Files – Creating and Extracting Zip Archives
In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…
https://youtube.com/watch?v=z0gguhEmWiY

Python Data Science Tutorial: Analyzing the 2019 Stack Overflow Developer Survey
In this Python Programming video, we will be learning how to download and analyze real-world data from the 2019 Stack Overflow Developer Survey. This is terrific practice for anyone getting into the data science field. We will learn different ways to analyze this data and also some best practices. Let’s get started…
https://youtube.com/watch?v=_P7X8tMplsw

Python Multiprocessing Tutorial: Run Code in Parallel Using the Multiprocessing Module
In this Python Programming video, we will be learning how t

In [None]:
# Alternative 

In [94]:

source_html = r.get('https://coreyms.com/').text
soup = BeautifulSoup(source_html, 'lxml')

article = soup.find('article')
print(article.prettify())

<article class="post-1670 post type-post status-publish format-standard has-post-thumbnail category-development category-python tag-gzip tag-shutil tag-zip tag-zipfile entry" itemscope="" itemtype="https://schema.org/CreativeWork">
 <header class="entry-header">
  <h2 class="entry-title" itemprop="headline">
   <a class="entry-title-link" href="https://coreyms.com/development/python/python-tutorial-zip-files-creating-and-extracting-zip-archives" rel="bookmark">
    Python Tutorial: Zip Files – Creating and Extracting Zip Archives
   </a>
  </h2>
  <p class="entry-meta">
   <time class="entry-time" datetime="2019-11-19T13:02:37-05:00" itemprop="datePublished">
    November 19, 2019
   </time>
   by
   <span class="entry-author" itemprop="author" itemscope="" itemtype="https://schema.org/Person">
    <a class="entry-author-link" href="https://coreyms.com/author/coreymschafer" itemprop="url" rel="author">
     <span class="entry-author-name" itemprop="name">
      Corey Schafer
     </spa

In [91]:
headline = article.h2.a.text
print(headline)

Python Tutorial: Zip Files – Creating and Extracting Zip Archives


In [130]:
date_of_article = article.p.time['datetime'].split('T')[0]
time_of_upload = article.p.time['datetime'].split('T')[1]

print(date_of_article)
print(time_of_upload)

2019-11-19
13:02:37-05:00


In [122]:
summary = article.find('div', class_='entry-content').p.text
print(summary)

In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…


In [146]:
video_link = article.h2.a['href']
print(video_link)

https://coreyms.com/development/python/python-tutorial-zip-files-creating-and-extracting-zip-archives


In [None]:
for article in soup.find_all('article'):