# Web Scraping With BeautifulSoup and Request

What is web scraping ?
- Process of extracting informations from a webpage by using patterns in webpage.

**We will scrap from website https://coreyms.com/**

**Grab post titles, summaries, links to youtube videos from this webpage**

**At Start we will scrap a simple html page to get idea about
scraping.**

to parse html file, we use **lxml parser**

In [3]:
from bs4 import BeautifulSoup
import requests

with open('simple.html') as html_file:
    soup = BeautifulSoup(html_file)
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" lang="">
 <head>
  <title>
   Test - A Sample Website
  </title>
  <meta charset="utf-8"/>
  <link href="css/normalize.css" rel="stylesheet"/>
  <link href="css/main.css" rel="stylesheet"/>
 </head>
 <body>
  <h1 id="site_title">
   Test Website
  </h1>
  <hr/>
  <div class="article">
   <h2>
    <a href="article_1.html">
     Article 1 Headline
    </a>
   </h2>
   <p>
    This is a summary of article 1
   </p>
  </div>
  <hr/>
  <div class="article">
   <h2>
    <a href="article_2.html">
     Article 2 Headline
    </a>
   </h2>
   <p>
    This is a summary of article 2
   </p>
  </div>
  <hr/>
  <div class="footer">
   <p>
    Footer Information
   </p>
  </div>
  <script src="js/vendor/modernizr-3.5.0.min.js">
  </script>
  <script src="js/plugins.js">
  </script>
  <script src="js/main.js">
  </script>
 </body>
</html>



## using find() method to extract a specific tag

In [5]:
match = soup.find('div', class_ = 'footer')
print(match)
# pass class_ attribute to get a div with class footer

<div class="footer">
<p>Footer Information</p>
</div>


In [6]:
match = soup.find('h1', id="site_title")
print(match)

<h1 id="site_title">Test Website</h1>


In [10]:

# Get headline and summary of article
article = soup.find('div',  class_="article")
headline = article.h2.a.text
summary = article.p.text
print(headline)
print(summary)

Article 1 Headline
This is a summary of article 1


## Using find_all() method 
- find_all() method returns a list of all matches

In [12]:
for article in soup.find_all('div', class_='article'):
    headline = article.h2.a.text
    summary = article.p.text
    print(headline)
    print(summary)

Article 1 Headline
This is a summary of article 1
Article 2 Headline
This is a summary of article 2


# Scrapping Real Website

https://coreyms.com/**

To get response from above webpage we have to sent a request to the same.

For such purpose, we use **requests** library.

- Pass **text** property to get text value of request.
- parsethe data using **BeautifulSoup**
- We need to extract heading, summary, video link from page.
- Transform the video link to below format as well
    - https://www.youtube.com/watch?v=ng2o98k983k
- Store those information in a csv file.

In [74]:
from bs4 import BeautifulSoup
import requests
import re
import csv
data_file = open('scrap_data.csv', 'w')
csv_writer = csv.writer(data_file)
csv_writer.writerow(['Heading', 'Summary', 'Youtube URL'])
source = requests.get('http://coreyms.com').text
# parse the page 
soup = BeautifulSoup(source)
for article in soup.find_all('article'):
    heading = article.header.h2.a.text
    summary = article.find('div', class_="entry-content").p.text
    youtube_url = article.find('iframe', class_='youtube-player')['src']
    youtube_url = youtube_url.split('?')[0]
    youtube_url = re.sub('embed/', 'watch?v=', youtube_url)
    csv_writer.writerow([heading, summary, youtube_url])
# close the csv file
data_file.close()

TypeError: TypeError: 'NoneType' object is not subscriptable

# Exercise Programs

Scraping Numbers from HTML using BeautifulSoup In this assignment you will write a Python program similar to http://www.py4e.com/code3/urllink2.py. The program will use urllib to read the HTML from the data files below, and parse the data, extracting numbers and compute the sum of the numbers in the file.

Actual data: http://py4e-data.dr-chuck.net/comments_1331184.html

## Using beautifulsoup to fetch data

In [19]:
from bs4 import BeautifulSoup
import requests
data_source = requests.get('http://py4e-data.dr-chuck.net/comments_1331184.html').text
soup = BeautifulSoup(data_source)
def extract_compute():
    return sum([int(number.text) for number in soup.find_all('span', class_='comments')])
print(extract_compute())

2359


## Using urllib.request library to fetch data

In [21]:
from bs4 import BeautifulSoup
import urllib.request
with urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1331184.html') as html_res:
    data_source = html_res.read()
soup = BeautifulSoup(data_source)
def extract_compute():
    return sum([int(number.text) for number in soup.find_all('span', class_='comments')])
print(extract_compute())

2359
