<a href="https://colab.research.google.com/github/jgamel/learn_n_dev/blob/python_web_scrapping/BeautifulSoup_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Implementing Web Scraping in Python with BeautifulSoup

There are mainly two ways to extract data from a website:

* Use the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.

* Access the HTML of the webpage and extract useful information/data from it. This technique is called web scraping or web harvesting or web data extraction.

This involved in web scraping using the implementation of a Web Scraping framework of Python called Beautiful Soup.

Steps involved in web scraping:

1. Send an HTTP request to the URL of the webpage you want to access. The server responds to the request by returning the HTML content of the webpage. For this task, we will use a third-party HTTP library for python-requests.

2. Once we have accessed the HTML content, we are left with the task of parsing the data. Since most of the HTML data is nested, we cannot extract data simply through string processing. One needs a parser which can create a nested/tree structure of the HTML data. There are many HTML parser libraries available but the most advanced one is html5lib.

3. Now, all we need to do is navigating and searching the parse tree that we created, i.e. tree traversal. For this task, we will be using another third-party python library, Beautiful Soup. It is a Python library for pulling data out of HTML and XML files.

### Step 1: Accessing the HTML content from webpage

In [44]:
import requests
URL = "https://www.geeksforgeeks.org/data-structures/"
r = requests.get(URL)
print(r.content)

b'<!doctype html><html lang=en-us prefix="og: http://ogp.me/ns#"><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1,maximum-scale=1"><link rel="shortcut icon" href=https://media.geeksforgeeks.org/wp-content/cdn-uploads/gfg_favicon.png type=image/x-icon><meta name=theme-color content="#308D46"><meta name=image property="og:image" content="https://media.geeksforgeeks.org/wp-content/cdn-uploads/gfg_200x200-min.png"><meta property="og:image:type" content="image/png"><meta property="og:image:width" content="200"><meta property="og:image:height" content="200"><script defer src=https://apis.google.com/js/platform.js></script><script async src=//cdnjs.cloudflare.com/ajax/libs/require.js/2.1.14/require.min.js></script><title>Data Structures - GeeksforGeeks</title><link rel=profile href=http://gmpg.org/xfn/11><link rel=pingback href><script type=application/ld+json>\r\n    {\r\n        "@context" : "http://schema.org",\r\n        "@type" : "Organization",\r\n    

Let us try to understand this piece of code.

* First of all import the requests library.
* Then, specify the URL of the webpage you want to scrape.
* Send a HTTP request to the specified URL and save the response from server in a response object called r.
* Now, as print r.content to get the raw HTML content of the webpage. It is of ‘string’ type.


### Step 2: Parsing the HTML content

In [45]:
#This will not run on online IDE
import requests
from bs4 import BeautifulSoup

URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)

soup = BeautifulSoup(r.content, 'html5lib') # If this line causes an error, run 'pip install html5lib' or install html5lib
print(soup.prettify())


<!DOCTYPE html>
<html class="no-js" dir="ltr" lang="en-US">
 <head>
  <title>
   Inspirational Quotes - Motivational Quotes - Leadership Quotes | PassItOn.com
  </title>
  <meta charset="utf-8"/>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width,initial-scale=1.0" name="viewport"/>
  <meta content="The Foundation for a Better Life | Pass It On.com" name="description"/>
  <link href="/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
  <link href="/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
  <link href="/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
  <link href="/site.webmanifest" rel="manifest"/>
  <link color="#c8102e" href="/safari-pinned-tab.svg" rel="mask-icon"/>
  <meta content="#c8102e" name="msapplication-TileColor"/>
  <meta content="#ffffff" name="theme-color"/>
  <link crossorigin="anonymous" href="https://stackp

A really nice thing about the BeautifulSoup library is that it is built on the top of the HTML parsing libraries like html5lib, lxml, html.parser, etc. So  BeautifulSoup object and specify the parser library can be created at the same time.

In the example above,

soup = BeautifulSoup(r.content, 'html5lib')

We create a BeautifulSoup object by passing two arguments:

* r.content : It is the raw HTML content.
* html5lib : Specifying the HTML parser we want to use.

Now soup.prettify() is printed, it gives the visual representation of the parse tree created from the raw HTML content.

### Step 4: Searching and navigating through the parse tree

Now, we would like to extract some useful data from the HTML content. The soup object contains all the data in the nested structure which could be programmatically extracted. In our example, we are scraping a webpage consisting of some quotes. So, we would like to create a program to save those quotes (and all relevant information about them).

In [65]:

!wget --continue https://raw.githubusercontent.com/jgamel/learn_n_dev/input_data/inspirational_quotes.csv -O /tmp/inspirational_quotes.csv

# Python program to read CSV file line by line
# import necessary packages
import csv

# Open file
with open('/tmp/inspirational_quotes.csv') as file_obj:
	
	# Create reader object by passing the file
	# object to reader method
	reader_obj = csv.reader(file_obj)
	
	# Iterate over each row in the csv
	# file using reader object
	for row in reader_obj:
		print(row)


--2022-05-05 14:24:45--  https://raw.githubusercontent.com/jgamel/learn_n_dev/input_data/inspirational_quotes.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 416 Range Not Satisfiable

    The file is already fully retrieved; nothing to do.

['theme', 'url', 'img', 'lines', 'author']
['OPTIMISM', '/inspirational-quotes/8309-it-s-not-that-optimism-solves-all-of-life-s', 'https://assets.passiton.com/quotes/quote_artwork/8309/medium/20220302_wednesday_quote.jpg?1645735819', 'It’s not that optimism solves all of life’s problems; it is just that it can sometimes make the difference between coping and collapsing. ', '<Author:0x00005607a3fbacd8>']
['OPTIMISM', '/inspirational-quotes/4506-perpetual-optimism-is-a-force-multiplier', 'https://assets.passiton.com/quotes/quote_artw

In [68]:
#Python program to scrape website
#and save quotes from website
import requests
from bs4 import BeautifulSoup
import csv

URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)

soup = BeautifulSoup(r.content, 'html5lib')

quotes=[] # a list to store quotes

table = soup.find('div', attrs = {'id':'all_quotes'})

for row in table.findAll('div',
						attrs = {'class':'col-6 col-lg-3 text-center margin-30px-bottom sm-margin-30px-top'}):
	quote = {}
	quote['theme'] = row.h5.text
	quote['url'] = row.a['href']
	quote['img'] = row.img['src']
	quote['lines'] = row.img['alt'].split(" #")[0]
	quote['author'] = row.img['alt'].split(" #")[1]
	quotes.append(quote)
 
filename = '/tmp/inspirational_quotes.csv'
with open(filename, 'w', newline='') as f:
	w = csv.DictWriter(f,['theme','url','img','lines','author'])
	w.writeheader()
	for quote in quotes:
		w.writerow(quote)


In [69]:
# Python program to read CSV file line by line
# import necessary packages
import csv

# Open file
with open('/tmp/inspirational_quotes.csv') as file_obj:
	
	# Create reader object by passing the file
	# object to reader method
	reader_obj = csv.reader(file_obj)
	
	# Iterate over each row in the csv
	# file using reader object
	for row in reader_obj:
		print(row)


['theme', 'url', 'img', 'lines', 'author']
