In [1]:
import requests

response = requests.get('http://dataquestio.github.io/web-scraping-pages/simple.html')
content = response.content
print(content)

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'


Web Scraping - Using requests library to fetch a website and with the content attribute, you can see it looks exactly like the source code of the website.

In [2]:
from bs4 import BeautifulSoup

parser = BeautifulSoup(content,'html.parser')
body = parser.body

p = body.p
print(p.text)

Here is some simple content for this page.


For getting individual content or 'scraping' the website, we are using the BeautifulSoup library. Here, we use the 'html.parser' parser to parse through the website.
'.text' is the text proprety to retrieve contents from a tag.

In [3]:
head = parser.head
title = head.title
title_text = title.text
print(title_text)

A simple example page


Code to get data from the title tag. If the tag to be retrieved is nested within a tag, you need to first parse through the root tag and then access the inner tags till you reach the destination tag.

In [4]:
#Using find_all method to find all occurences of a particular tag

body = parser.find_all('body')
p = body[0].find_all('p')
print(p[0].text)

Here is some simple content for this page.


In [5]:
#Similarly for title to retrieve text within the title tag

head = parser.find_all('head')
title = head[0].find_all('title')
print(title[0].text)

A simple example page


In [7]:
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_ids.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

first_paragraph = parser.find_all('p',id='first')[0]
print(first_paragraph.text)


                First paragraph.
            


Sometimes a web page is divided into logical entities using the 'div' tag. And pages can have Element IDs. So we need to pass an additional attribute called 'id'.

In [8]:
second_paragraph = parser.find_all('p',id='second')[0]
second_paragraph_text = second_paragraph.text
print(second_paragraph_text)



                Second paragraph.
            



In [10]:
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_classes.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

#Finding the first Inner Paragraph using the class_ parameter
first_inner_paragraph = parser.find_all('p',class_='inner-text')[0]
print(first_inner_paragraph.text)

#Finding the second Inner Paragraph using the class_ parameter
second_inner_paragraph = parser.find_all('p',class_='inner-text')[1]
second_inner_paragraph_text = second_inner_paragraph.text
print(second_inner_paragraph_text)

#Finding the first Outer Paragraph using the class_ parameter
first_outer_paragraph = parser.find_all('p',class_='outer-text')[0]
first_outer_paragraph_text = first_outer_paragraph.text
print(first_outer_paragraph_text)


                First paragraph.
            

                Second paragraph.
            


                First outer paragraph.
            



Elements can also be divided into Classes. Classes aren't globally unique but usually indicate that elements are linked.
One element can have or belong to multiple classes.

In [11]:
response = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

#Select all elements with the 'first-item' class.
first_items = parser.select('.first-item')
print(first_items[0].text)

#Select all elements with the 'outer-text' class.
outer_text = parser.select('.outer-text')
first_outer_text = outer_text[0].text
print(first_outer_text)

#Select all elements with the id = second
second_id = parser.select('#second')
second_text = second_id[0].text
print(second_text)


                First paragraph.
            


                First outer paragraph.
            



                First outer paragraph.
            



CSS Selectors are used to assign style to a particular element or class. We can use these CSS selectors to filter data using the .select method.

In [14]:
response = requests.get("http://dataquestio.github.io/web-scraping-pages/2014_super_bowl.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

turnovers = parser.select("#turnovers")[0]
seahawks_turnovers = turnovers.select("td")[1]
seahawks_turnovers_count = seahawks_turnovers.text
print(seahawks_turnovers_count)

total_plays = parser.select('#total-plays')[0]
patriot_total_plays = total_plays.select('td')[2]
patriot_total_plays_count = patriot_total_plays.text
print(patriot_total_plays_count)

total_yards = parser.select('#total-yards')[0]
seahawks_total_yards = total_yards.select('td')[1]
seahawks_total_yards_count = seahawks_total_yards.text
print(seahawks_total_yards_count)

1
72
396


We can nest the CSS selectors for parsing the data more efficiently.