# Practical: 7 

`Ishika Tailor 180280116118`

**Aim: Exploring Web Scraping Libraries BeautifulSoup.Provide Suitable example in web scraping for following functions.**

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Here’s an HTML document I’ll be using as an example throughout this document. It’s part of a story from Alice in Wonderland:

In [1]:
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [2]:
from bs4 import BeautifulSoup

# prettify()

In [3]:
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


Here are some simple ways to navigate that data structure:

#  find_all ,  find

In [4]:
print("soup.title:",soup.title)

print("soup.title.name:",soup.title.name)

print("soup.title.string:",soup.title.string)

print("soup.title.parent.name:",soup.title.parent.name)

print("soup.p:",soup.p)

print("soup.p['class']:",soup.p['class'])

print("soup.a:",soup.a)

print("soup.find_all('a'):",soup.find_all('a'))

print("soup.find(id=\"link3\"):",soup.find(id="link3"))

soup.title: <title>The Dormouse's story</title>
soup.title.name: title
soup.title.string: The Dormouse's story
soup.title.parent.name: head
soup.p: <p class="title"><b>The Dormouse's story</b></p>
soup.p['class']: ['title']
soup.a: <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a'): [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find(id="link3"): <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


One common task is extracting all the URLs found within a page’s <a> tags:

In [5]:
for link in soup.find_all('a'):
    print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


Another common task is extracting all the text from a page:

#  get_text()

In [6]:
print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



The BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree.

You can also pass a BeautifulSoup object into one of the methods defined in Modifying the tree, just as you would a Tag. This lets you do things like combine two parsed documents:

#  replace_with

In [18]:
html1="""<document><content/>INSERT FOOTER HERE</document"""
html2="""<footer>Here's the footer</footer>"""
doc = BeautifulSoup(html1, "xml")
footer = BeautifulSoup(html2, "xml")
doc.find(text="INSERT FOOTER HERE").replace_with(footer)
# 'INSERT FOOTER HERE'
print(doc)

<?xml version="1.0" encoding="utf-8"?>
<document><content/><footer>Here's the footer</footer></document>


You can use .next_sibling and .previous_sibling to navigate between page elements that are on the same level of the parse tree:

#  .next_sibling , .previous_sibling

In [19]:
html3="""<a><b>text1</b><c>text2</c></b></a>"""
sibling_soup = BeautifulSoup(html3, 'html.parser')
print(sibling_soup.prettify())

<a>
 <b>
  text1
 </b>
 <c>
  text2
 </c>
</a>


In [20]:
print(sibling_soup.b.next_sibling)
# <c>text2</c>

print(sibling_soup.c.previous_sibling)
# <b>text1</b>


<c>text2</c>
<b>text1</b>


In [21]:
print(sibling_soup.b.previous_sibling)
# None
print(sibling_soup.c.next_sibling)
# None

None
None


In [23]:
# <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
# <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
# <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
link=soup.a
link

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [24]:
link.next_sibling

',\n'

In [25]:
link.next_sibling.next_sibling

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

#  .next_siblings , .previous_siblings

we can iterate over a tag’s siblings with .next_siblings or .previous_siblings:

In [7]:
for sibling in soup.a.next_siblings:
    print(repr(sibling))

',\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'


In [8]:
for sibling in soup.find(id="link3").previous_siblings:
    print(repr(sibling))

' and\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
',\n'
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
'Once upon a time there were three little sisters; and their names were\n'


#  .next_element , .previous_elemenet

The .next_element attribute of a string or tag points to whatever was parsed immediately afterwards. It might be the same as .next_sibling, but it’s usually drastically different.

In [9]:
last_a_tag = soup.find("a", id="link3")
last_a_tag
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

last_a_tag.next_sibling
# ';\nand they lived at the bottom of a well.'

';\nand they lived at the bottom of a well.'

In [10]:
#return text of that link
last_a_tag.next_element

'Tillie'

In [11]:
last_a_tag.previous_element
# ' and\n'
last_a_tag.previous_element.next_element
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

#  .find_parents() and find_parent()

In [12]:
a_string = soup.find(string="Lacie")
a_string
# 'Lacie'

a_string.find_parents("a")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In [13]:
a_string.find_parent("p")
# <p class="story">Once upon a time there were three little sisters; and their names were
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
#  and they lived at the bottom of a well.</p>

a_string.find_parents("p", class_="title")
# []

[]

#  clear()

Tag.clear() removes the contents of a tag:

In [14]:
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.a

tag.clear()
tag

<a href="http://example.com/"></a>

#  extract()

PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted:

In [15]:
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
a_tag = soup.a

#return and delete content of i which is extracted
i_tag = soup.i.extract()

a_tag
# <a href="http://example.com/">I linked to</a>

i_tag
# <i>example.com</i>

print(i_tag.parent)

None


In [16]:
my_string = i_tag.string.extract()
my_string
# 'example.com'

print(my_string.parent)
# None
i_tag 
# <i></i>


None


<i></i>

# decompose()

Tag.decompose() removes a tag from the tree, then completely destroys it and its contents:

In [17]:
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
a_tag = soup.a
i_tag = soup.i

i_tag.decompose()
a_tag

<a href="http://example.com/">I linked to </a>

# wrap and unwrap()

PageElement.wrap() wraps an element in the tag you specify. It returns the new wrapper:

In [18]:
soup = BeautifulSoup("<p>I wish I was bold.</p>", 'html.parser')
soup.p.string.wrap(soup.new_tag("b"))
# <b>I wish I was bold.</b>

soup.p.wrap(soup.new_tag("div"))

<div><p><b>I wish I was bold.</b></p></div>

Tag.unwrap() is the opposite of wrap(). It replaces a tag with whatever’s inside that tag. It’s good for stripping out markup:

In [19]:
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
a_tag = soup.a

a_tag.i.unwrap()
a_tag

<a href="http://example.com/">I linked to example.com</a>