# Web Scraping with BeautifulSoup

## Web Scrapping

Web scraping is a computer software technique for extracting information from websites in an automated fashion.
This technique focuses on transformation of unstructures data or HTML format on the web into structured database or spreadsheet.

Web scrapping can be done in various ways including use of Google Docs, python, Php, R, Java and so on.

However python is mostly used to extract information from web. Since it is easy to use and it has a rich ecosystem.

### What Is Web Scraping?
Every day, you find yourself in a situation where you need to extract data from the web. 

For example:

* We might want to launch some promo to compare market price of certain product for online sales.
* To build data products and services using target industry websites such as hotel or airlines industry.
* To know the upcomig movies for a week.
* To learn new tchniques in the market.
* To fulfil our daily activities such as online shopping, finding restaurants and other desired places.
* To interact with internet to find reviews and to predict or interpret results from the existing data.

### Web Scraping Process: Basic Preparation
There are two basic things to consider before setting up the web scraping process:

1. Understanding the target data on the Internet.
2. Finalizing the list of websites.

Once you have understood the target data and finalized the list of websites, you need to design the web scraping process. 

The steps involved in a typical web scraping process are as follows :

* Step 1: A web request is sent to the targeted website to collect the required data.

* Step 2: The information is retrieved from the targeted website in HTML or XML format from web.

* Step 3: The retrieved information is parsed to the several parsers based on the data format. 

     __Parsing is a technique to read data and extract information from the available document.__

* Step 4: The parsed data is stored in the desired format. You can follow the same process to scrap another targeted web. 

### Web Scraping Software or Web Scraping  Browser
A web scraping software will interact with websites in the same way as your web browser. 

A Web scraper is used to extract the information from web in routine and automated manner.

A web browser Displays the data, a Web Scraping Software Saves data from the web page to the local file or database.

### Web Scraping Considerations
As a part  of web scrapping process, we might be reaching out to several websites to extract information in an automated way. So, Reading and understanding the legal information along with terms and conditions mentioned in the website is very important.

There can be several categories of the targeted website. It can be a registered company with,

* Legal Constraints 
* Notice
* Copyright
* Trademark Material
* Patented Information 

Which cannot be used  without their permission.

## Web Scraping Tool: BeautifulSoup
BeautifulSoup is an easy, intutive and robust python library designed for web scrapping.

### Features of BeautifulSoup
* Efficient tool for dissecting documents and extracting information from the web pages.
* Has powerful sets of built-in methods for navigating, searching, and modifying a parse tree.
* Contains a parser that supports both html and xml documents.
* Converts all incoming documents to unicode automatically.
* Converts all outgoing documents to UTF-8 automatically.

### Common Data/Page Formats on the Web
1. __HTML:__ An HTML page is one of the oldest, easiest, and the most popular methods to upload information on the web.
2. __HTML5:__ An HTML 5 is a new HTML standard which gained popularity with the mobile devices.
3. __XML:__ XML is another popular way to upload your information on the web.
4. __CSS:__ CSS is mainly used for the consistent presentation of data using cascaded style sheets.
5. __API's:__ Application Program Interface or APIs have now become a common practice to extract information from the web.
6. __PDF:__ PDF is also widely used to upload information and reports.
7. __JSON:__ JavaScript Object Notation, or JSON, is a lightweight and popular format used for information exchange on the web.

## Parser
What is a parser?
How does it help Data Scientists in the web scraping process?

A Parser is a basic tool to interpret or render information from a web document. A parser receives input in the form of program instruction, interactive command and markup tags and outputs the web document as objects, method and their attributes. This enables us to extract the information in a meaningful way.

![image.png](attachment:image.png)

A Parser is also used to validate the input information before processing it. 

### Importance of Parsing
Parsing data is one of the most important steps in the web scraping process.  The extracted file can be understood and stored in the desired format only if it is parsed successfully.

Note: Failing to parse the data would eventually lead to a failure of the entire process.

### Various Parsers
Various parsers supported by BeautifulSoup are:

* __html.parser:__ HTML parser is Python-based, fast, and lenient.
* __lxml html:__ Lxml html is not built using Python and it depends on C. However, it is fast and lenient in nature.
* __lxml xml:__ Lxml xml is the only xml parser available and it also depends on C.
* __html5lib:__ HTML5lib is another Python-based parser; however, it is slow and can create valid HTML5.


## Importance of Objects
Once a web document is parsed using the appropriate parser, it gets transformed into a complex tree objects. So, a tree can be defined as the collection of sample of complex objects.

Objects are used to extract the required information from a tree structure by searching or navigating through the parsed document. Also there exists a relationship b/w the objects which enables to extract the information faster and in an efficient way.
![image.png](attachment:image.png)

### Types of Objects
BeautifulSoup transforms a complex HTML document into a complex tree of Python objects. There are four types of objects. They are:

* __Tag:__ A tag object is an XML or HTML tag in the web document. Tags have a lot of attributes and methods.
* __NavigableString:__ A NavigableString is a string or set of characters that correspond to the text present within a tag.
* __BeautifulSoup:__ A BeautifulSoup represents the entire web document and supports navigating and searching the document tree.
* __Comment:__ A Comment represents the comment or information section of the document. It is a special type of NavigableString.

### Demo - 01: Parsing web document and extracting data using objects.
Demonstrate hoe to scrape a web documnet, parse it and use objects to extract information.


In [1]:
#import bs4 the Beautifulsoup library
from bs4 import BeautifulSoup

#create a html document
html_doc="""<html>
                <body>
                    <h1>My First Heading</h1>
                    <b>!--This is a commnet line--</b>
                    <p title = "About me" class="test">My first paragraph.</p>
                    <div class = "Cities">
                    <h2>London</h2>
                    </div>
                </body>
            </html>"""

#parse it using html parser
Soup = BeautifulSoup(html_doc,'html.parser')

#View the Soup Type
type(Soup)

bs4.BeautifulSoup

In [2]:
#view the soup object
print(Soup)

<html>
<body>
<h1>My First Heading</h1>
<b>!--This is a commnet line--</b>
<p class="test" title="About me">My first paragraph.</p>
<div class="Cities">
<h2>London</h2>
</div>
</body>
</html>


In [3]:
#create a tag object
tag = Soup.p

#View the tag type
type(tag)

bs4.element.Tag

In [4]:
#print the tag
print(tag)

<p class="test" title="About me">My first paragraph.</p>


In [5]:
#Create a comment object type
comment = Soup.b.string

#View the comment object type
type(comment)

bs4.element.NavigableString

In [6]:
#View the comment
comment

'!--This is a commnet line--'

In [7]:
#View the tag attributes
tag.attrs

{'title': 'About me', 'class': ['test']}

In [8]:
#view the tag value
tag.string

'My first paragraph.'

In [9]:
#view the tag type
type(tag.string)

bs4.element.NavigableString

## Understanding the tree
![Undettree1.PNG](attachment:Undettree1.PNG)

![Undettree2.PNG](attachment:Undettree2.PNG)

![Undettree3.PNG](attachment:Undettree3.PNG)

### Various Operations
#### Searching Tree: Filters
With the help of the search filters technique, you can extract specific information from the parsed document.The filters can be treated as search criteria for extracting the information based on the elements present in the document.

There are various kinds of filters used for searching information from a tree.
* __String:__ A string is the simplest filter. BeautifulSoup will perform a match against the search string. 
* __Regular Expressions:__ A regular expression filters the match against the search criteria.
* __List:__ A list filters the string that matches against the search item in the list. 
* __Function:__ A function filters the elements that match against its only argument. 

#### Searching the Tree: find_all()
BeautifulSoup defines a lot of methods for searching the parsed tree. However they are all simillar. 

The most popular methods for searching the parse tree are find_all() and find()

##### Searching the tree with find_all()
The find_all() searches and retrieves all tags’ descendants that matches your filters.

The syntax for find_all():
![tree1.PNG](attachment:tree1.PNG)

##### Searching the tree with find ()
The find_all() finds the entire document looking for results. 

To find one result, use find(). The find() method has a syntax similar to that of the find_all() method; however, there are some key differences. 
![tree2.PNG](attachment:tree2.PNG)

#### Searching the Tree with Other Methods
Searching the parse tree can also be performed by various other methods such as:
![tree3.PNG](attachment:tree3.PNG)

#### Demo - 02: Searching in a tree with filters
Demonstrate the ways to search in a tree using filters.

In [10]:
#import the required library
from bs4 import BeautifulSoup
#import the web scraping example html file
HTMLfilepath = "D:\\NIPUN_SC_REC\\3_Practice_Project\\Course_5_Data Science with Python\\Lesson_recap\\web_scrapping_example.html"

with open (HTMLfilepath,'r') as Organization:
    Soup = BeautifulSoup(Organization,"lxml")
    
#view the contebt of th soup object
Soup.contents

[<html>
 <head>
 <title>Web Scrapping Demo</title>
 </head>
 <body>
 <div class="Organizationlist">
 <ul id="HR">
 <li class="HRmanager">
 <div class="name">Jack</div>
 <div class="ID">101</div>
 </li>
 <li class="HRmanager">
 <div class="name">Kelly</div>
 <div class="ID">103</div>
 </li>
 </ul>
 <ul id="IT">
 <li class="ITmanager">
 <div class="name">Daren</div>
 <div class="ID">65</div>
 </li>
 </ul>
 <ul id="Finance">
 <li class="GEmanager">
 <div class="name">Sammy</div>
 <div class="ID">007</div>
 </li>
 <li class="AccManger">
 <div class="name">Joseph</div>
 <div class="ID">097</div>
 </li>
 </ul>
 </div>
 </body>
 </html>]

In [11]:
#searching using find() methods
tag_li = Soup.find("li")

#print the tag type
print(type(tag_li))

<class 'bs4.element.Tag'>


In [12]:
#print the tag
print(tag_li)

<li class="HRmanager">
<div class="name">Jack</div>
<div class="ID">101</div>
</li>


In [13]:
#search the document sing find() method for ID
find_id = Soup.find(id="HR")

#print the find_id object
print(find_id)

<ul id="HR">
<li class="HRmanager">
<div class="name">Jack</div>
<div class="ID">101</div>
</li>
<li class="HRmanager">
<div class="name">Kelly</div>
<div class="ID">103</div>
</li>
</ul>


In [14]:
#print the string value
print(find_id.li.div.string)

Jack


In [15]:
#search using string only
search_for_stringonly = Soup.find_all(text=["Kelly","Jack"])

#print the search results
print(search_for_stringonly)

['Jack', 'Kelly']


In [16]:
#Search based on CSS class name(present as attributes)
CSS_class_search = Soup.find(attrs={"class","ITmanager"})
print(CSS_class_search)

<li class="ITmanager">
<div class="name">Daren</div>
<div class="ID">65</div>
</li>


In [17]:
#create a function to search the document based upon the tag passed as parameter
def is_account_manager(tag):
    return tag.has_attr("id") and tag.get("id")=="Finance"

#Search the document using Function and Print it
account_manager = Soup.find(is_account_manager)
print(account_manager.li.div.string)

Sammy


In [18]:
#print tag name using True - which returns all the tags present in the document
for tag in Soup.find_all(True):
    print(tag.name)

html
head
title
body
div
ul
li
div
div
li
div
div
ul
li
div
div
ul
li
div
div
li
div
div


In [19]:
#Search using find_all() method for the given class
find_class = Soup.findAll(class_='HRmanager')

#View the type of the class
type(find_class)

bs4.element.ResultSet

In [20]:
#print the second result set
print(find_class[0])

<li class="HRmanager">
<div class="name">Jack</div>
<div class="ID">101</div>
</li>


In [21]:
#print the second result
print(find_class[1])

<li class="HRmanager">
<div class="name">Kelly</div>
<div class="ID">103</div>
</li>


In [22]:
#find the parents using find parent method
find_class = find_class[0]
find_parent = find_class.find_parent("ul")
print(find_parent)

<ul id="HR">
<li class="HRmanager">
<div class="name">Jack</div>
<div class="ID">101</div>
</li>
<li class="HRmanager">
<div class="name">Kelly</div>
<div class="ID">103</div>
</li>
</ul>


In [23]:
#now use find method to search based on the id
Org = Soup.find(id="IT")

#find the search object
print(Org)

<ul id="IT">
<li class="ITmanager">
<div class="name">Daren</div>
<div class="ID">65</div>
</li>
</ul>


In [24]:
#find the next sibiling
next_sibiling = Org.findNextSiblings()
print(next_sibiling)

#print the parents
parent = Org.findParents
print(parent)

[<ul id="Finance">
<li class="GEmanager">
<div class="name">Sammy</div>
<div class="ID">007</div>
</li>
<li class="AccManger">
<div class="name">Joseph</div>
<div class="ID">097</div>
</li>
</ul>]
<bound method PageElement.find_parents of <ul id="IT">
<li class="ITmanager">
<div class="name">Daren</div>
<div class="ID">65</div>
</li>
</ul>>


In [25]:
#find and print previous
all_previous = Org.findAllPrevious()
print(all_previous)

[<div class="ID">103</div>, <div class="name">Kelly</div>, <li class="HRmanager">
<div class="name">Kelly</div>
<div class="ID">103</div>
</li>, <div class="ID">101</div>, <div class="name">Jack</div>, <li class="HRmanager">
<div class="name">Jack</div>
<div class="ID">101</div>
</li>, <ul id="HR">
<li class="HRmanager">
<div class="name">Jack</div>
<div class="ID">101</div>
</li>
<li class="HRmanager">
<div class="name">Kelly</div>
<div class="ID">103</div>
</li>
</ul>, <div class="Organizationlist">
<ul id="HR">
<li class="HRmanager">
<div class="name">Jack</div>
<div class="ID">101</div>
</li>
<li class="HRmanager">
<div class="name">Kelly</div>
<div class="ID">103</div>
</li>
</ul>
<ul id="IT">
<li class="ITmanager">
<div class="name">Daren</div>
<div class="ID">65</div>
</li>
</ul>
<ul id="Finance">
<li class="GEmanager">
<div class="name">Sammy</div>
<div class="ID">007</div>
</li>
<li class="AccManger">
<div class="name">Joseph</div>
<div class="ID">097</div>
</li>
</ul>
</div>,

In [26]:
#search and print the previous sibilings
previous_sibiling = Org.findPreviousSibling()
print(previous_sibiling)

<ul id="HR">
<li class="HRmanager">
<div class="name">Jack</div>
<div class="ID">101</div>
</li>
<li class="HRmanager">
<div class="name">Kelly</div>
<div class="ID">103</div>
</li>
</ul>


In [27]:
#search and print aprevious sibiling
all_next = Org.findAllNext()
print(all_next)

[<li class="ITmanager">
<div class="name">Daren</div>
<div class="ID">65</div>
</li>, <div class="name">Daren</div>, <div class="ID">65</div>, <ul id="Finance">
<li class="GEmanager">
<div class="name">Sammy</div>
<div class="ID">007</div>
</li>
<li class="AccManger">
<div class="name">Joseph</div>
<div class="ID">097</div>
</li>
</ul>, <li class="GEmanager">
<div class="name">Sammy</div>
<div class="ID">007</div>
</li>, <div class="name">Sammy</div>, <div class="ID">007</div>, <li class="AccManger">
<div class="name">Joseph</div>
<div class="ID">097</div>
</li>, <div class="name">Joseph</div>, <div class="ID">097</div>]


In [28]:
#Use regular expression to search the document
import re
email_example ="""<br/>
<p>My email id is </p>
abc@example.com"""
soup_email =BeautifulSoup(email_example,"lxml")

#Use compile method to compile the information using regular expression
emailid_regexp = re.compile("\w+@\w+\.\w+")

#find and print the email id using regular expression
email_id = soup_email.find(text=emailid_regexp)
print(email_id)


abc@example.com


### Navigating Options
With the help of BeautifulSoup, it is easy to navigate the parse tree based on the need.

There are four options to navigate the tree. They are:
1. __Navigating Down:__ 

This technique shows you how to extract information from children tags.

Following are the attributes used to navigate down:

    * .contents and .children
    * .descendants
    * .string
    * .strings and stripped_strings

2. __Navigating Up:__ 

Every tag has a parent and two attributes, 

    * .parents and .parent, 

to help navigate up the family tree.

3. __Navigating Sideways:__ 

This technique shows you how to extract information from the same level in the tree.The attributes used to navigate sideways are: 
    
    * .next_sibling and .previous_sibling.

4. __Navigating Back and Forth:__ 

This technique shows you how to parse the tree back and forth. The attributes used to navigate back and forth are:
    
    * .next_element and .previous_element
    * .next_elements and .previous_elements
    
#### Demo - 03: Navigating a tree
Demonstrate how to navigate the web tree using various techniques.

In [29]:
#import the required library
from bs4 import BeautifulSoup
#Create html documnet
book_html_doc = """<catalog>
<head><title>The web book catalog </title><head>
<p class ="title"> <b> The Book Catalog </b></p>
<books>
    <book id = "bk001">
        <author>Hightower, Kim</author>
        <title> The first Book</title>
            <genre>Fictional</genre>
            <price>44.95</price>
            <pub_date>2000-10-01</pub_date>
            <review>An amazing story of nothing</review>
    </book>
    <book id = "bk002">
        <author>Nagata, Susanne</author>
        <title>Becoming somebody</title>
            <genre>Biography</genre>
            <review>A master piece on the fine art of gossiping</review>
    </book>
    <book id = "bk003">
        <author>Obey, Bruce</author>
        <title> The Poet's first poem</title>
            <genre>Poem</genre>
            <price>24.95</price>
            <review>The last poetic poem of the decade</review>
    </book>
</books>
</catalog>"""

#create a Sopu object
Book_Soup = BeautifulSoup(book_html_doc,'html.parser')
#print the catalog tag
print(Book_Soup)

<catalog>
<head><title>The web book catalog </title><head>
<p class="title"> <b> The Book Catalog </b></p>
<books>
<book id="bk001">
<author>Hightower, Kim</author>
<title> The first Book</title>
<genre>Fictional</genre>
<price>44.95</price>
<pub_date>2000-10-01</pub_date>
<review>An amazing story of nothing</review>
</book>
<book id="bk002">
<author>Nagata, Susanne</author>
<title>Becoming somebody</title>
<genre>Biography</genre>
<review>A master piece on the fine art of gossiping</review>
</book>
<book id="bk003">
<author>Obey, Bruce</author>
<title> The Poet's first poem</title>
<genre>Poem</genre>
<price>24.95</price>
<review>The last poetic poem of the decade</review>
</book>
</books>
</head></head></catalog>


In [30]:
#View the head of the book_html_doc
Book_Soup.head

<head><title>The web book catalog </title><head>
<p class="title"> <b> The Book Catalog </b></p>
<books>
<book id="bk001">
<author>Hightower, Kim</author>
<title> The first Book</title>
<genre>Fictional</genre>
<price>44.95</price>
<pub_date>2000-10-01</pub_date>
<review>An amazing story of nothing</review>
</book>
<book id="bk002">
<author>Nagata, Susanne</author>
<title>Becoming somebody</title>
<genre>Biography</genre>
<review>A master piece on the fine art of gossiping</review>
</book>
<book id="bk003">
<author>Obey, Bruce</author>
<title> The Poet's first poem</title>
<genre>Poem</genre>
<price>24.95</price>
<review>The last poetic poem of the decade</review>
</book>
</books>
</head></head>

In [31]:
#View the title of the book_html_doc
title_tag = Book_Soup.title
title_tag

<title>The web book catalog </title>

In [32]:
#print the catalog bold tag
print(Book_Soup.catalog.p)

<p class="title"> <b> The Book Catalog </b></p>


In [33]:
#Navigate down the descendents and print them
for descen in Book_Soup.head.descendants:
    print(descen)

<title>The web book catalog </title>
The web book catalog 
<head>
<p class="title"> <b> The Book Catalog </b></p>
<books>
<book id="bk001">
<author>Hightower, Kim</author>
<title> The first Book</title>
<genre>Fictional</genre>
<price>44.95</price>
<pub_date>2000-10-01</pub_date>
<review>An amazing story of nothing</review>
</book>
<book id="bk002">
<author>Nagata, Susanne</author>
<title>Becoming somebody</title>
<genre>Biography</genre>
<review>A master piece on the fine art of gossiping</review>
</book>
<book id="bk003">
<author>Obey, Bruce</author>
<title> The Poet's first poem</title>
<genre>Poem</genre>
<price>24.95</price>
<review>The last poetic poem of the decade</review>
</book>
</books>
</head>


<p class="title"> <b> The Book Catalog </b></p>
 
<b> The Book Catalog </b>
 The Book Catalog 


<books>
<book id="bk001">
<author>Hightower, Kim</author>
<title> The first Book</title>
<genre>Fictional</genre>
<price>44.95</price>
<pub_date>2000-10-01</pub_date>
<review>An amazing 

In [34]:
#Navigate down using stripped string method
for string in Book_Soup.stripped_strings:
    print(repr(string))

'The web book catalog'
'The Book Catalog'
'Hightower, Kim'
'The first Book'
'Fictional'
'44.95'
'2000-10-01'
'An amazing story of nothing'
'Nagata, Susanne'
'Becoming somebody'
'Biography'
'A master piece on the fine art of gossiping'
'Obey, Bruce'
"The Poet's first poem"
'Poem'
'24.95'
'The last poetic poem of the decade'


In [35]:
#Navigate up using parent method
title_tag.parent

<head><title>The web book catalog </title><head>
<p class="title"> <b> The Book Catalog </b></p>
<books>
<book id="bk001">
<author>Hightower, Kim</author>
<title> The first Book</title>
<genre>Fictional</genre>
<price>44.95</price>
<pub_date>2000-10-01</pub_date>
<review>An amazing story of nothing</review>
</book>
<book id="bk002">
<author>Nagata, Susanne</author>
<title>Becoming somebody</title>
<genre>Biography</genre>
<review>A master piece on the fine art of gossiping</review>
</book>
<book id="bk003">
<author>Obey, Bruce</author>
<title> The Poet's first poem</title>
<genre>Poem</genre>
<price>24.95</price>
<review>The last poetic poem of the decade</review>
</book>
</books>
</head></head>

In [36]:
#Create element object to navigate back and forth
element_soup = Book_Soup.catalog.books

#Navigate forward using next_element method
next_element = element_soup.next_element.next_element
next_element

<book id="bk001">
<author>Hightower, Kim</author>
<title> The first Book</title>
<genre>Fictional</genre>
<price>44.95</price>
<pub_date>2000-10-01</pub_date>
<review>An amazing story of nothing</review>
</book>

In [37]:
#Navigae back using previous element method
previous_element = next_element.previous_element.previous_element
previous_element

<books>
<book id="bk001">
<author>Hightower, Kim</author>
<title> The first Book</title>
<genre>Fictional</genre>
<price>44.95</price>
<pub_date>2000-10-01</pub_date>
<review>An amazing story of nothing</review>
</book>
<book id="bk002">
<author>Nagata, Susanne</author>
<title>Becoming somebody</title>
<genre>Biography</genre>
<review>A master piece on the fine art of gossiping</review>
</book>
<book id="bk003">
<author>Obey, Bruce</author>
<title> The Poet's first poem</title>
<genre>Poem</genre>
<price>24.95</price>
<review>The last poetic poem of the decade</review>
</book>
</books>

In [38]:
#create a sibiling object and navigate to view it
next_sibling = Book_Soup.catalog.books.book
next_sibling

<book id="bk001">
<author>Hightower, Kim</author>
<title> The first Book</title>
<genre>Fictional</genre>
<price>44.95</price>
<pub_date>2000-10-01</pub_date>
<review>An amazing story of nothing</review>
</book>

In [39]:
#navigate to next sibiling
next_sibling2 = next_sibling.next_sibling
next_sibling2.next_sibling

<book id="bk002">
<author>Nagata, Susanne</author>
<title>Becoming somebody</title>
<genre>Biography</genre>
<review>A master piece on the fine art of gossiping</review>
</book>

In [40]:
#Navigate to previous sibiling
previous_sibling = next_sibling2.previous_sibling
previous_sibling

<book id="bk001">
<author>Hightower, Kim</author>
<title> The first Book</title>
<genre>Fictional</genre>
<price>44.95</price>
<pub_date>2000-10-01</pub_date>
<review>An amazing story of nothing</review>
</book>

### Modifying the Tree
With BeautifulSoup, you can also modify the tree and write your changes as a new HTML or XML document.

There are several methods to modify the tree:

![tree4.PNG](attachment:tree4.PNG)

* __.String():__

.string() attribute is used to modify the string values.

* __append():__

It works simillar to the append function of python lists.

* __NavigableString():__

It is used to add a string value to a document

* __.new_tag():__

It is used to create a new tag.

* __insert():__

It is used to insert the content or values in the desired numeric position.

* __insert_before() and insert_after():__

They insert values before and after the given position respectively.

* __clear():__

It is used to remove the content of the tag.

* __extract():__

It is used to remove a tag or string from the tree and returns the extracted tag or string.

* __decompose():__

It is used to remove a tag from the tree and destroys the content completely.

* __replace_with():__

It is used to replace a tag with another.

* __wrap():__

It is used to wrap an element in the tag and returns new wrapper.

* __unwrap():__

It is used to replace a tag with an element inside a tag.

#### Demo - 04: Modifying the tree
Demonstrate how to modify a web tree to get the desired result with the help of an example.

In [41]:
#import the required library
from bs4 import BeautifulSoup

#create employee html document
employee_html_doc = """<employees>
<employee class = "accountant">
    <firstname> John </firstname> <lastname> Doe </lastname>
</employee>
<employee class = "manager">
    <firstname> Anna </firstname> <lastname> Smith </lastname>
</employee>
<employee class = "developer">
    <firstname> Peter </firstname> <lastname> Jones </lastname>
</employee>
</employees>
"""

#create soup object and pass the web doc as a parameter
soup_emp = BeautifulSoup(employee_html_doc,'html.parser')

#access and view the tag
tag = soup_emp.employee
tag

<employee class="accountant">
<firstname> John </firstname> <lastname> Doe </lastname>
</employee>

In [42]:
#Modify the tag
tag['class'] = 'manager'

#View the tag to see the modification
tag

<employee class="manager">
<firstname> John </firstname> <lastname> Doe </lastname>
</employee>

In [43]:
#view soup object to verify the modification
soup_emp

<employees>
<employee class="manager">
<firstname> John </firstname> <lastname> Doe </lastname>
</employee>
<employee class="manager">
<firstname> Anna </firstname> <lastname> Smith </lastname>
</employee>
<employee class="developer">
<firstname> Peter </firstname> <lastname> Jones </lastname>
</employee>
</employees>

In [44]:
#Add a tag
tag = soup_emp.new_tag('rank')
tag.string='manager'

#modify using insert_after method
soup_emp.employees.employee.insert_after(tag)

#view the soup object
print(soup_emp)

<employees>
<employee class="manager">
<firstname> John </firstname> <lastname> Doe </lastname>
</employee><rank>manager</rank>
<employee class="manager">
<firstname> Anna </firstname> <lastname> Smith </lastname>
</employee>
<employee class="developer">
<firstname> Peter </firstname> <lastname> Jones </lastname>
</employee>
</employees>



In [45]:
#clear all the modified tag (newly modified tag will be cleared)
tag.clear() 

#view the soup object
soup_emp

<employees>
<employee class="manager">
<firstname> John </firstname> <lastname> Doe </lastname>
</employee><rank></rank>
<employee class="manager">
<firstname> Anna </firstname> <lastname> Smith </lastname>
</employee>
<employee class="developer">
<firstname> Peter </firstname> <lastname> Jones </lastname>
</employee>
</employees>

In [46]:
#remove the tag 'rank' from the tree and destroys the content completely
tag.decompose() 

#view the soup object
soup_emp

<employees>
<employee class="manager">
<firstname> John </firstname> <lastname> Doe </lastname>
</employee>
<employee class="manager">
<firstname> Anna </firstname> <lastname> Smith </lastname>
</employee>
<employee class="developer">
<firstname> Peter </firstname> <lastname> Jones </lastname>
</employee>
</employees>

In [47]:
#create a tag object and view it.
tag=soup_emp.employees.employee
tag

<employee class="manager">
<firstname> John </firstname> <lastname> Doe </lastname>
</employee>

In [48]:
#extract the information using extract method
tag.firstname.string.extract()

' John '

In [49]:
#modify the tag name
tag.firstname.replace_with('firstname')

<firstname></firstname>

In [50]:
#view the changes
soup_emp.employees

<employees>
<employee class="manager">
firstname <lastname> Doe </lastname>
</employee>
<employee class="manager">
<firstname> Anna </firstname> <lastname> Smith </lastname>
</employee>
<employee class="developer">
<firstname> Peter </firstname> <lastname> Jones </lastname>
</employee>
</employees>

## Parsing Only Part of the Document
It is a waste of time and memory to parse the documnet completely. To overcome this problem, use __soupstrainer class__ to parse only a part of document.

__SoupStrainer class__:

It allows us to choose the part of the document to be parsed.

* Create a SoupStrainer object and pass it to the BeautifulSoup constructor as a parse_only argument.

* However, this feature of parsing a part of the documnet will not work with the html5ib parser.

### Demo - 05: Parsing a part of the document
Demonstrate how to parse only a part of the document with help of an example

__Parsing a document can be performed in six steps.__

In [51]:
#import the required library
from bs4 import BeautifulSoup

#Sample web document from www.simplilearn.com website
data_SL = """<ul class = "content-col_discover">
            <h5> Discover </h5>
            <li><a herf = "/resources" id="free_resources"> Free resource </a></li>
            <li><a herf = "http://community.simplilearn.com/" id="community"> Simplilearn community </a></li>
            <li><a herf = "/career-data-labs" id="lab"> Career data labs </a></li>
            <li><a herf = "/scholarships-for-veterans" id="scholarship"> Veterans scholarship </a></li>
            <li><a herf = "http://www.simplilearn.com/feed/" id="rss"> RSS feed </a></li>
            </ul>"""

#create soup object and pass the web dco as a parameter
soup_SL = BeautifulSoup(data_SL,'html.parser')

#parse only a part of document, test(string) values for tags using the gettext method
print(soup_SL.get_text())


 Discover 
 Free resource 
 Simplilearn community 
 Career data labs 
 Veterans scholarship 
 RSS feed 



In [52]:
#import Soupstrainer clas for parsing the desired part of the web documnet
from bs4 import SoupStrainer

#create object to parse only the id(link) with lab
tags_with_lablink = SoupStrainer(id="lab")

#print the part of the parsed document
print(BeautifulSoup(data_SL,'html.parser',parse_only=tags_with_lablink).prettify())

<a herf="/career-data-labs" id="lab">
 Career data labs
</a>


## Output: Printing and Formatting

### Printing
We can print the output using two methods.

* prettify()
* unicode() or str()

![output1.PNG](attachment:output1.PNG)

__prettify():__

The prettify or pretty printing method turns a parse tree into a decorative formatted Unicode string with each html or xml tag on its own line.

__unicode() or str():__

The unicode()or str() method turns a parse tree into a non-decorative formatting string. This unicode() or str() method is also called as non-pretty printing method in general.

### Formatters
The formatters are used to generate different types of output with the desired formatting.
![output2.PNG](attachment:output2.PNG)

* __Html and xml:__

Html and xml formatting will convert unicode characters into html and xml entities respectively.

* __Minimal:__

The minimal formatting will process content with valid html/ xml tags.

* __None:__

None formatting will not modify the content or string on output.

* __Uppercase and lowercase:__

Uppercase and lowercase formatting will convert string values to uppercase and lowercase, respectively.

### Demo - 06: Formatting and Printing
Demonstrate hoe to format, print and encode the web document.

In [53]:
#import the required libraries
from bs4 import BeautifulSoup
import lxml
import requests # API to extract web page

#define url for which formatting should be performed
url = 'http://simplilearn.com'

#access result through request object
result = requests.get(url)

#load the right content
page_content = result.content

#create soup object
soup = BeautifulSoup(page_content,'html.parser')

#View the contents
soup.contents

['html',
 <html dir="ltr" lang="en-US" xmlns="http://www.w3.org/1999/xhtml">
 <head>
 <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
 <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
 <title>World's #1 Online Bootcamp &amp; Certification Course Provider | Simplilearn</title>
 <link href="https://ssl.google-analytics.com/" rel="dns-prefetch"/>
 <link href="https://stats.g.doubleclick.net/" rel="dns-prefetch">
 <link href="https://www.google.com/" rel="dns-prefetch"/>
 <script type="text/javascript">
   ;window.NREUM||(NREUM={});NREUM.init={privacy:{cookies_enabled:true}};window.NREUM||(NREUM={}),__nr_require=function(t,e,n){function r(n){if(!e[n]){var i=e[n]={exports:{}};t[n][0].call(i.exports,function(e){var i=t[n][1][e];return r(i||e)},i,i.exports)}return e[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var i=0;i<n.length;i++)r(n[i]);return r}({1:[function(t,e,n){function r(t){try{c.console&&console.log(t)}catch(e){}}var i,o=t("ee"),a

In [54]:
#prettify the output (generate output in web format)
print(soup.prettify())

<!DOCTYPE html>
<html dir="ltr" lang="en-US" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <title>
   World's #1 Online Bootcamp &amp; Certification Course Provider | Simplilearn
  </title>
  <link href="https://ssl.google-analytics.com/" rel="dns-prefetch"/>
  <link href="https://stats.g.doubleclick.net/" rel="dns-prefetch">
   <link href="https://www.google.com/" rel="dns-prefetch"/>
   <script type="text/javascript">
    ;window.NREUM||(NREUM={});NREUM.init={privacy:{cookies_enabled:true}};window.NREUM||(NREUM={}),__nr_require=function(t,e,n){function r(n){if(!e[n]){var i=e[n]={exports:{}};t[n][0].call(i.exports,function(e){var i=t[n][1][e];return r(i||e)},i,i.exports)}return e[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var i=0;i<n.length;i++)r(n[i]);return r}({1:[function(t,e,n){function r(t){try{c.console&&console.log(t)}catch

In [55]:
#view the original encoding of soup object
soup.original_encoding

'utf-8'

In [56]:
#format the tag to a xml
soup.body.a.prettify(formatter='minimal')

'<a href="https://www.simplilearn.com/resources" rel="noreferrer" target="_blank" title="Resources">\n Resources\n</a>\n'

In [57]:
#define a custom function to convert string values to uppercase
def uppercaseFn(strtext):
    return strtext.upper()

#format using custom function for outputting string texts in uppercase
soup.body.a.prettify(formatter=uppercaseFn)

'<a href="HTTPS://WWW.SIMPLILEARN.COM/RESOURCES" rel="NOREFERRER" target="_BLANK" title="RESOURCES">\n RESOURCES\n</a>\n'

## Encoding
Two possible types of encoding can be performed using BeautifulSoup,

__Documnet Encoding:__

* HTML or XML documents are written in specific encodings, such as ASCII or UTF-8.
* When you load the document into BeautifulSoup, it gets converted into Unicode.
* The original encoding can be extracted from attribute.original encoding of the BeautifulSoup object.

__Output Encoding:__

* When you write a document from BeautifulSoup, you get a UTF-8 document irrespective of the original encoding.
* If some other encoding is required, you can pass it to prettify.

BeautifulSoup uses a sub-library called unicode.dammit to idenify the encoding of documents and convert it to unicode.

we also encode, the BeautifulSoup object or element in the soup just as if it where a python string.

Any character that cannot be represented in our chosen encoding will be converted into nummeric xml entity references.