### BeautifulSoup
BeautifulSoup is a Python library, which is used quickly extract (scrape) valid data from webpages, and this library is giving facility to us to use what ever the parser you want like html.parser, lxml and html5lib.

BeautifulSoup cannot done job alone, we have to make use of another libraries like requests,urllib to download the webpages then will use BeautifulSoup to parse HTML source code.

### Installing & Importing prerequisites

In [1]:
import requests
import webbrowser

from bs4 import BeautifulSoup

In [2]:
very_simple_html = """

<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
<body>
<p class="title">
    <b>The Dormouse's story</b>
</p>

<p class="story">
Once upon a time there were three little sisters; and their names were:

    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.
</p>

<p class="story">The story continues</p>
</body>
</html>
"""

In [3]:
soup = BeautifulSoup(very_simple_html)

In [4]:
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were:
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;

and they lived at the bottom of a well.
  </p>
  <p class="story">
   The story continues
  </p>
 </body>
</html>



In [5]:
soup.title

<title>The Dormouse's story</title>

In [6]:
soup.body

<body>
<p class="title">
<b>The Dormouse's story</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were:

    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

and they lived at the bottom of a well.
</p>
<p class="story">The story continues</p>
</body>

In [7]:
soup.body.name

'body'

In [8]:
soup.body.parent.name

'html'

In [9]:
soup.p

<p class="title">
<b>The Dormouse's story</b>
</p>

In [10]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

### Different Types Of Parsers
BeautifulSoup supports different types of parsers, depends on what type of markup you want to parse. Currently supported are “html”, “xml”, and “html5”

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use

In [11]:
html = """
    <h1><a /><b><th <td>
"""

In [12]:
soup = BeautifulSoup(html)

print(soup.prettify())

<html>
 <body>
  <h1>
   <a>
   </a>
   <b>
   </b>
   <th>
   </th>
  </h1>
 </body>
</html>



### lxml
https://lxml.de/index.html

In [13]:
soup = BeautifulSoup(html, 'lxml') # faster than html.parser

print(soup.prettify())

<html>
 <body>
  <h1>
   <a>
   </a>
   <b>
   </b>
   <th>
   </th>
  </h1>
 </body>
</html>



In [14]:
soup = BeautifulSoup(html, 'html.parser')

print(soup.prettify())

<h1>
 <a>
 </a>
 <b>
  <th <td="">
  </th>
 </b>
</h1>



In [15]:
soup = BeautifulSoup(html, 'html5lib') # slow

print(soup.prettify())

<html>
 <head>
 </head>
 <body>
  <h1>
   <a>
    <b>
    </b>
   </a>
  </h1>
 </body>
</html>



In [16]:
soup = BeautifulSoup(html, 'xml')

print(soup.prettify())

<?xml version="1.0" encoding="utf-8"?>
<h1>
 <a/>
 <b>
  <th>
   <td/>
  </th>
 </b>
</h1>



In [17]:
soup = BeautifulSoup(html, 'lxml-xml') # to read xml files with lxml parser

print(soup.prettify())

<?xml version="1.0" encoding="utf-8"?>
<h1>
 <a/>
 <b>
  <th>
   <td/>
  </th>
 </b>
</h1>



----

#### Downloading webpage using requests

In [18]:
resp = requests.get("https://google.com") 

resp

<Response [200]>

### Parsing webpage into HTML with the help of html.parser which is present in BeautifulSoup.

In [19]:
soup = BeautifulSoup(resp.text, "lxml")

soup

<!DOCTYPE html>
<html dir="rtl" itemscope="" itemtype="http://schema.org/WebPage" lang="ar"><head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"/><title>Google</title><script nonce="8QQDg3kIE0gu7BixFpPbdA">(function(){var _g={kEI:'L-26ZajeOoqQ9u8P8Iui4AU',kEXPI:'0,793344,572123,207,4804,1132070,1962,868575,327218,646,380090,44798,23792,12311,17588,4998,17075,38444,2872,2891,4140,7614,606,29877,791,19391,10631,16105,230,20583,4,86661,6633,7593,1,42154,2,39761,6700,31122,4567,6256,24673,30151,2913,2,2,1,24626,2006,8155,23350,22436,9779,42459,20198,40912,32267,3030,15816,1804,13806,7206,5396,9821,10853,476,1159,5265786,712,2,296,69,586,518,528,99,5991660,1209,2806666,7475465,20540004,16672,43887,3,1603,3,262,3,234,3,2121276,2585,22636438,392913,4126,8673,8409,4505,2072,2323,1989,5775,10,4931,8082,4427,3860,6717,1670,4208,17455,13537,10511,2370,6198,209,2765,2575,4462,4044,266

In [20]:
print(soup.prettify())

<!DOCTYPE html>
<html dir="rtl" itemscope="" itemtype="http://schema.org/WebPage" lang="ar">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"/>
  <title>
   Google
  </title>
  <script nonce="8QQDg3kIE0gu7BixFpPbdA">
   (function(){var _g={kEI:'L-26ZajeOoqQ9u8P8Iui4AU',kEXPI:'0,793344,572123,207,4804,1132070,1962,868575,327218,646,380090,44798,23792,12311,17588,4998,17075,38444,2872,2891,4140,7614,606,29877,791,19391,10631,16105,230,20583,4,86661,6633,7593,1,42154,2,39761,6700,31122,4567,6256,24673,30151,2913,2,2,1,24626,2006,8155,23350,22436,9779,42459,20198,40912,32267,3030,15816,1804,13806,7206,5396,9821,10853,476,1159,5265786,712,2,296,69,586,518,528,99,5991660,1209,2806666,7475465,20540004,16672,43887,3,1603,3,262,3,234,3,2121276,2585,22636438,392913,4126,8673,8409,4505,2072,2323,1989,5775,10,4931,8082,4427,3860,6717,1670,4208,17455,13537,10511,2370,6198,20

In [21]:
type(soup)

bs4.BeautifulSoup

In [22]:
with open('files/GiantPanda.html') as html_code:
    soup = BeautifulSoup(html_code, 'lxml')

In [23]:
print(soup.prettify())

<html>
 <head>
  <title>
   "Giant Panda"
  </title>
 </head>
 <body>
  <h1>
   The giant panda also known as panda bear or simply panda
  </h1>
  <h2 style="color:blue;">
   Giant Panda
  </h2>
  <h3 style="background-color:yellow;color:red;">
   The name "giant panda" is sometimes used to distinguish it from the
   <br/>
   red panda.
  </h3>
  <h6>
   The giant panda lives in a few mountain ranges in central China
  </h6>
  <b class="panda">
   pandas were thought to be rare and noble creatures
  </b>
  <div>
   <a href="https://en.wikipedia.org/wiki/Giant_panda">
    Link to Wikipedia page
   </a>
  </div>
  <br/>
  <br/>
  <i>
   <!--Here is a Panda image from Wikipedia -->
  </i>
  <div>
   <img alt="panda not found" src="Panda.jpeg"/>
  </div>
  <div>
   <p class="panda highlight">
    Giant pandas in the wild will occasionally eat other grasses, wild tubers, or even meat in the form of birds
   </p>
  </div>
 </body>
</html>



### Tag :

In [24]:
tag = soup.title

tag

<title>"Giant Panda"</title>

In [25]:
type(tag)

bs4.element.Tag

In [26]:
tag = soup.h2

tag

<h2 style="color:blue;">Giant Panda</h2>

### Name :

In [27]:
tag = soup.h1

tag

<h1>The giant panda also known as panda bear or simply panda</h1>

#### Every tag has a name, and we can access that name using "name" object.

In [28]:
tag.name

'h1'

#### if we want we can change that name also, and it is reflected in html code which is generated by BeautifulSoup.

In [29]:
tag.name = "user_defined"

In [30]:
tag.name

'user_defined'

In [32]:
tag.text

'The giant panda also known as panda bear or simply panda'

In [33]:
tag

<user_defined>The giant panda also known as panda bear or simply panda</user_defined>

### Attributes :

In [34]:
tag = soup.a

tag

<a href="https://en.wikipedia.org/wiki/Giant_panda"> Link to Wikipedia page </a>

In [35]:
tag['href']

'https://en.wikipedia.org/wiki/Giant_panda'

In [36]:
tag.attrs

{'href': 'https://en.wikipedia.org/wiki/Giant_panda'}

In [37]:
tag = soup.img

tag

<img alt="panda not found" src="Panda.jpeg"/>

In [38]:
tag.attrs

{'src': 'Panda.jpeg', 'alt': 'panda not found'}

#### Multi-valued attributes 

In [39]:
tag = soup.b

tag

<b class="panda">pandas were thought to be rare and noble creatures</b>

In [40]:
tag['class']

['panda']

In [41]:
tag = soup.p

tag

<p class="panda highlight">Giant pandas in the wild will occasionally eat other grasses, wild tubers, or even meat in the form of birds</p>

In [42]:
tag['class']

['panda', 'highlight']

#### To access the attribute values in the form of a list we will use 'get_attribute_list( )' method it will always return in list formate.

In [43]:
tag.get_attribute_list('class')

['panda', 'highlight']

### NavigableString :

In [44]:
tag = soup.p

tag

<p class="panda highlight">Giant pandas in the wild will occasionally eat other grasses, wild tubers, or even meat in the form of birds</p>

NavigableString is a instance of an unicode string. A NavigableString object holds the text within an HTML or an XML tag.

In [45]:
type(tag.string)

bs4.element.NavigableString

In [46]:
tag.string

'Giant pandas in the wild will occasionally eat other grasses, wild tubers, or even meat in the form of birds'

### Comment:

In [47]:
comment = soup.i

comment

<i><!--Here is a Panda image from Wikipedia --></i>

In [48]:
type(comment)

bs4.element.Tag

In [49]:
tag.string

'Giant pandas in the wild will occasionally eat other grasses, wild tubers, or even meat in the form of birds'