# Substitutions and Flags

Regex objects have .sub() method which allows us to replace patterns within a string

In the example below, we will replace & with and.

In [17]:
import re
sample_text = '''
Jack & Jill
Batman & Robin
Joker & Harley
'''
regex = re.compile(r'&') # We will create regex object in order to carry out the substitution
new_tetx = regex.sub(r'and',sample_text) # It will replace every occurence of a particular string with another string 
print("Original Text:",sample_text)
print("New Text:",new_tetx)

Original Text: 
Jack & Jill
Batman & Robin
Joker & Harley

New Text: 
Jack and Jill
Batman and Robin
Joker and Harley



## Substitutions using Groups

Here we are using groups in order to remove the middle name and only keep the first name and last name. The middle name is optional and not everyone has a middle name. So we will keep it optional. We will even keep the whitespace as optional attribute.

To mention a particular attribute being optional, we will use the ? character after the occurence of the pattern. We will match the whitespace with \ [  \ ]. We did not use the \\s sequence since this will match the newlines as well and we do not want to match those. Finally we will make the 3rd group to match the last name. Since all have the last name, we will not use ?. Since the middle name and whitespace after the first name is optional, we will use ? at the end of their occurence. Notice the whitespace after first name will only be optional if there is no middle name as there cannot be 2 spaces between first name and last name. The whitespace between first name and last name is compulsory. 

In [18]:
sample_text = '''
Anuj kumar Baid
Harshit Jain
Anant middle Bhandari
Shubham Sachdeva
Thrifty Kapila
Maggu Gupta
'''
regex = re.compile(r'([a-zA-z]+)[ ]?([a-zA-z]+)?[ ]([a-zA-z]+)')
matches = regex.finditer(sample_text)
for match in matches:
    print(match)

<re.Match object; span=(1, 16), match='Anuj kumar Baid'>
<re.Match object; span=(17, 29), match='Harshit Jain'>
<re.Match object; span=(30, 51), match='Anant middle Bhandari'>
<re.Match object; span=(52, 68), match='Shubham Sachdeva'>
<re.Match object; span=(69, 83), match='Thrifty Kapila'>
<re.Match object; span=(84, 95), match='Maggu Gupta'>


### Using groups in order to distinguish between first name, middle name and last name

In [19]:
sample_text = '''
Anuj kumar Baid
Harshit Jain
Anant middle Bhandari
Shubham Sachdeva
Thrifty kapila
Maggu Gupta
'''
regex = re.compile(r'([a-zA-z]+)[ ]?([a-zA-z]+)?[ ]([a-zA-z]+)')
matches = regex.finditer(sample_text)
for match in matches:
    print("\n First name:"+match.group(1))
    if match.group(2) is None:
        print("Middle name : None")
    else:
        print("Middle Name:"+match.group(2))
    print("Last Name:"+match.group(3))


 First name:Anuj
Middle Name:kumar
Last Name:Baid

 First name:Harshit
Middle name : None
Last Name:Jain

 First name:Anant
Middle Name:middle
Last Name:Bhandari

 First name:Shubham
Middle name : None
Last Name:Sachdeva

 First name:Thrifty
Middle name : None
Last Name:kapila

 First name:Maggu
Middle name : None
Last Name:Gupta


In [20]:
sample_text = '''
Anuj kumar Baid
Harshit Jain
Anant middle Bhandari
Shubham Sachdeva
Thrifty kapila
Maggu Gupta
'''
regex = re.compile(r'([a-zA-z]+)[ ]?([a-zA-z])+?[ ]([a-zA-z]+)')
new_text = regex.sub(r'\1 \3',sample_text) # Referring to each group in the sub method by using \.
print(new_text) 


Anuj Baid
Harshi Jain
Anant Bhandari
Shubha Sachdeva
Thrift kapila
Magg Gupta



# Flags

As it is known regexes are case sensitive, we will use something called flag keyword which will provide us more flexibility. For example, the re.IGNORECASE flag can be used to perform case-insensitive matching. In the code below, we have a string that contains the same name but with different cases. We will use the re.IGNORECASE in order to indicate that we do not care about case-matching.

In [23]:
sample_text = 'Harshit Jain is a good boy. HARSHIT has the will and potential to do anything but he just has to believe in himself and work hard in order to achieve them!'

In [24]:
regex = re.compile(r'harshit',re.IGNORECASE)
matches = regex.finditer(sample_text)
for match in matches:
    print(match)

<re.Match object; span=(0, 7), match='Harshit'>
<re.Match object; span=(28, 35), match='HARSHIT'>


# BeautifulSoup

It is a python library which allows us to pull data directly from html websites and xml files. It is particularly useful when the original document is formatted as html or xml.

Problems with BeautifulSoup:
1. Works best when we have perfectly formatted HTML. This means that the HTML document which we are analyzing has missing information or mistakes then this can result in beautiful soup returning the wrong text.
2. Not all the 10k's are in HTML or XML format. The older 10k's are not in html or xml format, so we will not be able to use html or xml there.

# Parser

It is a piece of software whose primary job is to build a data structure in the form of a hierarchical tree that gives a structural representation of the HTML or XML file. In other words, the parser divides the complex files into simpler parts while keeping track of how these parts are related to each other. BeautifulSoup supports a number of parsers, but we will be only focusing on lxml parsers which can be used to parse both html and xml files and it also has the advantage of being very fast. 

We can install lxml with the following command:
pip install lxml

## Parsing an HTML File

To parse an HTML or XML document, we need to pass the document into BeautifulSoup constructor. The BeautifulSoup constructor, BeautifulSoup(file, 'parser'), parses the given file using the given parser and returns a BeautifulSoup object. We can pass our file to the constructor either as a string or as open filehandle, while the parser is a string that indicates the parser we want to use.

The BeautifulSoup constructor will transform the HTML or XML file into a complex tree of python objects. One of the objects is the BeautifulSoup object returned by the constructor. The BeautifulSoup object itself represents the document as a whole and can be searched using various methods. Usage:
    
    from bs4 import BeautifulSoup

In [25]:
from bs4 import BeautifulSoup

# opening the html file and create a BeautifulSoup Object:

with open('./sample.html') as f:
    page_content = BeautifulSoup(f, 'lxml')
    
# Printing the BeautifulSoup object:

print(page_content)

<html>
<!-- Text between angle brackets is an HTML tag and is not displayed.
Most tags, such as the HTML and /HTML tags that surround the contents of
a page, come in pairs; some tags, like HR, for a horizontal rule, stand 
alone. Comments, such as the text you're reading, are not displayed when
the Web page is shown. The information between the HEAD and /HEAD tags is 
not displayed. The information between the BODY and /BODY tags is displayed.-->
<head>
<title>Enter a title, displayed at the top of the window.</title>
</head>
<!-- The information between the BODY and /BODY tags is displayed.-->
<body>
<h1>Enter the main heading, usually the same as the title.</h1>
<p>Be <b>bold</b> in stating your key points. Put them in a list: </p>
<ul>
<li>The first item in your list</li>
<li>The second item; <i>italicize</i> key words</li>
</ul>
<p>Improve your image by including an image. </p>
<p><img alt="A Great HTML Resource" src="http://www.mygifs.com/CoverImage.gif"/></p>
<p>Add a link to you

In [26]:
# Printing the BeautifulSoup Object with prettify

print(page_content.prettify())

<html>
 <!-- Text between angle brackets is an HTML tag and is not displayed.
Most tags, such as the HTML and /HTML tags that surround the contents of
a page, come in pairs; some tags, like HR, for a horizontal rule, stand 
alone. Comments, such as the text you're reading, are not displayed when
the Web page is shown. The information between the HEAD and /HEAD tags is 
not displayed. The information between the BODY and /BODY tags is displayed.-->
 <head>
  <title>
   Enter a title, displayed at the top of the window.
  </title>
 </head>
 <!-- The information between the BODY and /BODY tags is displayed.-->
 <body>
  <h1>
   Enter the main heading, usually the same as the title.
  </h1>
  <p>
   Be
   <b>
    bold
   </b>
   in stating your key points. Put them in a list:
  </p>
  <ul>
   <li>
    The first item in your list
   </li>
   <li>
    The second item;
    <i>
     italicize
    </i>
    key words
   </li>
  </ul>
  <p>
   Improve your image by including an image.
  </p>
  <p

As we can see above, the prettify() object will make the object easier to read and it also allows us to identify tags more readily. 

# Navigating the Parse Tree

The most straightforward way of navigating the parse tree created by BeautifulSoup is by accessing the HTML or XML tags. We can access these elements as if they were the elements or attributes of BeautifulSoup itself.

We will use the <head> tag in our page_content object using the statement.
    page_content.head
    
Whenever we access a tag in this manner, we get a Tag object. Tag object will save in the page_head variable. We then print the Tag object to see what it looks like.


In [27]:
from bs4 import BeautifulSoup 
with open('./sample.html') as f:
    page_content = BeautifulSoup(f, 'lxml')
    
page_head = page_content.head
print(page_head.prettify())

<head>
 <title>
  Enter a title, displayed at the top of the window.
 </title>
</head>



As we can see the prettify object has the entire contents of the <head> tag only, including all the openings and closing tags with it. These subtags are called the children of the <head> tag.
    
We can access the child tags within the <head> tag as if they were the attributes of the page_head. For example if we want to access the <title> tag within the <head> tag we can use:
    

In [28]:
page_head.title

<title>Enter a title, displayed at the top of the window.</title>

## Getting text

In [29]:
print(page_head.title)

<title>Enter a title, displayed at the top of the window.</title>


### If we only want to print the text in the title tag within the head tag

In [31]:
print(page_head.title.get_text())

Enter a title, displayed at the top of the window.


## getting Attributes

An html or xml tag can have many attributes. BeautifulSoup allows us to get the value of a tag's attribute by treating it like a dictionary.

In [36]:
from bs4 import BeautifulSoup
# opening the html file and creating a BeautifulSoup object
with open('./sample.html') as f:
    page_content = BeautifulSoup(f,'lxml')
# Access the h1 tag
page_h1 = page_content.body.h1
# get the value of the id attribute from the h1 tag
h1_id_attr = page_h1['id']
# Print the value of the id attribute
print(h1_id_attr)

1


## Finding all tags

In [37]:
from bs4 import BeautifulSoup
with open('./sample.html') as f:
    page_content = BeautifulSoup(f,'lxml')
print(page_content.prettify())

<html>
 <!-- Text between angle brackets is an HTML tag and is not displayed.
Most tags, such as the HTML and /HTML tags that surround the contents of
a page, come in pairs; some tags, like HR, for a horizontal rule, stand 
alone. Comments, such as the text you're reading, are not displayed when
the Web page is shown. The information between the HEAD and /HEAD tags is 
not displayed. The information between the BODY and /BODY tags is displayed.-->
 <head>
  <title>
   Enter a title, displayed at the top of the window.
  </title>
 </head>
 <!-- The information between the BODY and /BODY tags is displayed.-->
 <body>
  <h1 id="1">
   Enter the main heading, usually the same as the title.
  </h1>
  <p>
   Be
   <b>
    bold
   </b>
   in stating your key points. Put them in a list:
  </p>
  <ul>
   <li>
    The first item in your list
   </li>
   <li>
    The second item;
    <i>
     italicize
    </i>
    key words
   </li>
  </ul>
  <p>
   Improve your image by including an image.
  </

In [41]:
page_content.body.p

<p>Be <b>bold</b> in stating your key points. Put them in a list: </p>

# Searching the parse tree

Even though BeautifulSoup covers a lot of methods in order to parse a tree, we will only cover the .find_all() method.

find_all(filter) method will search an entire document for the given filter. The filter can be a string containing an HTML or XML tag name, a tag attribute or even a regular expression.

In [44]:
from bs4 import BeautifulSoup
with open('./sample.html') as f:
    page_content = BeautifulSoup(f,'lxml')
list = page_content.find_all('h1')
print(list)

[<h1 id="1">Enter the main heading, usually the same as the title.</h1>, <h1 id="2"> The second heading just for random testing purpose</h1>]


In [45]:
for tag in list:
    print(tag)

<h1 id="1">Enter the main heading, usually the same as the title.</h1>
<h1 id="2"> The second heading just for random testing purpose</h1>


## Searching for multiple tags

We can also search for more than one tag at a time by passing a list to the .findall() method.

In [46]:
from bs4 import BeautifulSoup
with open('./sample.html') as f:
    page_content = BeautifulSoup(f,'lxml')
for tag in page_content.find_all(['h1','p']):
    print(tag.prettify())

<h1 id="1">
 Enter the main heading, usually the same as the title.
</h1>

<p>
 Be
 <b>
  bold
 </b>
 in stating your key points. Put them in a list:
</p>

<p class="improve">
 Improve your image by including an image.
</p>

<p>
 <img alt="A Great HTML Resource" src="http://www.mygifs.com/CoverImage.gif"/>
</p>

<p>
 Add a link to your favorite
 <a href="https://www.dummies.com/">
  Web site
 </a>
 .
Break up your page with a horizontal rule or two.
</p>

<p>
 Finally, link to
 <a href="page2.html">
  another page
 </a>
 in your own Web site.
</p>

<p>
 © Wiley Publishing, 2011
</p>

<h1 id="2">
 The second heading just for random testing purpose
</h1>



## Searching for tags with multiple attributes

In [48]:
from bs4 import BeautifulSoup
with open('./sample.html') as f:
    page_content = BeautifulSoup(f, 'lxml')
h1 = page_content.find_all('h1', id ='1')
for tag in h1:
    print(tag)

<h1 id="1">Enter the main heading, usually the same as the title.</h1>


## Searching for atributes directly

In [49]:
from bs4 import BeautifulSoup
with open('./sample.html') as f:
    page_content = BeautifulSoup(f, 'lxml')
for tag in page_content.find_all(id='1'):
    print(tag)

<h1 id="1">Enter the main heading, usually the same as the title.</h1>


# Searching by class

For working with class, we cannot pass the method find_all(). The reason is that CSS attribute class is a reserved keyword in python. Therefore it will give a syntax error. To solve this, we use class_ in BeautifulSoup.

In [52]:
from bs4 import BeautifulSoup
with open('./sample.html') as f:
    page_content = BeautifulSoup(f, 'lxml')
for tag in page_content.find_all(class_= 2):
    print(tag)

<p class="2">Improve your image by including an image. </p>


# Searching with Regular Expressions

We can even pass a regular expression to the find_all().

In [54]:
from bs4 import BeautifulSoup
with open('./sample.html') as f:
    page_content = BeautifulSoup(f, 'lxml')
# Prints only the tag names of the tags which contain the letter i:
for tag in page_content.find_all(re.compile(r'i')):
    print(tag.name)

title
li
li
i
img


# Tags Children

Tags may contain another tags and strings within them. These elements are known as tags children. 

In [55]:
from bs4 import BeautifulSoup

with open('./sample.html') as f:
    page_content = BeautifulSoup(f, 'lxml')
print(page_content.prettify())

<html>
 <!-- Text between angle brackets is an HTML tag and is not displayed.
Most tags, such as the HTML and /HTML tags that surround the contents of
a page, come in pairs; some tags, like HR, for a horizontal rule, stand 
alone. Comments, such as the text you're reading, are not displayed when
the Web page is shown. The information between the HEAD and /HEAD tags is 
not displayed. The information between the BODY and /BODY tags is displayed.-->
 <head>
  <title>
   Enter a title, displayed at the top of the window.
  </title>
 </head>
 <!-- The information between the BODY and /BODY tags is displayed.-->
 <body>
  <h1 id="1">
   Enter the main heading, usually the same as the title.
  </h1>
  <p>
   Be
   <b>
    bold
   </b>
   in stating your key points. Put them in a list:
  </p>
  <ul>
   <li>
    The first item in your list
   </li>
   <li>
    The second item;
    <i>
     italicize
    </i>
    key words
   </li>
  </ul>
  <p class="2">
   Improve your image by including an i

In [63]:
from bs4 import BeautifulSoup

with open('./sample.html') as f:
    page_content = (BeautifulSoup(f, 'lxml').head.title.get_text())
print(page_content)

Enter a title, displayed at the top of the window.


In [67]:
from bs4 import BeautifulSoup
with open('./sample.html') as f:
    page_content = BeautifulSoup(f,'lxml')
page_html = page_content.html
print(page_head.contents)
print("\n The <html> tag contains {} tag elements as children".format(len(page_head.contents)))

['\n', " Text between angle brackets is an HTML tag and is not displayed.\nMost tags, such as the HTML and /HTML tags that surround the contents of\na page, come in pairs; some tags, like HR, for a horizontal rule, stand \nalone. Comments, such as the text you're reading, are not displayed when\nthe Web page is shown. The information between the HEAD and /HEAD tags is \nnot displayed. The information between the BODY and /BODY tags is displayed.", '\n', <head>
<title>Enter a title, displayed at the top of the window.</title>
</head>, '\n', ' The information between the BODY and /BODY tags is displayed.', '\n', <body>
<h1 id="1">Enter the main heading, usually the same as the title.</h1>
<p>Be <b>bold</b> in stating your key points. Put them in a list: </p>
<ul>
<li>The first item in your list</li>
<li>The second item; <i>italicize</i> key words</li>
</ul>
<p class="2">Improve your image by including an image. </p>
<p><img alt="A Great HTML Resource" src="http://www.mygifs.com/CoverImag

In [71]:
# Instead of getting the html children tag as list, we can also loop through it by using the .children attribute
from bs4 import BeautifulSoup
with open('./sample.html') as f:
    get_text = BeautifulSoup(f,'lxml')
for child in page_content.head.children:
    print(child)



<title>Enter a title, displayed at the top of the window.</title>




## The Recursive Argument

If we use the .find_all() method on a target object, tag.find_all() =, then the find_all() method will search all the tags children, it's children's children, and so on. However, there will be times where we only want BeautifulSoup to search a tag's direct children. To do this we will pass the recursive=False argument to .find_all() method.

In [72]:
from bs4 import BeautifulSoup
with open('./sample.html') as f:
    print(BeautifulSoup(f,'lxml').prettify())

<html>
 <!-- Text between angle brackets is an HTML tag and is not displayed.
Most tags, such as the HTML and /HTML tags that surround the contents of
a page, come in pairs; some tags, like HR, for a horizontal rule, stand 
alone. Comments, such as the text you're reading, are not displayed when
the Web page is shown. The information between the HEAD and /HEAD tags is 
not displayed. The information between the BODY and /BODY tags is displayed.-->
 <head>
  <title>
   Enter a title, displayed at the top of the window.
  </title>
 </head>
 <!-- The information between the BODY and /BODY tags is displayed.-->
 <body>
  <h1 id="1">
   Enter the main heading, usually the same as the title.
  </h1>
  <p>
   Be
   <b>
    bold
   </b>
   in stating your key points. Put them in a list:
  </p>
  <ul>
   <li>
    The first item in your list
   </li>
   <li>
    The second item;
    <i>
     italicize
    </i>
    key words
   </li>
  </ul>
  <p class="2">
   Improve your image by including an i

#### We can see above that the head> tag is directly beneath the html> tag and the title> is beneath the head> tag. If we search for html> tag using the find_all() method, BeautifulSoup will find it because it is searching in all the descendants of html> tag.

In [75]:
from bs4 import BeautifulSoup
with open('./sample.html') as f:
    page_content = BeautifulSoup(f,'lxml')
for tag in page_content.html.find_all('title'):
    print(tag)


<title>Enter a title, displayed at the top of the window.</title>


In [77]:
# Since <title> tag is not a direct descendant of <html> tag, we will not get any output
# By adding the recursive=False, we make sure that out search is restricted to only look at the <html> tag's direct children.
from bs4 import BeautifulSoup
with open('./sample.html') as f:
    page_content = BeautifulSoup(f,'lxml')
for tag in page_content.html.find_all('title', recursive = False):
    print(tag)