# Regular Expressions

Regular Expression helps us efficiently manipulate big text with simple syntext. Python has `re` library to do regular expression functionalities.

## Maching Characters
Most characters match with themselves. But there are some characters who don't match with themselves rather the signal some special things in `re`. They are called methacharacters. This is the complete list:
``` python
. ^ $ * + ? { } [ ] \ | ( )
```
- `[]` used to specify character class. `[abc]`  means `a`, `b` or `c`. This class is specifiable by `[a-b]`
- `^` is used for complementing a set. For example `[^a]` means all characters except `a`
- `\` is used to signals various special sequences. It is also used to escape all the metacharacters. To match `\`, `\\` is used.
    - `\d` matches any decimal digit: equivalent to `[0-9]`
    - `\D` matches any non-decimal digit: equivalent to `[^0-9]`
    - `\s` matches any whitespace characters: equivalent to `[ \t\n\r\f\v]`
    - `\S` matches any non-whitespace characters: equivalent to `[^ \t\n\r\f\v]`
    - `\w` matches any alphaneumeric characters: equivalent to `[a-zA-Z0-9_]`
    - `\W` matches any non-alphaneumeric characters: equivalent to `[^a-zA-Z0-9_]`

## Repeating Things
Several metacharacters for repeating things are
``` python
* + ? {m,n}
```
- `*` specifies that previous character can be matched zero or more times.
- `+` specifies that previous character can be matched one or more times.
- `?` specifies that previous character can be matched zero or one time.
- `{m,n}` specifies that previous character can be matched zero or one time.

## Compiling Regular Expressions
Regular expressions are compiled into pattern objects. The RE is passed to `re.compile()` as a string

In [1]:
import re
p = re.compile('ab*')
p

re.compile(r'ab*')

`re.compile()` accepts flag arguments:

In [2]:
p = re.compile('ab*', re.IGNORECASE)

## The Backlash Plague
Regular Expressions use `\` to indicate special forms which conflicts with `Python`'s usage of same character for the same purpose. That's why the backlash plague is used. The table is shown for mactching an example latex string `\section`

| Characters  | Stage  |
|---|---|
| \section  |  Text string to be matched |
|  \\section |  Escaped backslash for re.compile() |
| "\\\\section"  | Escaped backslashes for a string literal  |

Alternatively, we can use `Python`'s row string notation for regular expressuions.

|  Regular String |  Raw string |
|---|---|
| "ab*" |  r"ab*" |
|  "\\\\section" |  r"\\section"|
| "\\w+\\s+\\1"  | r"\w+\s+\1" |

## Performing Matches
Important arrtibutes for pattern object are:

|  Method/Attribute |  Purpose |
|---|---|
| match() |  Determine if the RE matches at the beginning of the string |
|  search() |  Scan through a string, looking for any location where this RE matches.|
| findall()  | Find all substrings where the RE matches, and returns them as a list. |
| finditer() | Find all substrings where the RE matches, and returns them as an iterator. |

For experiment, we will import `re` and compile a regular expression:

In [3]:
import re
p = re.compile(r'[a-z]+')
p

re.compile(r'[a-z]+')

Try to to match empty string with this expression. This will return `None`

p.match("")
print(p.match(""))

Try to match this expression with string `tempo`.

In [4]:
m = p.match('tempo')
m

<_sre.SRE_Match at 0x7f2c0447b9d0>

| Method/Attribute | Purpose |
|---|---|
| group() | Return the string matched by the RE| 
|start() | Return the starting position of the match|
| end() | Return the ending position of the match|
| span() |Return a tuple containing the (start, end) positions of the match |

In [5]:
m.group()

'tempo'

In [6]:
m.start(), m.end()

(0, 5)

In [7]:
m.span()

(0, 5)

`findall()` returns a list of matching strings:

In [8]:
p = re.compile(r'\d+')
p.findall('12 , 13, 14, 1000')

['12', '13', '14', '1000']

`finditer()` returns a sequence of match object instance as an iterator:

In [9]:
it = p.finditer('12, 13, 14, 15')
it

<callable-iterator at 0x7f2c0446b1d0>

In [10]:
for match in it:
    print(match.span())

(0, 2)
(4, 6)
(8, 10)
(12, 14)


## Module-Level Functions
We can also use `re` in this manner instead of the way shown before:

In [11]:
print(re.match(r'\s', 'A B'))

None


In [12]:
print(re.match(r'\s', ' A'))

<_sre.SRE_Match object at 0x7f2c04551c00>


## Grouping
Grouping are marked by `(` and `)`. By this, we can group a string and make RE operation on that.

In [13]:
p = re.compile(r'(ab)*')
print(p.match('ababababa').span())

(0, 8)


In [14]:
p = re.compile('(a)b')
m = p.match('ab')
m.group()


'ab'

## Modifying Strings

|Method/Attribute | Purpose|
|---|---|
|split()|Split the string into a list, splitting it wherever the RE matches|
|sub()|Find all substrings where the RE matches, and replace them with a different string|
|subn()|Does the same thing as sub(), but returns the new string and the number of replacements|

## Split

In [15]:
p = re.compile(r'\W+')
p.split('Hello is what we say. But Mello is what minions say')

['Hello',
 'is',
 'what',
 'we',
 'say',
 'But',
 'Mello',
 'is',
 'what',
 'minions',
 'say']

## Search and Replace

In [16]:
p = re.compile('(egg|fish|chicken)')
p.sub('food', 'We eat egg, fish and chicken')

'We eat food, food and food'

`subn()` also does the same way, but it returns a 2-tuple containing new string value and number of replacements.

In [17]:
p = re.compile('(egg|fish|chicken)')
p.subn('food', 'We eat egg, fish and chicken')

('We eat food, food and food', 3)

# Rest Api

We have to begin by importing `requests` mpdule

In [18]:
import requests

Let's get `GitHub`'s page:

In [19]:
r = requests.get('https://api.github.com/events')

Our *response* object is `r`. We can make a `HTTP POST` like this:

In [20]:
r = requests.post('https://httpbin.org/post', data = {'key':'value'})

For other `HTTP` types: `PUT`, `DELETE`, `HEAD` and `OPTIONS`:

In [21]:
r = requests.put('https://httpbin.org/put', data = {'key':'value'})
r = requests.delete('https://httpbin.org/delete')
r = requests.head('https://httpbin.org/get')
r = requests.options('https://httpbin.org/get')

## Passing Parameters in URLs
if you wanted to pass key1=value1 and key2=value2 to `httpbin.org/get`, you would use the following code:

In [22]:
payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.get('https://httpbin.org/get', params=payload)

In [23]:
print(r.url)

https://httpbin.org/get?key2=value2&key1=value1


## Response Content

In [24]:
import requests
r = requests.get('https://api.github.com/events')
r.text

u'[{"id":"11490532220","type":"CreateEvent","actor":{"id":10810283,"login":"direwolf-github","display_login":"direwolf-github","gravatar_id":"","url":"https://api.github.com/users/direwolf-github","avatar_url":"https://avatars.githubusercontent.com/u/10810283?"},"repo":{"id":239819832,"name":"direwolf-github/ephemeral-ci-24f091d5","url":"https://api.github.com/repos/direwolf-github/ephemeral-ci-24f091d5"},"payload":{"ref":null,"ref_type":"repository","master_branch":"master","description":null,"pusher_type":"user"},"public":true,"created_at":"2020-02-11T17:14:53Z"},{"id":"11490532217","type":"ForkEvent","actor":{"id":270587,"login":"Hsaka","display_login":"Hsaka","gravatar_id":"","url":"https://api.github.com/users/Hsaka","avatar_url":"https://avatars.githubusercontent.com/u/270587?"},"repo":{"id":143147062,"name":"koreezgames/phaser3-particle-editor","url":"https://api.github.com/repos/koreezgames/phaser3-particle-editor"},"payload":{"forkee":{"id":239819833,"node_id":"MDEwOlJlcG9zaXR

`request` guess what kind of encoding is used.

In [25]:
r.encoding

'utf-8'

## Response Status Codes

In [26]:
r = requests.get('https://httpbin.org/get')
r.status_code

200

We can also lookup status code from  built-in status codes

In [27]:
r.status_code == requests.codes.ok

True

## Response Headers

In [28]:
r.headers

{'Content-Length': '306', 'Server': 'gunicorn/19.9.0', 'Connection': 'keep-alive', 'Access-Control-Allow-Credentials': 'true', 'Date': 'Tue, 11 Feb 2020 17:20:00 GMT', 'Access-Control-Allow-Origin': '*', 'Content-Type': 'application/json'}

We can access the headers using any capitalization we want:

In [29]:
print(r.headers['Content-Type'])
print(r.headers['content-type'])

application/json
application/json


## HTTP Basic Authentication

In [30]:
response = requests.get("https://api.github.com/user", auth=("user", "pass"))
print(response.status_code)

403


# BeautifulSoup
BeautifulSoup is a python library for parsing HTML/XML files

In [31]:
from bs4 import BeautifulSoup as bs

## Constructor
BeatifulSoup objects can be constructed in the following way:

In [32]:
import requests

response = requests.get("http://0.facebook.com")

# We pass in two arguements to the constructor
# The first is the raw HTML content we want to parse
# The second is the name of the HTML parser we want to use
soup = bs(response.content, 'html5lib') 

## Navigation
We can navigate the HTML file and get useful information easily using function built in BeautifulSoup

### Using Tag Names
Will return a Tag object representing the first HTML/XML tag corresponding the tag name

In [33]:
soup.head

<head><title>Facebook \u2013 log in or sign up</title><meta content="default" id="meta_referrer" name="referrer"/><style type="text/css">/*<![CDATA[*/.s{pointer-events:none;}.w{padding:8px;}.ca{padding-bottom:12px;padding-top:12px;}.bz{padding-left:4px;padding-right:4px;}.bu{padding-left:8px;padding-right:8px;}.b .bw{display:block;width:auto;}.b .bw .bx,.b .bw .bx:hover{background-color:#42b72a;color:#fff;height:44px;}.b .br{display:block;margin-bottom:5px;margin-left:3%;margin-top:-3px;overflow:hidden;text-align:center;white-space:nowrap;width:94%;}.b .br>span{display:inline-block;position:relative;}.b .br>span:before,.b .br>span:after{background:#ccd0d5;content:"";height:1px;position:absolute;top:50%;width:9999px;}.b .br>span:before{margin-right:15px;right:100%;}.b .br>span:after{left:100%;margin-left:15px;}.b .bs{color:#4b4f56;font-size:14px;}.b .bl{border:solid 1px #999;box-sizing:border-box;width:100%;}.b .t{border:0;border-collapse:collapse;margin:0;padding:0;width:100%;}.b .t tb

In [34]:
soup.title

<title>Facebook \u2013 log in or sign up</title>

You can use the dot operator again and again to get tags nested inside another tag

In [35]:
soup.head.meta

<meta content="default" id="meta_referrer" name="referrer"/>

### Using prettify 
By calling prettify(), you can show the text of an HTML file with proper formatting

In [36]:
print(soup.prettify())

<!--?xml version="1.0" encoding="utf-8"?-->
<!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile 1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <title>
   Facebook – log in or sign up
  </title>
  <meta content="default" id="meta_referrer" name="referrer"/>
  <style type="text/css">
   /*<![CDATA[*/.s{pointer-events:none;}.w{padding:8px;}.ca{padding-bottom:12px;padding-top:12px;}.bz{padding-left:4px;padding-right:4px;}.bu{padding-left:8px;padding-right:8px;}.b .bw{display:block;width:auto;}.b .bw .bx,.b .bw .bx:hover{background-color:#42b72a;color:#fff;height:44px;}.b .br{display:block;margin-bottom:5px;margin-left:3%;margin-top:-3px;overflow:hidden;text-align:center;white-space:nowrap;width:94%;}.b .br>span{display:inline-block;position:relative;}.b .br>span:before,.b .br>span:after{background:#ccd0d5;content:"";height:1px;position:absolute;top:50%;width:9999px;}.b .br>span:before{margin-right:15px;right:100%;}.b .br>sp

### Using .contents and .children
A tag's children is available in a list called .contents

In [37]:
head_tag = soup.head
head_tag.contents

[<title>Facebook \u2013 log in or sign up</title>,
 <meta content="default" id="meta_referrer" name="referrer"/>,
 <style type="text/css">/*<![CDATA[*/.s{pointer-events:none;}.w{padding:8px;}.ca{padding-bottom:12px;padding-top:12px;}.bz{padding-left:4px;padding-right:4px;}.bu{padding-left:8px;padding-right:8px;}.b .bw{display:block;width:auto;}.b .bw .bx,.b .bw .bx:hover{background-color:#42b72a;color:#fff;height:44px;}.b .br{display:block;margin-bottom:5px;margin-left:3%;margin-top:-3px;overflow:hidden;text-align:center;white-space:nowrap;width:94%;}.b .br>span{display:inline-block;position:relative;}.b .br>span:before,.b .br>span:after{background:#ccd0d5;content:"";height:1px;position:absolute;top:50%;width:9999px;}.b .br>span:before{margin-right:15px;right:100%;}.b .br>span:after{left:100%;margin-left:15px;}.b .bs{color:#4b4f56;font-size:14px;}.b .bl{border:solid 1px #999;box-sizing:border-box;width:100%;}.b .t{border:0;border-collapse:collapse;margin:0;padding:0;width:100%;}.b .t t

In [38]:
title_tag = head_tag.contents[0]
title_tag

<title>Facebook \u2013 log in or sign up</title>

You can also iterate over a tag's children by using the .children iterator

In [39]:
for child in head_tag.children:
    print(child)

<title>Facebook – log in or sign up</title>
<meta content="default" id="meta_referrer" name="referrer"/>
<style type="text/css">/*<![CDATA[*/.s{pointer-events:none;}.w{padding:8px;}.ca{padding-bottom:12px;padding-top:12px;}.bz{padding-left:4px;padding-right:4px;}.bu{padding-left:8px;padding-right:8px;}.b .bw{display:block;width:auto;}.b .bw .bx,.b .bw .bx:hover{background-color:#42b72a;color:#fff;height:44px;}.b .br{display:block;margin-bottom:5px;margin-left:3%;margin-top:-3px;overflow:hidden;text-align:center;white-space:nowrap;width:94%;}.b .br>span{display:inline-block;position:relative;}.b .br>span:before,.b .br>span:after{background:#ccd0d5;content:"";height:1px;position:absolute;top:50%;width:9999px;}.b .br>span:before{margin-right:15px;right:100%;}.b .br>span:after{left:100%;margin-left:15px;}.b .bs{color:#4b4f56;font-size:14px;}.b .bl{border:solid 1px #999;box-sizing:border-box;width:100%;}.b .t{border:0;border-collapse:collapse;margin:0;padding:0;width:100%;}.b .t tbody{verti

### Using .descendants
.contents only gives a list of all direct descendants of a tag. In order to get all descendants (direct and indirect) of a tag, we use .descendants to iterate over all children of a tag recursively

In [40]:
for child in soup.table.descendants:
    print(child) 

<tbody><tr><td valign="middle"><h1 style="display:block;height:0;overflow:hidden;position:absolute;width:0;padding:0">Facebook</h1><img alt="facebook" class="j" height="16" src="https://static.xx.fbcdn.net/rsrc.php/v3/y8/r/k97pj8-or6s.png" width="77"/></td><td class="k" valign="middle"><a class="l m n o p q" href="/reg/?cid=102&amp;refid=8"><span class="s">Create Account</span></a></td></tr></tbody>
<tr><td valign="middle"><h1 style="display:block;height:0;overflow:hidden;position:absolute;width:0;padding:0">Facebook</h1><img alt="facebook" class="j" height="16" src="https://static.xx.fbcdn.net/rsrc.php/v3/y8/r/k97pj8-or6s.png" width="77"/></td><td class="k" valign="middle"><a class="l m n o p q" href="/reg/?cid=102&amp;refid=8"><span class="s">Create Account</span></a></td></tr>
<td valign="middle"><h1 style="display:block;height:0;overflow:hidden;position:absolute;width:0;padding:0">Facebook</h1><img alt="facebook" class="j" height="16" src="https://static.xx.fbcdn.net/rsrc.php/v3/y8

### Using .string, .strings, and .stripped_strings
If a tag has only one child, and it is a string, we can access it using .string

In [41]:
title_tag.string

u'Facebook \u2013 log in or sign up'

If a tag has more than one child, then you can access all the strings inside it using .strings

In [42]:
for string in soup.ul.strings:
    print(repr(string))

u'Phone number or email address'
u'Password'


We can see that the above strings have a lot of whitespaces. In order to get the strings with whitespaces removed, you can use .stripped_strings

In [43]:
for string in soup.ul.stripped_strings:
    print(repr(string))

u'Phone number or email address'
u'Password'


### Accessing Parent of Tag
You can access the parent tag of a tag or string using .parent



In [44]:
title_tag = soup.title
print(title_tag)
print(title_tag.parent)

<title>Facebook – log in or sign up</title>
<head><title>Facebook – log in or sign up</title><meta content="default" id="meta_referrer" name="referrer"/><style type="text/css">/*<![CDATA[*/.s{pointer-events:none;}.w{padding:8px;}.ca{padding-bottom:12px;padding-top:12px;}.bz{padding-left:4px;padding-right:4px;}.bu{padding-left:8px;padding-right:8px;}.b .bw{display:block;width:auto;}.b .bw .bx,.b .bw .bx:hover{background-color:#42b72a;color:#fff;height:44px;}.b .br{display:block;margin-bottom:5px;margin-left:3%;margin-top:-3px;overflow:hidden;text-align:center;white-space:nowrap;width:94%;}.b .br>span{display:inline-block;position:relative;}.b .br>span:before,.b .br>span:after{background:#ccd0d5;content:"";height:1px;position:absolute;top:50%;width:9999px;}.b .br>span:before{margin-right:15px;right:100%;}.b .br>span:after{left:100%;margin-left:15px;}.b .bs{color:#4b4f56;font-size:14px;}.b .bl{border:solid 1px #999;box-sizing:border-box;width:100%;}.b .t{border:0;border-collapse:collapse;

If you want to iterate over all of the parents of a tag, you can use .parents

In [45]:
link = soup.a
print(link)
for parent in link.parents:
    print(parent.name)

<a class="l m n o p q" href="/reg/?cid=102&amp;refid=8"><span class="s">Create Account</span></a>
td
tr
tbody
table
div
div
div
body
html
[document]


### Accessing Siblings of Tag
You can use .next_sibling and .previous_sibling to access the previous and next sibling of a tag respectively

In [46]:
sibling_soup = bs("<a><b>text1</b><c>text2</c></b></a>", 'html5lib')
print(sibling_soup.prettify())

<html>
 <head>
 </head>
 <body>
  <a>
   <b>
    text1
   </b>
   <c>
    text2
   </c>
  </a>
 </body>
</html>


In [47]:
sibling_soup.b.next_sibling

<c>text2</c>

In [48]:
sibling_soup.c.previous_sibling

<b>text1</b>

You can iterate over a tag's previous and next siblings using the .next_siblings and
.previous_siblings respectively

In [49]:
for sibling in soup.li.next_siblings:
    print(sibling)

<li class="bf"><div><span class="bh bi" id="u_0_1">Password</span><input autocapitalize="off" autocomplete="on" autocorrect="off" class="bj bk bm bn" name="pass" type="password"/></div></li>
<li class="bf"><input class="l q m bo bp bq" name="login" type="submit" value="Log In"/></li>


### Traversing the parse tree
You can access the next and previous element of tag in the preorder of the parse tree using .next_element and .previous_element respectively

In [50]:
a = soup.a
print(a)

<a class="l m n o p q" href="/reg/?cid=102&amp;refid=8"><span class="s">Create Account</span></a>


In [51]:
a.previous_element

<td class="k" valign="middle"><a class="l m n o p q" href="/reg/?cid=102&amp;refid=8"><span class="s">Create Account</span></a></td>

In [52]:
a.next_element

<span class="s">Create Account</span>

You can iterate over all previous and next elements using the .next_elements and .previous_elements iterators respectively

In [53]:
previous_elements = [element for element in a.previous_elements]
print(len(previous_elements))

24


In [54]:
next_elements = [element for element in a.next_elements]
print(len(next_elements))

76


### find_all() function
The find_all() function is the most important function in BeautifulSoup. It finds tags that fulfill certain criteria.

#### Finding tag using string
Entering the name of the tag in the find_all() function will give a list of all tags matching the condition

In [55]:
soup.find_all('a')

[<a class="l m n o p q" href="/reg/?cid=102&amp;refid=8"><span class="s">Create Account</span></a>,
 <a href="/recover/initiate/?c=https%3A%2F%2Fm.facebook.com%2F%3Fzero_e%3D6%26zero_et%3D1581441669%26refsrc%3Dhttps%253A%252F%252F0.facebook.com%252F&amp;r&amp;cuid&amp;ars=facebook_login&amp;lwv=100&amp;refid=8" id="forgot-password-link" tabindex="0">Forgotten password?</a>,
 <a class="ci" href="/a/language.php?l=as_IN&amp;lref=https%3A%2F%2Fm.facebook.com%2F%3Fzero_e%3D6%26zero_et%3D1581441669%26refsrc%3Dhttps%253A%252F%252F0.facebook.com%252F&amp;index=2&amp;sref=legacy_mbasic_footer&amp;gfid=AQBOoS6Z1friG7g2&amp;ref_component=mbasic_footer&amp;ref_page=%2Fwap%2Findex.php&amp;refid=8">\u0985\u09b8\u09ae\u09c0\u09af\u09bc\u09be</a>,
 <a class="ci" href="/a/language.php?l=pt_BR&amp;lref=https%3A%2F%2Fm.facebook.com%2F%3Fzero_e%3D6%26zero_et%3D1581441669%26refsrc%3Dhttps%253A%252F%252F0.facebook.com%252F&amp;index=4&amp;sref=legacy_mbasic_footer&amp;gfid=AQDK3Eh9WdQjDzE1&amp;ref_componen

If you insert a list of strings inside find_all(), you will get a list of tags whose name matches any name inside the list

In [56]:
soup.find_all(['a', 'li'])

[<a class="l m n o p q" href="/reg/?cid=102&amp;refid=8"><span class="s">Create Account</span></a>,
 <li class="bf"><span class="bh bi" id="u_0_0">Phone number or email address</span><input autocapitalize="off" autocomplete="on" autocorrect="off" class="bj bk bl" id="m_login_email" name="email" type="text"/></li>,
 <li class="bf"><div><span class="bh bi" id="u_0_1">Password</span><input autocapitalize="off" autocomplete="on" autocorrect="off" class="bj bk bm bn" name="pass" type="password"/></div></li>,
 <li class="bf"><input class="l q m bo bp bq" name="login" type="submit" value="Log In"/></li>,
 <li class="bf"><a href="/recover/initiate/?c=https%3A%2F%2Fm.facebook.com%2F%3Fzero_e%3D6%26zero_et%3D1581441669%26refsrc%3Dhttps%253A%252F%252F0.facebook.com%252F&amp;r&amp;cuid&amp;ars=facebook_login&amp;lwv=100&amp;refid=8" id="forgot-password-link" tabindex="0">Forgotten password?</a></li>,
 <a href="/recover/initiate/?c=https%3A%2F%2Fm.facebook.com%2F%3Fzero_e%3D6%26zero_et%3D1581441669

#### Finding tag using RE
You can enter an RE inside find_all() to find all tags whose name is accepted by the RE

In [57]:
import re

soup.find_all(re.compile('^b'))   # Returns list of tags whose name starts with 'b'
                                  # e.g. 'b', or 'body'

[<body class="b c d e" tabindex="0"><div class="f"><div id="viewport"><div class="g h" id="header"><table cellpadding="0" cellspacing="0" class="i"><tbody><tr><td valign="middle"><h1 style="display:block;height:0;overflow:hidden;position:absolute;width:0;padding:0">Facebook</h1><img alt="facebook" class="j" height="16" src="https://static.xx.fbcdn.net/rsrc.php/v3/y8/r/k97pj8-or6s.png" width="77"/></td><td class="k" valign="middle"><a class="l m n o p q" href="/reg/?cid=102&amp;refid=8"><span class="s">Create Account</span></a></td></tr></tbody></table></div><div id="objects_container"><div class="e" id="root"><table class="t"><tbody><tr><td class="u"><div class="v w x" id="login_error" style="display: none;"><div class="y"></div></div><div class="z ba"><div id="login_top_banner"></div><div class="bb"><form action="/login/device-based/regular/login/?refsrc=https%3A%2F%2F0.facebook.com%2F&amp;lwv=100&amp;refid=8" class="bc bd" id="login_form" method="post" novalidate="1"><input autocompl

#### Find tag using custom function
You can put a custom function with a tag as arguement inside find_all(). Then find_all() will return a list of all tags that return true for the custom function 

In [58]:
soup.find_all(lambda tag: tag.has_attr('id') and not tag.has_attr('class'))

[<meta content="default" id="meta_referrer" name="referrer"/>,
 <div id="viewport"><div class="g h" id="header"><table cellpadding="0" cellspacing="0" class="i"><tbody><tr><td valign="middle"><h1 style="display:block;height:0;overflow:hidden;position:absolute;width:0;padding:0">Facebook</h1><img alt="facebook" class="j" height="16" src="https://static.xx.fbcdn.net/rsrc.php/v3/y8/r/k97pj8-or6s.png" width="77"/></td><td class="k" valign="middle"><a class="l m n o p q" href="/reg/?cid=102&amp;refid=8"><span class="s">Create Account</span></a></td></tr></tbody></table></div><div id="objects_container"><div class="e" id="root"><table class="t"><tbody><tr><td class="u"><div class="v w x" id="login_error" style="display: none;"><div class="y"></div></div><div class="z ba"><div id="login_top_banner"></div><div class="bb"><form action="/login/device-based/regular/login/?refsrc=https%3A%2F%2F0.facebook.com%2F&amp;lwv=100&amp;refid=8" class="bc bd" id="login_form" method="post" novalidate="1"><in

If you put True in the body of find_all() it will return all tags in the BeautifulSoup object

In [59]:
soup.find_all(True)

[<html xmlns="http://www.w3.org/1999/xhtml"><head><title>Facebook \u2013 log in or sign up</title><meta content="default" id="meta_referrer" name="referrer"/><style type="text/css">/*<![CDATA[*/.s{pointer-events:none;}.w{padding:8px;}.ca{padding-bottom:12px;padding-top:12px;}.bz{padding-left:4px;padding-right:4px;}.bu{padding-left:8px;padding-right:8px;}.b .bw{display:block;width:auto;}.b .bw .bx,.b .bw .bx:hover{background-color:#42b72a;color:#fff;height:44px;}.b .br{display:block;margin-bottom:5px;margin-left:3%;margin-top:-3px;overflow:hidden;text-align:center;white-space:nowrap;width:94%;}.b .br>span{display:inline-block;position:relative;}.b .br>span:before,.b .br>span:after{background:#ccd0d5;content:"";height:1px;position:absolute;top:50%;width:9999px;}.b .br>span:before{margin-right:15px;right:100%;}.b .br>span:after{left:100%;margin-left:15px;}.b .bs{color:#4b4f56;font-size:14px;}.b .bl{border:solid 1px #999;box-sizing:border-box;width:100%;}.b .t{border:0;border-collapse:coll

#### Find tags based on attribute values
You can get list of all tags that have a particular value in a particular attribute by placing a keyword 'attribute=value' in find_all()

In [60]:
soup.find_all(id='login_error') # Find all tags that have login_error as id

[<div class="v w x" id="login_error" style="display: none;"><div class="y"></div></div>]

In [61]:
soup.find_all('li', class_='bf') # Find all tags that have class bf

[<li class="bf"><span class="bh bi" id="u_0_0">Phone number or email address</span><input autocapitalize="off" autocomplete="on" autocorrect="off" class="bj bk bl" id="m_login_email" name="email" type="text"/></li>,
 <li class="bf"><div><span class="bh bi" id="u_0_1">Password</span><input autocapitalize="off" autocomplete="on" autocorrect="off" class="bj bk bm bn" name="pass" type="password"/></div></li>,
 <li class="bf"><input class="l q m bo bp bq" name="login" type="submit" value="Log In"/></li>,
 <li class="bf"><a href="/recover/initiate/?c=https%3A%2F%2Fm.facebook.com%2F%3Fzero_e%3D6%26zero_et%3D1581441669%26refsrc%3Dhttps%253A%252F%252F0.facebook.com%252F&amp;r&amp;cuid&amp;ars=facebook_login&amp;lwv=100&amp;refid=8" id="forgot-password-link" tabindex="0">Forgotten password?</a></li>,
 <li class="bf"></li>,
 <li class="bf"></li>]

#### Other arguements of find_all()
You can search for string in the HTML file using the string arguement inside find_all()

In [62]:
soup.find_all(string='Password')

[u'Password']

You can also combine it with tag names to find tags which have a particular string in its inner text

In [63]:
soup.find_all('span', string='Password')

[<span class="bh bi" id="u_0_1">Password</span>]

By putting 'recursive=False' inside find_all() you can only search among direct descendants instead of all descendants of a HTML tag

In [64]:
soup.html.find_all('title')

[<title>Facebook \u2013 log in or sign up</title>]

In [65]:
soup.html.find_all('title', recursive=False) # Will return empty list becuase <title> is 
                                             # is not a direct descendant of <html>

[]

You can get only the first n elements matching a condition in find_all() using limit=n

In [66]:
soup.find_all('li', limit=2)

[<li class="bf"><span class="bh bi" id="u_0_0">Phone number or email address</span><input autocapitalize="off" autocomplete="on" autocorrect="off" class="bj bk bl" id="m_login_email" name="email" type="text"/></li>,
 <li class="bf"><div><span class="bh bi" id="u_0_1">Password</span><input autocapitalize="off" autocomplete="on" autocorrect="off" class="bj bk bm bn" name="pass" type="password"/></div></li>]

### find()
Find is like find_all() but with limit=1, a.k.a. it returns only the first element fulfilling a condition

In [67]:
soup.find('li', class_='bf')

<li class="bf"><span class="bh bi" id="u_0_0">Phone number or email address</span><input autocapitalize="off" autocomplete="on" autocorrect="off" class="bj bk bl" id="m_login_email" name="email" type="text"/></li>

### find_parents(), find_parent(), find_next_siblings(), find_next_sibling(), find_previous_siblings(), find_previous_sibling(), find_all_next(), find_next(), find_all_previous(), find_previous()

find_parents() is like find_all() and find_parent() is like find(), but they search only among all parents of a tag

In [68]:
a = soup.a
a.find_parents(lambda tag: tag.has_attr('id'))

[<div class="g h" id="header"><table cellpadding="0" cellspacing="0" class="i"><tbody><tr><td valign="middle"><h1 style="display:block;height:0;overflow:hidden;position:absolute;width:0;padding:0">Facebook</h1><img alt="facebook" class="j" height="16" src="https://static.xx.fbcdn.net/rsrc.php/v3/y8/r/k97pj8-or6s.png" width="77"/></td><td class="k" valign="middle"><a class="l m n o p q" href="/reg/?cid=102&amp;refid=8"><span class="s">Create Account</span></a></td></tr></tbody></table></div>,
 <div id="viewport"><div class="g h" id="header"><table cellpadding="0" cellspacing="0" class="i"><tbody><tr><td valign="middle"><h1 style="display:block;height:0;overflow:hidden;position:absolute;width:0;padding:0">Facebook</h1><img alt="facebook" class="j" height="16" src="https://static.xx.fbcdn.net/rsrc.php/v3/y8/r/k97pj8-or6s.png" width="77"/></td><td class="k" valign="middle"><a class="l m n o p q" href="/reg/?cid=102&amp;refid=8"><span class="s">Create Account</span></a></td></tr></tbody></t

In [69]:
a.find_parent(lambda tag: tag.has_attr('id'))

<div class="g h" id="header"><table cellpadding="0" cellspacing="0" class="i"><tbody><tr><td valign="middle"><h1 style="display:block;height:0;overflow:hidden;position:absolute;width:0;padding:0">Facebook</h1><img alt="facebook" class="j" height="16" src="https://static.xx.fbcdn.net/rsrc.php/v3/y8/r/k97pj8-or6s.png" width="77"/></td><td class="k" valign="middle"><a class="l m n o p q" href="/reg/?cid=102&amp;refid=8"><span class="s">Create Account</span></a></td></tr></tbody></table></div>

find_next_siblings() and find_previous_siblings() is like find_all() but they search in previous and next siblings respectively.

Similarly find_next_sibling() and find_previous_sibling() returns only the first tag fulfilling the condition

In [70]:
soup.find('div', class_='g').find_next_sibling()

<div id="objects_container"><div class="e" id="root"><table class="t"><tbody><tr><td class="u"><div class="v w x" id="login_error" style="display: none;"><div class="y"></div></div><div class="z ba"><div id="login_top_banner"></div><div class="bb"><form action="/login/device-based/regular/login/?refsrc=https%3A%2F%2F0.facebook.com%2F&amp;lwv=100&amp;refid=8" class="bc bd" id="login_form" method="post" novalidate="1"><input autocomplete="off" name="lsd" type="hidden" value="AVoB7ote"/><input autocomplete="off" name="jazoest" type="hidden" value="2711"/><input name="m_ts" type="hidden" value="1581441609"/><input name="li" type="hidden" value="SeJCXoM6n5yMQMdE6mofS0DQ"/><input name="try_number" type="hidden" value="0"/><input name="unrecognized_tries" type="hidden" value="0"/><ul class="be bf bg"><li class="bf"><span class="bh bi" id="u_0_0">Phone number or email address</span><input autocapitalize="off" autocomplete="on" autocorrect="off" class="bj bk bl" id="m_login_email" name="email" 

In [71]:
soup.find('a', string=re.compile('Por')).find_previous_siblings('b')

[<b class="ch">English (UK)</b>]

find_all_next() and find_all_previous() is like find_all() but they search the previous and next tags in the parse tree of the HTML file respectively.

Similarly find_next() and find_previous() returns only the first tag fulfilling the conditions


In [72]:
soup.find('div', class_='g').find_all_next('table')

[<table cellpadding="0" cellspacing="0" class="i"><tbody><tr><td valign="middle"><h1 style="display:block;height:0;overflow:hidden;position:absolute;width:0;padding:0">Facebook</h1><img alt="facebook" class="j" height="16" src="https://static.xx.fbcdn.net/rsrc.php/v3/y8/r/k97pj8-or6s.png" width="77"/></td><td class="k" valign="middle"><a class="l m n o p q" href="/reg/?cid=102&amp;refid=8"><span class="s">Create Account</span></a></td></tr></tbody></table>,
 <table class="t"><tbody><tr><td class="u"><div class="v w x" id="login_error" style="display: none;"><div class="y"></div></div><div class="z ba"><div id="login_top_banner"></div><div class="bb"><form action="/login/device-based/regular/login/?refsrc=https%3A%2F%2F0.facebook.com%2F&amp;lwv=100&amp;refid=8" class="bc bd" id="login_form" method="post" novalidate="1"><input autocomplete="off" name="lsd" type="hidden" value="AVoB7ote"/><input autocomplete="off" name="jazoest" type="hidden" value="2711"/><input name="m_ts" type="hidden"

In [73]:
soup.find('div', class_='g').find_previous(class_='f')

<div class="f"><div id="viewport"><div class="g h" id="header"><table cellpadding="0" cellspacing="0" class="i"><tbody><tr><td valign="middle"><h1 style="display:block;height:0;overflow:hidden;position:absolute;width:0;padding:0">Facebook</h1><img alt="facebook" class="j" height="16" src="https://static.xx.fbcdn.net/rsrc.php/v3/y8/r/k97pj8-or6s.png" width="77"/></td><td class="k" valign="middle"><a class="l m n o p q" href="/reg/?cid=102&amp;refid=8"><span class="s">Create Account</span></a></td></tr></tbody></table></div><div id="objects_container"><div class="e" id="root"><table class="t"><tbody><tr><td class="u"><div class="v w x" id="login_error" style="display: none;"><div class="y"></div></div><div class="z ba"><div id="login_top_banner"></div><div class="bb"><form action="/login/device-based/regular/login/?refsrc=https%3A%2F%2F0.facebook.com%2F&amp;lwv=100&amp;refid=8" class="bc bd" id="login_form" method="post" novalidate="1"><input autocomplete="off" name="lsd" type="hidden" v

### CSS Selectors
BeautifulSoup has a .select() method which can be used to enter CSS selectors, which is similar to the method CSS uses to select elements for styling

In [74]:
soup.select('title')

[<title>Facebook \u2013 log in or sign up</title>]

You can select tags inside other tags

In [75]:
soup.select('body div div table')

[<table cellpadding="0" cellspacing="0" class="i"><tbody><tr><td valign="middle"><h1 style="display:block;height:0;overflow:hidden;position:absolute;width:0;padding:0">Facebook</h1><img alt="facebook" class="j" height="16" src="https://static.xx.fbcdn.net/rsrc.php/v3/y8/r/k97pj8-or6s.png" width="77"/></td><td class="k" valign="middle"><a class="l m n o p q" href="/reg/?cid=102&amp;refid=8"><span class="s">Create Account</span></a></td></tr></tbody></table>,
 <table class="t"><tbody><tr><td class="u"><div class="v w x" id="login_error" style="display: none;"><div class="y"></div></div><div class="z ba"><div id="login_top_banner"></div><div class="bb"><form action="/login/device-based/regular/login/?refsrc=https%3A%2F%2F0.facebook.com%2F&amp;lwv=100&amp;refid=8" class="bc bd" id="login_form" method="post" novalidate="1"><input autocomplete="off" name="lsd" type="hidden" value="AVoB7ote"/><input autocomplete="off" name="jazoest" type="hidden" value="2711"/><input name="m_ts" type="hidden"

You can select direct descendants

In [76]:
soup.select('head > title')

[<title>Facebook \u2013 log in or sign up</title>]

You can find tags by id

In [77]:
soup.select('div#login_top_banner')

[<div id="login_top_banner"></div>]

You can find tags by CSS class

In [78]:
soup.select('.t')

[<table class="t"><tbody><tr><td class="u"><div class="v w x" id="login_error" style="display: none;"><div class="y"></div></div><div class="z ba"><div id="login_top_banner"></div><div class="bb"><form action="/login/device-based/regular/login/?refsrc=https%3A%2F%2F0.facebook.com%2F&amp;lwv=100&amp;refid=8" class="bc bd" id="login_form" method="post" novalidate="1"><input autocomplete="off" name="lsd" type="hidden" value="AVoB7ote"/><input autocomplete="off" name="jazoest" type="hidden" value="2711"/><input name="m_ts" type="hidden" value="1581441609"/><input name="li" type="hidden" value="SeJCXoM6n5yMQMdE6mofS0DQ"/><input name="try_number" type="hidden" value="0"/><input name="unrecognized_tries" type="hidden" value="0"/><ul class="be bf bg"><li class="bf"><span class="bh bi" id="u_0_0">Phone number or email address</span><input autocapitalize="off" autocomplete="on" autocorrect="off" class="bj bk bl" id="m_login_email" name="email" type="text"/></li><li class="bf"><div><span class="b

You can select siblings of tags

In [79]:
soup.select('div.g + div')

[<div id="objects_container"><div class="e" id="root"><table class="t"><tbody><tr><td class="u"><div class="v w x" id="login_error" style="display: none;"><div class="y"></div></div><div class="z ba"><div id="login_top_banner"></div><div class="bb"><form action="/login/device-based/regular/login/?refsrc=https%3A%2F%2F0.facebook.com%2F&amp;lwv=100&amp;refid=8" class="bc bd" id="login_form" method="post" novalidate="1"><input autocomplete="off" name="lsd" type="hidden" value="AVoB7ote"/><input autocomplete="off" name="jazoest" type="hidden" value="2711"/><input name="m_ts" type="hidden" value="1581441609"/><input name="li" type="hidden" value="SeJCXoM6n5yMQMdE6mofS0DQ"/><input name="try_number" type="hidden" value="0"/><input name="unrecognized_tries" type="hidden" value="0"/><ul class="be bf bg"><li class="bf"><span class="bh bi" id="u_0_0">Phone number or email address</span><input autocapitalize="off" autocomplete="on" autocorrect="off" class="bj bk bl" id="m_login_email" name="email"

In [80]:
soup.select('head title ~ meta')

[<meta content="default" id="meta_referrer" name="referrer"/>,
 <meta content="Create an account or log in to Facebook. Connect with friends, family and other people you know. Share photos and videos, send messages and get updates." name="description"/>]

You can find tags having certain attributes and having certain values

In [81]:
soup.select('a[href]')

[<a class="l m n o p q" href="/reg/?cid=102&amp;refid=8"><span class="s">Create Account</span></a>,
 <a href="/recover/initiate/?c=https%3A%2F%2Fm.facebook.com%2F%3Fzero_e%3D6%26zero_et%3D1581441669%26refsrc%3Dhttps%253A%252F%252F0.facebook.com%252F&amp;r&amp;cuid&amp;ars=facebook_login&amp;lwv=100&amp;refid=8" id="forgot-password-link" tabindex="0">Forgotten password?</a>,
 <a class="ci" href="/a/language.php?l=as_IN&amp;lref=https%3A%2F%2Fm.facebook.com%2F%3Fzero_e%3D6%26zero_et%3D1581441669%26refsrc%3Dhttps%253A%252F%252F0.facebook.com%252F&amp;index=2&amp;sref=legacy_mbasic_footer&amp;gfid=AQBOoS6Z1friG7g2&amp;ref_component=mbasic_footer&amp;ref_page=%2Fwap%2Findex.php&amp;refid=8">\u0985\u09b8\u09ae\u09c0\u09af\u09bc\u09be</a>,
 <a class="ci" href="/a/language.php?l=pt_BR&amp;lref=https%3A%2F%2Fm.facebook.com%2F%3Fzero_e%3D6%26zero_et%3D1581441669%26refsrc%3Dhttps%253A%252F%252F0.facebook.com%252F&amp;index=4&amp;sref=legacy_mbasic_footer&amp;gfid=AQDK3Eh9WdQjDzE1&amp;ref_componen

In [82]:
soup.select('input[value=2699]')

[]

You can use the select_one() method to select the first tag that matches a CSS selector

In [83]:
soup.select_one('a')

<a class="l m n o p q" href="/reg/?cid=102&amp;refid=8"><span class="s">Create Account</span></a>

# JSON
JSON is a popular format for sending and receiving data through the web. 

JSON stands for JavaScript Object Notation

It is text which is written in the format a JS object is written

Python has built-in support for JSON.

In [84]:
import json

## Parsing JSON - Converting from JSON to Python
If you have a JSON file or string, you can parse it by calling the json.loads() method

In [85]:
x = '{ "name": "John", "age": 30, "hobbies": ["Gaming", "Cycling"] }'    # A JSON text

y = json.loads(x)                                          # Parse x

The result will be a Python dictionary

In [86]:
print(y)

print(y["hobbies"])     # You can access elements like a normal Python dictionary

{u'age': 30, u'name': u'John', u'hobbies': [u'Gaming', u'Cycling']}
[u'Gaming', u'Cycling']


## Converting from Python to JSON
Conversely, if you have a Python object, you can convert it into a JSON string using the json.dumps() methond

In [87]:
x = {                                  # x is a Python dictionary
    "name": "John",
    "age": 30,
    "hobbies": ["Gaming", "Cycling"]
}

y = json.dumps(x)                       # Convert x to JSON string

print(y)                               # y is a JSON string

{"age": 30, "name": "John", "hobbies": ["Gaming", "Cycling"]}


## Formatting the result
Normally json.dumps() will put all the text in one line, which may be difficult to read.
If you want to make the JSON string easier to read, you can use the indent parameter in json.dumps()

In [88]:
print(json.dumps(x, indent=4))

{
    "age": 30, 
    "name": "John", 
    "hobbies": [
        "Gaming", 
        "Cycling"
    ]
}


If you want to order all the keys in the result, you can use the sort_keys parameter to specify if the keys should be sorted or not

In [89]:
print(json.dumps(x, indent=4, sort_keys=True))

{
    "age": 30, 
    "hobbies": [
        "Gaming", 
        "Cycling"
    ], 
    "name": "John"
}
