## Parsing HTML documents (web pages) with Beatiful Soup
Note: This notebook is based on the example from: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

#### Keep in mind that a website may not be as straightforward as illustrated here. Some pages contain information that’s hidden behind a login. That means you’ll need an account to be able to scrape anything from the page. 

#### Making an HTTP request from your Python script is different from how you access a page from your browser. Just because you can log in to the page through your browser doesn’t mean you’ll be able to scrape it with your Python script. However, the requests library comes with the built-in capacity to handle authentication. With these techniques, you can log in to websites when making the HTTP request from your Python script and then scrape information that’s hidden behind a login. 

#### Static sites are straightforward to work with because the server sends you an HTML page that  contains all the page information in the response. You can parse that HTML response and immediately begin to pick out the relevant data.

#### Dynamic website, the server might not send back any HTML. Instead, you could receive JavaScript code as a response. This code will look completely different from what we are inspecting here.

In [1]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

### Navigating the data  structure

#### It can be challenging to wrap your head around a long block of HTML code. To make it easier to read prettify() function in BeautifulSoup format the view and illustrates how the tags are nested in the document.

In [17]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify()) # the html uses class names

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


### Looking at the title 

In [3]:
soup.title

<title>The Dormouse's story</title>

In [5]:
soup.title.string # parsing out the string

"The Dormouse's story"

In [6]:
soup.title.parent.name

'head'

### Looking at the paragraph attribute ... p class

In [7]:
soup.p

<p class="title"><b>The Dormouse's story</b></p>

In [8]:
soup.p['class']

['title']

### Checking links

#### The a tag defines a hyperlink, which is used to link from one page to another

In [9]:
soup.a # locates the first occurence

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [18]:
soup.find_all('a') # find all the a tags

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

#### Let us check the p class which defines all the paragraphs

In [11]:
soup.find_all('p')

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [19]:
paragraphs=soup.find_all('p') # let us add to a variable 

In [20]:
paragraphs[2] # I an interested in the second p class that is the story


<p class="story">...</p>

In [24]:
soup.find(id="link3") # find a specific link by id. Note the output

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [25]:
#To find the URLs

for link in soup.find_all('a'):
    print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


In [26]:
#Another common task is extracting all the text from a page:

print(soup.get_text())


The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



## References

https://blog.hartleybrody.com/web-scraping-cheat-sheet/#using-beautifulsoup
