# Introduction to BeautifulSoup.

    Here you'll learn the main BeautifulSoup attributes and methods.
    I tried to make this pretty simple, so you can learn fast.
    
Video that help me creating this tutorial: https://www.youtube.com/watch?v=4UcqECQe5Kc&list=WL&index=20

In [1]:
from bs4 import BeautifulSoup
# We won't need to import requests (since we already have the 'html')

In [95]:
html_doc = """
<div class="header">
  <h2>Blog Name</h2>
</div>

<div class="row">
  <div class="leftcolumn">
    <div class="card">
      <h2>TITLE HEADING</h2>
      <h5>Title description, Dec 7, 2017</h5>
      <div class="fakeimg" style="height:200px;">Image</div>
      <p>Some text..</p>
    </div>
    <div class="card">
      <h2>TITLE HEADING</h2>
      <h5>Title description, Sep 2, 2017</h5>
      <div class="fakeimg" style="height:200px;">Image</div>
      <p>Some text..</p>
    </div>
  </div>
  <div class="rightcolumn">
    <div class="card">
      <h2>About Me</h2>
      <div class="fakeimg" style="height:100px;">Image</div>
      <p>Some text about me in culpa qui officia deserunt mollit anim..</p>
    </div>
    <div class="card">
      <h3>Popular Post</h3>
      <div class="fakeimg">Image</div><br>
      <div class="fakeimg">Image</div><br>
      <div class="fakeimg">Image</div>
    </div>
    <div class="stack-it">
      <span>    
          <h3>Popular Post</h3>
          <div class="fakeimg">Image</div><br>
          <div class="fakeimg">Image</div><br>
          <div class="fakeimg">Image</div>
      </span>
    </div>
    <div class="card">
      <h3 id="follow">Follow Me</h3>
      <p>Some text..</p>
    </div>
  </div>
</div>

<div class="footer">
  <h2>Footer</h2>
</div>
"""

### As you'll see BeautifulSoup makes sense of the html document
        Even though html_doc is a string, bs4 parses it and converts it into
        an object with certain attributes and methods that will help us
        getting specific information.


In [96]:
# ps: soup an instance.   BeautifulSoup is an object.
soup = BeautifulSoup(html_doc, 'html.parser')

In [97]:
# The content comes like this... but using .prettify()
print(soup)


<div class="header">
<h2>Blog Name</h2>
</div>
<div class="row">
<div class="leftcolumn">
<div class="card">
<h2>TITLE HEADING</h2>
<h5>Title description, Dec 7, 2017</h5>
<div class="fakeimg" style="height:200px;">Image</div>
<p>Some text..</p>
</div>
<div class="card">
<h2>TITLE HEADING</h2>
<h5>Title description, Sep 2, 2017</h5>
<div class="fakeimg" style="height:200px;">Image</div>
<p>Some text..</p>
</div>
</div>
<div class="rightcolumn">
<div class="card">
<h2>About Me</h2>
<div class="fakeimg" style="height:100px;">Image</div>
<p>Some text about me in culpa qui officia deserunt mollit anim..</p>
</div>
<div class="card">
<h3>Popular Post</h3>
<div class="fakeimg">Image</div><br/>
<div class="fakeimg">Image</div><br/>
<div class="fakeimg">Image</div>
</div>
<div class="stack-it">
<span>
<h3>Popular Post</h3>
<div class="fakeimg">Image</div><br/>
<div class="fakeimg">Image</div><br/>
<div class="fakeimg">Image</div>
</span>
</div>
<div class="card">
<h3 id="follow">Follow Me</h3

In [98]:
# ...it' possible to see a better html version

print(soup.prettify())

<div class="header">
 <h2>
  Blog Name
 </h2>
</div>
<div class="row">
 <div class="leftcolumn">
  <div class="card">
   <h2>
    TITLE HEADING
   </h2>
   <h5>
    Title description, Dec 7, 2017
   </h5>
   <div class="fakeimg" style="height:200px;">
    Image
   </div>
   <p>
    Some text..
   </p>
  </div>
  <div class="card">
   <h2>
    TITLE HEADING
   </h2>
   <h5>
    Title description, Sep 2, 2017
   </h5>
   <div class="fakeimg" style="height:200px;">
    Image
   </div>
   <p>
    Some text..
   </p>
  </div>
 </div>
 <div class="rightcolumn">
  <div class="card">
   <h2>
    About Me
   </h2>
   <div class="fakeimg" style="height:100px;">
    Image
   </div>
   <p>
    Some text about me in culpa qui officia deserunt mollit anim..
   </p>
  </div>
  <div class="card">
   <h3>
    Popular Post
   </h3>
   <div class="fakeimg">
    Image
   </div>
   <br/>
   <div class="fakeimg">
    Image
   </div>
   <br/>
   <div class="fakeimg">
    Image
   </div>
  </div>
  <div class

#### Using the attributes from the bs4 object (tags from html) we can fetch the information we want

In [99]:
# This approach give us the first tag, but not always we're looking for the content on the first tag
soup.div

<div class="header">
<h2>Blog Name</h2>
</div>

In [100]:
# Let's say we need to have TITLE HEADING, we know it is inside a h2 tag, but as I said
soup.h2 # First h2 tag

<h2>Blog Name</h2>

In [101]:
# In order to get TITLE HEADING we can use find() method, passing a tag and an attribute as arguments
# since TITLE HEADING h2 doesn't have an attribute, let's look for a tag it is in

soup.find('div', class_= 'card')

<div class="card">
<h2>TITLE HEADING</h2>
<h5>Title description, Dec 7, 2017</h5>
<div class="fakeimg" style="height:200px;">Image</div>
<p>Some text..</p>
</div>

In [102]:
# We can also pass this result into a variable (still bs4 object)
specific_heading = soup.find('div', class_= 'card')


# and now we got it
specific_heading.h2

<h2>TITLE HEADING</h2>

In [103]:
# To get only the actual content, you can use .text attribute 

specific_heading.h2.text

'TITLE HEADING'

#### .findAll() method ( or find_all() ) gives us a similar returning, but as its name says, it 'll return All the attributes/tags we passes as parameters. And It works just like an array.

In [104]:
# All divs
soup.findAll('div')

[<div class="header">
 <h2>Blog Name</h2>
 </div>,
 <div class="row">
 <div class="leftcolumn">
 <div class="card">
 <h2>TITLE HEADING</h2>
 <h5>Title description, Dec 7, 2017</h5>
 <div class="fakeimg" style="height:200px;">Image</div>
 <p>Some text..</p>
 </div>
 <div class="card">
 <h2>TITLE HEADING</h2>
 <h5>Title description, Sep 2, 2017</h5>
 <div class="fakeimg" style="height:200px;">Image</div>
 <p>Some text..</p>
 </div>
 </div>
 <div class="rightcolumn">
 <div class="card">
 <h2>About Me</h2>
 <div class="fakeimg" style="height:100px;">Image</div>
 <p>Some text about me in culpa qui officia deserunt mollit anim..</p>
 </div>
 <div class="card">
 <h3>Popular Post</h3>
 <div class="fakeimg">Image</div><br/>
 <div class="fakeimg">Image</div><br/>
 <div class="fakeimg">Image</div>
 </div>
 <div class="stack-it">
 <span>
 <h3>Popular Post</h3>
 <div class="fakeimg">Image</div><br/>
 <div class="fakeimg">Image</div><br/>
 <div class="fakeimg">Image</div>
 </span>
 </div>
 <div clas

In [105]:
# All h2s
soup.findAll('h2')

[<h2>Blog Name</h2>,
 <h2>TITLE HEADING</h2>,
 <h2>TITLE HEADING</h2>,
 <h2>About Me</h2>,
 <h2>Footer</h2>]

In [106]:
# All divs where its class attribute is equal 'fakeimg'
soup.findAll('div', class_="fakeimg")

[<div class="fakeimg" style="height:200px;">Image</div>,
 <div class="fakeimg" style="height:200px;">Image</div>,
 <div class="fakeimg" style="height:100px;">Image</div>,
 <div class="fakeimg">Image</div>,
 <div class="fakeimg">Image</div>,
 <div class="fakeimg">Image</div>,
 <div class="fakeimg">Image</div>,
 <div class="fakeimg">Image</div>,
 <div class="fakeimg">Image</div>]

#### Another way to get a specific information is 'stacking' the attributes

In [107]:
soup.find('div', class_="stack-it")

<div class="stack-it">
<span>
<h3>Popular Post</h3>
<div class="fakeimg">Image</div><br/>
<div class="fakeimg">Image</div><br/>
<div class="fakeimg">Image</div>
</span>
</div>

In [108]:
stacking_attributes = soup.find('div', class_="stack-it")

stacking_attributes.span

<span>
<h3>Popular Post</h3>
<div class="fakeimg">Image</div><br/>
<div class="fakeimg">Image</div><br/>
<div class="fakeimg">Image</div>
</span>

In [109]:
stacking_attributes.span.h3

<h3>Popular Post</h3>

In [110]:
stacking_attributes.span.h3.text

'Popular Post'

#### .contents attribute return to us the tag's children (tags inside that tag)

In [158]:
# in a list format
stacking_attributes.contents

['\n',
 <span>
 <h3>Popular Post</h3>
 <div class="fakeimg">Image</div><br/>
 <div class="fakeimg">Image</div><br/>
 <div class="fakeimg">Image</div>
 </span>,
 '\n']

In [164]:
stacking_attributes.contents[0]

'\n'

In [167]:
''' 
    Notice that <span> is a child, but <h3> and <<<divs>>> are children of <span>, 
    that's why the second item in the list is the whole <span>
'''
stacking_attributes.contents[1]

<span>
<h3>Popular Post</h3>
<div class="fakeimg">Image</div><br/>
<div class="fakeimg">Image</div><br/>
<div class="fakeimg">Image</div>
</span>

#### We can use .select() to select the tags 'directly from the class'
    ps: It returns a list

In [155]:
soup.select(".stack-it")

[<div class="stack-it">
 <span>
 <h3>Popular Post</h3>
 <div class="fakeimg">Image</div><br/>
 <div class="fakeimg">Image</div><br/>
 <div class="fakeimg">Image</div>
 </span>
 </div>]

In [119]:
soup.select(".fakeimg")

[<div class="fakeimg" style="height:200px;">Image</div>,
 <div class="fakeimg" style="height:200px;">Image</div>,
 <div class="fakeimg" style="height:100px;">Image</div>,
 <div class="fakeimg">Image</div>,
 <div class="fakeimg">Image</div>,
 <div class="fakeimg">Image</div>,
 <div class="fakeimg">Image</div>,
 <div class="fakeimg">Image</div>,
 <div class="fakeimg">Image</div>]

List characteristics:

In [144]:
soup.select(".fakeimg")[0]

<div class="fakeimg" style="height:200px;">Image</div>

In [145]:
soup.select(".fakeimg")[2:5]

[<div class="fakeimg" style="height:100px;">Image</div>,
 <div class="fakeimg">Image</div>,
 <div class="fakeimg">Image</div>]

#### .get_text() as you probrably think... yeah it gets the text
    Just like the attribute .text
    However, .get_text() can also support various keyword arguments to change how it behaves
    (separator, strip, types). If you need more control over the result, then you need the functional form.

In [120]:
soup.find('div', class_="stack-it").get_text()

'\n\nPopular Post\nImage\nImage\nImage\n\n'

In [121]:
soup.find('div', class_="stack-it").text

'\n\nPopular Post\nImage\nImage\nImage\n\n'

In [129]:
# You can specify a string to be used to join the bits of text together:
soup.find('div', class_="stack-it").get_text('|')

'\n|\n|Popular Post|\n|Image|\n|Image|\n|Image|\n|\n'

In [130]:
soup.find('div', class_="stack-it").get_text(' ')

'\n \n Popular Post \n Image \n Image \n Image \n \n'

In [140]:
# You can remove \n (Enter) too
soup.find('div', class_="stack-it").get_text(' ', strip=True)

'Popular Post Image Image Image'

#### Let's make some tests hehe

In [143]:
# A list
soup.select(".stack-it")

[<div class="stack-it">
 <span>
 <h3>Popular Post</h3>
 <div class="fakeimg">Image</div><br/>
 <div class="fakeimg">Image</div><br/>
 <div class="fakeimg">Image</div>
 </span>
 </div>]

In [147]:
# Each item from that list
for item in soup.select(".stack-it"):
    print(item)    

<div class="stack-it">
<span>
<h3>Popular Post</h3>
<div class="fakeimg">Image</div><br/>
<div class="fakeimg">Image</div><br/>
<div class="fakeimg">Image</div>
</span>
</div>


In [148]:
# Since we're working with a bs4 object... only the content: 
for item in soup.select(".stack-it"):
    print(item.text)



Popular Post
Image
Image
Image




In [150]:
# Funny
for item in soup.select(".stack-it"):
    print(item.get_text('-'))


-
-Popular Post-
-Image-
-Image-
-Image-
-



In [152]:
# No '\n'
for item in soup.select(".stack-it"):
    print(item.get_text('-', strip=True))

Popular Post-Image-Image-Image


#### Oh, with .find() we can also look for specific identifiers

In [172]:
soup.find(id='follow')

<h3 id="follow">Follow Me</h3>

In [170]:
# Same result? No! Here we have a list (of all 'follow' ids)
soup.findAll(id='follow')

[<h3 id="follow">Follow Me</h3>]

#### Now, heading back to the .contents-children subject. We can find the parent from a tag (the tag that contains the tag we have) using .find_parent()

In [178]:
# div is {h3 id="follow"}'s parent
soup.find(id='follow').findParent()

<div class="card">
<h3 id="follow">Follow Me</h3>
<p>Some text..</p>
</div>

#### and it is possible to find its siblings with find_next_sibling()

In [179]:
soup.find(id='follow').findNextSibling()

<p>Some text..</p>

In [180]:
soup.find(id='follow').find_next_sibling()

<p>Some text..</p>

### That's it. Of course there is a lot more, but those are the main methods and attributes you'll use 