<a href="https://colab.research.google.com/github/punctuationmarks/Python-Libraries/blob/master/BeautifulSoup4_GC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Beautiful Soup basics
BS4 is best for html code, if you need "dynamic" (read javascript) code, you'll (most likely) need a different library

In [0]:
# dependencies

from bs4 import BeautifulSoup # beautifulsoup4 for webscrapping
import requests # for loading the website

In [0]:
# a good practice website is "http://toscrape.com/"

requested_page = requests.get("http://books.toscrape.com/")
type(requested_page)

requests.models.Response

In [0]:
# so you have to grab the content of the page
content = requested_page.content

In [0]:
# this is the entire website... line...by line... squished... together. 
content




In [0]:
type(content)

bytes

In [0]:
# this is where beautiful soup comes into play
soup = BeautifulSoup(content, "html.parser") # html.parser is the default

In [0]:
# look how much prettier that looks. Well, at least it's readable
print(soup.prettify())

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="static/oscar/css/styles.css" rel="stylesheet" type="tex

In [0]:
# using prettify isn't necessary unless you don't have access to the site 
# because it'll most likely be easier to inspect what you're after 
# meaning:
# right click, "Inspect" (or something to that extent)
# Then it's detective work on how the developer built the website
# and what information you want
  

# now find what you want and loop through it

# in this book example, let's say we want the titles of the books for sale
# they're listed under an ordered list with a class of "row"
# and the titles are in an h3 tag under an article called "product_pod" 
# (the price is in a div class "product_price" under the same article)

grouping_title_and_price = soup.find_all("article", {"class":"product_pod"})

In [0]:
# printing this is still a lot, but it narrows it down if you didn't
# everything needed at first. 
# Also, what prints is a list and every article with that class 
# is seperated by a comma (good to know)
# (Side note: it's also nice to double check that you got all of the books)
grouping_title_and_price

[<article class="product_pod">
 <div class="image_container">
 <a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
 </div>
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
 <div class="product_price">
 <p class="price_color">£51.77</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>, <article class="product_pod">
 <div class="image_container">
 <a href="catalogue/tipping-the-velvet_999/index.html"><img alt="Tipping the Velvet" class="thum

In [0]:
type(grouping_title_and_price)

bs4.element.ResultSet

In [0]:
# since the product of find_all() is a list, it supports indexing
grouping_title_and_price[3]

<article class="product_pod">
<div class="image_container">
<a href="catalogue/sharp-objects_997/index.html"><img alt="Sharp Objects" class="thumbnail" src="media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg"/></a>
</div>
<p class="star-rating Four">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/sharp-objects_997/index.html" title="Sharp Objects">Sharp Objects</a></h3>
<div class="product_price">
<p class="price_color">£47.82</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>

In [0]:
# to grab just the title
# we can apply the find or find_all (depending on the scenario) method 
# and say you just want the text
grouping_title_and_price[3].find("h3").text

'Sharp Objects'

In [0]:
# here's an example of using find_all()
# (this would be necessary if you needed a specific one of a group or
# if wanted to loop over a group) 
# (difference is indexing)
grouping_title_and_price[3].find_all("h3")[0].text

'Sharp Objects'

In [0]:
help(BeautifulSoup)

Help on class BeautifulSoup in module bs4:

class BeautifulSoup(bs4.element.Tag)
 |  This class defines the basic interface called by the tree builders.
 |  
 |  These methods will be called by the parser:
 |    reset()
 |    feed(markup)
 |  
 |  The tree builder may call these methods from its feed() implementation:
 |    handle_starttag(name, attrs) # See note about return value
 |    handle_endtag(name)
 |    handle_data(data) # Appends to the current data node
 |    endData(containerClass=NavigableString) # Ends the current data node
 |  
 |  No matter how complicated the underlying parser is, you should be
 |  able to build a tree using 'start tag' events, 'end tag' events,
 |  'data' events, and "done with data" events.
 |  
 |  If you encounter an empty-element tag (aka a self-closing tag,
 |  like HTML's <br> tag), call handle_starttag and then
 |  handle_endtag.
 |  
 |  Method resolution order:
 |      BeautifulSoup
 |      bs4.element.Tag
 |      bs4.element.PageElement
 | 

In [0]:
# here's an example of using find_all()
# (this would be necessary if you needed a specific one of a group or
# if wanted to loop over a group) 
# (difference is indexing)
print(grouping_title_and_price[3].find_all("h3")[0].type)

None


In [0]:
# speaking of for loops, lets loop over all of this and get the titles and the prices

print("Titles: \t")
for title in grouping_title_and_price:
  
  print(title.find("h3").text)

Titles: 	
A Light in the ...
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History ...
The Requiem Red
The Dirty Little Secrets ...
The Coming Woman: A ...
The Boys in the ...
The Black Maria
Starving Hearts (Triangular Trade ...
Shakespeare's Sonnets
Set Me Free
Scott Pilgrim's Precious Little ...
Rip it Up and ...
Our Band Could Be ...
Olio
Mesaerion: The Best Science ...
Libertarianism for Beginners
It's Only the Himalayas
