<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Using-BeautifulSoup" data-toc-modified-id="Using-BeautifulSoup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Using BeautifulSoup</a></span></li><li><span><a href="#Using-Loops-to-Print-Output" data-toc-modified-id="Using-Loops-to-Print-Output-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Using Loops to Print Output</a></span></li><li><span><a href="#Parsing-from-innerHTML" data-toc-modified-id="Parsing-from-innerHTML-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Parsing from innerHTML</a></span></li></ul></div>

# Intro to Parsing with BeautifulSoup

Import BeautifulSoup and Request libraries

In [8]:
import requests as r
from bs4 import BeautifulSoup

# Save URL for efficiency porpoises
urltoget = 'http://drd.ba.ttu.edu/isqs6339/imbadproducts/'

Request products page, store in res, examine the content
- `\t` refers to tabs
- `\n` refers to newlines

In [10]:
res = r.get(urltoget)
res.content

b'<html>\n<head>\n<link rel="stylesheet" href="style/badstyle.css">\n</head>\n<body>\n\t<div id="searchresults">\n\t\t<h2>Search Results</h2>\n\t\t<a href="products/B01NAJGGA2.html">\n\t\t\t<div class="productresult">\n\t\t\t\t<span class="productid">B01NAJGGA2</span>\n\t\t\t\t<span class="producttitle">Mpow 059 Bluetooth Headphones</span>\n\t\t\t\t<span class="productprice">$35.99</span>\n\t\t\t\t<span class="productdesc">Over Ear, Hi-Fi Stereo Wireless Headset, Foldable, Soft Memory-Protein Earmuffs, w/Built-in Mic Wired Mode PC/Cell Phones/TV</span>\n\t\t\t</div>\n\t\t</a>\n\t\t<a href="products/B07JMSQLCP.html">\n\t\t\t<div class="productresult">\n\t\t\t\t<span class="productid">B07JMSQLCP</span>\n\t\t\t\t<span class="producttitle">APIE Bluetooth Headphones, Wireless Earbuds</span>\n\t\t\t\t<span class="productprice">$19.99</span>\n\t\t\t\t<span class="productdesc">Bluetooth 4.1 with Microphone Sport Stereo Headset, Stereo Neckband Headset, Premium Sound with Bass, Noise Cancelling

## Using BeautifulSoup

Call BeautifulSoup and parse the content into an object
- Here we're parsing with lxml

In [12]:
soup = BeautifulSoup(res.content,'lxml')

Use soup.find to find the anchor tags ('a') in the HTML
- This will also return all of the innerHTML within the tag
- soup.find will only return the **first** object that matches

In [14]:
#Find the first anchor links
results = soup.find("a")
print(results)

<a href="products/B01NAJGGA2.html">
<div class="productresult">
<span class="productid">B01NAJGGA2</span>
<span class="producttitle">Mpow 059 Bluetooth Headphones</span>
<span class="productprice">$35.99</span>
<span class="productdesc">Over Ear, Hi-Fi Stereo Wireless Headset, Foldable, Soft Memory-Protein Earmuffs, w/Built-in Mic Wired Mode PC/Cell Phones/TV</span>
</div>
</a>


We can then pull out the href from these results
- Note this is a local path which is why it looks truncated

In [16]:
#pull the href
print(results['href'])

products/B01NAJGGA2.html


Use soup.find_all to see all of the anchor links instead of just the first one

In [17]:
results = soup.find_all("a")

Put this in a loop to print all of the hrefs

## Using Loops to Print Output

In [18]:
#Loop to see the links
for l in results:
    print(l['href'])

products/B01NAJGGA2.html
products/B07JMSQLCP.html
products/B018APC4LE.html


To get the innerHTML we can loop on the text:

In [19]:
for l in results:
    print(l.text)



B01NAJGGA2
Mpow 059 Bluetooth Headphones
$35.99
Over Ear, Hi-Fi Stereo Wireless Headset, Foldable, Soft Memory-Protein Earmuffs, w/Built-in Mic Wired Mode PC/Cell Phones/TV




B07JMSQLCP
APIE Bluetooth Headphones, Wireless Earbuds
$19.99
Bluetooth 4.1 with Microphone Sport Stereo Headset, Stereo Neckband Headset, Premium Sound with Bass, Noise Cancelling - Black




B018APC4LE
Bluetooth Headphones, Otium
$19.97
Best Wireless Sports Earphones W/Mic IPX7 Waterproof HD Stereo Sweatproof in Ear Earbuds Gym Running Workout 8 Hour Battery Noise Cancelling Headsets




## Parsing from innerHTML

This is a great start, but what if we want to pull individual items from the innerHTML?
Example: Pull the product IDs
- We can parse by CSS class productid
- Pass in parameter 'attrs' (attributes)
    - In this example we give a small dictionary with the attribute on the tag we're looking for (class) and the text we want to match (productid)
- This gives us all of our productids

In [20]:
results = soup.find_all('span', attrs={'class' : 'productid'})
for l in results:
    #print(l) #returns HTML of each found node
    print(l.text)

B01NAJGGA2
B07JMSQLCP
B018APC4LE


We can do the same to find the prices:

In [21]:
results = soup.find_all('span', attrs={'class' : 'productprice'})
for l in results:
    #print(l) #returns HTML of each found node
    print(l.text)

$35.99
$19.99
$19.97


Further, we can search by the ID of a tag:

In [23]:
#Instead of class, we can look for id
results = soup.find('div', attrs={'id' : 'searchresults'})
print(results)

<div id="searchresults">
<h2>Search Results</h2>
<a href="products/B01NAJGGA2.html">
<div class="productresult">
<span class="productid">B01NAJGGA2</span>
<span class="producttitle">Mpow 059 Bluetooth Headphones</span>
<span class="productprice">$35.99</span>
<span class="productdesc">Over Ear, Hi-Fi Stereo Wireless Headset, Foldable, Soft Memory-Protein Earmuffs, w/Built-in Mic Wired Mode PC/Cell Phones/TV</span>
</div>
</a>
<a href="products/B07JMSQLCP.html">
<div class="productresult">
<span class="productid">B07JMSQLCP</span>
<span class="producttitle">APIE Bluetooth Headphones, Wireless Earbuds</span>
<span class="productprice">$19.99</span>
<span class="productdesc">Bluetooth 4.1 with Microphone Sport Stereo Headset, Stereo Neckband Headset, Premium Sound with Bass, Noise Cancelling - Black</span>
</div>
</a>
<a href="products/B018APC4LE.html">
<div class="productresult">
<span class="productid">B018APC4LE</span>
<span class="producttitle">Bluetooth Headphones, Otium</span>
<span