<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Parsing-Review" data-toc-modified-id="Parsing-Review-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Parsing Review</a></span></li><li><span><a href="#Nested-Search" data-toc-modified-id="Nested-Search-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Nested Search</a></span></li><li><span><a href="#Prepare-CSV-Export" data-toc-modified-id="Prepare-CSV-Export-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Prepare CSV Export</a></span></li><li><span><a href="#Export-to-CSV" data-toc-modified-id="Export-to-CSV-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Export to CSV</a></span></li></ul></div>

## Parsing Review

Import libraries, including csv which we will be using to generate CSV outputs

In [1]:
import requests as r
from bs4 import BeautifulSoup
import csv

Set URL and filepath for efficiency, run a get on the URL

In [2]:
urltoget = 'http://drd.ba.ttu.edu/isqs6339/imbadproducts/'
filepath = '/Users/jdmini/Library/Group Containers/D75L7R8266.com.reinvented.KeepIt/Keep It/Files/MS_DataScience/Business Intelligence/Assets/2.1_DataOut'

res = r.get(urltoget)

Small script to check if request is good

In [3]:
#Check if we have a good request
if res.status_code == 200:
    print('request is good')
else:
    print('bad request, received code ' + str(res.status_code))

request is good


Examine the server header, double check we're working with a normal HTML file

In [None]:
#Let's look at the server header
print(res.headers)    

Fire up the parser and place the code into 'soup' object

In [5]:
#Let's identify our products blocks in HTML
soup = BeautifulSoup(res.content,'lxml')

- find_all the anchor tags
- Store them in product_result
- Use for loop to print all of the anchor tag content

In [6]:
product_result = soup.find_all('a')
for pr in product_result:
    print(pr)

<a href="products/B01NAJGGA2.html">
<div class="productresult">
<span class="productid">B01NAJGGA2</span>
<span class="producttitle">Mpow 059 Bluetooth Headphones</span>
<span class="productprice">$35.99</span>
<span class="productdesc">Over Ear, Hi-Fi Stereo Wireless Headset, Foldable, Soft Memory-Protein Earmuffs, w/Built-in Mic Wired Mode PC/Cell Phones/TV</span>
</div>
</a>
<a href="products/B07JMSQLCP.html">
<div class="productresult">
<span class="productid">B07JMSQLCP</span>
<span class="producttitle">APIE Bluetooth Headphones, Wireless Earbuds</span>
<span class="productprice">$19.99</span>
<span class="productdesc">Bluetooth 4.1 with Microphone Sport Stereo Headset, Stereo Neckband Headset, Premium Sound with Bass, Noise Cancelling - Black</span>
</div>
</a>
<a href="products/B018APC4LE.html">
<div class="productresult">
<span class="productid">B018APC4LE</span>
<span class="producttitle">Bluetooth Headphones, Otium</span>
<span class="productprice">$19.97</span>
<span class="

## Nested Search
- Our previous approach works on simple pages, but if there are additional anchors this may complicate parsing
    - Example: footers often contain anchor tags
- This is a good use for nested searches
- In this case, we know that all of the products were contained within a div with an id of 'searchresults'
    - We can pass that into the attrs parameter of soup.find

In [8]:
search_results = soup.find('div', attrs={'id' : 'searchresults'})

Now our new variable 'search_results' is an object containing just the results in the searchresults div of our soup object
- We can run the same find_all within it to locate all of the anchor tags
- Loop through it as before to print all of the results
- In this example there weren't any extra anchor tags on the page, so this output is the same

In [9]:
#Now, search for anchors within that result
product_result = search_results.find_all('a')    
for pr in product_result:
    print(pr)

<a href="products/B01NAJGGA2.html">
<div class="productresult">
<span class="productid">B01NAJGGA2</span>
<span class="producttitle">Mpow 059 Bluetooth Headphones</span>
<span class="productprice">$35.99</span>
<span class="productdesc">Over Ear, Hi-Fi Stereo Wireless Headset, Foldable, Soft Memory-Protein Earmuffs, w/Built-in Mic Wired Mode PC/Cell Phones/TV</span>
</div>
</a>
<a href="products/B07JMSQLCP.html">
<div class="productresult">
<span class="productid">B07JMSQLCP</span>
<span class="producttitle">APIE Bluetooth Headphones, Wireless Earbuds</span>
<span class="productprice">$19.99</span>
<span class="productdesc">Bluetooth 4.1 with Microphone Sport Stereo Headset, Stereo Neckband Headset, Premium Sound with Bass, Noise Cancelling - Black</span>
</div>
</a>
<a href="products/B018APC4LE.html">
<div class="productresult">
<span class="productid">B018APC4LE</span>
<span class="producttitle">Bluetooth Headphones, Otium</span>
<span class="productprice">$19.97</span>
<span class="

## Prepare CSV Export
- To begin our CSV export, we need to print out the results for each item
    - In this case we can use the class attribute of the tags
- Note: this is in memory on the machine
    - The doesn't re-query the server
- Notice concatenations inside of print statement
    - Also converting our pr.find to text with .text


In [10]:
#Let's print out each part for each item    
product_result = search_results.find_all('a')    
for pr in product_result:
    print('URL:  ' + pr['href'])
    print('Product ID:  ' + pr.find('span', attrs={'class' : 'productid'}).text)
    print('Product Title:  ' + pr.find('span', attrs={'class' : 'producttitle'}).text)
    print('Product Price:  ' + pr.find('span', attrs={'class' : 'productprice'}).text)
    print('Product Description:  ' + pr.find('span', attrs={'class' : 'productdesc'}).text)
    print('----------------')

URL:  products/B01NAJGGA2.html
Product ID:  B01NAJGGA2
Product Title:  Mpow 059 Bluetooth Headphones
Product Price:  $35.99
Product Description:  Over Ear, Hi-Fi Stereo Wireless Headset, Foldable, Soft Memory-Protein Earmuffs, w/Built-in Mic Wired Mode PC/Cell Phones/TV
----------------
URL:  products/B07JMSQLCP.html
Product ID:  B07JMSQLCP
Product Title:  APIE Bluetooth Headphones, Wireless Earbuds
Product Price:  $19.99
Product Description:  Bluetooth 4.1 with Microphone Sport Stereo Headset, Stereo Neckband Headset, Premium Sound with Bass, Noise Cancelling - Black
----------------
URL:  products/B018APC4LE.html
Product ID:  B018APC4LE
Product Title:  Bluetooth Headphones, Otium
Product Price:  $19.97
Product Description:  Best Wireless Sports Earphones W/Mic IPX7 Waterproof HD Stereo Sweatproof in Ear Earbuds Gym Running Workout 8 Hour Battery Noise Cancelling Headsets
----------------


## Export to CSV 
- Begin by opening the file at the filepath with the write attribute
    - alias to dataout
- Create csv.writer object 'datawriter'
    - Use the dataout file
    - Set delimiter to comma 
    - Quotechar (quote character) ensures columns are surrounded by quotes
        - Since comma is our delimiter, this is useful in case a comma appears in the middle of a data item
    - csv.QUOTE_NONNUMERIC ensures we are only quoting nonnumeric data
        - We're only putting those quotes around nonnumeric data, theoritcally it shouldn't have our delimiter (comma)
- The first csv.writerow adds the first row to create column headers (names)
    - Nothing makes this a header except that it's the first row
    - In this case we add it as an array
- Loop through product_result writing the rows sequentially
    - This will add one row at a time found with .find
    - Should result in rows for each of the products    

In [13]:
#Let's put this in a csv file

with open(filepath, 'w') as dataout:
    datawriter = csv.writer(dataout, delimiter=',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
    datawriter.writerow(['URL', 'id', 'title', 'price', 'description'])
    for pr in product_result:
        datawriter.writerow([pr['href'],
                             pr.find('span', attrs={'class' : 'productid'}).text,
                             pr.find('span', attrs={'class' : 'producttitle'}).text,
                             pr.find('span', attrs={'class' : 'productprice'}).text,
                             pr.find('span', attrs={'class' : 'productdesc'}).text])