<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#About" data-toc-modified-id="About-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>About</a></span></li><li><span><a href="#Setup" data-toc-modified-id="Setup-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Setup</a></span><ul class="toc-item"><li><span><a href="#Import" data-toc-modified-id="Import-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Import</a></span></li><li><span><a href="#Initialization" data-toc-modified-id="Initialization-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Initialization</a></span></li></ul></li><li><span><a href="#Get-Good-Book-Links" data-toc-modified-id="Get-Good-Book-Links-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Get Good Book Links</a></span><ul class="toc-item"><li><span><a href="#Save-Book-Links" data-toc-modified-id="Save-Book-Links-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Save Book Links</a></span></li></ul></li><li><span><a href="#Extract-Download-Links" data-toc-modified-id="Extract-Download-Links-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Extract Download Links</a></span><ul class="toc-item"><li><span><a href="#Save-Download-Links" data-toc-modified-id="Save-Download-Links-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Save Download Links</a></span></li></ul></li></ul></div>

# About

In the following, LibriVox is scraped for good data points. 

A good data point is defined as being a complete, solo project. Additionally, only one example per reader is desired.

The result is a list of links to download

# Setup
## Import

In [1]:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import pickle as pkl

## Initialization

In [2]:
path_chromedriver = '/anaconda3/chromedriver'
driver = webdriver.Chrome(path_chromedriver)
book_links = [] # List of books to download
search_url = ("https://librivox.org/search?title=&author=&reader=&keywords=&genre_id=0&status=complete&project_type=solo&recorded_language=&sort_order=alpha&search_page={}&search_form=advanced")
no_pages = 322 # As of 10/06/2019

# Get Good Book Links

Scrape LibriVox for audiobooks that are complete and recorded using a single reader.

In [None]:
for page in range(1,1+no_pages):
    if page%10 == 0:
        print('{} of {}'.format(page,no_pages))
        
    # Load page    
    driver.get(search_url.format(page))

    # Wait until search results have been loaded
    results_loaded = EC.presence_of_element_located((By.CLASS_NAME , "catalog-result"))
    element = WebDriverWait(driver,100).until(results_loaded)

    # Soupify HTML
    html_source = driver.page_source
    soup = BeautifulSoup(html_source,'html.parser')

    # Get results    
    results_list = soup.find('ul', {'class': 'browse-list'})
    results_links = results_list.find_all('li', {'class': 'catalog-result'})

    # Extract relevant book links
    for result in results_links:
        # Extract relevant result info
        result_data = result.find('div', {'class': 'result-data'})
        book_meta = result_data.find('p', {'class': 'book-meta'})
        link = result_data.a["href"]

        # Conditions for good datum
        is_complete = str(book_meta).find("Complete")
        is_new = link not in book_links

        if is_complete and is_new:
                    book_links.append(link)

driver.close()

10 of 322
20 of 322


## Save Book Links

In [None]:
with open('book_links.pkl','wb') as fout:
    pkl.dump(book_links,fout)

# Extract Download Links

In [None]:
readers = []
download_links = []
sizes = []
bad_links = []

In [None]:
driver = webdriver.Chrome(path_chromedriver)

for i,link in enumerate(book_links):
    if i%10 ==0:
        print('{} of {}'.format(i,len(book_links)))
        
    driver.get(link)
    html_source = driver.page_source
    soup = BeautifulSoup(html_source,'html.parser')

    download_button = soup.find('a',{'class':'book-download-btn'})
    if download_button:
        product_details = soup.find('dl', {'class': 'product-details clearfix'})
        
        if product_details != None:
            product_details_list = product_details.find_all("dd")
            download_link = download_button['href']
            reader = product_details_list[3].get_text()
            size_mb = product_details_list[1].get_text()
            try:
                size_mb = float(size_mb[:-2])
            except:
                pass
            
        if reader not in readers:   
            readers.append(reader)
            sizes.append(size_mb)
            download_links.append(download_link)
    else:
        if link not in bad_links:
            bad_links.append(link)

## Save Download Links

In [None]:
with open('download_links.pkl','wb') as fout:
    pkl.dump(download_links,fout)
    