## Scraping Best Sellers (Top 100) on Audible.com with Selenium on Google colab

In this notebook, I will scrape Best Sellers section on Audible.com using Python and selenium where Top 100 audio books are listed, and store the data in a csv file.

the link to scrape:
https://www.audible.com/adblbestsellers

robots.txt has been checked and the above link is allowed.

For each audio book, following fields are scraped:
- Book title
- Authors
- Narrators
- Length
- Release date
- Language
- Ratings

## Install Chromedriver and selenium on Google colab

In [None]:
%%shell

# Add debian buster
cat > /etc/apt/sources.list.d/debian.list <<'EOF'
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster.gpg] http://deb.debian.org/debian buster main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster-updates.gpg] http://deb.debian.org/debian buster-updates main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-security-buster.gpg] http://deb.debian.org/debian-security buster/updates main
EOF

# Add keys
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys DCC9EFBF77E11517
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 648ACFD622F3D138
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 112695A0E562B32A

apt-key export 77E11517 | gpg --dearmour -o /usr/share/keyrings/debian-buster.gpg
apt-key export 22F3D138 | gpg --dearmour -o /usr/share/keyrings/debian-buster-updates.gpg
apt-key export E562B32A | gpg --dearmour -o /usr/share/keyrings/debian-security-buster.gpg

# Prefer debian repo for chromium* packages only
# Note the double-blank lines between entries
cat > /etc/apt/preferences.d/chromium.pref << 'EOF'
Package: *
Pin: release a=eoan
Pin-Priority: 500


Package: *
Pin: origin "deb.debian.org"
Pin-Priority: 300


Package: chromium*
Pin: origin "deb.debian.org"
Pin-Priority: 700
EOF

Executing: /tmp/apt-key-gpghome.2TRQWq5o3G/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys DCC9EFBF77E11517
gpg: key DCC9EFBF77E11517: public key "Debian Stable Release Key (10/buster) <debian-release@lists.debian.org>" imported
gpg: Total number processed: 1
gpg:               imported: 1
Executing: /tmp/apt-key-gpghome.Ole8OMsA6p/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys 648ACFD622F3D138
gpg: key DC30D7C23CBBABEE: public key "Debian Archive Automatic Signing Key (10/buster) <ftpmaster@debian.org>" imported
gpg: Total number processed: 1
gpg:               imported: 1
Executing: /tmp/apt-key-gpghome.rx0RUjfhw8/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys 112695A0E562B32A
gpg: key 4DFAB270CAA96DFA: public key "Debian Security Archive Automatic Signing Key (10/buster) <ftpmaster@debian.org>" imported
gpg: Total number processed: 1
gpg:               imported: 1




In [None]:
!apt-get update
!apt-get install chromium chromium-driver

0% [Working]            Get:1 http://deb.debian.org/debian buster InRelease [122 kB]
0% [Connecting to archive.ubuntu.com] [Connecting to security.ubuntu.com] [Conn0% [Connecting to archive.ubuntu.com] [Connecting to security.ubuntu.com] [Conn                                                                               Get:2 http://deb.debian.org/debian buster-updates InRelease [56.6 kB]
                                                                               Get:3 http://deb.debian.org/debian-security buster/updates InRelease [34.8 kB]
0% [Connecting to archive.ubuntu.com] [Connecting to security.ubuntu.com] [Wait                                                                               Get:4 https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/ InRelease [3,622 B]
Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease
Hit:6 http://archive.ubuntu.com/ubuntu focal InRelease
Get:7 http://deb.debian.org/debian buster/main a

In [None]:
!pip install selenium

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting selenium
  Downloading selenium-4.8.3-py3-none-any.whl (6.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.5/6.5 MB[0m [31m44.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting trio-websocket~=0.9
  Downloading trio_websocket-0.10.2-py3-none-any.whl (17 kB)
Collecting trio~=0.17
  Downloading trio-0.22.0-py3-none-any.whl (384 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m384.9/384.9 KB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m
Collecting outcome
  Downloading outcome-1.2.0-py2.py3-none-any.whl (9.7 kB)
Collecting async-generator>=1.9
  Downloading async_generator-1.10-py3-none-any.whl (18 kB)
Collecting sniffio
  Downloading sniffio-1.3.0-py3-none-any.whl (10 kB)
Collecting wsproto>=0.14
  Downloading wsproto-1.2.0-py3-none-any.whl (24 kB)
Collecting h11<1,>=0.9.0
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException

from time import sleep

import pandas as pd
import numpy as np

In [None]:
def web_driver():
    options = webdriver.ChromeOptions()
    options.add_argument("--verbose")
    options.add_argument('--no-sandbox') # needed, because colab runs as root
    options.add_argument('--headless')  # or use pyvirtualdiplay
    options.add_argument('--disable-gpu')
    options.add_argument("--window-size=1920, 1200")
    options.add_argument('--disable-dev-shm-usage')
    driver = webdriver.Chrome(options=options)
    return driver

## Scraping data with selenium

In [None]:
driver = web_driver()

url = 'https://www.audible.com/adblbestsellers'

driver.get(url)

Browser screenshot can be taken to check if it is working

In [None]:
# driver.save_screenshot('1.jpg')

Get the number of last page to scrape

In [None]:
last_page = driver.find_elements(By.CSS_SELECTOR, '.bc-link.refinementFormLink.pageNumberElement.bc-color-link')
last_page_no = int(last_page[-1].text)
last_page_no

5

Scrape data from the fields and store them in respective lists

In [None]:
title_list, author_list, narrator_list, length_list, \
release_date_list, language_list, rating_list = [], [], [], [], [], [], []

for page in range(last_page_no):
  print('Page', page+1)
  box = driver.find_elements(By.CSS_SELECTOR, '.bc-col-responsive .bc-col-6')
  
  for item in box:
      
      name = item.find_element(By.CSS_SELECTOR, 'h3 a').text

      # returns list if the book has multiple authors listed, so use find_elements
      authors = item.find_elements(By.CSS_SELECTOR, '.bc-list-item.authorLabel a')
      # if multiple authors, returns like this - ['Jodi Picoult', 'Jennifer Finney Boylan']
      author = ', '.join([i.text for i in authors])

      # same pattern with authors if multiple narrators
      narrated_by = item.find_elements(By.CSS_SELECTOR, '.bc-list-item.narratorLabel a')
      narrator = ', '.join([i.text for i in narrated_by])

      length_audio = item.find_element(By.CSS_SELECTOR, '.bc-list-item.runtimeLabel span').text
      length = length_audio.removeprefix('Length: ')
   
      dates = item.find_element(By.CSS_SELECTOR, '.bc-list-item.releaseDateLabel span').text
      release_date = dates.removeprefix('Release date: ')

      language = item.find_element(By.CSS_SELECTOR, '.bc-list-item .languageLabel span').text
      language = language.removeprefix('Language: ')

      # some books, especially the new ones may not have ratings yet so use NoSuchElementException

      try:
          rating = item.find_element(By.CSS_SELECTOR, '.bc-list-item .ratingsLabel span.bc-text.bc-pub-offscreen').text
          rating = rating.removesuffix(' out of 5 stars')
          
      except NoSuchElementException:
          rating = 'Not rated yet'
          
      title_list.append(name)
      author_list.append(author)
      narrator_list.append(narrator)
      length_list.append(length)
      release_date_list.append(release_date)
      language_list.append(language)
      rating_list.append(rating)

  try:
    next = driver.find_element(By.CSS_SELECTOR, '.bc-button.bc-button-secondary.nextButton.refinementFormButton.bc-button-small.bc-button-inline')
    next.click()
    print('moving to next page')
    sleep(5)
  except:
    print('THIS IS THE LAST PAGE')
    break
    
driver.quit()

# for checking
print(len(title_list))
print(len(author_list))

Page 1
moving to next page
Page 2
moving to next page
Page 3
moving to next page
Page 4
moving to next page
Page 5
moving to next page
100
100


## Handling the scraped data

In [None]:
data = list(zip(title_list, author_list, narrator_list, length_list, release_date_list, language_list, rating_list))
# checking the data of first 2 audio books
data[:2]

[('Atomic Habits',
  'James Clear',
  'James Clear',
  '5 hrs and 35 mins',
  '10-16-18',
  'English',
  '5'),
 ('Spare',
  'Prince Harry The Duke of Sussex',
  'Prince Harry The Duke of Sussex',
  '15 hrs and 39 mins',
  '01-10-23',
  'English',
  '5')]

Turn the data list into a dataframe

In [None]:
file = pd.DataFrame(data, columns=['title', 'author', 'narrator', 'length', 'release_date', 'language', 'ratings'])

# Rearrange index so that index works as the ranking of the audio books
file.index = np.arange(1, len(file) + 1)
file

Unnamed: 0,title,author,narrator,length,release_date,language,ratings
1,Atomic Habits,James Clear,James Clear,5 hrs and 35 mins,10-16-18,English,5
2,Spare,Prince Harry The Duke of Sussex,Prince Harry The Duke of Sussex,15 hrs and 39 mins,01-10-23,English,5
3,Lessons in Chemistry,Bonnie Garmus,"Miranda Raison, Bonnie Garmus, Pandora Sykes",11 hrs and 55 mins,04-05-22,English,4.5
4,Daisy Jones & The Six,Taylor Jenkins Reid,"Jennifer Beals, Benjamin Bratt, Judy Greer, Pa...",9 hrs and 3 mins,03-05-19,English,4.5
5,I Will Find You,Harlan Coben,Steven Weber,10 hrs and 16 mins,03-14-23,English,4.5
...,...,...,...,...,...,...,...
96,A Court of Silver Flames,Sarah J. Maas,Stina Nielsen,26 hrs and 5 mins,02-16-21,English,4.5
97,Shadow and Bone,Leigh Bardugo,Lauren Fortgang,9 hrs and 21 mins,11-30-12,English,4.5
98,I Have Some Questions for You,Rebecca Makkai,"Julia Whelan, JD Jackson",14 hrs and 4 mins,02-21-23,English,4
99,Horse,Geraldine Brooks,"James Fouhey, Lisa Flanagan, Graham Halstead, ...",14 hrs and 6 mins,06-14-22,English,4.5


In [None]:
file.to_csv('Audible Top 100 best sellers.csv', index=False)