# Scraping Audible Books using Python


![](https://i.imgur.com/Vf38DRk.jpg)

**What is web Scraping?**

Web scraping (or data scraping) is a technique used to collect content and data from the internet. This data is usually saved in a local file so that it can be manipulated and analyzed as needed. If you've ever copied and pasted content from a website into an Excel spreadsheet, this is essentially what web scraping is, but on a very small scale. Refer to the article to know more.

#### | Data is everywhere - perhaps in the form of "information" and "misinformation"


Audible is an American online audiobook and podcast service that allows users to purchase and stream audiobooks and other forms of spoken word content.

**Python language** is the best way to scrape information from any website and it provides many libraries like BeautifulSoup, Scrapy, etc but be careful before you scrape any website please go through the website terms and conditions.

Now let's scrape all the information about audio books under the category Business and Career. Go to the website [audible.in](https://www.audible.in/search?node=21881793031&pageSize=50&sort=&ref=a_search_c1_sort_0&pf_rd_p=ef3fe3b8-5e51-4a15-8620-4b6299f0f80d&pf_rd_r=5SXJAR17GV319RGXZG53) and get an understanding about the website by right clicking anywhere inn the website and select 'Inspect' option it will be available in almost all the modern Browsers.

We will be using packages like: [Requests](https://www.w3schools.com/python/module_requests.asp?msclkid=32785f24c4bf11ec9ffc8255b855ffcc), [BeautifulSoup4](https://beautiful-soup-4.readthedocs.io/en/latest/?msclkid=509f5e64c3c711eca02fb7b32259be6c), [Pandas](https://pandas.pydata.org/docs/?msclkid=7703b982c3c711ecad8359fe405ad5b0).  If you don't know please go through the documentation provided.

In [2]:
#Installing packages

'''Jovian library allows you to save the notebooks in our jovian profile'''

!pip install jovian --upgrade --quiet
!pip install requests --upgrade --quiet
!pip install beautifulsoup4 --upgrade --quiet
!pip install pandas --upgrade --quiet

In [3]:
#Importing the packages to use

import jovian
import requests as rq
from bs4 import BeautifulSoup as bs
import pandas as pd

## Download the webpage using `requests`

In [4]:
audible_url= 'https://www.audible.in/search?node=21881793031&pageSize=50&sort=&page=1'

The library is now installed and imported.

To download a page, we can use the `get` function from requests, which returns a response object.

In [5]:
response = rq.get(audible_url)

`requests.get` returns a response object containing the data from the web page and some other information.

The `.status_code` property can be used to check if the response was successful. A successful response will have an [HTTP status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) between 200 and 299.

In [6]:
response.status_code

200

The request was successful. We can get the contents of the page using `response.text`.

In [7]:
page_contents=response.text
len(page_contents)

2298532

In the above cell `page_content[:1000]` contains the [HTML](https://en.wikipedia.org/wiki/HTML) of the webpage [audible.in](https://www.audible.in/search?node=21881793031&pageSize=50&sort=&ref=a_search_c1_sort_0&pf_rd_p=ef3fe3b8-5e51-4a15-8620-4b6299f0f80d&pf_rd_r=5SXJAR17GV319RGXZG53)

We can also save it to a file and view the page locally within Jupyter using "File > Open".

In [8]:
with open ('business-career-books.html', 'w') as f:
    f.write(page_contents)

In [9]:
with open ('business-career-books.html', 'r') as f:
    html_source= f.read()

### Parse the HTML source code using `beautifulsoup4`

In [10]:
doc=bs(html_source)

Look's like we have the data for 50 audio books per page and there are about 24 pages. Here is a function that helps you to get.

**Creating a helper function that gets you desired webpage by taking page number as argument**

In [11]:
def get_pageno(pageno):
    pageno= str(pageno)
    # Construct the URL
    books_pageno_url = 'https://www.audible.in/search?node=21881793031&pageSize=50&sort=&page=' + pageno
    
    # Get the HTML page content using requests
    response = rq.get(books_pageno_url)
    
    # Ensure that the reponse is valid
    if response.status_code != 200:
        print('Status code:', response.status_code)
        raise Exception('Failed to fetch web page ' + books_pageno_url)
    
    # Construct a beautiful soup document
    doc = bs(response.text)
    
    return doc

Let's get the page 1 and start scraping the data out of it

In [12]:
pageno_1= get_pageno(1)

## Extracting the data


![](https://miro.medium.com/max/1400/1*QBCx2VuHpD4vTxm8JFbMTQ.png)


Here `<li>` tag with `class=bc-list-item productListItem` contains all the data about each book, so let's find all the all li tags and store it in a variable 

In [13]:
book_contents= pageno_1.find_all('li', class_='bc-list-item productListItem')

## 1. Title of Book:

![](https://miro.medium.com/max/1400/1*DALpLoNyaichld2dcZz8tA.png)

Now that we got book_contents that contains all the `<li>` tags. Let's create a function that gets all the book names.

Under `<a>` tag, name is written in text format we can retrieve name by using the below function.

In [14]:
def get_book_names(book_contents):
    book_names= []
    for tag in book_contents:
            a_tag_name= tag.h3.find_all('a', recursive=False)
            book_name= a_tag_name[0].text.strip()
            book_names.append(book_name)
    return book_names

We can call out function `get_book_names` to get the book names. 

In [15]:
get_book_names(book_contents)

['The Everyday Hero Manifesto',
 'HBR at 100',
 'Start with Why',
 'The Design of Everyday Things',
 '$100M Offers',
 'The Daily Stoic',
 'Think and Grow Rich',
 'Range',
 'The ONE Thing',
 'Trading in the Zone',
 'The Personal MBA: Master the Art of Business',
 'Indomitable',
 'Think Again',
 'Algorithms to Live By',
 'The Founders',
 'Hyperfocus',
 'The Ambuja Story',
 'How Come No One Told Me That',
 'Designing Data-Intensive Applications',
 'Indistractable',
 'The Obstacle Is the Way',
 'Elon Musk',
 'No Excuses!',
 'The DevOps Handbook, Second Edition',
 'The Hard Thing About Hard Things',
 'The Lean Startup',
 'Rework',
 'Your Money or Your Life',
 'The Goal',
 'Make Time',
 'The Warren Buffett Way',
 'The Millionaire Fastlane: Crack the Code to Wealth and Live Rich for a Lifetime',
 'Grit',
 'No Rules Rules',
 'Hooked: How to Build Habit-Forming Products',
 'Extreme Ownership',
 'How to Talk to Anyone',
 'Leadership Wisdom from the Monk Who Sold His Ferrari',
 'Eat That Frog!',


## 2. Each Book URL's:

![](https://cdn-images-1.medium.com/max/1600/1*DALpLoNyaichld2dcZz8tA.png)

Under the same `<a>` tag, `href` attribute contains the link of the book. We can retrieve the link of the audio book using the below function.

In [16]:
def get_book_links(book_contents):
    base_url='https://www.audible.in'
    book_links=[]
    for tag in book_contents:
        a_tag_name= tag.h3.find_all('a', recursive=False)
        url= a_tag_name[0]['href'].strip()
        book_link= base_url+url
        book_links.append(book_link)
    return book_links

We can call out function `get_book_links` to get the book URL's. 

In [17]:
get_book_links(book_contents)

['https://www.audible.in/pd/The-Everyday-Hero-Manifesto-Audiobook/B08XY8T574',
 'https://www.audible.in/pd/HBR-at-100-Audiobook/B09WFVS56M',
 'https://www.audible.in/pd/Start-with-Why-Audiobook/B09J5J1PTZ',
 'https://www.audible.in/pd/The-Design-of-Everyday-Things-Audiobook/B07L5T1Q55',
 'https://www.audible.in/pd/100M-Offers-Audiobook/B09BK615JG',
 'https://www.audible.in/pd/The-Daily-Stoic-Audiobook/B079B59VKY',
 'https://www.audible.in/pd/Think-and-Grow-Rich-Audiobook/B07BN7HM3H',
 'https://www.audible.in/pd/Range-Audiobook/B07P9RSNSB',
 'https://www.audible.in/pd/The-ONE-Thing-Audiobook/B079P6NWGL',
 'https://www.audible.in/pd/Trading-in-the-Zone-Audiobook/B07BTC3K6T',
 'https://www.audible.in/pd/The-Personal-MBA-Master-the-Art-of-Business-Audiobook/B079TN1LCT',
 'https://www.audible.in/pd/Indomitable-Audiobook/B09SGPKYMV',
 'https://www.audible.in/pd/Think-Again-Audiobook/B08PL4YK66',
 'https://www.audible.in/pd/Algorithms-to-Live-By-Audiobook/B079P87243',
 'https://www.audible.in

## 3. Duration of Audio Book:

![](https://cdn-images-1.medium.com/max/1600/1*OQo-8FshD6lVimttzKHhmQ.png)

There is another `<li>` tag, now class name should be specified to extract the expected tag i.e. `class_='bc-list-item runtimeLabel'`. After entering into the tag there is another tag `<span>` We can retrieve the length of the audio book using the below function.

In [18]:
def get_book_length(book_contents):
    book_length=[]
    for tag in book_contents:
        try:
            len_tag= tag.find('li', class_='bc-list-item runtimeLabel')
            length_tag = len_tag.find('span')
            length = length_tag.text.strip()
            book_length.append(length)
        except AttributeError:
            book_length.append(None)
    return book_length

We can call out function `get_book_length` to get the duration of the audio book. 

In [19]:
get_book_length(book_contents)

['Length: 9 hrs and 27 mins',
 'Length: 17 hrs and 7 mins',
 'Length: 7 hrs and 18 mins',
 'Length: 10 hrs and 39 mins',
 'Length: 3 hrs and 48 mins',
 'Length: 10 hrs and 6 mins',
 'Length: 10 hrs and 15 mins',
 'Length: 10 hrs and 17 mins',
 'Length: 5 hrs and 24 mins',
 'Length: 7 hrs and 57 mins',
 'Length: 15 hrs and 25 mins',
 'Length: 15 hrs and 26 mins',
 'Length: 6 hrs and 41 mins',
 'Length: 11 hrs and 50 mins',
 'Length: 15 hrs and 46 mins',
 'Length: 6 hrs and 39 mins',
 'Length: 13 hrs and 59 mins',
 'Length: 8 hrs and 21 mins',
 'Length: 20 hrs and 56 mins',
 'Length: 5 hrs and 15 mins',
 'Length: 6 hrs and 7 mins',
 'Length: 13 hrs and 23 mins',
 'Length: 6 hrs and 51 mins',
 'Length: 15 hrs and 51 mins',
 'Length: 7 hrs and 57 mins',
 'Length: 8 hrs and 38 mins',
 'Length: 2 hrs and 50 mins',
 'Length: 11 hrs and 21 mins',
 'Length: 11 hrs and 45 mins',
 'Length: 4 hrs and 58 mins',
 'Length: 10 hrs and 31 mins',
 'Length: 12 hrs and 46 mins',
 'Length: 9 hrs and 21 min

## 4. Authors of the book:

![](https://cdn-images-1.medium.com/max/1600/1*d_JgCClWrjFEhNL7E4Z0KQ.png)

Same as Book length there is another `<li>` tag, now class name should be specified to extract the expected tag i.e. `class_='bc-list-item authorLabel'` tag. After entering into the tag there is another tag `<span>` We can retrieve the author name using the below function.

In [20]:
def get_written_by(book_contents):
    written_by=[]
    for tag in book_contents:
        author_tag= tag.find('li', class_='bc-list-item authorLabel')
        try:
            auth_tag = author_tag.find('a')
            author = auth_tag.text.strip()
            written_by.append(author)
        except AttributeError:
            written_by.append(None)
    return written_by

We can call out function `get_written_by` to get the authors. 

In [21]:
get_written_by(book_contents)

['Robin Sharma',
 'Harvard Business Review',
 'Simon Sinek',
 'Don Norman',
 'Alex Hormozi',
 'Ryan Holiday',
 'Napoleon Hill',
 'David Epstein',
 'Gary Keller',
 'Mark Douglas',
 'Josh Kaufman',
 'Arundhati Bhattacharya',
 'Adam Grant',
 'Brian Christian',
 'Jimmy Soni',
 'Chris Bailey',
 'Narotam Sekhsaria',
 'Prakash Iyer',
 'Martin Kleppmann',
 'Nir Eyal',
 'Ryan Holiday',
 'Ashlee Vance',
 'Brian Tracy',
 'Gene Kim',
 'Ben Horowitz',
 'Eric Ries',
 'Jason Fried',
 'Vicki Robin',
 'Eliyahu M. Goldratt',
 'Jake Knapp',
 'Robert Hagstrom',
 'MJ DeMarco',
 'Angela Duckworth',
 'Reed Hastings',
 'Nir Eyal',
 'Jocko Willink',
 'Leil Lowndes',
 'Robin Sharma',
 'Brian Tracy',
 'John A List',
 'Grant Cardone',
 'Reid Hoffman',
 'Greg McKeown',
 'Malcolm Gladwell',
 'Amanda Frances',
 'Daniel H. Pink',
 'Kurien',
 'T. Harv Eker',
 'Robert Greene',
 'Simon Sinek']

## 5. Description about the audio book:

![](https://cdn-images-1.medium.com/max/1600/1*OOQOfDkDHllYam2e1QxLdQ.png)

`<li class_='bc-list-item subtitle'>` and `<span>` represents description field. Let's retrieve info by using the below function.

In [22]:
def get_description(book_contents):
    description=[]
    for tag in book_contents:
        about_tag= tag.find('li', class_='bc-list-item subtitle')
        try:
            description_tag = about_tag.find('span').text.strip()
            description.append(description_tag)
        except AttributeError:
            description.append(None)
    return description

We can call out function `get_description` to get the description about the books. 

In [23]:
get_description(book_contents)

['Activate Your Positivity, Maximize Your Productivity, Serve The World',
 "The Most Influential and Innovative Articles from Harvard Business Review's First Century",
 'How Great Leaders Inspire Everyone To Take Action',
 'Revised and Expanded Edition',
 'How to Make Offers So Good People Feel Stupid Saying No',
 '366 Meditations on Wisdom, Perseverance, and the Art of Living',
 None,
 'How Generalists Triumph in a Specialized World',
 'The Surprisingly Simple Truth Behind Extraordinary Results',
 'Master the Market with Confidence, Discipline, and a Winning Attitude',
 None,
 "A Working Woman's Notes on Work, Life and Leadership",
 "The Power of Knowing What You Don't Know",
 'The Computer Science of Human Decisions',
 None,
 None,
 'How a Group of Ordinary Men Created an Extraordinary Company',
 'Life Lessons, Practical Advice and Timeless Wisdom for Success',
 'The Big Ideas Behind Reliable, Scalable, and Maintainable Systems',
 'How to Control Your Attention and Choose Your Life',

## 6. Language of the Audio Book:

![](https://cdn-images-1.medium.com/max/1600/1*EbNErCWkTvNgFHUFSRY73g.png)

`<li class_='bc-list-item languageLabel>` and `<span>` represents description field. Let's retrieve info by using the below function.

In [24]:
def get_language(book_contents):
    language=[]
    for tag in book_contents:
        lang_tag= tag.find('li', class_='bc-list-item languageLabel')
        try:
            language_tag = lang_tag.find('span').text.split()
            language.append(language_tag)
        except AttributeError:
            language.append(None)
    return language

We can call out function `get_language` to get the language of the audio. 

In [25]:
get_language(book_contents)

[['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 ['Language:', 'English'],
 

## 7. Ratings for the book

![](https://cdn-images-1.medium.com/max/1600/1*FWuF4TwLDsVYPO9Ot_PZhQ.png)

`<li class_='bc-list-item ratingsLabel'>` and `<span class="bc-text bc-pub-offscreen">` represents ratings field. Let's retrieve no of stars by using the below function.

In [26]:
def get_rating(book_contents):
    rating=[]
    for tag in book_contents:
        star_tag= tag.find('li', class_='bc-list-item ratingsLabel')
        try:
            rating_tag = star_tag.find('span', class_='bc-text bc-pub-offscreen').text.strip()
            rating.append(rating_tag)
        except AttributeError:
            rating.append(None)
    return rating

We can call out function `get_rating` to get the ratings. 

In [27]:
get_rating(book_contents)

[None,
 None,
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '5 out of 5 stars',
 '5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 None,
 '4.5 out of 5 stars',
 '5 out of 5 stars',
 '5 out of 5 stars',
 '5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 None,
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars',
 '5 out of 5 stars',
 '4.5 out of 5 stars',
 '4.5 out of 5 stars

## 8. Number of People Who Rated

![](https://cdn-images-1.medium.com/max/1600/1*GLRZwKaiB_-bTz8d194qGA.png)

`<li class_='bc-list-item ratingsLabel'>` and `<span class="bc-text bc-size-small bc-color-secondary">` represents ratings field. Let's retrieve no of ratings by using the below function.

In [28]:
def get_no_of_ratings(book_contents):
    no_of_ratings=[]
    for tag in book_contents:
        star_tag= tag.find('li', class_='bc-list-item ratingsLabel')
        try:
            rating_tag = star_tag.find('span', class_='bc-text bc-size-small bc-color-secondary').text.strip()
            no_of_ratings.append(rating_tag)
        except AttributeError:
            no_of_ratings.append(None)
    return no_of_ratings

We can call out function `get_no_of_ratings` to get the no of people rated the book. 

In [29]:
get_no_of_ratings(book_contents)

['Not rated yet',
 'Not rated yet',
 '86 ratings',
 '129 ratings',
 '54 ratings',
 '50 ratings',
 '583 ratings',
 '482 ratings',
 '443 ratings',
 '550 ratings',
 '91 ratings',
 '12 ratings',
 '415 ratings',
 '194 ratings',
 'Not rated yet',
 '944 ratings',
 '1 rating',
 '12 ratings',
 '24 ratings',
 '465 ratings',
 '98 ratings',
 '2,180 ratings',
 '445 ratings',
 'Not rated yet',
 '532 ratings',
 '329 ratings',
 '590 ratings',
 '51 ratings',
 '186 ratings',
 '902 ratings',
 '743 ratings',
 '278 ratings',
 '318 ratings',
 '557 ratings',
 '363 ratings',
 '53 ratings',
 '509 ratings',
 '54 ratings',
 '260 ratings',
 '3 ratings',
 '1,093 ratings',
 '15 ratings',
 '395 ratings',
 '694 ratings',
 '2 ratings',
 '3 ratings',
 '6 ratings',
 '944 ratings',
 '37 ratings',
 '53 ratings']

## 9. Regular Price of the Audio Book

![](https://cdn-images-1.medium.com/max/1600/1*ubzh_Hwz2nDmgFNf9H1TiA.png)

`<p class_='bc-text buybox-regular-price bc-spacing-none bc-spacing-top-none'>` and there are two tags like `<span class="bc-text bc-size-small bc-color-secondary">` and second tag represents price field. Let's retrieve regular price of the audio book by using the below function.

In [30]:
def get_regular_price(book_contents):
    regular_price=[]
    for tag in book_contents:
        buy_tag= tag.find('p', class_='bc-text buybox-regular-price bc-spacing-none bc-spacing-top-none')
        try:
            price_tag = buy_tag.find_all('span', class_='bc-text bc-size-base bc-color-base')
            price= price_tag[1].text.strip()
            regular_price.append(price)
        except AttributeError:
            regular_price.append(None)
    return regular_price

We can call out function `get_regular_price` to get the regular price of the book. 

In [31]:
get_regular_price(book_contents)

['₹1,519.00',
 '₹703.00',
 '₹888.00',
 '₹500.00',
 '₹501.00',
 '₹388.00',
 '₹166.00',
 '₹323.00',
 '₹1,172.00',
 '₹879.00',
 '₹836.00',
 '₹1,575.00',
 '₹888.00',
 '₹568.00',
 '₹645.00',
 '₹323.00',
 '₹1,575.00',
 '₹879.00',
 '₹1,675.00',
 '₹501.00',
 '₹668.00',
 '₹820.00',
 '₹820.00',
 '₹836.00',
 '₹1,350.00',
 '₹1,005.00',
 '₹452.00',
 '₹1,005.00',
 '₹937.00',
 '₹615.00',
 '₹1,005.00',
 '₹568.00',
 '₹820.00',
 '₹888.00',
 '₹501.00',
 '₹134.00',
 '₹844.00',
 '₹586.00',
 '₹233.00',
 '₹888.00',
 '₹703.00',
 '₹888.00',
 '₹820.00',
 '₹500.00',
 '₹668.00',
 '₹797.00',
 '₹586.00',
 '₹1,181.00',
 '₹820.00',
 '₹615.00']

## 10. Images of the Book Cover

![](https://cdn-images-1.medium.com/max/1600/1*IUkG5NrmhLmmeLw1CUVYJg.png)

`<img class_='bc-pub-block bc-image-inset-border js-only-element'>` and 'src' attribute represents image link. Let's retrieve all the links of the images by using the below function.

In [32]:
def get_cover_img(book_contents):
    cover_img=[]
    for tag in book_contents:
        img_tag= tag.find_all('img', class_='bc-pub-block bc-image-inset-border js-only-element')
        try:
            #price_tag = img_tag.find('span', class_='bc-text bc-size-base bc-color-base')
            book_image_url= img_tag[0]['src'].strip()
            cover_img.append(book_image_url)
        except AttributeError:
            cover_img.append(None)
    return cover_img

We can call out function `get_cover_img` to get the Images of the Book Cover. 

In [33]:
get_cover_img(book_contents)

['https://m.media-amazon.com/images/I/51LP52ob7CL._SL500_.jpg',
 'https://m.media-amazon.com/images/I/41xwPia5dAL._SL500_.jpg',
 'https://m.media-amazon.com/images/I/41Px2q4eSiL._SL500_.jpg',
 'https://m.media-amazon.com/images/I/51Dl6lXXesL._SL500_.jpg',
 'https://m.media-amazon.com/images/I/51DbY4as4EL._SL500_.jpg',
 'https://m.media-amazon.com/images/I/514WgltUohL._SL500_.jpg',
 'https://m.media-amazon.com/images/I/51vKha04DuL._SL500_.jpg',
 'https://m.media-amazon.com/images/I/41eSJ3-5wUL._SL500_.jpg',
 'https://m.media-amazon.com/images/I/41mDQ4JH8EL._SL500_.jpg',
 'https://m.media-amazon.com/images/I/51+8mBt4k7L._SL500_.jpg',
 'https://m.media-amazon.com/images/I/51crV7aDATL._SL500_.jpg',
 'https://m.media-amazon.com/images/I/51q82y93idL._SL500_.jpg',
 'https://m.media-amazon.com/images/I/41+9W87E6nL._SL500_.jpg',
 'https://m.media-amazon.com/images/I/51HiU+5mTwL._SL500_.jpg',
 'https://m.media-amazon.com/images/I/51KReSBvxoL._SL500_.jpg',
 'https://m.media-amazon.com/images/I/51

## Create a Dictionary of Items By Using all the Functions 


Now that we got all the items that we need from the website, Let's define a function that parses HTML code from range of web pages and assembles all the list of items together and access it like a dictionary.

In [34]:
def parse_pages_ranged(end_page):
    all_page_contents = {
            'Book_Name':[],
            'Description':[],
            'Author':[],
            'Rating':[],
            'No_of_Ratings':[],
            'Regular_Price':[],
            'Language':[],
            'Book_Audio_Length':[],
            'Cover_IMG':[],
            'Book_URL':[],
            }    
    for page in range (0,end_page):
        pageno_x = get_pageno(page)
        book_contents = pageno_x.find_all('li', class_='bc-list-item productListItem')
        all_page_contents['Book_Name'] += get_book_names(book_contents)
        all_page_contents['Description'] += get_description(book_contents)
        all_page_contents['Author'] += get_written_by(book_contents)
        all_page_contents['Rating'] += get_rating(book_contents)
        all_page_contents['No_of_Ratings'] += get_no_of_ratings(book_contents)
        all_page_contents['Regular_Price'] += get_regular_price(book_contents)
        all_page_contents['Language'] += get_language(book_contents)
        all_page_contents['Book_Audio_Length'] += get_book_length(book_contents)
        all_page_contents['Cover_IMG'] += get_cover_img(book_contents)
        all_page_contents['Book_URL'] += get_book_links(book_contents)
        page = page + 1
    return all_page_contents

We can call out function `parse_pages_ranged` to get the output which was parsed from the Audible website using pandas we will be visualizing the data in a tabular format.

Python Pandas - `pandas.DataFrame( )` DataFrame: A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

In [None]:
all_pages_scraped= pd.DataFrame(parse_pages_ranged(24))
all_pages_scraped

## Save the extracted information to a CSV file

We've scraped 10 columns and 1000+ rows. Let's write all the data collected in a CSV file.

In [None]:
all_pages_scraped.to_csv('Audible_Business_and_Careers_Books_2022.csv',index=None)

## Summary


Here's a brief summary of the step-by-step process we followed for scraping top insurance companies from audible.in

1. We downloaded the webpage using requests
2. We parsed the HTML source code of the web page using beautifulsoup4
3. We extracted Book Name, Ratings, Price, Cover Image, Author, Length, Language, Links.
4. Compiled the data and created a CSV file using Pandas.

### Here's the complete code that we've used to get this project done

In [None]:
def get_pageno(pageno):
    pageno= str(pageno)
    # Construct the URL
    books_pageno_url = 'https://www.audible.in/search?node=21881793031&pageSize=50&sort=&page=' + pageno
    
    # Get the HTML page content using requests
    response = rq.get(books_pageno_url)
    
    # Ensure that the reponse is valid
    if response.status_code != 200:
        print('Status code:', response.status_code)
        raise Exception('Failed to fetch web page ' + books_pageno_url)
    
    # Construct a beautiful soup document
    doc = bs(response.text)
    
    return doc

def get_book_names(book_contents):
    book_names= []
    for tag in book_contents:
            a_tag_name= tag.h3.find_all('a', recursive=False)
            book_name= a_tag_name[0].text.strip()
            book_names.append(book_name)
    return book_names

def get_book_links(book_contents):
    base_url='https://www.audible.in'
    book_links=[]
    for tag in book_contents:
        a_tag_name= tag.h3.find_all('a', recursive=False)
        url= a_tag_name[0]['href'].strip()
        book_link= base_url+url
        book_links.append(book_link)
    return book_links

def get_book_length(book_contents):
    book_length=[]
    for tag in book_contents:
        try:
            len_tag= tag.find('li', class_='bc-list-item runtimeLabel')
            length_tag = len_tag.find('span')
            length = length_tag.text.strip()
            book_length.append(length)
        except AttributeError:
            book_length.append(None)
    return book_length

def get_written_by(book_contents):
    written_by=[]
    for tag in book_contents:
        author_tag= tag.find('li', class_='bc-list-item authorLabel')
        try:
            auth_tag = author_tag.find('a')
            author = auth_tag.text.strip()
            written_by.append(author)
        except AttributeError:
            written_by.append(None)
    return written_by

def get_description(book_contents):
    description=[]
    for tag in book_contents:
        about_tag= tag.find('li', class_='bc-list-item subtitle')
        try:
            description_tag = about_tag.find('span').text.strip()
            description.append(description_tag)
        except AttributeError:
            description.append(None)
    return description

def get_language(book_contents):
    language=[]
    for tag in book_contents:
        lang_tag= tag.find('li', class_='bc-list-item languageLabel')
        try:
            language_tag = lang_tag.find('span').text.split()
            language.append(language_tag)
        except AttributeError:
            language.append(None)
    return language

def get_rating(book_contents):
    rating=[]
    for tag in book_contents:
        star_tag= tag.find('li', class_='bc-list-item ratingsLabel')
        try:
            rating_tag = star_tag.find('span', class_='bc-text bc-pub-offscreen').text.strip()
            rating.append(rating_tag)
        except AttributeError:
            rating.append(None)
    return rating

def get_no_of_ratings(book_contents):
    no_of_ratings=[]
    for tag in book_contents:
        star_tag= tag.find('li', class_='bc-list-item ratingsLabel')
        try:
            rating_tag = star_tag.find('span', class_='bc-text bc-size-small bc-color-secondary').text.strip()
            no_of_ratings.append(rating_tag)
        except AttributeError:
            no_of_ratings.append(None)
    return no_of_ratings

def get_regular_price(book_contents):
    regular_price=[]
    for tag in book_contents:
        buy_tag= tag.find('p', class_='bc-text buybox-regular-price bc-spacing-none bc-spacing-top-none')
        try:
            price_tag = buy_tag.find_all('span', class_='bc-text bc-size-base bc-color-base')
            price= price_tag[1].text.strip()
            regular_price.append(price)
        except AttributeError:
            regular_price.append(None)
    return regular_price

def get_cover_img(book_contents):
    cover_img=[]
    for tag in book_contents:
        img_tag= tag.find_all('img', class_='bc-pub-block bc-image-inset-border js-only-element')
        try:
            #price_tag = img_tag.find('span', class_='bc-text bc-size-base bc-color-base')
            book_image_url= img_tag[0]['src'].strip()
            cover_img.append(book_image_url)
        except AttributeError:
            cover_img.append(None)
    return cover_img

def parse_pages_ranged(end_page):
    all_page_contents = {
            'Book_Name':[],
            'Description':[],
            'Author':[],
            'Rating':[],
            'No_of_Ratings':[],
            'Regular_Price':[],
            'Language':[],
            'Book_Audio_Length':[],
            'Cover_IMG':[],
            'Book_URL':[],
            }    
    for page in range (0,end_page):
        pageno_x = get_pageno(page)
        book_contents = pageno_x.find_all('li', class_='bc-list-item productListItem')
        all_page_contents['Book_Name'] += get_book_names(book_contents)
        all_page_contents['Description'] += get_description(book_contents)
        all_page_contents['Author'] += get_written_by(book_contents)
        all_page_contents['Rating'] += get_rating(book_contents)
        all_page_contents['No_of_Ratings'] += get_no_of_ratings(book_contents)
        all_page_contents['Regular_Price'] += get_regular_price(book_contents)
        all_page_contents['Language'] += get_language(book_contents)
        all_page_contents['Book_Audio_Length'] += get_book_length(book_contents)
        all_page_contents['Cover_IMG'] += get_cover_img(book_contents)
        all_page_contents['Book_URL'] += get_book_links(book_contents)
        page = page + 1
    return all_page_contents

We can scrape all the books in Audible under any category. All we have to do is find the tags, modify the code by inserting appropriate tags in the functions and change the variable names for better understanding of the code. 

### Saving the notebook

In [None]:
# Execute this to save new versions of the notebook
jovian.commit(files=['Audible_Business_and_Careers_Books_2022.csv'])

In [None]:
#jovian.commit(project="web-scraping-project-audible_B&C")

## References
1. [Jovian](https://jovian.ai/) A platform to learn Data Science.
2. [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) Documentation
3. [Pandas](https://pandas.pydata.org/docs/) Documentation
4. This Web Scraping project is completed under the guidance of Jovian Team. Thank you for all the support.

## Future Work
In near future, I'll post again on web scraping using Selenium or Scrapy. I'll be performing Data Analysis and visualizing the scraped data, so stay tuned!

## Follow me Here

[LinkedIn](https://www.linkedin.com/in/pratulot/) | [Jovian](https://jovian.ai/pratulofficialthings) | [GitHub](https://github.com/pratulot)