# Use data to find the best audio books
> This blog shows you how to use Python to increase your audiobook-listening experience

- toc:true
- branch: master
- badges: true
- comments: true
- author: Mikkel Freltoft Krogsholm
- categories: [audiobooks, api, python, jupyter]

I am a ferocious audiobook consumer and I finish 40-60 books a year (the trick is to listen to them a 2x-3x the speed). I am also picky. There is a lot of great content out there and I only want to listen to the best of it.

I have a trial membership of the audiobook service [Nextory](https://www.nextory.dk/). I wasn't completely satistified with the way their filtering works, so I decided to see if I could access their data and create my own filtering. It turns out I could. 

This blog shows you how, so you can create a list of great audiobooks to listen to!

### Import libraries
We need to import a few libraries for the code to work.

In [1]:
# For doinng api calls
import requests
from requests.exceptions import HTTPError

# For creating data frames
import pandas as pd

# For cleaning text
import re

# For doing math
import numpy as np 

# For showing a progress bar
from tqdm import tqdm

### Get the data

It turns out that the [Nextory](https://www.nextory.dk/) webpage does API requests to their database behind the scenes, and by mimicking those, we can get the data in a nice clean format that is ready to be analysed.

We can use the `requests` library to mimick the api calls. 

### Custom functions

I am defining three custom functions that I need to either retrieve data or clean data. 

- A `get_categories` function that gets all of the "main" categories that Nextory has. 
- A `get_books` function that gets n amount of books in a category.
- A `clean_cat` function that cleans the category label and makes it pretty.

In [2]:
# Define a custom function for getting book categories
def get_categories():
    
    headers = {
        'Authorization': 'Basic bmV4dG9yeXVpOnRvYmVkZWNpZGVk',
        'locale': 'da_DK',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'
    }

    url = 'https://www.nextory.dk/api/web/catalogue/6.2/groups?view=category'

    try:
        response = requests.get(url, headers = headers)
        response.raise_for_status()

        # access JSON content
        jsonResponse = response.json()

        bookgroups = jsonResponse['data']['bookgroups']

        return bookgroups

    except HTTPError as http_err:
        print(f'HTTP error occurred: {http_err}')
    except Exception as err:
        print(f'Other error occurred: {err}')
        

# Define a custom function for getting books
def get_books(category, rows = 20):
    
    headers = {
    'Authorization': 'Basic bmV4dG9yeXVpOnRvYmVkZWNpZGVk',
    'locale': 'da_DK',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'
    }
    
    url = "https://www.nextory.dk/api/web/catalogue/6.2/booksforbookgroup/" + category + "?rows=" + str(rows)
    
    try:
        response = requests.get(url, headers = headers)
        response.raise_for_status()
        
        # access JSON content
        jsonResponse = response.json()
        
        books = jsonResponse['data']['books']
    
        return books

    except HTTPError as http_err:
        print(f'HTTP error occurred: {http_err}')
    except Exception as err:
        print(f'Other error occurred: {err}')
    
# Define a custom function for cleaning categories
def clean_cat(s):
    s = re.sub(r'-\d$', '', s)
    s = re.sub(r'-', ' ', s)
    s = s.title()
    return s

### Get the book categories

First I run my `get_categories` function to get that categories.

In [3]:
bookcats = get_categories()

This returns a list of dictionaries with information about each category. Lets look at the first entry. It has some information about the "main" category and then a list of subcategories within the main one. 

**NOTE**: As you can tell the names are in Danish. You can change the the locale for Nextory in the custom functions above to get data for another region.

In [4]:
bookcats[0]

{'id': '4001',
 'title': 'Krimi',
 'position': 1,
 'slugname': 'krimi-1',
 'subcategories': [{'id': '4002',
   'title': 'Hyggelig krimi',
   'position': 1,
   'slugname': 'hyggelig-krimi-2'},
  {'id': '4003',
   'title': 'Humoristisk krimi',
   'position': 2,
   'slugname': 'humoristisk-krimi-2'},
  {'id': '4004',
   'title': 'Privatdetektiver',
   'position': 3,
   'slugname': 'privatdetektiver-2'},
  {'id': '4005',
   'title': 'Noir krimi',
   'position': 4,
   'slugname': 'noir-krimi-2'},
  {'id': '4006',
   'title': 'Politi romaner',
   'position': 5,
   'slugname': 'politi-romaner-2'},
  {'id': '4007',
   'title': 'Klassisk krimi',
   'position': 6,
   'slugname': 'klassisk-krimi-2'},
  {'id': '4008',
   'title': 'Historisk krimi',
   'position': 7,
   'slugname': 'historisk-krimi-2'}]}

I am interested in the slugnames of the main categories. These are the names I need to use in order to get books from each category.

In [5]:
slugs = []
 
for bookcat in bookcats:
    slugs.append(bookcat['slugname'])

In [6]:
slugs

['krimi-1',
 'spaending-1',
 'sande-historier-1',
 'borneboger-1',
 'feelgood-and-romance-1',
 'biografier-and-reportage-1',
 'skonlitteratur-1',
 'skraek-1',
 'klassikere-and-poesi-1',
 'fantasy-and-science-fiction-1',
 'personlig-udvikling-1',
 'fakta-1',
 'samfund-and-politik-1',
 'serier-and-humor-1',
 'livsstil-and-hobby-1',
 'teenager-and-young-adult-1',
 'letlaest-1',
 'books-in-english']

With this list of categories, it is time to query each one and get books. For that I will use my `get_books` function. I will try to get 1,000 books from each category.

In [8]:
books_list = []

for slug in tqdm(slugs):
    
    try:
        books = get_books(slug, 1000)
        df = pd.DataFrame(books)
        df['category'] = slug
        books_list.append(df)

    except Exception as err:
        print(f'An error occurred with {slug}: {err}')


100%|██████████| 18/18 [01:01<00:00,  3.42s/it]


This gives me a list of pandas dataframes for each column that I will concatenate into one big data frame.

In [11]:
booksdf = pd.concat(books_list)
booksdf.head(5)

Unnamed: 0,id,title,type,imageurl,weburl,descriptionbrief,relatedbookid,pubdate,authors,allowedinlibrary,...,titleslug,narratorslug,inCompletedList,relatedInCompletedList,bookstatus,relatedbookstatus,numberinseries,seriesslug,series,category
0,10370882,Alt det som ingen ser,2,https://www.nextory.se/coverimg/130/10370882_2...,https://www.nextory.dk/bog/alt-det-som-ingen-s...,Danske Rasmus vågner i en seng i Moskva sammen...,10374229,2019-06-06 00:00:00 +0200,[Jan Have Eriksen],1,...,alt-det-som-ingen-ser-10370882,paul-becker-400331,0,0,ACTIVE,ACTIVE,,,,krimi-1
1,10471409,Bag lukkede døre,2,https://www.nextory.se/coverimg/130/10471409_2...,https://www.nextory.dk/bog/bag-lukkede-døre-10...,#1 PÅ SUNDAY TIMES-BESTSELLERLISTE. Kender du ...,10471545,2017-01-31 00:00:00 +0100,[B.A. Paris],1,...,bag-lukkede-døre-10471409,bolette-schrøder-400250,0,0,ACTIVE,ACTIVE,1.0,mørke-hemmeligheder-20092,Mørke hemmeligheder,krimi-1
2,10502808,Hvor flodkrebsene synger,2,https://www.nextory.se/coverimg/130/10502808_2...,https://www.nextory.dk/bog/hvor-flodkrebsene-s...,"Kya Clark er den vilde pige, ”marskpigen”. Hun...",10502806,2019-09-27 00:00:00 +0200,[Delia Owens],1,...,hvor-flodkrebsene-synger-10502808,sara-ullner-415618,0,0,ACTIVE,ACTIVE,,,,krimi-1
3,10528019,Offer 2117,2,https://www.nextory.se/coverimg/130/10528019_2...,https://www.nextory.dk/bog/offer-2117-10528019,Offer 2117 er ottende bind i Jussi Adler-Olsen...,10664393,2019-06-14 00:00:00 +0200,[Jussi Adler-Olsen],1,...,offer-2117-10528019,githa-lehrmann-400308,0,0,ACTIVE,ACTIVE,8.0,afdeling-q-18468,Afdeling Q,krimi-1
4,10471424,Sammenbruddet,2,https://www.nextory.se/coverimg/130/10471424_2...,https://www.nextory.dk/bog/sammenbruddet-10471424,"Cass har haft det svært siden den aften, hun s...",10471502,2018-08-24 00:00:00 +0200,[B.A. Paris],1,...,sammenbruddet-10471424,bolette-schrøder-400250,0,0,ACTIVE,ACTIVE,2.0,mørke-hemmeligheder-20092,Mørke hemmeligheder,krimi-1


### Find the best books

First I want to reemove duplicates in my data. A book can belong to multiple categories, but I only want one category per book. To do this I look at each isbn number at pick the category where that book has the highest position. If there are ties after that I just pick the first one. Then I drop the rank and position columns. 

In [16]:
idx = booksdf.groupby(['isbn'])['position'].transform(max) == booksdf['position']
booksunique = booksdf[idx]
booksunique = booksunique.groupby('isbn').first()
booksunique = booksunique.drop(['rank', 'position'], axis=1)
booksunique.head(5)

Unnamed: 0_level_0,id,title,type,imageurl,weburl,descriptionbrief,relatedbookid,pubdate,authors,allowedinlibrary,...,titleslug,narratorslug,inCompletedList,relatedInCompletedList,bookstatus,relatedbookstatus,numberinseries,seriesslug,series,category
isbn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
9780007227761,10051841,The Hobbit,2,https://www.nextory.se/coverimg/130/10051841_2...,https://www.nextory.dk/bog/the-hobbit-10051841,Bilbo Baggins is a hobbit who enjoys a comfort...,,2005-10-17 00:00:00 +0200,[J. R. R. Tolkien],1,...,the-hobbit-10051841,martin-shaw-86014,0,0,ACTIVE,,,,,books-in-english
9780007228386,10051842,The Fellowship of the Ring (The Lord of the Ri...,2,https://www.nextory.se/coverimg/130/10051842_2...,https://www.nextory.dk/bog/the-fellowship-of-t...,"Continuing the story begun in The Hobbit, this...",,2005-10-17 00:00:00 +0200,[J. R. R. Tolkien],1,...,the-fellowship-of-the-ring-the-lord-of-the-rin...,rob-inglis-85977,0,0,ACTIVE,,1.0,the-lord-of-the-rings-2356,The Lord of the Rings,books-in-english
9780007228393,10051850,"The Two Towers (The Lord of the Rings, Book 2)",2,https://www.nextory.se/coverimg/130/10051850_2...,https://www.nextory.dk/bog/the-two-towers-the-...,Building on the story begun in The Hobbit and ...,,2005-10-17 00:00:00 +0200,[J. R. R. Tolkien],1,...,the-two-towers-the-lord-of-the-rings-book-2--1...,rob-inglis-85977,0,0,ACTIVE,,2.0,the-lord-of-the-rings-2356,The Lord of the Rings,books-in-english
9780007237494,10052538,"A Dance With Dragons (A Song of Ice and Fire, ...",2,https://www.nextory.se/coverimg/130/10052538_8...,https://www.nextory.dk/bog/a-dance-with-dragon...,HBO’s hit series A GAME OF THRONES is based on...,,2011-07-12 00:00:00 +0200,[George R.R. Martin],1,...,a-dance-with-dragons-a-song-of-ice-and-fire-bo...,roy-dotrice-95418,0,0,ACTIVE,,5.0,a-song-of-ice-and-fire-2394,A Song of Ice and Fire,books-in-english
9780007237500,10052539,"A Game of Thrones (A Song of Ice and Fire, Boo...",2,https://www.nextory.se/coverimg/130/10052539_7...,https://www.nextory.dk/bog/a-game-of-thrones-a...,HBO’s hit series A GAME OF THRONES is based on...,,2011-07-12 00:00:00 +0200,[George R.R. Martin],1,...,a-game-of-thrones-a-song-of-ice-and-fire-book-...,roy-dotrice-95418,0,0,ACTIVE,,1.0,a-song-of-ice-and-fire-2394,A Song of Ice and Fire,books-in-english


Now I want to find the best books. I am using my own logic here and you are welcome to do it in another way - but please share that with me so I can learn.

First I am removing all books without a review. Then I am calculating a log value of number of ratings in order to give good books with fewer rating a better chance.

Then I multiply the average rating with the log of number of ratings to produce a score. The logic is that product of the two is the best indicator of a good book - ie one that consistently generates possitive reviews.

Finally I sort them according to their new score.

In [17]:
booksunique = booksunique[booksunique['avgrate'] != 0]
booksunique = booksunique[booksunique['numberofrates'] != 0]

booksunique['log_of_numberofrates'] = booksunique['numberofrates'].apply(np.log)

booksunique['score'] = booksunique['avgrate'] * booksunique['log_of_numberofrates']

booksunique = booksunique.sort_values(by = ['score'], ascending = False)

Finally I will clean up the data frame a bit. I will limit the amount of columns and clean the category name. And then finally only pick the top ten in each category.

In [29]:
goodbooks = booksunique[['title', 'category', 'avgrate', 'numberofrates', 'score']]
goodbooks['category'] = goodbooks['category'].apply(clean_cat)

# get dataframe sorted by score in each category 
goodbooks = goodbooks.groupby(["category"]).apply(lambda x: x.sort_values(["score"], ascending = False)).reset_index(drop=True)

# select top N rows within each category
goodbooks = goodbooks.groupby('category').head(10)
goodbooks

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  goodbooks['category'] = goodbooks['category'].apply(clean_cat)


Unnamed: 0,title,category,avgrate,numberofrates,score
0,Sygeplejersken - En af Danmarkshistoriens mest...,Biografier And Reportage,4.443798,1032,30.836646
1,Geggo: Fra vild teenager til businesskvinde og...,Biografier And Reportage,4.424134,837,29.773642
2,Rødder: En gangsters udvej. Nedim Yasars historie,Biografier And Reportage,4.358811,471,26.827863
3,Kære Zoe Ukhona,Biografier And Reportage,4.695817,263,26.165818
4,Nicklas Bendtner - Begge sider,Biografier And Reportage,4.347059,340,25.338769
...,...,...,...,...,...
5074,Twilight: Tusmørke,Teenager And Young Adult,4.352941,17,12.332811
5075,Råddenskab: Ravneringene 2,Teenager And Young Adult,4.461538,13,11.443620
5076,Evnen: Ravneringene 3,Teenager And Young Adult,4.727273,11,11.335505
5077,Døde piger lyver ikke,Teenager And Young Adult,4.416667,12,10.975004


### FIN
And voila. Now you have a list of pretty good audiobooks to pass the time with. Enjoy :)