Daily Blink Page Layout has changed - IndexError: list index out of range #32

ptrstn · 2022-06-03T03:08:25Z

The Layout and URL of the Free Daily Page has changed.

New URL: https://www.blinkist.com/en/content/daily

The locator attribute values for BeautifulSoup have to be updated accordingly. Previous values are no longer valid and cause an IndexError:

    def _create_blink_info(response_text):
        soup = BeautifulSoup(response_text, "html.parser")
>       daily_book_href = soup.find_all("a", {"class": "daily-book__cta"})[0]["href"]
E       IndexError: list index out of range

The text was updated successfully, but these errors were encountered:

kotzer3 · 2022-06-04T20:50:26Z

confirmed, having this also since... 22.05.2022, because last folder i have in my library is:
'2022-05-21 - Finde den Weg zu deiner inneren Mitte'/

root@banane:~# python3 -m dailyblink
dailyblink v1.2.1, Python 3.9.2, Linux armv7l 32bit ELF
Downloading the free daily Blinks on 2022-06-04 22:47:32...
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/.local/lib/python3.9/site-packages/dailyblink/__main__.py", line 67, in <module>
    main()
  File "/root/.local/lib/python3.9/site-packages/dailyblink/__main__.py", line 63, in main
    blinkist_scraper.download_daily_blinks(args.language, base_path)
  File "/root/.local/lib/python3.9/site-packages/dailyblink/core.py", line 37, in download_daily_blinks
    self._attempt_daily_blinks_download(languages, base_path)
  File "/root/.local/lib/python3.9/site-packages/dailyblink/core.py", line 56, in _attempt_daily_blinks_download
    self._download_daily_blinks(language_code, base_path)
  File "/root/.local/lib/python3.9/site-packages/dailyblink/core.py", line 63, in _download_daily_blinks
    blink_info = self._get_daily_blink_info(language=language_code)
  File "/root/.local/lib/python3.9/site-packages/dailyblink/core.py", line 126, in _get_daily_blink_info
    return _create_blink_info(response.text)
  File "/root/.local/lib/python3.9/site-packages/dailyblink/core.py", line 171, in _create_blink_info
    daily_book_href = soup.find_all("a", {"class": "daily-book__cta"})[0]["href"]
IndexError: list index out of range
root@banane:~#

Erik262 · 2022-06-05T09:20:13Z

Jap same here. How to fix this?

NicoWeio · 2022-06-07T20:58:42Z

I was able to retrieve audio and text content for the free daily by calling Blinkist's API the way the frontend does. I prefer this over BeautifulSoup because it's more direct and the new DOM lacks descriptive classes/IDs. However, I haven't integrated my approach with this codebase, and I'm not sure if it works the same for arbitrary books on Blinkist Premium. If anyone's interested, I'll post my code tomorrow. :)

Erik262 · 2022-06-08T07:13:51Z

I was able to retrieve audio and text content for the free daily by calling Blinkist's API the way the frontend does. I prefer this over BeautifulSoup because it's more direct and the new DOM lacks descriptive classes/IDs. However, I haven't integrated my approach with this codebase, and I'm not sure if it works the same for arbitrary books on Blinkist Premium. If anyone's interested, I'll post my code tomorrow. :)

Perfect, let me please know!

NicoWeio · 2022-06-08T16:55:55Z

Here you go. :)

⚠️ Update: I've created a repo with updated code here

Again, I haven't tried other values for User-Agent yet, and I can't check whether this approach will work for Premium content.

import cloudscraper
from datetime import datetime
from pathlib import Path
import requests
from rich import print
from rich.progress import track

BASE_URL = 'https://www.blinkist.com/'

HEADERS = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:101.0) Gecko/20100101 Firefox/101.0',
    'x-requested-with': 'XMLHttpRequest',
}

LOCALES = ['en', 'de']
DOWNLOAD_DIR = Path.home() / 'Musik' / 'Blinkist'

scraper = cloudscraper.create_scraper()


def get_book_dir(book):
    return DOWNLOAD_DIR / f"{datetime.today().strftime('%Y-%m-%d')} – {book['slug']}"


def get_free_daily(locale):
    # see also: https://www.blinkist.com/en/content/daily
    response = scraper.get(
        BASE_URL + 'api/free_daily',
        params={'locale': locale}
    )
    return response.json()


def get_chapters(book_slug):
    url = f"{BASE_URL}/api/books/{book_slug}/chapters"
    response = requests.get(url, headers=HEADERS)
    response.raise_for_status()
    return response.json()['chapters']


def get_chapter(book_id, chapter_id):
    url = f"{BASE_URL}/api/books/{book_id}/chapters/{chapter_id}"
    response = requests.get(url, headers=HEADERS)
    response.raise_for_status()
    return response.json()


def download_chapter_audio(book, chapter_data):
    book_dir = get_book_dir(book)
    book_dir.mkdir(exist_ok=True)
    file_path = book_dir / f"chapter_{chapter_data['order_no']}.m4a"

    if file_path.exists():
        print(f"Skipping existing file: {file_path}")
        return

    assert 'm4a' in chapter_data['signed_audio_url']
    response = scraper.get(chapter_data['signed_audio_url'])
    assert response.status_code == 200
    file_path.write_bytes(response.content)
    print(f"Downloaded chapter {chapter_data['order_no']}")


for locale in LOCALES:
    free_daily = get_free_daily(locale=locale)
    book = free_daily['book']
    print(f"Today's free daily in {locale} is: “{book['title']}”")

    # list of chapters without their content
    chapter_list = get_chapters(book['slug'])

    # fetch chapter content
    chapters = [get_chapter(book['id'], chapter['id']) for chapter in track(chapter_list, description='Fetching chapters…')]

    # download audio
    for chapter in track(chapters, description='Downloading audio…'):
        download_chapter_audio(book, chapter)

    # write markdown
    # excluded for brevity – just access chapter['text'] etc.
    # markdown_text = download_book_md(book, chapters)

Erik262 · 2022-06-08T21:13:50Z

@NicoWeio does your code work straight out of the box, or does this to be replaced with the core.py ?

WrayOfSunshine · 2022-06-12T00:02:01Z

Would this approach work on a Windows machine?

NicoWeio · 2022-06-12T07:28:45Z

@NicoWeio does your code work straight out of the box, or does this to be replaced with the core.py ?

See my earlier comment:

However, I haven't integrated my approach with this codebase, and I'm not sure if it works the same for arbitrary books on Blinkist Premium.

Assuming you have cloudscraper installed, my script works out of the box, and it should download the audio just fine. However, it does not generate a text or cover image file, does not set the audio's metadata, and does not precisely follow dailyblink's naming conventions.

NicoWeio · 2022-06-12T07:31:12Z

Would this approach work on a Windows machine?

If dailyblink worked on Windows before, yes. Both my approach using Blinkist's API and the current approach using BeautifulSoup.

Erik262 · 2022-06-13T10:46:50Z

@ptrstn Is there a fix/update coming? you said until Sunday and then you removed your answer.

ptrstn · 2022-06-13T11:53:43Z

@ptrstn Is there a fix/update coming? you said until Sunday and then you removed your answer.

This change requires some refactoring and a little bit more time than initially expected. I'll see what I can do. Can't guarantee you when though, since I've got other things in life to take care of first.

Erik262 · 2022-06-13T16:01:20Z

@ptrstn Is there a fix/update coming? you said until Sunday and then you removed your answer.

This change requires some refactoring and a little bit more time than initially expected. I'll see what I can do. Can't guarantee you when though, since I've got other things in life to take care of first.

sure you're right about that.

rajeshbhavikatti · 2022-06-14T17:45:50Z

Here you go. :)

Again, I haven't tried other values for User-Agent yet, and I can't check whether this approach will work for Premium content.

import cloudscraper
from datetime import datetime
from pathlib import Path
import requests
from rich import print
from rich.progress import track

BASE_URL = 'https://www.blinkist.com/'

HEADERS = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:101.0) Gecko/20100101 Firefox/101.0',
    'x-requested-with': 'XMLHttpRequest',
}

LOCALES = ['en', 'de']
DOWNLOAD_DIR = Path.home() / 'Musik' / 'Blinkist'

scraper = cloudscraper.create_scraper()


def get_book_dir(book):
    return DOWNLOAD_DIR / f"{datetime.today().strftime('%Y-%m-%d')} – {book['slug']}"


def get_free_daily(locale):
    # see also: https://www.blinkist.com/en/content/daily
    response = scraper.get(
        BASE_URL + 'api/free_daily',
        params={'locale': locale}
    )
    return response.json()


def get_chapters(book_slug):
    url = f"{BASE_URL}/api/books/{book_slug}/chapters"
    response = requests.get(url, headers=HEADERS)
    response.raise_for_status()
    return response.json()['chapters']


def get_chapter(book_id, chapter_id):
    url = f"{BASE_URL}/api/books/{book_id}/chapters/{chapter_id}"
    response = requests.get(url, headers=HEADERS)
    response.raise_for_status()
    return response.json()


def download_chapter_audio(book, chapter_data):
    book_dir = get_book_dir(book)
    book_dir.mkdir(exist_ok=True)
    file_path = book_dir / f"chapter_{chapter_data['order_no']}.m4a"

    if file_path.exists():
        print(f"Skipping existing file: {file_path}")
        return

    assert 'm4a' in chapter_data['signed_audio_url']
    response = scraper.get(chapter_data['signed_audio_url'])
    assert response.status_code == 200
    file_path.write_bytes(response.content)
    print(f"Downloaded chapter {chapter_data['order_no']}")


for locale in LOCALES:
    free_daily = get_free_daily(locale=locale)
    book = free_daily['book']
    print(f"Today's free daily in {locale} is: “{book['title']}”")

    # list of chapters without their content
    chapter_list = get_chapters(book['slug'])

    # fetch chapter content
    chapters = [get_chapter(book['id'], chapter['id']) for chapter in track(chapter_list, description='Fetching chapters…')]

    # download audio
    for chapter in track(chapters, description='Downloading audio…'):
        download_chapter_audio(book, chapter)

    # write markdown
    # excluded for brevity – just access chapter['text'] etc.
    # markdown_text = download_book_md(book, chapters)

Executing this code on google colab I am getting 403 forbidden error on line 70 when calling get_chapters after troubleshooting I found that response.raise_for_status() gives that error as it can't access the url which gives this error. how can I resolve this?
@NicoWeio

NicoWeio · 2022-06-15T08:00:02Z

@rajeshbhavikatti I just published my code here, so we can keep this issue clean from further discussions.
Notice the double slash in the URL? That might be the cause, although it didn't cause issues for me. Maybe because of a different requests version? Anyway, I fixed the double slashes in my code. Plus, I've added CI to my repo, and it works just fine there, too.

kotzer3 · 2022-09-29T12:25:31Z

@ptrstn Is there a fix/update coming? you said until Sunday and then you removed your answer.

This change requires some refactoring and a little bit more time than initially expected. I'll see what I can do. Can't guarantee you when though, since I've got other things in life to take care of first.

Hi Peter @ptrstn , do you have some updates on this?

ptrstn · 2022-09-29T12:29:53Z

Hi Peter @ptrstn , do you have some updates on this?

I'll be able to work on it starting beginning of October, since I'm still busy with private issues

kotzer3 · 2022-12-10T08:03:54Z

Hi Peter @ptrstn , do you have some updates on this?

I'll be able to work on it starting beginning of October, since I'm still busy with private issues

Any News for us?

rajeshbhavikatti · 2023-01-07T06:49:49Z

Hi, I have made some updates based on this repo feel free to reach out to me on any changes or update check out my notebook here

Erik262 · 2023-01-07T07:36:23Z

@rajeshbhavikatti nice work, but you don't catch the mp3 files.

rajeshbhavikatti · 2023-01-07T07:53:30Z

@Erik262 yes, as the notion API doesn't support it yet

ptrstn added the bug Something isn't working label Jun 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Daily Blink Page Layout has changed - IndexError: list index out of range #32

Daily Blink Page Layout has changed - IndexError: list index out of range #32

ptrstn commented Jun 3, 2022

kotzer3 commented Jun 4, 2022

Erik262 commented Jun 5, 2022

NicoWeio commented Jun 7, 2022 •

edited

Erik262 commented Jun 8, 2022

NicoWeio commented Jun 8, 2022 •

edited

Erik262 commented Jun 8, 2022

WrayOfSunshine commented Jun 12, 2022

NicoWeio commented Jun 12, 2022

NicoWeio commented Jun 12, 2022

Erik262 commented Jun 13, 2022

ptrstn commented Jun 13, 2022

Erik262 commented Jun 13, 2022

rajeshbhavikatti commented Jun 14, 2022 •

edited

NicoWeio commented Jun 15, 2022

kotzer3 commented Sep 29, 2022

ptrstn commented Sep 29, 2022

kotzer3 commented Dec 10, 2022

rajeshbhavikatti commented Jan 7, 2023

Erik262 commented Jan 7, 2023

rajeshbhavikatti commented Jan 7, 2023

Daily Blink Page Layout has changed - IndexError: list index out of range #32

Daily Blink Page Layout has changed - IndexError: list index out of range #32

Comments

ptrstn commented Jun 3, 2022

kotzer3 commented Jun 4, 2022

Erik262 commented Jun 5, 2022

NicoWeio commented Jun 7, 2022 • edited

Erik262 commented Jun 8, 2022

NicoWeio commented Jun 8, 2022 • edited

Erik262 commented Jun 8, 2022

WrayOfSunshine commented Jun 12, 2022

NicoWeio commented Jun 12, 2022

NicoWeio commented Jun 12, 2022

Erik262 commented Jun 13, 2022

ptrstn commented Jun 13, 2022

Erik262 commented Jun 13, 2022

rajeshbhavikatti commented Jun 14, 2022 • edited

NicoWeio commented Jun 15, 2022

kotzer3 commented Sep 29, 2022

ptrstn commented Sep 29, 2022

kotzer3 commented Dec 10, 2022

rajeshbhavikatti commented Jan 7, 2023

Erik262 commented Jan 7, 2023

rajeshbhavikatti commented Jan 7, 2023

NicoWeio commented Jun 7, 2022 •

edited

NicoWeio commented Jun 8, 2022 •

edited

rajeshbhavikatti commented Jun 14, 2022 •

edited