Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Daily Blink Page Layout has changed - IndexError: list index out of range #32

Open
ptrstn opened this issue Jun 3, 2022 · 20 comments
Open
Labels
bug Something isn't working

Comments

@ptrstn
Copy link
Owner

ptrstn commented Jun 3, 2022

The Layout and URL of the Free Daily Page has changed.

New URL: https://www.blinkist.com/en/content/daily

The locator attribute values for BeautifulSoup have to be updated accordingly. Previous values are no longer valid and cause an IndexError:

    def _create_blink_info(response_text):
        soup = BeautifulSoup(response_text, "html.parser")
>       daily_book_href = soup.find_all("a", {"class": "daily-book__cta"})[0]["href"]
E       IndexError: list index out of range
@ptrstn ptrstn added the bug Something isn't working label Jun 3, 2022
@kotzer3
Copy link

kotzer3 commented Jun 4, 2022

confirmed, having this also since... 22.05.2022, because last folder i have in my library is:
'2022-05-21 - Finde den Weg zu deiner inneren Mitte'/

root@banane:~# python3 -m dailyblink
dailyblink v1.2.1, Python 3.9.2, Linux armv7l 32bit ELF
Downloading the free daily Blinks on 2022-06-04 22:47:32...
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/.local/lib/python3.9/site-packages/dailyblink/__main__.py", line 67, in <module>
    main()
  File "/root/.local/lib/python3.9/site-packages/dailyblink/__main__.py", line 63, in main
    blinkist_scraper.download_daily_blinks(args.language, base_path)
  File "/root/.local/lib/python3.9/site-packages/dailyblink/core.py", line 37, in download_daily_blinks
    self._attempt_daily_blinks_download(languages, base_path)
  File "/root/.local/lib/python3.9/site-packages/dailyblink/core.py", line 56, in _attempt_daily_blinks_download
    self._download_daily_blinks(language_code, base_path)
  File "/root/.local/lib/python3.9/site-packages/dailyblink/core.py", line 63, in _download_daily_blinks
    blink_info = self._get_daily_blink_info(language=language_code)
  File "/root/.local/lib/python3.9/site-packages/dailyblink/core.py", line 126, in _get_daily_blink_info
    return _create_blink_info(response.text)
  File "/root/.local/lib/python3.9/site-packages/dailyblink/core.py", line 171, in _create_blink_info
    daily_book_href = soup.find_all("a", {"class": "daily-book__cta"})[0]["href"]
IndexError: list index out of range
root@banane:~#

@Erik262
Copy link

Erik262 commented Jun 5, 2022

Jap same here. How to fix this?

@NicoWeio
Copy link

NicoWeio commented Jun 7, 2022

I was able to retrieve audio and text content for the free daily by calling Blinkist's API the way the frontend does. I prefer this over BeautifulSoup because it's more direct and the new DOM lacks descriptive classes/IDs. However, I haven't integrated my approach with this codebase, and I'm not sure if it works the same for arbitrary books on Blinkist Premium. If anyone's interested, I'll post my code tomorrow. :)

@Erik262
Copy link

Erik262 commented Jun 8, 2022

I was able to retrieve audio and text content for the free daily by calling Blinkist's API the way the frontend does. I prefer this over BeautifulSoup because it's more direct and the new DOM lacks descriptive classes/IDs. However, I haven't integrated my approach with this codebase, and I'm not sure if it works the same for arbitrary books on Blinkist Premium. If anyone's interested, I'll post my code tomorrow. :)

Perfect, let me please know!

@NicoWeio
Copy link

NicoWeio commented Jun 8, 2022

Here you go. :)

⚠️ Update: I've created a repo with updated code here

Again, I haven't tried other values for User-Agent yet, and I can't check whether this approach will work for Premium content.

import cloudscraper
from datetime import datetime
from pathlib import Path
import requests
from rich import print
from rich.progress import track

BASE_URL = 'https://www.blinkist.com/'

HEADERS = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:101.0) Gecko/20100101 Firefox/101.0',
    'x-requested-with': 'XMLHttpRequest',
}

LOCALES = ['en', 'de']
DOWNLOAD_DIR = Path.home() / 'Musik' / 'Blinkist'

scraper = cloudscraper.create_scraper()


def get_book_dir(book):
    return DOWNLOAD_DIR / f"{datetime.today().strftime('%Y-%m-%d')}{book['slug']}"


def get_free_daily(locale):
    # see also: https://www.blinkist.com/en/content/daily
    response = scraper.get(
        BASE_URL + 'api/free_daily',
        params={'locale': locale}
    )
    return response.json()


def get_chapters(book_slug):
    url = f"{BASE_URL}/api/books/{book_slug}/chapters"
    response = requests.get(url, headers=HEADERS)
    response.raise_for_status()
    return response.json()['chapters']


def get_chapter(book_id, chapter_id):
    url = f"{BASE_URL}/api/books/{book_id}/chapters/{chapter_id}"
    response = requests.get(url, headers=HEADERS)
    response.raise_for_status()
    return response.json()


def download_chapter_audio(book, chapter_data):
    book_dir = get_book_dir(book)
    book_dir.mkdir(exist_ok=True)
    file_path = book_dir / f"chapter_{chapter_data['order_no']}.m4a"

    if file_path.exists():
        print(f"Skipping existing file: {file_path}")
        return

    assert 'm4a' in chapter_data['signed_audio_url']
    response = scraper.get(chapter_data['signed_audio_url'])
    assert response.status_code == 200
    file_path.write_bytes(response.content)
    print(f"Downloaded chapter {chapter_data['order_no']}")


for locale in LOCALES:
    free_daily = get_free_daily(locale=locale)
    book = free_daily['book']
    print(f"Today's free daily in {locale} is: “{book['title']}”")

    # list of chapters without their content
    chapter_list = get_chapters(book['slug'])

    # fetch chapter content
    chapters = [get_chapter(book['id'], chapter['id']) for chapter in track(chapter_list, description='Fetching chapters…')]

    # download audio
    for chapter in track(chapters, description='Downloading audio…'):
        download_chapter_audio(book, chapter)

    # write markdown
    # excluded for brevity – just access chapter['text'] etc.
    # markdown_text = download_book_md(book, chapters)

@Erik262
Copy link

Erik262 commented Jun 8, 2022

@NicoWeio does your code work straight out of the box, or does this to be replaced with the core.py ?

@WrayOfSunshine
Copy link

Would this approach work on a Windows machine?

@NicoWeio
Copy link

@NicoWeio does your code work straight out of the box, or does this to be replaced with the core.py ?

See my earlier comment:

However, I haven't integrated my approach with this codebase, and I'm not sure if it works the same for arbitrary books on Blinkist Premium.

Assuming you have cloudscraper installed, my script works out of the box, and it should download the audio just fine. However, it does not generate a text or cover image file, does not set the audio's metadata, and does not precisely follow dailyblink's naming conventions.

@NicoWeio
Copy link

Would this approach work on a Windows machine?

If dailyblink worked on Windows before, yes. Both my approach using Blinkist's API and the current approach using BeautifulSoup.

@Erik262
Copy link

Erik262 commented Jun 13, 2022

@ptrstn Is there a fix/update coming? you said until Sunday and then you removed your answer.

@ptrstn
Copy link
Owner Author

ptrstn commented Jun 13, 2022

@ptrstn Is there a fix/update coming? you said until Sunday and then you removed your answer.

This change requires some refactoring and a little bit more time than initially expected. I'll see what I can do. Can't guarantee you when though, since I've got other things in life to take care of first.

@Erik262
Copy link

Erik262 commented Jun 13, 2022

@ptrstn Is there a fix/update coming? you said until Sunday and then you removed your answer.

This change requires some refactoring and a little bit more time than initially expected. I'll see what I can do. Can't guarantee you when though, since I've got other things in life to take care of first.

sure you're right about that.

@rajeshbhavikatti
Copy link

rajeshbhavikatti commented Jun 14, 2022

Here you go. :)

Again, I haven't tried other values for User-Agent yet, and I can't check whether this approach will work for Premium content.

import cloudscraper
from datetime import datetime
from pathlib import Path
import requests
from rich import print
from rich.progress import track

BASE_URL = 'https://www.blinkist.com/'

HEADERS = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:101.0) Gecko/20100101 Firefox/101.0',
    'x-requested-with': 'XMLHttpRequest',
}

LOCALES = ['en', 'de']
DOWNLOAD_DIR = Path.home() / 'Musik' / 'Blinkist'

scraper = cloudscraper.create_scraper()


def get_book_dir(book):
    return DOWNLOAD_DIR / f"{datetime.today().strftime('%Y-%m-%d')}{book['slug']}"


def get_free_daily(locale):
    # see also: https://www.blinkist.com/en/content/daily
    response = scraper.get(
        BASE_URL + 'api/free_daily',
        params={'locale': locale}
    )
    return response.json()


def get_chapters(book_slug):
    url = f"{BASE_URL}/api/books/{book_slug}/chapters"
    response = requests.get(url, headers=HEADERS)
    response.raise_for_status()
    return response.json()['chapters']


def get_chapter(book_id, chapter_id):
    url = f"{BASE_URL}/api/books/{book_id}/chapters/{chapter_id}"
    response = requests.get(url, headers=HEADERS)
    response.raise_for_status()
    return response.json()


def download_chapter_audio(book, chapter_data):
    book_dir = get_book_dir(book)
    book_dir.mkdir(exist_ok=True)
    file_path = book_dir / f"chapter_{chapter_data['order_no']}.m4a"

    if file_path.exists():
        print(f"Skipping existing file: {file_path}")
        return

    assert 'm4a' in chapter_data['signed_audio_url']
    response = scraper.get(chapter_data['signed_audio_url'])
    assert response.status_code == 200
    file_path.write_bytes(response.content)
    print(f"Downloaded chapter {chapter_data['order_no']}")


for locale in LOCALES:
    free_daily = get_free_daily(locale=locale)
    book = free_daily['book']
    print(f"Today's free daily in {locale} is: “{book['title']}”")

    # list of chapters without their content
    chapter_list = get_chapters(book['slug'])

    # fetch chapter content
    chapters = [get_chapter(book['id'], chapter['id']) for chapter in track(chapter_list, description='Fetching chapters…')]

    # download audio
    for chapter in track(chapters, description='Downloading audio…'):
        download_chapter_audio(book, chapter)

    # write markdown
    # excluded for brevity – just access chapter['text'] etc.
    # markdown_text = download_book_md(book, chapters)

Executing this code on google colab I am getting 403 forbidden error on line 70 when calling get_chapters after troubleshooting I found that response.raise_for_status() gives that error as it can't access the url which gives this error. how can I resolve this?
@NicoWeio
msedge_ilk1LMQ7Fj

@NicoWeio
Copy link

@rajeshbhavikatti I just published my code here, so we can keep this issue clean from further discussions.
Notice the double slash in the URL? That might be the cause, although it didn't cause issues for me. Maybe because of a different requests version? Anyway, I fixed the double slashes in my code. Plus, I've added CI to my repo, and it works just fine there, too.

@kotzer3
Copy link

kotzer3 commented Sep 29, 2022

@ptrstn Is there a fix/update coming? you said until Sunday and then you removed your answer.

This change requires some refactoring and a little bit more time than initially expected. I'll see what I can do. Can't guarantee you when though, since I've got other things in life to take care of first.

Hi Peter @ptrstn , do you have some updates on this?

@ptrstn
Copy link
Owner Author

ptrstn commented Sep 29, 2022

Hi Peter @ptrstn , do you have some updates on this?

I'll be able to work on it starting beginning of October, since I'm still busy with private issues

@kotzer3
Copy link

kotzer3 commented Dec 10, 2022

Hi Peter @ptrstn , do you have some updates on this?

I'll be able to work on it starting beginning of October, since I'm still busy with private issues

Any News for us?

@rajeshbhavikatti
Copy link

Hi, I have made some updates based on this repo feel free to reach out to me on any changes or update check out my notebook here

@Erik262
Copy link

Erik262 commented Jan 7, 2023

@rajeshbhavikatti nice work, but you don't catch the mp3 files.

@rajeshbhavikatti
Copy link

@Erik262 yes, as the notion API doesn't support it yet

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants