Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry logic with detailled logs to extraction of video data #214

Merged
merged 3 commits into from
Jul 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Changed

- Change log level of "Video at {url} has not yet been translated into {requested_lang_code}" messages from warning to debug (way too verbose)

### Fixed

- Restore functionality to resist temporary bad TED responses when parsing video pages (#209)

## [3.0.2] - 2024-06-24

### Changed
Expand Down
11 changes: 11 additions & 0 deletions codecov.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
coverage:
status:
project:
default:
informational: true
patch:
default:
informational: true
changes:
default:
informational: true
31 changes: 24 additions & 7 deletions src/ted2zim/scraper.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
import dateutil.parser
import jinja2
import yt_dlp
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup, Tag

Check warning on line 15 in src/ted2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

src/ted2zim/scraper.py#L15

Added line #L15 was not covered by tests
from kiwixstorage import KiwixStorage
from pif import get_public_ip
from slugify import slugify
Expand Down Expand Up @@ -821,15 +821,32 @@
try:
soup = BeautifulSoup(html_content, features="html.parser")

json_data = json.loads(
soup.find(
"script", attrs={"id": "__NEXT_DATA__"}
).string # pyright: ignore
)["props"]["pageProps"]["videoData"]
next_data_tag = soup.find("script", attrs={"id": "__NEXT_DATA__"})

Check warning on line 824 in src/ted2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

src/ted2zim/scraper.py#L824

Added line #L824 was not covered by tests

# TED is sometimes inconsistant in sending HTML content, it sometimes sends
# the HTML without the required script containing the talks data, so we
# retry after 5 seconds
if (
not next_data_tag
or not isinstance(next_data_tag, Tag)
or not isinstance(next_data_tag.string, str)
):
logger.debug(

Check warning on line 834 in src/ted2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

src/ted2zim/scraper.py#L834

Added line #L834 was not covered by tests
"Insufficient data returned by server, __NEXT_DATA__ script not "
"found in HTML page. Retrying in 5 seconds..."
)
time.sleep(5)
return self.extract_info_from_video_page(

Check warning on line 839 in src/ted2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

src/ted2zim/scraper.py#L838-L839

Added lines #L838 - L839 were not covered by tests
url, retry_count=retry_count + 1
)

json_data = json.loads(next_data_tag.string)["props"]["pageProps"][

Check warning on line 843 in src/ted2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

src/ted2zim/scraper.py#L843

Added line #L843 was not covered by tests
"videoData"
]

requested_lang_code = self.get_lang_code_from_url(url)
if requested_lang_code and json_data["language"] != requested_lang_code:
logger.warning(
logger.debug(

Check warning on line 849 in src/ted2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

src/ted2zim/scraper.py#L849

Added line #L849 was not covered by tests
f"Video at {url} has not yet been translated into "
f"{requested_lang_code}"
)
Expand Down
Loading