Skip to content

Commit

Permalink
Fix behavior of "auto" parser to be first parser returning links wins
Browse files Browse the repository at this point in the history
This is how it was actually documented to work in the code, but had been
intentionally broken to just try to work with as much as possible. With more
rigorous testing of the parsers and add's behavior, we shouldn't need that.

History explained here:
ArchiveBox#1363 (comment)

With this, we pass 9 of the 19 tests in tests/parser.
  • Loading branch information
jimwins committed Mar 4, 2024
1 parent 861f78f commit 9980674
Showing 1 changed file with 4 additions and 5 deletions.
9 changes: 4 additions & 5 deletions archivebox/parsers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,10 +130,8 @@ def run_parser_functions(to_parse: IO[str], timer, root_url: Optional[str]=None,
if not parsed_links:
raise Exception(f'No links found using {parser_name} parser')

# print(f'[鈭歖 Parser {parser_name} succeeded: {len(parsed_links)} links parsed')
if len(parsed_links) > len(most_links):
most_links = parsed_links
best_parser_name = parser_name
print(f'[鈭歖 Parser {parser_name} succeeded: {len(parsed_links)} links parsed')
break

except Exception as err: # noqa
# Parsers are tried one by one down the list, and the first one
Expand All @@ -143,8 +141,9 @@ def run_parser_functions(to_parse: IO[str], timer, root_url: Optional[str]=None,
# print('[!] Parser {} failed: {} {}'.format(parser_name, err.__class__.__name__, err))
# raise
pass

timer.end()
return most_links, best_parser_name
return parsed_links, parser_name


@enforce_types
Expand Down

0 comments on commit 9980674

Please sign in to comment.