Skip to content

Readability.summarize breaks when using Floki html5ever parser #65

@vkryukov

Description

@vkryukov

I discovered this while trying to fix some parsing errors for pages with charsets other than UTF-8.

Floki allows changing the underlying HTML parser, which you might want to do because e.g. if you want a faster parsing speed. However, selecting htlm5ever parser break things:

# in config/config.exs

config :floki, :html_parser, Floki.HTMLParser.Html5ever

Now summarize is broken:

iex(1)> Readability.summarize("https://medium.com/@kenmazaika/why-im-betting-on-elixir-7c8f847b58")
** (FunctionClauseError) no function clause matching in Readability.Helper.remove_tag/2    
    
    The following arguments were given to Readability.Helper.remove_tag/2:
    
        # 1
        {:doctype, "html", "", ""}
    
        # 2
        #Function<0.45730907/1 in Readability.Helper.normalize/2>
    
    Attempted function clauses (showing 4 out of 4):
    
        def remove_tag(content, _) when is_binary(content)
        def remove_tag([], _)
        def remove_tag([h | t], fun)
        def remove_tag({tag, attrs, inner_tree} = html_tree, fun)
    
    (readability 0.12.1) lib/readability/helper.ex:62: Readability.Helper.remove_tag/2
    (readability 0.12.1) lib/readability/helper.ex:66: Readability.Helper.remove_tag/2
    (readability 0.12.1) lib/readability.ex:92: Readability.summarize/2
    iex:1: (file)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions