Skip to content

Use Readability.js and Selenium to extract the useful text from a web page.

License

Notifications You must be signed in to change notification settings

mattblaha/readability-selenium

Repository files navigation

readability-selenium

Use Readability.js and Selenium to extract the useful text from a web page.

After trying many options, I found that solutions that try to do this without firing up a real web browser fail on too many sites that use lots of JavaScript to load pages. To work on the real web, I needed to just automate a real web browser.

This will inject Readability.js into the browser, execute it and fetch the results. It is essentially identical to visiting the page in Firefox and hitting the reader view button in the URL bar.

To run the example, just clone the repo, place your own copy of Readability.js alongside example.py, and run:

    python example.py https://github.com/mattblaha/readability-selenium

The simplest way I know of to setup a Selenium server that will work with the example is with docker (or podman):

    docker run -p 4444:4444 -v /dev/shm:/dev/shm selenium/standalone-chrome:3.141.59

About

Use Readability.js and Selenium to extract the useful text from a web page.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published