Skip to content

Commit

Permalink
Release 3.9.1
Browse files Browse the repository at this point in the history
  • Loading branch information
mborsetti committed Jan 28, 2022
1 parent 93ad084 commit db79546
Show file tree
Hide file tree
Showing 9 changed files with 129 additions and 74 deletions.
2 changes: 1 addition & 1 deletion .bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[bumpversion]
current_version = 3.9
current_version = 3.9.1
message = Release {new_version}
parse = ^
(?P<major>\d+)
Expand Down
111 changes: 105 additions & 6 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,112 @@ can check out the `wish list <https://github.com/mborsetti/webchanges/blob/main/
Security, in case of vulnerabilities. [triggers a minor patch]
Internals, for changes that don't affect users. [triggers a minor patch]
Version 3.9.1
===================
2022-01-27

⚠ Breaking changes in the near future
-------------------------------------
Pyppeteer will be replaced with Playwright (can opt in now!)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The implementation of ``use_browser: true`` jobs (i.e. those running on a browser to run JavaScript) using Pyppeteer
has been very problematic, as the library:

* is in alpha,
* is very slow,
* defaults to years-old obsolete versions of Chromium,
* can be insecure (found that TLS certificates were disabled for downloading browsers!)
* creates conflicts with imports (e.g. requires obsolete version of websockets)
* is poorly documented,
* is poorly maintained,
* and freezes when running it in the current version of Python (3.10)!

Pyppeteer's `open issues <https://github.com/pyppeteer/pyppeteer/issues>`__ now exceed 110.

As a result, I have been investigating a substitute, and found one in `Playwright
<https://playwright.dev/python/>`__. This package has none of the issues above, the core dev team apparently is the same
who wrote Puppetter (of which Pyppeteer is a port to Python), and is supported by the deep pockets of Microsoft. The
Python version is officially supported and up-to-date and we can easily use the latest stable version of Google Chrome
with it without mocking around with setting chromium_revisions.

You can upgrade to Playwright now!
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The Playwright implementation in this release of **webchanges** is extremely stable, fully tested (even on Python
3.10!), and much faster than Pyppeteer (some of my jobs are running 3x faster!). While it's probably production
quality, for the moment it is being released as an opt-in beta only.

I urge you to switch to Playwright. To do so:

Ensure that you have at least Python 3.8 (not tested in 3.7 due to testing limitations).

Install dependencies::

pip install --upgrade webchanges[playwright]

Ensure you have an up-to-date Chrome installation::

webchanges --install-chrome

Edit your configuration file...::

webchanges --edit-config

to add ``_beta_use_playwright: true`` (note the leading underline) under the ``browser`` section of ``job_defaults``,
like this:

.. code-block:: yaml
job_defaults:
browser:
_beta_use_playwright: true
That's it!

All job sub-directives works as they are, with only two minor exceptions:

* ``wait_for`` needs to be replaced with either ``wait_for_selector`` (see more `here
<https://playwright.dev/python/docs/api/class-frame/#frame-wait-for-function>`__) or ``wait_for_function`` (see
more `here <https://playwright.dev/python/docs/api/class-frame/#frame-wait-for-function>`__).
These can still be strings (in which case they will be either the selector or the expression) but also dicts with
arguments accepted by those functions (except for timeout, which is set by the ``timeout`` sub-directory).
* The experimental ``block_elements`` sub-directive is not implemented (yet?) and is simply ignored.

The following sub-directives are new:

* ``referer``: Referer header value. If provided it will take preference over the referer header value set by the
``headers`` sub-directive.
* ``headless`` (true/false): Launch browser in headless mode (i.e. invisible) (defaults to true). Set it to false to see
what's going on in the browser for debugging purposes.

Please make sure to open a GitHub `issue <https://github.com/mborsetti/webchanges/issues>`__ if you encounter
anything wrong!

If you decide to stick with Playwright, you can free up disk space (if no other package uses Pyppeteer) by removing
the downloaded Chromium by deleting the *directory* shown by running::

webchanges --chromium-directory

and uninstalling the Pyppeteer package by running::

pip uninstall pyppeteer

The Playwright implementation also determines the maximum number of jobs to run in parallel based on the amount of free
memory available, which seems to be the relevant constraint, and this will make **webchanges** faster on machines with
lots of memory and more stable on small ones.

Fixed
-----
* Config file directives checker would incorrect reject reports added through ``hooks.py``. Reported by `Knut Wannheden
<https://github.com/knutwannheden>`__ at `#24 <https://github.com/mborsetti/webchanges/issues/24>`__.


Version 3.9
===================
2022-01-26

⚠ Breaking changes in the near future (opt-in now):
---------------------------------------------------
Pyppetter will be replaced with Playwright (can opt in now!)
⚠ Breaking changes in the near future
-------------------------------------
Pyppeteer will be replaced with Playwright (can opt in now!)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The implementation of ``use_browser: true`` jobs (i.e. those running on a browser to run JavaScript) using Pyppeteer
has been very problematic, as the library:
Expand Down Expand Up @@ -80,7 +179,7 @@ Edit your configuration file...::

webchanges --edit-config

...to add ``_beta_use_playwright: true`` (note the leading underline) under the ``browser`` section of ``job_defaults``,
to add ``_beta_use_playwright: true`` (note the leading underline) under the ``browser`` section of ``job_defaults``,
like this:

.. code-block:: yaml
Expand Down Expand Up @@ -809,7 +908,7 @@ Relative to *urlwatch* 2.21:
* New ``additions_only`` directive to report only added lines (useful when monitoring only new content)
* New ``deletions_only`` directive to report only deleted lines
* New ``contextlines`` directive to set the number of context lines in the unified diff
* Support for Python 3.9
* Support for Python Version 3.9
* Backward compatibility with *urlwatch* 2.21 (except running on Python 3.5 or using ``lynx``, which is replaced by
the built-in ``html2text`` filter)

Expand Down Expand Up @@ -870,7 +969,7 @@ Relative to *urlwatch* 2.21:
* Expanded (only slightly) testing
* Using flake8 to check PEP-8 compliance and more
* Using coverage to check unit testing coverage
* Upgraded Travis CI to Python 3.9 from 3.9-dev and cleaned up pip installs
* Upgraded Travis CI to Python Version 3.9 from Version 3.9-dev and cleaned up pip installs

Removed
-------
Expand Down
58 changes: 2 additions & 56 deletions RELEASE.rst
Original file line number Diff line number Diff line change
Expand Up @@ -87,61 +87,7 @@ The Playwright implementation also determines the maximum number of jobs to run
memory available, which seems to be the relevant constraint, and this will make **webchanges** faster on machines with
lots of memory and more stable on small ones.

Changed
-------
* The method ``bs4`` of filter ``html2text`` has a new ``strip`` sub-directive which is passed to BeautifulSoup, and
its default value has changed to false to conform to BeautifulSoup's default. This gives better output in most
cases. To restore the previous non-standard behavior, add the ``strip: true`` sub-directive to the ``html2text``
filter of jobs.
* Pyppeteer (used for URL jobs with ``use_browser: true``) is now crashing during certain tests with Python 3.7.
There will be no new development to fix this as the use of Pyppeteer will soon be deprecated in favor of Playwright.
See above to start using Playwright now (highly suggested).

Added
-----
* The method ``bs4`` of filter ``html2text`` now accepts the sub-directives ``separator`` and ``strip``.
* When using the command line argument ``--test-diff``, the output can now be sent to a specific reporter by also
specifying the ``--test-reporter`` argument. For example, if running on a machine with a web browser, you can see
the HTML version of the last diff(s) from job 1 with ``webchanges --test-diff 1 --test-reporter browser`` on your
local browser.
* New filter ``remove-duplicate-lines``. Contributed by `Michael Sverdlin <https://github.com/sveder>`__ upstream `here
<https://github.com/thp/urlwatch/pull/653>`__ (with modifications).
* New filter ``csv2text``. Contributed by `Michael Sverdlin <https://github.com/sveder>`__ upstream `here
<https://github.com/thp/urlwatch/pull/658>`__ (with modifications).
* The ``html`` report type has a new job directive ``monospace`` which sets the output to use a monospace font.
This can be useful e.g. for tabular text extracted by the ``pdf2text`` filter.
* The ``command_run`` report type has a new environment variable ``WEBCHANGES_CHANGED_JOBS_JSON``.
* Opt-in to use Playwright for jobs with ``use_browser: true`` instead of pyppeteer (see above).

Fixed
-----
* During conversion of Markdown to HTML,
* Code blocks were not rendered without wrapping and in monospace font;
* Spaces immediately after ````` (code block opening) were being dropped.
* The ``email`` reporter's ``sendmail`` sub-directive was not passing the ``from`` sub-directive (when specified) to
the ``sendmail`` executable as an ``-f`` command line argument. Contributed by
`Jonas Witschel <https://github.com/diabonas>`__ upstream `here <https://github.com/thp/urlwatch/pull/671>`__ (with
modifications).
* HTML characters were not being unescaped when the job name is determined from the <title> tag of the data monitored
(if present).
* Command line argument ``--test-diff`` was only showing the last diff instead of all saved ones.
* The ``command_run`` report type was not setting variables ``count`` and ``jobs`` (always 0). Contributed by
`Brian Rak <https://github.com/devicenull>`__ in `#23 <https://github.com/mborsetti/webchanges/issues/23>`__.

Documentation
-------------
* Updated the "recipe" for monitoring Facebook public posts.
* Improved documentation for filter ``pdf2text``.

Internals
---------
* Support for Python 3.10 (except for URL jobs with ``use_browser`` using pyppeteer since it does not yet support it;
use Playwright instead).
* Improved speed of detection and handling of lines starting with spaces during conversion of Markdown to HTML.
* Logging (``--verbose``) now shows thread IDs to help with debugging.

Known issues
------------
* Pyppeteer (used for URL jobs with ``use_browser: true``) is now crashing during certain tests with Python 3.7.
There will be no new development to fix this as the use of Pyppeteer will soon be deprecated in favor of Playwright.
See above to start using Playwright now (highly suggested).
* Config file directives checker would incorrect reject reports added through ``hooks.py``. Reported by `Knut Wannheden
<https://github.com/knutwannheden>`__ at `#24 <https://github.com/mborsetti/webchanges/issues/24>`__.
6 changes: 3 additions & 3 deletions tests/reporters_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,12 +35,12 @@
('+Added line', '<tr style="background-color:#d1ffd1;color:#082b08"><td>Added line</td></tr>'),
(
'-Deleted line',
'<tr style="background-color:#fff0f0;color:#9c1c1c;text-decoration:line-through">' '<td>Deleted line</td></tr>',
'<tr style="background-color:#fff0f0;color:#9c1c1c;text-decoration:line-through"><td>Deleted line</td></tr>',
),
# Changes line
(
'@@ -1,1 +1,1 @@',
'<tr style="background-color:#fbfbfb"><td style="font-family:monospace">@@ -1,1 +1,1 @@' '</td></tr>',
'<tr style="background-color:#fbfbfb"><td style="font-family:monospace">@@ -1,1 +1,1 @@</td></tr>',
),
# Horizontal ruler is manually expanded since <hr> tag is used to separate jobs
(
Expand Down Expand Up @@ -111,7 +111,7 @@ def test_diff_to_htm_padded_table():
job = JobBase.unserialize({'url': 'https://www.example.com', 'is_markdown': True, 'markdown_padded_tables': True})
result = ''.join(list(HtmlReporter(report, {}, [], 0)._diff_to_html(inpt, job)))
assert result[250:-8] == (
'<tr><td><span style="font-family:monospace;white-space:pre-wrap">| table | ' 'row |</span></td></tr>'
'<tr><td><span style="font-family:monospace;white-space:pre-wrap">| table | row |</span></td></tr>'
)


Expand Down
4 changes: 2 additions & 2 deletions webchanges/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@
# * MINOR version when you add functionality in a backwards compatible manner, and
# * MICRO or PATCH version when you make backwards compatible bug fixes. We no longer use '0'
# If unsure on increments, use pkg_resources.parse_version to parse
__version__ = '3.9'
__version__ = '3.9.1'
__description__ = (
'Check web (or commands) for changes since last run and notify.\n\n' 'Anonymously alerts you of webpage changes.'
'Check web (or commands) for changes since last run and notify.\n\nAnonymously alerts you of webpage changes.'
)
__author__ = 'Mike Borsetti <mike@borsetti.com>'
__copyright__ = 'Copyright 2020- Mike Borsetti'
Expand Down
2 changes: 1 addition & 1 deletion webchanges/_vendored/packaging_version.py
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ def __init__(self, version: str) -> None:
self._key = _legacy_cmpkey(self._version)

warnings.warn(
'Creating a LegacyVersion has been deprecated and will be ' 'removed in the next major release',
'Creating a LegacyVersion has been deprecated and will be removed in the next major release',
DeprecationWarning,
)

Expand Down
2 changes: 1 addition & 1 deletion webchanges/filters.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ def normalize_filter_list(
:param filter_spec: A list of either filter_kind, subfilter (where subfilter is a dict) or a legacy
string-based filter list specification.
:returns: Iterator of filter_kind, subfilter (where subfilter is a dict)
:returns: Iterator of filter_kind, subfilter (where subfilter is a dict).
"""
for filter_kind, subfilter in cls._internal_normalize_filter_list(filter_spec):
filtercls = cls.__subclasses__.get(filter_kind, None)
Expand Down
2 changes: 1 addition & 1 deletion webchanges/reporters.py
Original file line number Diff line number Diff line change
Expand Up @@ -1055,7 +1055,7 @@ def submit(self, **kwargs: Any) -> None: # type: ignore[override]
service = self.web_service_get()
except Exception as e:
raise RuntimeError(
f'Failed to load or connect to {self.__kind__} - are the dependencies installed and ' 'configured?'
f'Failed to load or connect to {self.__kind__} - are the dependencies installed and configured?'
) from e

self.web_service_submit(service, 'Website Change Detected', text)
Expand Down
16 changes: 13 additions & 3 deletions webchanges/storage.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,13 @@
import copy
import email.utils
import getpass
import inspect
import logging
import os
import shutil
import sqlite3
import stat
import sys
import threading
from abc import ABC, abstractmethod
from collections import defaultdict
Expand All @@ -22,6 +24,7 @@
from . import __docs_url__, __project_name__, __version__
from .filters import FilterBase
from .jobs import JobBase, ShellJob
from .reporters import ReporterBase
from .util import edit_file

try:
Expand Down Expand Up @@ -662,6 +665,13 @@ def check_for_unrecognized_keys(self, config: Config) -> None:
if 'slack' in config_for_extras.get('report', {}): # legacy key; ignore
config_for_extras['report'].pop('slack') # type: ignore[typeddict-item]
extras: Config = self.dict_deep_difference(config_for_extras, DEFAULT_CONFIG)
if extras.get('report') and 'hooks' in sys.modules:
# skip reports added by hooks
for name, obj in inspect.getmembers(sys.modules['hooks'], inspect.isclass):
if obj.__module__ == 'hooks' and issubclass(obj, ReporterBase):
extras['report'].pop(obj.__kind__) # type: ignore[misc]
if not len(extras['report']):
extras.pop('report') # type: ignore[misc]
if extras:
raise ValueError(
f'Unrecognized directive(s) in the configuration file {self.filename}:\n'
Expand Down Expand Up @@ -737,7 +747,7 @@ def _parse(cls, fp: TextIO) -> List[JobBase]:
if conflicting_jobs:
raise ValueError(
'\n '.join(
['Each job must have a unique URL/command (for URLs, append #1, #2, etc. to ' 'make them unique):']
['Each job must have a unique URL/command (for URLs, append #1, #2, etc. to make them unique):']
+ conflicting_jobs
)
)
Expand Down Expand Up @@ -1147,7 +1157,7 @@ def load(self, guid: str) -> Snapshot:
"""
with self.lock:
row = self._execute(
'SELECT msgpack_data, timestamp FROM webchanges WHERE uuid = ? ' 'ORDER BY timestamp DESC LIMIT 1',
'SELECT msgpack_data, timestamp FROM webchanges WHERE uuid = ? ORDER BY timestamp DESC LIMIT 1',
(guid,),
).fetchone()
if row:
Expand Down Expand Up @@ -1175,7 +1185,7 @@ def get_history_data(self, guid: str, count: Optional[int] = None) -> Dict[str,

with self.lock:
rows = self._execute(
'SELECT msgpack_data, timestamp FROM webchanges WHERE uuid = ? ' 'ORDER BY timestamp DESC', (guid,)
'SELECT msgpack_data, timestamp FROM webchanges WHERE uuid = ? ORDER BY timestamp DESC', (guid,)
).fetchall()
if rows:
for msgpack_data, timestamp in rows:
Expand Down

0 comments on commit db79546

Please sign in to comment.