- New and updated requirements:
- packaging >= 20.0
- scrapy-poet >= 0.9.0
- web-poet >= 0.13.0
- zyte-common-items
- Added a scrapy-poet provider for Zyte API. Currently supported data types:
web_poet.BrowserHtml
web_poet.BrowserResponse
zyte_common_items.Product
- Added a
zyte_api_default_params
request meta key which allows users to ignore theZYTE_API_DEFAULT_PARAMS
setting for individual requests. - CI fixes.
- Fixed an exception raised by the downloader middleware when cookies were enabled.
- Made Python 3.11 support official.
- Added support for the upcoming automatic extraction feature of Zyte API.
- Included a descriptive message in the exception that triggers when the download handler cannot be initialized.
- Clarified that
LOG_LEVEL
must beDEBUG
forZYTE_API_LOG_REQUESTS
messages to be visible.
- Fixed the handling of response cookies without a domain.
- CI fixes
- Fixed an
AssertionError
when cookies are disabled. - Added links to the README to improve navigation from GitHub.
- Added a license file (BSD-3-Clause).
Added experimental cookie support:
- The
experimental.responseCookies
response parameter is now mapped to the response headers asSet-Cookie
headers, as well as added to the cookiejar of the request. - A new boolean setting,
ZYTE_API_EXPERIMENTAL_COOKIES_ENABLED
, can be set toTrue
to enable automated mapping of cookies from a request cookiejar into theexperimental.requestCookies
Zyte API parameter.
- The
ZyteAPITextResponse
is now a subclass ofHtmlResponse
, so that theopen_in_browser
function of Scrapy uses the.html
extension for Zyte API responses.While not ideal, this is much better than the previous behavior, where the
.html
extension was never used for Zyte API responses.ScrapyZyteAPIDownloaderMiddleware
now also supports non-string slot IDs.
- It is now possible to log the parameters of requests sent.
- Stats for HTTP and HTTPS traffic used to be kept separate, and only one of those sets of stats would be reported. This is fixed now.
- Fixed some code examples and references in the README.
When upgrading, you should set the following in your Scrapy settings:
DOWNLOADER_MIDDLEWARES = {
"scrapy_zyte_api.ScrapyZyteAPIDownloaderMiddleware": 1000,
}
# only applicable for Scrapy 2.7+
REQUEST_FINGERPRINTER_CLASS = "scrapy_zyte_api.ScrapyZyteAPIRequestFingerprinter"
Fixes the issue where scrapy-zyte-api is slow when Scrapy Cloud has Autothrottle Addon enabled. The new
ScrapyZyteAPIDownloaderMiddleware
fixes this.It now supports Scrapy 2.7's new
REQUEST_FINGERPRINTER_CLASS
which ensures that Zyte API requests are properly fingerprinted. This addresses the issue where Scrapy marks POST requests as duplicate if they point to the same URL despite having different request bodies. As a workaround, users were marking their requests withdont_filter=True
to prevent such dupe filtering.For users having
scrapy >= 2.7
, you can simply update your Scrapy settings to haveREQUEST_FINGERPRINTER_CLASS = "scrapy_zyte_api.ScrapyZyteAPIRequestFingerprinter"
.If your Scrapy project performs other requests aside from Zyte API, you can set
ZYTE_API_FALLBACK_REQUEST_FINGERPRINTER_CLASS = "custom.RequestFingerprinter"
to allow custom fingerprinting. By default, the default Scrapy request fingerprinter is used for non-Zyte API requests.For users having
scrapy < 2.7
, check the following link to see different ways on handling the duplicate request issue: https://github.com/scrapy-plugins/scrapy-zyte-api#request-fingerprinting-before-scrapy-27.More information about the request fingerprinting topic can be found in https://github.com/scrapy-plugins/scrapy-zyte-api#request-fingerprinting.
Various improvements to docs and tests.
- Add a
ZYTE_API_TRANSPARENT_MODE
setting,False
by default, which can be set toTrue
to make all requests use Zyte API by default, with request parameters being automatically mapped to Zyte API parameters. - Add a Request meta key,
zyte_api_automap
, that can be used to enable automated request parameter mapping for specific requests, or to modify the outcome of automated request parameter mapping for specific requests. - Add a
ZYTE_API_AUTOMAP_PARAMS
setting, which is a counterpart forZYTE_API_DEFAULT_PARAMS
that applies to requests where automated request parameter mapping is enabled. - Add the
ZYTE_API_SKIP_HEADERS
andZYTE_API_BROWSER_HEADERS
settings to control the automatic mapping of request headers. - Add a
ZYTE_API_ENABLED
setting,True
by default, which can be used to disable this plugin. - Document how Zyte API responses are mapped to Scrapy response subclasses.
- Raise the minimum dependency of Zyte API's Python API to
zyte-api>=0.4.0
. This changes all the requests to Zyte API to have haveAccept-Encoding: br
and automatically decompress brotli responses. - Rename "Zyte Data API" to simply "Zyte API" in the README.
- Lower the minimum Scrapy version from
2.6.0
to2.0.1
.
- Zyte Data API error responses (after retries) are no longer ignored, and
instead raise a
zyte_api.aio.errors.RequestError
exception, which allows user-side handling of errors and provides better feedback for debugging. - Allowed retry policies to be specified as import path strings, which is
required for the
ZYTE_API_RETRY_POLICY
setting, and allows requests with thezyte_api_retry_policy
request.meta key to remain serializable. - Fixed the naming of stats for some error types.
- Updated the output examples on the README.
- Cleaned up Scrapy stats names: fixed an issue with
//
, renamedscrapy-zyte-api/api_error_types/..
toscrapy-zyte-api/error_types/..
, addedscrapy-zyte-api/error_types/<empty>
for cases error type is unknown; - Added error type to the error log messages
- Testing improvements
Fixed incorrect 0.4.0 release.
- Requires a more recent Python client library zyte-api ≥ 0.3.0.
- Stats from zyte-api are now copied into Scrapy stats. The
scrapy-zyte-api/request_count
stat has been renamed toscrapy-zyte-api/processed
accordingly.
CONCURRENT_REQUESTS
Scrapy setting is properly supported; in previous releases max concurrency of Zyte API requests was limited to 15.- The retry policy for Zyte API requests can be overridden, using
either
ZYTE_API_RETRY_POLICY
setting orzyte_api_retry_policy
request.meta key. - Proper response.status is set when Zyte API returns
statusCode
field. - URL of the Zyte API server can be set using
ZYTE_API_URL
Scrapy setting. This feature is currently used in tests. - The minimum required Scrapy version (2.6.0) is now enforced in setup.py.
- Test and documentation improvements.
Remove the
Content-Decoding
header when returning the responses. This prevents Scrapy from decompressing already decompressed contents done by Zyte Data API. Otherwise, this leads to errors inside Scrapy'sHttpCompressionMiddleware
.Introduce
ZyteAPIResponse
andZyteAPITextResponse
which are subclasses ofscrapy.http.Response
andscrapy.http.TextResponse
respectively. These new response classes hold the raw Zyte Data API response in theraw_api_response
attribute.Introduce a new setting named
ZYTE_API_DEFAULT_PARAMS
.- At the moment, this only applies to Zyte API enabled
scrapy.Request
(which is declared by having thezyte_api
parameter in the Request meta having valid parameters, set toTrue
, or{}
).
- At the moment, this only applies to Zyte API enabled
Specify in the README to set
dont_filter=True
when using the same URL but with differentzyte_api
parameters in the Request meta. This is a current workaround since Scrapy will tag them as duplicate requests and will result in duplication filtering.Various documentation improvements.
- Initial release