Playwright with Scrapy? #213

LeMoussel · 2020-10-03T08:31:53Z

Hi Team,

I'm starting a project to provide a Scrapy download manager that executes queries using Playwright. It can be used to process pages that require JavaScript. The design is strongly inspired of Pyppeteer integration for Scrapy (scrapy-pyppeteer).
The main issue when running Scrapy and Playwright together is that Scrapy is using Twisted and that Playwright for Python is using asyncio for async stuff.

Like scrapy-pyppeteer, I make PlaywrightDownloadHandlerclass inherits from the default http/https handler, and it will only use Playwright for requests that are explicitly marked. The basic use is identical to scrapy-pyppeteer Basic usage.

The problem is that during the execution of a request (download_request) I get the following error:

(node:3992) UnhandledPromiseRejectionWarning: Error: EPIPE: broken pipe, write
    at Socket._write (internal/net.js:54:25)
    at doWrite (_stream_writable.js:403:12)
......

I use chome as headless browser and chrome is well executed.

import asyncio
import logging

from typing import Coroutine, Type, TypeVar, Optional

from scrapy import Spider, signals
from scrapy.crawler import Crawler
from scrapy.http import Request, Response
from scrapy.core.downloader.handlers.http import HTTPDownloadHandler
from scrapy.utils.reactor import verify_installed_reactor

from twisted.internet.defer import Deferred, inlineCallbacks

# https://github.com/microsoft/playwright-python
# https://github.com/microsoft/playwright/blob/master/docs/api.md
from playwright import async_playwright
from playwright.async_api import Playwright, Browser, BrowserContext, Page

logger = logging.getLogger("scrapy-laywright")

PlaywrightHandler = TypeVar("PlaywrightHandler", bound="PlaywrightDownloadHandler")

# Transform a Twisted Deffered to an Asyncio Future
def _force_deferred(coro: Coroutine) -> Deferred:
  future = asyncio.ensure_future(coro)
  return Deferred.fromFuture(future)

class PlaywrightDownloadHandler(HTTPDownloadHandler):
  playwright: Optional[Playwright] = None
  browser: Optional[Browser] = None
  context: Optional[BrowserContext] = None
  navigation_timeout: Optional[int] = None

  def __init__(self, crawler: Crawler) -> None:
    super().__init__(settings=crawler.settings, crawler=crawler)
    verify_installed_reactor("twisted.internet.asyncioreactor.AsyncioSelectorReactor")

    crawler.signals.connect(self._launch_browser_signal_handler, signals.engine_started)

    self.stats = crawler.stats

    if crawler.settings.get("PLAYWRIGHT_BROWSER_TYPE"):
      self.browser_type: str = crawler.settings.get("PLAYWRIGHT_BROWSER_TYPE")
    else:
      self.browser_type: str = 'chromium'
    if crawler.settings.get("PLAYWRIGHT_NAVIGATION_TIMEOUT"):
      self.navigation_timeout: int = crawler.settings.getint("PLAYWRIGHT_NAVIGATION_TIMEOUT")
    self.launch_options: dict = crawler.settings.getdict("PLAYWRIGHT_LAUNCH_OPTIONS") or {}

    logger.info("Browser launch options: %s" % self.launch_options)

  @classmethod
  def from_crawler(cls: Type[PlaywrightHandler], crawler: Crawler) -> PlaywrightHandler:
    return cls(crawler)

  def download_request(self, request: Request, spider: Spider) -> Deferred:
    if request.meta.get("playwright"):
      return _force_deferred(self._download_request(request, spider))
    return super().download_request(request, spider)

  # https://github.com/elacuesta/scrapy-pyppeteer/blob/master/scrapy_pyppeteer/handler.py#L103
  async def _download_request(self, request: Request, spider: Spider) -> Response:
    print("TODO PlaywrightDownloadHandler:_download_request()")

    page = await self.context.newPage() # => Error is caused here
    # response = await page.goto(request.url)

    return None

  async def _spider_closed(self):
    await self.browser.close()
    await self.playwright.stop()

  @inlineCallbacks
  def close(self) -> Deferred:
    yield super().close()
    if self.browser:
       _force_deferred(self._spider_closed())

  def _launch_browser_signal_handler(self) -> Deferred:
    return asyncio.get_event_loop().run_until_complete(asyncio.ensure_future(self._launch_browser()))

  async def _launch_browser(self) -> None:
    self.playwright = await async_playwright().start()
    # https://playwright.dev/#version=master&path=docs%2Fapi.md&q=browsertypelaunchoptions
    self.browser = await getattr(self.playwright,self.browser_type).launch(**self.launch_options)
    # https://playwright.dev/#version=master&path=docs%2Fapi.md&q=browsernewcontextoptions
    self.context = await self.browser.newContext()

    # For Test purpose. No error in this function
    page = await self.context.newPage()
    response = await page.goto('http://example.com/')
    print(response.status)

What I understand is that this type of error occurs when I'm trying to write in a closed stream/connection
What I don't understand is why the stream is closed that it works in launch_browser.
Can you help me?

Thank you

The text was updated successfully, but these errors were encountered:

mxschmitt · 2020-10-03T13:52:01Z

Hi, such a error normally occurs as you've already pointed out when the pipe is closed. In our case this would happen when you call self.playwright.stop(). Could you try to add there a log statement and see when and if its called before you want to make the request?
Would also be great if you could provide us a full "example" of your project, so we could clone it and debug / investigate ourselves.

Thanks!

LeMoussel · 2020-10-04T07:28:40Z

Hi,
I added a log statement in close(). close() is called when the spider is finish to close.
Normal behavior, download_request() is called before closing the spider.

System information:
OS: Windows 10 Pro x64/AMD64
Python: 3.7.8

I may have identified why the pipe is closed.
By managing exceptions I got this one:
Task <Task pending coro=<PlaywrightDownloadHandler._download_request() running at d:\Developpement\Python\POC_Scrapy_Playwright\POC_Scrapy_Playwright\scrapy_playwright\http.py:65> cb=[Deferred.fromFuture.<locals>.adapt() at C:\Users\pc\AppData\Local\Programs\Python\Python37\lib\site-packages\twisted\internet\defer.py:822]> got Future <Future pending> attached to a different loop exception
scrapy_playwright\http.py:65 corresponding to page = await self.context.newPage()

Hmmm, I don't master well asyncio
If I understand correctly, the problem is I'm launching coroutines in a new created loop. But I can't confirm that all interal operations are also using this new created one. So _download_request would be attached into another loop which is the default loop in current thread.

LeMoussel · 2020-10-05T15:12:03Z

appears to be the same error #178

pavelfeldman · 2020-10-06T03:02:06Z

Playwright API can only be used in single async loop. It should be the same thread / loop you started Playwright with async_playwright().start(). My understanding is that your download_request comes on a different thread / loop and things get confused. You can store the loop where you start playwright:

loop = asyncio.get_event_loop()

and then post tasks onto it. That way everything that talks to Playwright happens within that loop. Depending on whether your app uses multiple threads or multiple loops, you can then call loop.call_soon or loop.call_soon_threadsafe to post playwright actions to it.

LeMoussel closed this as completed Oct 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Playwright with Scrapy? #213

Playwright with Scrapy? #213

LeMoussel commented Oct 3, 2020 •

edited

Loading

mxschmitt commented Oct 3, 2020

LeMoussel commented Oct 4, 2020 •

edited

Loading

LeMoussel commented Oct 5, 2020

pavelfeldman commented Oct 6, 2020

Playwright with Scrapy? #213

Playwright with Scrapy? #213

Comments

LeMoussel commented Oct 3, 2020 • edited Loading

mxschmitt commented Oct 3, 2020

LeMoussel commented Oct 4, 2020 • edited Loading

LeMoussel commented Oct 5, 2020

pavelfeldman commented Oct 6, 2020

LeMoussel commented Oct 3, 2020 •

edited

Loading

LeMoussel commented Oct 4, 2020 •

edited

Loading