Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Playwright with Scrapy? #213

Closed
LeMoussel opened this issue Oct 3, 2020 · 4 comments
Closed

Playwright with Scrapy? #213

LeMoussel opened this issue Oct 3, 2020 · 4 comments

Comments

@LeMoussel
Copy link

LeMoussel commented Oct 3, 2020

Hi Team,

I'm starting a project to provide a Scrapy download manager that executes queries using Playwright. It can be used to process pages that require JavaScript. The design is strongly inspired of Pyppeteer integration for Scrapy (scrapy-pyppeteer).
The main issue when running Scrapy and Playwright together is that Scrapy is using Twisted and that Playwright for Python is using asyncio for async stuff.

Like scrapy-pyppeteer, I make PlaywrightDownloadHandlerclass inherits from the default http/https handler, and it will only use Playwright for requests that are explicitly marked. The basic use is identical to scrapy-pyppeteer Basic usage.

The problem is that during the execution of a request (download_request) I get the following error:

(node:3992) UnhandledPromiseRejectionWarning: Error: EPIPE: broken pipe, write
    at Socket._write (internal/net.js:54:25)
    at doWrite (_stream_writable.js:403:12)
......

I use chome as headless browser and chrome is well executed.

import asyncio
import logging

from typing import Coroutine, Type, TypeVar, Optional

from scrapy import Spider, signals
from scrapy.crawler import Crawler
from scrapy.http import Request, Response
from scrapy.core.downloader.handlers.http import HTTPDownloadHandler
from scrapy.utils.reactor import verify_installed_reactor

from twisted.internet.defer import Deferred, inlineCallbacks

# https://github.com/microsoft/playwright-python
# https://github.com/microsoft/playwright/blob/master/docs/api.md
from playwright import async_playwright
from playwright.async_api import Playwright, Browser, BrowserContext, Page

logger = logging.getLogger("scrapy-laywright")

PlaywrightHandler = TypeVar("PlaywrightHandler", bound="PlaywrightDownloadHandler")

# Transform a Twisted Deffered to an Asyncio Future
def _force_deferred(coro: Coroutine) -> Deferred:
  future = asyncio.ensure_future(coro)
  return Deferred.fromFuture(future)

class PlaywrightDownloadHandler(HTTPDownloadHandler):
  playwright: Optional[Playwright] = None
  browser: Optional[Browser] = None
  context: Optional[BrowserContext] = None
  navigation_timeout: Optional[int] = None

  def __init__(self, crawler: Crawler) -> None:
    super().__init__(settings=crawler.settings, crawler=crawler)
    verify_installed_reactor("twisted.internet.asyncioreactor.AsyncioSelectorReactor")

    crawler.signals.connect(self._launch_browser_signal_handler, signals.engine_started)

    self.stats = crawler.stats

    if crawler.settings.get("PLAYWRIGHT_BROWSER_TYPE"):
      self.browser_type: str = crawler.settings.get("PLAYWRIGHT_BROWSER_TYPE")
    else:
      self.browser_type: str = 'chromium'
    if crawler.settings.get("PLAYWRIGHT_NAVIGATION_TIMEOUT"):
      self.navigation_timeout: int = crawler.settings.getint("PLAYWRIGHT_NAVIGATION_TIMEOUT")
    self.launch_options: dict = crawler.settings.getdict("PLAYWRIGHT_LAUNCH_OPTIONS") or {}

    logger.info("Browser launch options: %s" % self.launch_options)

  @classmethod
  def from_crawler(cls: Type[PlaywrightHandler], crawler: Crawler) -> PlaywrightHandler:
    return cls(crawler)

  def download_request(self, request: Request, spider: Spider) -> Deferred:
    if request.meta.get("playwright"):
      return _force_deferred(self._download_request(request, spider))
    return super().download_request(request, spider)

  # https://github.com/elacuesta/scrapy-pyppeteer/blob/master/scrapy_pyppeteer/handler.py#L103
  async def _download_request(self, request: Request, spider: Spider) -> Response:
    print("TODO PlaywrightDownloadHandler:_download_request()")

    page = await self.context.newPage() # => Error is caused here
    # response = await page.goto(request.url)

    return None

  async def _spider_closed(self):
    await self.browser.close()
    await self.playwright.stop()

  @inlineCallbacks
  def close(self) -> Deferred:
    yield super().close()
    if self.browser:
       _force_deferred(self._spider_closed())

  def _launch_browser_signal_handler(self) -> Deferred:
    return asyncio.get_event_loop().run_until_complete(asyncio.ensure_future(self._launch_browser()))

  async def _launch_browser(self) -> None:
    self.playwright = await async_playwright().start()
    # https://playwright.dev/#version=master&path=docs%2Fapi.md&q=browsertypelaunchoptions
    self.browser = await getattr(self.playwright,self.browser_type).launch(**self.launch_options)
    # https://playwright.dev/#version=master&path=docs%2Fapi.md&q=browsernewcontextoptions
    self.context = await self.browser.newContext()

    # For Test purpose. No error in this function
    page = await self.context.newPage()
    response = await page.goto('http://example.com/')
    print(response.status)

What I understand is that this type of error occurs when I'm trying to write in a closed stream/connection
What I don't understand is why the stream is closed that it works in launch_browser.
Can you help me?

Thank you

@mxschmitt
Copy link
Member

Hi, such a error normally occurs as you've already pointed out when the pipe is closed. In our case this would happen when you call self.playwright.stop(). Could you try to add there a log statement and see when and if its called before you want to make the request?
Would also be great if you could provide us a full "example" of your project, so we could clone it and debug / investigate ourselves.

Thanks!

@LeMoussel
Copy link
Author

LeMoussel commented Oct 4, 2020

Hi,
I added a log statement in close(). close() is called when the spider is finish to close.
Normal behavior, download_request() is called before closing the spider.

System information:
OS: Windows 10 Pro x64/AMD64
Python: 3.7.8

I may have identified why the pipe is closed.
By managing exceptions I got this one:
Task <Task pending coro=<PlaywrightDownloadHandler._download_request() running at d:\Developpement\Python\POC_Scrapy_Playwright\POC_Scrapy_Playwright\scrapy_playwright\http.py:65> cb=[Deferred.fromFuture.<locals>.adapt() at C:\Users\pc\AppData\Local\Programs\Python\Python37\lib\site-packages\twisted\internet\defer.py:822]> got Future <Future pending> attached to a different loop exception
scrapy_playwright\http.py:65 corresponding to page = await self.context.newPage()

Hmmm, I don't master well asyncio
If I understand correctly, the problem is I'm launching coroutines in a new created loop. But I can't confirm that all interal operations are also using this new created one. So _download_request would be attached into another loop which is the default loop in current thread.

@LeMoussel
Copy link
Author

appears to be the same error #178

@pavelfeldman
Copy link
Member

Playwright API can only be used in single async loop. It should be the same thread / loop you started Playwright with async_playwright().start(). My understanding is that your download_request comes on a different thread / loop and things get confused. You can store the loop where you start playwright:

loop = asyncio.get_event_loop()

and then post tasks onto it. That way everything that talks to Playwright happens within that loop. Depending on whether your app uses multiple threads or multiple loops, you can then call loop.call_soon or loop.call_soon_threadsafe to post playwright actions to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants