Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive Method: Puppeteer support for scripting specific Chrome archive flows #51

Open
pirate opened this Issue Nov 2, 2017 · 8 comments

Comments

Projects
None yet
3 participants
@pirate
Copy link
Owner

pirate commented Nov 2, 2017

https://github.com/GoogleChrome/puppeteer is fantastic for scripting actions on pages before making a screenshot or PDF.

I could add support for custom puppeteer scripts for certain urls that need a user action to be performed before archiving (e.g. logging in or closing a welcome message popup).

Puppeteer code looks like this:

        const browser = await puppeteer.launch({headless: false})
        const page = await browser.newPage()

        await page.goto('https://carbon.now.sh')

        const code_input = 'div.ReactCodeMirror div.CodeMirror-code > pre:nth-child(11)'
        await page.click(code_input)
        await page.keyboard.down('Meta')
        await page.keyboard.down('a')
        await page.keyboard.up('a')
        await page.keyboard.up('Meta')
        await page.keyboard.press('Backspace')
@pirate

This comment has been minimized.

Copy link
Owner Author

pirate commented Sep 12, 2018

@FiloSottile

This comment has been minimized.

Copy link
Contributor

FiloSottile commented Dec 1, 2018

archive.is has a nice set of scripts that do things like expanding all Reddit threads or scrolling through Twitter timelines before taking a snapshot. This is the kind of thing I've seen develop a nice community around with youtube-dl.

@pirate

This comment has been minimized.

Copy link
Owner Author

pirate commented Mar 15, 2019

The beginnings of this will start to be implemented with our move from chromium-browser to pyppeteer: #177, then these will be possible:

  • support for scripted user flows (this ticket)
  • dismissing gdpr / cookie / subscription / donation popups automatically: #175
  • autoscroll before archiving with full-page dynamic height screenshots: #80
  • dynamic/interactive requests saving into the WARC with pypetteer running through pywb: #130
@n0ncetonic

This comment has been minimized.

Copy link
Contributor

n0ncetonic commented Mar 23, 2019

I have experience with coding Puppeteer scripts and I'm willing to start either implementing fixes for #175 #80 #130 as independent code samples in preparation for pyppeteer or to start a branch that just replicates current functionality but with pyppeteer depending on whether or not you've started a private branch or prefer to implement it yourself

@pirate

This comment has been minimized.

Copy link
Owner Author

pirate commented Mar 23, 2019

Sweet, the super-rough planned design is for ArchiveBox to run user-provided scripts like this:

archive_scripts = {
    'dismiss_modals: '() => {document.querySelectorAll(".modal").delete()}',
    ...
}


browser = await launch()
page = await browser.newPage()

for link in links:
    await page.goto(link['url'])

    for script_name, script_js in archive_scripts:
        link['history'][script_name].append(await page.evaluate(script_js))

    link['history']['screenshot'].append(await page.screenshot({'path': 'screenshot.png'}))
    link['history']['pdf'].append(await page.print_pdf({'path': 'output.pdf'}))

await browser.close()

The final implementation will be more fully-featured than this of course. I imagine we'll do something very similar to how archive_methods.py works right now, which a check function should_fetch_xyz to see if each script should run, then the script runs in the page's JS context and any output returned gets saved onto the link in a result entry like this: {start_ts, end_ts, duration, cmd, pwd, status, output}.

Link {
    timestamp: str,     (how we uniquely id links)    
    url: str,                                         
    title: str,                                       
    ...                               
    history: {
        wget: [
            {start_ts, end_ts, duration, cmd, pwd, status, output},
            ...
        ],
        screenshot: [
            {start_ts, end_ts, duration, script, status, output},
        ]
        ...
    },
}
@n0ncetonic

This comment has been minimized.

Copy link
Contributor

n0ncetonic commented Mar 23, 2019

Alright cool, I will start working on getting that implemented on my fork.

Planning to do this in 3 phases across two milestones which I think align well with the current roadmap.

Phase I. import pyppeteer and replace all current chromium-browser calls with pyppeteer equivalents.

Milestone I. ArchiveBox migrated to pyppeteer

Phase II. Implement minimalist scripting support allowing users to extend browser-based modules using javascript.

Milestone II. Codebase aligned with Roadmap's Long Term Change to allow user-defined scripting of the browser.

Phase III. Bootstrap collection of browser scripts by creating and including

  • autoscroll_screenshot.js - screenshot capturing of the entire page by autoscrolling #80
  • anti_detection.js - bypasses detection/blocking of headless browser via selective overwriting of page-wide getter properties this is something I have working for a personal project that leveraged Puppeteer
  • cookie_accept.js - generic enumeration and dismissal of GDPR/subscription/cookie popups #175

Note: As my primary aim will be to make progress on the Roadmap #130 will not be a requisite for Phase III completion. Once Phase III is complete and merged into master a separate Pull Request will address extending WARC generation.

We'll go to next steps (like mimicking archive_methods.py loading of scripts) after Phase III provides a working, basic scripting subsystem

@pirate

This comment has been minimized.

Copy link
Owner Author

pirate commented Mar 23, 2019

If possible, work on the Phase III scripts first. Those would be most helpful to me, as I've already started work on the phase I and II steps you outlined above over the last few months.

You can test your scripts using the pyppeteer demo code from their README, and I'll make sure the ArchiveBox API is compatible to work with them.

@pirate

This comment has been minimized.

Copy link
Owner Author

pirate commented Apr 17, 2019

I found some huge repositories of Seleneium/Puppeteer scripts for dismissing modals and logging in to lots of sites. These are going to be super useful:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.