Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stress test: scrape search results user story #89

Closed
pavelfeldman opened this issue Jul 18, 2017 · 7 comments
Closed

Stress test: scrape search results user story #89

pavelfeldman opened this issue Jul 18, 2017 · 7 comments

Comments

@pavelfeldman
Copy link
Contributor

I attempted to create a following stress test:

  • Navigate to google.com
  • Enter "Blin"
  • Capture screenshot with results
  • Go over the results
  • Capture screenshot for every result page

This is pretty much impossible using existing API :) Dumping observations here...

  • page.waitFor resolves even if element is not on the screen (display:none, etc). Many elements are in DOM too early, no way to click-via-screen them.
  • page.waitFor does not time timeout
  • keyboard API surface is large and cryptic. Did you figure out how to press 'Enter' using just page.*.
  • keyboard API does not allow for natural click delays, does not populate timestamps. google.com does not like that
  • Expose custom raf or timer-based page.waitFor(predicate)
  • Expose page.sleep(time)
  • Need to be able to page.waitForNavigation in case navigation is initiated via click [i'll fix]
  • Reload brings me back to the originally navigated page [i'll fix]
  • page.waitFor(selector) in my code is always followed by either page.click or page.focus with that element. Handles would make it look like (await page.waitFor(selector)).click().
  • navigation API is not exposed, I need to be able to click browser's back/forward.
  • Taking screenshot halts on mac headless good half of the time [i'll look at it]

^^ @aslushnikov @JoelEinbinder @dgozman @paulirish

@JoelEinbinder
Copy link
Collaborator

JoelEinbinder commented Jul 18, 2017

Why do you want to press enter? Usually there is an equally clickable element.

@JoelEinbinder
Copy link
Collaborator

JoelEinbinder commented Jul 18, 2017

let Browser = require('../lib/Browser.js');
let browser = new Browser({ args: ['--no-sandbox'] });
browser.newPage().then(async page => {
  await page.navigate('https://www.google.com');
  let query = 'input[title="Search"]';
  page.focus(query);
  page.type('blin');
  page.keyboard.press('Enter');
  await page.waitFor('h3 a');
  let urls = await page.$$('h3 a', a => a.href);
  let i = 0;
  for (let url of urls) {
    i++;
    await page.navigate(url);
    await page.screenshot({
      path: i + '.png'
    });
  }
  browser.close();
});

page.waitFor didn't work for me. Added text nodes crash it, and a new node could make the selector resolve anywhere in the document.

@pavelfeldman
Copy link
Contributor Author

  • Running your script on a Mac, it does not work. I get stuck with blink| in the search field.
  • Also, this script does not do what I want - I want to be clicking search results, not navigating manually. That covers the back/forward use cases
  • And to add stress to it, I want to be clicking the search button. It is in DOM all the time, but invisible half of the time which adds fun!

@aslushnikov
Copy link
Contributor

The page.waitFor was broken and is now fixed in 28265fc

A few observations:

  • For some reason, google.com handles "Enter" in input field differently every other time. Sometimes it issues a navigation to the https://www.google.com/search?q=blin endpoint, and sometimes it behaves as SPA and navigates to https://www.google.com/#q=Blin.
  • Opening multiple tabs and navigating them altogether doesn't work - half of the tabs timeout navigation
  • Trying to take screenshot simultaneously from multiple tabs doesn't work good - screenshotting hangs

The following works for me, but I don't really click the links and don't really click the "search" button. But we'll fix it!

const {Browser} = new require('.');
const browser = new Browser({headless: false});

const VIEWPORT = {width: 1000, height: 600};

browser.newPage().then(async page => {
  await page.setViewport(VIEWPORT);
  await page.navigate('https://google.com');
  await page.focus('#lst-ib');
  await page.type('Blin');
  await page.keyboard.press('Enter');

  try {
    await page.waitFor('div.g');
  } catch(e) {
    // Page might issue a navigation, so the page.waitFor will throw.
    await page.waitFor('div.g');
  }

  await page.screenshot({path: 'ee.png'});
  let links = await page.$$('div.g h3 > a', link => link.href);
  // have to do this sequentially =/
  for (let i = 0; i < links.length; ++i)
    await screenshotURL(links[i], `${i}.png`);
  browser.close();
});

async function screenshotURL(url, name) {
  let page = await browser.newPage();
  await page.setViewport(VIEWPORT);
  try {
    await page.navigate(url, { maxTime: 5000 });
  } catch (e) {
    // we did our best.
  }
  await page.screenshot({path: name});
  page.close();
  console.log('Done: ' + name);
}

@aslushnikov
Copy link
Contributor

aslushnikov commented Jul 18, 2017

After the offline discussion of the stress-test, here are the bullets:

  • make mouse to scroll element into view before clicking (otherwise it doesn't work!)
  • introduce page.press as a shortcut for page.keyboard.press
  • don't do any network throttling other than offline mode. We're bad at emulating it
  • implement page.waitForVisible(selector) which comes handy
  • implement page.waitForNavigation
  • implement page.back / page.forward
  • improve support for DEBUG module for public api methods
  • make screenshot tasks per browser, not per page. Screenshotting doesn't work across multiple tabs
  • screenshots on mac seem to be unstable
  • rename the keyboard's hold into down, and release into up
  • page.waitForSelector should survive navigation (currently it doesn't)
  • page.waitForSelector should timeout after some time (e.g. 30s by default)

aslushnikov pushed a commit that referenced this issue Jul 19, 2017
This patch introduces Page.waitForNavigation which allows to wait
for render-initiated navigation.

This patch also does a nice refactoring, replacing Navigator with NavigatorWatcher which
is not a part of a page state.

References #89
aslushnikov pushed a commit that referenced this issue Jul 19, 2017
This patch introduces page.goBack/page.goForward methods
to navigate the page history.

References #89.
aslushnikov added a commit that referenced this issue Jul 19, 2017
This patch improves on DEBUG module to trace all puppeteer's
public API calls.

References #89.
aslushnikov added a commit that referenced this issue Jul 19, 2017
Currently, it's impossible to do screenshots in parallel.
This patch:
- makes all screenshot tasks sequential inside one browser
- starts activating target before taking screenshot
- adds a test to make sure it's possible to take screenshots across
  tabs
- starts waiting for a proper page closing after each test. This might
  finally solve the ECONNRESET issues in tests.

References #89
aslushnikov pushed a commit that referenced this issue Jul 19, 2017
aslushnikov pushed a commit that referenced this issue Jul 19, 2017
This patch:
- introduces page.press() method
- adds more input tests

References #89
aslushnikov added a commit that referenced this issue Jul 20, 2017
This patch implements page.waitFor method which survives navigation.

References #89.
aslushnikov added a commit that referenced this issue Jul 20, 2017
This patch implements page.waitFor method which survives navigation.

References #89.
aslushnikov added a commit that referenced this issue Jul 21, 2017
This patch adds a 'visible' option to the Page.waitFor method, making
it possible to wait for the element to become actually visible.

References #89, #91.
aslushnikov added a commit that referenced this issue Jul 21, 2017
This patch implements timeout option for page.waitFor. The function
will throw if the selector doesn't appear during timeout milliseconds
of waittime.

References #89, #91.
aslushnikov pushed a commit that referenced this issue Jul 22, 2017
This patch:
- adds Mouse class which holds mouse state and implements mouse primitives,
such as moving, button down and button up.
- implements high-level mouse api, such as `page.click` and `page.hover`.

References #40, References #89
@aslushnikov
Copy link
Contributor

The resulting script looks like this:

const {Browser} = require('puppeteer');
const browser = new Browser({headless: false});

browser.newPage().then(async page => {
  page.on('load', () => console.log('LOADED: ' + page.url()));
  await page.navigate('https://google.com');
  await page.waitFor('input[name=q]');
  await page.focus('input[name=q]');
  await page.type('blin');
  await page.press('Enter');
  for (let i = 0; i < 10; ++i) {
    let searchResult = `div.g:nth-child(${i + 1}) h3 a`;
    await page.waitFor(searchResult, {visible: true});
    page.click(searchResult);
    await page.waitForNavigation();
    await page.screenshot({path: `screenshot-${i + 1}.png`});
    await page.goBack();
  }
  browser.close();
});

The two checkboxes from feedback which are not addressed are filed separately:

@ralyodio
Copy link

ralyodio commented Jun 2, 2018

You might be in different buckets too. I'm sure google is doing A/B testing. There's a way to force a bucket, but I'm not sure how.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants