About this Software
This software navigates to a web link, collects all the links, records their "coordinates" (their getBoundingClientRect position), and saves this data alongside a screenshot of the page.
This coordinate data can be used to examine the "spatial incidence rate" of certain domains or content types (e.g. "How often do Stack Overflow links appear in the top half of a SERP"?, "How often do Wikipedia links appear in the right half of a SERP?" ). You could also use these coordinates to generate a "ranked list". However, for traditional ranking analyses, you may wish to examine software that includes platform-specific parsing.
While written with the goal of studying Search Engine Results pages (SERPs), in theory the software works for any website. See examples below for scraping Google and Bing's homepages.
It uses the open source puppeteer library to automate headless browsing.
The basic concept of using puppeteer for SERP scraping is based on NikolaiT's library se-scraper.
Key differences from se-scaper:
- This repo contains a separate, more minimal implementation of the link coordinate collection without the additional scraping features from
se-scraper(e.g. use of puppeteer-cluster, specific parsing rules for Google news, etc.).
- This repo focuses on spatial analysis, not ranking analyses. The results are links and their coordinates, not ranks. While this has some advantages, there are also limitations to using a spatial approach.
- This repo is currently maintained as a side-project by a grad student, and may not be updated as frequently as other similar packages. Contributions and feedback are welcome!
- node and npm (most recently run with node 10.15.3 and npm 6.4.1)
- python3 distribution (anaconda recommended)
To play with the results_notebook.py, you may want to use a Jupyter-compatible tool, e.g. JupyterLab or VSCode's notebook feature (https://code.visualstudio.com/docs/python/jupyter-support).
Downloading node packages
To install relevant node packages into a local
node_modules folder, navigate to this folder (e.g.
cd LinkCoordMin) and run:
Generating search queries
A critical part of studying SERPs is generating relevant search queries. This is a huge topic, so it has a separate README!
See README in
collect.js script runs SERP collection.
There are a variety of named command line args you can pass. Check out collect.js to most directly see the options, or use
collect.js -h. You can also see examples below.
EXAMPLE_RUN.sh to see how you can 4 scripts in a sequence to programmatically generate queries and save SERP data for these queries.
Specific Examples of SERP collection
To run script that
- emulates iPhone X using puppeteer's Devices API (
- searches the Google search engine (by visiting https://www.google.com/search&q=) ((
- makes "covid_stems queries" (
- from the
- from the
uwlocation (university of washington lat / long /zip) (
- to dir
node collect.js --device=iphonex --platform=google --queryCat=covid_stems --queryFile=0 --geoName=uw --outDir=test
For bing & no location spoofing:
node collect.js --device=iphonex --platform=bing --queryCat=covid_stems --queryFile=0 --geoName=None --outDir=output/test
For bing on Chrome/Windows and a single test query (q = 'covid')
node collect.js --device=chromewindows --platform=bing --queryCat=test --queryFile=0 --geoName=None --outDir=output/test0
To run google and bing at the same time (using
& for parallel):
node collect.js --device=chromewindows --platform=google --queryCat=covid_stems --queryFile=0 --geoName=None --outDir=output/covidout_mar20 & node collect.js --device=chromewindows --platform=bing --queryCat=covid_stems --queryFile=0 --geoName=None --outDir=output/covidout_mar20 & wait
This software can collect data for websites other than SERPs as well!
To scrape reddit, we just create a
queryCat called reddit. The software will look at
search_queries/reddit/0.txt and visit any websites listed there.
node collect.js --device=chromewindows --platform=reddit --queryCat=reddit --queryFile=0 --geoName=None --outDir=output/reddit
Similarly, to visit search engine homepages.
node collect.js --device=chromewindows --platform=se --queryCat=homepages --queryFile=0 --geoName=None --outDir=output/reddit
Note that --sleepMin and --sleepMax default to 15 and 30 (seconds) respectively. You may wish to make these larger for longer jobs to avoid being rate limited (see discussion in the se-scraper repo).
Running many query categories with a python scirpt
covid.pyfor a script that collects a variety of COVID-19 related data.
- This script is a useful template for running a bunch of tasks at once, or setting up regular data collection.
Data visualization and analysis
WikipediaSERP.htmlfor a worked example
results_notebook.pyfor details. If you're not using an Anaconda environment, you may need to
pip installdependencies like pandas, matplotlib, etc.
results_notebook.py is formatted for use with VsCode's interactive jupyter notebook features. You can alternatively use the
results_notebook.ipynb version (updated semi-regularly) or just run
results_notebook.py as a Python script.
e.g. set SAVE_PLOTS to True, then run
results_notebook.py > my_results.txt
Known Issues and Debugging
- Location spoofing is inconsistent and the feature most likely to break. If performing any location-specific analyses, consider doing extra manaul validation for data quality!
- Bing mobile pages only loads top results (appears to be 4-6 items). The bottom half of the page is left with placeholder images, e.g. it hasn't loaded the full page yet. When this issue first arose, the "scrollDown" function seemed to fix it (issues scroll action til the bottom is reached).
- Reddit sometimes has issues loading
- Duckduckgo has some hard-to-replicate bugs when location spoofing.
Using headfull mode for developmet
- Pass --headless=0
- This is very useful for debugging, you can watch the web browser in real time!
- If you are interested in helping to debug any issues with the software (including new issues that may arise as SERPs change), consider using headfull mode and watching the software "in action".
node tests/testStealth.jsto see how puppeteer-extra-stealth is doing. This library is meant to help puppeteer scripts avoid detection, i.e. so websites don't detect the script.