Scrape ads from the interweb.
- Getting Started
- Running
- Adding a new site
- Customizing
- Contributing
- Versioning
- Authors
- License
- See also
- Acknowledgments
Follow these instructions to get the scraper running on your local machine.
You will need node with npm installed on your machine. You can install it from the official website https://nodejs.org.
A step by step series of examples that tell you how to get a working copy of this project to your local machine.
Clone the git repository:
git clone https://github.com/intelligenerator/ad-scraper.git
cd ad-scraper/
Then installed the dependencies (Note: This will take up some room as a copy of Chromium will be downloaded to your machine):
npm install
Sites that should be scraped can be added to config/sites.json. See Adding a new site for detailed instructions.
Happy scraping!
To start the scraping process, run this in the project folder:
npm run scrape
You may add additional websites to scrape in config/sites.json. To specify additional domains of ad servers, modify config/ad-servers.json.
You can add a site to the scraping list by editing config/sites.json.
A site information object has the following fields (*
are required fields):
- *name : Name of the website (such as
"Google"
) - *url : URL of the website (such as
"www.google.com"
) - cookies : CSS selector of the accept cookies button to remove consent banner from screenshots
- additional: Array of additional CSS selectors to screenshot
Here is a simple example of a site information object.
For more examples check out the example sites.example.json.
[
{
"name": "NYTimes",
"url": "https://www.nytimes.com/",
"cookies": "[data-testid='GDPR-accept']"
}
]
Please read CONTRIBUTING.md and CODE_OF_CONDUCT.md for details on our code of conduct, and the process for submitting pull requests to us.
We use SemVer for versioning. For the versions available, see the tags on this repository.
Ulysse McConnell - umcconnell
See also the list of contributors who participated in this project.
This project is licensed under the MIT License - see the LICENSE.md file for details.
- Puppeteer docs - Puppeteer
- Contributor Covenant - Code of Conduct