Scrap any content on any website that has a search bar (using Puppeteer).
View Demo · Report Bug · Request Feature
Automate data scrapping on the web. Works on any website using a search bar. The scrapper manually enters a text in the search bar, then explore the links it finds. You can specify any number of websites to explore, any text to search as well as any number of values to retrieve.
-
Just go on official Node.js website and download the installer. Also, be sure to have
git
available in your PATH,npm
might need it (You can find git here).
You can install nodejs and npm easily with apt install, just run the following commands.
$ sudo apt install nodejs
$ sudo apt install npm
- You can find more information about the installation on the official Node.js website and the official NPM website.
If the installation was successful, you should be able to run the following command.
$ node --version
v14.17.6
$ npm --version
7.24.1
If you need to update npm
, you can make it using npm
! Cool right? After running the following command, just open again the command line and be happy.
$ npm install npm -g
$ npm install @ownw/web-search-scrap
The module has the following functions:
const {scrap, saveJsonAsyncGenerator, pagesToScrap, nameFile, qos} = require('@ownw/web-search-scrap');
_______________
//starts the scrapping
scrap(toSearchFor:string|string[], pagesToScrap:...PageToScrap): AsyncGenerator<Object>
//save results to a directory
saveJsonAsyncGenerator(fileDir:string, gens:...AsyncGenerator): Promise<void>
//loads all config files
pagesToScrap(directoryName:string): Promise<PageToScrap[]>
//generates a name with the current date
nameFile(names:...string): string
Your main file could look like this:
const {scrap, saveJsonAsyncGenerator, pagesToScrap, nameFile} = require('@ownw/web-search-scrap');
pagesToScrap(path.join(__dirname, 'pageToScrap')).then(async pages => {
const pathDir = path.join('results', nameFile("search1"));
await saveJsonAsyncGenerator(pathDir, scrap(["text to search", "other text to search"], pages['target1']));
const pathDir2 = path.join('results', nameFile("search2"));
await saveJsonAsyncGenerator(pathDir2, scrap("text to search", pages['target1'], pages['target2']));
process.exit(0);
});
Using the function saveJsonAsyncGenerator()
will save to the specified directory 3 files. Let say you have this code:
pathDir = 'out'
pathSearch = path.join(pathDir, nameFile('search'))
targetPage = ...
saveJsonAsyncDirectory(pathSearch, scrap([...], targetPage)
---
You will have the following files:
out/[date].search/
->[date].search.json
->[date].search.log
->[date].search.qos
You can also directly use the results generated:
const {scrap} = require('@ownw/web-search-scrap');
const targetWebsite = ...;
const asyncFn = async () => {
for await (const res of scrap("text to search", targetWebsite)){
//do something with res...
//res.type = ('data'|'log'|'qos')
// ->data: contains the actual data (use res.value)
// ->log: contains the log
// ->qos: contains metrics for the search
}
});
asyncFn();
The files loaded by the function pagesToScrap(directory path) need to have the following structure:
{
"name": ...,
"url": ...,
"searchBarSelector": ...,
"xpathResults": [...],
"xpathPagination": {"next": ...},
"disableIntercept": true/false,
"delayStrategy": {
"nbUrlPerChunk": ...,
"delayBetweenChunks": ...(ms)
},
"fields": [
{
"name": ...,
"xpath": ...,
"htmlProperty": "textContent"/"src"/...
},
...
],
"captcha": {
"xpath": ...,
"retryIn": ...(ms),
"maxTries": ...
}
}
For example if the targeted website is amazon, the following values are suggested:
{
"name": "amazon",
"url": "https://www.amazon.com",
"searchBarSelector": "#twotabsearchtextbox",
"xpathResults": [
"//*[@data-component-type='s-search-result']//a[not(contains(@href, '#customerReviews') or contains(@href, 'javascript') or contains(@href, 'offer-listing') or contains(@href, 'bestsellers'))]"
],
"xpathPagination" : {
"next": "//ul[@class='a-pagination']//li[@class='a-last']//a"
},
"disableIntercept": true,
"delayStrategy": {
"nbUrlPerChunk": 2,
"delayBetweenChunks": 20000
},
"fields": [
{
"name": "title",
"xpath": "//*[@id=\"productTitle\"]",
"htmlProperty": "textContent"
},
{
"name": "image",
"xpath": "//*[@id=\"landingImage\"]",
"htmlProperty": "src"
},
{
"name": "price",
"xpath": "//*[@id=\"priceblock_ourprice\"]",
"htmlProperty": "textContent"
}
],
"captcha": {
"xpath": "//title[contains(.,'CAPTCHA')]",
"retryIn": 250000,
"maxTries": 5
}
}
This will yield the results (depending on the text searched): (Note you can get at most 350 results per search on Amazon)
[
{
"url": "https://www.amazon.fr/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A05352261ELNJONHOCE7P&url=%2FPhilips-Connect%25C3%25A9e-Compatible-Bluetooth-Fonctionne%2Fdp%2FB07SS377J3%2Fref%3Dsr_1_1_sspa%3F__mk_fr_FR%3D%25C3%2585M%25C3%2585%25C5%25BD%25C3%2595%25C3%2591%26dchild%3D1%26keywords%3DPhilips%2BHue%2Bampoule%26qid%3D1596722562%26sr%3D8-1-spons%26psc%3D1&qualifier=1596722562&id=6255010379111342&widgetName=sp_atf",
"title": "Philips Hue Ampoule LED Connectée White & Color Ambiance E27 Compatible Bluetooth, Fonctionne avec Alexa",
"image": "https://images-na.ssl-images-amazon.com/images/I/71rIv9NRlZL._AC_SX342_.jpg",
"price": "59,90 €"
},
{
"url": "https://www.amazon.fr/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A030014930UETVZTN5TEW&url=%2FPhilips-Connect%25C3%25A9e-Compatible-Bluetooth-Fonctionne%2Fdp%2FB07SNGBWG4%2Fref%3Dsr_1_2_sspa%3F__mk_fr_FR%3D%25C3%2585M%25C3%2585%25C5%25BD%25C3%2595%25C3%2591%26dchild%3D1%26keywords%3DPhilips%2BHue%2Bampoule%26qid%3D1596722562%26sr%3D8-2-spons%26psc%3D1&qualifier=1596722562&id=6255010379111342&widgetName=sp_atf",
"title": "Philips Hue Ampoule LED Connectée White Filament E27 Forme Standard, Compatible Bluetooth 7 W, Fonctionne avec Alexa et Google Assistant",
"image": "https://images-na.ssl-images-amazon.com/images/I/61LalkKznwL._AC_SX342_.jpg",
"price": "19,99 €"
},
{
"url": "https://www.amazon.fr/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A01163081ITUF15PTIXWC&url=%2FPhilips-Connect%25C3%25A9es-Compatible-Bluetooth-Fonctionne%2Fdp%2FB07SR3DTPG%2Fref%3Dsr_1_3_sspa%3F__mk_fr_FR%3D%25C3%2585M%25C3%2585%25C5%25BD%25C3%2595%25C3%2591%26dchild%3D1%26keywords%3DPhilips%2BHue%2Bampoule%26qid%3D1596722562%26sr%3D8-3-spons%26psc%3D1&qualifier=1596722562&id=6255010379111342&widgetName=sp_atf",
"title": "Philips Hue Ampoules LED Connectées White Ambiance E27 Compatible Bluetooth, Fonctionne avec Alexa Pack de 2",
"image": "https://images-na.ssl-images-amazon.com/images/I/71-HeRcTqSL._AC_SX342_.jpg",
"price": "44,99 €"
},
...
]
The documentation for each attribute utility is available in the code.
To search on Google, use:
{
"name": "google",
"url": "https://www.google.com/",
"searchBarSelector": "input.gLFyf.gsfi",
"xpathResults": [
".//*[contains(@href, 'https://webcache.googleusercontent.com/search')]"
],
"xpathPagination" : {
"next": "//*[@id=\"pnnext\"]"
},
"delayStrategy": {
"nbUrlPerChunk": 1,
"delayBetweenChunks": 20000
},
"disableIntercept": true,
"fields": [
...
],
"captcha": {
"xpath": "//*[@id='captcha-form']",
"retryIn": 250000,
"maxTries": 5
}
}
with
scrap("text site:targetedWebsite.com", page['google'])
For more examples, please refer to the Documentation
See the open issues for a list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the MIT License. See LICENSE
for more information.
Project Link: