Skip to content

olavomello/node-multithreading-webscraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nodejs Asynchronous Multithreading Web Scraping

Contact me

Nodejs Asynchronous Multithreading Web Scraping

Reading online data multiple times faster ;)

Nodejs Asynchronous Multithreading Web Scraping

About

Web scraping is the process of extracting data from websites. In today's world, web scraping has become an essential technique for businesses and organizations to gather valuable data for their research and analysis. Node.js is a powerful platform that enables developers to perform web scraping in an efficient and scalable manner.

What is Multithreaded Web Scraping?

Multithreaded web scraping is a technique that involves dividing the web scraping task into multiple threads. Each thread performs a specific part of the scraping process, such as downloading web pages, parsing HTML, or saving data to a database. By using multiple threads, the scraping process can be performed in parallel, which can significantly improve the speed and efficiency of the scraping task.

Why use Multithreaded Web Scraping?

There are several reasons why multithreaded web scraping is beneficial. Firstly, it can significantly reduce the time required to scrape large amounts of data from multiple websites. Secondly, it can improve the performance of the scraping process by utilizing the resources of the machine more efficiently. Lastly, it can help avoid potential roadblocks like getting blocked by a website due to overloading of requests from a single IP address.

How to implement Multithreaded Web Scraping in Node.js?

To implement multithreaded web scraping in Node.js, we can use a library called "cluster". The cluster library enables the creation of child processes that can run in parallel and communicate with each other through a shared memory space. By creating multiple child processes, we can distribute the scraping task across all available cores of the CPU.

Running the code

In this code example we use tabnews.com.br as a target. The objective is generate the JSON files listing article's title and URL to each page. Our code will :

  • Start the master process and fork each cluster based on CPUs available
  • Apply the Web Scraping engine to each cluster
  • Read the page, generate de screenshot and breakdown content in article list
  • Save a .json file with article's title and url
  • Finish the process and restart another

Usage

Start app

    yarn start

Links

Tabnews
Medium

Let's stay connected

Hope be usefull and you enjoy it ! Connect me at Linkedin and follow to see what comes next ;)

Cya ! :)