Finished working on a basic parallel crawler for Structured Data in Rust #1

last-genius · 2021-03-01T10:46:48Z

I've implemented a simple crawler that uses thread pool and message passing to crawl webpages in parallel. While it uses most of CPU's resources and does everything we agreed upon for this week, there is still a lot of room for improvement, more on this below.

Completed:

A minimal Docker image for the crawler.
A CLI app that takes a text input file and then crawls through these webpages in parallel, collecting more and more links.
The app saves the collected structured data in plain form into an output file.

Things left to do, roughly in the descending order of importance:

Look into asynchronous implementations for each worker in the thread pool
Improve link collection, stop considering anchors on the same page as a different page
Add more documentation and split the file into several modules
Add tests for the crawler and thread pool implementations
Fix up progress display in a Docker image

Implemented a better command line argument read, now takes an input file with pages (does not yet perform crawling) on which to perform structured data scraping. Now it also writes data naked and simple into another file.

There is a lot of new stuff inside, it doesn't work yet though, a couple of issues to resolve with parsing out relative urls. A lot of todos inside + have to write tests

The architecture still sucks, going to rework it today-tomorrow hopefully. I need to have a smarter thread pool, probably don't have to actually create so many threads each time, can create them once and just update their urls or something like that.

Added nice TUI, divided everything up into several structures and their associated functions. Still a lot of TODOs and cleanup to do, but it should work roughly as expected.

Update Dockerfile, since we now need an input file copied to work, and also have to provide command line parameters to the executable. Image is still super small, only 14megs. Improved error handling in link normalization, now threads don't panick there if something goes wrong. They still can panic in other places, will have to improve error handling there too once we start working with more data from different websites. Improved progress bar display, add progress and finish message. Scratch Docker image is not capable of displaying the progress bar, probably will have to fix that somehow or just implement something workable for stupid terminals.

last-genius added 7 commits February 27, 2021 17:25

Fix the docker image for the rust crawler

68e78f0

Update the helloworld rust docker example

0f0199f

Implement input and output files, link collection

d58eae4

Implemented a better command line argument read, now takes an input file with pages (does not yet perform crawling) on which to perform structured data scraping. Now it also writes data naked and simple into another file.

Add multi-thread worker pools, rework scrapping

c77e2dd

There is a lot of new stuff inside, it doesn't work yet though, a couple of issues to resolve with parsing out relative urls. A lot of todos inside + have to write tests

Fix a few things, now it works.

ee45dec

The architecture still sucks, going to rework it today-tomorrow hopefully. I need to have a smarter thread pool, probably don't have to actually create so many threads each time, can create them once and just update their urls or something like that.

Finish parallel thread pool implementation

892cc9a

Added nice TUI, divided everything up into several structures and their associated functions. Still a lot of TODOs and cleanup to do, but it should work roughly as expected.

last-genius merged commit edf78aa into main Mar 1, 2021

last-genius mentioned this pull request Mar 2, 2021

Implement a more advanced asynchronous multi-processor crawler #2

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finished working on a basic parallel crawler for Structured Data in Rust #1

Finished working on a basic parallel crawler for Structured Data in Rust #1

last-genius commented Mar 1, 2021 •

edited

Finished working on a basic parallel crawler for Structured Data in Rust #1

Finished working on a basic parallel crawler for Structured Data in Rust #1

Conversation

last-genius commented Mar 1, 2021 • edited

last-genius commented Mar 1, 2021 •

edited