Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finished working on a basic parallel crawler for Structured Data in Rust #1

Merged
merged 7 commits into from Mar 1, 2021

Conversation

last-genius
Copy link
Collaborator

@last-genius last-genius commented Mar 1, 2021

I've implemented a simple crawler that uses thread pool and message passing to crawl webpages in parallel. While it uses most of CPU's resources and does everything we agreed upon for this week, there is still a lot of room for improvement, more on this below.

Completed:

  • A minimal Docker image for the crawler.
  • A CLI app that takes a text input file and then crawls through these webpages in parallel, collecting more and more links.
  • The app saves the collected structured data in plain form into an output file.

Things left to do, roughly in the descending order of importance:

  • Look into asynchronous implementations for each worker in the thread pool
  • Improve link collection, stop considering anchors on the same page as a different page
  • Add more documentation and split the file into several modules
  • Add tests for the crawler and thread pool implementations
  • Fix up progress display in a Docker image

Implemented a better command line argument read, now takes an
input file with pages (does not yet perform crawling) on which
to perform structured data scraping.

Now it also writes data naked and simple into another file.
There is a lot of new stuff inside, it doesn't work yet
though, a couple of issues to resolve with parsing out relative
urls. A lot of todos inside + have to write tests
The architecture still sucks, going to rework it today-tomorrow
hopefully. I need to have a smarter thread pool, probably don't have to
actually create so many threads each time, can create them once and just
update their urls or something like that.
Added nice TUI, divided everything up into several structures and their associated
functions. Still a lot of TODOs and cleanup to do, but it should work
roughly as expected.
Update Dockerfile, since we now need an input file copied to work, and
also have to provide command line parameters to the executable. Image is
still super small, only 14megs.

Improved error handling in link normalization, now threads don't panick
there if something goes wrong. They still can panic in other places,
will have to improve error handling there too once we start working with
more data from different websites.

Improved progress bar display, add progress and finish message. Scratch
Docker image is not capable of displaying the progress bar, probably
will have to fix that somehow or just implement something workable for stupid
terminals.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant