Domclick html pages parser

Written in Rust

For educational purposes only!

Website was blocking python scripts for parsing it source code, so I decide to download pages with undetected chromedriver and then parse it in rust. This project uses scraper crate to get values from html code, polars to create and maintain data-table in runtime and rayon to scan html files in multi thread.

It first gets raw data, then fills in the gaps, does one hot encoding for categorical columns and normalizes numeric columns.

It saves a result into out.tsv table.

You also can get a raw data table in short.tsv running project with -s flag.

It has scanned 2000 html web pages (total 1 Gb) and built a table in 1.5 seconds

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
scripts		scripts
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Domclick html pages parser

Written in Rust

For educational purposes only!

About

Uh oh!

Releases

Packages

Uh oh!

Languages

mpxx1/domclick-parser-rs

Folders and files

Latest commit

History

Repository files navigation

Domclick html pages parser

Written in Rust

For educational purposes only!

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages