html2text_rs_py

A Python library backed by Rust's html2text to convert HTML to plain text. The project leverages the power of Rust to ensure fast and efficient operations, while providing an easy-to-use Python interface.

Note this entire thing was done with GPT-4 and it's my first time touching Rust -- just a bit of a weekend sidequest/learning experience. As a wise man once said: "I'm in the arena trying stuff. Some will work, some won't. But always learning."

Installation

Prerequisites:

Ensure you have both Rust and Python installed on your machine.
Install maturin:

pip install maturin

Building and Installing:

Option 1: Use precompiled binaries from PyPI

You can use the precompiled binaries available on PyPI. This means you don't need to compile anything yourself, and the Rust toolchain is not required.

pip install html2text_rs_py

Option 2: Building from source:

If you prefer to compile the Rust code yourself, or if you're interested in developing, you can build directly from the source code:

First, ensure you have the Rust toolchain installed. If you don't have it, get it from rustup.rs.
Clone this repo:

git clone https://github.com/mpr1255/html2text_rs_py.git
cd html2text_rs_py

Build and install the Python package:

maturin develop --release

This will compile the Rust code and link it with the Python wrapper, making the module available for Python.

Usage

After installing, you can use the Rust functions directly in Python:

from html2text_rs_py import convert_html_directory_to_text, convert_html_file_to_text_py, convert_html_files_to_text_batch_py, extract_text_from_html_file_py

convert_html_directory_to_text("./input_directory", "./output_directory")

# Convert a single HTML file to text
convert_html_file_to_text_py("input_file.html", "output_file.txt")

# Convert multiple HTML files to text in a batch
input_files = ["input1.html", "input2.html"]
output_files = ["output1.txt", "output2.txt"]
convert_html_files_to_text_batch_py(input_files, output_files)

# Convert html file into a python string object and return as a variable
input_file_path = "path/to/file.html"
extracted_text = extract_text_from_html_file_py(input_file_path)

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you'd like to change. Please make sure to update tests as appropriate.

License

MIT

Note on benchmarks

Speed was the motivation for this little project. To make sure the comparison was 1:1, I generated a ~1gb dataset of html files that do NOT contain links (because the Rust html2text library does not expose a flag to stop generating the hyperlinked URLs, and I don't know enough Rust to figure it out). This shows that it's only ~6x faster than the normal python implementation and only ~3x faster than the Tika... Not that great... However, I will say there is a lot of boilerplate overhead with those (multithreading) whereas this wrapper has three very simple functions you can call, and the multithreading happens for free under the hood with Rust's rayon.

Benchmarks

Method	Threading	Documents Processed	Total Output Size (bytes)	Time (seconds)
tika	single-threaded	3007	1500926103	94.76
html2text	single-threaded	3007	1500340646	184.90
tika	multi-threaded	3007	1500926103	14.29
html2text	multi-threaded	3007	1500340646	25.65
rust	multi-threaded	3007	1531829273	3.92

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
Cargo.toml		Cargo.toml
DOCS.md		DOCS.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
results.txt		results.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

html2text_rs_py

Table of Contents

Installation

Prerequisites:

Building and Installing:

Option 1: Use precompiled binaries from PyPI

Option 2: Building from source:

Usage

Contributing

License

Note on benchmarks

Benchmarks

About

Releases

Packages

Languages

mpr1255/html2text_rs_py

Folders and files

Latest commit

History

Repository files navigation

html2text_rs_py

Table of Contents

Installation

Prerequisites:

Building and Installing:

Option 1: Use precompiled binaries from PyPI

Option 2: Building from source:

Usage

Contributing

License

Note on benchmarks

Benchmarks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages