minutia is a library for summarizing the content of various internet services. It uses a URL based interface. Each URL is mapped to a dictionary that contains data about the resource linked, such as its title, its filesize and more.
- Supported protocols: HTTP
- Customized extraction mechanisms for various services.
- Calculates a TTL value that can be used for caching.
- Supports multiple languages.
- Built with Python async.
- Includes tests that can be used to ensure that each service is parsed properly.
- [media] Supports many content types: HTML, PDF, PNG, JPEG, MP3, ACC, FLAC and more.
- [explicit] Detects explicit content using metadata and ML models.
Before this library can be used, it must be initialized. After initialization it can then be configured. When you are done using it, you should also terminate it.
Below is a simple example of how to use the library:
import asyncio
import minutia
async def main():
# Initialize
await minutia.init()
# Configure
minutia.set_lang("en-GB")
minutia.set_max_filesize(6_553_500) # 6.5 MB in bytes
# Use the library
"ok", data = await minutia.http.get("http://hyperimpose.org")
# `data' will be a dictionary like:
# {
# "@": "http:html",
# "t": "This is the title of the page"
# }
# Terminate
await minutia.terminate()
asyncio.run(main())
You can install this library using pip:
pip install "git+https://github.com/hyperimpose/minutia.git@master"
Some of the features listed above are optional. You can tell pip which optional features you want installed
by using the syntax [media,explicit]
after the installation url:
pip install "git+https://github.com/hyperimpose/minutia.git@master"[media]
The availability of those dependencies is checked once when minutia is imported. When an optional dependency is missing the library will log a warning to inform you about the fact.
Table of available optional features:
Name | Description |
---|---|
media | Extract extra information from various media files. |
explicit | See Explicit. |
The library is able to detect explicit content using a Machine Learning (ML) model. To run the ML model, tensorflow and other dependencies need to be installed which has the following caveats:
- Increased installation size.
- When the code is started for the first time it will need to download and cache the ML model.
- Tensorflow requires a CPU with AVX support
- If this requirement is not met the program will crash and show this message:
Illegal Instruction
.
- If this requirement is not met the program will crash and show this message:
To avoid those issues this feature is provided through a secondary service. The library can then be configured, at runtime, to connect to that service over a unix socket.
An example workflow to use this feature:
- Do NOT use the
explicit
flag when installing minutia as a library for your project.- This keeps the size of the library relatively small.
- Create a new venv, clone this repo and install the library with the
explicit
flag.- Install with:
pip install ./minutia[explicit]
- Run
python minutia explicit -u /tmp/your_path_here.sock
to start the service.- CAUTION: the file path given after the -u option will be replaced on startup.
- Install with:
- After initializing the library in your code setup the explicit client:
- Add this code:
await minutia.setup_explicit_unix_socket("/tmp/your_path_here.sock")
- Add this code:
The explicit service can provide video support. Some extra runtime dependencies are needed for this to work. The availability of those dependencies is checked once when the service is started. When a runtime dependency is missing the library will log a warning to inform you about the fact.
Table of extra runtime dependencies:
Name | Description |
---|---|
ffmpeg | Enables explicit detection of videos. |
ffprobe | MUST be installed with ffmpeg . |
Callable | Description |
---|---|
async init() | Intialize the library |
async terminate() | Terminate the library |
Callable | Description | Default |
---|---|---|
set_http_useragent(ua: str) | The useragent to use when making HTTP requests. | |
set_lang(lang: str) | The default language to request content in. The value | “en” |
is passed in HTTP headers such as Accept-Language. | ||
set_max_filesize(i: int) | The max number of bytes to download for deep inspecion of | 14_600 |
supported media files. Set to <= 0 to disable the feature. | ||
set_max_htmlsize(i: int) | The max bytes to download when parsing HTML pages. | 14_600 |
Callable | Description | Default |
---|---|---|
async setup_explicit_unix_socket(path: str) | Path to the explicit service. When called a new | ”” |
client is started. “” disables the feature. |
This module is used when working with HTTP/HTTPS links.
Callable | Description |
---|---|
async get(link: str, lang: str = “”) | Visit the link and return information about it. If `lang’ is |
given then it will be used instead of the default lang set. |
minutia is using the logging
module to log various events. Everything is logged under the minutia
logger.
When the library is imported it might log information about the availability of various features. If you want to capture those you must configure logging in your application before importing minutia.
The library has an extra installation option dev
to be used during development. It is built using Flit.
You can setup a development environment with all the dependencies by running the following:
python -m venv venv
source venv/bin/activate
pip install flit
flit install
This project provides both a library for use in other python projects and standalone services to provide extra features to the library.
/minutia The library to import in other programs /services/explicit The explicit service /__main__.py Magic module to start the services from the CLI
minutia is licensed under the GNU Affero General Public License version 3 (AGPLv3).
A copy of this license is included in the file COPYING.