Skip to content

hyperimpose/minutia

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

minutia

minutia is a library for summarizing the content of various internet services. It uses a URL based interface. Each URL is mapped to a dictionary that contains data about the resource linked, such as its title, its filesize and more.

Features

  • Supported protocols: HTTP
  • Customized extraction mechanisms for various services.
  • Calculates a TTL value that can be used for caching.
  • Supports multiple languages.
  • Built with Python async.
  • Includes tests that can be used to ensure that each service is parsed properly.
  • [media] Supports many content types: HTML, PDF, PNG, JPEG, MP3, ACC, FLAC and more.
  • [explicit] Detects explicit content using metadata and ML models.

Usage

Before this library can be used, it must be initialized. After initialization it can then be configured. When you are done using it, you should also terminate it.

Below is a simple example of how to use the library:

import asyncio
import minutia

async def main():
    # Initialize
    await minutia.init()

    # Configure
    minutia.set_lang("en-GB")
    minutia.set_max_filesize(6_553_500)  # 6.5 MB in bytes

    # Use the library
    "ok", data = await minutia.http.get("http://hyperimpose.org")

    # `data' will be a dictionary like:
    # {
    #     "@": "http:html",
    #     "t": "This is the title of the page"
    # }

    # Terminate
    await minutia.terminate()

asyncio.run(main())

Installation

You can install this library using pip:

pip install "git+https://github.com/hyperimpose/minutia.git@master"

Optional Dependencies

Some of the features listed above are optional. You can tell pip which optional features you want installed by using the syntax [media,explicit] after the installation url:

pip install "git+https://github.com/hyperimpose/minutia.git@master"[media]

The availability of those dependencies is checked once when minutia is imported. When an optional dependency is missing the library will log a warning to inform you about the fact.

Table of available optional features:

NameDescription
mediaExtract extra information from various media files.
explicitSee Explicit.

Explicit

The library is able to detect explicit content using a Machine Learning (ML) model. To run the ML model, tensorflow and other dependencies need to be installed which has the following caveats:

  • Increased installation size.
  • When the code is started for the first time it will need to download and cache the ML model.
  • Tensorflow requires a CPU with AVX support
    • If this requirement is not met the program will crash and show this message: Illegal Instruction.

To avoid those issues this feature is provided through a secondary service. The library can then be configured, at runtime, to connect to that service over a unix socket.

An example workflow to use this feature:

  • Do NOT use the explicit flag when installing minutia as a library for your project.
    • This keeps the size of the library relatively small.
  • Create a new venv, clone this repo and install the library with the explicit flag.
    • Install with: pip install ./minutia[explicit]
    • Run python minutia explicit -u /tmp/your_path_here.sock to start the service.
      • CAUTION: the file path given after the -u option will be replaced on startup.
  • After initializing the library in your code setup the explicit client:
    • Add this code: await minutia.setup_explicit_unix_socket("/tmp/your_path_here.sock")

Video support

The explicit service can provide video support. Some extra runtime dependencies are needed for this to work. The availability of those dependencies is checked once when the service is started. When a runtime dependency is missing the library will log a warning to inform you about the fact.

Table of extra runtime dependencies:

NameDescription
ffmpegEnables explicit detection of videos.
ffprobeMUST be installed with ffmpeg.

API

minutia

Initialization / Termination

CallableDescription
async init()Intialize the library
async terminate()Terminate the library

Configuration

CallableDescriptionDefault
set_http_useragent(ua: str)The useragent to use when making HTTP requests.
set_lang(lang: str)The default language to request content in. The value“en”
is passed in HTTP headers such as Accept-Language.
set_max_filesize(i: int)The max number of bytes to download for deep inspecion of14_600
supported media files. Set to <= 0 to disable the feature.
set_max_htmlsize(i: int)The max bytes to download when parsing HTML pages.14_600

Setup

CallableDescriptionDefault
async setup_explicit_unix_socket(path: str)Path to the explicit service. When called a new””
client is started. “” disables the feature.

minutia.http

This module is used when working with HTTP/HTTPS links.

CallableDescription
async get(link: str, lang: str = “”)Visit the link and return information about it. If `lang’ is
given then it will be used instead of the default lang set.

Logging

minutia is using the logging module to log various events. Everything is logged under the minutia logger.

When the library is imported it might log information about the availability of various features. If you want to capture those you must configure logging in your application before importing minutia.

Developer Notes

The library has an extra installation option dev to be used during development. It is built using Flit.

You can setup a development environment with all the dependencies by running the following:

python -m venv venv
source venv/bin/activate
pip install flit
flit install

Project Structure

This project provides both a library for use in other python projects and standalone services to provide extra features to the library.

/minutia            The library to import in other programs
/services/explicit  The explicit service
/__main__.py        Magic module to start the services from the CLI

License

minutia is licensed under the GNU Affero General Public License version 3 (AGPLv3).

https://www.gnu.org/graphics/agplv3-with-text-162x68.png

A copy of this license is included in the file COPYING.