Skip to content

🌏 Truly Universal Encoding Detection in Golang 🌎

License

Notifications You must be signed in to change notification settings

gonejack/charamel

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Charamel

Charamel is a Go port of the Charamel Python library by AI.

This is a machine learning-based universal character encoding detection library that supports 98 character encoding formats.

Features

  • 🌈 Powered by machine learning
  • πŸ“¦ No external dependencies
  • ⚑ High performance
  • 🐍 Supports all 98 Python encodings
  • 🌍 Works on 60+ languages
  • πŸ”Ž High accuracy

Installation

go get github.com/gonejack/charamel

Usage

Basic Usage

package main

import (
    "fmt"
    "log"
    
    "github.com/gonejack/charamel"
)

func main() {
    // Create detector
    detector, err := charamel.NewDetector(nil, 0.0)
    if err != nil {
        log.Fatal(err)
    }
    
    // Detect encoding
    content := []byte("Hello World")
    encoding := detector.Detect(content)
    if encoding != nil {
        fmt.Printf("Detected encoding: %s\n", *encoding)
    }
}

Multiple Encoding Detection

// Get top 3 most likely encodings with confidences
results := detector.Probe(content, 3)
for _, result := range results {
    fmt.Printf("Encoding: %s, Confidence: %.4f\n", 
        result.Encoding, result.Confidence)
}

Custom Detector

// Detect only specific encodings
encodings := []charamel.Encoding{
    charamel.UTF8,
    charamel.GBK,
    charamel.BIG5,
}

// Set minimum confidence threshold
detector, err := charamel.NewDetector(encodings, 0.7)

API Documentation

Types

Encoding

type Encoding string

Represents character encoding type. Supported encodings include but are not limited to:

  • UTF8
  • GBK
  • BIG5
  • LATIN1
  • CP1252
  • And many more...

ProbeResult

type ProbeResult struct {
    Encoding   Encoding
    Confidence float64
}

Represents the result of encoding detection, containing encoding type and confidence.

Detector

type Detector struct {
    // private fields
}

Universal encoding detector.

Functions

NewDetector(encodings []Encoding, minConfidence float64) (*Detector, error)

Creates a new encoding detector.

Parameters:

  • encodings: List of encodings to support, pass nil to support all encodings
  • minConfidence: Minimum confidence threshold (0.0-1.0)

AllEncodings() []Encoding

Returns a list of all supported encodings.

ParseEncoding(s string) (Encoding, error)

Parses encoding type from string.

Methods

(*Detector) Detect(content []byte) *Encoding

Detects the most probable encoding. Returns nil if no encoding meets the minimum confidence threshold.

(*Detector) Probe(content []byte, top int) []ProbeResult

Detects the top N most probable encodings with their confidences.

Supported Encodings

This library supports 98 character encodings, including:

Unicode

  • UTF-8, UTF-16, UTF-32 (including BE/LE variants)
  • UTF-7, UTF-8-SIG

Chinese Encodings

  • GB2312, GBK, GB18030
  • Big5, Big5-HKSCS
  • HZ

Japanese Encodings

  • Shift_JIS, EUC-JP
  • ISO-2022-JP (various variants)

Korean Encodings

  • EUC-KR, CP949, JOHAB

European Encodings

  • ISO-8859-1 to ISO-8859-16
  • CP1250-CP1258
  • KOI8-R, KOI8-U

Others

  • ASCII
  • Various CP code pages
  • Mac encodings

For the complete list, see the AllEncodings() function.

License

Apache License 2.0

Technical Implementation

  • βœ… Real Data: Uses pre-trained machine learning models from the original Python version
  • βœ… Embedded Resources: Uses go:embed to embed model data into the binary
  • βœ… IEEE 754 Support: Complete implementation of half-precision float parsing
  • βœ… Gzip Decompression: Automatic decompression of model data files
  • βœ… High Performance: Leverages Go's high-performance characteristics
  • βœ… Zero Dependencies: No external dependencies required

Model Data

This library contains the following pre-trained model files:

  • features.gzip: 57,425 byte-level feature mappings
  • biases.gzip: Linear model bias values for 98 encodings
  • weights/: 98 gzip files, one weight file per encoding

Contributing

Pull Requests and Issues are welcome!

About

🌏 Truly Universal Encoding Detection in Golang 🌎

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Go 100.0%