Charamel is a Go port of the Charamel Python library by AI.
This is a machine learning-based universal character encoding detection library that supports 98 character encoding formats.
- π Powered by machine learning
- π¦ No external dependencies
- β‘ High performance
- π Supports all 98 Python encodings
- π Works on 60+ languages
- π High accuracy
go get github.com/gonejack/charamelpackage main
import (
"fmt"
"log"
"github.com/gonejack/charamel"
)
func main() {
// Create detector
detector, err := charamel.NewDetector(nil, 0.0)
if err != nil {
log.Fatal(err)
}
// Detect encoding
content := []byte("Hello World")
encoding := detector.Detect(content)
if encoding != nil {
fmt.Printf("Detected encoding: %s\n", *encoding)
}
}// Get top 3 most likely encodings with confidences
results := detector.Probe(content, 3)
for _, result := range results {
fmt.Printf("Encoding: %s, Confidence: %.4f\n",
result.Encoding, result.Confidence)
}// Detect only specific encodings
encodings := []charamel.Encoding{
charamel.UTF8,
charamel.GBK,
charamel.BIG5,
}
// Set minimum confidence threshold
detector, err := charamel.NewDetector(encodings, 0.7)type Encoding stringRepresents character encoding type. Supported encodings include but are not limited to:
UTF8GBKBIG5LATIN1CP1252- And many more...
type ProbeResult struct {
Encoding Encoding
Confidence float64
}Represents the result of encoding detection, containing encoding type and confidence.
type Detector struct {
// private fields
}Universal encoding detector.
Creates a new encoding detector.
Parameters:
encodings: List of encodings to support, passnilto support all encodingsminConfidence: Minimum confidence threshold (0.0-1.0)
Returns a list of all supported encodings.
Parses encoding type from string.
Detects the most probable encoding. Returns nil if no encoding meets the minimum confidence threshold.
Detects the top N most probable encodings with their confidences.
This library supports 98 character encodings, including:
- UTF-8, UTF-16, UTF-32 (including BE/LE variants)
- UTF-7, UTF-8-SIG
- GB2312, GBK, GB18030
- Big5, Big5-HKSCS
- HZ
- Shift_JIS, EUC-JP
- ISO-2022-JP (various variants)
- EUC-KR, CP949, JOHAB
- ISO-8859-1 to ISO-8859-16
- CP1250-CP1258
- KOI8-R, KOI8-U
- ASCII
- Various CP code pages
- Mac encodings
For the complete list, see the AllEncodings() function.
Apache License 2.0
- β Real Data: Uses pre-trained machine learning models from the original Python version
- β
Embedded Resources: Uses
go:embedto embed model data into the binary - β IEEE 754 Support: Complete implementation of half-precision float parsing
- β Gzip Decompression: Automatic decompression of model data files
- β High Performance: Leverages Go's high-performance characteristics
- β Zero Dependencies: No external dependencies required
This library contains the following pre-trained model files:
- features.gzip: 57,425 byte-level feature mappings
- biases.gzip: Linear model bias values for 98 encodings
- weights/: 98 gzip files, one weight file per encoding
Pull Requests and Issues are welcome!