Skip to content

struct-tag preprocessing and validation for CSV/TSV/LTSV, Parquet, Excel.

License

Notifications You must be signed in to change notification settings

nao1215/fileprep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

fileprep

Go Reference Go Report Card MultiPlatformUnitTest Coverage

日本語 | Español | Français | 한국어 | Русский | 中文

fileprep-logo

fileprep is a Go library for cleaning, normalizing, and validating structured data—CSV, TSV, LTSV, Parquet, and Excel—through lightweight struct-tag rules, with seamless support for gzip, bzip2, xz, and zstd streams.

Why fileprep?

I developed nao1215/filesql, which allows you to execute SQL queries on files like CSV, TSV, LTSV, Parquet, and Excel. I also created nao1215/csv for CSV file validation.

While studying machine learning, I realized: "If I extend nao1215/csv to support the same file formats as nao1215/filesql, I could combine them to perform ETL-like operations." This idea led to the creation of fileprep—a library that bridges data preprocessing/validation with SQL-based file querying.

Features

  • Multiple file format support: CSV, TSV, LTSV, Parquet, Excel (.xlsx)
  • Compression support: gzip (.gz), bzip2 (.bz2), xz (.xz), zstd (.zst)
  • Name-based column binding: Fields auto-match snake_case column names, customizable via name tag
  • Struct tag-based preprocessing (prep tag): trim, lowercase, uppercase, default values
  • Struct tag-based validation (validate tag): required, and more
  • Seamless filesql integration: Returns io.Reader for direct use with filesql
  • Detailed error reporting: Row and column information for each error

Installation

go get github.com/nao1215/fileprep

Requirements

  • Go Version: 1.24 or later
  • Operating Systems:
    • Linux
    • macOS
    • Windows

Quick Start

package main

import (
    "fmt"
    "os"
    "strings"

    "github.com/nao1215/fileprep"
)

// User represents a user record with preprocessing and validation
type User struct {
    Name  string `prep:"trim" validate:"required"`
    Email string `prep:"trim,lowercase"`
    Age   string
}

func main() {
    csvData := `name,email,age
  John Doe  ,JOHN@EXAMPLE.COM,30
Jane Smith,jane@example.com,25
`

    processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
    var users []User

    reader, result, err := processor.Process(strings.NewReader(csvData), &users)
    if err != nil {
        fmt.Printf("Error: %v\n", err)
        return
    }

    fmt.Printf("Processed %d rows, %d valid\n", result.RowCount, result.ValidRowCount)

    for _, user := range users {
        fmt.Printf("Name: %q, Email: %q\n", user.Name, user.Email)
    }

    // reader can be passed directly to filesql
    _ = reader
}

Output:

Processed 2 rows, 2 valid
Name: "John Doe", Email: "john@example.com"
Name: "Jane Smith", Email: "jane@example.com"

Preprocessing Tags (prep)

Multiple tags can be combined: prep:"trim,lowercase,default=N/A"

Basic Preprocessors

Tag Description Example
trim Remove leading/trailing whitespace prep:"trim"
ltrim Remove leading whitespace prep:"ltrim"
rtrim Remove trailing whitespace prep:"rtrim"
lowercase Convert to lowercase prep:"lowercase"
uppercase Convert to uppercase prep:"uppercase"
default=value Set default if empty prep:"default=N/A"

String Transformation

Tag Description Example
replace=old:new Replace all occurrences prep:"replace=;:,"
prefix=value Prepend string to value prep:"prefix=ID_"
suffix=value Append string to value prep:"suffix=_END"
truncate=N Limit to N characters prep:"truncate=100"
strip_html Remove HTML tags prep:"strip_html"
strip_newline Remove newlines (LF, CRLF, CR) prep:"strip_newline"
collapse_space Collapse multiple spaces into one prep:"collapse_space"

Character Filtering

Tag Description Example
remove_digits Remove all digits prep:"remove_digits"
remove_alpha Remove all alphabetic characters prep:"remove_alpha"
keep_digits Keep only digits prep:"keep_digits"
keep_alpha Keep only alphabetic characters prep:"keep_alpha"
trim_set=chars Remove specified characters from both ends prep:"trim_set=@#$"

Padding

Tag Description Example
pad_left=N:char Left-pad to N characters prep:"pad_left=5:0"
pad_right=N:char Right-pad to N characters prep:"pad_right=10: "

Advanced Preprocessors

Tag Description Example
normalize_unicode Normalize Unicode to NFC form prep:"normalize_unicode"
nullify=value Treat specific string as empty prep:"nullify=NULL"
coerce=type Type coercion (int, float, bool) prep:"coerce=int"
fix_scheme=scheme Add or fix URL scheme prep:"fix_scheme=https"
regex_replace=pattern:replacement Regex-based replacement prep:"regex_replace=\\d+:X"

Validation Tags (validate)

Multiple tags can be combined: validate:"required,email"

Basic Validators

Tag Description Example
required Field must not be empty validate:"required"
boolean Must be true, false, 0, or 1 validate:"boolean"

Character Type Validators

Tag Description Example
alpha ASCII alphabetic characters only validate:"alpha"
alphaunicode Unicode letters only validate:"alphaunicode"
alphaspace Alphabetic characters or spaces validate:"alphaspace"
alphanumeric ASCII alphanumeric characters validate:"alphanumeric"
alphanumunicode Unicode letters or digits validate:"alphanumunicode"
numeric Valid integer validate:"numeric"
number Valid number (integer or decimal) validate:"number"
ascii ASCII characters only validate:"ascii"
printascii Printable ASCII characters (0x20-0x7E) validate:"printascii"
multibyte Contains multibyte characters validate:"multibyte"

Numeric Comparison Validators

Tag Description Example
eq=N Value equals N validate:"eq=100"
ne=N Value not equals N validate:"ne=0"
gt=N Value greater than N validate:"gt=0"
gte=N Value greater than or equal to N validate:"gte=1"
lt=N Value less than N validate:"lt=100"
lte=N Value less than or equal to N validate:"lte=99"
min=N Value at least N validate:"min=0"
max=N Value at most N validate:"max=100"
len=N Exactly N characters validate:"len=10"

String Validators

Tag Description Example
oneof=a b c Value is one of the allowed values validate:"oneof=active inactive"
lowercase Value is all lowercase validate:"lowercase"
uppercase Value is all uppercase validate:"uppercase"
eq_ignore_case=value Case-insensitive equality validate:"eq_ignore_case=yes"
ne_ignore_case=value Case-insensitive not equal validate:"ne_ignore_case=no"

String Content Validators

Tag Description Example
startswith=prefix Value starts with prefix validate:"startswith=http"
startsnotwith=prefix Value does not start with prefix validate:"startsnotwith=_"
endswith=suffix Value ends with suffix validate:"endswith=.com"
endsnotwith=suffix Value does not end with suffix validate:"endsnotwith=.tmp"
contains=substr Value contains substring validate:"contains=@"
containsany=chars Value contains any of the chars validate:"containsany=abc"
containsrune=r Value contains the rune validate:"containsrune=@"
excludes=substr Value does not contain substring validate:"excludes=admin"
excludesall=chars Value does not contain any of the chars validate:"excludesall=<>"
excludesrune=r Value does not contain the rune validate:"excludesrune=$"

Format Validators

Tag Description Example
email Valid email address validate:"email"
uri Valid URI validate:"uri"
url Valid URL validate:"url"
http_url Valid HTTP or HTTPS URL validate:"http_url"
https_url Valid HTTPS URL validate:"https_url"
url_encoded URL encoded string validate:"url_encoded"
datauri Valid data URI validate:"datauri"
uuid Valid UUID validate:"uuid"

Network Validators

Tag Description Example
ip_addr Valid IP address (v4 or v6) validate:"ip_addr"
ip4_addr Valid IPv4 address validate:"ip4_addr"
ip6_addr Valid IPv6 address validate:"ip6_addr"
cidr Valid CIDR notation validate:"cidr"
cidrv4 Valid IPv4 CIDR validate:"cidrv4"
cidrv6 Valid IPv6 CIDR validate:"cidrv6"
fqdn Valid fully qualified domain name validate:"fqdn"
hostname Valid hostname (RFC 952) validate:"hostname"
hostname_rfc1123 Valid hostname (RFC 1123) validate:"hostname_rfc1123"
hostname_port Valid hostname:port validate:"hostname_port"

Cross-Field Validators

Tag Description Example
eqfield=Field Value equals another field validate:"eqfield=Password"
nefield=Field Value not equals another field validate:"nefield=OldPassword"
gtfield=Field Value greater than another field validate:"gtfield=MinPrice"
gtefield=Field Value >= another field validate:"gtefield=StartDate"
ltfield=Field Value less than another field validate:"ltfield=MaxPrice"
ltefield=Field Value <= another field validate:"ltefield=EndDate"
fieldcontains=Field Value contains another field's value validate:"fieldcontains=Keyword"
fieldexcludes=Field Value excludes another field's value validate:"fieldexcludes=Forbidden"

Supported File Formats

Format Extension Compressed
CSV .csv .csv.gz, .csv.bz2, .csv.xz, .csv.zst
TSV .tsv .tsv.gz, .tsv.bz2, .tsv.xz, .tsv.zst
LTSV .ltsv .ltsv.gz, .ltsv.bz2, .ltsv.xz, .ltsv.zst
Excel .xlsx .xlsx.gz, .xlsx.bz2, .xlsx.xz, .xlsx.zst
Parquet .parquet .parquet.gz, .parquet.bz2, .parquet.xz, .parquet.zst

Note on Parquet compression: The external compression (.parquet.gz, etc.) is for the container file itself. Parquet files may also use internal compression (Snappy, GZIP, LZ4, ZSTD) which is handled transparently by the parquet-go library.

Note on Excel files: Only the first sheet is processed. Multi-sheet workbooks will have subsequent sheets ignored.

Integration with filesql

// Process file with preprocessing and validation
processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
var records []MyRecord

reader, result, err := processor.Process(file, &records)
if err != nil {
    return err
}

// Check for validation errors
if result.HasErrors() {
    for _, e := range result.ValidationErrors() {
        log.Printf("Row %d, Column %s: %s", e.Row, e.Column, e.Message)
    }
}

// Pass preprocessed data to filesql using Builder pattern
ctx := context.Background()
builder := filesql.NewBuilder().
    AddReader(reader, "my_table", filesql.FileTypeCSV)

validatedBuilder, err := builder.Build(ctx)
if err != nil {
    return err
}

db, err := validatedBuilder.Open(ctx)
if err != nil {
    return err
}
defer db.Close()

// Execute SQL queries on preprocessed data
rows, err := db.QueryContext(ctx, "SELECT * FROM my_table WHERE age > 20")

Design Considerations

Name-Based Column Binding

Struct fields are mapped to file columns by name, not by position. Field names are automatically converted to snake_case to match CSV column headers:

// File columns: user_name, email_address, phone_number (any order)
type User struct {
    UserName     string  // → matches "user_name" column
    EmailAddress string  // → matches "email_address" column
    PhoneNumber  string  // → matches "phone_number" column
}

Column order doesn't matter - fields are matched by name, so you can reorder columns in your CSV without changing your struct.

Custom Column Names with name Tag

Use the name tag to override the auto-generated column name:

type User struct {
    UserName string `name:"user"`       // → matches "user" column (not "user_name")
    Email    string `name:"mail_addr"`  // → matches "mail_addr" column (not "email")
    Age      string                     // → matches "age" column (auto snake_case)
}

Missing Columns Behavior

If a CSV column doesn't exist for a struct field, the field value is treated as an empty string. Validation still runs, so required will catch missing columns:

type User struct {
    Name    string `validate:"required"`  // Error if "name" column is missing
    Country string                        // Empty string if "country" column is missing
}

Case Sensitivity and Duplicate Headers

Header matching is case-sensitive and exact. A struct field UserName maps to user_name, but headers like User_Name, USER_NAME, or userName will not match:

type User struct {
    UserName string  // ✓ matches "user_name"
                     // ✗ does NOT match "User_Name", "USER_NAME", "userName"
}

This applies to all file formats: CSV, TSV, LTSV keys, and Parquet/XLSX column names must match exactly.

Duplicate column names: If a file contains duplicate header names (e.g., id,id,name), the first occurrence is used for binding:

id,id,name
first,second,John  → struct.ID = "first" (first "id" column wins)

Format-Specific Notes

LTSV, Parquet, and XLSX follow the same case-sensitive matching rules. Keys/column names must match exactly:

type Record struct {
    UserID string                 // expects "user_id" key/column
    Email  string `name:"EMAIL"`  // use name tag for non-snake_case columns
}

If your LTSV keys use hyphens (user-id) or Parquet/XLSX columns use camelCase (userId), use the name tag to specify the exact column name.

Memory Usage

fileprep loads the entire file into memory for processing. This enables random access and multi-pass operations but has implications for large files:

File Size Approx. Memory Recommendation
< 100 MB ~2-3x file size Direct processing
100-500 MB ~500 MB - 1.5 GB Monitor memory, consider chunking
> 500 MB > 1.5 GB Split files or use streaming alternatives

For compressed inputs (gzip, bzip2, xz, zstd), memory usage is based on decompressed size.

Benchmark target available: Run make bench to measure processing performance on your system.

Related or inspired Projects

  • nao1215/filesql - sql driver for CSV, TSV, LTSV, Parquet, Excel with gzip, bzip2, xz, zstd support.
  • nao1215/csv - read csv with validation and simple DataFrame in golang.
  • go-playground/validator - Go Struct and Field validation, including Cross Field, Cross Struct, Map, Slice and Array diving
  • shogo82148/go-header-csv - go-header-csv is encoder/decoder csv with a header.

Contributing

Contributions are welcome! Please see the Contributing Guide for more details.

Support

If you find this project useful, please consider:

  • Giving it a star on GitHub - it helps others discover the project
  • Becoming a sponsor - your support keeps the project alive and motivates continued development

Your support, whether through stars, sponsorships, or contributions, is what drives this project forward. Thank you!

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

struct-tag preprocessing and validation for CSV/TSV/LTSV, Parquet, Excel.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

 

Packages

No packages published