utf8clean is a robust PostgreSQL function that strips out invalid or non-UTF-8 byte sequences from text fields.
It was built to handle messy external data sources (looking at you, Windows-1252 ๐) and ensure stored data is clean, safe, and consistent.
- External vendors and data feeds often deliver text with invalid UTF-8 characters.
- Standard PostgreSQL conversions can fail or truncate.
- This function inspects each byte, validating against the UTF-8 spec, and only passes allowed sequences.
- Written in PL/pgSQL.
- Strict adherence to UTF-8 rules (including multi-byte sequences).
- Preserves newline and carriage return characters.
- Easy to integrate into
INSERT/UPDATEqueries. - Works on large datasets.
Run the utf8clean.sql in your PostgreSQL instance
Just wrap strings for insert/update in utf8clean(), for example:
update some_table set some_column = utf8clean('some naughty string');