Skip to content
Transforms ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋ into this without breaking internationalization
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
src Added tests, fixed minor issues Mar 18, 2018
test
.babelrc Removed `@babel/preset-stage-0`, re-added `minify` preset Apr 24, 2018
.eslintrc Added linting and code coverage Mar 18, 2018
.gitignore Added tests, fixed minor issues Mar 18, 2018
.npmignore Ignored more files, fixed dependencies Nov 16, 2015
.travis.yml
LICENSE.md Update LICENSE.md Nov 16, 2015
README.md Re-worded threshold for clarification May 30, 2018
package-lock.json
package.json babel-eslint@^8.2.5, updated lock file Jul 5, 2018

README.md

unzalgo

Travis codecov dependencies Status

Transforms ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋ into this without breaking internationalization.

Installation

$ npm i -D unzalgo

About

You can use unzalgo to both detect Zalgo text and transform it back into normal text without breaking internationalization. For example, you could transform:

T͘H͈̩̬̺̩̭͇I͏̼̪͚̪͚S͇̬̺ ́E̬̬͈̮̻̕V҉̙I̧͖̜̹̩̞̱L͇͍̝ ̺̮̟̙̘͎U͝S̞̫̞͝E͚̘͝R IṊ͍̬͞P̫Ù̹̳̝͓̙̙T̜͕̺̺̳̘͝

into

THIS EVIL USER INPUT

while also keeping

thiŝ te̅xt unchanged, since some lângûaĝes aĉtuallŷ uŝe thêse sŷmbo̅ls,

and, at the same time, keep all diacritics in

Z nich ovšem pouze předposlední sdílí s výše uvedenou větou příliš žluťoučký kůň úpěl […]

which remains unchanged after a transformation.

Is there a demo?

Yes! You can check it out here. You can edit the text at the top; the lower part shows the text after clean using the default threshold.

How does it work?

In Unicode, every character is assigned to a character category. Zalgo text uses characters that belong to the categories Mn (Mark, Nonspacing) or Me (Mark, Enclosing).

First, the text is divided into words; each word is then assigned to a score that corresponds to the usage of the categories above, combined with small use of statistics. If the score exceeds a threshold, we're able to detect Zalgo text (which allows us to strip away all characters from the above categories).

Getting started

import { clean, isZalgo }  from "unzalgo";
/* Regular cleaning */
assert(clean("ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋") === "this");
/* Clean only if there are no "normal" characters in the word (t, h, i and s are "normal") */
assert(clean("ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋", 1) === "ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋");
/* Clean only if there is at least one combining character  */
assert(clean("français", 0) === "francais");
/* "français" is not a Zalgo text, of course */
assert(isZalgo("français") === false);
/* Unless you define the Zalgo property as containing combining characters */
assert(isZalgo("français", 0) === true);
/* You can also define the Zalgo property as consisting of nothing but combining characters */
assert(isZalgo("français", 1) === false);

Threshold

Unzalgo functions accept a threshold option that lets you configure how sensitively unzalgo behaves. The number threshold falls between 0 and 1. The threshold defaults to 0.55.

A threshold of 0 indicates that a string should be classified as Zalgo text if at least 0% of its codepoints have the Unicode category Mn or Me.

A threshold of 1 indicates that a string should be classified as Zalgo text if at least 100% of its codepoints have the Unicode category Mn or Me.

Exports

clean(string, threshold) [default export]

Removes all Zalgo text characters for every "likely Zalgo" word in string. Returns a representation of string without Zalgo text.

computeScores(string)

Computes a score ∈ [0, 1] for every word in the input string. Each score represents the ratio of Zalgo characters to total characters in a word.

isZalgo(string, threshold)

Returns true if string is a Zalgo text, else false.

You can’t perform that action at this time.