Skip to content

Commit

Permalink
fixed a issue when handing CJK characters
Browse files Browse the repository at this point in the history
  • Loading branch information
Saiya authored and Saiya committed Nov 26, 2015
1 parent 54d70b4 commit 027360a
Show file tree
Hide file tree
Showing 4 changed files with 51 additions and 6 deletions.
4 changes: 3 additions & 1 deletion lib/truncate.coffee
Expand Up @@ -45,7 +45,7 @@ truncate = (html, length, options)->
# <p>Lorem ipsum <span>dolor sit</span> amet, consectetur</p>
# tempor incididunt ut labore
#
$ = cheerio.load "<div>#{html}</div>"
$ = cheerio.load "<div>#{html}</div>", decodeEntities: options.decodeEntities
$html = $('div').first()

# remove excludes elements
Expand Down Expand Up @@ -100,6 +100,8 @@ truncate.defaultOptions =
stripTags: false
# postfix of the string
ellipsis: '...'
# decode html entities
decodeEntities: false
# excludes: img
# length: 0

Expand Down
7 changes: 5 additions & 2 deletions lib/truncate.js

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion package.json
@@ -1,6 +1,6 @@
{
"name": "truncate-html",
"version": "0.0.5",
"version": "0.0.6",
"description": "truncate html and keep tags in safe",
"main": "lib/truncate.js",
"scripts": {
Expand Down
44 changes: 42 additions & 2 deletions readme.md
Expand Up @@ -15,14 +15,16 @@ truncate(html, [length], [options])
stripTags: Boolean, whether to remove tags
ellipsis: String, custom ellipsis sign, set it to empty string to remove the ellipsis postfix
excludes: String or Array, the selectors of the elements you want to ignore
decodeEntities: Boolean, auto decode html entities in the html string
}
```

### Default options
```js
truncate.defaultOptions = {
stripTags: false,
ellipsis: '...'
ellipsis: '...',
decodeEntities: false
};
```

Expand All @@ -35,7 +37,6 @@ npm install truncate-html
**Notice** Extra blank spaces in html content will be removed. If the html string content's length is shorter than `options.length`, then no ellipsis will be appended to the final html string. If longer, then the final html content's length will be `options.length` + `options.ellipsis`.



```js
var truncate = require('truncate-html');

Expand Down Expand Up @@ -91,6 +92,45 @@ truncate(html, {
});
// returns: This is a string for~


// handing encoded characters
var html = '<p>&nbsp;test for &lt;p&gt; encoded string</p>'
truncate(html, {
length: 20,
decodeEntities: true
});
// returns: <p> test for &lt;p&gt; encode...</p>

// when set decodeEntities false
var html = '<p>&nbsp;test for &lt;p&gt; encoded string</p>'
truncate(html, {
length: 20,
decodeEntities: false // this is the dafault value
});
// returns: <p>&nbsp;test for &lt;p...</p>


// and there may be a surprise by setting `decodeEntities` to true when handing CJK characters
var html = '<p>&nbsp;test for &lt;p&gt; 中文 string</p>'
truncate(html, {
length: 20,
decodeEntities: true
});
// returns: <p> test for &lt;p&gt; &#x4E2D;&#x6587; str...</p>
// to fix this, see below for instructions

```

### Known issues
Known issues about handing CJK characters when set the option `decodeEntities` to `true`.

You have seen the option `decodeEntities`, it's really magic! When it's true, encoded html entities will be decoded automatically, so `&amp;` will be treat as a single character. This is probably what we want. But, if there are CJK characters in the html string, they will be replaced by characters like `&#xF6;` in the final html you get. That's confused.

To fix this, you have two choices:

- keep the option `decodeEntities` false, but `&amp;` will treat as five characters.
- modify cheerio's source code: find out the function `getInverse` in the file `./node_modules/cheerio/node_modules/entities/lib/decode.js`, comment out the last line `.replace(re_nonASCII, singleCharReplacer);`.




0 comments on commit 027360a

Please sign in to comment.