New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding unicode values - best practice? #65

Open
kevinmarks opened this Issue Mar 18, 2016 · 5 comments

Comments

Projects
None yet
3 participants
@kevinmarks
Member

kevinmarks commented Mar 18, 2016

If I start with

<div class="h-entry"><span class="p-name">Entity &mdash; emdash</span></div>
<div class="h-entry"><span class="p-name">unicode — emdash</span></div>

I get

{"rels": {}, 
"items": 
  [{"type": ["h-entry"], "properties": {"name": ["Entity \u2014 emdash"]}}, 
  {"type": ["h-entry"], "properties": {"name": ["unicode \u2014 emdash"]}}], 
"rel-urls": {}}

with the emdash as a unicode entity.
If we passed ensure_ascii=False to json.dumps() we'd get

{"rels": {},
"items": 
  [{"type": ["h-entry"], "properties": {"name": ["Entity — emdash"]}}, 
  {"type": ["h-entry"], "properties": {"name": ["unicode — emdash"]}}],  
"rel-urls": {}}

Would that be more normal json? What is good practice here?

@kevinmarks

This comment has been minimized.

Show comment
Hide comment
@kevinmarks

kevinmarks Mar 18, 2016

Member

If I make it an e- field istead, we still get \u encoding in the html, which seems off:

<div class="h-entry"><span class="e-name">Entity &mdash; emdash</span></div>
<div class="h-entry"><span class="e-name">unicode — emdash</span></div>

becomes

{"rels": {}, 
"items":
   [{"type": ["h-entry"], "properties": 
    {"name": 
      [{"html": "Entity \u2014 emdash", 
      "value": "Entity \u2014 emdash"}]}}, 
  {"type": ["h-entry"], "properties": 
    {"name": 
      [{"html": "unicode \u2014 emdash", 
      "value": "unicode \u2014 emdash"}]}}], 
"rel-urls": {}}

Is having \u escaped text in the HTML field a good idea?

Member

kevinmarks commented Mar 18, 2016

If I make it an e- field istead, we still get \u encoding in the html, which seems off:

<div class="h-entry"><span class="e-name">Entity &mdash; emdash</span></div>
<div class="h-entry"><span class="e-name">unicode — emdash</span></div>

becomes

{"rels": {}, 
"items":
   [{"type": ["h-entry"], "properties": 
    {"name": 
      [{"html": "Entity \u2014 emdash", 
      "value": "Entity \u2014 emdash"}]}}, 
  {"type": ["h-entry"], "properties": 
    {"name": 
      [{"html": "unicode \u2014 emdash", 
      "value": "unicode \u2014 emdash"}]}}], 
"rel-urls": {}}

Is having \u escaped text in the HTML field a good idea?

@kylewm

This comment has been minimized.

Show comment
Hide comment
@kylewm

kylewm Apr 23, 2016

Collaborator

I guess my expectation is that the HTML property would be preserve the original markup, i.e. continue to include the entity &mdash;

I'd vote to not to force the result to ASCII anymore because every system we expect to use mf2py support UTF-8, and we don't want to subject our Russian friends to

"content": [{
  "html": "\n<p>\u0430 \u043f\u0440\u044f\u043c\u043e\u0433\u043e \u0438\u0437 \u041c\u0421\u041a \u0432 \u0442\u043e\u0447\u043a\u0443 \u043d\u0430\u0437\u043d\u0430\u0447\u0435\u043d\u0438\u044f \u043d\u0435 \u0431\u044b\u043b\u043e?</p>\n",
   "value": "\n\u0430 \u043f\u0440\u044f\u043c\u043e\u0433\u043e \u0438\u0437 \u041c\u0421\u041a \u0432 \u0442\u043e\u0447\u043a\u0443 \u043d\u0430\u0437\u043d\u0430\u0447\u0435\u043d\u0438\u044f \u043d\u0435 \u0431\u044b\u043b\u043e?\n"
}]

so I propose

<div class="h-entry"><span class="e-name">Entity &mdash; emdash</span></div>

should be parsed as

{"rels": {}, 
"items":
   [{"type": ["h-entry"], "properties": 
    {"name": 
      [{"html": "Entity &mdash; emdash", 
      "value": "Entity — emdash"}]}}], 
"rel-urls": {}}
Collaborator

kylewm commented Apr 23, 2016

I guess my expectation is that the HTML property would be preserve the original markup, i.e. continue to include the entity &mdash;

I'd vote to not to force the result to ASCII anymore because every system we expect to use mf2py support UTF-8, and we don't want to subject our Russian friends to

"content": [{
  "html": "\n<p>\u0430 \u043f\u0440\u044f\u043c\u043e\u0433\u043e \u0438\u0437 \u041c\u0421\u041a \u0432 \u0442\u043e\u0447\u043a\u0443 \u043d\u0430\u0437\u043d\u0430\u0447\u0435\u043d\u0438\u044f \u043d\u0435 \u0431\u044b\u043b\u043e?</p>\n",
   "value": "\n\u0430 \u043f\u0440\u044f\u043c\u043e\u0433\u043e \u0438\u0437 \u041c\u0421\u041a \u0432 \u0442\u043e\u0447\u043a\u0443 \u043d\u0430\u0437\u043d\u0430\u0447\u0435\u043d\u0438\u044f \u043d\u0435 \u0431\u044b\u043b\u043e?\n"
}]

so I propose

<div class="h-entry"><span class="e-name">Entity &mdash; emdash</span></div>

should be parsed as

{"rels": {}, 
"items":
   [{"type": ["h-entry"], "properties": 
    {"name": 
      [{"html": "Entity &mdash; emdash", 
      "value": "Entity — emdash"}]}}], 
"rel-urls": {}}
@kevinmarks

This comment has been minimized.

Show comment
Hide comment
@kevinmarks

kevinmarks Apr 25, 2016

Member

Can you clarify that 'Russian' example? They use KOI8-r or utf8, don't they?
The JSON output is in utf8, surely?
You can't be utf8 and preserve source encoding.
Oh, hang on, utf9 was a typo, and I think we're mostly agreeing.
I think removing HTML safe entity encoding (apart from < > etc) is worth doing for the sake fo uniformity.
On Sat, Apr 23, 2016 at 10:56 AM, Kyle Mahan notifications@github.com
wrote:

I guess my expectation is that the HTML property would be preserve the
original markup, i.e. continue to include the entity —

I'd vote to not to force the result to ASCII anymore because every
system we expect to use mf2py support UTF-9, and we don't want to subject
our Russian friends to

"content": [
"html": "\n

\u0430 \u043f\u0440\u044f\u043c\u043e\u0433\u043e \u0438\u0437 \u041c\u0421\u041a \u0432 \u0442\u043e\u0447\u043a\u0443 \u043d\u0430\u0437\u043d\u0430\u0447\u0435\u043d\u0438\u044f \u043d\u0435 \u0431\u044b\u043b\u043e?

\n",
"value": "\n\u0430 \u043f\u0440\u044f\u043c\u043e\u0433\u043e \u0438\u0437 \u041c\u0421\u041a \u0432 \u0442\u043e\u0447\u043a\u0443 \u043d\u0430\u0437\u043d\u0430\u0447\u0435\u043d\u0438\u044f \u043d\u0435 \u0431\u044b\u043b\u043e?\n"}]

so I propose

Entity — emdash

should be parsed as

{"rels": {}, "items":
[{"type": ["h-entry"], "properties":
{"name":
[{"html": "Entity — emdash",
"value": "Entity — emdash"}]}}], "rel-urls": {}}


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#65 (comment)

Member

kevinmarks commented Apr 25, 2016

Can you clarify that 'Russian' example? They use KOI8-r or utf8, don't they?
The JSON output is in utf8, surely?
You can't be utf8 and preserve source encoding.
Oh, hang on, utf9 was a typo, and I think we're mostly agreeing.
I think removing HTML safe entity encoding (apart from < > etc) is worth doing for the sake fo uniformity.
On Sat, Apr 23, 2016 at 10:56 AM, Kyle Mahan notifications@github.com
wrote:

I guess my expectation is that the HTML property would be preserve the
original markup, i.e. continue to include the entity —

I'd vote to not to force the result to ASCII anymore because every
system we expect to use mf2py support UTF-9, and we don't want to subject
our Russian friends to

"content": [
"html": "\n

\u0430 \u043f\u0440\u044f\u043c\u043e\u0433\u043e \u0438\u0437 \u041c\u0421\u041a \u0432 \u0442\u043e\u0447\u043a\u0443 \u043d\u0430\u0437\u043d\u0430\u0447\u0435\u043d\u0438\u044f \u043d\u0435 \u0431\u044b\u043b\u043e?

\n",
"value": "\n\u0430 \u043f\u0440\u044f\u043c\u043e\u0433\u043e \u0438\u0437 \u041c\u0421\u041a \u0432 \u0442\u043e\u0447\u043a\u0443 \u043d\u0430\u0437\u043d\u0430\u0447\u0435\u043d\u0438\u044f \u043d\u0435 \u0431\u044b\u043b\u043e?\n"}]

so I propose

Entity — emdash

should be parsed as

{"rels": {}, "items":
[{"type": ["h-entry"], "properties":
{"name":
[{"html": "Entity — emdash",
"value": "Entity — emdash"}]}}], "rel-urls": {}}


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#65 (comment)

@kylewm

This comment has been minimized.

Show comment
Hide comment
@kylewm

kylewm Apr 25, 2016

Collaborator

Heh, yeah UTF-9 was an unfortunate typo. Wish GitHub would wait a tick before sending the email notification...

And yep I'm agreeing with you, except I think we should leave html entities as-is in the "html" output (precisely because there are exceptions like &lt; and &gt;, easier to just treat everything the same)

Collaborator

kylewm commented Apr 25, 2016

Heh, yeah UTF-9 was an unfortunate typo. Wish GitHub would wait a tick before sending the email notification...

And yep I'm agreeing with you, except I think we should leave html entities as-is in the "html" output (precisely because there are exceptions like &lt; and &gt;, easier to just treat everything the same)

@kartikprabhu

This comment has been minimized.

Show comment
Hide comment
@kartikprabhu

kartikprabhu Mar 11, 2018

Member

phpmf2 also encodes the &emdash to a \u2014. So at the moment this seems fine.

Member

kartikprabhu commented Mar 11, 2018

phpmf2 also encodes the &emdash to a \u2014. So at the moment this seems fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment