Consider normalisation to NFKC in latex/Unicode completer #10673

Open
Carreau opened this Issue Jun 24, 2017 · 4 comments

Comments

Projects
None yet
2 participants
Owner

Carreau commented Jun 24, 2017

Thanks @grumpfou for the report, there is a weird interaction between our latex completer and the normalisation of identifiers

unnamed

The even weirdest thing is that above you have 3 involved "epsilon-like" glyphs:

screen shot 2017-06-24 at 00 22 23

Then this implies you can't make the assumption that the following does not raise:

s = 'ɛ'; exec('dict(%s=1)["%s"]' % (s,s))

(works with "varepsilon", but not "epsilon")

I'm tempted to think that we should insert the NFKC form when completing so that – at least – the identifier you complete to and the identifier which is actually generated are the same, and prevent users to enter invalid code that is auto-normalized. Typically you would expect the following to return 14 :
screen shot 2017-06-24 at 07 30 38

Though as \phi get normalized the \varphi you did bind to the same twice, which is awfully confusing.

In the meantime, I know that some people are using the latex completer to actually write in docstrings and I'm unsure if it's ok to break that.

If we normalize, this also mean that some notebook that have \phi in the code will work with one glyph which is implicitly normalized and can't be typed anymore – which I'm not a fan of.

There is no way in Python AFAICT to get a warning as the normalisation is done during parsing. the str.isidentifier does not have a way to tell us that Non NFKC are used but there a bug for that. There not way either to compile without normalisation nor to get a warning if non-normalized identifier present – I'm unsure an enhancement request for that would be accepted.

Owner

takluyver commented Jun 24, 2017

If Python normalises them in identifiers, +1 to doing it pre-emptively in our completer to avoid the confusion.

Owner

Carreau commented Jul 6, 2017

Discussing with @takluyver we might be able to do something with tokenize.

Owner

takluyver commented Jul 6, 2017

In theory, I think the completer could tokenize the text up to the cursor, work out if it's in a string, and do the normalisation if not. Any non-ascii character outside a string must be part of an identifier, I think (or a syntax error). It would be fiddly to get right, though.

Owner

Carreau commented Jul 6, 2017

In theory, I think the completer could tokenize the text up to the cursor, work out if it's in a string, and do the normalisation if not. Any non-ascii character outside a string must be part of an identifier, I think (or a syntax error). It would be fiddly to get right, though.

Yes, and that would still not really help with different visual-identifiers in different cells. I believe jedi also have an utility "am I in a string". So we could try to use that as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment