New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add file encoding to configure_file #3135

Merged
merged 4 commits into from Jun 9, 2018

Conversation

Projects
None yet
6 participants
@infirit
Contributor

infirit commented Feb 25, 2018

meson should not make the assumption that all files are encoded in utf-8 or isolatin1 when passed to configure_file. The new keyword will allow the user to provide the encoding for the the input and output file. It will default to utf-8 so all current code will keep working.

In #1542 people suggest to detect the encoding. This may work for some file but in the end this is nothing more than an educated guess. Meson should not guess if it can ask the user.

Fixes #1542

@jpakkane

This comment has been minimized.

Member

jpakkane commented Feb 27, 2018

What is the specific use case for this? Under which circumstances would one need to generate a header file with mixed encoding?

@infirit

This comment has been minimized.

Contributor

infirit commented Feb 27, 2018

What is the specific use case for this?

Configure any file that is not utf-8 or isolatin1. See #1542 for two examples where it actually causes an error. Worse case is that the input is decoded incorrectly and you end up corrupting the output.

file with mixed encoding

It's not mixed, in my example the whole file is encoded koi8-r. It just so happens the other characters are in the same place as isolatin1.

Here is an example where things go horribly wrong without an error.

test_kanji = '準拠'
>>> test_kanji.encode('ISO 2022-JP')
b'\x1b$B=`5r\x1b(B'
>>> iso_22022_jp = test_kanji.encode('ISO 2022-JP')
>>> iso_22022_jp.decode('utf-8')
'\x1b$B=`5r\x1b(B'
>>> iso_22022_jp.decode('ISO 2022-JP')
'準拠'
@infirit

This comment has been minimized.

Contributor

infirit commented Feb 27, 2018

btw, decoding from bytes to code-points is not strictly necessary. You can also read the input as bytes and call replace() with bytes as argument.

binary_input = b'My test @to be replaced@'
>>> binary_input.replace(b'@to be replaced@', b'has been replaced')
b'My test has been replaced'

But if you choose to decode knowing the input encoding is an absolute requirement.

edit: But this moves the problem as the argument to replace then needs to be encoded.

@NickeZ

This comment has been minimized.

Contributor

NickeZ commented Mar 1, 2018

Could we add a copyonly/binary mode as well and fix #860 at the same time? Only supporting textmode/UTF-8 is a real limitation.

edit: Actually, maybe this should be a switch on project instead. @infirit are all your files in koi8-r?. Or maybe both... Wouldn't a switch on project solve most of the cases? I can imagine that if you are writing a project that has to handle multiple encodings it could be nice to have multiple encodings per configure_file() as well. But that seems like a very niche project. I think the "binary" mode is more important so that one can copy files without overhead.

@hardening

This comment has been minimized.

Contributor

hardening commented Mar 1, 2018

@NickeZ do you have a practical case in mind, as usually configure files (at least in autotools and cmake) are always text files.

@NickeZ

This comment has been minimized.

Contributor

NickeZ commented Mar 1, 2018

Most of my use cases configure_data() is empty. So the overhead of parsing the text is unnecessary:

$ rg -A4 configure_file
src/tools/test/meson.build
17:    configure_file(
18-        input: ''.join([t, '.plt']),
19-        output: ''.join([t, '.t']),
20-        configuration: configuration_data(),
21-    )
src/libCom/flex/meson.build
35:#flex_static = configure_file(
36-#    command: [cp, '@INPUT@', '@OUTPUT@'],
37-#    input: 'flex.skel.static',
38-#    output: 'flex.skel.static'
39-#)
src/libCom/env/meson.build
6:configure_file(
7-    input: 'envDefs.h',
8-    output: 'envDefs.h',
9-    configuration: configuration_data()
10-)
src/libCom/test/meson.build
7:configure_file(
8-    input: 'epicsUnitTestTest.plt',
9-    output: 'epicsUnitTestTest.t',
10-    configuration: configuration_data(),
11-)
@nirbheek

This comment has been minimized.

Member

nirbheek commented Mar 1, 2018

As a start, we could optimize the case where the configuration data object is empty and copy without substitution.

@NickeZ

This comment has been minimized.

Contributor

NickeZ commented Mar 1, 2018

I guess configuration is a required parameter today. It could also be made optional, and if it is missing the default behavior is to copy the file.

@infirit

This comment has been minimized.

Contributor

infirit commented Mar 1, 2018

This is a bit out of scope for this PR but.

If meson is going to allow just copying with configure_file I would suggest to instead add something to the api. Calling configure_file then configure nothing is weird and confusing.

@NickeZ, I stumbled upon #1542 by accident. koi8-r was just first encoding that I thought off to use as a test that would actually error. Not all decoding causes errors like I showed with ISO-2022-JP.

@NickeZ

This comment has been minimized.

Contributor

NickeZ commented Mar 1, 2018

Sorry yeah i went a bit off-topic. But what do you think about adding this as a parameter on the whole project instead? Surely you don't want to have mixed encodings in a project?

@infirit

This comment has been minimized.

Contributor

infirit commented Mar 1, 2018

But what do you think about adding this as a parameter on the whole project instead?

Why would you limit your users like that? If the user cares the user will convert everything to utf-8 if possible, but sometimes it can not be done for everything.

Surely you don't want to have mixed encodings in a project?

Why not, there may be legacy or regulatory requirements for things to be done in a certain non utf-8 way. I have worked for a company ~6years ago that needed to setup an edi between them an a transport company. The transport company only accepted EDIFACT which has encodings but definitely not utf-8. The point is that utf-8 is not widely adopted still to this day in many areas of daily life.

@ePirat

This comment has been minimized.

Contributor

ePirat commented Mar 17, 2018

Surely you don't want to have mixed encodings in a project?

For example at VLC we had NSIS installer language files that all needed to be in different encodings (Windows codepages) depending on the language. We never (luckily) had to replace any value in those using the buildsystem, but it is a case that certainly could happen at some point.

@infirit

This comment has been minimized.

Contributor

infirit commented Mar 22, 2018

Resolved conflict. Whats the word on this, merge reject?

@jpakkane

This comment has been minimized.

Member

jpakkane commented Mar 25, 2018

If meson is going to allow just copying with configure_file I would suggest to instead add something to the api.

This has actually been discussed several times, specifically by adding a function copy_to_builddir or somesuch. It's just that no-one has yet had the time and energy to implement this and create a MR.

@infirit

This comment has been minimized.

Contributor

infirit commented Apr 13, 2018

Fixed conflicts

@infirit

This comment has been minimized.

Contributor

infirit commented May 22, 2018

Fixed conflicts again

@infirit

This comment has been minimized.

Contributor

infirit commented May 23, 2018

One of the jobs failed to download qt and failed, the others completed.

@nirbheek

This comment has been minimized.

Member

nirbheek commented May 23, 2018

One of the jobs failed to download qt and failed, the others completed.

I restarted the job.

@jpakkane

This comment has been minimized.

Member

jpakkane commented May 24, 2018

The functionality itself is ok, there is just a few open questions about supportability and the like.

Specifically, what should we say about supported encodings and how they are named? Is it ok to say that we support ASCII, utf-8, iso-latin-1 and "possibly others" or should we commit to something more?

This also needs a test for the case when the user tries to create a file with content that can not be expressed in the given encoding. Such as trying to write Ä in an ASCII file. Meson should abort with a hard error in this case.

In addition:

  • since this is new functionality, there should be a release note snippet
  • the reference manual should mention that this kwarg is only available since 0.47.0
@infirit

This comment has been minimized.

Contributor

infirit commented May 24, 2018

Specifically, what should we say about supported encodings and how they are named? Is it ok to say that we support ASCII, utf-8, iso-latin-1 and "possibly others" or should we commit to something more?

Supported is basically anything that python3 supports which is pretty much all of them. Thinking about it, I do think people should be encouraged to use utf-8.

I'll update the docs in the relevant place(s) along the lines of below, hopefully this covers it. Please suggest additional bits and/or resources to include if you have them.

The default meson file encoding to configure files is utf-8. If you need to configure files that are not
utf-8 encoded the encoding keyword will allow you to specify which file encoding to use. It is however
strongly advised to convert your non utf-8 file to utf-8 whenever possible. Supported file encodings are
those of python3, see https://docs.python.org/3/library/codecs.html#standard-encodings

It may also be worth to add a meaningful error message on UnicodeEncodeError. I'll experiment with it a bit and see if it is actually useful.

I'll update the PR with the test, snippet and mention the meson version over the weekend.

@infirit

This comment has been minimized.

Contributor

infirit commented May 27, 2018

This also needs a test for the case when the user tries to create a file with content that can not be expressed in the given encoding. Such as trying to write Ä in an ASCII file. Meson should abort with a hard error in this case.

configure_file currently will raise a MesonException and the test I added will fail (as it should).

How do I handle this in the meson.build file? I do not see a way to use test on configure_file directly ad mark it should fail. Is there another way for me to mark it as "should fail" in meson.build?

@jpakkane

This comment has been minimized.

Member

jpakkane commented May 27, 2018

Put the test in a subdirectory under test cases/failing. The rest will happen automatically.

@infirit

This comment has been minimized.

Contributor

infirit commented May 27, 2018

Thanks. Added docs, snippet and test for the encoding keyword now succeed locally. Now waiting for travis and appveyor to finish.

@jpakkane

This comment has been minimized.

Member

jpakkane commented May 29, 2018

The test fails because of the wrong reason:

ERROR: First statement must be a call to project
@jpakkane

This comment has been minimized.

Member

jpakkane commented Jun 1, 2018

Once the test is fixed to fail due to the correct reason (wrong encoding rather than missing a call to project() at the beginning) this is ok to merge.

@nirbheek nirbheek self-assigned this Jun 1, 2018

@nirbheek

This comment has been minimized.

Member

nirbheek commented Jun 7, 2018

You will also need to resolve conflicts now, since other features were merged that touch the same code.

infirit added some commits Feb 25, 2018

Add file encoding to configure_file
Input files can be in any file encoding, not just utf-8 or isolatin1. Meson
should not make assumptions here and allow for the user to specify the
encoding to use.
Add new encoding keyword for configure_file to manual
Also add a section how to deal with file encodings.
@infirit

This comment has been minimized.

Contributor

infirit commented Jun 8, 2018

Resolved conflicts and fixed test.

@jpakkane

This comment has been minimized.

Member

jpakkane commented Jun 9, 2018

The failure text is now:

meson.build:5:0: ERROR:  Could not write output file /Users/jpakkane/encoding/build/config9.h: 'ascii' codec can't encode character '\u0434' in position 17: ordinal not in range(128)

Which looks to be ok.

@nirbheek nirbheek merged commit ec616c2 into mesonbuild:master Jun 9, 2018

3 checks passed

ci/sideci No issues found!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment