New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Variable string encoding for SAC files #1773
Conversation
obspy/io/sac/util.py
Outdated
@@ -171,7 +171,7 @@ def _clean_str(value, strip_whitespace=True): | |||
Trace that the user may have manually added. | |||
""" | |||
try: | |||
value = value.decode('ASCII', 'replace') | |||
value = value.decode(encode, 'replace') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a good idea to make this something like..
try:
value = value.decode(encoding, 'strict')
except UnicodeError:
msg = ('Encountered characters that can not be '
'decoded with encoding {}: {}\nThese characters '
'are ignored.'.format(encoding, value))
warnings.warn(msg)
value = value.decode(encoding, 'replace')
except AttributeError:
...
obspy/io/sac/sactrace.py
Outdated
if val.startswith(native_str('-12345')): | ||
val = HD.SNULL | ||
hs[i] = val | ||
hs[i] = val.encode('ascii', 'replace') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't follow this logic, why always encode to ASCII when the user might have specified a different encoding?
So right now you''re only handling reading files with a different encoding? I assume the idea is to regard files with an encoding different then ASCII as illegal files and only facilitate reading them while any output will be encoded to ASCII? I can follow that logic but then there should be an warning message I think when a user specifies a different encoding but output is ASCII encoded.. |
Thank @megies for reviewing the pull request. I will answer your two review comments and one last comment together here.
I think that we should always reinforce the alphanumeric idea by using only ASCII encoding. This also helps the encoding problem between py2 and py3. (e.g. You will have an error when decoding special characters encoded in UTF-8 using ASCII, but not the other way around) Since
from Wikipedia
|
I tend to agree with the general idea outlined above. Default:
Spefifying an encoding:
|
By the way, the current testing file |
To reiterate what I said in the issue, changing a string header from unknown to '?????' by default is different behavior from the way the SAC program behaves (which seems to allow arbitrary byte strings to pass through unchanged). This may trip up some users, and I suspect will result in bug reports. Also, I think our current test suite has some assumptions that a file can be read and re-written unchanged. |
Right now byte strings can also not pass through unchanged.. and we never claim to be 100% sac compatible, so I don't see that as a show stopper. Showing meaningful warnings in these exotic and probably sac-format-breaching cases is enough I would say.. |
I don't think so, as it looks like you're calling this string cleanup method that will always decode.. |
If we want to achieve that, we may need a class attribute that store the original headers in bytes. An alternative way to go is to record the user inputted encoding flag and use that when writing to SAC. Either way will involve changing the code structure which I am not comfortable to do as a newcomer to Obspy.
I double we have some testing suite that checks the integrity of SAC file (unchanged after read-write process). My tweaks to the code pass the test properly which clearly does not hold that principle. |
Can anyone give me some advice here? The Travis CI throw some tests fail like:
which I didn't even mess with. All changes I made were in @jkmacc-LANL and @megies, Maybe keep ASCII encoding by default messed some pre-existing assumption? |
@lijunzh That failure may be unrelated. If you see it when you run the test suite on the |
@lijunzh but there's also some "real" fails in io.sac: http://tests.obspy.org/79829/#1 |
In any case, I'm leaving this to @jkmacc-LANL to decide how to tackle this encoding issue in |
Seems to be that the clipping of the edges is slightly different - we can just up the tolerance a tiny bit - no need to recreate the test images IMHO. I'll do it in a separate PR. |
obspy/io/sac/sactrace.py
Outdated
@@ -1100,7 +1100,7 @@ def reftime(self, new_reftime): | |||
# --------------------------- I/O METHODS --------------------------------- | |||
@classmethod | |||
def read(cls, source, headonly=False, ascii=False, byteorder=None, | |||
checksize=False, debug_strings=False): | |||
checksize=False, debug_strings=False, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using kwargs
- can you please directly specify the argument name in the function definition? This is IMHO much easier to read and understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually easier for me to code. I was trying to keep all the code unchanged until absolutely needed. If you guys have no problem with adding a new parameter to read
in obspy.io.sac.sactrace
and change the calling statement in _internal_read_sac
in obspy.io.sac.core
, I will be more than happy to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great :) Public methods (not starting with underscore) should be backwards compatible but just adding a new argument that defaults to the old behavior of course satisfies that constraint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I will change the code accordingly.
Can anyone give some insight what this test is for? def test_reftime_incomplete(self):
"""
Replacement for SACTrace._from_arrays doctest which raises UserWarning
"""
sac = SACTrace._from_arrays()
self.assertTrue(sac.lcalda)
self.assertFalse(sac.leven)
self.assertFalse(sac.lovrok)
self.assertFalse(sac.lpspol)
self.assertEqual(sac.iztype, None)
self.assertRaises(SacHeaderTimeError, getattr, sac, 'reftime')
# raises "UserWarning: Reference time information incomplete"
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter('always', UserWarning)
str(sac)
self.assertEqual(len(w), 1)
self.assertIn("Reference time information incomplete", str(w[0])) My code fails at self.assertEqual(len(w), 1) in which I don't know why we want to keep |
It is probably useful to either jump to the line with a debugger or just print the caught warnings. Maybe some other warning got triggered as well? Tests usually assert the number of caught warnings as they intentionally trigger only one warning. |
I kind of throw multiple warnings for different content when using the "encoding" flag. It will first warn user that even they specify an encoding, it will use ASCII as output. The second one tell user that what characters are replaced by "?" when conversion happens. It that triggers this test to fail, I can merge them. |
This test is to verify that the user gets exceptions when they explicitly request time header information that isn't there, and to warn them with they implicitly request it ( EDIT: Actually, I guess I don't see why this test is failing. |
In fact, when I tested on my machine, I don't see this fail. It correctly delivers only one warning:
More importantly, Travis CI shows that it only fails on py27 and py33. I then created an env for py27 and repeat the test. Here are my results: (py27) lijun@Lijuns-MacBook-Pro:~$ python
Python 2.7.13 |Continuum Analytics, Inc.| (default, Dec 20 2016, 23:05:08)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>> import obspy
>>> sac = obspy.io.sac.sactrace.SACTrace._from_arrays()
>>> sac
SACTrace(lcalda=1, leven=0, lovrok=0, lpspol=0)
>>> str(sac)
/Users/lijun/anaconda3/envs/py27/lib/python2.7/site-packages/obspy/io/sac/sactrace.py:1441: UserWarning: Reference time information incomplete.
warnings.warn(msg)
'Reference Time = XX/XX/XX (XXX) XX:XX:XX.XXXXXX\n\tiztype not set\nlcalda = True\nleven = False\nlovrok = False\nlpspol = False'
>>> import warnings
>>> with warnings.catch_warnings(record=True) as w:
... warnings.simplefilter('always', UserWarning)
...
>>> len(w)
0
>>> type(w)
<type 'list'>
>>> w
[] Apparently, py27 treat @jkmacc-LANL Maybe we should fix this problem in a different PR. You may have more insight in this test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty darn good PR; thanks! I have just a few relatively minor suggestions.
obspy/io/sac/core.py
Outdated
@@ -381,7 +381,12 @@ def _internal_read_sac(buf, headonly=False, debug_headers=False, fsize=True, | |||
:return: A ObsPy Stream object. | |||
""" | |||
# read SAC file | |||
sac = SACTrace.read(buf, headonly=headonly, ascii=False, checksize=fsize) | |||
if 'encoding' in kwargs.keys(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of repeating the SACTrace.read
call in branches, maybe you can just do a single SACTrace.read(..., kwargs.get('encoding', 'ASCII')
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know that we can do this. It will be much cleaner.
obspy/io/sac/util.py
Outdated
@@ -162,7 +162,7 @@ def is_same_byteorder(bo1, bo2): | |||
return (bo1.lower() in le) == (bo2.lower() in le) | |||
|
|||
|
|||
def _clean_str(value, strip_whitespace=True): | |||
def _clean_str(value, encoding='ASCII', strip_whitespace=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move the encoding
keyword to the end, to preserve the original ordering of the strip_whitespace
argument.
obspy/io/sac/core.py
Outdated
@@ -381,7 +381,12 @@ def _internal_read_sac(buf, headonly=False, debug_headers=False, fsize=True, | |||
:return: A ObsPy Stream object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The encoding
keyword may also need to be documented here? Maybe overkill, but the others are too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have second thoughts here. I don't think we want to make "encoding" flag a standard parameter nor we want to encourage users to use it. Its existence is due to the fact that bad SAC files are created when people passing files from one system to another. When they are facing this encoding problem like I have (i.e. someone maybe read the file in Windows system and didn't encode it correctly before writing it to SAC. That's why I have a lot of characters in CP 1252), we have a backup plan to guide them using an "encoding" flag by throwing a user warning. I did document the "encoding" parameter in obspy.io.sac.sactrace
which I regard as the lowest level that we can hide its existence.
:param encoding: By default, ASCII encoding is used. ASCII-characters
encoded in a different encoding scheme will be converted to ASCII
while other special characters will be replaced by '?'.
:type encoding: str
If you guys think it is better to have it as a standard parameter, I have no problem with it. But it seems to be unnecessary changes to the documents and Obspy functionality that most people will not use nor they want to konw about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good thought. I'm OK with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very nice of you. I will go ahead and change the rest of the PR and make a push now.
obspy/io/sac/sactrace.py
Outdated
"All non-ASCII characters will be replaced by " | ||
"?".format(encoding)) | ||
warnings.warn(msg) | ||
val = _ut._clean_str(val, encoding=encoding, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Swap position of encoding
and strip_whitespace
arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. The parameter positions need to be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, the order doesn't technically matter here.
obspy/io/sac/sactrace.py
Outdated
@@ -1231,7 +1248,7 @@ def _from_arrays(cls, hf=None, hi=None, hs=None, data=None): | |||
.. rubric:: Example | |||
|
|||
>>> sac = SACTrace._from_arrays() | |||
>>> print(sac) # doctest: +NORMALIZE_WHITESPACE +SKIP | |||
>>> print(sac) # doctest: +NORMALIZE_WHITESPACE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this test is giving the PR problems, and it was previously skipped, maybe just put the +SKIP
back in and I'll make a new PR for it.
Looking at the reflog you posted, this is the state before the rebase that went wrong, I think: Personally, I would recommend to just raze your branch, start again from current master and manually apply the diff of this page (i.e. the final state):
yeah.. but it's really pretty simple.. just search for lines "checkout" because before rebasing at some point you needed to switch to that branch.. :-) |
@megies I did try to erase the branch and start all over again. But that automatically close my PR and I asked how to reopen it before. I was only able to reopen my PR once I restore my old branch and then the rebase problem came back. |
just do..
|
@megies I did what you said here:
And it says
However, I can't see my changes on EDIT: I realized that I didn't commit the changes which only sit in my stash. I added those changes back and pushed a new commit:
Somehow, the Travis-ci and Circle-ci test were canceled. But the code changes should be fine. They have passed those test multiple times |
I'm on the move now.. maybe somebody on our gitter channel can give further advice, otherwise I'll do that change tomorrow. |
@megies @jkmacc-LANL I think this PR is now ready for merge. There is only one commit from it which capture all changes I made for the minimal-change version. |
@@ -1092,7 +1092,7 @@ def read(cls, source, headonly=False, ascii=False, byteorder=None, | |||
val = _ut._clean_str(val, strip_whitespace=False) | |||
if val.startswith(native_str('-12345')): | |||
val = HD.SNULL | |||
hs[i] = val | |||
hs[i] = val.encode('ASCII', 'replace') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't make sense to me; when reading, why would you want bytes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment. hs[i]
was an array of bytes. So, the previous statement will make the implicit conversion which is equivalent to hs[i] = val.encode('ASCII', 'strict')
. However, when header string is not encoded in 'ASCII', it causes a problem and thus raises an exemption. This PR was trying to let the non-ASCII characters pass the read
function for now as '?' and leave the implementation of a 'encoding' flag for the later version as we discussed before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_ut._clean_str
produces a str
; encode
goes from str
to bytes
; I think you might have something backwards. Are you testing on Python 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, upon further investigation, this works because SAC I/O is not very good at defining boundaries and does encode/decode a bit too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other thing I didn't know is that NumPy does an implicit encode
here when assigning to hs[i]
(even on Python 3, probably for backwards compatibility reasons, but it's kind of unfortunate.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree with you. This has given us a lot problems in the past.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@QuLogic Yep.. it's a bit strange that we store it internally as bytes again (in numpy) after first decoding it. But we kind of decided to make this PR a minimal fix and postpone the more major cleanup for later..
The latest Travis build is here but for some reason it's not linking it right. |
So even GitHub is confused here: Maybe a reopen will fix it... |
Bah, still wrong, but at least the merge commit (which Travis and AppVeyor test) is correct:
|
I'm still not sure the existing architecture of |
PS, both docker bots and CircleCI are running on the PR head, whereas Travis and AppVeyor are using the PR merge head. I'm not sure how GitHub got out of sync like this, but that's why they're failing. |
Thank @QuLogic for trying to clear the git problem here. Hopefully, we can merge with those red crosses. The code itself is actually good. It passed those tests before I trying to rebase the git changes. |
So, if I "dismiss" @krischer's review, will that make merging possible? Is everything else good to go? |
Rest of the failing CI is unrelated, so I'm gonna merge this now. Thanks @lijunzh for working on this and uncovering what the actual problem was/is exactly! |
You can just press "merge" and then use your admin privileges to merge it even if some reviews are not yet accepted. Sorry for being a bit slow this week. |
@megies @jkmacc-LANL @krischer @QuLogic Thank you all for helping me through this PR. It was a delightful experience contributing to Obspy community. @jkmacc-LANL Do you want me to start another PR for actually adding 'encoding' flag to SAC read/write functions or you want to do it yourself? |
I can't confirm (obviously), but I think this override switch is only available for members of the "admins" team (which is rather small) but not for the "developers" team (which is huge).. which makes sense kind of.. see https://help.github.com/articles/repository-permission-levels-for-an-organization/ |
@megies Correct, "merge" wasn't highlighted, so I think I needed the help. Thanks for managing that. EDIT: I'll do a new PR for adding the write encoding. I think you and @QuLogic are right, storing string headers in a NumPy array that recasts to bytestrings all the time spreads the encoding/decoding logic around too much. I have some ideas about that, and I can follow up. Thanks! |
@jkmacc-LANL Since we only access the header field by field, why not store it in a list during the read/write process which can easily hold a byte or string. In this way, we only need to encode/decode once during read/write process.
Regarding the 'encoding' flag, I don't think we now have it for either read or write. This PR was only a minimal fix for the current unexpected exception. I can work with you to set a uniform read/write 'encoding' flag. |
I can second that, thanks for your work and.. welcome! :-) |
This is an attempt to add "encoding" flag to obspy.io.sac.satrace.read() function in order to read some SAC that has encoding other than "ASCII".
PR Checklist
CHANGELOG.txt
.CONTRIBUTORS.txt
.Closes #1768.