-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Where is the best place to discuss usage/interpretation/best practices? #25
Comments
Here 😄
Can you reproduce the figure with a signed exponent for the floating-point representation? i.e in your notebook vals = var.second[:,:]
signed_exponent!(vals) # add this line
a = bitinformation(vals,dims=1) While julia> bitstring.([1.9f0,2.1f0],:split)
2-element Vector{String}:
"0 01111111 11100110011001100110011"
"0 10000000 00001100110011001100110" which is prevented with julia> bitstring.(signed_exponent([1.9f0,2.1f0]),:split)
2-element Vector{String}:
"0 00000000 11100110011001100110011"
"0 00000001 00001100110011001100110" For more information see here |
Regarding SST, and other data that will include some form of missing values, this is something that provided me a lot of headache in the past. Reason being that you don't want to calculate the mutual information across boundaries. Ideally, one would use some form of space-filling curve to avoid the missing values in data, but that's a bigger project. Technically, it should also be possible in the future to pass a masked array to |
@milankl , ah that makes sense! I hadn't thought about the missing data in some fields -- SOILMOISTURE is missing over the ocean, SST is missing over the land, etc. Okay, I will revamp my notebook to select regions for variables that don't have missing values. Do you have thoughts how to best calculate the keepbits for each variable when you have 40 years of simulation? I only used a single time step of model output for the analysis above, and I'm assuming so I was thinking of making a more robust assessment by running BitInformation on increasing numbers of time steps chosen randomly from the simulation until I get a stable distribution of the keepbits value for each variable. Then pick the mode or the highest value from the distribution to use as the keepbits value for compression. Or maybe it's smarter to average all the bitinformation from multiple time steps before determining the keepbits? Or is there a different recommended approach? |
Yes, you can do that. However, it might an overkill. I found the bitwise information fairly robust in general, such that I'd probably start by looking at a time step in the beginning of the 40yrs one at the end. Then maybe also check how the information changes in the different dimensions. We did this analysis for temperature in Fig. S10 This highly depends on the resolution in each dimension but I usually found longitude to have the highest correlation. Let me know if yours differs. |
@rsignell-usgs The bitinformation method is applicable for relative precision trimming and relies on two assumptions that unfortunately were not put upfront in the paper:
(@milankl please correct me if I am wrong). The former assumption is violated for winds, and the second one for SST with missing values. Both are violated for cloud quantities from the paper, causing absolutely unrealistic estimate of 1 mantissa bit per value needed. As a practical solution for winds, temperatures and few other quantities I use trimming that keeps absolute precision. Please let me know if you need any further info on that or on lossy compression of WRF data (we did it for four-years hourly dataset). |
This is correct when used with floats, but due to floats and not due to the method itself. You can use other bitencodings like linear quantisation and calculate the bitinformation based on those. This will bound the max abs error and therefore using only (I can't remember the compression factors, but they are presumably similar, given that the lossless compression will compress away the unused exponent bits in Float32 ecoding).
I've always struggled to find a geophysical variable with additive noise but a wide dynamic range. I assumed that this is due to the scale-invariance of the Navier-Stokes equations, such that any error would scale as well. You mention winds, but why do you consider them to have an additive noise? Thanks for the input, I think fundamental answers here are very important to take away some subjectivity that always went into choices of data compression!
Yeah, for temperature in Kelvin chopping of mantissa bits preserves an absolute precision as long as all your values are in [256K,512K), because floats are linearly distributed within a power of two. For the same number of mantissa bits, values in [128K,256K) will have twice the absolute precision compared to [256K,512K) etc.
Yes, but note that what non-uniform statistical properties mean for the bitwise information content (given a bitwise encoding!) is not trivial. For example a change in the mean (e.g. going from ˚C to K) will flag different mantissa bits with information when using floats, but will not have an impact when using linear quantisation as bitencoding. On the other hand, a scaling (e.g. going from micro to milli-some unit) will not affect the bitwise information in float encoding (and is bitwise reproducible if the scaling is a power of two). In that sense, a change in variance might not classify as non-uniform statistical property. But hey, really understanding what these mean is very important and I'm glad that people ask these questions, because I see them as very essential going forward!
Yes, I haven't implemented yet a bitpair count (i.e. counting 00, 01, 10 and 11) in adjacent grid points that allows for a mask to not count bitpairs across boundaries. Technically that is possible. However, @rkouznetsov on the discussion on non-uniformity, note that nothing stops one from preserving different mantissa bits for, say, different regions of a 2D field, or different vertical levels in the round+lossless framework. While this opens up a whole new door that I didn't want to address yet, it's no problem to choose |
@rkouznetsov and @milankl thanks for taking the time to help inform me and the community about how this all works. I think what got me really excited was the possibility to achieve much more compression in a scientifically defensible way, and this discussion seems key. @rkouznetsov I absolutely would like to see/hear more about what you did with WRF output. Is it okay to discuss this here @milankl , or should we start a new issue? |
Happy to keep this open, if there's more on a given topic that risks other points being lost here feel free to branch off into another issue. |
@rsignell-usgs, @milankl I am not sure if "O3 mixing ratio field keeping 99% of information" (or should one keep 99.9%?) is any more defensible than 'accurate within 1/128 or 0.0625ppb'. At least I know how to interpret the latter for any specific application. Known error bars are better than potentially smaller but unknown ones. Making the precision-truncation separately for levels/subdomains would make the error quantification even more complicated. @rsignell-usgs, WRF processing is indeed an offtopic here, but i see no better place to put the reply on it. Here is what I use for WRF archiving for SILAM use. It gets a bit more than x12 though with omission of a few variables, which you might wish to keep. All uncertainties are well quantified and are rather conservative according to an expert* (i.e. my :) opinion/experience. You might wish to adjust it here and there though:
baa=5 is to ensure that "bit rounding" is used. The default "bit adjustment algorithm" of nco has been changed few times during last 5 years. Any feedback on this set of precisions would be welcome (personally, or publicly). * Expert: a person who shows slides. :) |
You can always communicate the error (abs/rel) as you prefer. The 99% level is just a guidance hopefully applicable to a wide range of variables, which then translates to a relative error (for floats/log-quantisation) or an absolute error (linear quantisation or floats within a power of two) that you can communicate explicitly as 'accurate within 1/128' but that information is also stored implicitly if all but the first 8 mantissa bits are zero.
Yes, that's why I'm afraid of using it. But so then you agree that the issue around non-uniform statistics of a field aren't actually a drawback of the method, but reflect more underlying properties of the variable? |
I would not say for physical quantities, but I would argue that majority of model quantities do have both additive and multiplicative errors. Consider e.g. clwc from a model that has two modules: cloud evaporation-condensation (microphysics) and spectral advection (transport) and uses process split. The microphysics module (I assume it is coded properly) would put clwc to zero everywhere, where rh < 100%. Advection would create some numerical noise over the whole domain due to a finite precision of a Fourier transform. If the output is set after microphysics step the domain would be full of zeros except for some small areas with smooth fields. The bitinformation ("real") of such a field will be quite high. If the output is set after transport, the field will have a lot of entropy in all bits in average, and have a low bitinformation. Note that in both cases it is the same model, and both vays of output are perfectly valid and matching observations equally well.
The difference between 0.01m/s and 0.0001 m/s (factor of 100) is negligible, while 10m/s and 15 m/s differ quite substantially. Therefore, the additive model is more adequate here, if one has to chose. There seems to be a false contradiction between floating point with multiplicative noise/error and linear quantization with additive noise/error. So I consider bitinformation as an interesting and useful tool to study datasets, but it should not be used as a sole ground to decide on a number of bits to store. Uncertainties and artifacts originating from a model that produces the data, and sensitivity of a model/analysis downstream should be considered when deciding on lossy-compression parameters. ` |
@milankl , assuming you agree with this statement, seems like something very important for readers of the Nature paper and potential users of BitInformation approach for compression to understand, right? One of the exciting ideas of the paper was a completely quantitative method for determining the true information content and number of bits to store. If that's not exactly true, it would be great to know "best practices/caveats" for using the approach. (or do I just need to read the paper more carefully?) |
In principle I agree with the argument that one should consider different uncertainty quantifications if possible. But it's rarely straightforward what the uncertainty of a dataset is. Sure, one may think of a measurement device with some lab-quantified error. In that case, true, I would just discard precision according to that measurement error. But with any model simulation data there's so many errors that contribute to the uncertainty
and then it's also a question with respect to what the uncertainty should be quantified. E.g. I may change the discretization condition error by increasing the resolution. That'll give me an error estimate, but it won't tell me whether the error that comes from the cloud physics scheme is larger. Similarly, in an ensemble forecast the uncertainty increases over time. Eventually the uncertainty is so large that rounding the values to match the uncertainty, we are just left with climatology. This is because the ensemble forecast cannot provide any more information than the climatology at such long lead times. Similarly, calculating the bitwise information across the ensemble dimension will round away everything but the climatology (because that's what the bits across the ensemble dimension still have in common). However, if you want to investigate simulated hurricanes then you should keep many of those bits and the uncertainty relevant to your use case is another one. Calculating the bitwise information in the ensemble dimension and assuming that's the overall uncertainty of the data set is clearly not a good idea. The bitwise information analysis cannot tell you a precision that is necessary to retain for every use case. So, yes, if possible include other uncertainty quantifications. But, and this is the issue that we often face, and presumably many others, it is near to impossible to obtain such uncertainty quantifications for hundreds of variables each with respect to hundreds of different scientific analyses. However, many use cases will find the mutual information of bits in adjacent grid points (as it's defined in the paper) as essential to preserve, hence, we propose this is a new perspective on the problem. In our data sets unifying the different precisions required for many variables into a single "preserve 99% of real information" works well and is the first approacht that I know that tries to generalise precision for gridded data into a single information-based concept. |
So on the note of best practices, I would say
Maybe "real information" is by many understood as "the only true precision one should care about", this is not the case, neither can it be for any lossy compression method. In fact, "real information" is calculated with respect to something (as discussed further up in this thread), and one is free to change the predictor in the mutual information to account for that. |
@milankl, thanks for that pep talk. And I like the notes on best practices. I'm going to revisit the BitInformation calculations we made for our 40 year WRF simulation for the US and report back here with issues! |
Thanks for the discussion here. We want to simply replace my We (@observingClouds and me) tried However, for (sorry for the plot, I'm a first time relative difference up to 20% between raw and rounded when only keeping 1 bit (instead of 6) for |
Hi @aaronspring many thanks for sharing this analysis!
|
The data I analyze is the first and the last year of a 50-year simulation with monthly output. Here's one variable to download. The grid is curvilinear, i.e. the coordinate
I have a static land-sea mask. When I drop all ds=xr.open_dataset(i)
ds=ds.stack(d=ds.dims)
ds['d']=np.arange(ds['d'].size)
ds=ds.dropna('d')
ds=ds.squeeze()
ndims=len(ds.dims)
for i in range(4):
if len(ds.dims)<4:
ds=ds.expand_dims(f'extra_{i}')
assert len(ds.dims)==4
ds.to_netcdf(o)
see https://owncloud.gwdg.de/index.php/s/eRUA8rSTNjgd4Et for original netcdf file
New analysis on the input data without
|
Thanks for sharing the data, I see several problems with it:
julia> bitstring.(fgco2[118:120,16:25,1],:split)
3×10 Matrix{String}:
"0 01100001 11110110001110000011100" "0 01100001 11110110001110000011100" … "0 01100001 11110110001110000011100"
"0 01100001 11110110001110000011100" "0 01100001 11110110001110000011100" "0 01100001 11110110001110000011100"
"0 01100001 11110110001110000011100" "0 01100001 11110110001110000011100" "0 01100001 11110010100010001111001" This creates an artificial mutual information as is discussed in the NatCompSci paper. Solution to that would be to actually use the data on the original grid that's apparent in your plots.
julia> using Blosc, BitInformation
julia> Blosc.set_compressor("zlib")
julia> keepbits = 3
julia> sizeof(fgco2)/sizeof(compress(round(fgco2,keepbits)))
8.079668129161835 But again this has to be treated with caution given that I currently find it difficult to grasp what other effects your interpolation method has.
I'm not sure I managed to get my point across: If you approach the compression of julia> floor(-log2(0.01))
6.0 The idea of BitInformation.jl is that it can tell you how many bits you need in the case where you don't have a prior knowledge about an error or an uncertainty. That's why I ask again, how do you know that a 1% error is okay, but a 5% error is not?
It's not a heuristic, it's a rigid error bound as long as you stay within the normal floating point numbers julia> floatmin(Float32),floatmax(Float32)
(1.1754944f-38, 3.4028235f38) Floats don't guarantee you an absolute error in the same way as their spacing increases towards +-infinity. But on a logarithmic axis they are approximately equi-distant which bounds the relative error. |
Here's the curvilinear grid. at the equator it is similar to a regular grid but especially the Fram strait is different. I do not do any interpolation here. This is the native MPIOM grid and while not regular, still adjacent array entries are also adjacent in physical space. The
I dont know and cannot answer. There is no objective metric I can use here. I just play around with the numbers before on metrics I am most familiar with and decided by 👁️ .
Now, I get it.
And dropping |
I see that due to the convergence of the grid lines on Greenland, the plot you first shared has Greenland indeed entirely spread out over the top edge of the plot. Sorry, but this is not what I meant by interpolation. Here you can still see some grid lines. I'm therefore postulating that, because these are fluxes between ocean and atmosphere, the atmospheric has a lower resolution than the ocean hence the atmospheric model grid has to be interpolated on the ocean grid giving us these artifacts. I currently can't see of an easy way to correct for this other than analysing the bitwise information for other regions instead. Otherwise, you'll get some artificial information.
Not as BitInformation.jl currently implements the algorithm, the essential lines are these ones here BitInformation.jl/src/mutual_information.jl Lines 75 to 82 in fa2dd50
From a given array bitinformation(A::AbstractArray,mask::BitArray) such that the user can pass on a for i in eachindex(A)[1:end-1]
if ~(B[i] | B[i+1]) # if neither entry is masked
bitcount_pair!(C,A[i],A[i+1]) # count all bits and increase counter C
end
end which would walk along the first dimension. For other dimensions, the array and its mask would need to be transposed/permutedimed upfront as we already do anyway. This still would come with the problem that we'd walk across the domain edges (e.g. we would count bitpairs with A[end,1] and A[1,2] as they lie adjacent in memory). Maybe there's a neat way to solve that too? Sorry, literally thinking out loud here. I can potentially implement that in two weeks as I think it would be useful for people. Feel free to use these code snippets here and create a PR though! You are more than welcome to contribute! |
yes, that’s the case.
Could chunking be a problem here, i.e. if the array is spatially chunked, adjacent grid points could be separated in memory (maybe that’s only on disk). A PR sounds good. Happy to test it :) |
I find it most practical to run |
Wouldn't |
Yes, nothing stops you from assessing the bitinformation for subsets of your data arrays. Whether this is in the vertical, or say you want to use a different number of keepbits for the tropics vs high/mid-latitudes. The beauty of the bitrounding is that it is indeed possible to have different precisions within the same array. However, I was always afraid to open up this door because there's just too many ways how one could do it. I believe you'd need a good prior justification to do so, which, in turn, goes against the idea that you want to algorithm to tell you the required precision for some data, without particular knowledge about the data or its uncertainties. Hence, yes it may be more powerful, but the analysis also gets vastly more complex. Having said that, a variable like nitrogen dioxide concentration is subject to very different dynamics at different vertical levels (industry & ship tracks at the surface, turbulence and radiation higher up in the atmosphere), so if output is already stored on independent vertical levels, a vertically varying keepbits might be a good choice. However, imagine you have 500 variables to output on 100 levels each, that's a lot of keepbits compression parameters to preanalyse and store. |
Usage suggestion: When the data has an oscillation (seasonality or daily cycle), the training period for bitinformation should include the extremes of that cycle and the threshold level .99 ... 0.999999 should probably be increased with respect to non-oscillatory data when the oscillation isn't averaged before the analysis. |
Can you provide an example for why that's an issue that requires special treatment? |
I was a bit to fast in my previous post: Hard to explain subtle differences consistently across many variables. seasonality increases the range of values, not necessarily the real bitwise information content. Sure, I get different outcomes for the 99% and 100% cutoff bits.
Full and not conclusive analysis: https://gist.github.com/aaronspring/6d949097c7f06b6e7a10c161b60f4e50 |
No seasonality means you've used the same data as before, but subtracted a mean seasonal cycle? But with and without seasonal cycle this is the bitwise information in longitude? Could you further clarify what conceptual model you are using to justify that the information with and without the seasonal cycle should be the same? It sounds as if you are saying information(anomaly) = information(anomaly + seasons) should hold, but that's not true on the bitlevel, even if you think of seasons being some constant. For example, using ˚C or K will shift the bitwise information content across bit positions. There's an example in the paper that illustrates that For a given autocorrelation in your data, adding or removing a constant will shift the information across bit positions. However, the compressibility does not change as much as you may think, but it's also not identical. You can think of this as moving information from the exponent bits into the mantissa bits makes the compression of the exponent bits easier, but the compression of the mantissa bits harder. In your example I'm actually quite surprised/happy to see how robust the method is against the removal of a seasonal cycle (which supports the conceptual idea that the information of the seasonal cycle is negligible compared to the information of the anomalies). Given that you end up with a discrete number of keepbits in the analysis, having +-1 when using the de-seasonalised data could also be just part of the uncertainty (as in, repeat the analysis with an independent other sample of your data and you may get a somewhat different result). The exception being that dissicos keepbits for 100% preserved information jump from 12 to 19, I agree. However, note that this might also be just a consequence of uncertainty: Information/entropy is by definition a non-negative quantity; in practice, this means that even for two fully independent variables X,Y the mutual information between them is not zero, but somewhat larger when estimated. In order to filter out the non-significant information bitinformation calls the On a more practial note, you can run |
Short clarification: I compare 49 January data as no seasonality against 49 consecutive months as seasonality. No anomalies taken. So climate states and data is different. Maybe I should also try the full dataset raw and with seasonal cycle removed.
No I dont think they should be the same.
I like that justification for 99% or at least not to go for 100%=0.999999999. |
@milankl , where is the best place to discuss usage/interpretation/best practices with usage of these tools?
For example, I've got two questions applying these techniques to a new CONUS WRF run. I've got a CONUS404 Jupyter notebook where I started with your tutorial and wrote a little loop to go through all the variables:
I'm wondering what it means that some variables don't have continuous distributions in many of the variables in the cell [8] figure. Some are continuous like
V
but others likeSST
are pretty strange looking. Is this expected for some variables or does this mean I did something wrong, or didn't use enough data?I only used one 2D field here, so I was thinking of making a more robust assessment by selecting 100 fields or 1000 fields through the year and seeing what the histogram of keepbits for each variable looks like. And then picking something on the conservative side to apply. Does that sound reasonable?
The text was updated successfully, but these errors were encountered: