New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: denspeed.pyx to give correct output for nlmeans #1018

Merged
merged 4 commits into from May 13, 2016

Conversation

Projects
None yet
4 participants
@riddhishb
Contributor

riddhishb commented Mar 24, 2016

The bugfix in the implementation of denspeed in reference to #1011
Now we can use any patch_radius and block_radius for denoising purpose

  • The implementation follows the block weighing approach which is then averaged for each voxel.
  • Does not exactly follow the coupe2008 block averaging approach
  • To follow the [coupe2008] approach, the existing code by @omarocegueda here is accurate, but it needs to be optimized (I am working on that currently!)
Riddhish Bhalodia Riddhish Bhalodia
Fix: denspeed.pyx to give correct output for nlmeans
The bugfix in the implementation of denspeed
Now we can use any patch_radius and block_radius for denoising purpose

Note: It is now following the weighted averaging approach applied at each pixel
	  and not the block averaging approach of [coupe2008]
@samuelstjean

This comment has been minimized.

Contributor

samuelstjean commented Mar 24, 2016

The implementation here is actually only for the voxelwise average, and not the blockwise average.

The thing we are actually missing is the weighting check based on the mean and variance of eq. 11, never got around to implement it. Not sure why the sqrt(2) factor in the denominator went away though.

@riddhishb

This comment has been minimized.

Contributor

riddhishb commented Mar 24, 2016

@samuelstjean if it's voxelwise average than the patch submitted fixes it, right? (except for the enhancement by adding equation 11)

Also sorry about the sqrt(2) removal, will change that :D

One more thing, isn't the blockwise approach more helpful in denoising? The results by ornlm are much better than the voxelwise implementation, the only drawback is the speed which can be improved if optimized.

@samuelstjean

This comment has been minimized.

Contributor

samuelstjean commented Mar 24, 2016

That's the thing, might be hard to sell the it's better but slower angle, I tried to add the option to use a 3D volume, which had a performance cost of around 10-15 secs over a full run of the algorithm, and I pretty much had to outline a bunch of timing tests to show that.

If you go for splitting it in a second function, there will be a ton of duplication, so I don't know how to would go either. Adding a keyword between voxelwise and blockwise could be a compromise maybe.

The implementation we went for is actually this one http://scil.dinf.usherbrooke.ca/wp-content/papers/descoteaux-etal-miccai08.pdf, so that is why it's the voxelwise and not the blockwise version.

Oh and also the whole thing is in cython and pretty much optimized to death already, but the blockwise version only runs through more for loops, so I doubt it can be made faster.

@riddhishb

This comment has been minimized.

Contributor

riddhishb commented Mar 24, 2016

@samuelstjean Fair point.

Will change the sqrt(2) and try to include the equation 11 thing.
Yeah the patch submitted follows the http://scil.dinf.usherbrooke.ca/wp-content/papers/descoteaux-etal-miccai08.pdf approach, we need to update the docstrings as well (it mentions the [coupe2008] paper). Can you run this thing to verify the correctness?

A keyword may work, I will try to code it up.

Also I am not sure why is the travis cl check failing, it shows a fail in test_nlmeans.py for python 2.7
but when I do
python2.7 -m nose test_nlmeans.py
in my local machine it passes all the tests.

@samuelstjean

This comment has been minimized.

Contributor

samuelstjean commented Mar 24, 2016

It is not wrong per say for the docstring, the paper version outlines both the voxelwise and the blockwise version, but we had a binary executable of the voxelwise version, so we went with it for convenience.

These seems to be some weird test checking that the parallel version is faster than the multithreaded version, but travis might use some weird openmp stuff, explaining why some test fails. I know I had problem with the default not running fullspeed, but I did not write the part messing with openmp or it's tests, might be worth asking the original perosn who added that (default was to run full speed, but because of reasons, mac osx did not have openmp support by default in apple's stock compiler configuration or something, and it behaved weirdly).

@riddhishb

This comment has been minimized.

Contributor

riddhishb commented Mar 26, 2016

@samuelstjean yeah apple and openmp is a bad combo.
So what are your suggestions for the patch itself (correctness in regards to bugfix for #1011 )?
Should I try the keyword based approach, or let the voxelwise approach remain as it is?

If we were to code for some adaptive techniques for denoising (suggested by @Garyfallidis ) then the blockwise approach gives a much better result, so maybe a keyword does make sense.

@samuelstjean

This comment has been minimized.

Contributor

samuelstjean commented Mar 26, 2016

Actually, I modified nlmeans a while back so it could support a whole 3D volume for the noise estimation, so it is easier for anyone to roll out his own adaptive version transparently thanks to that. For the blockwise approach, it does work better for structural images (which is what nlmeans what made for in the first place), but since this library does diffusion mri, many people use it for processing the whole bunch of 3D volumes indepently.

While this works on pape,r I fear that in practice the blockwise approach will smear out your edged near regions of signal drop between adjacent structure, hence why we went for the voxelwise version, but feel fre to try it. Now that I saw the issue (did not the first time), I am starting to wonder if the bug was in my original version of the cython translation, go figure. At the very least it should flag an error for patchsize >= blocksize I'd say.

@riddhishb

This comment has been minimized.

Contributor

riddhishb commented Mar 27, 2016

I will try it.
For the bugfix does anything else needs to be changed apart from the cython file?

for i in prange(B, I - B):
for j in range(B, J - B):
for k in range(B, K - B):
for i in range(B+P, I - B-P):

This comment has been minimized.

@Garyfallidis

Garyfallidis Apr 26, 2016

Member

space between operators
All pep8 should be supported in Cython files except from the line width limitation. Please correct everywhere.

@@ -153,49 +152,53 @@ cdef double process_block(double [:, :, ::1] arr,

cdef:
cnp.npy_intp m, n, o, M, N, O, a, b, c, cnt, step
double patch_vol_size
double patch_vol_size,block_vol_size

This comment has been minimized.

@Garyfallidis

Garyfallidis Apr 26, 2016

Member

same space after comma

@Garyfallidis

This comment has been minimized.

Member

Garyfallidis commented Apr 26, 2016

Hi @riddhishb I am a bit worried about the error that you are currently getting.

assert_equal(duration_2core < duration_1core, True)

False

Basically the error says that using two cpus is slower than one. Have you looked into how the threading is used in this code? Please make sure that threading is used optimally in this code after your added changes. Let me know if you need some help figuring this out.

@riddhishb

This comment has been minimized.

Contributor

riddhishb commented Apr 29, 2016

@Garyfallidis I am trying to fix that bug, but for some reason I am not able to annotate the cython code for denspeed

I do cython -a denspeed.pyx

but I get this

Error compiling Cython file

import numpy as np
cimport numpy as cnp

cimport cython
cimport safe_openmp as openmp

denspeed.pyx:7:8: 'safe_openmp.pxd' not found

Error compiling Cython file

import numpy as np
cimport numpy as cnp

cimport cython
cimport safe_openmp as openmp
from safe_openmp cimport have_openmp

denspeed.pyx:8:0: 'safe_openmp/have_openmp.pxd' not found

I cannot figure out as to why os this happening, any ideas ?

@Garyfallidis

This comment has been minimized.

Member

Garyfallidis commented Apr 29, 2016

@riddhishb go to the main directory of dipy and do
make clean
make ext
and tell me if that works for you

@Garyfallidis

This comment has been minimized.

Member

Garyfallidis commented Apr 29, 2016

Okay when I tried cython -a dipy/alignt/bundlemin.pyx I also get some errors. For example
For example bundlemin.pyx:224:36: Converting to Python object not allowed without gil

@Garyfallidis

This comment has been minimized.

Member

Garyfallidis commented Apr 29, 2016

Bundlmin uses also threads (openmp). What this is telling us is that the cython command doesn't know where to find openmp. And that is correct.

@Garyfallidis

This comment has been minimized.

Member

Garyfallidis commented Apr 29, 2016

In setup.py we especially express how to find the headers etc before compiling the cython files. However, the cython command doesn't have that information. Is this clear? Please let me know otherwise.

@riddhishb

This comment has been minimized.

Contributor

riddhishb commented Apr 29, 2016

oh fair point, so any workaround to that?, yeah bundlemin is have the same king of openmp imports.

@Garyfallidis

This comment has been minimized.

Member

Garyfallidis commented Apr 29, 2016

Maybe you can provide extra information to the cython command. I haven't tried it yet.

@riddhishb

This comment has been minimized.

Contributor

riddhishb commented Apr 29, 2016

Ok will try that!

@Garyfallidis

This comment has been minimized.

Member

Garyfallidis commented Apr 29, 2016

The cython command is like gcc you can add include files etc. I hope this tip helps ;).

@riddhishb

This comment has been minimized.

Contributor

riddhishb commented Apr 29, 2016

Yeah solved :)

@Garyfallidis

This comment has been minimized.

Member

Garyfallidis commented Apr 30, 2016

Cool, can you provide here the command that worked for you? It is good to keep track of things as we move ahead. It may be interesting for other people too.

@riddhishb

This comment has been minimized.

Contributor

riddhishb commented Apr 30, 2016

cython -I src/ -a dipy/denoise/denspeed.pyx did the trick, the cython files for openmp are in that folder

Riddhish Bhalodia Riddhish Bhalodia
W = <double *> malloc(BS * BS * BS * sizeof(double))
cache = <double *> malloc(BS * BS * BS * sizeof(double))
sigma_block = <double *> malloc(BS * BS * BS * sizeof(double))
W = <double * > malloc(PS * PS * PS * sizeof(double))

This comment has been minimized.

@Garyfallidis

Garyfallidis May 9, 2016

Member

No need for space after * in this case because now * is not an operator but a pointer.

cache = <double *> malloc(BS * BS * BS * sizeof(double))
sigma_block = <double *> malloc(BS * BS * BS * sizeof(double))
W = <double * > malloc(PS * PS * PS * sizeof(double))
cache = <double * > malloc(TS * TS * TS * sizeof(double))

This comment has been minimized.

@Garyfallidis
sigma_block = <double *> malloc(BS * BS * BS * sizeof(double))
W = <double * > malloc(PS * PS * PS * sizeof(double))
cache = <double * > malloc(TS * TS * TS * sizeof(double))
sigma_block = <double * > malloc(TS * TS * TS * sizeof(double))

This comment has been minimized.

@Garyfallidis
@@ -273,7 +307,7 @@ cdef cnp.npy_intp copy_block_3d(double * dest,

for i in range(I):
for j in range(J):
memcpy(&dest[i * J * K + j * K], &source[i + min_i, j + min_j, min_k], K * sizeof(double))
memcpy(& dest[i * J * K + j * K], & source[i + min_i, j + min_j, min_k], K * sizeof(double))

This comment has been minimized.

@Garyfallidis

Garyfallidis May 9, 2016

Member

No need for space here.

@Garyfallidis

This comment has been minimized.

Member

Garyfallidis commented May 9, 2016

Hi @riddhishb, did you figure out why the code was slower with 2 cpus? Can you also report to me the timings between the old version and the new version of the code? Any changes in quality? I also pointed out some simple stylistic changes. I think we are very close to merge this. Let us know what you think.

for i in prange(B, I - B):
for j in range(B, J - B):
for k in range(B, K - B):
for i in prange(B + P, I - B - P):

This comment has been minimized.

@riddhishb

riddhishb May 10, 2016

Contributor

@Garyfallidis The bug for the cpu speed mismatch was pretty simple, I didn't use prange here which was giving that error.
The times are, for the denoise_nlmeans example from doc/examples/
Original = 4.70s
Updated = 4.65s
so it's not much different. Also in the quality the updated one gives a much better denoised output (maybe too smooth at times), it also solves the bug reported in #1011
I will commit those simple changes, I also think that this can be merged with the main branch, it's nearly done.

@Garyfallidis

This comment has been minimized.

Member

Garyfallidis commented May 13, 2016

Okay, I am merging this. Let's try to monitor it during the next months using different datasets and see if it is all good. For example a dataset where the brain is touching the edge of the volume would be good to try. Some times with patch methods someone needs to be careful with what happens near the boundaries. So far the tests are happy :) Congrats @riddhishb !

@Garyfallidis Garyfallidis merged commit 56fd9ed into nipy:master May 13, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@riddhishb

This comment has been minimized.

Contributor

riddhishb commented May 13, 2016

@Garyfallidis Thank You :)
I will run different tests in several limiting conditions on this patch and report the outputs in some kind of documentation in next few days.


# calculate weights between the central patch and the moving patch in block
# (m, n, o) coordinates are the center of the moving patch
# (a, b, c) run inside both patches
for m in range(P, BS - P):
for n in range(P, BS - P):
for o in range(P, BS - P):

This comment has been minimized.

@riddhishb

riddhishb Jun 22, 2016

Contributor

@Garyfallidis @omarocegueda
The essential algorithm following this is that we have a patch(V) around a voxel(x) of radius (P), and then we do the following to get the denoised output
screen shot 2016-06-22 at 11 19 56 pm

and the weights are computed by computing distance between a patch (size S) with a different radius (B), centered at x and x_i, for all i in V. Mathematically

screen shot 2016-06-22 at 11 28 25 pm

So in the current code there is some correction in normalization of the weights (will correct that, but it does not affect the oversmoothing problem).

The changed code exactly follows the algorithm. The problem with the previous code lies primarily in these lines as it did something entirely different, and if you keep P=B or P>B you can see these loops will cause a problem. Hence I restructured it totally and followed the paper's algorithm step by step. I rechecked it, I do not find any problem with the implementation apart from the normalization of sigma (lines 222, 223). Please do have look at this.

This comment has been minimized.

@grlee77

grlee77 Jun 22, 2016

Contributor

I see that the term in the exponential as defined here (and in the reference you linked) differ from that defined in Eq. 2 of the paper by Coupe et. al. (http://www.ncbi.nlm.nih.gov/pubmed/18390341). Namely, both the numerator and denominator of the exp exponent are squared in Eq.2 relative to the above.
I think the original Buades paper definition matches the Coupe one.

This will affect the degree of smoothing as (a / b) != (sqrt(a) / sqrt(b)) in most cases. When the numerator of the exp term is large relative to h, the weights from Coupe should fall off faster than the ones as defined above giving a sharper image.

denom = sqrt(2) * (sigm / patch_vol_size)**2
w = exp(-(summ / patch_vol_size) / denom)
denom = sqrt(2) * (sigm / block_vol_size)**2
w = exp(-(summ / block_vol_size) / denom)
sumw += w

This comment has been minimized.

@riddhishb

riddhishb Jun 22, 2016

Contributor

@Garyfallidis these will be changed to
denom = sigm / block_vol_size
w = exp(-sqrt(summ/ block_vol_size)/denom)

in PR #1066 , but still that doesn't get rid of the oversmoothing

W = <double *> malloc(BS * BS * BS * sizeof(double))
cache = <double *> malloc(BS * BS * BS * sizeof(double))
sigma_block = <double *> malloc(BS * BS * BS * sizeof(double))
W = <double *> malloc(PS * PS * PS * sizeof(double))

This comment has been minimized.

@grlee77

grlee77 Jun 22, 2016

Contributor

pretty sure this should still be BS * BS * BS. You want to exam the distances for all patches falling within the block.

This comment has been minimized.

@riddhishb

riddhishb Jun 22, 2016

Contributor

Ok, maybe some naming inconsistency, I take the central patch of the size PS and then compute the distance between the blocks BS between the block around central voxel and the block around the voxel in the patch.

This comment has been minimized.

@grlee77

grlee77 Jun 22, 2016

Contributor

But what you want are differences between patches of size PS for all patches within BS. There should be no comparing over distances between blocks. This is what the code was doing before. I think the bug in the old code was just that the block radius searched over was smaller than specified by one patch radius, not in the actual implementation.

This comment has been minimized.

@riddhishb

riddhishb Jun 22, 2016

Contributor

Ok @grlee77 if you can just wait for few mins then I will try the changes you suggested and give you the output

cache,
PS + BS - 1,
PS + BS - 1,
PS + BS - 1,

This comment has been minimized.

@grlee77

grlee77 Jun 22, 2016

Contributor

not important, but you have already defined TS = PS + BS - 1 above

for o in range(P, BS - P):
for m in range(-P, P + 1):
for n in range(-P, P + 1):
for o in range(-P, P + 1):

This comment has been minimized.

@grlee77

grlee77 Jun 22, 2016

Contributor

I think the outer m, n, o loops should be over blocks as it was before

This comment has been minimized.

@riddhishb

riddhishb Jun 22, 2016

Contributor

I guess these two comments will also be taken care if we consider that naming inconsistency right.

for c in range(-P, P + 1):
for a in range(-B, B + 1):
for b in range(-B, B + 1):
for c in range(-B, B + 1):

This comment has been minimized.

@grlee77

grlee77 Jun 22, 2016

Contributor

The inner a, b, c loops should be over the patches as it was before

denom = sqrt(2) * (sigm / patch_vol_size)**2
w = exp(-(summ / patch_vol_size) / denom)
denom = sqrt(2) * (sigm / block_vol_size)**2
w = exp(-(summ / block_vol_size) / denom)

This comment has been minimized.

@grlee77

grlee77 Jun 22, 2016

Contributor

patch_sizes, not block sizes should be in these denominators

for o in range(P, BS - P):
for m in range(-P, P + 1):
for n in range(-P, P + 1):
for o in range(-P, P + 1):

This comment has been minimized.

@grlee77

grlee77 Jun 22, 2016

Contributor

The m, n, o loops here should be over -B, B + 1

This comment has been minimized.

@grlee77

grlee77 Jun 22, 2016

Contributor

The m, n, o loops here should also be over -B, B + 1 as before

@samuelstjean

This comment has been minimized.

Contributor

samuelstjean commented Jun 22, 2016

Did realize this was back in march, I have not looked at it since april (ismrm + summer stuff), and now I started receiving tons of emails.

So actually my original concern about 'there is probably no bug' is that if you select patch_size = block_size, your search region is of the same size as what you are trying to denoise, hence the mean of one element is one element, and nothing is processed. Of course this is still valid no matter the size you select, as you have basically no effective search region. Yes, it's a strange corner case, but perfectly valid mathematically.

So in that matter, apparently there was a problem, and I now see a bunch of TS variables, so what exactly is the logic we need to double check for problems? Pretty sure if anyone runs the example on the latest master with the default everywhere you will see the huge effect reported earlier.

@samuelstjean

This comment has been minimized.

Contributor

samuelstjean commented Jun 22, 2016

Ah yes in any case the original (closed source) matlab code is available to double check in case you have discrepancies. The original version pretty much mimics the original C version (missing the mean and variable dropout check, if it has it), which we do not have the sourse anymore, as @mdesco mentionned earlier.

https://sites.google.com/site/pierrickcoupe/softwares/denoising-for-medical-imaging/mri-denoising/mri-denoising-software
http://personales.upv.es/jmanjon/denoising/index.htm

@riddhishb

This comment has been minimized.

Contributor

riddhishb commented Jun 22, 2016

@samuelstjean The TS variables are because the cache variable should be of the Patch size + block size, and then the big block TS takes care of all this, and even the initial buggy case of blocksize > patchsize .

Also on the coupe's website the ornlm implementation which incorporates the same algorithm as that of denspeed but with blockwise averaging is being incorporated with a keyword (like you initially suggested) through #1066,

@samuelstjean

This comment has been minimized.

Contributor

samuelstjean commented Jun 22, 2016

Yes saw that go by, still busy doign stuff for end of july, might look at it when finished.
By the way, if you want ot roll out an adaptative version based on nlmeans, this version accepts a 3D array especially for that purpose, so might want to re-use all those parts.

Makes it easier to roll out your own version by computing the sigma volume externally with the various noise profile estimation functions around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment