Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Obtained Mismatching Reproducing Results #10

Open
xxliang99 opened this issue May 28, 2021 · 5 comments
Open

Obtained Mismatching Reproducing Results #10

xxliang99 opened this issue May 28, 2021 · 5 comments

Comments

@xxliang99
Copy link

Dear Quande,

Thank you for your time, and I modified the data-preparation step as you explained in previous issue #8 . However, sorry to report that, I still could not obtain the approximate Dice scores as is reported in the paper.

I tried several times under different data pre-processing methods. Below is the description of my modifying.

  1. At first, The only python files that I edited myself is the prepare_dataset.py.

    • I resized the input images and labels (except domain 2) to 384*384. After that, the image data was converted to numpy array. Their amplitudes were extracted using the provided extract_amp_spectrum function in prepare_dataset.py.

    • The label mask was split into 2 masks, Disc and Cup. Pixels with gray level 255 are set 0 as background, gray level 128 and 0 are set 1 in each Disc and Cup masks as foreground. After all the steps above, the final numpy file stored is organized as [image_np, disc_mask_np, cup_mask_np], with size [384, 384, 5]. The image data are kept the original 3 channels, and the 2 masks are binary arrays.

    • For Domain 2, at the beginning, I cropped each image from Domain 2 into 2 individual samples that share the same label mask. The label mask I used is cropped from the left half side of the raw mask. Therefore, numbers of samples of each domain become 101, 318, 400, 400, respectively.

To prove that I have correctly modified the prepare_dataset.py, here are some examples of Disc's and Cup's background and contour, that are extracted as the original code did.

image

  1. Noticing that it is explained in paper that the best performance is reached when interpolation ratio is set random number between [0,1], but in the provided code it was set 1 (meaning all information transferred), I modified it to random number in function low_freq_mutate_np of fundus_dataloader.py. Below is the code:

Original:

def low_freq_mutate_np(amp_src, amp_trg, L=0.1):
    .......
    #ratio = random.randint(1, 10) / 10

    a_src[:, h1:h2, w1:w2] = a_trg[:, h1:h2, w1:w2]
    #a_src[:, h1:h2, w1:w2] = a_src[:, h1:h2, w1:w2] * ratio + a_trg[:, h1:h2, w1:w2] * (1 - ratio)
    ......

Modified:

def low_freq_mutate_np(amp_src, amp_trg, L=0.1):
    ......
    ratio = random.randint(1, 10) / 10

    # a_src[:, h1:h2, w1:w2] = a_trg[:, h1:h2, w1:w2]
    a_src[:, h1:h2, w1:w2] = a_src[:, h1:h2, w1:w2] * ratio + a_trg[:, h1:h2, w1:w2] * (1 - ratio)
    ......
  1. With unsatisfactory results produced, I further modified fundus_dataloader.py, added the random rotation and flipping step. I noticed that you have mentioned in issue Questions about data augmentation in fundus dataset. #6 that the data augmentation steps are performed offline. I supposed that local kept data still remains unchanged during the training process, so I added the steps to getitem in fundus_dataloader.py. I noticed and made use of a provided function in the same python file, function RandomRotFlip(). Below is the code I added.

Original code:

def __getitem__(self, idx):
        raw_file = self.image_list[idx]

        mask_patches = []
        raw_inp = np.load(raw_file)
        image_patch = raw_inp[..., 0:3] 
        mask_patch = raw_inp[..., 3:]
        image_patches = image_patch.copy()
      
        # image_patches = 
        # print (image_patch.dtype)
        # print (mask_patch.dtype)
        disc_contour, disc_bg, cup_contour, cup_bg = _get_coutour_sample(mask_patch)
        # print ('raw', np.min(image_patch), np.max(image_patch))
        for tar_freq_domain in np.random.choice(self.freq_site_index, 2):
               ......

Modified code:

def __init__(self, unseen_site_idx, client_idx=None, freq_site_idx=None, split='train', transform=None):
        self.unseen_site_idx = unseen_site_idx
        self.client_idx = client_idx    
        ......
def __getitem__(self, idx):
        raw_file = self.image_list[idx]

        mask_patches = []

        raw_inp = np.load(raw_file)
        image_patch = raw_inp[..., 0:3]
        mask_patch = raw_inp[..., 3:]

        # image_patches = 
        # print (image_patch.dtype)
        # print (mask_patch.dtype)

        if self.client_idx != self.unseen_site_idx:
            sample = {"image": image_patch, "label": mask_patch}
            preprocessor = RandomRotFlip()
            sample = preprocessor(sample)
            mask_patch = sample["label"]
            image_patch = sample["image"]

        image_patches = image_patch.copy()

        disc_contour, disc_bg, cup_contour, cup_bg = _get_coutour_sample(mask_patch)
        # print ('raw', np.min(image_patch), np.max(image_patch))
        for tar_freq_domain in np.random.choice(self.freq_site_index, 2):
                 .......

To prove that all the modifying steps are correct, below reveals the raw and transformed images.
image

  • The image above is took under the [0,1] random-chosen interpolation ratio scheme.

The Dice score of Discs after training for 100 epoches are reported in the following table. "1, 2, 3" refers to the modifying steps above, along with their corresponding combinations.

image

I only trained once for each situation, but obviously the performance I reproduced is much worse than that was reported in paper, about which I got quite confused. As the data preparation steps are absent in the provided code, I am sincerely seeking for your help in checking my preparation steps.

Another question is, I did not observe any difference in treating data from domain 2 comparing to other domains, from the code provided. If 2 views are generated from a single image in domain 2, as is explained in issue #8 , the aggregating weight should be depending on [101, 318, 400, 400], but in line 63 of train_ELCFS.py the number is still [101, 159, 400, 400]. If the 2 views are stored in a single .npy file, then bugs would be raised in code below, from line 37~45 in fundus_dataloader.py because of changes in numpy array size.

def __getitem__(self, idx):
        raw_file = self.image_list[idx]

        mask_patches = []

        raw_inp = np.load(raw_file)
        image_patch = raw_inp[..., 0:3]
        mask_patch = raw_inp[..., 3:]

Thank you for your time. I was greatly inspired by your contribution, and I sincerely hope that a satisfactory result could be reproduced. I feel sorry if I have made any mistake, and please do not hesitate to point it out.

Thank you very much!

Best,
Vivian

@liuquande
Copy link
Owner

Hi Vivian,

It seems that you directly resized the original images and labels to 384x384 for network training.
In our experiments, the model was trained on the center-cropped ROI regions instead of the original image, which may lead to large differences between our model performance.

The fundus data released by Fundus contains the center-cropped samples we used on the four datasets. Please try using these released datasets.

Best,
Quande.

@xxliang99
Copy link
Author

xxliang99 commented Jun 1, 2021

Hi Vivian,

It seems that you directly resized the original images and labels to 384x384 for network training.
In our experiments, the model was trained on the center-cropped ROI regions instead of the original image, which may lead to large differences between our model performance.

The fundus data released by Fundus contains the center-cropped samples we used on the four datasets. Please try using these released datasets.

Best,
Quande.

Hi Quande,

Thank you for your reply. Yes, the data I used is downloaded from the DoFE repository provided, and the download link I used is in its "Usage" part, as is in the picture below. The provided link in DoFE refers to this.

image

Besides, I did not find any other released dataset in DoFE. And I double checked that in domain 1 and 2 the images seem to be center-cropped (while they are not square-shape in resolution), but in domain 3 and 4 they don't.

image
image

I would be grateful if you could provide with further suggestions.

Thank you very much!

Best,
Vivian

@SakurajimaMaiii
Copy link

center-croped data could be found in Domain3/train/ROIs/image and Domain3/train/ROIs/mask,the same for Domain4
image

@aldiak
Copy link

aldiak commented Sep 19, 2022

Dear Quande,

Thank you for your time, and I modified the data-preparation step as you explained in previous issue #8 . However, sorry to report that, I still could not obtain the approximate Dice scores as is reported in the paper.

I tried several times under different data pre-processing methods. Below is the description of my modifying.

  1. At first, The only python files that I edited myself is the prepare_dataset.py.

    • I resized the input images and labels (except domain 2) to 384*384. After that, the image data was converted to numpy array. Their amplitudes were extracted using the provided extract_amp_spectrum function in prepare_dataset.py.
    • The label mask was split into 2 masks, Disc and Cup. Pixels with gray level 255 are set 0 as background, gray level 128 and 0 are set 1 in each Disc and Cup masks as foreground. After all the steps above, the final numpy file stored is organized as [image_np, disc_mask_np, cup_mask_np], with size [384, 384, 5]. The image data are kept the original 3 channels, and the 2 masks are binary arrays.
    • For Domain 2, at the beginning, I cropped each image from Domain 2 into 2 individual samples that share the same label mask. The label mask I used is cropped from the left half side of the raw mask. Therefore, numbers of samples of each domain become 101, 318, 400, 400, respectively.

To prove that I have correctly modified the prepare_dataset.py, here are some examples of Disc's and Cup's background and contour, that are extracted as the original code did.

image

  1. Noticing that it is explained in paper that the best performance is reached when interpolation ratio is set random number between [0,1], but in the provided code it was set 1 (meaning all information transferred), I modified it to random number in function low_freq_mutate_np of fundus_dataloader.py. Below is the code:

Original:

def low_freq_mutate_np(amp_src, amp_trg, L=0.1):
    .......
    #ratio = random.randint(1, 10) / 10

    a_src[:, h1:h2, w1:w2] = a_trg[:, h1:h2, w1:w2]
    #a_src[:, h1:h2, w1:w2] = a_src[:, h1:h2, w1:w2] * ratio + a_trg[:, h1:h2, w1:w2] * (1 - ratio)
    ......

Modified:

def low_freq_mutate_np(amp_src, amp_trg, L=0.1):
    ......
    ratio = random.randint(1, 10) / 10

    # a_src[:, h1:h2, w1:w2] = a_trg[:, h1:h2, w1:w2]
    a_src[:, h1:h2, w1:w2] = a_src[:, h1:h2, w1:w2] * ratio + a_trg[:, h1:h2, w1:w2] * (1 - ratio)
    ......
  1. With unsatisfactory results produced, I further modified fundus_dataloader.py, added the random rotation and flipping step. I noticed that you have mentioned in issue Questions about data augmentation in fundus dataset. #6 that the data augmentation steps are performed offline. I supposed that local kept data still remains unchanged during the training process, so I added the steps to getitem in fundus_dataloader.py. I noticed and made use of a provided function in the same python file, function RandomRotFlip(). Below is the code I added.

Original code:

def __getitem__(self, idx):
        raw_file = self.image_list[idx]

        mask_patches = []
        raw_inp = np.load(raw_file)
        image_patch = raw_inp[..., 0:3] 
        mask_patch = raw_inp[..., 3:]
        image_patches = image_patch.copy()
      
        # image_patches = 
        # print (image_patch.dtype)
        # print (mask_patch.dtype)
        disc_contour, disc_bg, cup_contour, cup_bg = _get_coutour_sample(mask_patch)
        # print ('raw', np.min(image_patch), np.max(image_patch))
        for tar_freq_domain in np.random.choice(self.freq_site_index, 2):
               ......

Modified code:

def __init__(self, unseen_site_idx, client_idx=None, freq_site_idx=None, split='train', transform=None):
        self.unseen_site_idx = unseen_site_idx
        self.client_idx = client_idx    
        ......
def __getitem__(self, idx):
        raw_file = self.image_list[idx]

        mask_patches = []

        raw_inp = np.load(raw_file)
        image_patch = raw_inp[..., 0:3]
        mask_patch = raw_inp[..., 3:]

        # image_patches = 
        # print (image_patch.dtype)
        # print (mask_patch.dtype)

        if self.client_idx != self.unseen_site_idx:
            sample = {"image": image_patch, "label": mask_patch}
            preprocessor = RandomRotFlip()
            sample = preprocessor(sample)
            mask_patch = sample["label"]
            image_patch = sample["image"]

        image_patches = image_patch.copy()

        disc_contour, disc_bg, cup_contour, cup_bg = _get_coutour_sample(mask_patch)
        # print ('raw', np.min(image_patch), np.max(image_patch))
        for tar_freq_domain in np.random.choice(self.freq_site_index, 2):
                 .......

To prove that all the modifying steps are correct, below reveals the raw and transformed images. image

  • The image above is took under the [0,1] random-chosen interpolation ratio scheme.

The Dice score of Discs after training for 100 epoches are reported in the following table. "1, 2, 3" refers to the modifying steps above, along with their corresponding combinations.

image

I only trained once for each situation, but obviously the performance I reproduced is much worse than that was reported in paper, about which I got quite confused. As the data preparation steps are absent in the provided code, I am sincerely seeking for your help in checking my preparation steps.

Another question is, I did not observe any difference in treating data from domain 2 comparing to other domains, from the code provided. If 2 views are generated from a single image in domain 2, as is explained in issue #8 , the aggregating weight should be depending on [101, 318, 400, 400], but in line 63 of train_ELCFS.py the number is still [101, 159, 400, 400]. If the 2 views are stored in a single .npy file, then bugs would be raised in code below, from line 37~45 in fundus_dataloader.py because of changes in numpy array size.

def __getitem__(self, idx):
        raw_file = self.image_list[idx]

        mask_patches = []

        raw_inp = np.load(raw_file)
        image_patch = raw_inp[..., 0:3]
        mask_patch = raw_inp[..., 3:]

Thank you for your time. I was greatly inspired by your contribution, and I sincerely hope that a satisfactory result could be reproduced. I feel sorry if I have made any mistake, and please do not hesitate to point it out.

Thank you very much!

Best, Vivian

Hi, can you please share with me the code you use to organize your datasets? Thank you

@freshman97
Copy link

Another question is, I did not observe any difference in treating data from domain 2 comparing to other domains, from the code provided. If 2 views are generated from a single image in domain 2, as is explained in issue #8 , the aggregating weight should be depending on [101, 318, 400, 400], but in line 63 of train_ELCFS.py the number is still [101, 159, 400, 400]. If the 2 views are stored in a single .npy file, then bugs would be raised in code below, from line 37~45 in fundus_dataloader.py because of changes in numpy array size.

Hello Liang
I also experienced the same problem as you.
The trained result is very similar to yours, and the final average dice across four domains is around ~74.
Thus I think that might not be your problem. I hope the author could disclose more training details @liuquande

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants