Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wakehealth dataset: large unannexed files #177

Closed
kousu opened this issue May 13, 2022 · 1 comment
Closed

Wakehealth dataset: large unannexed files #177

kousu opened this issue May 13, 2022 · 1 comment
Assignees

Comments

@kousu
Copy link
Member

kousu commented May 13, 2022

The wakehealth dataset has become slow to download:

p115628@joplin:~/datasets$ git clone git@data.neuro.polymtl.ca:datasets/wakehealth
Cloning into 'wakehealth'...
remote: Enumerating objects: 11138, done.
remote: Counting objects: 100% (11138/11138), done.
remote: Compressing objects: 100% (5655/5655), done.
remote: Total 11138 (delta 6042), reused 5618 (delta 2377), pack-reused 0
Receiving objects: 100% (11138/11138), 5.92 GiB | 39.49 MiB/s, done.
Resolving deltas: 100% (6042/6042), done.

The reason is that it's distribution of files is unbalanced:

git@data:~/repositories/datasets$ (cd wakehealth.git/; du -hs objects/ annex/)
6.0G    objects/
2.1G    annex/

In contrast, the other datasets are weighted the other way:

git@data:~/repositories/datasets$ (cd beijing-tumor.git/; du -hs objects/ annex/)
1.7M    objects/
16G     annex/
git@data:~/repositories/datasets$ (cd philadelphia-pediatric.git/; du -hs objects/ annex/)
4.1M    objects/
2.2G    annex/
git@data:~/repositories/datasets$ (cd spine-generic-processed.git/; du -hs objects/ annex/)
36M     objects/
16G     annex/

Please figure out why and fix it.

@kousu kousu self-assigned this May 13, 2022
@kousu
Copy link
Member Author

kousu commented May 13, 2022

I used How to find/identify large commits in git history? to determine that it's because there's a folder called 'sourcedata' that's full of these large "hamamatsu" .ndpi files.

git@data:~/repositories/datasets/wakehealth.git$ git rev-list --objects --all |   git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |   sed -n 's/^blob //p' |   sort --numeric-sort --key=2 |   cut -c 1-12,41- |   $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest | tail -n 100
e60bcf433d3e    296B 087/a8b/SHA256E-s8411822--c5cf2814003e957625305f1c3abc1a508b5ee1a141b4931d2b88134ad03d2868.tif.log
e6d9e59ad5fa    296B ad5/8fd/SHA256E-s47376--3cfc86663e3507b356a22ca632644f9da3528f7ca83885325c4e8e65f67022fc.tif.log
e71b53e09536    296B 7d1/fae/SHA256E-s6765366--e198987c8c0a1ff411500a078a91fd475fb8c8ae39a4c915cc1b87adb4204290.tif.log
e94d58e1026d    296B 62c/343/SHA256E-s5482050--617e6f90e7bbda8b9402ac86084c0013f9971fbac9373ad9ea449c9dfc9abcb0.tif.log
eb39b7fa2402    296B 799/151/SHA256E-s11992332--a6100e9fd5e500e38fb75f312be5ebf77b8b5b89053c59b8fa6395d760c7a355.tif.log
ec6b76a28613    296B 8ea/332/SHA256E-s11066482--cd64c2f5e88c5908ef2b9b15bd8990750921dd6cbb3a064b269be3220c79ca63.tif.log
ee7c8407addc    296B 748/28a/SHA256E-s5184290--0df2f0e44feb6e1b6710e91c6e247ef27120fcab9acc52d7523dd72b6692e78b.tif.log
f12b537a0120    296B 7a4/352/SHA256E-s6593606--cb5f7435dceb568d96f1dd6b4f57eadbaf8029174e27c52580e8038ef64be1c7.tif.log
f299bef40747    296B ef8/487/SHA256E-s8345880--e0016965267984c406f4d87df1c75693bf18f36edd9505e960c1ae7e572d80e5.tif.log
f38efa0a8d83    296B 0a0/e0b/SHA256E-s3689220--dcded11818dac86a9b33045fa6b9755d878124a64178073d19a1eeaf7b9caef2.tif.log
f41511e66273    296B c07/32f/SHA256E-s3326216--ab655e85141f152d262b567598a93eb76ba52a94be733bbef40d31bb5b34ff81.tif.log
f820c931bf4d    296B 873/a90/SHA256E-s2903244--980d2d6eecb939ff419caa70fc00c965139b3fba26c4cd9e92d430abab8faef2.tif.log
f83ca4eb127f    296B 306/ec1/SHA256E-s53128--9dd4e4827d7f3ed05fbefed2ca05fea7c544705cfc1d0818355f2d3b3e90ae03.tif.log
f8d0570118ce    296B 9d8/083/SHA256E-s3803310--7f784fb8c9a02c514cee2658fcf30a82b9e4429bc0ea152379637c35e6bab1ac.tif.log
f96eb61f610c    296B 1d9/6c9/SHA256E-s5235508--976d2679e9ddd447c0cad920e46cec60bc9169431cf7a26039a63efbb4758aa8.tif.log
fe5725923293    296B 51f/378/SHA256E-s66966--28384393e2a955942a4bcb4b91eda71b8e645a2be3b012b9fa410fe974dc4b66.tif.log
feb595aedfbe    296B 33d/788/SHA256E-s6763992--394eda4ecfcceb65b4179ea35f45f90dea2b5066e4832122cafe34934e711eed.tif.log
ff906fc6935f    296B 769/509/SHA256E-s4355134--9dbeb03667c6378d5520d21af3fb8da0c9d35a3ac3cd528eb81b50e911deef32.tif.log
95be6a82ce78    312B samples.json
f00fdd057845    360B uuid.log
4f3c3f2607c2    477B uuid.log
d69d384faf17    535B participants.tsv
5a3a7121a92d    685B participants.tsv
0aad2c83a0b1    961B README
821d910921d8  1.6KiB samples.tsv
228fe31b584a  2.0KiB samples.tsv
a75037611793   16KiB .DS_Store
b09b789ec056   36KiB derivatives/labels/sub-21133/microscopy/sub-21133_sample-21133_acq-crop_chunk-4_BF_seg-axon-manual.png
3562f8b0de29   53KiB derivatives/labels/sub-21151/microscopy/sub-21151_sample-21151_acq-crop_chunk-9_BF_seg-axon-manual.png
a0395d2dd663   83KiB derivatives/labels/sub-21133/microscopy/sub-21133_sample-21133_acq-crop_chunk-4_BF_seg-myelin-manual.png
b3a591f5e6dd  105KiB derivatives/labels/sub-21133/microscopy/sub-21133_sample-21133_acq-crop_chunk-4_BF_seg-axonmyelin-manual.png
5a4c4253a470  107KiB derivatives/labels/sub-21151/microscopy/sub-21151_sample-21151_acq-crop_chunk-9_BF_seg-myelin-manual.png
49334b4eecd9  125KiB derivatives/labels/sub-21151/microscopy/sub-21151_sample-21151_acq-crop_chunk-9_BF_seg-axonmyelin-manual.png
547f326d52bd   38MiB sourcedata/sub-21145/micr/21-145_-_2021-06-02_12.12.04.ndpi
cfeef71134e4   42MiB sourcedata/sub-17268/micr/17-268_94144536_-_2017-09-27_22.44.10.ndpi
1436060e62b6   50MiB sourcedata/sub-21159/micr/21-159_-_2021-06-17_14.31.05.ndpi
eb00d0bb04f6   50MiB sourcedata/sub-21129/micr/21-129.ndpi
55a7bfd6f7d6   54MiB sourcedata/sub-21148/micr/21-148_-_2021-06-02_12.17.42.ndpi
533c60ef78af   56MiB sourcedata/sub-21161/micr/21-161_-_2021-06-17_14.35.09.ndpi
5e6d58a7b1cf   60MiB sourcedata/sub-17290/micr/17-290_-_2017-09-27_22.51.18.ndpi
0ed20b7f8792   60MiB sourcedata/sub-21160/micr/21-160_-_2021-06-17_14.33.03.ndpi
83847fedecef   62MiB sourcedata/sub-21140/micr/21-140_-_2021-06-02_12.02.28.ndpi
247e89043265   62MiB sourcedata/sub-21142/micr/21-142_-_2021-06-02_12.06.30.ndpi
8bb2c565a9b5   63MiB sourcedata/sub-21157/micr/21-157_-_2021-06-17_14.24.59.ndpi
466f46fc277f   65MiB sourcedata/sub-21115/micr/21-115_-_2021-05-11_13.31.28.ndpi
78536722c022   66MiB sourcedata/sub-21127/micr/21-127.ndpi
99b14b30d6a7   72MiB sourcedata/sub-21123/micr/21-123_-_2021-05-11_13.33.23.ndpi
485f91753890   73MiB sourcedata/sub-17288/micr/17-288_-_2017-09-27_23.03.50.ndpi
dbb8776e0b68   73MiB sourcedata/sub-21137/micr/21-137.ndpi
56d44382c3ad   73MiB sourcedata/sub-21144/micr/21-144_-_2021-06-02_12.10.11.ndpi
3a6ee4c3141c   77MiB sourcedata/sub-21146/micr/21-146_-_2021-06-02_12.13.33.ndpi
2a6bacf58ea7   77MiB sourcedata/sub-21135/micr/21-135.ndpi
d2af68a42fa3   77MiB sourcedata/sub-21126/micr/21-126.ndpi
79d439aa69dc   79MiB sourcedata/sub-21119/micr/21-119_-_2021-05-11_13.21.23.ndpi
5cf247ed5e45   79MiB sourcedata/sub-21134/micr/21-134.ndpi
21642f4e8dde   79MiB sourcedata/sub-21143/micr/21-143_-_2021-06-02_12.08.18.ndpi
55b93f80b3a0   84MiB sourcedata/sub-17298/micr/17-298_-_2017-09-27_23.11.37.ndpi
3e1cd5c42612   86MiB sourcedata/sub-21122/micr/21-122_-_2021-05-11_13.35.51.ndpi
c0cffb5a83d4   86MiB sourcedata/sub-17293/micr/17-293_-_2017-09-27_22.53.11.ndpi
599d33520c49   87MiB sourcedata/sub-21163/micr/21-163_-_2021-06-17_14.39.28.ndpi
9728c66904e3   88MiB sourcedata/sub-21141/micr/21-141_-_2021-06-02_12.04.18.ndpi
920667263d92   89MiB sourcedata/sub-21118/micr/21-118_-_2021-05-11_13.23.54.ndpi
85498622cc31   90MiB sourcedata/sub-21162/micr/21-162_-_2021-06-17_14.37.03.ndpi
2d7523ea1cf0   91MiB sourcedata/sub-17289/micr/17-289_-_2017-09-27_23.09.45.ndpi
843b73a34120   92MiB sourcedata/sub-21156/micr/21-156_-_2021-06-17_14.22.04.ndpi
284713bfd8d4   94MiB sourcedata/sub-17295/micr/17-295_-_2017-09-27_23.16.45.ndpi
c7906d5a8cfd   99MiB sourcedata/sub-21117/micr/21-117_-_2021-05-11_13.26.17.ndpi
a992273c4333  100MiB sourcedata/sub-21120/micr/21-120_-_2021-05-11_13.12.50.ndpi
990fb5dd6a04  101MiB sourcedata/sub-17269/micr/17-269_45635549_-_2017-09-27_22.48.49.ndpi
001557db4a8d  103MiB sourcedata/sub-21147/micr/21-147_-_2021-06-02_12.15.29.ndpi
dd24489fad0d  103MiB sourcedata/sub-21132/micr/21-132.ndpi
ff6d68ea03b9  103MiB sourcedata/sub-21149/micr/21-149_-_2021-06-02_12.28.54.ndpi
4e965e181571  104MiB sourcedata/sub-21139/micr/21-139.ndpi
c574af021a71  105MiB sourcedata/sub-17302/micr/17-302_-_2017-09-27_22.58.14.ndpi
d9b33e7deae0  106MiB sourcedata/sub-21125/micr/21-125.ndpi
fa635ad097e1  110MiB sourcedata/sub-21138/micr/21-138.ndpi
5356534f0f3d  110MiB sourcedata/sub-21124/micr/21-124_-_2021-05-11_13.18.46.ndpi
1d382ddf6598  112MiB sourcedata/sub-17296/micr/17-296_-_2017-09-27_22.45.47.ndpi
921144298e4c  112MiB sourcedata/sub-21116/micr/21-116_-_2021-05-11_13.28.50.ndpi
5de6e8968bdd  116MiB sourcedata/sub-21155/micr/21-155_-_2021-06-17_14.19.03.ndpi
4b0ae0a0c0c8  117MiB sourcedata/sub-17299/micr/17-299_-_2017-09-27_23.14.03.ndpi
973e4b777816  118MiB sourcedata/sub-21133/micr/21-133.ndpi
a3b94a3783b2  123MiB sourcedata/sub-21153/micr/21-153_-_2021-06-17_14.13.46.ndpi
00472ae07bd2  126MiB sourcedata/sub-17301/micr/17-301_-_2017-09-27_23.00.49.ndpi
7e8425f26541  130MiB sourcedata/sub-17303/micr/17-303_-_2017-09-27_22.55.21.ndpi
e24815f275ec  136MiB sourcedata/sub-21131/micr/21-131.ndpi
ab5961a15c43  138MiB sourcedata/sub-21164/micr/21-164_-_2021-06-17_14.41.38.ndpi
3e088f6aaf23  139MiB sourcedata/sub-21121/micr/21-121_-_2021-05-11_13.38.31.ndpi
d0c1aff63d47  149MiB sourcedata/sub-21128/micr/21-128.ndpi
473a697573eb  154MiB sourcedata/sub-21158/micr/21-158_-_2021-06-17_14.27.06.ndpi
2f6cccbd7bfe  165MiB sourcedata/sub-21154/micr/21-154_-_2021-06-17_14.16.19.ndpi
029456a1b343  167MiB sourcedata/sub-21150/micr/21-150_-_2021-06-17_14.03.55.ndpi
328a5a024189  167MiB sourcedata/sub-21136/micr/21-136.ndpi
21859196b0cd  179MiB sourcedata/sub-21130/micr/21-130.ndpi
1fc1f95e7bc6  179MiB sourcedata/sub-21152/micr/21-152_-_2021-06-17_14.10.37.ndpi
b4a155d68b5d  183MiB sourcedata/sub-17297/micr/17-297.ndpi
f2d029ede2dc  193MiB sourcedata/sub-17300/micr/17-300.ndpi
0d97a07bd8a0  197MiB sourcedata/sub-17294/micr/17-294_-_2017-09-27_23.05.56.ndpi
a3f6ff02aa8b  205MiB sourcedata/sub-21151/micr/21-151-_2021-06-17_14.07.12.ndpi
0f667ea56080  216MiB sourcedata/sub-17292/micr/17-292_-_2017-09-27_23.18.47.ndpi

It looks like these files don't exist on master though; actually they're part of #175:

git@data:~/repositories/datasets/wakehealth.git$ git log --name-only ac/bids-update -- 'sourcedata/*/*/*.ndpi'
commit 80f5fba68a0e58c4b00f2634eed898ef1bbe2445
Author: Armand Collin <armac@videotron.ca>
Date:   Thu Feb 10 11:31:05 2022 -0500

    Added NDPI source data

sourcedata/sub-17268/micr/17-268_94144536_-_2017-09-27_22.44.10.ndpi
sourcedata/sub-17269/micr/17-269_45635549_-_2017-09-27_22.48.49.ndpi
sourcedata/sub-17288/micr/17-288_-_2017-09-27_23.03.50.ndpi
sourcedata/sub-17289/micr/17-289_-_2017-09-27_23.09.45.ndpi
sourcedata/sub-17290/micr/17-290_-_2017-09-27_22.51.18.ndpi
sourcedata/sub-17292/micr/17-292_-_2017-09-27_23.18.47.ndpi
sourcedata/sub-17293/micr/17-293_-_2017-09-27_22.53.11.ndpi
sourcedata/sub-17294/micr/17-294_-_2017-09-27_23.05.56.ndpi
sourcedata/sub-17295/micr/17-295_-_2017-09-27_23.16.45.ndpi
sourcedata/sub-17296/micr/17-296_-_2017-09-27_22.45.47.ndpi
sourcedata/sub-17297/micr/17-297.ndpi
[...]

I'll deal with it over there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant