# Get total dataset size
This notebook was used to see how big the projections, reconstructions and .zarr datasets were.
It's loosely based on what [we did for the acini project](https://github.com/habi/acinar-analysis/blob/master/DataSizeBragging.ipynb) with Johannes.

In [1]:
import platform
import glob
import os
import pandas
import matplotlib.pyplot as plt
import seaborn

In [2]:
def get_git_hash():
    '''
    Get the current git hash from the repository.
    Based on http://stackoverflow.com/a/949391/323100 and
    http://stackoverflow.com/a/18283905/323100
    '''
    from subprocess import Popen, PIPE
    import os
    gitprocess = Popen(['git',
                        '--git-dir',
                        os.path.join(os.getcwd(), '.git'),
                        'rev-parse',
                        '--short',
                        '--verify',
                        'HEAD'],
                       stdout=PIPE)
    (output, _) = gitprocess.communicate()
    return output.strip().decode("utf-8")

In [3]:
# Make directory for output
OutPutDir = os.path.join(os.getcwd(), 'Output', get_git_hash())
print('We are saving all the output to %s' % OutPutDir)
os.makedirs(OutPutDir, exist_ok=True)

We are saving all the output to /home/habi/P/Documents/AcinarSize_Johannes/Output/1eeb872


In [4]:
# Different locations if running either on Linux or Windows
FastSSD = False
# to speed things up significantly
if 'Linux' in platform.system():
    if FastSSD:
        BasePath = os.path.join(os.sep, 'media', 'habi', 'Fast_SSD')
    else:
        BasePath = os.path.join(os.sep, 'home', 'habi', '1272')
else:
    if FastSSD:
        BasePath = os.path.join('F:\\')
    else:
        if 'anaklin' in platform.node():
            BasePath = os.path.join('S:\\')
        else:
            BasePath = os.path.join('D:\\Results')
Root = os.path.join(BasePath, 'ZMK')
print('We are loading all the data from %s' % Root)

We are loading all the data from /home/habi/1272/ZMK


In [5]:
# Make us a dataframe for saving all that we need
Data = pandas.DataFrame()

In [6]:
# Look only for folders: https://stackoverflow.com/a/38216530
Data['Folder'] = glob.glob(os.path.join(Root, 'ToothBattallion', '*' + os.path.sep))

In [7]:
print('We found %s tooth folders in %s' % (len(Data), Root))

We found 104 tooth folders in /home/habi/1272/ZMK


In [8]:
# We could do it in a list comprehension, but then it fails if we're still scanning a tooth
# Data['LogFile'] = [sorted(glob.glob(os.path.join(f, '*.log')))[0] for f in Data['Folder']]
for c, row in Data.iterrows():
    try:
        Data.at[c, 'LogFile'] = sorted(glob.glob(os.path.join(row['Folder'], '*.log')))[0]
    except IndexError:
        print('No logfile found in %s, removing the folder temporarily' % row.Folder)
        Data.at[c, 'LogFile'] = 'scanning'
Data = Data[Data['LogFile'] != 'scanning']
Data.reset_index(drop=True, inplace=True)
print('We have %s tooth folders to work with' % (len(Data)))

We have 104 tooth folders to work with


In [9]:
Data['Sample'] = [os.path.splitext(os.path.basename(l))[0] for l in Data['LogFile']]

In [10]:
# Proper sorting *with* leading zeros :)
Data.sort_values(by=['Sample'], inplace=True)

In [11]:
# # Temporarily drop some data
# Data = Data[:3]
# print('We are currently working with a subset of %s teeth' % len(Data))

In [12]:
# Get the projection details
Data['Projections'] = [sorted(glob.glob(os.path.join(f,
                                                     '*.tif'))) for f in Data['Folder']]
Data['NumProj'] = [len(r) for r in Data['Projections']]

In [13]:
print('In total, we have %s projections over all the %s datasets'
      % (Data['NumProj'].sum(),
         len(Data)))

In total, we have 239173 projections over all the 104 datasets


In [14]:
print('On average, we recorded about %s projections for each of the %s teeth.'
      % (int(round(Data['NumProj'].mean())),
         len(Data)))

On average, we recorded about 2300 projections for each of the 104 teeth.


In [15]:
# Get the size of the original TIF files
Data['SizeProj'] = [[os.path.getsize(rec) for rec in recs] for recs in Data['Projections']]
Data['SizeProjSum'] = [sum(sizes) for sizes in Data['SizeProj']]

In [16]:
Data['SizeProjSum']

0      4568284528
16    10231938720
27    10231938720
38     5763828950
49     8615737750
         ...     
2      8615737750
3      8615737750
4      8615737750
5      8615737750
6      8615737750
Name: SizeProjSum, Length: 104, dtype: int64

To get (nearly) the same size, use
````bash
du -csb [123]/*.tif
````
in a Linux console

In [17]:
print('On average, the projections of each of the %s assessed samples '
      'are %0.2f GB in size' % (len(Data),
                                Data['SizeProjSum'].mean() * 1e-9))

On average, the projections of each of the 104 assessed samples are 8.17 GB in size


In [18]:
print('In total, all projections are %0.2f GB in size' % (Data['SizeProjSum'].sum() * 1e-9))

In total, all projections are 849.59 GB in size


----

In [19]:
# Get the file names of the reconstructions
Data['Reconstructions'] = [sorted(glob.glob(os.path.join(f,
                                                         'rec',
                                                         '*rec*.png'))) for f in Data['Folder']]
Data['NumRec'] = [len(r) for r in Data['Reconstructions']]

In [20]:
print('In total, we have %s reconstructions for all the %s datasets'
      % (Data['NumRec'].sum(),
         len(Data)))

In total, we have 282062 reconstructions for all the 104 datasets


In [21]:
print('On average, each of the %s tooth datasets has about %s reconstructions.'
      % (len(Data),
         int(round(Data['NumRec'].mean()))))

On average, each of the 104 tooth datasets has about 2712 reconstructions.


In [22]:
# Get the size of the original TIF files
Data['SizeRec'] = [[os.path.getsize(rec) for rec in recs] for recs in Data['Reconstructions']]
Data['SizeRecSum'] = [sum(sizes) for sizes in Data['SizeRec']]

In [23]:
Data['SizeRecSum']

0     2322645692
16    3403138701
27    4086408490
38    3511385613
49    3134716556
         ...    
2     3274405150
3     4315113565
4     3877259032
5     3347123825
6     4327010362
Name: SizeRecSum, Length: 104, dtype: int64

To get (nearly) the same size, use
````bash
du -csb [123]/rec/*.png
````
in a Linux console

In [24]:
print('On average, the reconstructions of each of the %s assessed samples '
      'are %0.2f GB in size' % (len(Data),
                                Data['SizeRecSum'].mean() * 1e-9))

On average, the reconstructions of each of the 104 assessed samples are 3.13 GB in size


In [25]:
print('In total, the reconstructions are %0.2f GB in size' % (Data['SizeRecSum'].sum() * 1e-9))

In total, the reconstructions are 325.54 GB in size


----

In [26]:
# Get the file names of the zarred reconstructions
Data['ReconstructionsZarr'] = [sorted(glob.glob(os.path.join(f,
                                                             '*rec.zarr', '*'))) for f in Data['Folder']]

In [27]:
Data['SizeRecZarr'] = [[os.path.getsize(rec) for rec in recs] for recs in Data['ReconstructionsZarr']]
Data['SizeRecZarrSum'] = [sum(sizes) for sizes in Data['SizeRecZarr']]

In [28]:
Data['SizeRecZarrSum']

0     2187845495
16    3238810024
27    3797976564
38    3501731282
49    3179400124
         ...    
2     3321625321
3     4373727364
4     4070391520
5     3396211408
6     4341612924
Name: SizeRecZarrSum, Length: 104, dtype: int64

To get (nearly) the same size, use
````bash
du -csb [123]/Tooth*_rec.zarr/*
````
in a Linux console

In [29]:
print('On average, the *zarred* reconstructions of each of the %s assessed samples '
      'are %0.2f GB in size' % (len(Data),
                                Data['SizeRecZarrSum'].mean() * 1e-9))

On average, the *zarred* reconstructions of each of the 104 assessed samples are 3.14 GB in size


In [30]:
print('In total, the *zarred* reconstructions are %0.2f GB in size' % (Data['SizeRecZarrSum'].sum() * 1e-9))

In total, the *zarred* reconstructions are 326.36 GB in size


----

In [31]:
# Get the file names of the zarred reconstructions
Data['ReconstructionsCropZarr'] = [sorted(glob.glob(os.path.join(f,
                                                                 '*rec_crop.zarr', '*'))) for f in Data['Folder']]

In [32]:
Data['SizeRecCropZarr'] = [[os.path.getsize(rec) for rec in recs] for recs in Data['ReconstructionsCropZarr']]
Data['SizeRecCropZarrSum'] = [sum(sizes) for sizes in Data['SizeRecCropZarr']]

In [33]:
Data['SizeRecCropZarrSum']

0      753157774
16     887660265
27    1396845850
38    1035119296
49     859072421
         ...    
2     3321625321
3      787210022
4     3292101958
5     3053843562
6      879544202
Name: SizeRecCropZarrSum, Length: 104, dtype: int64

To get (nearly) the same size, use
````bash
du -csb [123]/Tooth*_rec_crop.zarr/*
````
in a Linux console

In [34]:
print('On average, the cropped reconstructions of each of the %s assessed samples '
      'are %0.2f GB in size' % (len(Data),
                                Data['SizeRecCropZarrSum'].mean() * 1e-9))

On average, the cropped reconstructions of each of the 104 assessed samples are 1.37 GB in size


In [35]:
print('In total, the cropped reconstructions are %0.2f GB in size' % (Data['SizeRecCropZarrSum'].sum() * 1e-9))

In total, the cropped reconstructions are 142.00 GB in size


----

In [36]:
# Get the file names of the zarred reconstructions
Data['PulpaZarr'] = [sorted(glob.glob(os.path.join(f,
                                                   '*pulpa.zarr', '*'))) for f in Data['Folder']]

In [37]:
Data['SizePulpaZarr'] = [[os.path.getsize(rec) for rec in recs] for recs in Data['PulpaZarr']]
Data['SizePulpaZarrSum'] = [sum(sizes) for sizes in Data['SizePulpaZarr']]

In [38]:
Data['SizePulpaZarrSum']

0      1579581
16     1447936
27     2317964
38     2814709
49     2845325
        ...   
2      8656858
3      1881379
4     16420815
5      7826571
6      1664073
Name: SizePulpaZarrSum, Length: 104, dtype: int64

To get (nearly) the same size, use
````bash
du -csb [123]/Tooth*_pulpa.zarr/*
````
in a Linux console

In [39]:
print('On average, the extracted pulpa of each of the %s assessed samples '
      'are %0.2f MB in size' % (len(Data),
                                Data['SizePulpaZarrSum'].mean() * 1e-6))

On average, the extracted pulpa of each of the 104 assessed samples are 3.42 MB in size


In [40]:
print('In total, the extracted pulpa are %0.2f MB in size' % (Data['SizePulpaZarrSum'].sum() * 1e-6))

In total, the extracted pulpa are 355.79 MB in size


----

In [41]:
# Get the file names of the reformatted bottom part
Data['BottomFiles'] = [sorted(glob.glob(os.path.join(f,
                                                     'base_reslice',
                                                     '*.png'))) for f in Data['Folder']]

In [42]:
Data['SizeBase'] = [[os.path.getsize(rec) for rec in recs] for recs in Data['BottomFiles']]
Data['SizeBaseSum'] = [sum(sizes) for sizes in Data['SizeBase']]

In [43]:
Data['SizeBaseSum']

0     100539658
16     79626508
27    132858342
38     84835069
49     69946256
        ...    
2     273638687
3      82951126
4     348122415
5     288166988
6     111953882
Name: SizeBaseSum, Length: 104, dtype: int64

To get (nearly) the same size, use
````bash
du -csb [123]/Tooth*_pulpa.zarr/*
````
in a Linux console

In [44]:
print('On average, the extracted pulpa of each of the %s assessed samples '
      'are %0.2f MB in size' % (len(Data),
                                Data['SizeBaseSum'].mean() * 1e-6))

On average, the extracted pulpa of each of the 104 assessed samples are 126.75 MB in size


In [46]:
print('In total, the extracted pulpa are %0.2f GB in size' % (Data['SizeBaseSum'].sum() * 1e-9))

In total, the extracted pulpa are 13.18 GB in size
