# Get total dataset size
This notebook was used to see how big the projections, reconstructions and .zarr datasets were.
It's loosely based on what [we did for the acini project](https://github.com/habi/acinar-analysis/blob/master/DataSizeBragging.ipynb) with Johannes.

In [1]:
import platform
import glob
import os
import pandas
import matplotlib.pyplot as plt
import seaborn

In [2]:
def get_git_hash():
    '''
    Get the current git hash from the repository.
    Based on http://stackoverflow.com/a/949391/323100 and
    http://stackoverflow.com/a/18283905/323100
    '''
    from subprocess import Popen, PIPE
    import os
    gitprocess = Popen(['git',
                        '--git-dir',
                        os.path.join(os.getcwd(), '.git'),
                        'rev-parse',
                        '--short',
                        '--verify',
                        'HEAD'],
                       stdout=PIPE)
    (output, _) = gitprocess.communicate()
    return output.strip().decode("utf-8")

In [3]:
# Make directory for output
OutPutDir = os.path.join(os.getcwd(), 'Output', get_git_hash())
print('We are saving all the output to %s' % OutPutDir)
os.makedirs(OutPutDir, exist_ok=True)

We are saving all the output to P:\Documents\ZMK\Output\8d1ee91


In [4]:
# Different locations if running either on Linux or Windows
FastSSD = False
# to speed things up significantly
if 'Linux' in platform.system():
    if FastSSD:
        BasePath = os.path.join(os.sep, 'media', 'habi', 'Fast_SSD')
    else:
        BasePath = os.path.join(os.sep, 'home', 'habi', '1272')
else:
    if FastSSD:
        BasePath = os.path.join('F:\\')
    else:
        if 'anaklin' in platform.node():
            BasePath = os.path.join('S:\\')
        else:
            BasePath = os.path.join('D:\\Results')
Root = os.path.join(BasePath, 'ZMK')
print('We are loading all the data from %s' % Root)

We are loading all the data from D:\Results\ZMK


In [5]:
# Make us a dataframe for saving all that we need
Data = pandas.DataFrame()

In [6]:
# Look only for folders: https://stackoverflow.com/a/38216530
Data['Folder'] = glob.glob(os.path.join(Root,
                                        'ToothBattallion',
                                        '*' + os.path.sep))

In [7]:
print('We found %s tooth folders in %s' % (len(Data), Root))

We found 104 tooth folders in D:\Results\ZMK


In [9]:
# Get all the log files
Data['LogFile'] = [sorted(glob.glob(os.path.join(folder,
                                                 'proj',
                                                 '*.log')))[0] for folder in Data['Folder']]
print('We have %s log files to work with' % (len(Data)))

We have 104 log files to work with


In [10]:
# Construct sample names
Data['Sample'] = [os.path.splitext(os.path.basename(logfile))[0] for logfile in Data['LogFile']]

In [11]:
# Proper sorting *with* leading zeros :)
Data.sort_values(by=['Sample'], inplace=True)

In [12]:
# # Temporarily drop some data
# Data = Data[:3]
# print('We are currently working with a subset of %s teeth' % len(Data))

In [16]:
# Get the projection details
Data['Projections'] = [sorted(glob.glob(os.path.join(folder,
                                                     'proj',
                                                     '*.tif'))) for folder in Data['Folder']]
Data['NumProj'] = [len(r) for r in Data['Projections']]

In [17]:
print('In total, we have %s projections over all the %s datasets'
      % (Data['NumProj'].sum(),
         len(Data)))

In total, we have 2425 projections over all the 104 datasets


In [18]:
print('On average, we recorded about %s projections for each of the %s teeth.'
      % (int(round(Data['NumProj'].mean())),
         len(Data)))

On average, we recorded about 23 projections for each of the 104 teeth.


In [19]:
# Get the size of the original TIF files
Data['SizeProj'] = [[os.path.getsize(rec) for rec in recs] for recs in Data['Projections']]
Data['SizeProjSum'] = [sum(sizes) for sizes in Data['SizeProj']]

In [20]:
Data['SizeProjSum']

0     0
16    0
27    0
38    0
49    0
     ..
2     0
3     0
4     0
5     0
6     0
Name: SizeProjSum, Length: 104, dtype: int64

To get (nearly) the same size, use
````bash
du -csb [123]/*.tif
````
in a Linux console

In [21]:
print('On average, the projections of each of the %s assessed samples '
      'are %0.2f GB in size' % (len(Data),
                                Data['SizeProjSum'].mean() * 1e-9))

On average, the projections of each of the 104 assessed samples are 0.08 GB in size


In [22]:
print('In total, all projections are %0.f GB in size' % (Data['SizeProjSum'].sum() * 1e-9))

In total, all projections are 9 GB in size


----

In [23]:
# Get the file names of the reconstructions
Data['Reconstructions'] = [sorted(glob.glob(os.path.join(folder,
                                                         'rec',
                                                         '*rec*.png'))) for folder in Data['Folder']]
Data['NumRec'] = [len(r) for r in Data['Reconstructions']]

In [24]:
print('In total, we have %s reconstructions for all the %s datasets'
      % (Data['NumRec'].sum(),
         len(Data)))

In total, we have 282025 reconstructions for all the 104 datasets


In [25]:
print('On average, each of the %s tooth datasets has about %s reconstructions.'
      % (len(Data),
         int(round(Data['NumRec'].mean()))))

On average, each of the 104 tooth datasets has about 2712 reconstructions.


In [23]:
# Get the size of the reconstructions
Data['SizeRec'] = [[os.path.getsize(rec) for rec in recs] for recs in Data['Reconstructions']]
Data['SizeRecSum'] = [sum(sizes) for sizes in Data['SizeRec']]

In [24]:
Data['SizeRecSum']

0     2322645692
16    3403138701
27    4086408490
38    3511385613
49    3134716556
         ...    
2     3274405150
3     4315113565
4     3877259032
5     3347123825
6     4327010362
Name: SizeRecSum, Length: 104, dtype: int64

To get (nearly) the same size, use
````bash
du -csb [123]/rec/*.png
````
in a Linux console

In [25]:
print('On average, the reconstructions of each of the %s assessed samples '
      'are %0.2f GB in size' % (len(Data),
                                Data['SizeRecSum'].mean() * 1e-9))

On average, the reconstructions of each of the 104 assessed samples are 3.13 GB in size


In [26]:
print('In total, the reconstructions are %0.f GB in size' % (Data['SizeRecSum'].sum() * 1e-9))

In total, the reconstructions are 326 GB in size


----

In [26]:
# Get the file names of the zarred reconstructions
Data['ReconstructionsZarr'] = [sorted(glob.glob(os.path.join(folder,
                                                             '*rec.zarr', '*'))) for folder in Data['Folder']]

In [28]:
Data['SizeRecZarr'] = [[os.path.getsize(rec) for rec in recs] for recs in Data['ReconstructionsZarr']]
Data['SizeRecZarrSum'] = [sum(sizes) for sizes in Data['SizeRecZarr']]

In [29]:
Data['SizeRecZarrSum']

0     2236552493
16    3269925002
27    3848577222
38    3534229472
49    3196891467
         ...    
2     3344769715
3     4403832286
4     4109684013
5     3430171788
6     4404762486
Name: SizeRecZarrSum, Length: 104, dtype: int64

To get (nearly) the same size, use
````bash
du -csb [123]/Tooth*_rec.zarr/*
````
in a Linux console

In [30]:
print('On average, the *zarred* reconstructions of each of the %s assessed samples '
      'are %0.2f GB in size' % (len(Data),
                                Data['SizeRecZarrSum'].mean() * 1e-9))

On average, the *zarred* reconstructions of each of the 104 assessed samples are 3.17 GB in size


In [31]:
print('In total, the *zarred* reconstructions are %0.f GB in size' % (Data['SizeRecZarrSum'].sum() * 1e-9))

In total, the *zarred* reconstructions are 330 GB in size


----

In [32]:
# Get the file names of the zarred reconstructions
Data['ReconstructionsCropZarr'] = [sorted(glob.glob(os.path.join(folder,
                                                                 '*rec_crop.zarr',
                                                                 '*'))) for folder in Data['Folder']]

In [33]:
Data['SizeRecCropZarr'] = [[os.path.getsize(rec) for rec in recs] for recs in Data['ReconstructionsCropZarr']]
Data['SizeRecCropZarrSum'] = [sum(sizes) for sizes in Data['SizeRecCropZarr']]

In [34]:
Data['SizeRecCropZarrSum']

0      732951760
16     868776195
27    1371973308
38    1013726895
49     799702689
         ...    
2     2845728173
3      763224010
4     3265892359
5     3025245060
6      769648083
Name: SizeRecCropZarrSum, Length: 104, dtype: int64

To get (nearly) the same size, use
````bash
du -csb [123]/Tooth*_rec_crop.zarr/*
````
in a Linux console

In [35]:
print('On average, the cropped reconstructions of each of the %s assessed samples '
      'are %0.2f GB in size' % (len(Data),
                                Data['SizeRecCropZarrSum'].mean() * 1e-9))

On average, the cropped reconstructions of each of the 104 assessed samples are 1.11 GB in size


In [36]:
print('In total, the cropped reconstructions are %0.f GB in size' % (Data['SizeRecCropZarrSum'].sum() * 1e-9))

In total, the cropped reconstructions are 115 GB in size


----

In [28]:
# Get the file names of the zarred reconstructions
Data['RootCanalZarr'] = [sorted(glob.glob(os.path.join(folder,
                                                       '*rootcanal.zarr',
                                                       '*'))) for folder in Data['Folder']]

In [31]:
Data['SizeRootCanalZarr'] = [[os.path.getsize(rec) for rec in recs] for recs in Data['RootCanalZarr']]
Data['SizeRootCanalZarrSum'] = [sum(sizes) for sizes in Data['SizeRootCanalZarr']]

In [33]:
Data['SizeRootCanalZarrSum']

0      1604058
16     1487890
27     2422112
38     2710712
49     2608521
        ...   
2      7229116
3      1780260
4     16382915
5      7845735
6      1529387
Name: SizeRootCanalZarrSum, Length: 104, dtype: int64

To get (nearly) the same size, use
````bash
du -csb [123]/Tooth*_rootcanal.zarr/*
````
in a Linux console

In [34]:
print('On average, the extracted root canal of each of the %s assessed samples '
      'are %0.2f MB in size' % (len(Data),
                                Data['SizeRootCanalZarrSum'].mean() * 1e-6))

On average, the extracted root canal of each of the 104 assessed samples are 2.97 MB in size


In [35]:
print('In total, the extracted root canals are %0.2f MB in size' % (Data['SizeRootCanalZarrSum'].sum() * 1e-6))

In total, the extracted root canals are 309.05 MB in size


----

In [36]:
# Get the file names of the reformatted bottom part
Data['ApexFiles'] = [sorted(glob.glob(os.path.join(folder,
                                                   'apex_reslice',
                                                   '*.png'))) for folder in Data['Folder']]

In [37]:
Data['SizeApex'] = [[os.path.getsize(rec) for rec in recs] for recs in Data['ApexFiles']]
Data['SizeApexSum'] = [sum(sizes) for sizes in Data['SizeApex']]

In [38]:
Data['SizeApexSum']

0     35818973
16    22963837
27    38075422
38    22788583
49    11829993
        ...   
2     16030677
3     21928582
4     31148795
5     24345631
6     48254904
Name: SizeApexSum, Length: 104, dtype: int64

To get (nearly) the same size, use
````bash
du -csb [123]/Tooth*_apex_reslice/*
````
in a Linux console

In [39]:
print('On average, the extracted apexes of each of the %s assessed samples '
      'are %0.2f MB in size' % (len(Data),
                                Data['SizeApexSum'].mean() * 1e-6))

On average, the extracted apexes of each of the 104 assessed samples are 21.97 MB in size


In [40]:
print('In total, the extracted apexes are %0.2f GB in size' % (Data['SizeApexSum'].sum() * 1e-9))

In total, the extracted apexes are 2.28 GB in size
