# Get the size of the datasets on disk

For the manuscript on the fishes, we want to know how much data we produced.
This notebook is based on a copy of `DataWrangling.ipynb` and https://github.com/habi/zmk-tooth-cohort/blob/master/ToothDataSize.ipynb

In [1]:
import platform
import os
import glob
import pandas
from tqdm import notebook
# import imageio
# import numpy
# import matplotlib.pyplot as plt
# from matplotlib_scalebar.scalebar import ScaleBar
# import seaborn
# import dask
# import dask_image.imread
# from dask.distributed import Client, LocalCluster
# import skimage

In [24]:
# The canonical place for *this* notebook is the iee research storage, as only there we have *all* the data
if 'Linux' in platform.system():
    Root = os.path.join(os.sep, 'home', 'habi', 'research-storage-iee')
else:
    Root = os.path.join('I:\\microCTupload')
print('We are loading all the data from %s' % Root)

We are loading all the data from /home/habi/research-storage-iee


In [25]:
def get_git_hash():
    '''
    Get the current git hash from the repository.
    Based on http://stackoverflow.com/a/949391/323100 and
    http://stackoverflow.com/a/18283905/323100
    '''
    from subprocess import Popen, PIPE
    import os
    gitprocess = Popen(['git',
                        '--git-dir',
                        os.path.join(os.getcwd(), '.git'),
                        'rev-parse',
                        '--short',
                        '--verify',
                        'HEAD'],
                       stdout=PIPE)
    (output, _) = gitprocess.communicate()
    return output.strip().decode("utf-8")

In [26]:
# Make directory for output
OutPutDir = os.path.join(os.getcwd(), 'Output', get_git_hash())
print('We are saving all the output to %s' % OutPutDir)
os.makedirs(OutPutDir, exist_ok=True)

We are saving all the output to /home/habi/P/Documents/EAWAG/Output/d3ac286


In [27]:
# Make us a dataframe for saving all that we need
Data = pandas.DataFrame()

In [28]:
# Get *all* log files
# Sort them by time, not name
Data['LogFile'] = [f for f in sorted(glob.glob(os.path.join(Root, '**', '*.log'),
                                               recursive=True),
                                     key=os.path.getmtime)]
print('We have %s log files to work with' % (len(Data)))

We have 1172 log files to work with


In [29]:
# Get all folders
Data['Folder'] = [os.path.dirname(f) for f in Data['LogFile']]

In [30]:
# Generate us some meaningful colums
Data['Fish'] = [l[len(Root)+1:].split(os.sep)[0] for l in Data['LogFile']]
Data['Scan'] = ['_'.join(l[len(Root)+1:].split(os.sep)[1:-1]) for l in Data['LogFile']]

In [31]:
# How many fishes did we scan?
# We scanned six 'buckets of fish' and one set of only 'teeth', so subtract those :)
print('We have %s unique names in our corpus of scan' % (len(Data.Fish.unique())-7))
print('We performed %s scans in total' % len(Data.Scan))

We have 127 unique names in our corpus of scan
We performed 1172 scans in total


In [32]:
# # Temporarily drop some data
# Data = Data[:3]
# print('We are currently working with a subset of %s teeth' % len(Data))

In [33]:
for i in Data.Folder[:10]:
    print(i)

/home/habi/research-storage-iee/14298/proj
/home/habi/research-storage-iee/103908/jaw/rec
/home/habi/research-storage-iee/103908/jaw/proj
/home/habi/research-storage-iee/Teeth/W/rec_al0.25
/home/habi/research-storage-iee/Teeth/W/proj_nofilter
/home/habi/research-storage-iee/Teeth/W/rec_nofilter
/home/habi/research-storage-iee/Teeth/P/proj_al0.25
/home/habi/research-storage-iee/Teeth/P/rec_al0.25
/home/habi/research-storage-iee/Teeth/P/proj_nofilter
/home/habi/research-storage-iee/Teeth/P/rec_nofilter


In [34]:
# Get the projection details
# Let's look for 'tif' *and* 'iif' files, which are alignment projections
Data['Projections'] = [sorted(glob.glob(os.path.join(folder,
                                                     '*.?if'))) for folder in Data['Folder']]
Data['NumberOfProjections'] = [len(r) for r in Data['Projections']]

In [35]:
# Get the size of the TIFFs
Data['ProjectionSize'] = [[os.path.getsize(rec) for rec in recs] for recs in Data['Projections']]
Data['ProjectionSizeSum'] = [sum(size) for size in Data['ProjectionSize']]

In [36]:
Data[['Folder', 'NumberOfProjections', 'ProjectionSize', 'ProjectionSizeSum']]

Unnamed: 0,Folder,NumberOfProjections,ProjectionSize,ProjectionSizeSum
0,/home/habi/research-storage-iee/14298/proj,1948,"[32171566, 32171566, 32171566, 32171566, 32171...",62612307144
1,/home/habi/research-storage-iee/103908/jaw/rec,2,"[982606, 982606]",1965212
2,/home/habi/research-storage-iee/103908/jaw/proj,977,"[8043342, 8043342, 8043342, 8043342, 8043342, ...",7844223662
3,/home/habi/research-storage-iee/Teeth/W/rec_al...,0,[],0
4,/home/habi/research-storage-iee/Teeth/W/proj_n...,472,"[3564886, 3564886, 3564886, 3564886, 3564886, ...",1682626192
...,...,...,...,...
1167,/home/habi/research-storage-iee/BucketOfFish_F...,0,[],0
1168,/home/habi/research-storage-iee/BucketOfFish_F...,0,[],0
1169,/home/habi/research-storage-iee/BucketOfFish_F...,3616,"[11944838, 11944838, 11944838, 11944838, 11944...",43192534208
1170,/home/habi/research-storage-iee/BucketOfFish_F...,2,"[2387182, 2387182]",4774364


To get (nearly) the same size, use
````bash
find . -iname '*.?if' -print0 | du -ch --files0-from=-
````
in a Linux console.
The command is based on https://askubuntu.com/a/558989/759778

In [37]:
print('In total, all projections are %0.2f GB in size' % (Data['ProjectionSizeSum'].sum() / 1024 / 1024 / 1024))

In total, all projections are 28518.61 GB in size


In [38]:
print('In total, all projections are %0.2f TB in size' % (Data['ProjectionSizeSum'].sum() / 1024 / 1024 / 1024 / 1024))

In total, all projections are 27.85 TB in size


----

In [39]:
# Get the file names of the reconstructions
Data['Reconstructions'] = [sorted(glob.glob(os.path.join(f, '*rec0*.png'))) for f in Data['Folder']]
Data['NumberOfReconstructions'] = [len(r) for r in Data.Reconstructions]

In [40]:
print('In total, we have %s reconstructions for all the %s datasets'
      % (Data['NumberOfReconstructions'].sum(),
         len(Data)))

In total, we have 931205 reconstructions for all the 1172 datasets


In [41]:
print('On average, each of the %s datasets has about %s reconstructions.'
      % (len(Data),
         int(round(Data['NumberOfReconstructions'].mean()))))

On average, each of the 1172 datasets has about 795 reconstructions.


In [42]:
# Drop samples which have not been reconstructed yet
# Based on https://stackoverflow.com/a/13851602
# for c,row in Data.iterrows():
#     if not row['Number of reconstructions']:
#         print('%s contains no PNG files, we might be currently reconstructing it' % row.Folder)
print('We have %s folders in total' % (len(Data)))
print("Of which %s folders contain reconstructions (Data['NumberOfReconstructions']>0)" % (len(Data[Data['NumberOfReconstructions'] > 0])))

We have 1172 folders in total
Of which 338 folders contain reconstructions (Data['NumberOfReconstructions']>0)


In [43]:
# Get the size of the reconstructions
Data['ReconstructionSize'] = [[os.path.getsize(rec) for rec in recs] for recs in Data['Reconstructions']]
Data['ReconstructionSizeSum'] = [sum(sizes) for sizes in Data['ReconstructionSize']]

In [44]:
print('In total, the reconstructions are %0.2f GB in size' % (Data['ReconstructionSizeSum'].sum() / 1024 / 1024 / 1024))

In total, the reconstructions are 1417.46 GB in size


In [45]:
print('In total, the reconstructions are %0.2f TB in size' % (Data['ReconstructionSizeSum'].sum() / 1024 / 1024 / 1024 / 1024))

In total, the reconstructions are 1.38 TB in size


To get (nearly) the same size, use

````bash
find . -iname '*rec0*.png' -print0 | du -ch --files0-from=-
````

in a Linux console