## Summary of notebook

This notebook:
1. With access to the a directory containing the hcp files, creates a pickle for each subject in the output_pklz directory
2. Creates a dictionary calls dfs. 
    + dfs contains a dataframe summarizing the files at two levels of depth in the hcp filetree.
    + dfs is written to a small pickled file called 100_subjects.pklz 
3. Shows a quick summary of the data. Main conclusion is that the bulk of files, which are those contained in the results directories of 'MNINonLinear', were recently accessed.

## Setup

In [None]:
pwd

In [None]:
from pathlib import Path
import pandas as pd
import numpy as np
# import seaborn as sns
import pickle
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 500)
pd.set_option('max_colwidth',500)
# %matplotlib inline

# dir = {'900':'/data/HCP/HCP_900/s3/hcp/**/*','1200': '/data/HCP/HCP_1200/**/*'}
dirs = list(Path('/data/HCP/HCP_900/s3/hcp').iterdir())
# cols = ['perms','links','user','group','size','year','time','dir']
cols = ['size','date','dir']
pklz_dir = Path('output_pklz')
if not pklz_dir.exists():
    # alternatively if subject_pklz.tar.gz exists this could be unpickled
    pklz_dir.mkdir()


In [None]:
def get_subdir(d_path):
    return d_path.name
def get_output_path(d_path):
    return Path('output_pklz').joinpath('files_characterization_' + d_path.parent.name + '_' + get_subdir(d_path) + '.pklz')
def write_tsv_with_atimes_and_size(d_path):
        #     -lu gives access time
    #     -d1 gives just the file/dir instead of the contents
    #     the glob pattern provides all the files/dirs
    #     the awk command turns it into tab separated output
    # need to have globstar set to on in bash: shopt -s globstar
    output_file = get_output_path(d_path)
    print(d_path.as_posix() + '/**/*')
    ! shopt -s globstar;ls -lu -d1  --time-style long-iso {d_path.as_posix() + '/**/*'}| awk -v OFS="\t" '$1=$1'|cut -f5,6,8 > {output_file.with_suffix('.tsv')}


## Write pickle for every subject

The below cell was used to generate a tsv file for each subject:

The output pickles were subsequently tarred:
tar -czvf subject_pklz.tar.gz output_pklz

## Load 100 random subjects and assess the access times in their file trees.

### Define helper functions:

In [None]:
def get_least_common_value(series):
    return series[series.apply(len).idxmin()]
def get_dir_level_summary(df,level = 7):
    df_grouped = (
        df.loc[pd.notnull(df[level+ 1]) ,:].
        groupby(list(range(level + 1)))
    )
    df = (
        df_grouped.
        aggregate({'date':max,'size': sum, 'file':len,'parent_dir' : lambda x :get_least_common_value(x)}).
        assign(total_size_gb = lambda df: round(df['size'] /1000000000,3)).
        rename(columns = {'date':'most_recent_access',
                     'file' : 'num_files'}).
#         reset_index(drop = True).
        assign(tree_depth = level)
        
    )
    return df

# from IPython.core.debugger import Pdb; ipdb=Pdb()
# ipdb.runcall(get_dir_level_summary, df_split, 8)

# test = pd.concat([df_sub.head(100), df_sub.head(100).file.str.split('/',expand = True)], axis = 1)
# get_dir_level_summary(test, 7)

In [None]:
def summarise_subject_info(subject_summary,num_subs=3,levels=[7]):
    dfs = {}
    dfs_full = {}
    if not subject_summary.exists():
        for d_path in np.random.choice( dirs, num_subs):
            output_file = get_output_path(d_path)
            df_sub = pd.read_pickle(output_file)
            df_sub = df_sub.rename(columns = {'dir' : 'file'})
            df_sub['subject'] = d_path.name
            df_sub['is_file'] = df_sub.file.apply(lambda x: Path(x).is_file())
            df_sub['parent_dir'] = df_sub.file.apply(lambda x:'/'.join(x.split('/')[:-1]))
#             Create columns representing depth into the file tree to group across them:
            df_sub = pd.concat([df_sub, df_sub.file.str.split('/',expand = True)], axis = 1)
            
            for lev in levels:
                dfs[lev] = get_dir_level_summary(df_sub, lev)

                if lev in dfs_full.keys():
                    dfs_full[lev]  = pd.concat([dfs_full[lev],dfs[lev]],axis = 0)
                else:
                    dfs_full[lev] = dfs[lev].copy()

        pickle.dump(dfs_full, open(subject_summary, "wb"))
    else: dfs_full = pickle.load(open(subject_summary, "rb"))
    return dfs_full

# directory of output pickles required:
# tar xvf subject_pklz.tar.gz output_pklz

## Create merged dataframes

The code below creates a dictionary of dataframes. Each dataframe contains a summary of files at the depth into the tree:

In [None]:
dfs = summarise_subject_info(Path('100_subjects.pklz'), num_subs= 100, levels=[7,9])

In [None]:
dfs[7].head()


In [None]:
dfs[9].head()


## Size of 'MNINonLinear' results directories

Most of the disk space of the hpc dataset is used up by the results directory in the 'MNINonLinear' directory.

For all 100 subjects all the files have a size (in GB) of:

In [None]:
dfs[9].loc['','data','HCP','HCP_900','s3','hcp',:,:]['total_size_gb'].sum()

For all 100 subjects MNINonLinear results files have a size (in GB) of:

In [None]:
dfs[9].loc['','data','HCP','HCP_900','s3','hcp',:,'MNINonLinear']['total_size_gb'].sum()

## Most recent access times in the MNINonLinear results directories

In [None]:
9 levels deep in the tree all directories in MNINonLinear/*/Results have been accessed recently. This is the bulk of the data in the HCP dataset.

In [None]:
dfs[9].loc['','data','HCP','HCP_900','s3','hcp',:,'MNINonLinear'].head()

In [None]:
dfs[9].loc['','data','HCP','HCP_900','s3','hcp',:,'MNINonLinear']['most_recent_access'].unique()

### SAVE CONDA ENVIRONMENT

In [None]:
!   conda env export > hcp_characterization.yml