## Summary of notebook

This notebook:
1. With access to the a directory containing the hcp files, creates a pickle for each subject in the output_pklz directory
2. Creates a dictionary calls dfs. 
    + dfs contains a dataframe summarizing the files at two levels of depth in the hcp filetree.
    + dfs is written to a small pickled file called 100_subjects.pklz 
3. Shows a quick summary of the data. Main conclusion is that the bulk of files, which are those contained in the results directories of 'MNINonLinear', were recently accessed.

## Setup

In [1]:
pwd

'/Users/rodgersleejg/Documents/nih/code/hcp_characterization'

In [2]:
from pathlib import Path
import pandas as pd
import numpy as np
# import seaborn as sns
import pickle
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 500)
pd.set_option('max_colwidth',500)
# %matplotlib inline

# dir = {'900':'/data/HCP/HCP_900/s3/hcp/**/*','1200': '/data/HCP/HCP_1200/**/*'}
hcp_dir = Path('/data/HCP/HCP_900/s3/hcp')
if hcp_dir.exists():
    dirs = list(hcp_dir.iterdir())
# cols = ['perms','links','user','group','size','year','time','dir']
cols = ['size','date','dir']
pklz_dir = Path('output_pklz')
if not pklz_dir.exists():
    # alternatively if subject_pklz.tar.gz exists this could be unpickled
    pklz_dir.mkdir()


In [3]:
def get_subdir(d_path):
    return d_path.name
def get_output_path(d_path):
    return Path('output_pklz').joinpath('files_characterization_' + d_path.parent.name + '_' + get_subdir(d_path) + '.pklz')
def write_tsv_with_atimes_and_size(d_path):
        #     -lu gives access time
    #     -d1 gives just the file/dir instead of the contents
    #     the glob pattern provides all the files/dirs
    #     the awk command turns it into tab separated output
    # need to have globstar set to on in bash: shopt -s globstar
    output_file = get_output_path(d_path)
    print(d_path.as_posix() + '/**/*')
    ! shopt -s globstar;ls -lu -d1  --time-style long-iso {d_path.as_posix() + '/**/*'}| awk -v OFS="\t" '$1=$1'|cut -f5,6,8 > {output_file.with_suffix('.tsv')}


## Write pickle for every subject

The below cell was used to generate a tsv file for each subject:

The output pickles were subsequently tarred:
tar -czvf subject_pklz.tar.gz output_pklz

## Load 100 random subjects and assess the access times in their file trees.

### Define helper functions:

In [4]:
def get_least_common_value(series):
    return series[series.apply(len).idxmin()]
def get_dir_level_summary(df,level = 7):
    df_grouped = (
        df.loc[pd.notnull(df[level+ 1]) ,:].
        groupby(list(range(level + 1)))
    )
    df = (
        df_grouped.
        aggregate({'date':max,'size': sum, 'file':len,'parent_dir' : lambda x :get_least_common_value(x)}).
        assign(total_size_gb = lambda df: round(df['size'] /1000000000,3)).
        rename(columns = {'date':'most_recent_access',
                     'file' : 'num_files'}).
#         reset_index(drop = True).
        assign(tree_depth = level)
        
    )
    return df

# from IPython.core.debugger import Pdb; ipdb=Pdb()
# ipdb.runcall(get_dir_level_summary, df_split, 8)

# test = pd.concat([df_sub.head(100), df_sub.head(100).file.str.split('/',expand = True)], axis = 1)
# get_dir_level_summary(test, 7)

In [5]:
def summarise_subject_info(pickle_path,num_subs=3,levels=[7],dirs=None):
    pickle_path = Path(pickle_path)
    dfs = {}
    dfs_full = {}
    if not pickle_path.exists():
        for d_path in np.random.choice( dirs, num_subs):
            output_file = get_output_path(d_path)
            df_sub = pd.read_pickle(output_file)
            df_sub = df_sub.rename(columns = {'dir' : 'file'})
            df_sub['subject'] = d_path.name
            df_sub['is_file'] = df_sub.file.apply(lambda x: Path(x).is_file())
            df_sub['parent_dir'] = df_sub.file.apply(lambda x:'/'.join(x.split('/')[:-1]))
#             Create columns representing depth into the file tree to group across them:
            df_sub = pd.concat([df_sub, df_sub.file.str.split('/',expand = True)], axis = 1)
            
            for lev in levels:
                dfs[lev] = get_dir_level_summary(df_sub, lev)

                if lev in dfs_full.keys():
                    dfs_full[lev]  = pd.concat([dfs_full[lev],dfs[lev]],axis = 0)
                else:
                    dfs_full[lev] = dfs[lev].copy()

        pickle.dump(dfs_full, open(pickle_path.as_posix(), "wb"))
    else: dfs_full = pickle.load(open(pickle_path.as_posix(), "rb"))
    return dfs_full

# directory of output pickles required:
# tar xvf subject_pklz.tar.gz output_pklz

## Create merged dataframes

The code below creates a dictionary of dataframes. Each dataframe contains a summary of files at the depth into the tree:

In [6]:
dfs = summarise_subject_info(Path('100_subjects_with_10.pklz'), num_subs= 100, levels=[7,9,10])

In [7]:
dfs[7].head()


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,most_recent_access,size,num_files,parent_dir,total_size_gb,tree_depth
0,1,2,3,4,5,6,7,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
,data,HCP,HCP_900,s3,hcp,178748,MNINonLinear,2017-11-16,56167660000.0,16493,/data/HCP/HCP_900/s3/hcp/178748/MNINonLinear,56.168,7
,data,HCP,HCP_900,s3,hcp,178748,T1w,2017-11-16,9631565000.0,499,/data/HCP/HCP_900/s3/hcp/178748/T1w,9.632,7
,data,HCP,HCP_900,s3,hcp,178748,release-notes,2017-04-06,49098.0,54,/data/HCP/HCP_900/s3/hcp/178748/release-notes,0.0,7
,data,HCP,HCP_900,s3,hcp,178748,unprocessed,2017-11-16,11847280000.0,381,/data/HCP/HCP_900/s3/hcp/178748/unprocessed,11.847,7
,data,HCP,HCP_900,s3,hcp,200008,MNINonLinear,2017-11-16,51709440000.0,12037,/data/HCP/HCP_900/s3/hcp/200008/MNINonLinear,51.709,7


In [8]:
dfs[9].head()


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,Unnamed: 9_level_0,most_recent_access,size,num_files,parent_dir,total_size_gb,tree_depth
0,1,2,3,4,5,6,7,8,9,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
,data,HCP,HCP_900,s3,hcp,178748,MNINonLinear,Results,rfMRI_REST1_LR,2017-11-16,8354434000.0,1534,/data/HCP/HCP_900/s3/hcp/178748/MNINonLinear/Results/rfMRI_REST1_LR,8.354,9
,data,HCP,HCP_900,s3,hcp,178748,MNINonLinear,Results,rfMRI_REST1_RL,2017-11-16,8266014000.0,1404,/data/HCP/HCP_900/s3/hcp/178748/MNINonLinear/Results/rfMRI_REST1_RL,8.266,9
,data,HCP,HCP_900,s3,hcp,178748,MNINonLinear,Results,rfMRI_REST2_LR,2017-11-16,8322172000.0,1484,/data/HCP/HCP_900/s3/hcp/178748/MNINonLinear/Results/rfMRI_REST2_LR,8.322,9
,data,HCP,HCP_900,s3,hcp,178748,MNINonLinear,Results,rfMRI_REST2_RL,2017-11-16,8463622000.0,1664,/data/HCP/HCP_900/s3/hcp/178748/MNINonLinear/Results/rfMRI_REST2_RL,8.464,9
,data,HCP,HCP_900,s3,hcp,178748,MNINonLinear,Results,tfMRI_EMOTION,2017-11-16,549169700.0,712,/data/HCP/HCP_900/s3/hcp/178748/MNINonLinear/Results/tfMRI_EMOTION,0.549,9


In [9]:
dfs[10].head()


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,Unnamed: 9_level_0,Unnamed: 10_level_0,most_recent_access,size,num_files,parent_dir,total_size_gb,tree_depth
0,1,2,3,4,5,6,7,8,9,10,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
,data,HCP,HCP_900,s3,hcp,178748,MNINonLinear,Results,rfMRI_REST1_LR,RestingStateStats,2017-04-06,84586500,52,/data/HCP/HCP_900/s3/hcp/178748/MNINonLinear/Results/rfMRI_REST1_LR/RestingStateStats,0.085,10
,data,HCP,HCP_900,s3,hcp,178748,MNINonLinear,Results,rfMRI_REST1_LR,RibbonVolumeToSurfaceMapping,2017-04-06,50389,1,/data/HCP/HCP_900/s3/hcp/178748/MNINonLinear/Results/rfMRI_REST1_LR/RibbonVolumeToSurfaceMapping,0.0,10
,data,HCP,HCP_900,s3,hcp,178748,MNINonLinear,Results,rfMRI_REST1_LR,rfMRI_REST1_LR,2017-11-16,110857036,59,/data/HCP/HCP_900/s3/hcp/178748/MNINonLinear/Results/rfMRI_REST1_LR/rfMRI_REST1_LR,0.111,10
,data,HCP,HCP_900,s3,hcp,178748,MNINonLinear,Results,rfMRI_REST1_LR,rfMRI_REST1_LR_hp2000.ica,2017-11-16,1027505181,1331,/data/HCP/HCP_900/s3/hcp/178748/MNINonLinear/Results/rfMRI_REST1_LR/rfMRI_REST1_LR_hp2000.ica,1.028,10
,data,HCP,HCP_900,s3,hcp,178748,MNINonLinear,Results,rfMRI_REST1_RL,RestingStateStats,2017-04-06,86634343,52,/data/HCP/HCP_900/s3/hcp/178748/MNINonLinear/Results/rfMRI_REST1_RL/RestingStateStats,0.087,10


## Size of 'MNINonLinear' results directories

Most of the disk space of the hpc dataset is used up by the results directory in the 'MNINonLinear' directory.

For all 100 subjects all the files have a size (in GB) of:

In [10]:
dfs[9].loc['','data','HCP','HCP_900','s3','hcp',:,:]['total_size_gb'].sum()

7683.5219999999999

For all 100 subjects MNINonLinear results files have a size (in GB) of:

In [11]:
dfs[9].loc['','data','HCP','HCP_900','s3','hcp',:,'MNINonLinear']['total_size_gb'].sum()

5052.2399999999998

## Most recent access times in the MNINonLinear results directories

9 levels deep in the tree all directories in MNINonLinear/*/Results have been accessed recently. This is the bulk of the data in the HCP dataset.

In [12]:
dfs[9].loc['','data','HCP','HCP_900','s3','hcp',:,'MNINonLinear'].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,most_recent_access,size,num_files,parent_dir,total_size_gb,tree_depth
6,8,9,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
178748,Results,rfMRI_REST1_LR,2017-11-16,8354434000.0,1534,/data/HCP/HCP_900/s3/hcp/178748/MNINonLinear/Results/rfMRI_REST1_LR,8.354,9
178748,Results,rfMRI_REST1_RL,2017-11-16,8266014000.0,1404,/data/HCP/HCP_900/s3/hcp/178748/MNINonLinear/Results/rfMRI_REST1_RL,8.266,9
178748,Results,rfMRI_REST2_LR,2017-11-16,8322172000.0,1484,/data/HCP/HCP_900/s3/hcp/178748/MNINonLinear/Results/rfMRI_REST2_LR,8.322,9
178748,Results,rfMRI_REST2_RL,2017-11-16,8463622000.0,1664,/data/HCP/HCP_900/s3/hcp/178748/MNINonLinear/Results/rfMRI_REST2_RL,8.464,9
178748,Results,tfMRI_EMOTION,2017-11-16,549169700.0,712,/data/HCP/HCP_900/s3/hcp/178748/MNINonLinear/Results/tfMRI_EMOTION,0.549,9


In [13]:
dfs[9].loc['','data','HCP','HCP_900','s3','hcp',:,'MNINonLinear']['most_recent_access'].unique()

array(['2017-11-16'], dtype=object)

### Summing the size of files/directories one level deeper that were not accessed

Assessing at this level already filters out much of the data contained in higher directories but shows that even at this deeper level many directories were accessed across the 'MNINonLinear' tree.

In [14]:
(dfs[10].
 loc['','data','HCP','HCP_900','s3','hcp',:,'MNINonLinear'].
 drop('tree_depth',axis = 1).
 groupby('most_recent_access').
 sum()
)

Unnamed: 0_level_0,size,num_files,total_size_gb
most_recent_access,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-02-06,411705.0,8,0.0
2017-04-04,17222290000.0,16226,17.178
2017-04-05,8204448000.0,7830,8.183
2017-04-06,4708983000.0,4172,4.696
2017-04-07,4140584000.0,3892,4.13
2017-05-02,54441.0,464,0.0
2017-11-16,1037069000000.0,1237293,1036.597


### SAVE CONDA ENVIRONMENT

In [15]:
!   conda env export > hcp_characterization.yml
# but the following should work:
# conda create -n file_exploration_env python=3 pandas pickle