# File manager
In this notebook we will show how to organize files within the **desipipe** framework. You need to have installed **desipipe** with:
```
python -m pip install git+https://github.com/cosmodesi/desipipe#egg=desipipe
```
You can also take a look at https://desipipe.readthedocs.io/en/latest/user/getting_started.html.

In [1]:
!rm -rf _tests_fm  # remove directory if exists

## File object
The `BaseFile` object contains most information about the file:
- file path
- author, description
- save / load functions
- options (specify how the file is to be produced)

In [2]:
from desipipe.file_manager import BaseFile

# The list of file types is in https://github.com/cosmodesi/desipipe/blob/main/desipipe/io.py
from desipipe.io import TextFile
print('registered file types', TextFile._registry)

file = BaseFile('_tests_fm/data_boxsize{boxsize:.0f}_{imock:d}.fits',  # path can have {} patterns to be replaced by options
                description='A mock catalog',
                author='You',
                filetype='catalog',
                options={'boxsize': 1000., 'imock': 1})

# To get description
print('description', file.description)
# To get options
print('options', file.options)
# To get filepath
print('filepath', file.filepath)

registered file types {'base': <class 'desipipe.io.BaseFile'>, 'text': <class 'desipipe.io.TextFile'>, 'catalog': <class 'desipipe.io.CatalogFile'>, 'power': <class 'desipipe.io.PowerSpectrumFile'>, 'correlation': <class 'desipipe.io.CorrelationFunctionFile'>, 'covariance': <class 'desipipe.io.ObservableCovarianceFile'>, 'observable': <class 'desipipe.io.ObservableArrayFile'>, 'wmatrix': <class 'desipipe.io.BaseMatrixFile'>, 'chain': <class 'desipipe.io.ChainFile'>, 'profiles': <class 'desipipe.io.ProfilesFile'>, 'generic': <class 'desipipe.io.GenericFile'>}
description A mock catalog
options {'boxsize': 1000.0, 'imock': 1}
filepath _tests_fm/data_boxsize1000_1.fits


In [3]:
# Example: let's create a catalog
from mockfactory import RandomBoxCatalog

catalog = RandomBoxCatalog(boxsize=file.options['boxsize'], csize=1000)
# Let's save the catalog
file.save(catalog)

# Load the catalog
catalog = file.load()

In [4]:
# To copy and update
file = file.clone(description='a created mock catalog', options={**file.options, 'csize': 1000})
print(file.options)

# To create a symlink when saving the catalog, set the 'link' property
file = file.clone(link='_tests_fm/link_to_catalog.fits')
file.symlink()

{'boxsize': 1000.0, 'imock': 1, 'csize': 1000}


In [5]:
!ls -l _tests_fm/  # see, the symlink is there

total 32
-rw-r--r-- 1 adematti idphp 31680 oct.  17 20:54 data_boxsize1000_1.fits
lrwxrwxrwx 1 adematti idphp    33 oct.  17 20:54 link_to_catalog.fits -> _tests_fm/data_boxsize1000_1.fits


## File entry

The `BaseFileEntry` is an object that gathers a coherent set of `BaseFile` objects, i.e. which differ only through their options.

In [6]:
from desipipe.file_manager import BaseFileEntry

# File entry for 100 mocks
entry = BaseFileEntry('_tests_fm/data_boxsize{boxsize:.0f}_{imock:d}.fits',  # path can have {} patterns to be replaced by options
                     description='A mock catalog',
                     author='You',
                     filetype='catalog',
                     options={'boxsize': 1000., 'imock': range(100)})

In [7]:
# To get a `BaseFileEntry` restricted to 'imock': range(10)

entry2 = entry.select(imock=range(10))
print(entry2)

BaseFileEntry(
path: _tests_fm/data_boxsize{boxsize:.0f}_{imock:d}.fits,
filetype: catalog,
id: ,
author: You,
options: {'boxsize': [1000.0], 'imock': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]},
description: A mock catalog,
link: 
)


In [8]:
# To get a :class:`BaseFile`, we need to specify all options!
# Returns :class:`BaseFile` for mock #3
file = entry2.get(imock=3)
print('filepath', file)
print('options', file.options)

filepath _tests_fm/data_boxsize1000_3.fits
options {'boxsize': 1000.0, 'imock': 3}


In [9]:
# Tip: let's say we do not know how many mocks we have, we can just leave the option empty, as Ellipsis

entry2 = entry.clone(options={**entry.options, 'imock': Ellipsis})
print(entry2)

file = entry2.get(imock=2345)
print(file)

BaseFileEntry(
path: _tests_fm/data_boxsize{boxsize:.0f}_{imock:d}.fits,
filetype: catalog,
id: ,
author: You,
options: {'boxsize': [1000.0], 'imock': Ellipsis},
description: A mock catalog,
link: 
)
_tests_fm/data_boxsize1000_2345.fits


In [10]:
# Tip: let's say we have an option, 'shuffled', that should only appear in the file path when it is True,
# we can provide a dictionary of {option: string_in_path}

entry3 = entry2.clone(path='_tests_fm/data_boxsize{boxsize:.0f}_{imock:d}{shuffled}.fits',
                      options={**entry.options, 'shuffled': {True: '_shuffled', False: ''}})
print(entry3.get(imock=24, shuffled=True), entry3.get(imock=24, shuffled=False))

# Alternatively, we can specify foptions, in the same order as options
entry3 = entry2.clone(path='_tests_fm/data_boxsize{boxsize:.0f}_{imock:d}{shuffled}.fits',
                      options={**entry.options, 'shuffled': [True, False]}, foptions={'shuffled': ['_shuffled', '']})
print(entry3.get(imock=24, shuffled=True), entry3.get(imock=24, shuffled=False))

_tests_fm/data_boxsize1000_24_shuffled.fits _tests_fm/data_boxsize1000_24.fits
_tests_fm/data_boxsize1000_24_shuffled.fits _tests_fm/data_boxsize1000_24.fits


## File Manager

The file manager aims at keeping track of files (of all kinds) produced in the processing.
A `FileManager` is a collection of `FileEntry`, i.e. can contain many different files of different types, authors, etc.
It is usually useful to identify a file entry with and "id", typically a small string that is self-explanatory on the corresponding files.

In [11]:
# Let's create a file manager, essentially works like a list of entries
from desipipe.file_manager import FileManager

fm = FileManager()
entry1 = BaseFileEntry('_tests_fm/data_boxsize{boxsize:.0f}_{imock:d}.fits',
                       id='my_mock_catalog',  # id, this will be useful to select files
                       description='Mock catalogs',
                       author='You',
                       filetype='catalog',
                       options={'boxsize': 1000., 'imock': range(100)})
fm.append(entry1)
entry2 = BaseFileEntry('_tests_fm/power_spectrum_boxsize{boxsize:.0f}_interlacing{interlacing:d}_{imock:d}.npy',
                       id='my_mock_power_spectrum',
                       description='Power spectrum measurements for mock catalogs',
                       author='You',
                       filetype='power',
                       options={'boxsize': 1000., 'imock': Ellipsis, 'interlacing': [0, 3]})
fm.append(entry2)
print(fm)

FileManager(
BaseFileEntry(
path: _tests_fm/data_boxsize{boxsize:.0f}_{imock:d}.fits,
filetype: catalog,
id: my_mock_catalog,
author: You,
options: {'boxsize': [1000.0], 'imock': range(0, 100)},
description: Mock catalogs,
link: 
),
BaseFileEntry(
path: _tests_fm/power_spectrum_boxsize{boxsize:.0f}_interlacing{interlacing:d}_{imock:d}.npy,
filetype: power,
id: my_mock_power_spectrum,
author: You,
options: {'boxsize': [1000.0], 'imock': Ellipsis, 'interlacing': [0, 3]},
description: Power spectrum measurements for mock catalogs,
link: 
)
)


In [12]:
# One can restrict the file manager to some specific options
fm = fm.select(imock=range(10))
print(fm)

# And select one specific file
file = fm.get(id='my_mock_power_spectrum')

FileManager(
BaseFileEntry(
path: _tests_fm/data_boxsize{boxsize:.0f}_{imock:d}.fits,
filetype: catalog,
id: my_mock_catalog,
author: You,
options: {'boxsize': [1000.0], 'imock': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]},
description: Mock catalogs,
link: 
),
BaseFileEntry(
path: _tests_fm/power_spectrum_boxsize{boxsize:.0f}_interlacing{interlacing:d}_{imock:d}.npy,
filetype: power,
id: my_mock_power_spectrum,
author: You,
options: {'boxsize': [1000.0], 'imock': range(0, 10), 'interlacing': [0, 3]},
description: Power spectrum measurements for mock catalogs,
link: 
)
)


ValueError: "get" is not applicable as there are multiple options:
BaseFileEntry(
path: _tests_fm/power_spectrum_boxsize{boxsize:.0f}_interlacing{interlacing:d}_{imock:d}.npy,
filetype: power,
id: my_mock_power_spectrum,
author: You,
options: {'boxsize': [1000.0], 'imock': range(0, 10), 'interlacing': [0, 3]},
description: Power spectrum measurements for mock catalogs,
link: 
)

In [13]:
# Oups! The error states that there are multiple matching options, we did not select on 'imock' and 'interlacing'!

file = fm.get(id='my_mock_power_spectrum', imock=2, interlacing=3)
print(file, file.options)

_tests_fm/power_spectrum_boxsize1000_interlacing3_2.npy {'boxsize': 1000.0, 'imock': 2, 'interlacing': 3}


In [14]:
# We can also iterate on the file manager
for fi in fm.select(imock=[0, 1]).iter():
    print(fi)

_tests_fm/data_boxsize1000_0.fits
_tests_fm/data_boxsize1000_1.fits
_tests_fm/power_spectrum_boxsize1000_interlacing0_0.npy
_tests_fm/power_spectrum_boxsize1000_interlacing3_0.npy
_tests_fm/power_spectrum_boxsize1000_interlacing0_1.npy
_tests_fm/power_spectrum_boxsize1000_interlacing3_1.npy


In [15]:
# It is often useful to iterate on a file to produce, let's do this for the power spectrum:
# we want to get the catalog corresponding to each power spectrum to compute
for fpower in fm.select(id='my_mock_power_spectrum', imock=[0, 1]).iter():
    print(fpower.options)
    fcatalog = fm.get(id='my_mock_catalog', **fpower.options)

{'boxsize': 1000.0, 'imock': 0, 'interlacing': 0}


ValueError: "get" is not applicable as there are no matching entries

In [16]:
# Oups! This does not work because 'interlacing' is not part of my_mock_catalog options
# To remedy this, let's use ignore=True

for fpower in fm.select(id='my_mock_power_spectrum', imock=[0, 1]).iter():
    fcatalog = fm.get(id='my_mock_catalog', **fpower.options, ignore=True)
    print(fpower, fcatalog)

_tests_fm/power_spectrum_boxsize1000_interlacing0_0.npy _tests_fm/data_boxsize1000_0.fits
_tests_fm/power_spectrum_boxsize1000_interlacing3_0.npy _tests_fm/data_boxsize1000_0.fits
_tests_fm/power_spectrum_boxsize1000_interlacing0_1.npy _tests_fm/data_boxsize1000_1.fits
_tests_fm/power_spectrum_boxsize1000_interlacing3_1.npy _tests_fm/data_boxsize1000_1.fits


In [17]:
# To help ourselves, we can also use keywords, that match file entry's descriptions

print(fm.select(keywords='spectrum'))

FileManager(
BaseFileEntry(
path: _tests_fm/power_spectrum_boxsize{boxsize:.0f}_interlacing{interlacing:d}_{imock:d}.npy,
filetype: power,
id: my_mock_power_spectrum,
author: You,
options: {'boxsize': [1000.0], 'imock': range(0, 10), 'interlacing': [0, 3]},
description: Power spectrum measurements for mock catalogs,
link: 
)
)


In [18]:
# Eventually, the file manager can be saved as .yaml
fm.save('_tests_fm/files.yaml')

# And reloaded
fm = FileManager('_tests_fm/files.yaml')
print(fm)

FileManager(
BaseFileEntry(
path: _tests_fm/data_boxsize{boxsize:.0f}_{imock:d}.fits,
filetype: catalog,
id: my_mock_catalog,
author: You,
options: {'boxsize': [1000.0], 'imock': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]},
description: Mock catalogs,
link: 
),
BaseFileEntry(
path: _tests_fm/power_spectrum_boxsize{boxsize:.0f}_interlacing{interlacing:d}_{imock:d}.npy,
filetype: power,
id: my_mock_power_spectrum,
author: You,
options: {'boxsize': [1000.0], 'imock': range(0, 10), 'interlacing': [0, 3]},
description: Power spectrum measurements for mock catalogs,
link: 
)
)


In [19]:
# Display the yaml file
!cat '_tests_fm/files.yaml'

author: You
description: Mock catalogs
filetype: catalog
id: my_mock_catalog
link: ''
options:
  boxsize: [1000.0]
  imock: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
path: _tests_fm/data_boxsize{boxsize:.0f}_{imock:d}.fits
---
author: You
description: Power spectrum measurements for mock catalogs
filetype: power
id: my_mock_power_spectrum
link: ''
options:
  boxsize: [1000.0]
  imock: range(0, 10)
  interlacing: [0, 3]
path: _tests_fm/power_spectrum_boxsize{boxsize:.0f}_interlacing{interlacing:d}_{imock:d}.npy


In [20]:
# To add a new entry
fm.append(dict(description='added file', id='added_file', filetype='catalog', path='test.fits'))
# To delete an entry
del fm[-1]

# Interaction with the task manager
Let's write a file manager, and see how it can be used together with desipipe's task management system (look at `task_manager_examples.yaml`).

In [21]:
%%file '_tests_fm/files.yaml'

description: Some text file
id: my_input_file
filetype: text
path: ${SOMEDIR}/in_{option1}_{i:d}.txt
author: Chuck Norris
options:
  option1: ['a', 'b']
  i: range(0, 3, 1)

Overwriting _tests_fm/files.yaml


In [22]:
# We can provide environment variables, to be used in the paths
fm = FileManager('_tests_fm/files.yaml', environ=dict(SOMEDIR='_tests_fm'))
# Iterate over files
for fi in fm.select(keywords='text file', option1=['a']):
    print(fi)
    # Write text
    fi.save('hello world!')

_tests_fm/in_a_0.txt
_tests_fm/in_a_1.txt
_tests_fm/in_a_2.txt


In [23]:
# Let's add a cloned entry
fm.append(fm.data[0].clone(id='my_output_file', description='cloned file', path='${SOMEDIR}/out_{option1}_{i:d}.txt'))
fm.save('_tests_fm/files.yaml')
# Display new file data base
!cat '_tests_fm/files.yaml'

author: Chuck Norris
description: Some text file
filetype: text
id: my_input_file
link: ''
options:
  i: range(0, 3)
  option1: [a, b]
path: ${SOMEDIR}/in_{option1}_{i:d}.txt
---
author: Chuck Norris
description: cloned file
filetype: text
id: my_output_file
link: ''
options:
  i: range(0, 3)
  option1: [a, b]
path: ${SOMEDIR}/out_{option1}_{i:d}.txt


In [24]:
from desipipe import Queue, Environment, TaskManager, FileManager, spawn

# Let's instantiate a Queue, which records all tasks to be performed
queue = Queue('my_fm_test', base_dir='_tests')
queue.clear()
# Pool of 4 workers
# Any environment variable can be passed to Environment: it will be set when running the tasks below
tm = TaskManager(queue, environ=Environment(), scheduler=dict(max_workers=4))

# Let's create task!
@tm.python_app
def copy(text_in, text_out):
    import numpy as np  # just to illustrate below that the package version is tracked
    text = text_in.load()
    text += ' this is my first message'
    print('saving', text_out.filepath)
    text_out.save(text)

In [25]:
# Iterate over files, add tasks to the queue
for fo in fm.select(id='my_output_file', option1=['a']).iter():
    copy(fm.get(id='my_input_file', **fo.options), fo)

In [26]:
# Let's spawn a process
from desipipe import spawn
spawn(queue, timestep=1.)

[000000.36]  10-17 20:54  BaseFile                  INFO     Loading _tests_fm/in_a_0.txt
[000000.39]  10-17 20:54  BaseFile                  INFO     Loading _tests_fm/in_a_1.txt
[000000.40]  10-17 20:54  BaseFile                  INFO     Moving output to _tests_fm/out_a_0.txt
[000000.41]  10-17 20:54  BaseFile                  INFO     Loading _tests_fm/in_a_2.txt
[000000.44]  10-17 20:54  BaseFile                  INFO     Moving output to _tests_fm/out_a_1.txt
[000000.58]  10-17 20:54  BaseFile                  INFO     Moving output to _tests_fm/out_a_2.txt


In [27]:
!ls -a _tests_fm/

.			 .desipipe   in_a_1.txt		   out_a_0.txt
..			 files.yaml  in_a_2.txt		   out_a_1.txt
data_boxsize1000_1.fits  in_a_0.txt  link_to_catalog.fits  out_a_2.txt


In [28]:
!cat _tests_fm/out_a_0.txt

hello world! this is my first message

In [29]:
# This is where desipipe processing information is saved
!ls -a _tests_fm/.desipipe
print('\n*.py file is:')
!cat _tests_fm/.desipipe/copy.py
print('\n*.versions file is:')
!cat _tests_fm/.desipipe/copy.versions

.  ..  copy.py	copy.versions

*.py file is:
def copy(text_in, text_out):
    import numpy as np  # just to illustrate below that the package version is tracked
    text = text_in.load()
    text += ' this is my first message'
    print('saving', text_out.filepath)
    text_out.save(text)

*.versions file is:
ctypes=1.1.0
mpi4py=4.0.0
numpy=1.26.4


In [30]:
# Delete queue
queue.delete(kill=False)