# 2.1 Advanced Indexing

## Indexing files

As was shown earlier, we can create an index of the data space using the `index()` method:

In [1]:
import signac

project = signac.get_project(root='projects/tutorial')
index = list(project.index())

for doc in index[:3]:
    print(doc)

{'_id': '10743bc8b95bffab09503bce9abbe627', 'V_gas': 10000.0, 'statepoint': {'b': 0, 'kT': 1.0, 'p': 0.1, 'a': 0, 'N': 1000}, 'fluid': 'ideal gas', 'V_liq': 0.0, 'signac_id': '10743bc8b95bffab09503bce9abbe627'}
{'_id': 'f906bdf73414abbbd2e8d2b672201fb3', 'V_liq': 0, 'signac_id': 'f906bdf73414abbbd2e8d2b672201fb3', 'V_gas': 1000.0, 'statepoint': {'b': 0, 'kT': 1.0, 'p': 1.0, 'a': 0, 'N': 1000}}
{'_id': '304357838edbf2ec730f4847bb8a0e20', 'V_gas': 100.0, 'statepoint': {'b': 0, 'kT': 1.0, 'p': 10.0, 'a': 0, 'N': 1000}, 'fluid': 'ideal gas', 'V_liq': 0.0, 'signac_id': '304357838edbf2ec730f4847bb8a0e20'}


At this point the index contains information about the statepoint and all data stored in the job document.
If we want to include the `V.txt` text files we used to store data in, with the index, we need to tell **signac** the filename pattern and optionally the file format.
Any name defined as a `str` constant or even a python class may serve as a format definition.

We will specify that in additon to the job documents, all files matching the regular expression `.*/V\.txt` are to be indexed as `TextFile`.

In [2]:
definitions = {'.*/V\.txt': 'TextFile'}
index = list(project.index(definitions))
for doc in index[-3:]:
    print(doc)

{'_id': '11d61a67dcd734885038d6cdc71d279a', 'md5': 'd84a7bdf88719b706b1b2b86169dd14f', 'root': '/home/johndoe/signac-examples/notebooks/projects/tutorial/workspace', 'statepoint': {'b': 0.03201, 'kT': 1.0, 'a': 1.355, 'p': 7.800000000000001, 'N': 1000}, 'file_id': 'd84a7bdf88719b706b1b2b86169dd14f', 'signac_id': '3b4c55fd2100f14c914afd33bc3ef530', 'filename': '3b4c55fd2100f14c914afd33bc3ef530/V.txt', 'format': 'TextFile'}
{'_id': 'be351f5e04f69a3b1725c4729042cb1f', 'md5': '58d1e9bfeb4c80cd0713d876f38143af', 'root': '/home/johndoe/signac-examples/notebooks/projects/tutorial/workspace', 'statepoint': {'b': 0.03201, 'kT': 1.0, 'a': 1.355, 'p': 8.9, 'N': 1000}, 'file_id': '58d1e9bfeb4c80cd0713d876f38143af', 'signac_id': 'bd5c16f1a57c1c664bada558045cd06e', 'filename': 'bd5c16f1a57c1c664bada558045cd06e/V.txt', 'format': 'TextFile'}
{'_id': 'e02537599f8e86e5aaed45c7381ae345', 'md5': '1d80dad51981a1f55c105b4b5dd22a11', 'root': '/home/johndoe/signac-examples/notebooks/projects/tutorial/workspac

**Tip**: Consider to to create a shared set of format definitions within your environment which serve as format conventions.

Accessing files via the index is useful, for example to select specific data sub sets.

In [3]:
import os

def select(doc):
    return 'TextFile' in doc.get('format', '') and doc['statepoint']['p'] < 5.0

docs_selected = [doc for doc in index if select(doc)]
for doc in docs_selected[:3]:
    print('p=', doc['statepoint']['p'], end=' ')
    fn = os.path.join(doc['root'], doc['filename'])
    with open(fn) as file:
        print('V=', file.read().strip())

p= 0.1 V= 0.0,10000.0
p= 1.0 V= 0,1000.0
p= 2.575 V= 0,388.34951456310677


## Customized Project Crawlers

The `index()` function as well as the `$ signac index` command internally creater a `Crawler` instance to crawl through the data space and create the index.
To have more control over the indexing process, we can do this explicitly:

In [4]:
from signac.contrib import SignacProjectCrawler

# Specialize a SignacProject Crawler...
class TutorialProjectCrawler(SignacProjectCrawler):
    pass

# Define files to index...
TutorialProjectCrawler.define('.*/V\.txt', 'TextFile')

# Create a crawler instance and generate the index.
crawler = TutorialProjectCrawler(root=project.workspace())
index = list(crawler.crawl())
for doc in index[:3]:
    print(doc)

{'_id': '10743bc8b95bffab09503bce9abbe627', 'V_gas': 10000, 'V_liq': 0, 'fluid': 'ideal gas', 'statepoint': {'b': 0, 'kT': 1.0, 'p': 0.1, 'a': 0, 'N': 1000}, 'signac_id': '10743bc8b95bffab09503bce9abbe627'}
{'signac_id': 'f906bdf73414abbbd2e8d2b672201fb3', 'statepoint': {'b': 0, 'kT': 1.0, 'p': 1.0, 'a': 0, 'N': 1000}, '_id': 'f906bdf73414abbbd2e8d2b672201fb3', 'V_gas': 1000, 'V_liq': 0}
{'_id': '304357838edbf2ec730f4847bb8a0e20', 'V_gas': 100, 'V_liq': 0, 'fluid': 'ideal gas', 'statepoint': {'b': 0, 'kT': 1.0, 'p': 10.0, 'a': 0, 'N': 1000}, 'signac_id': '304357838edbf2ec730f4847bb8a0e20'}


We could specialize the `IdealGasCrawler` further, e.g., to add more metadata to the index.

## Using a Master Crawler

A master crawler uses other crawlers to compile a combined master index of one or more data spaces.
This allows you and everyone else who has access to the master index, to search and possibly access all data within the shared data space.

To expose the project to a `MasterCrawler` we need to create a so called *access module*.
For signac projects this is simplified by using the `create_access_module()` method.
Let's create an access module:

In [5]:
try:
    project.create_access_module({'.*/V\.txt': 'TextFile'})
except IOError:
    pass  # File already exists...

This function creates a file called `signac_access.py` within our project's root directory.

In [6]:
% cat projects/tutorial/signac_access.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os

from signac.contrib import SignacProjectCrawler
from signac.contrib import MasterCrawler


class TutorialProjectCrawler(SignacProjectCrawler):
    pass
TutorialProjectCrawler.define('.*/V\.txt', 'TextFile')


def get_crawlers(root):
    return {'main': TutorialProjectCrawler(os.path.join(root, 'workspace'))}


if __name__ == '__main__':
    master_crawler = MasterCrawler('.')
    for doc in master_crawler.crawl(depth=1):
        print(doc)


You will notice that this file looks very similar to our custom crawler definition earlier.
It also shows us how to execute a Master Crawler for this data space.
Let's do that:

In [7]:
from signac.contrib import MasterCrawler
master_crawler = MasterCrawler('projects')
master_index = list(master_crawler.crawl(depth=1))
for doc in master_index[:3]:
    print(doc)

{'_id': '10743bc8b95bffab09503bce9abbe627', 'format': None, 'V_gas': 10000, 'V_liq': 0, 'fluid': 'ideal gas', 'statepoint': {'b': 0, 'kT': 1.0, 'p': 0.1, 'a': 0, 'N': 1000}, 'signac_id': '10743bc8b95bffab09503bce9abbe627', 'project': 'tutorial'}
{'_id': 'f906bdf73414abbbd2e8d2b672201fb3', 'format': None, 'V_gas': 1000, 'statepoint': {'b': 0, 'kT': 1.0, 'p': 1.0, 'a': 0, 'N': 1000}, 'V_liq': 0, 'signac_id': 'f906bdf73414abbbd2e8d2b672201fb3', 'project': 'tutorial'}
{'_id': '304357838edbf2ec730f4847bb8a0e20', 'format': None, 'V_gas': 100, 'V_liq': 0, 'fluid': 'ideal gas', 'statepoint': {'b': 0, 'kT': 1.0, 'p': 10.0, 'a': 0, 'N': 1000}, 'signac_id': '304357838edbf2ec730f4847bb8a0e20', 'project': 'tutorial'}


The index generated by the master crawler contains all the information about our project, even the files, without any additional information.
This is possible, because the `MasterCrawler` searches the data space for files named `signac_access.py` and then collects all indexes generated by the `slave crawlers` defined within these modules.

This allows us to easily generate a *master index* of multiple projects and even directly fetch data, using only the index, see the next section.

## Fetch data via filename

Just like before, we can access data via the filenames specified in the index documents:

In [8]:
import os

docs_files = [doc for doc in master_index if doc['format'] is not None]
for doc in docs_files[:3]:
    fn = os.path.join(doc['root'], doc['filename'])
    with open(fn) as file:
        print(doc['statepoint']['p'], file.read().strip())

0.1 0.0,10000.0
1.0 0,1000.0
10.0 0.0,100.0


## Fetch data via index documents

But even better, data files can be seamlessly fetched using the `signac.fetch()` function:

In [9]:
for doc in docs_files[:3]:
    with signac.fetch(doc) as file:
        print(doc['statepoint']['p'], file.read().strip())

0.1 0.0,10000.0
1.0 0,1000.0
10.0 0.0,100.0


Think of `fetch()` like the built-in `open()` function. It allows us to retrieve and open files based on the index document (file id) instead of an absolute file path. This makes it easier to operate on data agnostic to its actual physical location.