# 2.1 Advanced Indexing

## Indexing files

As was shown earlier, we can create an index of the data space using the `index()` method:

In [1]:
import signac
from itertools import islice

project = signac.get_project(root='projects/tutorial')
index = list(project.index())

for doc in index[:3]:
    print(doc)

{'V_liq': 32.8033857086239, 'signac_id': '732fb26b0b6fb83687625248b1f0a0b6', '_id': '732fb26b0b6fb83687625248b1f0a0b6', 'V_gas': 416.2699738123549, 'fluid': 'argon', 'statepoint': {'a': 1.355, 'kT': 1.0, 'p': 1.2000000000000002, 'b': 0.03201, 'N': 1000}}
{'V_liq': 30.65957481807377, 'signac_id': 'fdfd39a204f42e56bbe1b9c861674430', '_id': 'fdfd39a204f42e56bbe1b9c861674430', 'V_gas': 64.0177766935273, 'fluid': 'water', 'statepoint': {'a': 5.536, 'kT': 1.0, 'p': 7.800000000000001, 'b': 0.03049, 'N': 1000}}
{'V_liq': 30.659542839653493, 'signac_id': 'e578035d17bbb374a03d4af7c3f9ecaa', '_id': 'e578035d17bbb374a03d4af7c3f9ecaa', 'V_gas': 56.09500386107218, 'fluid': 'water', 'statepoint': {'a': 5.536, 'kT': 1.0, 'p': 8.9, 'b': 0.03049, 'N': 1000}}


At this point the index contains information about the statepoint and all data stored in the job document.
If we want to include the `V.txt` text files we used to store data in, with the index, we need to tell **signac** the filename pattern and the file format.
Any python class may serve as a format definition.

We will specify that in additon to the job documents, all files matching the regular expression `.*/V\.txt` are to be indexed as `MyTextFile`.

In [2]:
class MyTextFile(object):
    pass

definitions = {'.*/V\.txt': MyTextFile}
index = list(project.index(definitions))



You will notice that **signac** is warning us that the `MyTextFile` class has no `read()` and no `close()` method.
The exact reason why this is important will become clear in the section about master crawlers, but we can fix this problem for now by extending our class declaration slightly.

In [3]:
class MyTextFile(object):
    
    def __init__(self, fd):
        self.fd = fd
        
    def read(self):
        return self.fd.read()
    
    def close(self):
        self.fd.close()

definitions = {'.*/V\.txt': MyTextFile}
index = list(project.index(definitions))

And because this is a very common pattern, we don't need to implement this everytime, but can simply use or specialize **signac**'s `TextFile` class.

In [4]:
from signac.contrib.formats import TextFile

index = list(project.index({'.*/V\.txt': TextFile}))

Accessing files via the index is useful, for example to select specific data sub sets.

In [5]:
import os

def select(doc):
    return 'TextFile' in doc.get('format', '') and doc['statepoint']['p'] < 5.0

docs_selected = [doc for doc in index if select(doc)]
for doc in docs_selected[:3]:
    print('p=', doc['statepoint']['p'], end=' ')
    fn = os.path.join(doc['root'], doc['filename'])
    with open(fn) as file:
        print('V=', file.read().strip())

p= 1.2000000000000002 V= 32.8033857086239,416.2699738123549
p= 3.4000000000000004 V= 32.80193336746696,146.6628568456784
p= 0.1 V= 32.804113976682224,8430.935727416612


## Customized Project Crawlers

The `index()` function as well as the `$ signac index` command internally creater a `Crawler` instance to crawl through the data space and create the index.
To have more control over the indexing process, we can do this explicitly:

In [6]:
from signac.contrib.crawler import SignacProjectCrawler
from signac.contrib.formats import TextFile

# Specialize a SignacProject Crawler...
class TutorialProjectCrawler(SignacProjectCrawler):
    pass

# Define files to index...
TutorialProjectCrawler.define('.*/V\.txt', TextFile)

# Create a crawler instance and generate the index.
crawler = TutorialProjectCrawler(root=project.workspace())
index = list(crawler.crawl())
for doc in index[:3]:
    print(doc)

{'V_liq': 32.8033857086239, 'signac_id': '732fb26b0b6fb83687625248b1f0a0b6', '_id': '732fb26b0b6fb83687625248b1f0a0b6', 'V_gas': 416.2699738123549, 'fluid': 'argon', 'statepoint': {'a': 1.355, 'kT': 1.0, 'p': 1.2000000000000002, 'b': 0.03201, 'N': 1000}}
{'V_liq': 30.65957481807377, 'signac_id': 'fdfd39a204f42e56bbe1b9c861674430', '_id': 'fdfd39a204f42e56bbe1b9c861674430', 'V_gas': 64.0177766935273, 'fluid': 'water', 'statepoint': {'a': 5.536, 'kT': 1.0, 'p': 7.800000000000001, 'b': 0.03049, 'N': 1000}}
{'V_liq': 30.659542839653493, 'signac_id': 'e578035d17bbb374a03d4af7c3f9ecaa', '_id': 'e578035d17bbb374a03d4af7c3f9ecaa', 'V_gas': 56.09500386107218, 'fluid': 'water', 'statepoint': {'a': 5.536, 'kT': 1.0, 'p': 8.9, 'b': 0.03049, 'N': 1000}}


We could specialize the `IdealGasCrawler` further, e.g., to add more metadata to the index.

## Using a Master Crawler

A master crawler uses other crawlers to compile a combined master index of one or more data spaces.
This allows you and everyone else who has access to the master index, to search and possibly access all data within the shared data space.

To expose the project to a `MasterCrawler` we need to create a so called *access module*.
For signac projects this is simplified by using the `create_access_module()` method.
Let's create an access module:

In [7]:
try:
    project.create_access_module({'.*/V\.txt': TextFile})
except IOError:
    pass  # File already exists...

This function creates a file called `signac_access.py` within our project's root directory.

In [8]:
% cat projects/tutorial/signac_access.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os

from signac.contrib.crawler import SignacProjectCrawler
from signac.contrib.crawler import MasterCrawler
from signac.contrib.formats import TextFile


class TutorialProjectCrawler(SignacProjectCrawler):
    pass
TutorialProjectCrawler.define('.*/V\.txt', TextFile)


def get_crawlers(root):
    return {'main': TutorialProjectCrawler(os.path.join(root, 'workspace'))}


if __name__ == '__main__':
    master_crawler = MasterCrawler('.')
    for doc in master_crawler.crawl(depth=1):
        print(doc)


You will notice that this file looks very similar to our custom crawler definition earlier.
It also shows us how to execute a Master Crawler for this data space.
Let's do that:

In [9]:
from signac.contrib.crawler import MasterCrawler
master_crawler = MasterCrawler('projects')
master_index = list(master_crawler.crawl(depth=1))
for doc in master_index[:3]:
    print(doc)

{'project': 'tutorial', 'format': None, 'V_liq': 32.8033857086239, 'signac_id': '732fb26b0b6fb83687625248b1f0a0b6', '_id': '732fb26b0b6fb83687625248b1f0a0b6', 'V_gas': 416.2699738123549, 'fluid': 'argon', 'signac_link': {'access_module': 'signac_access.py', 'access_crawler_id': 'main', 'link_type': 'module_fetch', 'access_crawler_root': '/home/johndoe/signac-examples/notebooks/projects/tutorial'}, 'statepoint': {'a': 1.355, 'kT': 1.0, 'p': 1.2000000000000002, 'b': 0.03201, 'N': 1000}}
{'project': 'tutorial', 'format': None, 'V_liq': 30.65957481807377, 'signac_id': 'fdfd39a204f42e56bbe1b9c861674430', '_id': 'fdfd39a204f42e56bbe1b9c861674430', 'V_gas': 64.0177766935273, 'fluid': 'water', 'signac_link': {'access_module': 'signac_access.py', 'access_crawler_id': 'main', 'link_type': 'module_fetch', 'access_crawler_root': '/home/johndoe/signac-examples/notebooks/projects/tutorial'}, 'statepoint': {'a': 5.536, 'kT': 1.0, 'p': 7.800000000000001, 'b': 0.03049, 'N': 1000}}
{'project': 'tutorial

The index generated by the master crawler contains all the information about our project, even the files, without any additional information.
This is possible, because the `MasterCrawler` searches the data space for files named `signac_access.py` and then collects all indexes generated by the `slave crawlers` defined within these modules.

This allows us to easily generate a *master index* of multiple projects and even directly fetch data, using only the index, see the next section.

## Fetch data via filename

Just like before, we can access data via the filenames specified in the index documents:

In [10]:
import os

docs_files = [doc for doc in master_index if doc['format'] is not None]
for doc in docs_files[:3]:
    fn = os.path.join(doc['root'], doc['filename'])
    with open(fn) as file:
        print(doc['statepoint']['p'], file.read().strip())

1.2000000000000002 32.8033857086239,416.2699738123549
7.800000000000001 30.65957481807377,64.0177766935273
8.9 30.659542839653493,56.09500386107218


## Fetch data via index

But, even better, data files which were indexed with a `MasterCrawler` can be seamlessly fetched using the `fetch()` and the `fetch_one()` functions:

In [11]:
for doc in docs_files[:3]:
    file = signac.fetch_one(doc)
    print(doc['statepoint']['p'], file.read().strip())

1.2000000000000002 32.8033857086239,416.2699738123549
7.800000000000001 30.65957481807377,64.0177766935273
8.9 30.659542839653493,56.09500386107218


Using the `fetch()` and `fetch_one()` function allows us to access data in a way which is agnostic to the actual data source.

The format of each file object is the one that we defined earlier, which is why it is important that it has a `read()` and `close()` method.

In [12]:
for doc in docs_files[:3]:
    file = signac.fetch_one(doc)
    print(type(file))

<class 'signac.contrib.formats.TextFile'>
<class 'signac.contrib.formats.TextFile'>
<class 'signac.contrib.formats.TextFile'>
