# 4. Advanced Indexing

**This part of the tutorial covers a special topic which may not be relevant to all users!**

## Indexing files

As was shown earlier, we can create an index of the data space using the `index()` method:

In [None]:
import signac

project = signac.get_project(root='projects/tutorial')

for doc in project.index():
    print(doc)

At this point the index contains information about the statepoint and all data stored in the job document. If we used text files to store data we need to additionally specify the format of those file to make them indexable. In general, any python class may be a format definition, however optimally a format class provides a file-like interface. An example for such a format class is the `TextFile` class.

We will specify that in addition to the job documents all files matching the regular expression `.*/V\.txt` are to be indexed as `TextFile`:

In [None]:
from signac.contrib.formats import TextFile

for doc in project.index({'.*/V\.txt': TextFile}):
    print(doc)

You will notice that the index now has a few additional entries for the text files.

This is primarily useful to expose the data set to a database, for example using a `MasterCrawler`.

## Customized Project Crawlers

The `index()` function as well as the `$ signac index` command internally setup a `SignacProjectCrawler` to crawl through the data space and create the index.
To have more control over the indexing process, we can do this explicitly:

In [None]:
from signac.contrib.crawler import SignacProjectCrawler
from signac.contrib.formats import TextFile

class TutorialProjectCrawler(SignacProjectCrawler):
    pass
TutorialProjectCrawler.define('.*/V\.txt', TextFile)

crawler = TutorialProjectCrawler(root=project.workspace())
for doc in crawler.crawl():
    print(doc)

We could specialize the `IdealGasCrawler` further, e.g., to add more meta data to the index.

## Using a Master Crawler

A master crawler uses other crawlers to compile a combined master index of one or more data spaces. This allows you to expose your project data to you and everyone else who has access to the index.

To expose the project to a `MasterCrawler` we need to create an access module.
For signac projects this is simplified by using the `create_access_module()` method.
Let's create the access module:

In [None]:
try:
    project.create_access_module({'.*/V\.txt': TextFile})
except IOError:
    pass  # File already exists...

This function create a file called `signac_access.py` within our project's root directory.

In [None]:
% cat projects/tutorial/signac_access.py

You will notice that this file looks very similar to our custom crawler definition earlier.
It also shows us how to execute a Master Crawler for this data space.
Let's do that:

In [None]:
from signac.contrib.crawler import MasterCrawler
master_crawler = MasterCrawler('projects')
master_index = list(master_crawler.crawl(depth=1))
for doc in master_index:
    print(doc)

The index generated by the master crawler contains all the information about our project, even the files.
The `MasterCrawler` searches the data space for files named `signac_access.py` collects all indexes generated by the `slave crawlers` defined within those modules.

This allows us, for example to generate a *master index* containing information about data of multiple projects.
Furthermore, we can fetch data directly using this index, see the next section.

## Fetch data via filename

The master index contains information about the data source, allowing us to access the data.

In [None]:
import os

for doc in master_index:
    if doc['format'] is not None:
        fn = os.path.join(doc['root'], doc['filename'])
        with open(fn) as file:
            print(doc['statepoint']['p'], file.read().strip())

## Fetch data via index

Even better, data which was indexed with a `MasterCrawler` can be seamlessly fetched using the `fetch()` and `fetch_one()` functions:

In [None]:
for doc in master_index:
    if doc['format'] is not None:
        file = signac.fetch_one(doc)
        print(doc['statepoint']['p'], file.read().strip())

Using the `fetch()` and `fetch_one()` function allows us to access data in a way which is agnostic with respect to the actual data source.