# 1.2 Exploring Data

## Finding jobs

In [section one](signac_101_Getting_Started.ipynb) of this tutorial, we evaluated the ideal gas equation and stored the results in the *job document* and in a file called `V.txt`.
Let's now have a look at how we can explore our data space for basic and advanced analysis.

We already saw how to iterate over the *complete* data space using the `find_jobs()` method.
Instead of finding all jobs, we can also find a subset using *filters*.

Let's get started by getting a handle on our project using the `get_project()` function.
We don't need to initialize the project again, since we already did that in section 1.

In [1]:
import signac
project = signac.get_project('projects/tutorial')

Next, we assume that we would like to find all jobs, where *p=10.0*. For this, we can use the find_jobs function, which takes a dictionary of parameters as filter argument.

In [2]:
for job in project.find_jobs({'p': 10.0}):
    print(job.statepoint())

{'N': 1000, 'kT': 1.0, 'p': 10.0}


In this case, that is of course only a single job.

You can execute the same kind of find operation on the [command line](signac_105_Command_Line_Interface.ipynb) with `$ signac find`, as will be shown later.

While the filtering method is optimized for a simple dissection of the data space, it is possible to construct more complex query routines for example using [list comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions).

This is an example for how to select all jobs where the pressure *p* is greater than 0.1:

In [3]:
jobs_p_gt_0_1 = [job for job in project.find_jobs() if job.statepoint()['p'] > 0.1]
for job in jobs_p_gt_0_1:
    print(job.statepoint(), job.document)

{'N': 1000, 'kT': 1.0, 'p': 10.0} {'V': 100.0}
{'N': 1000, 'kT': 1.0, 'p': 1.0} {'V': 1000.0}


Finding jobs by certain criteria requires an index of the data space.
In all previous examples this index was created implicitly, however depending on the data space size, it may make sense to create the index explicitly for multiple uses. This is shown in the next section.

## Indexing

An index is a complete record of the data and its associated metadata within our project’s data space. To create an index, we need to crawl through the project’s data space, for example by calling the `index()` method:

In [4]:
for doc in project.index():
    print(doc)

{'_id': '5a456c131b0c5897804a4af8e77df5aa', 'V': 100.0, 'statepoint': {'N': 1000, 'kT': 1.0, 'p': 10.0}, 'signac_id': '5a456c131b0c5897804a4af8e77df5aa'}
{'_id': '5a6c687f7655319db24de59a2336eff8', 'V': 10000.0, 'statepoint': {'N': 1000, 'kT': 1.0, 'p': 0.1}, 'signac_id': '5a6c687f7655319db24de59a2336eff8'}
{'_id': 'ee617ad585a90809947709a7a45dda9a', 'V': 1000.0, 'statepoint': {'N': 1000, 'kT': 1.0, 'p': 1.0}, 'signac_id': 'ee617ad585a90809947709a7a45dda9a'}


*Crawling* here refers to collecting information from files within a certain data space, which may be a set of directories or other sources.
Each index document contains at least the state point parameters and the contents of the *job document*.

You can store the index wherever it may be useful, e.g., a file, a database, or even just in a variable for repeated find operations within one script.

In [5]:
# Create the index once
index = list(project.index())

# Use it multiple times
for job in project.find_jobs({'p': 10.0}, index=index):
    print(job.statepoint())
    
for job in project.find_jobs({'p': 1.0}, index=index):
    print(job.statepoint())

{'N': 1000, 'kT': 1.0, 'p': 10.0}
{'N': 1000, 'kT': 1.0, 'p': 1.0}


## Views

Sometimes we want to examine our data on the file system directly. However the file paths within the workspace are obfuscated by the *job id*. The solution is to use *views*, which are human-readable, maximally compact hierarchical links to our data space.

To create a linked view we simply execute the `create_linked_view()` method within python or the `$ signac view` command on the [command line](signac_105_Command_Line_Interface.ipynb).

In [6]:
# Create an empty view directory
% rm -rf projects/tutorial/view
% mkdir projects/tutorial/view

# Create the linked view within the view directory
project.create_linked_view(prefix='projects/tutorial/view')
% ls projects/tutorial/view

[34mp_0.1[m[m/  [34mp_1.0[m[m/  [34mp_10.0[m[m/


The view paths only contain parameters which actually vary across the different jobs.
In this example, that is only the pressure *p*.

This allows us to examine the data with highly-compact human-readable path names:

In [7]:
% ls 'projects/tutorial/view/p_1.0/job/'
% cat 'projects/tutorial/view/p_1.0/job/V.txt'

V.txt                     signac_job_document.json  signac_statepoint.json
1000.0


Tip: Consider creating a linked view for large data sets on an [**in-memory** file system](https://en.wikipedia.org/wiki/Tmpfs) for best performance!

The [next section](signac_103_A_Basic_Workflow.ipynb) will demonstrate how to implement a basic, but complete workflow for more expensive computations.