# Ensembles in Kosh

Frequently the need arise to run an *ensemble*, e.g producing many datasets that share some common `metadata` or `sources`

Kosh provides a convenience class `KoshEnsemble` that helps you keep all of your datasets in sync.

## The basics

In essence, by creating a `KoshEnsemble` you lock a set of metadata that will be shared by all members of the ensemble. These metadata we be identical for all dataset in the ensemble and can only be edited from the `KoshEnsemble` object.

Additionally you can associate data with the ensemble. The data will then appear as if it was associated with each dataset.


In [1]:
import kosh

store = kosh.connect("ensembles_example.sql", delete_all_contents=True)

# let's create an ensemble. 
# we use the dedicated `create_ensemble` function that works just like the `create` function for datasets

ensemble = store.create_ensemble(name="My Example Dataset", metadata={"root":"/root/path/for/ensemble", "project":"Example"})

print(ensemble)

KOSH ENSEMBLE
	id: 273779eff30b41d6847794e548323c5b
	name: My Example Dataset
	creator: moreno45

--- Attributes ---
	creator: moreno45
	name: My Example Dataset
	project: Example
	root: /root/path/for/ensemble
--- Associated Data (0)---
--- Member Datasets (0)---
	[]


In [2]:
# Let's associated some file common to all datasets with the ensemble
ensemble.associate("../LICENSE", "text")
print(ensemble)

KOSH ENSEMBLE
	id: 273779eff30b41d6847794e548323c5b
	name: My Example Dataset
	creator: moreno45

--- Attributes ---
	creator: moreno45
	name: My Example Dataset
	project: Example
	root: /root/path/for/ensemble
--- Associated Data (1)---
	Mime_type: text
		/g/g20/moreno45/Projects/ASCAML/kosh/LICENSE ( 01a96a0627564ccbab9da7da488ec9e0 )
--- Member Datasets (0)---
	[]


In [3]:
# Now let's add a member to our ensemble.
# We use the `create` function which works exactly as the store's `create` function.
ds1 = ensemble.create(name="First dataset", metadata={"param1":1., "param2": "a"})
# Notice that our ensemble attributes and associated data appear on the dataset
print(ds1)

KOSH DATASET
	id: e3b9a36e2a484b3b953ef438043b4968
	name: First dataset
	creator: moreno45

--- Attributes ---
	creator: moreno45
	name: First dataset
	param1: 1.0
	param2: a
--- Associated Data (1)---
	Mime_type: text
		/g/g20/moreno45/Projects/ASCAML/kosh/LICENSE ( 01a96a0627564ccbab9da7da488ec9e0 )
--- Ensembles (1)---
	['273779eff30b41d6847794e548323c5b']
--- Ensemble Attributes ---
	--- Ensemble 273779eff30b41d6847794e548323c5b ---
		['project', 'root']
--- Alias Feature Dictionary ---


In [4]:
# Dataset 1 also appears as part of the ensemble:
print(ensemble)

KOSH ENSEMBLE
	id: 273779eff30b41d6847794e548323c5b
	name: My Example Dataset
	creator: moreno45

--- Attributes ---
	creator: moreno45
	name: My Example Dataset
	project: Example
	root: /root/path/for/ensemble
--- Associated Data (1)---
	Mime_type: text
		/g/g20/moreno45/Projects/ASCAML/kosh/LICENSE ( 01a96a0627564ccbab9da7da488ec9e0 )
--- Member Datasets (1)---
	['e3b9a36e2a484b3b953ef438043b4968']


In [5]:
# We can also create a dataset on its own as usual:
ds2 = store.create(name="Second dataset", metadata={"param1":2., "param2": "b"})
# And later add it to the ensemble
ensemble.add(ds2)
print(ensemble)

KOSH ENSEMBLE
	id: 273779eff30b41d6847794e548323c5b
	name: My Example Dataset
	creator: moreno45

--- Attributes ---
	creator: moreno45
	name: My Example Dataset
	project: Example
	root: /root/path/for/ensemble
--- Associated Data (1)---
	Mime_type: text
		/g/g20/moreno45/Projects/ASCAML/kosh/LICENSE ( 01a96a0627564ccbab9da7da488ec9e0 )
--- Member Datasets (2)---
	['e3b9a36e2a484b3b953ef438043b4968', '0baf46544b1748da86d9495ae4000d18']


In [6]:
# We can also tell a dataset to join an ensemble:
# Let's create a dataset:
ds3 = store.create(name="Third dataset", metadata={"param1":3., "param2": "c"})
# Now let's ask the dataset to join the ensemble:
ds3.join_ensemble(ensemble)
print(ensemble)

KOSH ENSEMBLE
	id: 273779eff30b41d6847794e548323c5b
	name: My Example Dataset
	creator: moreno45

--- Attributes ---
	creator: moreno45
	name: My Example Dataset
	project: Example
	root: /root/path/for/ensemble
--- Associated Data (1)---
	Mime_type: text
		/g/g20/moreno45/Projects/ASCAML/kosh/LICENSE ( 01a96a0627564ccbab9da7da488ec9e0 )
--- Member Datasets (3)---
	['e3b9a36e2a484b3b953ef438043b4968', '0baf46544b1748da86d9495ae4000d18', '74cd06177903469ebcb2bddfe0875d96']


In [7]:
# Now we can access all datasets of an ensemble:
list(ensemble.get_members(ids_only=True))

['e3b9a36e2a484b3b953ef438043b4968',
 '0baf46544b1748da86d9495ae4000d18',
 '74cd06177903469ebcb2bddfe0875d96']

In [8]:
# Similarly a dataset can leave or be removed from an ensemble.
dataset = ensemble.create()
print("Ensemble has {} members.".format(len(list(ensemble.get_members(ids_only=True)))))
dataset.leave_ensemble(ensemble)
print("Ensemble has {} members after dataset left.".format(len(list(ensemble.get_members(ids_only=True)))))
ensemble.add(dataset)
print("Ensemble has {} members after adding dataset back.".format(len(list(ensemble.get_members(ids_only=True)))))
ensemble.delete(dataset)
print("Ensemble has {} members after removing dataset.".format(len(list(ensemble.get_members(ids_only=True)))))

Ensemble has 4 members.
Ensemble has 3 members after dataset left.
Ensemble has 4 members after adding dataset back.
Ensemble has 3 members after removing dataset.


# Attributes

As previously mentioned the ensemble attributes appear on all of its members. 

Changing or adding an ensemble attribute propagates to all of its members:


In [9]:
ensemble.root = "foo"
ensemble.new_attribute = "bar"
[(x.root, x.new_attribute) for x in ensemble.get_members()]

[('foo', 'bar'), ('foo', 'bar'), ('foo', 'bar')]

***WARNING:*** You cannot set an attribute belonging to an ensemble from one of its members

In [10]:
try:
    ds1.root = "root_from_ds1"
except KeyError as err:
    print(err)
ds1.root

'The attribute root is controlled by ensemble: 273779eff30b41d6847794e548323c5b and cannot be set here'


'foo'

You can ask a dataset if one of its attributes belongs to an ensemble

In [11]:
print("Is `root` an ensemble attribute?", ds1.is_ensemble_attribute("root"))
print("Is `param1` an ensemble attribute?", ds1.is_ensemble_attribute("param1"))

Is `root` an ensemble attribute? True
Is `param1` an ensemble attribute? False


You can also get which ensemble the attribute comes from:

In [12]:
print("Attribute `root` belongs to ensemble:", ds1.is_ensemble_attribute("root", ensemble_id=True))

Attribute `root` belongs to ensemble: 273779eff30b41d6847794e548323c5b


In [13]:
print("Attribute `param1` belongs to ensemble:", ds1.is_ensemble_attribute("param1", ensemble_id=True))

Attribute `param1` belongs to ensemble: 


## Searching

We can search a store for ensembles containing some attributes

In [14]:
ensembles = store.find_ensembles(root="foo", ids_only=True)
print(list(ensembles))

['273779eff30b41d6847794e548323c5b']


The ensemble metadata appear as dataset metadata, so we can search dataset based on ensemble attributes

In [15]:
list(store.find(root="foo", ids_only=True))

['0baf46544b1748da86d9495ae4000d18',
 'e3b9a36e2a484b3b953ef438043b4968',
 '74cd06177903469ebcb2bddfe0875d96']

Just like for datasets, the `find` function is used to lookup associated sources

In [16]:
next(ensemble.find(mime_type="text", ids_only=True))

'01a96a0627564ccbab9da7da488ec9e0'

The associated data will also appear and be searchable for each individual dataset.

In [17]:
next(ds1.find(mime_type="text", ids_only=True))

'01a96a0627564ccbab9da7da488ec9e0'

We can also search for datasets within an ensemble.

In [18]:
next(ensemble.find_datasets(param1=1, ids_only=True))

'e3b9a36e2a484b3b953ef438043b4968'

## Multiple ensembles

Datasets can be part of multiple ensembles. For example doing  a parameter study for a problem. But also with 2 different tools.



In [19]:
problem1_ensemble = store.create_ensemble(name="problem 1", metadata={"problem":"problem1"})
problem2_ensemble = store.create_ensemble(name="problem 2", metadata={"problem":"problem2"})
tool1_ensemble = store.create_ensemble(name="tool1", metadata={"tool":"tool1"})
tool2_ensemble = store.create_ensemble(name="tool2", metadata={"tool":"tool2"})

for problem in ["problem1", "problem2"]:
    for tool in ["tool1", "tool2"]:
        for param1 in [1,2,3,]:
            ds = store.create(metadata={"param1":param1})
            tool_ensemble = next(store.find_ensembles(tool= tool))
            ds.join_ensemble(tool_ensemble)
            problem_ensemble = next(store.find_ensembles(problem= problem))
            ds.join_ensemble(problem_ensemble)

# now let's find datasets for tool1 and problem1
datasets = list(store.find(tool="tool1", problem="problem1"))
print("We found:",len(datasets),"datasets")
ds = datasets[0]  # belongs to two ensembles
#  Note that string will show which attributes belong to which ensemble
ds

We found: 3 datasets


KOSH DATASET
	id: 41e372b21767464d9c6feae23a122361
	name: Unnamed Dataset
	creator: moreno45

--- Attributes ---
	creator: moreno45
	name: Unnamed Dataset
	param1: 2
--- Associated Data (0)---
--- Ensembles (2)---
	['418456b853c541a3b804fa425c8dd53c', '5f069d8ec3af466ab236e0c94fa062f1']
--- Ensemble Attributes ---
	--- Ensemble 418456b853c541a3b804fa425c8dd53c ---
		['tool']
	--- Ensemble 5f069d8ec3af466ab236e0c94fa062f1 ---
		['problem']
--- Alias Feature Dictionary ---

***WARNING:*** In order to belong to multi-ensemble, each ensemble must have a unique set of attributes unless you pass in the `inherit_attributes=False` which can be seen in the next couple of cells.

Example if another ensemble had the `problem` attribute and a datasets belong to both ensembles, we could not determine which ensemble to grab the `problem` attribute from:

In [20]:
e3 = store.create_ensemble(metadata={"problem":"another problem"})
try:
    ds.join_ensemble(e3)
except Exception as err:
    print(err)

Dataset 41e372b21767464d9c6feae23a122361 is already part of ensemble 5f069d8ec3af466ab236e0c94fa062f1 which already provides support for attribute: problem. Bailing


Similarly you cannot create a new attribute on an ensemble if one of its member belongs to another ensemble already controlling this attribute:


In [21]:
try:
    problem1_ensemble.tool = "some tool"
except Exception as err:
    print(err)

A member of this ensemble belongs to ensemble de9eec810e684998a970800c82403849 which already controls attribute tool


### Adding Multiple Datasets to Ensembles with the same attributes and organizing with ensemble tags

Datasets also can be part of multiple ensembles and they can be further organized within a single ensemble using `ensemble_tags`. 

For example, say you want to add your train, validation, and test datasets to a single ensemble but need to organize them as such. Adding an attribute to the dataset would make that attribute the same across all ensembles but the train, validation, and test split is randomized for each ensemble. Adding an attribute to the ensemble would be at the ensemble level and thus you would need three ensembles one for train, validation, and test. `ensemble_tags` allow the user to organize the datasets within the ensemble.

We also use `inherit_attributes=False` so that the datasets and ensembles as well as the different ensemebles containing the same datasets can have the same attributes or else there will be a clash since the same attributes are seen.

**Note:** If a dataset was added to another ensemble using the default parameter `inherit_attributes=True` and the new ensemble and/or dataset attributes have the same name, there will be a conflict. In order to fix this you need to update the special ensemble tag `'INHERIT_ATTRIBUTES'` to `False` for that other dataset ensemble relation. This means that the dataset attributes will no longer be tied to that other ensemble so there will no longer be a conflict. If the dataset belongs to multiple ensembles with `inherit_attributes=True` (and there are attribute conflicts), this will need to be done for all those different ensembles: `dataset.add_ensemble_tags(ensemble_id, {'INHERIT_ATTRIBUTES': False})`

In [22]:
temp_datasets = []
import random
for i in range(20):
    metadata={"problem":f"problem_ds_{i}",
              "tool":f"tool_ds_{i}",
              "param1": random.randint(0, 1),
              "param2": random.randint(-10, 10)}

    temp_dataset = store.create(id=f"ds_{i}", metadata=metadata)
    temp_datasets.append(temp_dataset)

for i in range(10):
    ensemble = store.create_ensemble(id=f"ens_{i}",
                                    metadata={"problem":f"problem_ens_{i}",
                                              "tool":f"tool_ens_{i}"})
    for j, temp_ds in enumerate(temp_datasets):

        ensemble_tags = {}

        if j % 2 == 0:
            ensemble_tags["even_or_odd"] = "even"
        else:
            ensemble_tags["even_or_odd"] = "odd"

        if j <= 11:
            ensemble_tags["data_type"] = "train data"
        elif j <= 15:
            ensemble_tags["data_type"] = "validation data"
        else:
            ensemble_tags["data_type"] = "test data"

        ensemble.add(temp_ds, inherit_attributes=False, ensemble_tags=ensemble_tags)

print(ensemble)
print(temp_dataset)

KOSH ENSEMBLE
	id: ens_9
	name: Unnamed Ensemble
	creator: moreno45

--- Attributes ---
	creator: moreno45
	name: Unnamed Ensemble
	problem: problem_ens_9
	tool: tool_ens_9
--- Associated Data (0)---
--- Member Datasets (20)---
	['ds_0', 'ds_1', 'ds_2', 'ds_3', 'ds_4', 'ds_5', 'ds_6', 'ds_7', 'ds_8', 'ds_9', 'ds_10', 'ds_11', 'ds_12', 'ds_13', 'ds_14', 'ds_15', 'ds_16', 'ds_17', 'ds_18', 'ds_19']
KOSH DATASET
	id: ds_19
	name: Unnamed Dataset
	creator: moreno45

--- Attributes ---
	creator: moreno45
	name: Unnamed Dataset
	param1: 0
	param2: -5
--- Associated Data (0)---
--- Ensembles (10)---
	['ens_0', 'ens_1', 'ens_2', 'ens_3', 'ens_4', 'ens_5', 'ens_6', 'ens_7', 'ens_8', 'ens_9']
--- Ensemble Attributes ---
	--- Ensemble ens_0 ---
		['problem', 'tool']
		--- Ensemble Tags ---
			['data_type', 'even_or_odd']
	--- Ensemble ens_1 ---
		['problem', 'tool']
		--- Ensemble Tags ---
			['data_type', 'even_or_odd']
	--- Ensemble ens_2 ---
		['problem', 'tool']
		--- Ensemble Tags ---
			['d

### Converting `ensemble.find()` method to Pandas DataFrame

You can also pass in the same arguments in the `ensemble.find()` method to the  `ensemble.to_dataframe()` method to get the attributes of the filtered datasets within that specific ensemble. By default, it will always include ['id', 'name', 'creator'] and both the ensemble attributes and ensemble tags but they can be turned off.

In [23]:
# All Datasets in Ensemble
df = ensemble.to_dataframe()
print('All Datasets in Ensemble')
print(df,'\n\n')

# Filtered Datasets in Ensemble
from sina.utils import DataRange
target_data = {'param1': 1,
               'param2': DataRange(min=0, max=10, max_inclusive=True)}
target_ensemble_tags = {"data_type": "train data"}
df = ensemble.to_dataframe(data=target_data, ensemble_tags=target_ensemble_tags)
print('Filtered Datasets in Ensemble')
print(df,'\n\n')

# Specific columns without ensemble attributes or ensemble tags
df = ensemble.to_dataframe(data=target_data, ensemble_tags=target_ensemble_tags,
                           data_columns=['param20'],
                           include_ensemble_attributes=False, include_ensemble_tags=False)
print('Filtered Associated Files with specific columns')
print(df,'\n\n')

All Datasets in Ensemble
       id             name                           creator  param1  param2  \
0    ds_0  Unnamed Dataset  9b7d60f394284459a1ae979bb0af019f       0      -4   
1   ds_14  Unnamed Dataset  9b7d60f394284459a1ae979bb0af019f       0      -6   
2    ds_6  Unnamed Dataset  9b7d60f394284459a1ae979bb0af019f       1       4   
3    ds_9  Unnamed Dataset  9b7d60f394284459a1ae979bb0af019f       1       1   
4   ds_12  Unnamed Dataset  9b7d60f394284459a1ae979bb0af019f       1       1   
5   ds_13  Unnamed Dataset  9b7d60f394284459a1ae979bb0af019f       0       9   
6    ds_2  Unnamed Dataset  9b7d60f394284459a1ae979bb0af019f       1       7   
7   ds_18  Unnamed Dataset  9b7d60f394284459a1ae979bb0af019f       1      -6   
8   ds_16  Unnamed Dataset  9b7d60f394284459a1ae979bb0af019f       0      10   
9   ds_17  Unnamed Dataset  9b7d60f394284459a1ae979bb0af019f       0      -7   
10   ds_5  Unnamed Dataset  9b7d60f394284459a1ae979bb0af019f       0       9   
11  ds_11  Unna

