# Important Datastructures

In this tutorial we will walk through important datastructures that users will encounter while using ``pyjpt``.

## Sets

As sets are ubiquitous objects of interest in almost every mathematical theory ``pyjpt`` provides fast and flexible implementations of all kinds of sets.

### Discrete Sets

Domains of [jpt.variables.SymbolicVariable](../autoapi/jpt/variables/index.html#jpt.variables.SymbolicVariable) and [jpt.variables.IntegerVariable](../autoapi/jpt/variables/index.html#jpt.variables.IntegerVariable) are ordinary python sets. These can be constructed by calling the python set constructor.

In [76]:
symbolic_set = {"Dog", "Cat", "Mouse"}
integer_set = {1, 2, 3}

For [jpt.variables.SymbolicVariable](../autoapi/jpt/variables/index.html#jpt.variables.SymbolicVariable)  a set of strings is usable and for [jpt.variables.IntegerVariable](../autoapi/jpt/variables/index.html#jpt.variables.IntegerVariable) a set of integers is required.

### Continuous Sets

As real world applications often contain variables with a continuous domain ``pyjpt`` implements [jpt.base.intervals.ContinuousSet](../autoapi/jpt/base/intervals.html#jpt.base.intervals.ContinuousSet) and  [jpt.base.intervals.RealSet](../autoapi/jpt/base/intervals.html#jpt.base.intervals.RealSet) as domain for
[numeric random variables](../autoapi/jpt/variables/index.html#jpt.variables.NumericVariable).
Continuous sets represent intervals on $\mathbb{R}$ and work very similar to python sets. A continuous set can be created by importing the package and
   * calling the constructor
   * parsing it from string
   * parsing it from list

In [77]:
from jpt.base.intervals import ContinuousSet
from jpt.base.utils import list2interval

a = ContinuousSet(0, 1)
b = ContinuousSet.fromstring("[1, 2)")
c = list2interval([-1, 1])

a, b, c

(<ContinuousSet=[0.000,1.000]>,
 <ContinuousSet=[1.000,2.000[>,
 <ContinuousSet=[-1.000,1.000]>)

The usual set operators are also applicable on continuous sets.

In [78]:
a_union_b = a.union(b)
a_difference_b = a.difference(b)
a_intersection_c = a.intersection(c)

a_union_b, a_difference_b, a_intersection_c

(<ContinuousSet=[0.000,2.000[>,
 <ContinuousSet=[0.000,1.000[>,
 <ContinuousSet=[0.000,1.000]>)

It should be noted that sets can also be empty or contain only one single element.

In [79]:
from jpt.base.intervals import EMPTY
d = EMPTY
print("Empty set through Construction (%s) and intersection (%s)" % (d, b.intersection(ContinuousSet(2,100))))

single_element_set = b.intersection(c)
print("Set with only one element %s" % single_element_set)

Empty set through Construction (∅) and intersection (∅)
Set with only one element {1.0}


Applying arbitrary operations on continuous sets can produce [real sets](../autoapi/jpt/base/intervals/index.html#jpt.base.intervals.RealSet). These are disjoint unions of continuous sets.
Additionally real sets can be constructed by their constructors or from strings.

In [80]:
from jpt.base.intervals import RealSet

c_union_b_difference_a = c.union(b).difference(a)

print("RealSet from set operations %s" % c_union_b_difference_a)

e = RealSet([c, list2interval([100, 200])])
print("RealSet from construction %s" % e)


RealSet from set operations [-1.0,0.0[ ∪ ]1.0,2.0[
RealSet from construction [-1.0,1.0] ∪ [100.0,200.0]


Real sets can also be simplified. The simplification ensures that all sets are disjoint.

In [81]:
joint_real_set = RealSet([a, b])
print("Not simplified RealSet %s; Simplified RealSet %s" % (joint_real_set, joint_real_set.simplify()))

Not simplified RealSet [0.0,2.0[ ∪ [1.0,2.0[; Simplified RealSet [0.0,2.0[


## Variable Assignments

All kinds of information that is passed to JPTs is stored in VariableAssignments. VariableAssignments are either LabelAssignments or ValueAssignments. For users, LabelAssignments are the more interesting datastructure. LabelAssignments are extensions of dictionaries in python that map variables to values. Semantically they describe the (partial) information that an agent provides to the probability distributions. The easiest method to create them, is by binding python dictionaries through the jpt.trees.JPT.bind method. Additionally they can be created through
   * their constructor
   * from ValueAssignments
   * through the jpt.trees.JPT._preprocess_query method.
The latter should only be used by developers, as indicted by the _ in the beginning of the function name.
Also, dictionary like updating is supported.

To create LabelAssignments through JPTs we first have to fit one. For that we will use the iris toy-datasets.

In [82]:
import pandas as pd
import jpt.trees
import jpt.variables
from jpt import infer_from_dataframe
import sklearn.datasets

dataset, y = sklearn.datasets.load_iris(as_frame=True, return_X_y=True)

for idx, name in enumerate(['setosa', 'versicolor', 'virginica']):
    y[y==idx] = name

dataset["leaf"] = y

model = jpt.trees.JPT(infer_from_dataframe(dataset), min_samples_leaf=0.1)
model.fit(dataset)

# create the LabelAssignment through binding
query = {"leaf" : {"setosa", "versicolor"},
         "sepal length (cm)" : [5,6]}

bounded = model.bind(query)
print("Bounded query from python dictionary %s" % bounded)

# create it through direct constructor calling
query_ = jpt.variables.LabelAssignment({model.varnames["leaf"]: {"setosa", "versicolor"}}.items())
query_[model.varnames["sepal length (cm)"]] = list2interval([5,6])
print("Direct construction of a LabelAssignment %s" % query_)

Bounded query from python dictionary <LabelAssignment {leaf: {'setosa', 'versicolor'}, sepal length (cm): <ContinuousSet=[5.000,6.000]>}>
Direct construction of a LabelAssignment <LabelAssignment {leaf: {'setosa', 'versicolor'}, sepal length (cm): <ContinuousSet=[5.000,6.000]>}>


ValueAssignments are very similar to LabelAssignments. However, they use representation of variables inside of trees, i.e. every discrete value is replaced by its index in distributions and continuous sets are scaled with respect to the preprocessing of the variables. ValueAssignments can be created like LabelAssignments; they also can be converted from one to the other by calling the respective method.

In [83]:
print("Intern Representation of the query from the previous example %s", bounded.value_assignment())
print("Extern Representation of the query from the previous example %s", bounded.value_assignment().label_assignment())

Intern Representation of the query from the previous example %s <ValueAssignment {leaf: {0, 1}, sepal length (cm): <ContinuousSet=[-0.983,-0.011]>}>
Extern Representation of the query from the previous example %s <LabelAssignment {leaf: {'setosa', 'versicolor'}, sepal length (cm): <ContinuousSet=[5.000,6.000]>}>
