This notebook is licensed under the MIT License. See the [LICENSE file](https://github.com/tommasocarraro/LTNtorch/blob/main/LICENSE) in the project root for details.

## Complementary Notebook: Appropriate Operators to Approximate Connectives and Quantifiers

This notebook is a complement to the tutorial on operators (2-grounding_connectives.ipynb).

Logical connectives are grounded in LTN using fuzzy semantics. However, while all fuzzy logic operators make sense when simply *querying* the language, not every operator is equally suited for *learning*.

We will see common problems of some fuzzy semantics and which operators are better for the task of *learning*.

In [7]:
import ltn
import torch

### Querying

One can access the implementation of the most common fuzzy semantics in the `ltn.fuzzy_ops` module.
They are implemented using PyTorch primitives.

Here, we compare:
- the product t-norm: $u \land_{\mathrm{prod}} v = uv$,
- the Lukasiewicz t-norm: $u \land_{\mathrm{luk}} v = \max(u+v-1,0)$,
- the minimum aggregator: $\min(u_1,\dots,u_n)$,
- the p-mean error aggregator (generalized mean of the deviations w.r.t. the truth): $\mathrm{pME}(u_1,\dots,u_n) = 1 - \biggl( \frac{1}{n} \sum\limits_{i=1}^n (1-u_i)^p \biggr)^{\frac{1}{p}}$.

Each operator obviously conveys very different meanings, but they can all make sense depending on the intent of the query.

In the following, it is possible to observe that different semantics for the conjunction return very different results.
The same behavior can be observed when comparing different aggregators computed on the same input.

In [8]:
x1 = torch.tensor(0.4)
x2 = torch.tensor(0.7)

# the stable keyword is explained at the end of the notebook
and_prod = ltn.fuzzy_ops.AndProd(stable=False)
and_luk = ltn.fuzzy_ops.AndLuk()

print(and_prod(x1, x2))
print(and_luk(x1, x2))

tensor(0.2800)
tensor(0.1000)


In [9]:
xs = torch.tensor([1.0, 1.0, 1.0, 0.5, 0.3, 0.2, 0.2, 0.1])

# the stable keyword is explained at the end of the notebook
forall_min = ltn.fuzzy_ops.AggregMin()
forall_pME = ltn.fuzzy_ops.AggregPMeanError(p=4, stable=False)

print(forall_min(xs, dim=0))
print(forall_pME(xs, dim=0))

tensor(0.1000)
tensor(0.3134)


### Learning

While all operators are suitable in a querying setting, this not the case in a learning setting. Indeed, many fuzzy logic operators have derivatives not suitable for gradient-based algorithms. For more details, read [van Krieken et al., *Analyzing Differentiable Fuzzy Logic Operators*, 2020](https://arxiv.org/abs/2002.06100).

Here, we give simple illustrations of such gradient issues.

#### 1. Vanishing Gradients

Some operators have vanishing gradients on some part of their domains.

For example, in $u \land_{\mathrm{luk}} v = \max(u+v-1,0)$, if $u+v-1 < 0$, the gradients vanish.

In the following, it is possible to observe an edge case in which the Lukasiewicz conjunction leads to vanishing gradients.

In [4]:
x1 = torch.tensor(0.3, requires_grad=True)
x2 = torch.tensor(0.5, requires_grad=True)

y = and_luk(x1, x2)
y.backward()  # this is necessary to compute the gradients
res = y.item()
gradients = [v.grad for v in [x1, x2]]
# print the result of the aggregation
print(res)
# print gradients of x1 and x2
print(gradients)

0.0
[tensor(0.), tensor(0.)]


#### 2. Single-Passing Gradients

Some operators have gradients propagating to only one input at a time, meaning that all other inputs will not benefit from learning at this step.

An example is the minimum aggregator, namely $\min(u_1,\dots,u_n)$.

In the following, it is possible to observe an edge case in which the `Min` aggregator leads to singe-passing gradients.

In [5]:
xs = torch.tensor([1.0, 1.0, 1.0, 0.5, 0.3, 0.2, 0.2, 0.1], requires_grad=True)

y = forall_min(xs, dim=0)
res = y.item()
y.backward()
gradients = xs.grad
# print the result of the aggregation
print(res)
# print gradients of xs
print(gradients)

0.10000000149011612
tensor([0., 0., 0., 0., 0., 0., 0., 1.])


#### 3. Exploding Gradients

Some operators have exploding gradients on some part of their domains.

An example is the `PMean` aggregator, namely $\mathrm{pME}(u_1,\dots,u_n) = 1 - \biggl( \frac{1}{n} \sum\limits_{i=1}^n (1-u_i)^p \biggr)^{\frac{1}{p}}$.

In the edge case where all inputs are $1.0$, this operator leads to exploding gradients.

In the following, it is possible to observe this behavior.

In [6]:
xs = torch.tensor([1.0, 1.0, 1.0], requires_grad=True)

y = forall_pME(xs, dim=0, p=4)
res = y.item()
y.backward()
gradients = xs.grad
# print the result of the aggregation
print(res)
# print the gradients of xs
print(gradients)

1.0
tensor([nan, nan, nan])


### Stable Product Configuration

#### Product Configuration

In general, we recommend using the following "product configuration" in LTN:
* not: the standard negation  $\lnot u = 1-u$,
* and: the product t-norm $u \land v = uv$,
* or: the product t-conorm (probabilistic sum) $u \lor v = u+v-uv$,
* implication: the Reichenbach implication $u \rightarrow v = 1 - u + uv$,
* existential quantification ("exists"): the generalized mean (p-mean) $\mathrm{pM}(u_1,\dots,u_n) = \biggl( \frac{1}{n} \sum\limits_{i=1}^n u_i^p \biggr)^{\frac{1}{p}} \qquad p \geq 1$,
* universal quantification ("for all"): the generalized mean of "the deviations w.r.t. the truth" (p-mean error) $\mathrm{pME}(u_1,\dots,u_n) = 1 - \biggl( \frac{1}{n} \sum\limits_{i=1}^n (1-u_i)^p \biggr)^{\frac{1}{p}} \qquad p \geq 1$.

#### "Stable"

As is, this "product configuration" is not fully exempt from issues:
- the product t-norm has vanishing gradients on the edge case $u=v=0$;
- the product t-conorm has vanishing gradients on the edge case $u=v=1$;
- the Reichenbach implication has vanishing gradients on the edge case $u=0$,$v=1$;
- `pMean` has exploding gradients on the edge case $u_1=\dots=u_n=0$;
- `pMeanError` has exploding gradients on the edge case $u_1=\dots=u_n=1$.

However, all these issues happen on edge cases and can easily be fixed using the following "trick":
- if the edge case happens when an input $u$ is $0$, we modify every input with $u' = (1-\epsilon)u+\epsilon$;
- if the edge case happens when an input $u$ is $1$, we modify every input with $u' = (1-\epsilon)u$;

where $\epsilon$ is a small positive value (e.g. $1\mathrm{e}{-5}$).

This "trick" gives us a stable version of such operators. Stable in the sense it has not gradient issues anymore.

One can trigger the stable version of such operators by using the boolean parameter `stable`. It is possible to set a default
value for `stable` when initializing the operator, or to use different values at each call of the operator.

In the following, we repeat the last example with the difference that we are now using the stable version of the `pMean`
operator. It is possible to observe that the gradients are now different from `NaN`. Thanks to the stable verison of the
operator, we are now able to obtain suitable gradients.

In [10]:
xs = torch.tensor([1.0, 1.0, 1.0], requires_grad=True)

# the exploding gradient problem is solved
y = forall_pME(xs, dim=0, p=4, stable=True)
res = y.item()
y.backward()
gradients = xs.grad
# print the result of the aggregation
print(res)
# print the gradients of xs
print(gradients)

0.9998999834060669
tensor([0.3333, 0.3333, 0.3333])


#### The hyper-parameter $p$ in the generalized means

The hyper-parameter $p$ of `pMean` and `pMeanError` offers flexibility in writing more or less strict formulas, to
account for outliers in the data depending on the application. However, $p$ should be carefully set since it could have
strong implications for the training of LTN.

In the following, we see how a huge increase of $p$ leads to single-passing gradients in the `pMean` operator. This is
intuitive as in the second tutorial we have observed that `pMean` tends to the `Max` when $p$ tends to infinity. Similar
to the `Min` aggregator (seen before in this tutorial), the `Max` aggregator leads to single-passing gradients.

In [11]:
xs = torch.tensor([1.0, 1.0, 1.0, 0.5, 0.3, 0.2, 0.2, 0.1], requires_grad=True)

y = forall_pME(xs, dim=0, p=4)
res = y.item()
y.backward()
gradients = xs.grad
# print result of aggregation
print(res)
# print gradients of xs
print(gradients)

0.31339913606643677
tensor([0.0000, 0.0000, 0.0000, 0.0483, 0.1325, 0.1977, 0.1977, 0.2815])


In [12]:
xs = torch.tensor([1.0, 1.0, 1.0, 0.5, 0.3, 0.2, 0.2, 0.1], requires_grad=True)

y = forall_pME(xs, dim=0, p=20)
res = y.item()
y.backward()
gradients = xs.grad
# print result of aggregation
print(res)
# print gradients of xs
print(gradients)

0.18157517910003662
tensor([0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0734e-05, 6.4147e-03, 8.1100e-02,
        8.1100e-02, 7.6019e-01])


While it can be tempting to set a high value for $p$ when querying, in a learning setting, this can quickly lead to a "single-passing" operator that will focus too much on outliers at each step (i.e., gradients overfitting one input at this step, potentially harming the training of the others). We recommend not to set a too high $p$ when learning.
