# Constraint-based structure estimation #720

Merged
merged 8 commits into from Aug 30, 2016

Projects
None yet
4 participants
Contributor

# Basic constraint-based structure estimation

This PR adds `estimators/ConstraintBasedEstimator.py`. The class implements the PC algorithm to construct a partially directed acyclic graph (PDAG), either from a given set of independencies or by using independence tests on a discrete data set. The PDAG can then be completed to obtain a `BayesianModel`.

## Methods of estimator class:

### `test_conditional_independence(self, X, Y, Zs=[])`

Chi-square conditional independence test (PGM book, 18.2.2.3, page 789)

### `build_skeleton(nodes, independencies)`

Build undirected graph from independencies/1st part of PC algorithm (PGM book, 3.4.2.1, page 85, like Algorithm 3.3)

### `skeleton_to_pdag(skel, seperating_sets)`

Orients compelled edges of skeleton/2nd part of PC (Neapolitan, Learning Bayseian Networks, Section 10.1.2, page 550, Algorithm 10.2)

### `pdag_to_dag(pdag)`

Method to (faithfully) orient the remaining edges to obtain `BayesianModel` (Implemented as described here on page 454 last paragraph (in text)).

Finally three methods that combine the above parts for convenient access:

### `estimate(self, significance_level=0.01)`

-> returns `BayesianModel` estimate for data

### `estimate_skeleton(self, significance_level=0.01)`

-> returns `UndirectedGraph` estimate for data

### `estimate_from_independencies(nodes, independencies)`

-> static, takes set of independencies and estimates `BayesianModel`.

## Examples:

```import pandas as pd
import numpy as np
from pgmpy.base import DirectedGraph
from pgmpy.estimators import ConstraintBasedEstimator
from pgmpy.independencies import Independencies

data = pd.DataFrame(np.random.randint(0, 5, size=(2500, 3)), columns=list('XYZ'))
data['sum'] = data.sum(axis=1)
print(data)

# estimate BN structue:
c = ConstraintBasedEstimator(data)
model = c.estimate()
print("Resulting network: ", model.edges())```

Output:

``````      X  Y  Z  sum
0     1  3  4    8
1     3  3  0    6
2     4  4  1    9
...  .. .. ..  ...
2497  0  4  2    6
2498  0  3  1    4
2499  2  1  3    6

[2500 rows x 4 columns]

Resulting network: [('Z', 'sum'), ('X', 'sum'), ('Y', 'sum')]
``````

Using parts of the algorithm manually:

```# some (in)dependence tests:
data = pd.DataFrame(np.random.randint(0, 2, size=(50000, 4)), columns=list('ABCD'))
data['E'] = data['A'] + data['B'] + data['C']
c = ConstraintBasedEstimator(data)

print("\n P-value for hypothesis test that A, C are dependent: ",
c.test_conditional_independence('A', 'C'))
print("P-value for hypothesis test that A, B are dependent, given D: ",
c.test_conditional_independence('A', 'B', 'D'))
print("P-value for hypothesis test that A, B are dependent, given D and E: ",
c.test_conditional_independence('A', 'B', ['D', 'E']))

# build skeleton from list of independencies:
ind = Independencies(['B', 'C'], ['A', ['B', 'C'], 'D'])
ind = ind.closure()
skel, sep_sets = ConstraintBasedEstimator.build_skeleton("ABCD", ind)
print("Some skeleton: ", skel.edges())

# build PDAG from skeleton (+ sep_sets):
data = pd.DataFrame(np.random.randint(0, 4, size=(5000, 3)), columns=list('ABD'))
data['C'] = data['A'] - data['B']
data['D'] += data['A']
c = ConstraintBasedEstimator(data)
pdag = c.skeleton_to_pdag(*c.estimate_skeleton())
print("Some PDAG: ", pdag.edges())  # edges: A->C, B->C, A--D (not directed)

# complete PDAG to DAG:
pdag1 = DirectedGraph([('A', 'B'), ('C', 'B'), ('C', 'D'), ('D', 'C'), ('D', 'A'), ('A', 'D')])
print("PDAG: ", pdag1.edges())
dag1 = ConstraintBasedEstimator.pdag_to_dag(pdag1)
print("DAG:  ", dag1.edges())
```

Output:

``````P-value for hypothesis test that A, C are dependent:  0.995509460079
P-value for hypothesis test that A, B are dependent, given D:  0.998918522413
P-value for hypothesis test that A, B are dependent, given D and E:  0.0
Some skeleton:  [('A', 'D'), ('C', 'D'), ('B', 'D')]
Some PDAG:  [('A', 'C'), ('A', 'D'), ('D', 'A'), ('B', 'C')]
PDAG:  [('A', 'D'), ('A', 'B'), ('C', 'D'), ('C', 'B'), ('D', 'A'), ('D', 'C')]
DAG:   [('A', 'B'), ('C', 'B'), ('D', 'A'), ('D', 'C')]
``````

### chrisittner added some commits Jul 14, 2016

``` estimators/ConstraintBasedEstimator.py & method for conditional indep… ```
`…endence testing`
``` be2baa6 ```
``` Constraint-based BN skeleton & PDAG estimation (PC algorithm) ```
``` 63b52a9 ```
``` ConstraintBasedEstimator.estimate()-method + PDAG to DAG conversion ```
``` 34895dd ```
``` Added contains()/__contains__()-method to Independencies-class ```
``` cb6894f ```
``` ConstraintBasedEstimator: seperated build_skeleton() & skeleton_to_pd… ```
`…ag()`
``` 7c56282 ```
``` ConstraintBasedEstimator: Added estimate_from_independencies() and da… ```
`…g_to_pdag()`
``` 0b2abf6 ```

### coveralls commented Aug 19, 2016 • edited

 Coverage decreased (-0.2%) to 96.203% when pulling 783113c on chrisittner:constraint-based-structure-search into e6d612d on pgmpy:dev.

### yashu-seth reviewed Aug 19, 2016

 """ if not isinstance(model, BayesianModel): TypeError("'model' must be a BayesianModel()-instance.")

#### yashu-seth Aug 19, 2016

Member

@chrisittner There should be `raise TypeError`.
And I think the message `"model: Expected BayesianModel instance, got type {model_type}".format(model_type=type(model))"` would be consistent with other TypeErrors.

### yashu-seth reviewed Aug 19, 2016

 #!/usr/bin/env python import numpy as np import pandas as pd from warnings import warn

#### yashu-seth Aug 19, 2016

Member

@chrisittner
Imports should be grouped in the following order:

standard library imports
related third party imports
local application/library specific imports

You should put a blank line between each group of imports.

### yashu-seth reviewed Aug 19, 2016

 Parameters ---------- p_value: float, default: 0.01

#### yashu-seth Aug 19, 2016 • edited

Member

@chrisittner I think this should be renamed to `significance_level`. P-value is what we obtain for a particular test. Then we compare it to the significance level and make our decision accordingly. Naming this variable `p-value` seems confusing to me. What do you think?

Contributor

Yes! Thanks.

### yashu-seth reviewed Aug 19, 2016

 ---------- p_value: float, default: 0.01 A significance level to use for conditional independence tests in the data set. The p_value is the threshold probability of falsely

#### yashu-seth Aug 19, 2016

Member

@chrisittner This line is not very clear to me. I think mentioning the null hypothesis and the alternate hypothesis here would make things more clear.

### yashu-seth reviewed Aug 19, 2016

 nodes = self.state_names.keys() def is_independent(X, Y, Zs): """Computes p-value of chi2 independence test. If it is larger

#### yashu-seth Aug 19, 2016

Member

@chrisittner Same here, I feel the null hypothesis should be mentioned.

### yashu-seth reviewed Aug 19, 2016

 p_value: float A significance level for the hypothesis that X and Y are dependent given Zs. The p_value is the probability of falsely rejecting the hypothesis that the variables are conditionally dependent. A low

#### yashu-seth Aug 19, 2016 • edited

Member

@chrisittner I don't think this should be a defined as a significance level. p-value is the probability of obtaining the given sample statistic or an even more extreme statistic considering the null hypothesis is true. If this obtained p-value is greater than the significance_level then we fail to reject the null hypothesis. Here, the null hypothesis is that the variables are independent and the alternate hypothesis is that they are dependent.

Please correct me if you feel otherwise.
And I cannot understand the statement - "falsely rejecting the hypothesis".

#### chrisittner Aug 20, 2016

Contributor

Fixed, this was confused. I also changed the return type, now the docstring reads:

``````        Returns
-------
chi2: float
The chi2 test statistic.
p_value: float
The p_value, i.e. the probability of observing the computed chi2
statistic (or an even higher value), given the null hypothesis
that X _|_ Y | Zs.
sufficient_data: bool
A flag that indicates if the sample size is considered sufficient.
As in [4], require at least 5 samples per parameter (on average).
That is, the size of the data set must be greater than
`5 * (c(X) - 1) * (c(Y) - 1) * prod([c(Z) for Z in Zs])`
(c() denotes the variable cardinality).

``````
``` ConstraintBasedEstimator: Added tests & minor edits ```
``` b0a3c8e ```
Contributor

### chrisittner commented Aug 20, 2016

 I amended to the last commit to account for @yashu-seth's helpful comments. `p_value` is renamed to `significance_level` & descriptions are improved/fixed. Also changed `test_conditional_independence` to also return chi2 value and an indicator whether the data sample was sufficiently large.

### coveralls commented Aug 20, 2016

 Coverage decreased (-0.2%) to 96.223% when pulling b0a3c8e on chrisittner:constraint-based-structure-search into e6d612d on pgmpy:dev.

### coveralls commented Aug 20, 2016

 Coverage decreased (-0.2%) to 96.223% when pulling b0a3c8e on chrisittner:constraint-based-structure-search into e6d612d on pgmpy:dev.

### coveralls commented Aug 20, 2016

 Coverage decreased (-0.2%) to 96.223% when pulling b0a3c8e on chrisittner:constraint-based-structure-search into e6d612d on pgmpy:dev.
``` moved test_conditional_independencies() to BaseEstimator class ```
``` 10510a0 ```

Open

### coveralls commented Aug 20, 2016 • edited

 Coverage decreased (-0.2%) to 96.222% when pulling 10510a0 on chrisittner:constraint-based-structure-search into e6d612d on pgmpy:dev.
Contributor

### chrisittner commented Aug 20, 2016

 Moved `test_conditional_independence` to the base class, MMHC needs it as well.

### ankurankan reviewed Aug 25, 2016

 @@ -0,0 +1,533 @@ #!/usr/bin/env python from warnings import warn

#### ankurankan Aug 25, 2016

Member

@chrisittner Add a blank line after shebang.

### ankurankan reviewed Aug 25, 2016

 [('Z', 'sum'), ('X', 'sum'), ('Y', 'sum')] """ skel, seperating_sets = self.estimate_skeleton(significance_level)

#### ankurankan Aug 25, 2016

Member

@chrisittner Typo in `separating`. Please replace everywhere.

### ankurankan reviewed Aug 25, 2016

 ------- skeleton: UndirectedGraph An estimate for the undirected graph skeleton of the BN underlying the data. seperating_sets: dict

#### ankurankan Aug 25, 2016

Member

Add a blank line before this. And in other places.

### ankurankan merged commit `10510a0` into pgmpy:dev Aug 30, 2016 1 of 3 checks passed

#### 1 of 3 checks passed

coverage/coveralls Coverage decreased (-0.2%) to 96.222%
Details
code-quality/landscape Landscape is checking code quality
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details