New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constraint-based structure estimation #720

Merged
merged 8 commits into from Aug 30, 2016

Conversation

Projects
None yet
4 participants
@chrisittner
Contributor

chrisittner commented Aug 19, 2016

Basic constraint-based structure estimation

This PR adds estimators/ConstraintBasedEstimator.py. The class implements the PC algorithm to construct a partially directed acyclic graph (PDAG), either from a given set of independencies or by using independence tests on a discrete data set. The PDAG can then be completed to obtain a BayesianModel.

Methods of estimator class:

test_conditional_independence(self, X, Y, Zs=[])

Chi-square conditional independence test (PGM book, 18.2.2.3, page 789)

build_skeleton(nodes, independencies)

Build undirected graph from independencies/1st part of PC algorithm (PGM book, 3.4.2.1, page 85, like Algorithm 3.3)

skeleton_to_pdag(skel, seperating_sets)

Orients compelled edges of skeleton/2nd part of PC (Neapolitan, Learning Bayseian Networks, Section 10.1.2, page 550, Algorithm 10.2)

pdag_to_dag(pdag)

Method to (faithfully) orient the remaining edges to obtain BayesianModel (Implemented as described here on page 454 last paragraph (in text)).

Finally three methods that combine the above parts for convenient access:

estimate(self, significance_level=0.01)

-> returns BayesianModel estimate for data

estimate_skeleton(self, significance_level=0.01)

-> returns UndirectedGraph estimate for data

estimate_from_independencies(nodes, independencies)

-> static, takes set of independencies and estimates BayesianModel.

Examples:

import pandas as pd
import numpy as np
from pgmpy.base import DirectedGraph
from pgmpy.estimators import ConstraintBasedEstimator
from pgmpy.independencies import Independencies

data = pd.DataFrame(np.random.randint(0, 5, size=(2500, 3)), columns=list('XYZ'))
data['sum'] = data.sum(axis=1)
print(data)

# estimate BN structue:
c = ConstraintBasedEstimator(data)
model = c.estimate()
print("Resulting network: ", model.edges())

Output:

      X  Y  Z  sum
0     1  3  4    8
1     3  3  0    6
2     4  4  1    9
...  .. .. ..  ...
2497  0  4  2    6
2498  0  3  1    4
2499  2  1  3    6

[2500 rows x 4 columns]

Resulting network: [('Z', 'sum'), ('X', 'sum'), ('Y', 'sum')]

Using parts of the algorithm manually:

# some (in)dependence tests:
data = pd.DataFrame(np.random.randint(0, 2, size=(50000, 4)), columns=list('ABCD'))
data['E'] = data['A'] + data['B'] + data['C']
c = ConstraintBasedEstimator(data)

print("\n P-value for hypothesis test that A, C are dependent: ",
      c.test_conditional_independence('A', 'C'))
print("P-value for hypothesis test that A, B are dependent, given D: ",
      c.test_conditional_independence('A', 'B', 'D'))
print("P-value for hypothesis test that A, B are dependent, given D and E: ",
      c.test_conditional_independence('A', 'B', ['D', 'E']))

# build skeleton from list of independencies:
ind = Independencies(['B', 'C'], ['A', ['B', 'C'], 'D'])
ind = ind.closure()
skel, sep_sets = ConstraintBasedEstimator.build_skeleton("ABCD", ind)
print("Some skeleton: ", skel.edges())

# build PDAG from skeleton (+ sep_sets):
data = pd.DataFrame(np.random.randint(0, 4, size=(5000, 3)), columns=list('ABD'))
data['C'] = data['A'] - data['B']
data['D'] += data['A']
c = ConstraintBasedEstimator(data)
pdag = c.skeleton_to_pdag(*c.estimate_skeleton())
print("Some PDAG: ", pdag.edges())  # edges: A->C, B->C, A--D (not directed)

# complete PDAG to DAG:
pdag1 = DirectedGraph([('A', 'B'), ('C', 'B'), ('C', 'D'), ('D', 'C'), ('D', 'A'), ('A', 'D')])
print("PDAG: ", pdag1.edges())
dag1 = ConstraintBasedEstimator.pdag_to_dag(pdag1)
print("DAG:  ", dag1.edges())

Output:

P-value for hypothesis test that A, C are dependent:  0.995509460079
P-value for hypothesis test that A, B are dependent, given D:  0.998918522413
P-value for hypothesis test that A, B are dependent, given D and E:  0.0
Some skeleton:  [('A', 'D'), ('C', 'D'), ('B', 'D')]
Some PDAG:  [('A', 'C'), ('A', 'D'), ('D', 'A'), ('B', 'C')]
PDAG:  [('A', 'D'), ('A', 'B'), ('C', 'D'), ('C', 'B'), ('D', 'A'), ('D', 'C')]
DAG:   [('A', 'B'), ('C', 'B'), ('D', 'A'), ('D', 'C')]
@coveralls

This comment has been minimized.

coveralls commented Aug 19, 2016

Coverage Status

Coverage decreased (-0.2%) to 96.203% when pulling 783113c on chrisittner:constraint-based-structure-search into e6d612d on pgmpy:dev.

"""
if not isinstance(model, BayesianModel):
TypeError("'model' must be a BayesianModel()-instance.")

This comment has been minimized.

@yashu-seth

yashu-seth Aug 19, 2016

Member

@chrisittner There should be raise TypeError.
And I think the message "model: Expected BayesianModel instance, got type {model_type}".format(model_type=type(model))" would be consistent with other TypeErrors.

#!/usr/bin/env python
import numpy as np
import pandas as pd
from warnings import warn

This comment has been minimized.

@yashu-seth

yashu-seth Aug 19, 2016

Member

@chrisittner
Imports should be grouped in the following order:

standard library imports
related third party imports
local application/library specific imports

You should put a blank line between each group of imports.

Parameters
----------
p_value: float, default: 0.01

This comment has been minimized.

@yashu-seth

yashu-seth Aug 19, 2016

Member

@chrisittner I think this should be renamed to significance_level. P-value is what we obtain for a particular test. Then we compare it to the significance level and make our decision accordingly. Naming this variable p-value seems confusing to me. What do you think?

This comment has been minimized.

@chrisittner

chrisittner Aug 20, 2016

Contributor

Yes! Thanks.

----------
p_value: float, default: 0.01
A significance level to use for conditional independence tests in
the data set. The p_value is the threshold probability of falsely

This comment has been minimized.

@yashu-seth

yashu-seth Aug 19, 2016

Member

@chrisittner This line is not very clear to me. I think mentioning the null hypothesis and the alternate hypothesis here would make things more clear.

nodes = self.state_names.keys()
def is_independent(X, Y, Zs):
"""Computes p-value of chi2 independence test. If it is larger

This comment has been minimized.

@yashu-seth

yashu-seth Aug 19, 2016

Member

@chrisittner Same here, I feel the null hypothesis should be mentioned.

p_value: float
A significance level for the hypothesis that X and Y are dependent
given Zs. The p_value is the probability of falsely rejecting the
hypothesis that the variables are conditionally dependent. A low

This comment has been minimized.

@yashu-seth

yashu-seth Aug 19, 2016

Member

@chrisittner I don't think this should be a defined as a significance level. p-value is the probability of obtaining the given sample statistic or an even more extreme statistic considering the null hypothesis is true. If this obtained p-value is greater than the significance_level then we fail to reject the null hypothesis. Here, the null hypothesis is that the variables are independent and the alternate hypothesis is that they are dependent.

Please correct me if you feel otherwise.
And I cannot understand the statement - "falsely rejecting the hypothesis".

This comment has been minimized.

@chrisittner

chrisittner Aug 20, 2016

Contributor

Fixed, this was confused. I also changed the return type, now the docstring reads:

        Returns
        -------
        chi2: float
            The chi2 test statistic.
        p_value: float
            The p_value, i.e. the probability of observing the computed chi2
            statistic (or an even higher value), given the null hypothesis
            that X _|_ Y | Zs.
        sufficient_data: bool
            A flag that indicates if the sample size is considered sufficient.
            As in [4], require at least 5 samples per parameter (on average).
            That is, the size of the data set must be greater than
            `5 * (c(X) - 1) * (c(Y) - 1) * prod([c(Z) for Z in Zs])`
            (c() denotes the variable cardinality).

@chrisittner

This comment has been minimized.

Contributor

chrisittner commented Aug 20, 2016

I amended to the last commit to account for @yashu-seth's helpful comments. p_value is renamed to significance_level & descriptions are improved/fixed. Also changed test_conditional_independence to also return chi2 value and an indicator whether the data sample was sufficiently large.

@coveralls

This comment has been minimized.

coveralls commented Aug 20, 2016

Coverage Status

Coverage decreased (-0.2%) to 96.223% when pulling b0a3c8e on chrisittner:constraint-based-structure-search into e6d612d on pgmpy:dev.

2 similar comments
@coveralls

This comment has been minimized.

coveralls commented Aug 20, 2016

Coverage Status

Coverage decreased (-0.2%) to 96.223% when pulling b0a3c8e on chrisittner:constraint-based-structure-search into e6d612d on pgmpy:dev.

@coveralls

This comment has been minimized.

coveralls commented Aug 20, 2016

Coverage Status

Coverage decreased (-0.2%) to 96.223% when pulling b0a3c8e on chrisittner:constraint-based-structure-search into e6d612d on pgmpy:dev.

@coveralls

This comment has been minimized.

coveralls commented Aug 20, 2016

Coverage Status

Coverage decreased (-0.2%) to 96.222% when pulling 10510a0 on chrisittner:constraint-based-structure-search into e6d612d on pgmpy:dev.

@chrisittner

This comment has been minimized.

Contributor

chrisittner commented Aug 20, 2016

Moved test_conditional_independence to the base class, MMHC needs it as well.

@@ -0,0 +1,533 @@
#!/usr/bin/env python
from warnings import warn

This comment has been minimized.

@ankurankan

ankurankan Aug 25, 2016

Member

@chrisittner Add a blank line after shebang.

[('Z', 'sum'), ('X', 'sum'), ('Y', 'sum')]
"""
skel, seperating_sets = self.estimate_skeleton(significance_level)

This comment has been minimized.

@ankurankan

ankurankan Aug 25, 2016

Member

@chrisittner Typo in separating. Please replace everywhere.

-------
skeleton: UndirectedGraph
An estimate for the undirected graph skeleton of the BN underlying the data.
seperating_sets: dict

This comment has been minimized.

@ankurankan

ankurankan Aug 25, 2016

Member

Add a blank line before this. And in other places.

@ankurankan ankurankan merged commit 10510a0 into pgmpy:dev Aug 30, 2016

1 of 3 checks passed

coverage/coveralls Coverage decreased (-0.2%) to 96.222%
Details
code-quality/landscape Landscape is checking code quality
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment