New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor MLE.py/BaseEstimator.py #694

Merged
merged 5 commits into from Jun 7, 2016

Conversation

Projects
None yet
3 participants
@chrisittner
Contributor

chrisittner commented May 31, 2016

3 commits:

  1. cleaned code in get_parameters a bit, added internal _estimate_cpd method, because estimation is done node-by-node.
  2. Allow to pass an optional list/set of possible values for each node in BaseEstimator.__init()__, to deal with cases where not all values that a variable may have occur in the data set.
  3. Make use of the state_name feature of CPDs for values taken from the data set:
import pandas as pd
from pgmpy.models import BayesianModel
from pgmpy.estimators import MaximumLikelihoodEstimator
mle=MaximumLikelihoodEstimator(BayesianModel([('A','B')]), 
                               pd.DataFrame(data={'A':[1, 2], 'B':[23, 42]}))
print(str(mle.get_parameters()[0]))

Output before:

╒═════╤═════╤═════╕
│ A   │ A_0 │ A_1 │
├─────┼─────┼─────┤
│ B_0 │ 1.0 │ 0.0 │
├─────┼─────┼─────┤
│ B_1 │ 0.0 │ 1.0 │
╘═════╧═════╧═════╛

Output now:

╒═══════╤══════╤══════╕
│ A     │ A(1) │ A(2) │
├───────┼──────┼──────┤
│ B(23) │ 1.0  │ 0.0  │
├───────┼──────┼──────┤
│ B(42) │ 0.0  │ 1.0  │
╘═══════╧══════╧══════╛
@chrisittner

This comment has been minimized.

Contributor

chrisittner commented May 31, 2016

There are two more things that should be changed in the MLE code:

  • remove restriction that values in data set must be integers Update: fixed in 4. commit, see comment below
  • deal with missing data (e.g. #687)

@ankurankan you self-assigned to #687, are you already working on it? Otherwise, I can send a PR for that, too.

@coveralls

This comment has been minimized.

coveralls commented May 31, 2016

Coverage Status

Coverage decreased (-0.02%) to 96.093% when pulling a322652 on chrisittner:refactor_MLE into c9bc0e2 on pgmpy:dev.

@coveralls

This comment has been minimized.

coveralls commented May 31, 2016

Coverage Status

Coverage decreased (-0.004%) to 96.106% when pulling 0815f3c on chrisittner:refactor_MLE into c9bc0e2 on pgmpy:dev.

@chrisittner

This comment has been minimized.

Contributor

chrisittner commented Jun 1, 2016

I added a fourth commit to use arbitrary state_names in the data set, not only ints. Now this works:

import pandas as pd
from pgmpy.models import BayesianModel
from pgmpy.estimators import MaximumLikelihoodEstimator
model = BayesianModel([('Light?', 'Color'), ('Fruit', 'Color')])
data = pd.DataFrame(data={'Fruit': ['Apple', 'Apple', 'Apple', 'Banana', 'Banana'],
                          'Light?': [True,   True,   False,   False,    True],
                          'Color': ['red',   'green', 'black', 'black',  'yellow']})
mle = MaximumLikelihoodEstimator(model, data)
print(str(mle._estimate_cpd('Color')))

Output:

╒═══════════════╤═══════════════╤══════════════╤═══════════════╤═══════════════╕
│ Fruit         │ Fruit(Apple)  │ Fruit(Apple) │ Fruit(Banana) │ Fruit(Banana) │
├───────────────┼───────────────┼──────────────┼───────────────┼───────────────┤
│ Light?        │ Light?(False) │ Light?(True) │ Light?(False) │ Light?(True)  │
├───────────────┼───────────────┼──────────────┼───────────────┼───────────────┤
│ Color(black)  │ 1.0           │ 0.0          │ 1.0           │ 0.0           │
├───────────────┼───────────────┼──────────────┼───────────────┼───────────────┤
│ Color(green)  │ 0.0           │ 0.5          │ 0.0           │ 0.0           │
├───────────────┼───────────────┼──────────────┼───────────────┼───────────────┤
│ Color(red)    │ 0.0           │ 0.5          │ 0.0           │ 0.0           │
├───────────────┼───────────────┼──────────────┼───────────────┼───────────────┤
│ Color(yellow) │ 0.0           │ 0.0          │ 0.0           │ 1.0           │
╘═══════════════╧═══════════════╧══════════════╧═══════════════╧═══════════════╛

Edit: There was a copy paste error in the output, now fixed.


The existing code for MLE still had issues if the variable states weren't numbered from 0, which are now fixed with proper state names. Using 1 and 2 as values for a binary variable instead of 0 and 1 led to wrong output if the variable had parents and in the case without parents it was confusing because of the ordering:

>>> import pandas as pd
>>> from pgmpy.models import BayesianModel
>>> from pgmpy.estimators import MaximumLikelihoodEstimator
>>> model = BayesianModel([('A', 'C'), ('B', 'C')])
>>> 
>>> data1 = pd.DataFrame(data={'A': [0, 1, 1],
...                            'B': [0, 1, 1],
...                            'C': [0, 0, 1]})
>>> cpds1 = MaximumLikelihoodEstimator(model, data1).get_parameters()
>>> 
>>> data2 = pd.DataFrame(data={'A': [2, 1, 1],
...                            'B': [0, 1, 1],
...                            'C': [0, 0, 1]})
>>> cpds2 = MaximumLikelihoodEstimator(model, data2).get_parameters()
>>> 
>>> print(str(cpds1[0]))
╒═════╤═════╤═════╤═════╤═════╕
│ A   │ A_0 │ A_0 │ A_1 │ A_1 │
├─────┼─────┼─────┼─────┼─────┤
│ B   │ B_0 │ B_1 │ B_0 │ B_1 │
├─────┼─────┼─────┼─────┼─────┤
│ C_0 │ 1.00.50.50.5 │
├─────┼─────┼─────┼─────┼─────┤
│ C_1 │ 0.00.50.50.5 │
╘═════╧═════╧═════╧═════╧═════╛
>>> print(str(cpds2[0]))
╒═════╤═════╤═════╤═════╤═════╕
│ A   │ A_0 │ A_0 │ A_1 │ A_1 │
├─────┼─────┼─────┼─────┼─────┤
│ B   │ B_0 │ B_1 │ B_0 │ B_1 │
├─────┼─────┼─────┼─────┼─────┤
│ C_0 │ 0.50.50.50.5 │
├─────┼─────┼─────┼─────┼─────┤
│ C_1 │ 0.50.50.50.5 │
╘═════╧═════╧═════╧═════╧═════╛
>>> print(str(cpds1[1]))
╒═════╤══════════╕
│ A_0 │ 0.333333 │
├─────┼──────────┤
│ A_1 │ 0.666667 │
╘═════╧══════════╛
>>> print(str(cpds2[1]))
╒═════╤══════════╕
│ A_0 │ 0.666667 │
├─────┼──────────┤
│ A_1 │ 0.333333 │
╘═════╧══════════╛

@chrisittner chrisittner changed the title from Refactor MLE.py/BaseEstimator.py to [WIP] Refactor MLE.py/BaseEstimator.py Jun 1, 2016

@ankurankan

This comment has been minimized.

Member

ankurankan commented Jun 4, 2016

@chrisittner travis is failing. Can you please check: https://travis-ci.org/pgmpy/pgmpy/builds/134468165

@chrisittner

This comment has been minimized.

Contributor

chrisittner commented Jun 4, 2016

Yep, will do today. It's a bit tricky because I can't reproduce it yet. Tests run fine on my system.

@coveralls

This comment has been minimized.

coveralls commented Jun 6, 2016

Coverage Status

Coverage decreased (-0.03%) to 96.076% when pulling c7b4035 on chrisittner:refactor_MLE into c9bc0e2 on pgmpy:dev.

@chrisittner chrisittner changed the title from [WIP] Refactor MLE.py/BaseEstimator.py to Refactor MLE.py/BaseEstimator.py Jun 6, 2016

@coveralls

This comment has been minimized.

coveralls commented Jun 6, 2016

Coverage Status

Coverage decreased (-0.03%) to 96.078% when pulling 04c4d14 on chrisittner:refactor_MLE into c9bc0e2 on pgmpy:dev.

@ankurankan

This comment has been minimized.

Member

ankurankan commented Jun 7, 2016

@chrisittner Can you please fix #687 also if you haven't yet ?

@ankurankan

This comment has been minimized.

Member

ankurankan commented Jun 7, 2016

@chrisittner BTW you don't need to use print(str(cpd)) for printing the structure. print(cpd) should give the same result.

"""
def __init__(self, model, data):
def __init__(self, model, data, node_values=None):

This comment has been minimized.

@ankurankan

ankurankan Jun 7, 2016

Member

@chrisittner Add docstring.

This comment has been minimized.

@ankurankan

ankurankan Jun 7, 2016

Member

@chrisittner Is node_values is for passing the state names ? If yes, I would suggest you to change the name as it's not clear from the name.

This comment has been minimized.

@chrisittner

chrisittner Jun 7, 2016

Contributor

Fixed. It is for optionally passing the possible states for each variable/node, i used state_names now.

>>> data = pd.DataFrame(data={'A': [0, 0, 1], 'B': [0, 1, 0], 'C': [1, 1, 0]})
>>> model = BayesianModel([('A', 'C'), ('B', 'C')])
>>> cpd_A = MaximumLikelihoodEstimator(model, data)._estimate_cpd('A')
>>> print(str(cpd_A))

This comment has been minimized.

@ankurankan

ankurankan Jun 7, 2016

Member

@chrisittner You can replace this with just print(cpd_A).

evidence_card=parents_cardinalities,
state_names=state_names)
cpd.normalize()
return cpd

This comment has been minimized.

@ankurankan

ankurankan Jun 7, 2016

Member

@chrisittner Add a new line at the end of the file.

This comment has been minimized.

@chrisittner

chrisittner Jun 7, 2016

Contributor

Hm, I'm positive that the files end with a newline. When i add another one my PEP8 linter complains about having an extra blank line in the end.

for node in model.nodes():
if node in node_values:
if not set(self._get_node_values(node)) <= set(node_values[node]):
raise ValueError("Data contains unexpected values for variable '" + str(node) + "'.")

This comment has been minimized.

@ankurankan

ankurankan Jun 7, 2016

Member

@chrisittner Use .format instead of using +.

@ankurankan

This comment has been minimized.

Member

ankurankan commented Jun 7, 2016

@chrisittner PR looks good. Just have a look at the comments.

@chrisittner

This comment has been minimized.

Contributor

chrisittner commented Jun 7, 2016

Fixed the points raised in the comments. I will make a seperate PR for #687 and one to push my GSoC things, so this is ready from my side.

@coveralls

This comment has been minimized.

coveralls commented Jun 7, 2016

Coverage Status

Coverage decreased (-0.007%) to 96.103% when pulling 523fb0c on chrisittner:refactor_MLE into c9bc0e2 on pgmpy:dev.

@ankurankan ankurankan merged commit 523fb0c into pgmpy:dev Jun 7, 2016

1 of 4 checks passed

continuous-integration/appveyor/pr AppVeyor build failed
Details
coverage/coveralls Coverage decreased (-0.007%) to 96.103%
Details
code-quality/landscape Landscape is checking code quality
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@chrisittner chrisittner deleted the chrisittner:refactor_MLE branch Jun 8, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment