In this notebook, we show an example for learning the parameters (CPDs) of a Discrete Bayesian Network given the data and the model structure. pgmpy has two main methods for learning the parameters:
1. MaximumLikelihood Estimator (pgmpy.estimators.MaximumLikelihoodEstimator)
2. Bayesian Estimator (pgmpy.estimators.BayesianEstimator)

In the examples, we will try to generate some data from given models and then try to learn the model parameters back from the generated data.

### Step 1: Generate some data

In [1]:
# Use the alarm model to generate data from it.

from pgmpy.utils import get_example_model
from pgmpy.sampling import BayesianModelSampling

alarm_model = get_example_model('alarm')
samples = BayesianModelSampling(alarm_model).forward_sample(size=int(1e5))
samples.head()

  "Found unknown state name. Trying to switch to using all state names as state numbers"
Generating for node: CVP: 100%|██████████| 37/37 [01:23<00:00,  2.27s/it]         


Unnamed: 0,MINVOLSET,VENTMACH,DISCONNECT,VENTTUBE,INTUBATION,PULMEMBOLUS,SHUNT,PAP,FIO2,KINKEDTUBE,...,HRBP,LVFAILURE,HISTORY,HYPOVOLEMIA,STROKEVOLUME,CO,BP,LVEDVOLUME,PCWP,CVP
0,NORMAL,NORMAL,False,LOW,ESOPHAGEAL,False,NORMAL,NORMAL,NORMAL,False,...,LOW,False,False,True,NORMAL,LOW,LOW,NORMAL,NORMAL,NORMAL
1,NORMAL,NORMAL,False,LOW,NORMAL,False,NORMAL,NORMAL,NORMAL,False,...,LOW,False,False,False,NORMAL,NORMAL,LOW,NORMAL,NORMAL,NORMAL
2,NORMAL,NORMAL,False,LOW,ESOPHAGEAL,False,NORMAL,NORMAL,NORMAL,False,...,NORMAL,False,False,False,NORMAL,HIGH,HIGH,NORMAL,NORMAL,NORMAL
3,NORMAL,NORMAL,False,LOW,NORMAL,False,NORMAL,NORMAL,NORMAL,False,...,HIGH,False,False,False,NORMAL,HIGH,HIGH,NORMAL,NORMAL,NORMAL
4,NORMAL,NORMAL,False,LOW,NORMAL,False,NORMAL,HIGH,NORMAL,False,...,HIGH,False,False,False,NORMAL,HIGH,HIGH,NORMAL,NORMAL,NORMAL


### Step 2: Define a model structure

In this case, since we are trying to learn the model parameters back we will use the model structure that we used to generate the data from.

In [2]:
# Defining the Bayesian Model structure

from pgmpy.models import BayesianModel

model_struct = BayesianModel(ebunch=alarm_model.edges())
model_struct.nodes()

NodeView(('HYPOVOLEMIA', 'LVEDVOLUME', 'STROKEVOLUME', 'CVP', 'PCWP', 'LVFAILURE', 'HISTORY', 'CO', 'ERRLOWOUTPUT', 'HRBP', 'ERRCAUTER', 'HREKG', 'HRSAT', 'INSUFFANESTH', 'CATECHOL', 'ANAPHYLAXIS', 'TPR', 'BP', 'KINKEDTUBE', 'PRESS', 'VENTLUNG', 'FIO2', 'PVSAT', 'SAO2', 'PULMEMBOLUS', 'PAP', 'SHUNT', 'INTUBATION', 'MINVOL', 'VENTALV', 'DISCONNECT', 'VENTTUBE', 'MINVOLSET', 'VENTMACH', 'EXPCO2', 'ARTCO2', 'HR'))

### Step 3: Learning the model parameters 

In [3]:
"""
基础知识：
用贝叶斯网来分析一组数据（同一组变量的不同取值向量），就是：
要从这组数据出发，找出一个相对于数据在某种意义下最优的贝叶斯网，
所得的结果是关于数据的一个统计模型，称为贝叶斯网模型。
"""

"""
1、最大似然估计的概念MLE
在判断某个数值作为参数的估计是否合适时，应该考虑它与数据的拟合程度。
某个可能取值与数据的拟合程度用数据点条件概率来度量。

概率越大，拟合程度越高。
给定参数取值，数据的条件概率称为是参数的似然度（likelihood）。

固定数据而让参数在其定义域上变动，那么L就是参数的一个函数，称为参数的似然函数。
参数的最大似然估计，MLE是令L达到最大的那个参数取值。
最大似然估计的概念可推广到多参数和多变量的情况。

比如：
data = pd.DataFrame(data={'A': [0, 0, 1], 'B': [0, 1, 0], 'C': [1, 1, 0]})
model = BayesianModel([('A', 'C'), ('B', 'C')])

        >>> cpd_A = MaximumLikelihoodEstimator(model, data).estimate_cpd('A')
        >>> print(cpd_A)
        ╒══════╤══════════╕
        │ A(0) │ 0.666667 │
        ├──────┼──────────┤
        │ A(1) │ 0.333333 │
        ╘══════╧══════════╛
        >>> cpd_C = MaximumLikelihoodEstimator(model, data).estimate_cpd('C')
        >>> print(cpd_C)
        ╒══════╤══════╤══════╤══════╤══════╕
        │ A    │ A(0) │ A(0) │ A(1) │ A(1) │
        ├──────┼──────┼──────┼──────┼──────┤
        │ B    │ B(0) │ B(1) │ B(0) │ B(1) │
        ├──────┼──────┼──────┼──────┼──────┤
        │ C(0) │ 0.0  │ 0.0  │ 1.0  │ 0.5  │
        ├──────┼──────┼──────┼──────┼──────┤
        │ C(1) │ 1.0  │ 1.0  │ 0.0  │ 0.5  │
        ╘══════╧══════╧══════╧══════╧══════╛
"""

# Fitting the model using Maximum Likelihood Estimator
from pgmpy.estimators import MaximumLikelihoodEstimator

"""
MaximumLikelihoodEstimator继承ParameterEstimator，使用super初始化
ParameterEstimator继承BaseEstimator，使用super初始化

在BaseEstimator中：
如果传入的data非空，进行下面操作：
1、存储样本完整程度（没有写入参数，默认完整）
2、将列的Values设置为变量名称
（省略对state_names的代码解读，本例默认为空）
3、获取state_names：通过collect_state_names函数
    states = sorted(list(self.data.loc[:, variable].dropna().unique()))
     .loc：按标签取数据如：0号标签：A -1 B -2，这里为取第variable列的所有标签信息
     .dropna删除所有包含NAN的行
     .unique去除重复元素
"""

mle = MaximumLikelihoodEstimator(model=model_struct, data=samples)


# Estimating the CPD for a single node.
"""
1、通过baseEstimator类中state_counts函数获得state_counts，
    作用：根据parent获得某个变量的值的出现次数
    1）获得parent_states：每个parent的state_names
          =data.groupby([variable] + parents).size().unstack(parents)
          按变量和parents分组，size()统计个数，unstack按parents重新索引
    2）获得行：row_index = self.state_names[variable]
    3）获得列column_index 
       =pd.MultiIndex.from_product(parents_states, names=parents)
       MultiIndex.from_product为两个列表中每个可能元素组合创建了索引条目
    4）state_counts = state_count_data.reindex(
                index=row_index, columns=column_index
            ).fillna(0)
       重新设定索引，相当于按新的索引重排顺序，如果索引在原数据中没有值则填0
       
2、获得parents、parents_cardinalities、node_cardinality 
    cpd = TabularCPD(
            node,
            node_cardinality,
            np.array(state_counts),
            evidence=parents,
            evidence_card=parents_cardinalities,
            state_names={var: self.state_names[var] for var in chain([node], parents)},
        )
    cpd.normalize()//正则化
    return cpd

"""
print(mle.estimate_cpd(node='FIO2'))
print(mle.estimate_cpd(node='CVP'))

# Estimating CPDs for all the nodes in the model
"""
循环调用上面的estimate_cpd函数

"""
mle.get_parameters()[:10] # Show just the first 10 CPDs in the output

+--------------+---------+
| FIO2(LOW)    | 0.05005 |
+--------------+---------+
| FIO2(NORMAL) | 0.94995 |
+--------------+---------+
+-------------+----------------------+----------------------+----------------------+
| LVEDVOLUME  | LVEDVOLUME(HIGH)     | LVEDVOLUME(LOW)      | LVEDVOLUME(NORMAL)   |
+-------------+----------------------+----------------------+----------------------+
| CVP(HIGH)   | 0.6989774078478003   | 0.011237205162438807 | 0.00951605298126795  |
+-------------+----------------------+----------------------+----------------------+
| CVP(LOW)    | 0.009845422116527943 | 0.9506008010680908   | 0.038807207052738366 |
+-------------+----------------------+----------------------+----------------------+
| CVP(NORMAL) | 0.29117717003567184  | 0.0381619937694704   | 0.9516767399659937   |
+-------------+----------------------+----------------------+----------------------+


[<TabularCPD representing P(ANAPHYLAXIS:2) at 0x7f14727362e8>,
 <TabularCPD representing P(ARTCO2:3 | VENTALV:4) at 0x7f1471306048>,
 <TabularCPD representing P(BP:3 | CO:3, TPR:3) at 0x7f1471306f28>,
 <TabularCPD representing P(CATECHOL:2 | ARTCO2:3, INSUFFANESTH:2, SAO2:3, TPR:3) at 0x7f1471306a20>,
 <TabularCPD representing P(CO:3 | HR:3, STROKEVOLUME:3) at 0x7f1471306fd0>,
 <TabularCPD representing P(CVP:3 | LVEDVOLUME:3) at 0x7f1471306630>,
 <TabularCPD representing P(DISCONNECT:2) at 0x7f1471306e48>,
 <TabularCPD representing P(ERRCAUTER:2) at 0x7f1471306dd8>,
 <TabularCPD representing P(ERRLOWOUTPUT:2) at 0x7f1471306320>,
 <TabularCPD representing P(EXPCO2:4 | ARTCO2:3, VENTLUNG:4) at 0x7f14712b8160>]

In [4]:
# Verifying that the learned parameters are almost equal.
"""
allclose比较两个array是否每一个元素都相等（在误差范围内）
"""
np.allclose(alarm_model.get_cpds('FIO2').values, mle.estimate_cpd('FIO2').values, atol=0.01)

True

In [5]:
# Fitting the using Bayesian Estimator
from pgmpy.estimators import BayesianEstimator
"""
一些基础概念：
1、先验概率P(theta)：在获取某些信息前，对变量p的不确定性进行猜测
2、似然函数p(x|theta)：这个函数自变量是统计模型的参数，在参数给定的条件下，对于观察到的证据信息x的值的条件分布
   似然与概率的区别：概率是在给定参数值的情况下关于观察值的函数，似然用于给定一个观察值，用于描述参数。
     如概率：已知一个硬币均匀，连续十次朝上的概率
       似然：如果一个硬币10次抛中均正面朝上，那硬币是均匀的概率是多少？这里的概率实质上是可能性，就是似然
       
3、后验概率p(theta|x)：关于参数theta在给定的证据信息x下的概率

p(theta|x)=p(x|theta)p(theta)p(x)

4、共轭分布：如果先验分布p(theta)和似然函数p(x|theta)可以使得先验p(theta)和后验p(theta|x)有相同的形式，
           就称头两个是共轭分布。
           意义：先验分布和后验分布的形式相同，1）符合人的直观，
                2）形成先验链：现在的后验分布可以作为下一次计算的先验分布
                
5、狄利克雷分布：多项式分布的共轭先验分布的概率归一化形式
"""
best = BayesianEstimator(model=model_struct, data=samples)
"""
函数def estimate_cpd(
        self, node, prior_type="BDeu", pseudo_counts=[], equivalent_sample_size=5)
     prior_type有：'dirichlet', 'BDeu', 'K2',
    
   
        >>> data = pd.DataFrame(data={'A': [0, 0, 1], 'B': [0, 1, 0], 'C': [1, 1, 0]})
        >>> model = BayesianModel([('A', 'C'), ('B', 'C')])
        >>> cpd_C = estimator.estimate_cpd('C', prior_type="dirichlet", pseudo_counts=[1, 2])
        
        
        ╒══════╤══════╤══════╤══════╤════════════════════╕
        │ A    │ A(0) │ A(0) │ A(1) │ A(1)               │
        ├──────┼──────┼──────┼──────┼────────────────────┤
        │ B    │ B(0) │ B(1) │ B(0) │ B(1)               │
        ├──────┼──────┼──────┼──────┼────────────────────┤
        │ C(0) │ 1    │ 1    │ 1+1  │    1               │
        ├──────┼──────┼──────┼──────┼────────────────────┤
        │ C(1) │ 2+1  │ 2+1  │ 2    │    2               │
        ╘══════╧══════╧══════╧══════╧════════════════════╛
        ╒══════╤══════╤══════╤══════╤════════════════════╕
        │ A    │ A(0) │ A(0) │ A(1) │ A(1)               │
        ├──────┼──────┼──────┼──────┼────────────────────┤
        │ B    │ B(0) │ B(1) │ B(0) │ B(1)               │
        ├──────┼──────┼──────┼──────┼────────────────────┤
        │ C(0) │ 0.25 │ 0.25 │ 0.5  │ 0.3333333333333333 │
        ├──────┼──────┼──────┼──────┼────────────────────┤
        │ C(1) │ 0.75 │ 0.75 │ 0.5  │ 0.6666666666666666 │
        ╘══════╧══════╧══════╧══════╧════════════════════╛
        
        node_cardinality = len(self.state_names[node])
        parents = sorted(self.model.get_parents(node))
        parents_cardinalities = [len(self.state_names[parent]) for parent in parents]
        
        列是parent的组合
        行是要求变量的不同状态值
        cpd_shape = (node_cardinality, np.prod(parents_cardinalities, dtype=int))
        
        pseudo_counts就是dirichlet hyperparameters超参数，即先验知识
        pseudo_counts = np.ones(cpd_shape, dtype=int) * pseudo_counts
        这样乘起来之后，就获得变量不同取值的计数    
        再与当前知识加起来即可
        bayesian_counts = state_counts + pseudo_counts
        

        k2 bdeu 狄利克雷,留到下一章中看区别
"""
print(best.estimate_cpd(node='FIO2', prior_type="BDeu", equivalent_sample_size=1000))
# Uniform pseudo count for each state. Can also accept an array of the size of CPD.
print(best.estimate_cpd(node='CVP', prior_type="dirichlet", pseudo_counts=100))

# Learning CPDs for all the nodes in the model. For learning all parameters with BDeU prior, a dict of
# pseudo_counts need to be provided
best.get_parameters(prior_type="BDeu", equivalent_sample_size=1000)[:10]

+--------------+----------+
| FIO2(LOW)    | 0.054505 |
+--------------+----------+
| FIO2(NORMAL) | 0.945495 |
+--------------+----------+
+-------------+----------------------+----------------------+----------------------+
| LVEDVOLUME  | LVEDVOLUME(HIGH)     | LVEDVOLUME(LOW)      | LVEDVOLUME(NORMAL)   |
+-------------+----------------------+----------------------+----------------------+
| CVP(HIGH)   | 0.6938335287221571   | 0.021640826873385012 | 0.010898174626886907 |
+-------------+----------------------+----------------------+----------------------+
| CVP(LOW)    | 0.014396248534583822 | 0.9306632213608957   | 0.04006430776672784  |
+-------------+----------------------+----------------------+----------------------+
| CVP(NORMAL) | 0.2917702227432591   | 0.04769595176571921  | 0.9490375176063852   |
+-------------+----------------------+----------------------+----------------------+


[<TabularCPD representing P(HYPOVOLEMIA:2) at 0x7f1472736e10>,
 <TabularCPD representing P(LVEDVOLUME:3 | HYPOVOLEMIA:2, LVFAILURE:2) at 0x7f147128b898>,
 <TabularCPD representing P(STROKEVOLUME:3 | HYPOVOLEMIA:2, LVFAILURE:2) at 0x7f147128bb70>,
 <TabularCPD representing P(CVP:3 | LVEDVOLUME:3) at 0x7f147128bbe0>,
 <TabularCPD representing P(PCWP:3 | LVEDVOLUME:3) at 0x7f147128b9e8>,
 <TabularCPD representing P(LVFAILURE:2) at 0x7f147128bc18>,
 <TabularCPD representing P(HISTORY:2 | LVFAILURE:2) at 0x7f147128bfd0>,
 <TabularCPD representing P(CO:3 | HR:3, STROKEVOLUME:3) at 0x7f147128bef0>,
 <TabularCPD representing P(ERRLOWOUTPUT:2) at 0x7f147128b6a0>,
 <TabularCPD representing P(HRBP:3 | ERRLOWOUTPUT:2, HR:3) at 0x7f147128b5f8>]

In [7]:
# Shortcut for learning all the parameters and adding the CPDs to the model.

model_struct = BayesianModel(ebunch=alarm_model.edges())
model_struct.fit(data=samples, estimator=MaximumLikelihoodEstimator)
print(model_struct.get_cpds('FIO2'))

model_struct = BayesianModel(ebunch=alarm_model.edges())
model_struct.fit(data=samples, estimator=BayesianEstimator, prior_type='BDeu', equivalent_sample_size=1000)
print(model_struct.get_cpds('FIO2'))

+--------------+---------+
| FIO2(LOW)    | 0.05005 |
+--------------+---------+
| FIO2(NORMAL) | 0.94995 |
+--------------+---------+
+--------------+----------+
| FIO2(LOW)    | 0.054505 |
+--------------+----------+
| FIO2(NORMAL) | 0.945495 |
+--------------+----------+
