# 贝叶斯定理之证明

第一章我们来证明顶顶大名的贝叶斯定理：

$$P(A|B) = \frac{P(A) P(B|A)}{P(B)}
$$

我们用 Penguins 数据来证明它。

In [23]:
import pandas as pd
import numpy as np

In [24]:
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")
df = df.dropna()
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,MALE


In [25]:
df.shape

(333, 7)

In [26]:
set(df.species), set(df.island), set(df.sex)

({'Adelie', 'Chinstrap', 'Gentoo'},
 {'Biscoe', 'Dream', 'Torgersen'},
 {'FEMALE', 'MALE'})

我们看到总共有三种企鹅 ('Adelie', 'Chinstrap', 'Gentoo')，两种性别 ('FEMALE', 'MALE')。我们下面只用到这两个变量。

我们把 A 事件定义为：FEMALE。B 事件定义为 Adelie。

## A & B

我们先来看一下 $P(A \& B)$:

In [27]:
def prob_sex_and_species(df, sex_str, species_str):
    subset = df[(df['sex'] == sex_str) & (df['species'] == species_str)]
    return len(subset) / len(df)

In [28]:
prob_sex_and_species(df, sex_str='FEMALE', species_str='Adelie')

0.21921921921921922

<!-- 另一种解题方式： 

## Fraction of Adelie species
adelie = (df['species'] == 'Adelie')
adelie.head()

## Adelie and Female
female = (df['sex'] == 'FEMALE')
female.head()

def prob(A):
    '''probability of A'''
    '''Input: a series of True and False values'''
    return A.mean()

prob(female & adelie)

female[adelie].head()

prob(female[adelie])

`female` 是 `df` 所有行 `sex == "Female"` 的结果（True 或者 False)。`female[adelie]` 依然是 `sex == "Female"` 的结果，但并非对于 `df` 所有行，而是 `df` 中 `species == 'adelie'` 为 True 之行。

prob(female[adelie]) == prob(female & adelie)/prob(adelie)

prob(female & adelie) == prob(adelie & female)
-->

## A|B

我们直接计算 $P(A|B)$ 也就是 P(Female|Adelie):

In [29]:
def prob_sex_given_species(df, sex_str, species_str):
    species_subset = df[df.species == species_str]
    sex_subset_within_species_subset = species_subset[species_subset.sex == sex_str]
    return len(sex_subset_within_species_subset)/len(species_subset)

In [30]:
prob_sex_given_species(df, 'FEMALE', 'Adelie')

0.5

## 贝叶斯定理

首先，我们看到

$$P(A|B) = \frac{P(A\&B)}{P(B)}
$$

In [31]:
def prob_species(df, species_str):
    subset = df[df.species == species_str]
    return len(subset)/len(df)

In [32]:
prob_species(df, 'Adelie')

0.43843843843843844

In [33]:
prob_sex_given_species(
    df, 'FEMALE', 'Adelie') == prob_sex_and_species(
    df, 'FEMALE', 'Adelie')/prob_species(df, 'Adelie')

True

我们也知道

$$P(A\&B) = P(B\&A)$$

这个貌似不用证明

进而我们知道

$$P(A\&B) = P(A|B) P(B)$$

所以

$$P(B\&A) = P(B|A) P(A)$$

所以

$$
P(A|B) = \frac{P(A \& B)}{P(B)} = \frac{P(B\&A)}{P(B)} = \frac{ P(A) P(B|A) }{P(B)}
$$

得证。