# Week 03 (Wednesday), AST 8581 / PHYS 8581 / CSCI 8581: Big Data in Astrophysics

### Michael Coughlin <cough052@umn.edu>, Michael Steinbach <stei0062@umn.edu>, Nico Adams adams900@umn.edu

---

### Conditional probability is the foundation of Bayes Theorem and Bayesian Statistics

Given two random variables, $X$ and $Y$, We can define the conditional probability of $X$ with respect to $Y$ and vice-versa as follows:

$P(Y|X) = \frac{P(X,Y)}{P(X)}$ and $P(X|Y) = \frac{P(X,Y)}{P(Y)}$

The derivation of Bayes theorem follows immediately from these two equations.

$P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}$


$P(Y)$ is the *prior probability* of $Y$, belief in $Y$ before knowing $X$  
$P(Y|X)$ is the *posterior probability*, belief in $Y$ after knowing $X$  
$P(X|Y)$ is the *likelihood*, i.e., the probability of $X$ (the data) given $Y$  
$P(X)$ is the Sometimes called the *evidence* in the Bayesian approach 

A couple of practical points. $P(X)$ is interpreted as $P(X=x)$, where the $x$ is omitted for convenience.   
Also, note that $P(X=x) = \sum_{y \in Y} P(X=x,Y=y)$, and likewise, $P(Y=y) = \sum_{x \in X} P(Y=y,X=x)$    
This is known as marginalization, since wer are summing across a row or column of probability table to get the marginal sums. 

Let's see how this works for a simple example when we have a discrete distribution and a table of the joint probabilities

In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd
from pandas import Series, DataFrame
import scipy.stats

In [3]:
# Let's randomly generate the joint distribution, P(X,Y), of X and Y
# X has values 0 to 4, and Y has values 0 to 3. 
np.random.seed(123)
pxy = pxy = np.random.rand(5,4)
pxy /= pxy.sum()

# We also need the maginal distributions, i.e., P(X) and P(Y)
# For each value of X=x the P(X) is obtained by summing across all values of Y
# and similarly for Y. 
px = pxy.sum(axis = 1)
py = pxy.sum(axis = 0)

print(pxy,'\n')
print(px,'\n')
print(py)

[[0.07277757 0.02990014 0.02370485 0.05760965]
 [0.07518093 0.04421252 0.10248498 0.0715613 ]
 [0.05025499 0.04097433 0.0358604  0.07618207]
 [0.04582862 0.00623604 0.04159364 0.07711685]
 [0.01906948 0.01833384 0.05554447 0.05557334]] 

[0.1839922  0.29343973 0.20327179 0.17077516 0.14852113] 

[0.26311158 0.13965687 0.25918834 0.33804321]


#### Calculate P(X|Y) and Y|X). Note that these are conditional distributions, which we calculate by just using the formula for conditional probability applied to each combination of X and Y values.  No use of Bayes theorem yet!

In [4]:
# First comput P(Y|X)
py_given_x = np.zeros((5,4))
# First just do a double loop
for i in range(5):
    for j in range(4):
        py_given_x[i, j] = pxy[i,j] / px[i] 
        
print('py_given_x Using loops:\n', py_given_x)

# If we want to use the vectorization power of numpy,  we can do this in one line
py_given_x = pxy / np.matrix(px).T 

print('py_given_x Using numpy:\n', py_given_x)


# Next compute P(Y|X)

px_given_y = np.zeros((5,4))
# First just do a double loop
for i in range(5):
    for j in range(4):
        px_given_y[i, j] = pxy[i,j] / py[j] 
        
print('px_given_y Using loops:\n', px_given_y)

# If we want to use the vectorization power of numpy, we can do this in one line
px_given_y = pxy / np.matrix(py) 

print('px_given_y Using numpy:\n', px_given_y)


py_given_x Using loops:
 [[0.39554701 0.16250763 0.12883616 0.3131092 ]
 [0.25620569 0.15066985 0.34925393 0.24387053]
 [0.24723053 0.20157411 0.176416   0.37477935]
 [0.26835647 0.0365161  0.24355794 0.45156949]
 [0.12839573 0.12344261 0.37398366 0.374178  ]]
py_given_x Using numpy:
 [[0.39554701 0.16250763 0.12883616 0.3131092 ]
 [0.25620569 0.15066985 0.34925393 0.24387053]
 [0.24723053 0.20157411 0.176416   0.37477935]
 [0.26835647 0.0365161  0.24355794 0.45156949]
 [0.12839573 0.12344261 0.37398366 0.374178  ]]
px_given_y Using loops:
 [[0.27660343 0.21409715 0.091458   0.17042097]
 [0.28573782 0.31657964 0.39540737 0.21169277]
 [0.19100258 0.29339288 0.13835651 0.22536193]
 [0.1741794  0.04465261 0.16047653 0.2281272 ]
 [0.07247677 0.13127773 0.21430159 0.16439714]]
px_given_y Using numpy:
 [[0.27660343 0.21409715 0.091458   0.17042097]
 [0.28573782 0.31657964 0.39540737 0.21169277]
 [0.19100258 0.29339288 0.13835651 0.22536193]
 [0.1741794  0.04465261 0.16047653 0.2281272 ]
 [0.

Now we will show how to create a probability and conditional probability table from a table of data.  
We will use that table to compute perform classification using the naive Bayes technique and Bayes theorem. 

We will compute the probability of each class using Bayes theorem, i.e.,  
$$P(Y|X_1, X_2, ..., X_d) = \frac{P(X_1, X_2, ..., X_d|Y)P(Y)}{P(X_1, X_2, ..., X_d)}$$

Then, we will take the naive Bayes approach, which assumes the predictor variables, $X_1, X_2, ..., X_d$ are conditionally independent of the class variable, $Y$.  

Two variables, $X_i$ and $X_j$ are conditionally independent given $Y$ if $P(X_iX_j|Y) = P(X_i|Y)P(
X_j|Y)$. Thus, 

$$P(Y|X_1, X_2, ..., X_d) = \frac{P(X_1|Y)P(X_1)P(X_2)...P(X_d|Y)P(Y)}{P(X_1, X_2, ..., X_d)}$$

Note that for classification, P(X_1, X_2, ..., X_d) is the same for all classes and thus, can be ignored.


In [None]:
# Define a sample data table, where Refund, Marital Status, and Taxable, Income are predictor variables,   
# while Evaded Taxes (Yes or No) is the class variable. 

In [5]:
from IPython.display import display
dict = {'Refund': ['Yes','No','No','Yes','No','No','Yes','No','No','No'],
        'Marital Status':['Single','Married','Single','Married','Divorced','Married','Divorced','Single','Married','Single'],
        'Taxable Income':[125,100,70,120,95,60,220,85,75,90],
        'Evaded Taxes':['No','No','No','No','Yes', 'No','No','Yes', 'No','Yes']}
data = DataFrame(dict)
display(data)

Unnamed: 0,Refund,Marital Status,Taxable Income,Evaded Taxes
0,Yes,Single,125,No
1,No,Married,100,No
2,No,Single,70,No
3,Yes,Married,120,No
4,No,Divorced,95,Yes
5,No,Married,60,No
6,Yes,Divorced,220,No
7,No,Single,85,Yes
8,No,Married,75,No
9,No,Single,90,Yes


Evade is our class variable. From the data, we can compute the conditional probabilities needed to use Bayes Theorem and naive Bayes

In [14]:
# Evade is our class variable. 
# From the data, we can compute the conditional probabilities needed to use Bayes Theorem and naive Bayes

refund_evaded = pd.crosstab(data.Refund, data['Evaded Taxes'],margins=True, normalize='all')
display(refund_evaded)

refund_given_evaded = pd.crosstab(data.Refund, data['Evaded Taxes'],margins=True,normalize='columns')
display(refund_given_evaded)

m_status_evaded = pd.crosstab(data['Marital Status'], data['Evaded Taxes'],margins=True,normalize='all')
display(m_status_evaded)

m_status__given_evaded = pd.crosstab(data['Marital Status'], data['Evaded Taxes'],normalize='columns')
display(m_status__given_evaded)

evaded_no = data['Evaded Taxes'] == 'No'
p_evaded_no = sum(evaded_no)/ len(data)
evaded_yes = data['Evaded Taxes'] == 'Yes'
p_evaded_yes = sum(evaded_yes)/ len(data)
print('P(Evaded Taxes = No) = %5.2f   P(Evaded Taxes = Yes) = %5.2f )' %(p_evaded_no, p_evaded_yes))

mean_income_no = data['Taxable Income'][evaded_no].mean()
std_income_no = data['Taxable Income'][evaded_no].std()

mean_income_yes = data['Taxable Income'][evaded_yes].mean()
std_income_yes = data['Taxable Income'][evaded_yes].std()

print("Evaded=No  income mean = %5.3f, income std = %5.3f" %(mean_income_no, std_income_no))
print("Evaded=Yes income mean = %5.3f, income std = %5.3f" %(mean_income_yes, std_income_yes))

Evaded Taxes,No,Yes,All
Refund,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
No,0.4,0.3,0.7
Yes,0.3,0.0,0.3
All,0.7,0.3,1.0


Evaded Taxes,No,Yes,All
Refund,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
No,0.571429,1.0,0.7
Yes,0.428571,0.0,0.3


Evaded Taxes,No,Yes,All
Marital Status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Divorced,0.1,0.1,0.2
Married,0.4,0.0,0.4
Single,0.2,0.2,0.4
All,0.7,0.3,1.0


Evaded Taxes,No,Yes
Marital Status,Unnamed: 1_level_1,Unnamed: 2_level_1
Divorced,0.142857,0.333333
Married,0.571429,0.0
Single,0.285714,0.666667


P(Evaded Taxes = No) =  0.70   P(Evaded Taxes = Yes) =  0.30 )
Evaded=No  income mean = 110.000, income std = 54.544
Evaded=Yes income mean = 90.000, income std = 5.000


Using naive Bayes, classify a test record: $X =$(Refund=No, Marital Status=Divorced, Taxable Income=120,000) by computing  
P(Evaded Taxes = Yes| $X$) and P(Evaded Taxes = No| $X$).  


In [41]:
# Since P(X) is the same when computing the P(Evaded Taxes = Yes|X) or P(Evaded Taxes = No|X),   
# we only need to compute   
#P(Refund=No|Yes)P(Marital Status=Divorced |Yes)P(Taxable Income=120,000|Yes)P(Evaded Taxes = Yes) and
#P(Refund=No|No)P(Marital Status=Divorced |No)P(Taxable Income=120,000|No)P(Evaded Taxes = No)
# We do that using the probabilities computed above. 

# Assuming Gaussian distributions, we compute the densities for the income 
# under 'Evaded Taxes=No' and 'Evaded Taxes=Yes'
p_inc_given_no = scipy.stats.norm.pdf((120-110),0, 54.54) 
p_inc_given_yes = scipy.stats.norm.pdf((120-90),0, 5)
print('p_inc_given_no = %5.4f   p_inc_given_yes = %5.1e )' %(p_inc_given_no, p_inc_given_yes))

# Next compute the probability of divorce under 'Evaded Taxes=No' and 'Evaded Taxes=Yes'
p_div_given_no = m_status__given_evaded.loc['Divorced','No']
p_div_given_yes = m_status__given_evaded.loc['Divorced','Yes']
print('p_div_given_no = %5.4f   p_div_given_yes = %5.4f )' %(p_div_given_no, p_div_given_yes))

# Then compute the probability of refund=no under 'Evaded Taxes=No' and 'Evaded Taxes=Yes'
p_no_refund_given_no  = refund_given_evaded.loc['No','No']
p_no_refund_given_yes = refund_given_evaded.loc['No','Yes']
print('p_no_refund_given_no = %5.4f   p_no_refund_given_yes = %5.4f )' %(p_no_refund_given_no, p_no_refund_given_yes))

# Now we can compute the posterior probability of 'Evaded Taxes=No' and 'Evaded Taxes=Yes' (ignoring the denominator)
p_X_given_no = p_inc_given_no * p_div_given_no * p_no_refund_given_no 
p_X_given_yes = p_inc_given_yes * p_div_given_yes * p_no_refund_given_yes 
print('p_X_given_no = %5.6f   p_X_given_yes = %5.6e )' %(p_X_given_no, p_X_given_yes))

p_evaded_no_given_data =  p_X_given_no * p_evaded_no
p_evaded_yes_given_data = p_X_given_yes * p_evaded_yes
print('P(Evaded Taxes=No|X) = %5.6f   P(Evaded Taxes = Yes|X) = %5.2e )' %(p_evaded_no_given_data, p_evaded_yes_given_data))

p_inc_given_no = 0.0072   p_inc_given_yes = 1.2e-09 )
p_div_given_no = 0.1429   p_div_given_yes = 0.3333 )
p_no_refund_given_no = 0.5714   p_no_refund_given_yes = 1.0000 )
p_X_given_no = 0.000587   p_X_given_yes = 4.050589e-10 )
P(Evaded Taxes=No|X) = 0.000411   P(Evaded Taxes = Yes|X) = 1.22e-10 )


Therefore, we choose Evaded Taxes = No as the class.  
In conclusion, note that scikit-learn has a naive Bayes classifier. https://scikit-learn.org/stable/modules/naive_bayes.html