# Probabilities

- **What are mutually exclusive events?**
    - These are events that if one happens, the other cannot happen and viceversa.
    - Example: In a single dice throw, you cannot get 1 and 5 at the same time, one outcome prevents the other from happening.

![](venn.png)

- **What are independent events?**
    - These are events that do not affect each other, they are independent.
    - Example: If you throw two coins, the outcome of one (say head or tails) does not affect the other (head or tails).

- P(A and B) = P(A ∩ B) = Intersection
- P(A or B) = P(A ∪ B) = P(A) + P(B) - P(A ∩ B) = Union

### Joint, Conditional and Marginal Probability

![](join.png)

In [4]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'male': [240, 200, 100], 
                   'female': [150, 50, 260]}, 
                  index=['cricket', 'football', 'others'])

In [5]:
df

Unnamed: 0,male,female
cricket,240,150
football,200,50
others,100,260


In [13]:
#Create a class that will extract all relevant probability information
#from the original dataset, Joint, Marginal and Conditional Probs

class probDF(object):
    def __init__(self, df):
        """
        Initialize class with dataset, calculate row and column
        totals, then normalize everything dividing by the total
        of totals.
        """
        self.df = df.copy()
        total_col = self.df.sum(axis=1)
        total_row = self.df.sum(axis=0)
        self.df.loc['total'] = total_row
        self.df['total'] = total_col
        self.df.loc['total', 'total'] = total_col.sum()
        self.norm_df = self.df / self.df.loc['total', 'total']
        
    def marginal_prob(self, var, col=True):
        """
        Calculate marginal probability, that is the probability
        of and event happening irrespective of any other variable
        P(Male) does not care if they prefer cricket, football or 
        other as long as they are male.
        """
        if col:
            print(f'marginal prob. for {var}')
            return self.norm_df.loc['total', var]
        else:
            print(f'marginal prob. for {var}')
            return self.norm_df.loc[var, 'total']
        
    def joint_prob(self, row, col):
        """
        Joint probability is the probability of 2 events happening
        at the same time, P(Female and Cricket).
        """
        print(f'joint prob for {row} and {col}')
        return self.norm_df.loc[row, col]
    
    def conditional_prob(self, value, given, given_col=True):
        """
        Conditional probability is the probability that event occurs given
        that another event has happened. That is we have prior information
        to consider to calculate the probability.
        P(Cricket | Male) is the probability that the selected male likes
        cricket, so we dismiss the female part of the dataset.
        """
        print(f'conditional prob of {value} given {given}')
        if given_col:
            return self.norm_df.loc[value, given] / self.norm_df.loc['total', given]
        else:
            return self.norm_df.loc[given, value] / self.norm_df.loc[given, 'total']
        
    def show_df(self):
        return self.df
    
    def show_norm_df(self):
        return self.norm_df

In [8]:
prob_df = probDF(df)

Show the dataset with the totals.

In [9]:
prob_df.show_df()

Unnamed: 0,male,female,total
cricket,240,150,390.0
football,200,50,250.0
others,100,260,360.0
total,540,460,1000.0


Normalize the dataset, that is, divide over the total amount of people in it.

In [10]:
prob_df.show_norm_df()

Unnamed: 0,male,female,total
cricket,0.24,0.15,0.39
football,0.2,0.05,0.25
others,0.1,0.26,0.36
total,0.54,0.46,1.0


Extract information needed:
- Give me the probability that likes cricket and is female.

In [16]:
prob_df.joint_prob('cricket', 'female')

joint prob for cricket and female


0.15

- Prob. likes others and is male.

In [17]:
prob_df.joint_prob('others', 'male')

joint prob for others and male


0.1

- What is the probability that I randomly select a male (marginal prob.) from the dataset?

In [18]:
prob_df.marginal_prob('male')

marginal prob. for male


0.54

- What is the probability that he/she likes cricket?

In [19]:
prob_df.marginal_prob('cricket', False)

marginal prob. for cricket


0.39

- Given that I already know that he/she likes cricket, what is the probability that he is male?

In [20]:
prob_df.conditional_prob('cricket', 'male')

conditional prob of cricket given male


0.4444444444444444

- Given that I already know that he is a male, what is the probability that he likes cricket?

In [21]:
prob_df.conditional_prob('male', 'cricket', False)

conditional prob of male given cricket


0.6153846153846153

> **NOTE:** Conditional probabilities are not symmetrical, P(Male|Cricket) not necesarily equal to P(Cricket|Male)

# Bayes Theorem

![](bayes.png)

In this case **Actual Spam** will be our **A** and **Detected Spam** will be our B.

- P(A|B) = P(B|A) * P(A) / P(B) reads:
    - Probability of spam given spam was detected = probability of spam detected given actual spam * the probability of actual spam / probability of detected spam

Given values:
- prob_spam = 0.03
- detection_rate = 0.99
- false_positive = 0.002

In [24]:
class bayesDF(object):
    def __init__(self, columns, rows):
        """
        Initialize object, create a dataframe filled with zeros
        to be used to calculate the Bayes Table.
        """
        self.df = pd.DataFrame(np.zeros((len(rows), len(columns))), 
                                    columns=columns, index=rows)
        self.rows = rows
        self.cols = columns
        self.bayes_df = None
        self.flag = False
        
    def ret(self):
        """
        Auxiliar function to catch errors.
        """
        if not self.flag:
            print('Need to populated dataframe first')
        else:
            return self.bayes_df
    
    def populate(self, positive_prob, detection_rate, fp_rate):
        """
        Populate table using starting conditions and information
        """
        self.bayes_df = self.df.copy()
        self.bayes_df.loc[self.rows[-1], self.cols[1]] = positive_prob     #P(A) = Prob. of spam
        self.bayes_df.loc[self.rows[-1], self.cols[0]] = 1 - positive_prob #P(~A) = Prob. of not spam

        self.bayes_df.loc[self.rows[0], self.cols[1]] = positive_prob * detection_rate #Prob. of detecting spam if it is present
        self.bayes_df.loc[self.rows[1], self.cols[1]] = positive_prob - self.bayes_df.loc[self.rows[0], self.cols[1]] #Prob. of not detecting spam if it is present

        self.bayes_df.loc[self.rows[0], self.cols[0]] = (1 - positive_prob) * fp_rate # Prob. of detecting spam if it is not present
        self.bayes_df.loc[self.rows[1], self.cols[0]] = (1 - positive_prob) * (1 - fp_rate) #Prob. of not detecting spam if it is not present

        self.bayes_df.loc[:, self.cols[-1]] = self.bayes_df.sum(axis=1) #Get totals for P(B) and P(~B)
        self.flag = True
        
    def bayes_rule(self, tl, pred):
        print(f'P({tl}|{pred}) = P({pred}|{tl}) * P({tl}) / P({pred})')
        return self.bayes_df.loc[pred, tl] / self.bayes_df.loc[pred, self.cols[-1]]

In [25]:
#TL = True Label or ground truth

columns= ['tl_not_spam', 'tl_spam', 'total'] #columns, ground truth    Order of negative and positive and total is important!
rows = ['pred_spam', 'pred_not_spam', 'total'] #rows, predictions      follow the convention here

bdf =  bayesDF(columns, rows)
bdf.ret()

Need to populated dataframe first


In [27]:
prob_spam = 0.03
detection_rate = 0.99
false_positive = 0.002

bdf.populate(prob_spam, detection_rate, false_positive)
bdf.ret()

Unnamed: 0,tl_not_spam,tl_spam,total
pred_spam,0.00194,0.0297,0.03164
pred_not_spam,0.96806,0.0003,0.96836
total,0.97,0.03,1.0


In [28]:
bdf.bayes_rule('tl_spam', 'pred_spam')

P(tl_spam|pred_spam) = P(pred_spam|tl_spam) * P(tl_spam) / P(pred_spam)


0.9386852085967131

In [29]:
bdf.bayes_rule('tl_not_spam', 'pred_spam')

P(tl_not_spam|pred_spam) = P(pred_spam|tl_not_spam) * P(tl_not_spam) / P(pred_spam)


0.06131479140328699

In [30]:
bdf.bayes_rule('tl_spam', 'pred_not_spam')

P(tl_spam|pred_not_spam) = P(pred_not_spam|tl_spam) * P(tl_spam) / P(pred_not_spam)


0.00030980213970011327

In [31]:
bdf.bayes_rule('tl_not_spam', 'pred_not_spam')

P(tl_not_spam|pred_not_spam) = P(pred_not_spam|tl_not_spam) * P(tl_not_spam) / P(pred_not_spam)


0.9996901978602999

It is very typical in this problems to be given:

- The prevalence of the positive class (with that you automatically can get the negative).
- The detection rate or recall, that is, if the positive class is present how likely is it to find it.
- The false positive rate, that is, if the we classify something as positive when it is not.

From these we can get all other parameters.