## The Bayesian Theorem
- In contrast, P(H) is the prior probability of H for any sample, regardless of how the data in the sample looks.
- The posterior probability P(H|X) is based on more information then the prior probability P(H).
- The Bayesian Theorem provides a way of calculating the posterior probability P(H|X) using probabilities P(H), P(X), and P(X|H).
- The basic relation is:
$$ P(H|X) = \frac{P(X|H)P(H)}{P(X)} $$

- Suppose now that there are a set of m samples S = {S1, S2, …, Sm} (the training data set) where every sample Si is represented as an n- dimensional vector {x1, x2, …, xn}.
- Values xi correspond to attributes A1, A2, …, An, respectively.
- Also, there are k classes C1, C2, …, Ck, and every sample belongs to one of these classes.
- Given an additional data sample X (its class is unknown), it is possible to predict the class for X using the highest conditional probability P(Ci|X), where i = 1. …, k.
- That is the basic idea of Naïve-Bayesian Classifier. These probabilities are computed using Bayes Theorem:
$$ P(C_i|X) = \frac{P(X|C_i)P(C_i)}{P(X)} $$
- As P(X) is constant for all classes, only the product P(X|Ci) · P(Ci) needs to be maximized. We compute the prior probabilities of the class as
- P(Ci) = number of training samples of class Ci/m (m is total number of training samples).


- Because the computation of P(X|Ci) is extremely complex, especially for large data sets, the Naïve assumption of conditional independence between attributes is made.
- Using this assumption, we can express P(X|Ci) as a product:
$$ P(X|C_i) = \prod_{t=1}^n P(X_t|C_i) $$
- where xi are values for attributes in the sample X. The probabilities P(Xt|Ci) can be estimated from the training data set.

In [2]:
import pandas as pd

df = pd.read_csv('data_reduced.csv')
df.head()

Unnamed: 0,name,age,height,weight,genders,income
0,Ethan,87.0,316,26.0,F,H
1,Alexander,53.65,154,140.0,M,L
2,Henry,87.0,185,187.0,M,L
3,Xiao,29.94,216,129.0,M,L
4,Adam,29.94,213,90.0,M,L


In [53]:
# Let's start calculating the entropy from scratch
# First, we need to calculate the probability of each class and add it to the bottom of the dataframe
# prior probability of each class

count_of_H = df['income'].value_counts()['H']
count_of_M = df['income'].value_counts()['M']
count_of_L = df['income'].value_counts()['L']

total_count = df['income'].count()

prob_of_H = count_of_H / total_count
prob_of_M = count_of_M / total_count
prob_of_L = count_of_L / total_count

count_of_F = df['genders'].value_counts()['F']
count_of_M = df['genders'].value_counts()['M']

p_H_F = str(df[(df['income'] == 'H') & (df['genders'] == 'F')].shape[0]) + "/" + str(count_of_F)
p_M_F = str(df[(df['income'] == 'M') & (df['genders'] == 'F')].shape[0]) + "/" + str(count_of_F)
p_L_F = str(df[(df['income'] == 'L') & (df['genders'] == 'F')].shape[0]) + "/" + str(count_of_F)

p_H_M = str(df[(df['income'] == 'H') & (df['genders'] == 'M')].shape[0]) + "/" + str(count_of_M)
p_M_M = str(df[(df['income'] == 'M') & (df['genders'] == 'M')].shape[0]) + "/" + str(count_of_M)
p_L_M = str(df[(df['income'] == 'L') & (df['genders'] == 'M')].shape[0]) + "/" + str(count_of_M)

likelihoods_gender_income = {'F': [p_H_F, p_M_F, p_L_F], 'M': [p_H_M, p_M_M, p_L_M]}

count_of_age_87 = df['age'].value_counts()[87.00]
count_of_age_53 = df['age'].value_counts()[53.65]
count_of_age_29 = df['age'].value_counts()[29.94]

p_H_87 = str(df[(df['income'] == 'H') & (df['age'] == 87.00)].shape[0]) + "/" + str(count_of_age_87)
p_M_87 = str(df[(df['income'] == 'M') & (df['age'] == 87.00)].shape[0]) + "/" + str(count_of_age_87)
p_L_87 = str(df[(df['income'] == 'L') & (df['age'] == 87.00)].shape[0]) + "/" + str(count_of_age_87)

p_H_53 = str(df[(df['income'] == 'H') & (df['age'] == 53.65)].shape[0]) + "/" + str(count_of_age_53)
p_M_53 = str(df[(df['income'] == 'M') & (df['age'] == 53.65)].shape[0]) + "/" + str(count_of_age_53)
p_L_53 = str(df[(df['income'] == 'L') & (df['age'] == 53.65)].shape[0]) + "/" + str(count_of_age_53)

p_H_29 = str(df[(df['income'] == 'H') & (df['age'] == 29.94)].shape[0]) + "/" + str(count_of_age_29)
p_M_29 = str(df[(df['income'] == 'M') & (df['age'] == 29.94)].shape[0]) + "/" + str(count_of_age_29)
p_L_29 = str(df[(df['income'] == 'L') & (df['age'] == 29.94)].shape[0]) + "/" + str(count_of_age_29)

likelihoods_age_income = {'87.00': [p_H_87, p_M_87, p_L_87], '53.65': [p_H_53, p_M_53, p_L_53], '29.94': [p_H_29, p_M_29, p_L_29]}

likelihoods_age_income_df = pd.DataFrame(likelihoods_age_income, index=['H', 'M', 'L'])
print("likelihoods age income df:\n", likelihoods_age_income_df)

likelihoods_gender_income_df = pd.DataFrame(likelihoods_gender_income, index=['H', 'M', 'L'])
print("likelihoods gender income df:\n", likelihoods_gender_income_df)

income_H_weight = df[df['income'] == 'H']['weight'].to_list()
mean_income_H_weight = round(sum(income_H_weight) / len(income_H_weight), 2)
stdev_income_H_weight = round((sum([(x - mean_income_H_weight) ** 2 for x in income_H_weight]) / len(income_H_weight)) ** 0.5, 2)
print("income H weight mean: ",mean_income_H_weight, " stdev: ", stdev_income_H_weight, " list: ", income_H_weight)

income_M_weight = df[df['income'] == 'M']['weight'].to_list()
mean_income_M_weight = round(sum(income_M_weight) / len(income_M_weight), 2)
stdev_income_M_weight = round((sum([(x - mean_income_M_weight) ** 2 for x in income_M_weight]) / len(income_M_weight)) ** 0.5, 2)
print("income M weight mean: ",mean_income_M_weight, " stdev: ", stdev_income_M_weight, " list: ", income_M_weight)

income_L_weight = df[df['income'] == 'L']['weight'].to_list()
mean_income_L_weight = round(sum(income_L_weight) / len(income_L_weight), 2)
stdev_income_L_weight = round((sum([(x - mean_income_L_weight) ** 2 for x in income_L_weight]) / len(income_L_weight)) ** 0.5, 2)
print("income L weight mean: ",mean_income_L_weight, " stdev: ", stdev_income_L_weight, " list: ", income_L_weight)

import math

# normal distribution probability density function
def pdf(x, mean, stdev):
    exponent = math.exp(-((x-mean)**2 / (2 * stdev**2 )))
    return round((1 / (math.sqrt(2 * math.pi) * stdev)) * exponent, 4)

# We can now calculate the probability of each class
print("prob of H: ", prob_of_H)
print("prob of M: ", prob_of_M)
print("prob of L: ", prob_of_L)

# name, age, height, weight, gender => guess income
x = ['Ali', 29.94, 163, 62, 'M']

# prob_H_X = P(29.94|H) * P(163|H) * P(M|H) * P(H)
p_H_M = int(p_H_M.split('/')[0]) / count_of_M
p_M_M = int(p_M_M.split('/')[0]) / count_of_M
p_L_M = int(p_L_M.split('/')[0]) / count_of_M
p_H_29 = int(p_H_29.split('/')[0]) / count_of_age_29
p_M_29 = int(p_M_29.split('/')[0]) / count_of_age_29
p_L_29 = int(p_L_29.split('/')[0]) / count_of_age_29
prob_H_X = round(p_H_29 * pdf(61.5, mean_income_H_weight, stdev_income_H_weight) * p_H_M * prob_of_H, 4)
prob_M_X = round(p_M_29 * pdf(61.5, mean_income_M_weight, stdev_income_M_weight) * p_M_M * prob_of_M, 4)
prob_L_X = round(p_L_29 * pdf(61.5, mean_income_L_weight, stdev_income_L_weight) * p_L_M * prob_of_L, 4)
print("prob_H_X: ", prob_H_X, " prob_M_X: ", prob_M_X, " prob_L_X: ", prob_L_X)

# P(H|X) = P(X|H) * P(H) / P(X)
# P(M|X) = P(X|M) * P(M) / P(X)
# P(L|X) = P(X|L) * P(L) / P(X)
prob_X = prob_H_X + prob_M_X + prob_L_X
prob_H_X = round(prob_H_X / prob_X, 4)
prob_M_X = round(prob_M_X / prob_X, 4)
prob_L_X = round(prob_L_X / prob_X, 4)

print("prob_H_X: ", prob_H_X, " prob_M_X: ", prob_M_X, " prob_L_X: ", prob_L_X)

# P(H) = 0.4
# P(M) = 0.4
# P(L) = 0.2




likelihoods age income df:
   87.00 53.65 29.94
H   4/6   1/6   2/8
M   1/6   2/6   3/8
L   1/6   3/6   3/8
likelihoods gender income df:
      F     M
H  4/7  3/13
M  3/7  3/13
L  0/7  7/13
income H weight mean:  86.86  stdev:  62.02  list:  [26.0, 134.0, 118.0, 118.0, 13.0, 18.0, 181.0]
income M weight mean:  69.0  stdev:  68.03  list:  [66.0, 3.0, 133.0, 18.0, 10.0, 184.0]
income L weight mean:  142.29  stdev:  30.58  list:  [140.0, 187.0, 129.0, 90.0, 132.0, 181.0, 137.0]
prob of H:  0.35
prob of M:  0.3
prob of L:  0.35
prob_H_X:  0.0001  prob_M_X:  0.0002  prob_L_X:  0.0
prob_H_X:  0.3333  prob_M_X:  0.6667  prob_L_X:  0.0
