##### We are given a multivariate classficaition dataset, containing 11314 and 7532 documents. each document is represented as a 2000-dimensional binary vector
##### Each feature shows whether a word appears in the corresponding document or not.

## Import the libraries required.

In [6]:
import numpy as np
import matplotlib.pyplot as plt

### Read the training set

In [7]:
X_train = np.genfromtxt(fname = "20newsgroup_words_train.csv", delimiter = "," ,dtype = int)
X_test = np.genfromtxt(fname = "20newsgroup_words_test.csv", delimiter = "," ,dtype = int)
Y_train = np.genfromtxt(fname = "20newsgroup_labels_train.csv", delimiter = "," ,dtype = int)
Y_test = np.genfromtxt(fname = "20newsgroup_labels_test.csv", delimiter = "," ,dtype = int)


print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(11314, 2000)
(7532, 2000)
(11314,)
(7532,)


### Estimating the prior probabilities

$ \widehat{\Pr}(y = c) = \dfrac{\sum\limits_{i = 1}^{N}\mathbb{1}(y_{i} = c)}{N}$

In [8]:
np.random.seed(42)
N = Y_train.shape[0]
K = np.max(Y_train)
priors = [np.sum(Y_train == c)/N for c in range(1,K+1)]

### To have a naive bayes classifier after finding the priors we also need the likelihood probability with each attribute for each class, then we can calculate the posterior probability but remember to normalize)

$ \widehat{\Pr}(y = c | x) = \dfrac{\hat{p}(x|y = c)\widehat{Pr}(y=c)}{\hat{p}(x)}$

let

$ \pi_{cd} = p(x_{d} = 1 | y = c)$

we will have 20 x 2000 of these

$ \pi_{cd} = \dfrac{\sum\limits_{i=1}^{N}\mathbb{1}(y_{i} = c, x_id = 1)}{N_{c}}$


##### However we also have to add α to the numerator and αD to the denominator to avoid 0 probabilities (Laplace smoothing), we have set alpha to be 0.2

$ \pi_{cd} = \dfrac{\sum\limits_{i=1}^{N}\mathbb{1}(y_{i} = c, x_id = 1) + \alpha}{N_{c} + \alpha D}$

In [11]:
a = 0.2
print(X_train.T.shape)

class_counts = np.array([np.sum(Y_train == c + 1) for c in range(K)])
print(class_counts)

D = X_train.shape[1]
print(D)
feature_counts = Y_train.T @ X_train

pi = (feature_counts + a) 



(2000, 11314)
[480 584 591 590 578 593 585 594 598 597 600 595 591 594 593 599 546 564
 465 377]
2000
[ 5033  5334  2466 ... 36973  2382  4266]
[[3.00022727e+00 1.10002273e+01 1.20002273e+01 ... 2.33000227e+02
  1.20002273e+01 2.40002273e+01]
 [2.00002033e+01 1.10002033e+01 1.00002033e+01 ... 9.50002033e+01
  6.00020325e+00 7.00020325e+00]
 [1.90002018e+01 1.00002018e+01 8.00020182e+00 ... 1.18000202e+02
  2.01816347e-04 5.00020182e+00]
 ...
 [2.90002075e+01 8.40002075e+01 2.00020747e+00 ... 2.41000207e+02
  2.80002075e+01 3.40002075e+01]
 [1.60002312e+01 3.00002312e+01 2.00023121e+00 ... 1.76000231e+02
  1.00002312e+01 1.70002312e+01]
 [4.00025740e+00 7.00025740e+00 6.00025740e+00 ... 1.55000257e+02
  1.20002574e+01 3.10002574e+01]]


### Now that we have the requirements, (prior and conditionals), we can calculate the score values.

#### Remember how we calculate the score from the previous lab.

You have to notice that p(x|y) is a multinomial distribution, and i guess we assume independence of the words in a data point here. 

$ g_{c}(x) = \log\left[\prod\limits_{d = 1}^{D}\hat{p}(x_d | y = c)\right] + \log\widehat{\Pr}(y = c)$

$ =  \log\left[\prod\limits_{d = 1}^{D}\hat{\pi}_{cd}^{x_{d}}(1-\hat{\pi}_{cd})^{1 - x_{d}}\right] +\log\widehat{\Pr}(y = c)$


In [5]:
## write the math equation then.

log = np.log(pi)
inv_log = np.log(1-pi)
log_priors = np.log(priors)

score_values = X_train @ log.T + X_train @ inv_log.T + log_priors

print(score_values)

[[nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 ...
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]]


  inv_log = np.log(1-pi)
