# Assignment 1
by Nicolas Larranaga Cifuentes - nlarranagac - 2879695

Let $D = \{d1, ..., dn\}$ a vector of documents, $T = \{t1, ..., tm\}$ a vector of terms, $TD = (TD_{i j})_{i=1...m,j=1...n}$ the matrix of frequencies of the terms T in the documents D and finally $L = (l_1, ..., l_m)$ a vector describing the length of the terms T. Finally, there is a process where a document $d_j$ is randomly chosen with uniform probability and then a term $t_i$, present in $d_j$, is randomly chosen with a probability proportional to the frequency of $t_i$ in $d_j$ .

Lets asume that the following values are provided.

In [5]:
import numpy as np

n, m = 5, 6 #sizes

TD = np.matrix([[2,3,0,3,7],
                [0,5,5,0,3],
                [5,0,7,3,3],
                [3,1,0,9,9],
                [0,0,7,1,3],
                [6,9,4,6,0]])

#L is a vector so we use the transpose of the matrix
L = np.matrix([5,2,3,6,4,3]).T 

a) Calculate Matrix $P(T, D)$

Start from the fact that $P(T,D) = P(T \cap D)$ which is equals to $P(T,D) = P(D)*P(T|D)$, Bayes' theorem could be used to calculate the value for $P(T|D)$, but lets take advantage of the matrix TD, since  the frequency of each term per document is known, I can simply obtain the sum of these frequencies to calculate the probability. This can be summarized in the following expression.

$P(T,D) = \frac{1}{n}* \left( TD / \left(J_m * TD\right) \right)$

where / refers to one to one matrix division and $J_m$ denotes a matrix of size $m \times m$ filled with 1's in all of its entries.

In [7]:
P_T_D = 1/n*np.divide(TD,np.ones((m,m))*TD)

print(P_T_D)

[[0.025      0.03333333 0.         0.02727273 0.056     ]
 [0.         0.05555556 0.04347826 0.         0.024     ]
 [0.0625     0.         0.06086957 0.02727273 0.024     ]
 [0.0375     0.01111111 0.         0.08181818 0.072     ]
 [0.         0.         0.06086957 0.00909091 0.024     ]
 [0.075      0.1        0.03478261 0.05454545 0.        ]]


the sum of all of it's entries sum up to 1.0

In [8]:
print(P_T_D.sum())

1.0


b) Calculate Matrix $P(T|D)$

As explained in the previous problem,  $P(T|D)$ can be calculated using matrix TD as follows:

$P(T|D) = TD / (J_m * TD)$

In [10]:
P_TD = np.divide(TD, np.ones((m,m))*TD)

print (P_TD)

[[0.125      0.16666667 0.         0.13636364 0.28      ]
 [0.         0.27777778 0.2173913  0.         0.12      ]
 [0.3125     0.         0.30434783 0.13636364 0.12      ]
 [0.1875     0.05555556 0.         0.40909091 0.36      ]
 [0.         0.         0.30434783 0.04545455 0.12      ]
 [0.375      0.5        0.17391304 0.27272727 0.        ]]


c) Calculate $P(D|T)$

Using Bayes' Theorem and the results obtained before I can calculate the value for $P(D|T)$ in the following way

$P(D|T) = P(T,D)/(P(T,D)*J_n) $

In [13]:
P_DT = np.divide(P_T_D, (P_T_D* np.ones((n,n))))

print(P_DT)

[[0.17654612 0.23539482 0.         0.19259576 0.3954633 ]
 [0.         0.45154704 0.35338464 0.         0.19506832]
 [0.35787437 0.         0.34853851 0.15616336 0.13742376]
 [0.18524987 0.05488885 0.         0.40418153 0.35567975]
 [0.         0.         0.64782097 0.09675248 0.25542655]
 [0.28373832 0.37831776 0.13158879 0.20635514 0.        ]]


d) Calculate Matrix $P(D)$

Based on the fact that the Documents in D are all equaly probably, lets define $P(D)$ as follows

$P(D) = \frac{1}{n} * J_{1,n}$

In [15]:
P_D = 1/n * np.ones((1,n))

print (P_D)

[[0.2 0.2 0.2 0.2 0.2]]


e) Calculate Matrix $P(T)$

Lets refer to problem a) to solve this using the following formula.

$P(T) = P(T,D) * J_{n,1}$

In [16]:
P_T = P_T_D * np.ones((n,1))

print (P_T)

[[0.14160606]
 [0.12303382]
 [0.17464229]
 [0.20242929]
 [0.09396047]
 [0.26432806]]


f) Calculate $E(L)$ where $E(x)$ is the expected value of x

Starting from the formula

$E(x) = \sum_x P(x)*x$

I can get the following equivalent

$E(L) = \sum_l P(l) \times L $

But lets remember that L depends from the Term it appears in, which also depends on the document that is selected. So I end up with the following.

$E(L) = \sum P(T,D) \times (L * J_{1,n}) $

In [18]:
length_expected_value = np.multiply(P_T_D, L * np.ones((1,n))).sum()

print(length_expected_value)

3.8614266578831797


g) Calculate $Var(L)$ 

I make use of the following formula

$Var(x) = \sum_{i=1}^{n} P_i \times (X_i - \mu)^2 $

where $\mu$ is the expected value, and apply it to the Matrix scenario as such

let $Y = (X_i - \mu) = (L*J_{1,n} - \mu) $

then we can define $Var(L)$ as $Var(L) = \sum Y\times Y \times P(T,D) $  (remember that $\times$ stands for one to one matrix multiplication)

In [19]:
mu = length_expected_value
Y = L * np.ones((1,n)) - mu
length_variance = np.multiply(P_T_D, np.multiply(Y,Y)).sum()

print(length_variance)

1.8632262826095156
