# TP4 - Non-negative Matrix Factorization
The goal is to study the use of nonnegative matrix factorisation (NMF) for topic extraction from a dataset of text documents. The rationale is to interpret each extracted NMF component as being associated with a specific topic. 

Study and test the following script (introduced  on [scikit](http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html))

1. Test and comment on the effect of varying the initialisation, especially using random
nonnegative values as initial guesses (for W and H coefficients, using the notations introduced
during the lecture).
2. Compare and comment on the difference between the results obtained with `2 cost compared
to the generalised Kullback-Liebler cost.
3. Test and comment on the results obtained using a simpler term-frequency representation
as input (as opposed to the TF-IDF representation considered in the code above) when
considering the Kullback-Liebler cost.

In [14]:
###### CUSTOM NMF IMPLEMENTATION ######
# Multiplicative Update Rules for NMF #
# estimation with beta divergences    #
import numpy

# TODO: translate slides 59 [beta-divergence] & 47 [error and special cases]

def custom_NMF(V, K, W=None, H=None, steps=50, beta=0, toll=0.1, show_div=False):
    
    F = len(V) #Number of V rows
    N = len(V[0]) #Number of V columns

    if W is None:
        W = numpy.random.rand(F,K)
        
    if H is None:
        H = numpy.random.rand(K,N)
        
    if N != len(H[0]):
        raise ValueError("Size for H[0] is different - found "+str(len(H[0]))+" in place of "+str(N))
    if F != len(W):
        raise ValueError("Size for F is different - found "+str(len(F))+" in place of "+str(N))
        
    #Setup n_iter
    n_iter = 1
    
    # Setup initial error
    init_error = _beta_div(V,W,H,beta,F,N,K)
    if show_div:
        print("Initial error: "+str(init_error))
    error = init_error
    
    for step in range(steps):
    
        # Tests with whole matrix : multiply = O | dot = *
        upd_UP = numpy.dot(W.T, numpy.multiply(pow(numpy.dot(W,H),beta-2), V))
        upd_DOWN = numpy.dot(W.T, pow(numpy.dot(W,H),beta-1))
        upd = upd_UP / upd_DOWN
        H = numpy.multiply(H, upd)
        
        upd_UP = numpy.dot(numpy.multiply(pow(numpy.dot(W,H),beta-2), V),H.T)
        upd_DOWN = numpy.dot(pow(numpy.dot(W,H),beta-1), H.T)
        upd = upd_UP / upd_DOWN
        W = numpy.multiply(W, upd)

        # Test element-wise products
#         for i in range(F):
#             for j in range(N):
#                 for k in range(K):
#                     x = V[i][j]
#                     w = W[i][k]
#                     h = H[k][j]
#                     y = w*h
# #                     print("x:"+str(x)+" | w:"+str(w)+" | h:"+str(h)+" | y:"+str(y))
#                     # Update h
#                     upd_up = w*(pow(y,beta-2)*x)
#                     upd_down = w*pow(y,beta-1)
#                     upd = upd_up/upd_down
#                     h = h*upd
#                     # Update w
#                     upd_up = (pow(y,beta-2)*x)*h
#                     upd_down = pow(y,beta-1)*h
#                     upd = upd_up/upd_down
#                     w = w*upd
        
        if toll > 0:
            new_error = _beta_div(V,W,H,beta,F,N,K)
            if show_div:
                print("Error on iteration "+str(n_iter)+": " +str(new_error))
            # Check if approximation error relative decrease is below the desired threshold
            rel_dec = ((error - new_error) / init_error)
            if show_div:
                print("Error relative decrease at iteration "+str(n_iter)+": "+str(rel_dec))
            if rel_dec < toll:
                break
            error = new_error
            
        n_iter += 1
            
    return W, H

def _beta_div(V,W,H,beta,F,N,K):
    div = 0
    # Update beta_divergence
    if beta == 1: # generalized Kullback-Leibler divergence. x log(x/y) - x + y
        # div = numpy.dot(V, numpy.log(V,numpy.dot(W,H))) - numpy.sum(V) + numpy.sum(numpy.dot(W,H))
        func = _kullback_leiber
    elif beta == 0: # Itakura-Saito divergence. (x/y) - log(x/y) -1
        # div = numpy.sum(V / numpy.dot(W,H)) - numpy.sum(numpy.log(V / numpy.dot(W,H))) - numpy.product(len(V))
        func = _itakura_saito
    else: # Euclidean distance. (1/beta(beta-1))(x^beta + (beta-1)y^beta - beta*x*y^beta-1)
        func = _euclidean_distance
    WH = numpy.dot(W, H)
    for i in range(F):
        for j in range(N):
            x = V[i][j]
            if x == 0:
                x = numpy.finfo(numpy.double).tiny
            y = WH[i][j]
            div += func(x,y,beta)
    return div

def _kullback_leiber(x,y,beta):
    return x*numpy.log(x/y) - x + y

def _itakura_saito(x,y,beta):
    return x*numpy.log(x/y) - x + y

def _euclidean_distance(x,y,beta):
    return (1/(beta*(beta-1)))*(pow(x,beta) + (beta-1)*pow(y,beta) - beta*x*pow(y,beta-1))

#######

if __name__ == "__main__":
    V = [
         [5,3,0,1],
         [4,0,0,1],
         [1,1,0,5],
         [1,0,0,4],
         [0,1,5,4],
        ]

    V = numpy.array(V) # Data matrix F x N 
    K = 2

    W, H = custom_NMF(V, K, beta = 1, toll = 0.0001, show_div = True)

Initial error: 51.643944750947426
Error on iteration 1: 15.69087154542149
Error on iteration 2: 14.365351593662911
Error on iteration 3: 13.202652632634065
Error on iteration 4: 12.499595673095968
Error on iteration 5: 12.168553108484467
Error on iteration 6: 12.025669046860067
Error on iteration 7: 11.963370396526086
Error on iteration 8: 11.933537330556263
Error on iteration 9: 11.916593277429161
Error on iteration 10: 11.90507427494362
Error on iteration 11: 11.896195303173478
Error on iteration 12: 11.888872871384228
Error on iteration 13: 11.882634726996741
Error on iteration 14: 11.877236534843778
Error on iteration 15: 11.872526795726381
Error on iteration 16: 11.86839730448731
Error on iteration 17: 11.864763575927029
Error on iteration 18: 11.861556280112953
Error on iteration 19: 11.858717024360454
Error on iteration 20: 11.856195999375727
Error on iteration 21: 11.853950489771055
Error on iteration 22: 11.851943813976042
Error on iteration 23: 11.850144493353692
Error on ite

In [15]:
##### TEST RESULTS #####
W

array([[4.23771815e-002, 1.13908590e+000],
       [1.14926699e-001, 5.43001531e-001],
       [9.34300906e-001, 1.02757983e-030],
       [6.67357790e-001, 9.15737397e-129],
       [4.45558650e-001, 8.73979726e-001]])