# answers 10: the adventure of the moonlighting genes
**Sean's answers**

I've downloaded [Adler's data file, w10-data.tbl](http://mcb112.org/w10/w10-data.tbl) to my current directory.

## 1. write a script that simulates positive control data

In [1]:
import numpy as np

np.random.seed(42)   # Make the simulation reproducible (so if we re-run this cell, we don't have to re-run the whole page)

R = 4       # Number of hidden components, R >= 2
N = 100     # Number of genes. HW used sand mouse names; here we'll just use gene0 - gene99
M = 60      # Number of cell types/samples/experiments

X = 20      # Assign X genes to components 0..R-2; component R-1 gets the rest.
Y = 3       # Sets 0 and 1 share <Y> moonlighting genes

# Assign N genes randomly to R components.
# Sets 0..R-2 have X=20 genes each; set R-1 gets the rest
# Sets 0 and 1 share Y=3 moonlighting genes.
#   => so there have to be more than (R-1)*X-Y genes.
#
assert(N > (R-1)*X-Y)  
shuffidx = np.random.permutation(N)    # Randomly shuffle gene indices, to randomly assign genes to sets
S   = []
S.append(set(shuffidx[0:X]))           # Set 0 gets first X genes
S.append(set(shuffidx[X-Y:X*2-Y]))     # Set 1 also has X genes, but overlaps set 0 by Y genes (random, because of the permuted order)
i = X*2-Y
for a in range(2, R-1):                # Sets 2..R-2 get X genes each
    S.append(set(shuffidx[i:i+X]))
    i += X
S.append(set(shuffidx[i:]))            # Set R-1 gets the rest.

# H_ru  for component r, experiment u
# Columns u are probability vectors, \sum_r H_ru = 1
#   - np.ones(R) gives us a vector of R values = 1.0 (Dirichlet parameters)
#   - sampling from a Dirichlet with all \alpha_i = 1 samples p_i uniformly
#   - size=M argument says do it M times, giving us M row probability vectors: MxR matrix
#   - we need an RxM matrix with columns as prob vectors, so .T transposes 
#
H = np.random.dirichlet(np.ones(R), size=M).T


# W_ir = P(i | r)    for gene i, component r
# Sparse ... for each module r, only some genes are expressed.
#
W = np.zeros((N,R))
for a in range(R):
    p = np.random.dirichlet(np.ones(len(S[a])))  # probabilities of the subset that's expressed
    for j, i in enumerate(S[a]):                 # assign those probabilities to the subset of expressed genes
        W[i,a] = p[j]


# C_u : total counts per experiment (900-1100)*N
#
C = np.zeros(M).astype(int)
for u in range(M):
    C[u] = (900 + np.random.randint(201)) * N


# lambda_iu = C_u \sum_r W_ir H_ru = \sum_r P(i | r) P(r)  for each experiment u
# (we can't use lambda as a variable name, because it has a meaning in python)
#
lam = W @ H            # that's matrix multiplication in python3, so I'm told
for i in range(N):
    for u in range(M):
        lam[i,u] *= C[u]  # there's surely a better way to multiply a matrix by a vector in numpy, but this'll do
        

# V_iu = observed data, Poisson distributed around \lambda_iu
#
V  = np.zeros((N,M)).astype(int)
for i in range(N):
    for u in range(M):
        V[i,u] = np.random.poisson(lam[i,u])
        
        
# Let's remember the true solution. We'll be using W,H a lot below.
#
W_true = np.copy(W)
H_true = np.copy(H)
  

### inferred assignments of genes to components

In [2]:
# It's useful to look at the W's normalized in the a direction:
# the posterior probability that component a generates gene i, as
# opposed to the other components. We'll reuse this function to 
# look at optimized solutions; for now we use it to look at the true
# solution in our simulated data.
# 
def print_sets(W, H, genenames=[]):
    (N,R) = W.shape
    (R,M) = H.shape
    pp = np.zeros((N,R))
    Z  = np.sum(W, axis=1)
    for i in range(N):
        for a in range(R):
            pp[i,a] = W[i,a] / Z[i]
            
    nassigned = np.zeros(R).astype(int)
    for i in range(N):
        if len(genenames) > 0:
            print("{0:15s} ".format(genenames[i]), end='')
        else:
            print("{0:5d} ".format(i), end='')
        for a in range(R):
            print("{0:6.4f} ".format( pp[i,a] ), end='')
        print('      ', end='')

        nin = 0
        for a in range(R):
            if pp[i,a] > 0.10:            # Arbitrary threshold for calling a gene "assigned" to a component!
                print("YES ", end='')     # Here, for the true solution, unassigned is a pp=0
                nin += 1                  # but when we run NMF, we'll only get pp~0.
                nassigned[a] += 1
            else:       
                print("  . ", end='')
        if nin > 1:
            print("    <= moonlighter", end='')
        print('')

    for a in range(R):
        print("Component {0:2d}: {1:2d} genes assigned.".format(a, nassigned[a]))
        
        
print_sets(W_true, H_true)

    0 1.0000 0.0000 0.0000 0.0000       YES   .   .   . 
    1 0.0000 0.0000 0.0000 1.0000         .   .   . YES 
    2 0.0000 0.0000 0.0000 1.0000         .   .   . YES 
    3 0.0000 0.0000 1.0000 0.0000         .   . YES   . 
    4 1.0000 0.0000 0.0000 0.0000       YES   .   .   . 
    5 0.0000 1.0000 0.0000 0.0000         . YES   .   . 
    6 0.0000 0.0000 1.0000 0.0000         .   . YES   . 
    7 0.0000 0.0000 1.0000 0.0000         .   . YES   . 
    8 0.0000 0.0000 1.0000 0.0000         .   . YES   . 
    9 0.0000 1.0000 0.0000 0.0000         . YES   .   . 
   10 1.0000 0.0000 0.0000 0.0000       YES   .   .   . 
   11 0.0000 1.0000 0.0000 0.0000         . YES   .   . 
   12 0.7821 0.2179 0.0000 0.0000       YES YES   .   .     <= moonlighter
   13 0.0000 0.0000 1.0000 0.0000         .   . YES   . 
   14 0.0000 0.0000 0.0000 1.0000         .   .   . YES 
   15 0.0000 1.0000 0.0000 0.0000         . YES   .   . 
   16 0.0000 0.0000 1.0000 0.0000         .   . YES   . 
   17 0.0000 

### log likelihood of the data

We're going to want a function that calculates the log likelihood of an NMF model. Let's go ahead and implement that function now, and see what the log likelihood of the true model is. Our NMF implementation should approach this (it can even do better, by overfitting the noise in the data a bit).

In [3]:
# A function for calculating the total likelihood (up to a constant)
#
def loglikelihood(V, lam):
    N,M = V.shape
    ll = 0.
    for i in range(N):
        for u in range(M):
            ll += V[i,u] * np.log(lam[i,u]) - lam[i,u]   # \sum_u \sum_i log(V_iu ! ) is a constant offset; we neglect it.
    return ll

print("{0:.6g}".format(loglikelihood(V, lam)))

4.01408e+07


## Implement non-negative matrix factorization

In [4]:
def NMF(V, R):
    """
    Input:
       V are the data: NxM matrix of mapped read counts
       R is the number of hidden components we want
       
    Returns:
       W : the NxR weight matrix
       H : the RxM coefficient matrix
       ll : log likelihood of the WxH solution
       n  : number of iterations it took
    """
    N,M = V.shape
    C = np.sum(V, axis=0)
    
    # Initial guesses for H, W
    #
    H = np.random.dirichlet(np.ones(R), size=M).T    # gives an RxM matrix, columns are probability vectors
    W = np.random.dirichlet(np.ones(N), size=R).T    #  ... and NxR, ditto
            
    # NMF optimization (following Lee and Seung, 1999, but with explicit counts, and normalized H_au)
    #
    iterations = 0
    while True:   
        # Update our expected mean lambda_iu, given current W,H.
        #
        lam = W @ H
        for i in range(N):
            for u in range(M):
                lam[i,u] *= C[u]  
                
        # Calculate the log likelihood of the current model W,H, 
        # and test for convergence.
        #
        ll = loglikelihood(V, lam)
        #print("current log likelihood = {0:.3f} ".format( ll))
        if iterations >= 1:
            diff = (ll - old_ll) / abs(old_ll)
            if diff < 1e-8: break
        iterations += 1
        old_ll      = ll

        # Update eqn for W'_ia
        #
        new_W = np.zeros((N,R))
        for i in range(N):
            for a in range(R):
                term1 = 0.
                term2 = 0.
                for u in range(M):
                    term1 += (V[i,u] / lam[i,u]) * C[u] * H[a,u]
                    term2 += C[u] * H[a,u]
                new_W[i,a] = W[i,a] * term1 / term2 
        
        # Renormalization of W_ia
        #
        Z = np.sum(new_W, axis=0)
        for a in range(R):
            for i in range(N):
                new_W[i,a] = new_W[i,a] / Z[a]

        # Update eqn for H'_au
        new_H = np.zeros((R,M))
        for a in range(R):
            for u in range(M):
                term1 = 0.
                term2 = 0.
                for i in range(N):
                    term1 += (V[i,u] / lam[i,u]) * W[i,a]
                new_H[a,u] = H[a,u] * term1 
        
        # Now swap the updated estimates in, making them the current model.
        W = new_W
        H = new_H

    return (W,H, ll, iterations)

It's a local optimizer, so let's wrap the NMF optimization in something that runs it many times and takes the solution with the best log likelihood:

In [5]:
def NMF_best(V, R, nruns):
    for run in range(nruns):
        (W, H, ll, n_iterations) = NMF(V,R)
        print ("run: {0:2d}  iterations: {1:5d}  logL: {2:.6g}".format(run, n_iterations, ll))
        if run == 0 or ll > ll_best:
            ll_best = ll
            W_best  = np.copy(W)
            H_best  = np.copy(H)
    return (W_best, H_best, ll_best)        

Each run takes a while, and doing several runs takes several times as long ... set it running and get a cup of coffee...

In [6]:
(W, H, logL) = NMF_best(V,R,5)

run:  0  iterations:   399  logL: 4.0141e+07
run:  1  iterations:   266  logL: 4.01411e+07
run:  2  iterations:   539  logL: 4.0141e+07
run:  3  iterations:   431  logL: 4.01411e+07
run:  4  iterations:   421  logL: 4.0141e+07


In [7]:
print_sets(W,H)

    0 0.0058 0.9902 0.0000 0.0040         . YES   .   . 
    1 0.0026 0.0659 0.0009 0.9305         .   .   . YES 
    2 0.0099 0.0568 0.0013 0.9319         .   .   . YES 
    3 0.9621 0.0062 0.0091 0.0225       YES   .   .   . 
    4 0.0007 0.9926 0.0000 0.0067         . YES   .   . 
    5 0.0009 0.0206 0.9708 0.0077         .   . YES   . 
    6 0.9631 0.0064 0.0087 0.0218       YES   .   .   . 
    7 0.9643 0.0039 0.0118 0.0200       YES   .   .   . 
    8 0.9640 0.0032 0.0120 0.0208       YES   .   .   . 
    9 0.0001 0.0323 0.9677 0.0000         .   . YES   . 
   10 0.0000 0.9933 0.0000 0.0066         . YES   .   . 
   11 0.0001 0.0189 0.9729 0.0081         .   . YES   . 
   12 0.0018 0.7648 0.2270 0.0064         . YES YES   .     <= moonlighter
   13 0.9634 0.0090 0.0097 0.0179       YES   .   .   . 
   14 0.0072 0.0594 0.0071 0.9263         .   .   . YES 
   15 0.0116 0.0148 0.9559 0.0177         .   . YES   . 
   16 0.9621 0.0038 0.0120 0.0221       YES   .   .   . 
   17 0.9622 

We can make a couple of observations here. 

One is that NMF sparsifies, but not completely -- when a gene doesn't belong to a component, NMF drives its assignment toward zero, but not completely so. Many implementations of NMF include an additional term in the objective function to drive the factorization toward even more sparse solutions.

So to deduce which genes are assigned to which components, the `print_sets` function uses an arbitrary threshold on the posterior probability, of 0.1. It happens to work on this problem because our moonlighting genes are only shared across a couple of components, and we're able to distinguish their assignments easily from the background haze of lower pp's for other genes.

And given those semi-arbitrary (i.e. thresholded) assignments, we're usually able to figure out which genes truly belong to which component, even when there's some overlap (i.e., here, three moonlighting genes).

We can also note that NMF usually converges to similar answers, on these data, so there isn't much of a local optimum problem. We don't need to do a lot of NMF runs; we can probably get away with fewer. (Which is good, because it takes a while to compute.)

### can we deduce the number of components?

We set the true number of components to R=4. Can we deduce R, by looking at the loglikelihood of our fit, as we vary R? We don't have to do multiple runs of NMF per R choice to get a feel of this.

In [8]:
for r in range(3,7):
    (W, H, logL, niter) = NMF(V,r)
    print ("R: {0:2d}  logL: {1:.6g}".format(r, logL))

R:  3  logL: 3.9598e+07
R:  4  logL: 4.0141e+07
R:  5  logL: 4.01411e+07
R:  6  logL: 4.01412e+07


I'd say yes, we can. The log likelihood pretty much goes flat at R=4; adding additional components doesn't help it.

## 3. Analyze the Adler data

We have a working NMF implementation, so let's analyze Adler's data set now.

In [9]:
# Read the data file into Vx[]
# I'm calling the real data Vx[] so I don't overwrite the simulated data from above.
#
infile = "w10-data.tbl"
genenames = []
Vx        = []
with open(infile) as f:
    for line in f:
        fields = line.split()
        genenames.append( fields[0] )
        Vx.append( [ float(x) for x in fields[1:] ])
Vx     = np.array(Vx)
Nx, Mx = Vx.shape

What's our best guess for how many gene batteries there are? We've been told the choices are 3..6. Let's look at the log likelihood of NMF solutions for that range of R. 

In [10]:
for Rx in range(3,7):
    (Wx, Hx, logLx, niter) = NMF(Vx,Rx)
    print ("R: {0:2d}  logL: {1:.6g}".format(Rx, logLx))

R:  3  logL: 4.03928e+07
R:  4  logL: 4.07958e+07
R:  5  logL: 4.07955e+07
R:  6  logL: 4.07961e+07


Looks like 4 components, because adding more components after that barely changes the log likelihood. Recompute that (and let's take the best of three runs), so we can look at it:

In [11]:
Rx = 4
(Wx, Hx, logLx) = NMF_best(Vx, Rx, 3)
print_sets(Wx, Hx, genenames=genenames)

run:  0  iterations:   351  logL: 4.0796e+07
run:  1  iterations:   240  logL: 4.07959e+07
run:  2  iterations:   384  logL: 4.07959e+07
anise           0.0000 0.9912 0.0001 0.0087         . YES   .   . 
apricot         0.0010 0.0000 0.9917 0.0072         .   . YES   . 
artichoke       0.9828 0.0109 0.0056 0.0008       YES   .   .   . 
arugula         0.9838 0.0162 0.0000 0.0000       YES   .   .   . 
asparagus       0.0030 0.0000 0.9941 0.0029         .   . YES   . 
avocado         0.9821 0.0116 0.0049 0.0014       YES   .   .   . 
banana          0.9871 0.0104 0.0025 0.0000       YES   .   .   . 
basil           0.0234 0.0105 0.0000 0.9661         .   .   . YES 
beet            0.9839 0.0129 0.0032 0.0000       YES   .   .   . 
blackberry      0.9806 0.0131 0.0042 0.0020       YES   .   .   . 
blueberry       0.0026 0.0012 0.9913 0.0049         .   . YES   . 
broccoli        0.9857 0.0118 0.0017 0.0008       YES   .   .   . 
butternut       0.0251 0.0135 0.0000 0.9614         .   .  

There are three moonlighting genes: _grape, kohlrabi, and pepper._

Three of the components have 20 genes assigned; the other has 43. 

(Which happens to be the same as my simulation, because I used the same simulation code that generated Adler's data! That's a little dangerous - better, I should be running simulations with different parameters, to be sure my NMF inferences are robust.)