# Lab 3 - Numpy Arrays

Latent Dirichlet Allocation (LDA) is a model used to represent bodies of text. It's a hierarchical, probabilitic, generative model that represents each document in a collection as a mixture of topics. Each topic is a mixture of words. 

For more information see:  Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. The Journal of Machine Learning Research. 2003;3:993-1022.


Consider the following version of the LDA generative process, that records the words in
each document as well as which topic produced which word. 

Run this code to create your own document corpus. Note that the generative process is stochastic, so your corpus will be different from everyone elses, and if you run the code again, you will get a different corpus.

In [6]:
import numpy as np
 
# there are 2000 words in the corpus
alpha = np.full (2000, .1)
 
# there are 100 topics
beta = np.full (100, .1)
 
# this gets us the probabilty of each word happening in each of the 100 topics
wordsInTopic = np.random.dirichlet (alpha, 100)
 
# produced [doc, topic, word] gives us the number of times that the given word was
# produced by the given topic in the given doc
produced = np.zeros ((50, 100, 2000))
 
# generate each doc
for doc in range (0, 50):
        #
        # get the topic probabilities for this doc
        topicsInDoc = np.random.dirichlet (beta)
        #
        # assign each of the 2000 words in this doc to a topic
        wordsToTopic = np.random.multinomial (2000, topicsInDoc)
        #
        # and generate each of the 2000 words
        for topic in range (0, 100):
                produced[doc, topic] = np.random.multinomial (wordsToTopic[topic], wordsInTopic[topic])

As described in the comments,
```
  produced [doc, topic, word]
```  
gives the number of times that the given word was produced by the given topic in the given document. You need to complete the five tasks where we have not given an answer, and then show your answers in order to get checked off:

(1) Write a line of code that computes the number of words produced by topic 17 in
document 18.

In [35]:
produced[18,17,:].sum ()

0.0

(2) Write a line of code that computes the number of words produced by topic 17 thru 45 in document 18.

In [72]:
produced[18,17:46,:].sum ()

1112.0

(3) Write a line of code that computes the number of words in the entire corpus.

In [37]:
produced[:,:,:].sum ()

100000.0

(4) Write a line of code that computes the number of words in the entire corpus produced by topic 17.

In [47]:
produced[:,17,:].sum ()

597.0

(5) Write a line of code that computes the number of words in the entire corpus
produced by topic 17 or topic 23.

In [42]:
produced[:,np.array([17,23]),:].sum()

1333.0

(6) Write a line of code that computes the number of words in the entire corpus
produced by even numbered topics.

In [43]:
produced[:,np.arange(0,100,2),:].sum()

48724.0

(7) Write a line of code that computes the number of each word produced by topic 15.

In [94]:
produced[:,15,:].sum(0).nonzero()

(array([  11,   30,   36,   39,   69,   71,   74,   76,   84,   85,   88,
         102,  109,  116,  121,  126,  130,  157,  168,  171,  172,  178,
         182,  199,  207,  209,  227,  228,  230,  239,  240,  251,  271,
         274,  282,  290,  295,  318,  322,  335,  336,  344,  347,  370,
         444,  452,  457,  461,  462,  470,  488,  498,  504,  508,  518,
         522,  527,  531,  541,  544,  555,  558,  585,  589,  596,  598,
         605,  612,  617,  620,  623,  641,  649,  650,  659,  665,  666,
         683,  691,  694,  701,  705,  719,  729,  731,  733,  746,  749,
         770,  775,  784,  790,  796,  801,  808,  814,  820,  836,  840,
         846,  882,  889,  890,  899,  900,  902,  906,  912,  925,  930,
         950,  976,  977,  992,  998, 1007, 1015, 1027, 1028, 1041, 1043,
        1060, 1067, 1074, 1077, 1084, 1088, 1098, 1100, 1113, 1114, 1125,
        1127, 1136, 1137, 1140, 1156, 1169, 1173, 1177, 1209, 1211, 1212,
        1216, 1219, 1220, 1225, 1229, 


(8) Write a line of code that computes the topic responsible for the most instances of each word in the corpus.

In [98]:
produced.sum(0).argmax(0)

array([41, 80,  0, ..., 11, 41, 82])

(9) Write a line of code that for each topic, computes the max number of occurrences (summed over all documents) of any word that it was responsible for.

In [5]:
produced[:,np.arange(0,100,1),produced.sum(0).argmax(1)].sum(0)

array([26., 27., 19., 20., 19., 16., 19.,  7., 30., 39., 24., 32., 16.,
       22., 38., 28., 17., 64., 17., 29., 21., 22., 30., 17., 13., 21.,
       39., 16., 24., 20., 34., 13., 25., 26., 32., 16., 57., 19., 16.,
       32., 21., 35., 15., 15., 19., 20., 18., 11., 26., 19., 19., 21.,
       12., 29., 54., 25., 14., 23., 49., 12., 18., 34., 31., 26., 21.,
       21., 18., 25., 16., 27., 18., 16., 13., 17., 25., 22.,  6., 42.,
       27., 17., 33., 18., 26., 17., 16., 24., 37., 45., 19., 38., 45.,
       21., 26., 36., 31., 43., 19., 19., 29., 38.])