<h4> Latent Dirichlet Allocation for Topic Modeling </h4>
- is the most popular topic modeling technique
- this assumes documents are produced from a mixture of topics, and those topics then generate words based on their probability distribution
- given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place


Is a matrix factorization technique
- In vector space, any corpus (collection of documents) can be represented as a document-term matrix

- the following matrix shows a corpus of N documents (D1, D2, ..., DN) and a vocabulary size of M words (W1, W2, ..., WN)
    - the value of i,j cell gives the frequency count of word Wj in Document Di
![alt text](https://www.analyticsvidhya.com/wp-content/uploads/2016/08/Modeling2.png)

- LDA converts this Document-Term Matrix into two lower dimensional matrices - M1 and M2
    - M1 is a document-topics matrix and M2 is a topic-terms matrix with dimensions (N, K) and (K, M) respectively
        - K is the number of topics
        - M is the vocabulary size
![alt text](https://www.analyticsvidhya.com/wp-content/uploads/2016/08/modeling3.png)
![alt text](https://www.analyticsvidhya.com/wp-content/uploads/2016/08/Modeling4.png)

<h5> These 2 matrices provide topic-word and document-topic distributions </h5>
- But LDA improves upon this distribution by using sampling techniques to improve the matrices
- It iterates through each word "w" for each document "d" and tries to adjust the current topic-word assignment with a new assignment
    - a new topic "k" is assigned to word "w" with a probability P which is a product of two probabilities p1 and p2

<h3> For every topic, two probabilities are calculated: </h3>
- P1: p(topic t / document d)
    - the proportion of words in document d that are currently assigned to the topic
- P2: p(word w / topic t)
    - the proportion of assignments to topic t over all documents that come from this word w
    
<h5> The current topic-word assignment is updated with a new topic with the probability, product of p1 and p2 </h5>
- In this step, the model assumes that all the existing word-topic assignments except the current word are correct
- this is essentially the probability that topic t generated word w, so it makes sense to adjust the current word's topic with new probability

After a number of iterations, a steady state is achieved where the document topic and topic term distributions are fairly good

<h2> Parameters of LDA: </h2>

<h6> Alpha and Beta Parameters </h6>
- Alpha represents document-topic density
    - higher value: documents are composed of more topics
    - lower value: documents contain fewer topics
- Beta represents topic-word density
    - higher value: topics are composed of a large number of words in the corpus
    - lower value: topics are composed of fewer words
    
<h6> Number of Topics </h6>
- number of topics to be extracted from the corpus
- example approache: Kullback Leibler Divergence Score

<h6> Number of Topic Terms </h6>
- Number of terms composed in a single topic
- If you want to extract themes or concepts, higher number is better
- If you want to extract features or terms, a lower number is better

<h1> Example Model </h1>

In [3]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle."

In [4]:
doc_all = [doc1, doc2, doc3, doc4, doc5] #compile documents

<h3> Cleaning and Preprocessing </h3>
- removing punctuations, stopwords, and normalizing the corpus

In [5]:
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string

In [10]:
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()

def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''. join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

doc_clean = [clean(doc).split() for doc in doc_all]

In [14]:
clean(doc1)

'sugar bad consume sister like sugar father'

<h3> Preparing Document-Term Matrix </h3>
- all the text documents combined is known as the corpus
- to run any mathematical model on text corpus, it is good practice to convert it into a matrix representation
LDA models look for repeating patterns in the entire DT matrix

In [12]:
import gensim
from gensim import corpora

In [15]:
# creating the term dictionary for our corpus
dictionary = corpora.Dictionary(doc_clean)

In [20]:
# converting corpus into Document Term Matrix using above dictionary
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

<h3> Running LDA Model </h3>
- need to create an object for the LDA model and train it on the Document-Term Matrix

In [23]:
# Creating the object for LDA using gensim
Lda = gensim.models.ldamodel.LdaModel

In [24]:
# running and training LDA matrix on document term matrix
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word=dictionary, passes=50)

In [27]:
print(ldamodel.print_topics(num_topics=3, num_words=3))
# each line is a topic with individual topic terms and weight

[(0, '0.071*"pressure" + 0.041*"feel" + 0.041*"school"'), (1, '0.085*"sister" + 0.085*"father" + 0.084*"sugar"'), (2, '0.075*"say" + 0.075*"lifestyle" + 0.075*"health"')]
