## LDA (Latent Dirichlet Allocation)

LDA stands for Latent Dirichlet Allocation

LDA is an unsupervised machine learning algorithm that helps in extracting the hidden themes or topics from the given set of documents.


### Steps used in this Algorithm:----

1.  Import all the necessary libraries

2.  Define the Sample Dataset

3.  Perform Text Vectorization (Bag of Words) i.e Tfidf Vectorizer

4.  Apply LDA Model

5.  Display the Topics

6.  Predict Topic for Each Document

### Step 1:  Import all the necessary libraries

In [208]:
import numpy             as  np
import pandas            as  pd
import matplotlib.pyplot as plt
import seaborn           as sns

from   sklearn.feature_extraction.text import TfidfVectorizer
from   sklearn.decomposition           import LatentDirichletAllocation

### OBSERVATIONS:

1.  numpy  --------------->  Computation of numerical array

2.  pandas --------------->  Data Manipulation

3.  matplotlib ----------->  Data Visualization

4.  seaborn    ----------->  Data Correlation

5.  TfidfVectorizer ------->  converts the text into the matrix of numrical tfidf scores

6.  LatentDirichletAllocation -------> discovers topics or hidden themes from the list of documents

### Step 2: Define the Sample Dataset

In [209]:
documents = [
    "I love to eat pizza and pasta",
    "The new movie was fantastic and thrilling",
    "Python is great for data science and machine learning",
    "I enjoy watching movies and eating popcorn",
    "Data scientists use Python for analysis and modeling",
    "The pizza at that restaurant was delicious",
]

In [210]:
documents

['I love to eat pizza and pasta',
 'The new movie was fantastic and thrilling',
 'Python is great for data science and machine learning',
 'I enjoy watching movies and eating popcorn',
 'Data scientists use Python for analysis and modeling',
 'The pizza at that restaurant was delicious']

### OBSERVATIONS:

1. The corpus text has a list of sentences that is fed into the Vectorizer to get it converted into the numerical vector.

2. This numerical vector can be easily be trained by the machine learning model.

### Step 3: Perform Text Vectorization (Bag of Words) i.e Tfidf Vectorizer

In [211]:
#### Create the object of Tfidf Vectorizer

tfidf = TfidfVectorizer()


### using the object of tfidf, transform the input text

X_vectorized = tfidf.fit_transform(documents)

In [212]:
X_vectorized

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 43 stored elements and shape (6, 33)>

In [213]:
### convert the sparse matrix into numpy array for better view

X_array = X_vectorized.toarray()

In [214]:
X_array

array([[0.        , 0.23062572, 0.        , 0.        , 0.        ,
        0.4501536 , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.4501536 , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.4501536 ,
        0.36913239, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.4501536 ,
        0.        , 0.        , 0.        ],
       [0.        , 0.21635609, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.422301  , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.422301  , 0.        , 0.422301  , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.34629286, 0.422301  , 0.        ,
        0.        , 0.34629286, 0.        ],
       [0.        , 0.18988419, 0.        , 0.30392276, 0.        ,
        0.        , 0.    

### OBSERVATIONS:

1. The input text is converted into the sparse matrix using tfidf vectorizer

2. The sparse matrix is converted into the numpy array for better view and visibility

### Step 4: Apply LDA Model

In [215]:
from   sklearn.decomposition           import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=2,random_state=42)

In [216]:
### using the object of lda, train the model

lda.fit(X_array)

### OBSERVATIONS:

1. The object for Latent Dirichlet Alocation (LDA)  is created.

2. From this object of LDA, it discovers two topics from the given list of documents as the total number of components is 2.

3. The lda is trained using the input numerical vector array of the text.

### Step 5:  Display the Topics

In [217]:
### Get the list of all the vocabularies from the documents

words = tfidf.get_feature_names_out()

In [218]:
words

array(['analysis', 'and', 'at', 'data', 'delicious', 'eat', 'eating',
       'enjoy', 'fantastic', 'for', 'great', 'is', 'learning', 'love',
       'machine', 'modeling', 'movie', 'movies', 'new', 'pasta', 'pizza',
       'popcorn', 'python', 'restaurant', 'science', 'scientists', 'that',
       'the', 'thrilling', 'to', 'use', 'was', 'watching'], dtype=object)

In [219]:
lda.components_

array([[0.51080001, 0.72423887, 0.90039821, 0.51159214, 0.90039821,
        0.94162535, 0.51332734, 0.51332734, 0.51863425, 0.51159214,
        0.51016456, 0.51016456, 0.51016456, 0.94162535, 0.51016456,
        0.51080001, 0.51863425, 0.51332734, 0.51863425, 0.94162535,
        1.19480172, 0.51332734, 0.51159214, 0.90039821, 0.51016456,
        0.51080001, 0.90039821, 0.8648026 , 0.51863425, 0.94162535,
        0.51080001, 0.86480215, 0.51332734],
       [0.88825141, 1.34040429, 0.50726379, 1.11955849, 0.50726379,
        0.50852825, 0.92259069, 0.92259069, 0.90366675, 1.11955849,
        0.86046649, 0.86046649, 0.86046649, 0.50852825, 0.86046649,
        0.88825141, 0.90366675, 0.92259069, 0.90366675, 0.50852825,
        0.50861935, 0.92259069, 1.11955849, 0.50726379, 0.86046649,
        0.88825141, 0.50726379, 0.81577894, 0.90366675, 0.50852825,
        0.88825141, 0.81577939, 0.92259069]])

In [220]:
### Get the top five  topics for each topic index

for x , y in enumerate(lda.components_):
    print(f"Topic:{x+1}")
    print([words[j] for j in y.argsort()[-5:]])

Topic:1
['pasta', 'love', 'to', 'eat', 'pizza']
Topic:2
['popcorn', 'data', 'python', 'for', 'and']


### OBSERVATIONS:

1. It prints the top five topic words for each and every topic.

### Step 6:  Predict Topic for Each Document

In [221]:
topic_distribution = lda.transform(X_array)

print(topic_distribution)

for i, doc in enumerate(documents):
    print(f"\nDocument {i+1}: {doc}")
    print(f"Topic Distribution: {topic_distribution[i]}")
    print(f"Predicted Topic: {topic_distribution[i].argmax() + 1}")

[[0.83077895 0.16922105]
 [0.18895813 0.81104187]
 [0.1448701  0.8551299 ]
 [0.16891665 0.83108335]
 [0.1501231  0.8498769 ]
 [0.84408263 0.15591737]]

Document 1: I love to eat pizza and pasta
Topic Distribution: [0.83077895 0.16922105]
Predicted Topic: 1

Document 2: The new movie was fantastic and thrilling
Topic Distribution: [0.18895813 0.81104187]
Predicted Topic: 2

Document 3: Python is great for data science and machine learning
Topic Distribution: [0.1448701 0.8551299]
Predicted Topic: 2

Document 4: I enjoy watching movies and eating popcorn
Topic Distribution: [0.16891665 0.83108335]
Predicted Topic: 2

Document 5: Data scientists use Python for analysis and modeling
Topic Distribution: [0.1501231 0.8498769]
Predicted Topic: 2

Document 6: The pizza at that restaurant was delicious
Topic Distribution: [0.84408263 0.15591737]
Predicted Topic: 1


### OBSERVATIONS:

1. X  is the input document.

2. transformation is applied on the input document to get the topic probabilities for every document.

3. The matrix obtained is of (n_documents, n_topics)

4. Every row of the topic matrix corresponds to every document and every column corresponds to every topic.

5. The topic matrix has two columns for two topics and the values in each cell contains the probabilities of the each document containing the two topics.

EX:-  

The topic matrix contains the following details:-

.   Document1 contains 83 % of  the topic1 and 17 % of the topic2.

.   Document2 contains 19 % of  the topic1 and 81 % of the topic2.

.   Document3 contains 14 % of  the topic1 and 86 % of the topic2.

.   Document4 contains 17 % of  the topic1 and 83 % of the topic2.

.   Document5 contains 15 % of  the topic1 and 85 % of the topic2.

.   Document6 contains 84 % of  the topic1 and 16 % of the topic2.