# Topic Modelling with LDA
#### A really trivial exercise to warm up with LDA


#### Modules to import

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

#### Data Set up
- For simplicity , we have prepared the corpus to be a list of documents.
- In the ideal case, you should process the input text to remove some stopwords and punctuations
- As we are using Scikit learn, the entire corpus needs to be broken into invidividual documents (may contain more than 1 sentences).
- Then each document is represented as a string
- The corpus (to be vectorized) is a list of documents

In [2]:
doc1 = "cats eat rice"
doc2 = "Daddy cat is a super agent "
doc3 = "Mummy cat goes shopping for baby cat food"
corpus = [doc1, doc2, doc3]

#### Let us vectorized now

In [3]:

tf_vectorizer = CountVectorizer()
doc_terms_matrix = tf_vectorizer.fit_transform(corpus)

In [4]:
## answer: 
doc_terms_matrix = tf_vectorizer.fit_transform(corpus)

#### To see the feature names (terms)

In [5]:

tf_feature_names = tf_vectorizer.get_feature_names()
print (tf_feature_names)

['agent', 'baby', 'cat', 'cats', 'daddy', 'eat', 'food', 'for', 'goes', 'is', 'mummy', 'rice', 'shopping', 'super']


#### To see the document term matrix using pandas. This is purely for analysing for now. The dtm (document term matrix in Pandas is not used)


In [6]:
pd.DataFrame(doc_terms_matrix.toarray(), columns=tf_vectorizer.get_feature_names())

Unnamed: 0,agent,baby,cat,cats,daddy,eat,food,for,goes,is,mummy,rice,shopping,super
0,0,0,0,1,0,1,0,0,0,0,0,1,0,0
1,1,0,1,0,1,0,0,0,0,1,0,0,0,1
2,0,1,2,0,0,0,1,1,1,0,1,0,1,0



#### Now, the real work starts.  Fit and transform the corpus (count!) into LDA model

In [None]:

no_topics = 6
lda_model = LatentDirichletAllocation(n_components=??how many topics as a first guess???)
document_topic = lda_model.fit_transform(???what do want to fit and transform??) 

In [7]:
#Answer:
no_topics = 6
lda_model = LatentDirichletAllocation(n_components=no_topics)
document_topics = lda_model.fit_transform(doc_terms_matrix)

#### Let us take a look at the document-topics

In [8]:
print(document_topics.shape)
document_topics

(3, 6)


array([[0.04166989, 0.04166727, 0.04166989, 0.04166754, 0.79165553,
        0.04166989],
       [0.02778068, 0.0278857 , 0.02778068, 0.86099338, 0.02777889,
        0.02778068],
       [0.01852084, 0.90736427, 0.01852084, 0.0185538 , 0.01851941,
        0.01852084]])

In [9]:
sum(document_topics)

array([0.0879714 , 0.97691725, 0.0879714 , 0.92121472, 0.83795382,
       0.0879714 ])

##### Reflection time: 
- Now, we know that the original corpus has ????? documents and ???? terms.
- We also know that document at index 1 has a heavier weight on topic No. ?????

#### Let us take a look at the topics-terms

In [11]:
topic_terms = lda_model.components_
print ("Shape is : ", topic_terms.shape)
topic_terms

Shape is :  (6, 14)


array([[0.16667015, 0.16666972, 0.16667273, 0.16667096, 0.16667015,
        0.16667096, 0.16666972, 0.16666972, 0.16666972, 0.16667015,
        0.16666972, 0.16667096, 0.16666972, 0.16667015],
       [0.16666734, 1.16665551, 2.16701035, 0.16666748, 0.16666734,
        0.16666748, 1.16665551, 1.16665551, 1.16665551, 0.16666734,
        1.16665551, 0.16666748, 1.16665551, 0.16666734],
       [0.16667015, 0.16666972, 0.16667273, 0.16667096, 0.16667015,
        0.16667096, 0.16666972, 0.16666972, 0.16666972, 0.16667015,
        0.16666972, 0.16667096, 0.16666972, 0.16667015],
       [1.16665423, 0.1666675 , 1.16630246, 0.16666783, 1.16665423,
        0.16666783, 0.1666675 , 0.1666675 , 0.1666675 , 1.16665423,
        0.1666675 , 0.16666783, 0.1666675 , 1.16665423],
       [0.166668  , 0.16666783, 0.16666899, 1.16665182, 0.166668  ,
        1.16665182, 0.16666783, 0.16666783, 0.16666783, 0.166668  ,
        0.16666783, 1.16665182, 0.16666783, 0.166668  ],
       [0.16667015, 0.16666972, 0.1

##### Reflection: Now, we know that the corpus is made up of _____ topics.
##### Each topic is collection of ____ terms with different weights

#### Questions:
1. How many documents are there in the original corpus?
2. What is the number of terms that are extracted?
3. Does the number of rows in document_topic corresponse to the number of documents
4. Does the number of rows in the topic_terms corresponse to the number of topics (n_component)
5. Does the number of columns in document_topic corresponse to the number of rows in topics

#### Exercise:
1. Change the number of topics (e.g. set n_topics = 3) and run the notebook.
2. Discuss how has lda_component and lda_output changes.

#### Reference:
- www.machinelearningplus.com/nlp/topic-modeling-python-sklearn-examples/
