### Filename: topic_modeling_lvl-1.ipynb
### Author: Nikhil Singh
### Date: September 17, 2019, Tuesday

#### Details: The first notebook in the series of topic modeling in python. Helpful to start learning Topic Modeling sub-topic as a part of **NLP**

Reference: https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/

- Topic Modelling is different from rule-based text mining approaches that use regular expressions or dictionary based keyword searching techniques. It is an unsupervised approach used for finding and observing the bunch of words (called “topics”) in large clusters of texts.     
- **Topics**: Repeating pattern of co-occuring terms in a corpus.
- Can be used for *Document Clustering*
- Arranging Textual Data

### Part-1: Latent Dirichlet Allocation for Topic Modeling
- LDA is a Matrix Factorization Technique
- We get two lower dimension matrices from the Document-Term matrix of a corpus.
- The dimension of the two smaller matrices can be [M X A] & [A X N] if the document-term matrix is of [M X N] dimension.
- M - number of documents, A - number of topics & N - number of vocabulary size (words within the document).

In [21]:
!pip install gensim

^C


In [22]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

import string

import gensim
from gensim import corpora



In [13]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [14]:
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()

In [15]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle."

# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]

In [16]:
doc_complete

['Sugar is bad to consume. My sister likes to have sugar, but not my father.',
 'My father spends a lot of time driving my sister around to dance practice.',
 'Doctors suggest that driving may cause increased stress and blood pressure.',
 'Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better.',
 'Health experts say that Sugar is not good for your lifestyle.']

In [17]:
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

In [18]:
doc_clean = [clean(doc).split() for doc in doc_complete]

In [19]:
print('\n\nCleaned Data\n\n')
print(doc_clean)



Cleaned Data


[['sugar', 'bad', 'consume', 'sister', 'like', 'sugar', 'father'], ['father', 'spends', 'lot', 'time', 'driving', 'sister', 'around', 'dance', 'practice'], ['doctor', 'suggest', 'driving', 'may', 'cause', 'increased', 'stress', 'blood', 'pressure'], ['sometimes', 'feel', 'pressure', 'perform', 'well', 'school', 'father', 'never', 'seems', 'drive', 'sister', 'better'], ['health', 'expert', 'say', 'sugar', 'good', 'lifestyle']]


In [24]:
# Creating the term dictionary of our courpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

### Running LDA model

In [26]:
# Creating the object for LDA model using gensim library
lda = gensim.models.ldamodel.LdaModel

# Running and training LDA model on the document term matrix
ldamodel = lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

In [27]:
ldamodel.print_topics(num_topics=3, num_words=3)

[(0, '0.029*"sugar" + 0.029*"driving" + 0.029*"sister"'),
 (1, '0.050*"driving" + 0.050*"sister" + 0.050*"father"'),
 (2, '0.059*"sugar" + 0.059*"father" + 0.059*"sister"')]