## Latent Dirichlet Allocation (LDA)

##### LDA is an example of topic modeling; a statistical modeling mechanism for discovering hidden topics in a collection of documents.

##### LDA builds a words per topic model and a topic per document model, which are modeled as Dirichlet distributions.

##### Let's load relevant libraries. 

In [None]:
import nltk
# nltk.download('stopwords')
import re
import pandas as pd

In [None]:
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

##### We we also load some stopwords to remove from our documents

In [None]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

##### We will use the NewsGroup data, which contains 18,000 different USENET newsgroup documents spread evenly across 20 different topics. We will use LDA to learn these latent topics. 

In [None]:
from sklearn.datasets import fetch_20newsgroups
df = fetch_20newsgroups(subset='train',shuffle=True)

##### We will clean up the data little, by removing links.

In [None]:
data = df.data
data = [re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', sent) for sent in data]
data = [re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ", sent) for sent in data]

In [None]:
data[0]

##### We will use a built in gensim function to convert each document into a list of words and remove all punctuations.

In [None]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

print(data_words[:1])

##### We will also remove all stopwords from the data

In [None]:
data_words_nostops = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in data_words]
print(data_words_nostops[:1])

##### Next, we will create a dictionary of each word in all documents. The idea is to have a unique identifier (ID) for each word to swap out computations with actual string values.

In [88]:
id2word = corpora.Dictionary(data_words_nostops)

##### Next, we will assign the frequency of each word occuring in each document. Think of it as building our corpus so you end up with how many times a word is repeated in each document.

In [89]:
texts = data_words_nostops
corpus = [id2word.doc2bow(text) for text in texts]

In [None]:
print(corpus[0])
print(id2word[13])

##### Now we can use the corpus object to build our LDA model. We will provide the corpus, the dictionary (vocabulary), and the number of topics we want to identify as parameters.

In [None]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

##### We can then look at each topic. The print_topics function gives the top 10 words associated with each topic, and these keywords basically describe what the topic is about.

In [None]:
pprint(lda_model.print_topics())

##### Now it can be hard to determine if the topics are really well fitting for the data provided. (You will have to go through each and every document and manually assign it a topic and then compare the manual topics with the ones generated by LDA). Instead, you can compute the coherence of the model. Coherence score measures the degree of semantic similarity between high scoring words in a topic.

##### A high coherence score means that the topics were meaningful and sort of fit the data well.

In [None]:
coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

##### The lda_model object also gives you the topic assigned to each document in the data so you can attach it back to the original data to summarize findings or conduct additional analyses.