# Parameter Selection for LDA

### Import the neccessary libraries
- LDA works only with bag of words approach only
- Regular expressions re, gensim and spacy are used to process texts. 
- PyLDAvis and matplotlib for visualization and numpy
- Pandas for manipulating and viewing data in tabular format.



In [2]:
# Sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

# 
import numpy as np
import pandas as pd
import re, nltk


#### display_topic() is a commonly used function to display topics and related terms
- model - the lda model
- feature_names - the features names
- no_topc_words - how many terms to display

In [3]:

def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print ("Topic %d:" % (topic_idx))
        print (" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

#### Data Cleaning
- Clean_documents() is a function to perform data cleansing for a raw document

In [4]:
def clean_documents(raw_document):
    # placeholder: Write data preparation codes here
    document = raw_document
    return document

#### This is the section to modify if you have other sources
- Load in the documents from its source
- The LDA topic model algorithm requires a document word matrix as the main input.
- Vectorise the document using count vectorizing
- LDA can only use raw term counts for LDA because it is a probabilistic graphical model


In [5]:


dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = clean_documents(dataset.data)

no_features = 1000
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
tf_vectorized_documents = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()

#### How Do you know you have a good  model performance
- One way is to use with perplexity and log-likelihood
- A model with higher log-likelihood and lower perplexity (exp(-1. * log-likelihood per word)) is considered to be good
- The only drawback is tha it doesn’t consider the context and semantic associations between words

- We need to start by pre-specifying an initial range of "sensible" values:
- Then apply LDA for all possible topic size k
- Calculate the log-likeliihood and perplexity at the same time


### TO DO:

The codes segment below implements a loop to perform modelling for a range of values of k (components)
Use you intuition, set some boundaries for min and max number topics

In [6]:
kmin= 15
kmax = 25

### TO DO:
Initiate the LDA model and call the relevant methods in the lda object to return to log_likehood and perplexity


In [7]:


topic_models = []
# try each value of k
for k in range(kmin,kmax+1):
    print("Applying LDA for k=%d ..." % k )
    lda_model = LatentDirichletAllocation(n_components=k, max_iter=5, learning_method='online', learning_offset=50.,random_state=0)
    lda_output = lda_model.fit_transform(tf_vectorized_documents)  
    log_likelihood = lda_model.score(tf_vectorized_documents)
    perplexity = lda_model.perplexity(tf_vectorized_documents)
    topic_models.append( (k,lda_model,lda_output, log_likelihood, perplexity) ) # store for later
    

Applying LDA for k=15 ...
Applying LDA for k=16 ...
Applying LDA for k=17 ...
Applying LDA for k=18 ...
Applying LDA for k=19 ...
Applying LDA for k=20 ...
Applying LDA for k=21 ...
Applying LDA for k=22 ...
Applying LDA for k=23 ...
Applying LDA for k=24 ...
Applying LDA for k=25 ...


- Prints the number of topics and corresponding perplexity and log likelihood

In [8]:
for model in topic_models:
  print("k topics : % 2d, Log Likelihood : % 5.2f   Perplexity : %5.2f" %(model[0], model[3], model[4]))  

k topics :  15, Log Likelihood : -3059155.03   Perplexity : 259.51
k topics :  16, Log Likelihood : -3057436.03   Perplexity : 258.70
k topics :  17, Log Likelihood : -3054443.24   Perplexity : 257.29
k topics :  18, Log Likelihood : -3051152.22   Perplexity : 255.76
k topics :  19, Log Likelihood : -3052920.90   Perplexity : 256.58
k topics :  20, Log Likelihood : -3049240.73   Perplexity : 254.87
k topics :  21, Log Likelihood : -3048468.07   Perplexity : 254.52
k topics :  22, Log Likelihood : -3047531.30   Perplexity : 254.08
k topics :  23, Log Likelihood : -3055736.92   Perplexity : 257.90
k topics :  24, Log Likelihood : -3051220.08   Perplexity : 255.79
k topics :  25, Log Likelihood : -3054335.57   Perplexity : 257.24


### Questions:
1. Based on results, how should  the Log Likelihood  & Perplexity be interpreted?
2. Which topic number provides the best 'optimal number' of topic?
3. Would this 'best' numbers be the final answer?



In [13]:
### Answer:
no_top_words = 10
best_k = 22 
best_lda_model = topic_models[7][1]
display_topics(best_lda_model, tf_feature_names, no_top_words)

Topic 0:
government gun state law people states right rights guns control
Topic 1:
new national public private health use administration years service research
Topic 2:
greek body right sound left order david pro mentioned cross
Topic 3:
key chip encryption keys clipper use des algorithm government security
Topic 4:
10 15 11 12 25 14 17 16 20 13
Topic 5:
better does case point use way question problem don think
Topic 6:
windows thanks window use using dos does help problem know
Topic 7:
information university 1993 general medical water april air new page
Topic 8:
just don like know think ve people time say good
Topic 9:
00 car power 50 speed cars 000 engine new used
Topic 10:
edu com available ftp mail list pub information software version
Topic 11:
game team year play games season hockey league players win
Topic 12:
god jesus people believe christian bible does life church say
Topic 13:
mr people think don group going president know stephanopoulos yes
Topic 14:
drive card db scsi disk


### Optional Exercises:
1. Retrieve the best model
2. Display the topics
3. Check if the terms make sense
4. Create a Panda dataframe to show the relationship between terms, topics and documents

#### Reference:
- www.machinelearningplus.com/nlp/topic-modeling-python-sklearn-examples/
