# Latent Semantic Analysis

LSA (Latent Semantic Analysis) also known as LSI (Latent Semantic Index) LSA uses bag of words(BoW) model, which results in a term-document matrix(occurrence of terms in a document). Rows represent terms and columns represent documents. LSA learns latent topics by performing a matrix decomposition on the document-term matrix using Singular value decomposition. LSA is typically used as a dimension reduction or noise reducing technique.


Implementing LSA using Gensim

Import the required library


In [1]:
#import modules
import os.path
from gensim import corpora
from gensim.models import LsiModel
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from gensim.models.coherencemodel import CoherenceModel
import matplotlib.pyplot as plt


In [2]:
def load_data(path,file_name):
    """
    Input  : path and file_name
    Purpose: loading text file
    Output : list of paragraphs/documents and
             title(initial 100 words considered as title of document)
    """
    documents_list = []
    titles=[]
    with open( os.path.join(path, file_name) ,"r") as fin:
        for line in fin.readlines():
            text = line.strip()
            documents_list.append(text)
    print("Total Number of Documents:",len(documents_list))
    titles.append( text[0:min(len(text),100)] )
    return documents_list,titles

##  Preprocessing Data

After data loading function, you need to preprocess the text. Following steps are taken to preprocess the text:

- Tokenize the text articles
- Remove stop words
- Perform stemming in text article


In [3]:
def preprocess_data(doc_set):
    """
    Input  : document list
    Purpose: preprocess text (tokenize, removing stopwords, and stemming)
    Output : preprocessed text
   """
    # initialize regex tokenizer
    tokenizer = RegexpTokenizer(r'\w+')
    # create English stop words list
    en_stop = set(stopwords.words('english'))
    # Create p_stemmer of class PorterStemmer
    p_stemmer = PorterStemmer()
    # list for tokenized documents in loop
    texts = []
    # loop through document list
    for i in doc_set:
        # clean and tokenize document string
        raw = i.lower()
        tokens = tokenizer.tokenize(raw)
        # remove stop words from tokens
        stopped_tokens = [i for i in tokens if not i in en_stop]
        # stem tokens
        stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
        # add tokens to list
        texts.append(stemmed_tokens)
    return texts

## Prepare Corpus

Next step is to prepare corpus. Here, you need to create a document-term matrix and dictionary of terms. 

In [4]:
def prepare_corpus(doc_clean):
    """
    Input  : clean document
    Purpose: create term dictionary of our corpus and Converting list of documents (corpus) into Document Term Matrix
    Output : term dictionary and Document Term Matrix
    """
    # Creating the term dictionary of our corpus, where every unique term is assigned an index. dictionary = corpora.Dictionary(doc_clean)
    dictionary = corpora.Dictionary(doc_clean)
    # Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
    doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
    # generate LDA model
    return dictionary,doc_term_matrix


## Create an LSA model using Gensim

After corpus creation, you can generate a model using LSA.
 

In [5]:
def create_gensim_lsa_model(doc_clean,number_of_topics,words):
    """
    Input  : clean document, number of topics and number of words associated with each topic
    Purpose: create LSA model using gensim
    Output : return LSA model
    """
    dictionary,doc_term_matrix=prepare_corpus(doc_clean)
    # generate LSA model
    lsamodel = LsiModel(doc_term_matrix, num_topics=number_of_topics, id2word = dictionary)  # train model
    print(lsamodel.print_topics(num_topics=number_of_topics, num_words=words))
    return lsamodel


## Determine the number of topics

Another extra step needs to be taken to optimize results by identifying an optimum amount of topics. Here, you will generate coherence scores to determine an optimum number of topics.

In [6]:
def compute_coherence_values(dictionary, doc_term_matrix, doc_clean, stop, start=2, step=3):
    """
    Input   : dictionary : Gensim dictionary
              corpus : Gensim corpus
              texts : List of input texts
              stop : Max num of topics
    purpose : Compute c_v coherence for various number of topics
    Output  : model_list : List of LSA topic models
              coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, stop, step):
        # generate LSA model
        model = LsiModel(doc_term_matrix, num_topics=number_of_topics, id2word = dictionary)  # train model
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=doc_clean, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())
    return model_list, coherence_values


Let's plot coherence score values.


In [7]:
def plot_graph(doc_clean,start, stop, step):
    dictionary,doc_term_matrix=prepare_corpus(doc_clean)
    model_list, coherence_values = compute_coherence_values(dictionary, doc_term_matrix,doc_clean,
                                                            stop, start, step)
    # Show graph
    x = range(start, stop, step)
    plt.plot(x, coherence_values)
    plt.xlabel("Number of Topics")
    plt.ylabel("Coherence score")
    plt.legend(("coherence_values"), loc='best')
    plt.show()


You can easily evaluate this graph. Here, you have a number of topics on X-axis and coherence score on Y-axis. Of the number of topics, 7 have the highest coherence score, so the optimum number of topics are 7.


Run all the above functions


In [8]:
# LSA Model
number_of_topics=7
words=10
#document_list,titles=load_data("","articles.txt")
document_list = open("articles.txt", encoding="utf8")
clean_text=preprocess_data(document_list)
model=create_gensim_lsa_model(clean_text,number_of_topics,words)


[(0, '0.869*"â" + 0.155*"trump" + 0.136*"say" + 0.118*"said" + 0.075*"would" + 0.071*"peopl" + 0.070*"clinton" + 0.070*"one" + 0.059*"year" + 0.059*"campaign"'), (1, '0.389*"citi" + 0.372*"v" + 0.358*"2016" + 0.358*"h" + 0.356*"2017" + 0.165*"unit" + 0.160*"west" + 0.157*"manchest" + 0.116*"apr" + 0.112*"dec"'), (2, '-0.330*"eu" + 0.307*"trump" + -0.244*"say" + 0.222*"â" + -0.215*"would" + -0.173*"leav" + -0.147*"uk" + 0.136*"clinton" + -0.134*"said" + -0.132*"vote"'), (3, '-0.454*"trump" + 0.276*"min" + -0.202*"clinton" + 0.201*"â" + -0.181*"said" + -0.175*"campaign" + -0.172*"eu" + -0.139*"vote" + -0.132*"say" + 0.124*"goal"'), (4, '-0.391*"min" + -0.386*"trump" + 0.279*"â" + -0.181*"clinton" + -0.172*"goal" + -0.144*"ball" + -0.120*"1" + -0.114*"0" + -0.102*"win" + -0.100*"leagu"'), (5, '0.433*"bank" + -0.263*"eu" + -0.240*"say" + -0.190*"min" + 0.183*"market" + 0.176*"year" + 0.165*"rate" + -0.143*"leav" + 0.127*"financi" + -0.123*"cameron"'), (6, '0.615*"say" + -0.225*"eu" + -0.17

- Topic 1 : a, trump, say, said, would, peopl, clinton, one, campaign ((US Presidential Elections)
- Topic 2: citi, v, h, unit, west, manchest, apr, dec (English Premier League)
- Topic 3: eu, trump, say, a would, leav, uk, clinton, said, vote (US Presidential Elections,)
- Topic 4: trump, min, clinton, said, campaign, eu, vote, say, goal (US Presidential Elections, EPL)
- Topic 5: min, trump, clinton, goal, ball, 1, 0, win, leagu (US Presidential Elections, EPL)
- Topic 6: bank, eu, say, min, market, year, rate, leav, financi, cameron (Brexit and Market Condition)
- Topic 7: say, eu, said, vote, poll, campaign, govern, remain, leav, tax (US Presidential Elections and Financial Planning)

Here, 7 Topics were discovered using Latent Semantic Analysis. Some of them are overlapping topics. For Capturing multiple meanings with higher accuracy we need to try LDA( latent Dirichlet allocation). 


I will leave this as an exercise for you, try it out using Gensim and share your views.
