## Latent Semantic Analysis (LSA) Overview

Latent Semantic Analysis (LSA) is a foundational technique in natural language processing and information retrieval. It identifies patterns in the relationships between terms and documents and uncovers latent semantic structures, effectively grouping together terms that are used in similar contexts.

### Steps in LSA:

1. **Term-Document Matrix Creation**:
   LSA constructs a matrix that represents the frequency of terms (words) across a set of documents. The matrix entries may be raw counts or, more commonly, weighted frequencies such as TF-IDF scores.

2. **Matrix Decomposition**:
   The term-document matrix is decomposed using Singular Value Decomposition (SVD). SVD separates the matrix into three components: a term-concept matrix, a diagonal matrix of singular values, and a concept-document matrix.

3. **Dimensionality Reduction**:
   By selecting the top `k` singular values and their corresponding vectors, LSA reduces the dimensionality of the term and document space to the `k` most informative concepts. This step helps in denoising the data and clarifying the structure.

4. **Concept Identification**:
   In the reduced `k`-dimensional space, terms and documents are now associated with latent concepts, which can often be interpreted as topics. The proximity of terms and documents within this space indicates their semantic similarity.

5. **Similarity Measurement**:
   LSA allows for the measurement of semantic similarity between terms and documents by using the cosine similarity of their vectors in the reduced space. Small angles between vectors indicate a high degree of semantic similarity.

LSA is particularly adept at dealing with synonymy and polysemy—common challenges in language processing. However, it does not account for word order or syntactic nuances, and the choice of `k` is crucial for the method's effectiveness.


In [3]:
import pandas as pd
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models import CoherenceModel
import numpy as np
from scipy.stats import entropy
### Text Vectorization and LSA Application:

- Vectorizes the cleaned text using TF-IDF with specified max and min document frequency thresholds, excluding common English stop words.
- Apply LSA using TruncatedSVD to the TF-IDF matrix to identify latent topics within the data.
# Loading the data
data = pd.read_csv("train_no_simplify.csv")

# Extracting the cleaned text
texts = data['clean_text'].values


### Text Vectorization and LSA Application:

- Vectorizes the cleaned text using TF-IDF with specified max and min document frequency thresholds, excluding common English stop words.
- Apply LSA using TruncatedSVD to the TF-IDF matrix to identify latent topics within the data.

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Define the number of topics
n_topics = 20

# Vectorize the cleaned text using TF-IDF
vectorizer = TfidfVectorizer(max_df=100, min_df=5, stop_words='english')
dtm_tfidf = vectorizer.fit_transform(texts)

# Applying LSA (Truncated SVD) on the TF-IDF matrix
lsa_model = TruncatedSVD(n_components=n_topics, n_iter=10)
lsa_topic_matrix = lsa_model.fit_transform(dtm_tfidf)


### Score calculation:

In [6]:
from gensim.corpora import Dictionary

# Ensure texts_list is a list of lists of tokens
texts_list = [text.split() for text in texts]
gensim_dictionary = Dictionary(texts_list)


n_top_words = 10
words = np.array(vectorizer.get_feature_names())
top_words = [words[np.argsort(topic)[-n_top_words:]] for topic in lsa_model.components_]

# Create the topics list expected by CoherenceModel
topics = [list(topic) for topic in top_words]
cm = CoherenceModel(topics=topics, texts=texts_list, dictionary=gensim_dictionary, coherence='c_v')
coherence_score = cm.get_coherence()

In [7]:
coherence_score

0.5014920513250186

In [10]:
import numpy as np

def calculate_topic_exclusivity(model, feature_names, top_n_words=20):
    """Calculates the topic exclusivity score for a given topic model.

    Args:
        model: The fitted topic model with a `components_` attribute containing topic-word distributions.
        feature_names: A list of feature names corresponding to the columns of the topic-word matrix.
        top_n_words: The number of top words to consider for exclusivity calculation (default: 20).

    Returns:
        The overall topic exclusivity score, averaged across all topics.
    """

    topics = model.components_
    exclusivity_scores = []

    for topic_idx, topic in enumerate(topics):
        top_features_ind = topic.argsort()[:-top_n_words-1:-1]
        top_features_ind = top_features_ind.astype(int)  # Ensure it's an integer array
        top_features = [feature_names[i] for i in top_features_ind]  # Use list comprehension for indexing


        other_topics = np.delete(topics, topic_idx, axis=0)

        # Check for zero denominator and handle it appropriately
        if np.sum(other_topics[:, top_features_ind]) == 0:
            topic_exclusivity_score = np.inf  # Assign infinite exclusivity if no overlap
        else:
            topic_exclusivity_score = np.sum(topic[top_features_ind]) / np.sum(other_topics[:, top_features_ind])

        exclusivity_scores.append(topic_exclusivity_score)

    # Use a robust averaging method to handle potential outliers
    overall_exclusivity = np.median(exclusivity_scores)  # Consider np.mean as well

    return overall_exclusivity

# Calculating Topic Exclusivity
feature_names = vectorizer.get_feature_names()
topic_exclusivity_score = calculate_topic_exclusivity(lsa_model, feature_names)
print(topic_exclusivity_score)

0.755777164947796


In [1]:
# Adjusting the number of topics to be less than the number of features
n_topics_adjusted = min(dtm_tfidf.shape[1] - 1, 5)  # Setting a maximum of 5 topics for this small dataset

# Re-applying LSA (Truncated SVD) with the adjusted number of topics
lsa_model_adjusted = TruncatedSVD(n_components=n_topics_adjusted, n_iter=10)
lsa_topic_matrix_adjusted = lsa_model_adjusted.fit_transform(dtm_tfidf)

# Recalculating the KL Divergence and Topic Diversity score with the adjusted number of topics
topic_diversity_scores_adjusted = []
for i in range(n_topics_adjusted):
    divergences = []
    for j in range(n_topics_adjusted):
        if i != j:
            divergences.append(kl_divergence(lsa_topic_matrix_adjusted[:, i], lsa_topic_matrix_adjusted[:, j]))
    topic_diversity_scores_adjusted.append(np.mean(divergences))

# Calculate the overall Topic Diversity score as the mean of individual topic scores
topic_diversity_score_adjusted = np.mean(topic_diversity_scores_adjusted)
topic_diversity_score_adjusted


kl_divergence: 1.549283


1. **Coherence Score (0.501)**:
   - The coherence score measures how semantically related the top words within each topic are. A score of 0.501 indicates that the topics have a moderate level of coherence. This means that while the top words in each topic exhibit some meaningful connections, there is room for improvement to enhance the overall coherence and interpretability of the topics.

2. **Topic Exclusivity Score (0.756)**:
   - The topic exclusivity score evaluates the distinctiveness of words within each topic. A score of 0.756 suggests that the topics are relatively exclusive, meaning that the words in each topic are distinct and not highly shared with words from other topics. This is a positive sign, as it indicates that the model has succeeded in creating internally coherent and separate topics.

3. **KL Divergence (1.549)**:
   - KL Divergence, in the context of topic modeling, measures the dissimilarity between topics. A value of 1.549 suggests that the topics exhibit a moderate level of diversity. While they are not highly dissimilar, they also do not overlap significantly. This balance between diversity and relatedness between topics can be useful depending on the application.

In summary, the scores collectively indicate that the topic modeling algorithm has generated topics with moderate coherence, high exclusivity, and moderate diversity. While the topics are distinct and not highly overlapping, there is potential for enhancing the interpretability of topics by improving coherence. Depending on the specific goals of the topic modeling application, further refinement of the model's parameters or post-processing techniques may be considered to achieve the desired balance between coherence, exclusivity, and diversity in the generated topics.