## Latent Semantic Analysis (LSA) Overview

Latent Semantic Analysis (LSA) is a foundational technique in natural language processing and information retrieval. It identifies patterns in the relationships between terms and documents and uncovers latent semantic structures, effectively grouping together terms that are used in similar contexts.

### Steps in LSA:

1. **Term-Document Matrix Creation**:
   LSA constructs a matrix that represents the frequency of terms (words) across a set of documents. The matrix entries may be raw counts or, more commonly, weighted frequencies such as TF-IDF scores.

2. **Matrix Decomposition**:
   The term-document matrix is decomposed using Singular Value Decomposition (SVD). SVD separates the matrix into three components: a term-concept matrix, a diagonal matrix of singular values, and a concept-document matrix.

3. **Dimensionality Reduction**:
   By selecting the top `k` singular values and their corresponding vectors, LSA reduces the dimensionality of the term and document space to the `k` most informative concepts. This step helps in denoising the data and clarifying the structure.

4. **Concept Identification**:
   In the reduced `k`-dimensional space, terms and documents are now associated with latent concepts, which can often be interpreted as topics. The proximity of terms and documents within this space indicates their semantic similarity.

5. **Similarity Measurement**:
   LSA allows for the measurement of semantic similarity between terms and documents by using the cosine similarity of their vectors in the reduced space. Small angles between vectors indicate a high degree of semantic similarity.

LSA is particularly adept at dealing with synonymy and polysemy—common challenges in language processing. However, it does not account for word order or syntactic nuances, and the choice of `k` is crucial for the method's effectiveness.


## Implementation

importing libraries and preprocessed data

In [None]:
import pandas as pd
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models import CoherenceModel
import numpy as np
from scipy.stats import entropy

# Loading the data
data = pd.read_csv("/content/train_no_simplify.csv")

# Extracting the cleaned text
texts = data['clean_text'].values


### Text Vectorization and LSA Application:

- Vectorizes the cleaned text using TF-IDF with specified max and min document frequency thresholds, excluding common English stop words.
- Apply LSA using TruncatedSVD to the TF-IDF matrix to identify latent topics within the data.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Define the number of topics
n_topics = 20

# Vectorize the cleaned text using TF-IDF
vectorizer = TfidfVectorizer(max_df=100, min_df=5, stop_words='english')
dtm_tfidf = vectorizer.fit_transform(texts)

# Applying LSA (Truncated SVD) on the TF-IDF matrix
lsa_model = TruncatedSVD(n_components=n_topics, n_iter=10)
lsa_topic_matrix = lsa_model.fit_transform(dtm_tfidf)


### Topic Extraction and Coherence Calculation:

- Prepares the data for coherence score calculation by tokenizing the texts and creating a Gensim dictionary.
- Identifies the top words from each LSA topic.
- Constructs a list of topics with top words and calculates the coherence score using Gensim's CoherenceModel.

In [None]:
from gensim.corpora import Dictionary

# Ensure texts_list is a list of lists of tokens
texts_list = [text.split() for text in texts]
gensim_dictionary = Dictionary(texts_list)


n_top_words = 10
words = np.array(vectorizer.get_feature_names_out())
top_words = [words[np.argsort(topic)[-n_top_words:]] for topic in lsa_model.components_]

# Create the topics list expected by CoherenceModel
topics = [list(topic) for topic in top_words]
cm = CoherenceModel(topics=topics, texts=texts_list, dictionary=gensim_dictionary, coherence='c_v')
coherence_score = cm.get_coherence()

In [None]:
coherence_score

0.4757086283363246

In [None]:
import numpy as np

def calculate_topic_exclusivity(model, feature_names, top_n_words=20):
    """Calculates the topic exclusivity score for a given topic model.

    Args:
        model: The fitted topic model with a `components_` attribute containing topic-word distributions.
        feature_names: A list of feature names corresponding to the columns of the topic-word matrix.
        top_n_words: The number of top words to consider for exclusivity calculation (default: 20).

    Returns:
        The overall topic exclusivity score, averaged across all topics.
    """

    topics = model.components_
    exclusivity_scores = []

    for topic_idx, topic in enumerate(topics):
        top_features_ind = topic.argsort()[:-top_n_words-1:-1]
        top_features = feature_names[top_features_ind]

        other_topics = np.delete(topics, topic_idx, axis=0)

        # Check for zero denominator and handle it appropriately
        if np.sum(other_topics[:, top_features_ind]) == 0:
            topic_exclusivity_score = np.inf  # Assign infinite exclusivity if no overlap
        else:
            topic_exclusivity_score = np.sum(topic[top_features_ind]) / np.sum(other_topics[:, top_features_ind])

        exclusivity_scores.append(topic_exclusivity_score)

    # Use a robust averaging method to handle potential outliers
    overall_exclusivity = np.median(exclusivity_scores)  # Consider np.mean as well

    return overall_exclusivity

# Calculating Topic Exclusivity
feature_names = vectorizer.get_feature_names_out()
topic_exclusivity_score = calculate_topic_exclusivity(lsa_model, feature_names)
print(topic_exclusivity_score)

0.6288398456379947


In [None]:
import numpy as np
from scipy.special import rel_entr

def calculate_average_topic_divergence(model):
    topics = model.components_  # Access topic-word distributions
    divergence_matrix = np.zeros((len(topics), len(topics)))

    for i in range(len(topics)):
        for j in range(i + 1, len(topics)):
            divergence = rel_entr(topics[i], topics[j])

            # Handle zero denominator and potential numerical issues:
            if np.isinf(divergence).any():  # Check for infinity
                divergence_matrix[i, j] = np.inf  # Assign infinity
            else:
                divergence_matrix[i, j] = divergence

            divergence_matrix[j, i] = divergence_matrix[i, j]

    # Calculate average divergence, ignoring infinity values:
    average_divergence = np.nanmean(divergence_matrix[~np.eye(divergence_matrix.shape[0], dtype=bool)])

    return average_divergence



average_divergence = calculate_average_topic_divergence(lsa_model)  # Adjust the model argument as needed
print("Average Topic Divergence:", average_divergence)


Average Topic Divergence: inf
