# 02 - Latent Semantic Analysis (LSA)
This notebook performs Latent Semantic Analysis on the governance data set. It assumes the data was cleaned and placed in the cache by the "_00 - Preprocess the Governance Data Set__" notebook.

The code in this notebook is largely based on the [Introduction to Latent Semantic Analysis](https://www.youtube.com/playlist?list=PLroeQp1c-t3qwyrsq66tBxfR6iX6kSslt) youtube series by Joshua Cook.


---
## Dependencies and Imports
Here we import the libraries we need to perform our analysis.

In [None]:
!pip install scikit-learn pandas seaborn graphviz wordcloud

In [None]:
import re
import sys
from pathlib import Path
print("python=={}".format(re.sub(r'\s.*', '', sys.version)))

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import __version__ as sklearn__version__
print(f"scikit-learn=={sklearn__version__}")

import pandas as pd
print(f"pandas=={pd.__version__}")
ROW    = 0
COLUMN = 1
STRING = 'string'
OBJECT = 'object'
NUMBER = 'number'
CATEGORY = 'category'
INTEGER = 'integer'
UNSIGNED = 'unsigned'
FLOAT = 'float'

import matplotlib.pyplot as plt
from matplotlib import __version__ as matplotlib__version__
print(f"matplotlib=={matplotlib__version__}")

import seaborn as sns
CMAP_VLAG = sns.color_palette("vlag", as_cmap=True)
print(f"seaborn=={sns.__version__}")

from graphviz import __version__ as graphviz__version__
print(f"graphviz=={graphviz__version__}")
from graphviz import Source as dot_graph

import numpy as np
print(f"numpy=={np.__version__}")

from wordcloud import WordCloud
from wordcloud import __version__ as wordcloud__version__
print(f"wordcloud=={wordcloud__version__}")


---
## Process Steps
The image below shows the process steps that we are taking in this notebook. The values for `min_df` and `max_df` are explained elsewhere.

In [None]:
dot_graph('''
digraph {
  rankdir=LR;
  node [fontname="georgia; serif" fontsize=9, margin=0 arrowhead=normal arrowtail=dot];
  edge [arrowsize=0.75]

  node [shape=folder]; documents;
  node [shape=oval]; vec [label="count vectorizer", color=grey fontcolor=grey style=dotted];
      tfidf [label="TF-IDF vectorizer"]; svd [label="singular value decomposition"];
  node [shape=plain]; min_df [label="min_df=15"]; max_df [label="min_df=85%"];
      sublinear_tf [label="sublinear_tf=True"]; n_components [label="n_components=2"]; dot0 [label=" "];
  node [shape=box]; dtm [label="document term matrix"]; dict [label="dictionary"];
      topic_encodings [label="topic encodings"]; encoding_matrix [label="encoding matrix"]; explained_variance_sum [label="sum(explained variance)"];

  documents -> corpus;
  corpus -> vec [color=grey style=dotted arrowhead=vee arrowsize=0.5];
  corpus -> tfidf;
  sublinear_tf -> tfidf;
  min_df -> tfidf;
  max_df -> tfidf;
  vec -> dtm [color=grey style=dotted arrowhead=vee arrowsize=0.5];
  tfidf -> dtm;
  vec -> dict [color=grey style=dotted arrowhead=vee arrowsize=0.5];
  tfidf -> dict;
  dtm -> svd;
  dict -> svd;
  n_components -> svd;
  svd -> topic_encodings;
  svd -> encoding_matrix;
  svd -> explained_variance_sum;
}
''')

In [None]:
# the caller is expected to select the right column from their data frame and
# pass it in as a series.
def compose_document_term_matrix(series, min_df, max_df, sublinear_tf):
    vectorizer = TfidfVectorizer(min_df=min_df, max_df=max_df, sublinear_tf=sublinear_tf)
    dt_matrix  = vectorizer.fit_transform(series)
    dictionary = vectorizer.get_feature_names_out()

    return pd.DataFrame(index=series.index, columns=dictionary, data=dt_matrix.toarray()), \
           dictionary

def perform_lsa(dt_matrix, dictionary, n_components, column_names=None):
    svd = TruncatedSVD(n_components=n_components)
    explained_variance_sum = svd.fit(dt_matrix).explained_variance_.sum()
    lsa = svd.transform(dt_matrix)
    if not column_names:
        column_names = [f"topic {n}" for n in range(lsa.shape[1])]

    return pd.DataFrame(index=dt_matrix.index, data=lsa, columns=column_names), \
           pd.DataFrame(index=column_names, data=svd.components_, columns=dictionary).T, \
           explained_variance_sum           

def n_most_significant(topic, n_terms, encoding_matrix):
    df = encoding_matrix[topic]
    return df.iloc[df.abs().argsort()][::-1][:n_terms].to_frame()


---
## Apply LSA to the Governance Data Set
We apply LSA to the subset _duurzaamheidsvisie_ (DV) documents from the governance data set.

### Load the DV Documents from the Governance Data

In [None]:
CACHE_DIR = '../cache/Governance'

GLOB_DV = CACHE_DIR + '/GM????DV??.txt'

# take a glob and make it iterable. We cannot use globs as objects, since these get
# "exhausted" when you iterate over them.
# https://stackoverflow.com/questions/51108256/how-to-take-a-pathname-string-with-wildcards-and-resolve-the-glob-with-pathlib
def expand_glob(glob):
    p = Path(glob)
    return Path(p.parent).expanduser().glob(p.name)

print(f"'Duurzaamheidsvisie' documents = {GLOB_DV}")

DOCUMENT_BODY = 'body'

def load_corpus_as_dataframe(glob):
    df = pd.DataFrame(data   =[file.read_text() for file in expand_glob(glob)],
                      index  =[file.stem        for file in expand_glob(glob)],
                      columns=[DOCUMENT_BODY])
    df[DOCUMENT_BODY] = df[DOCUMENT_BODY].astype(STRING)

    return df


In [None]:
MIN_DF = 15   # count
MAX_DF = 0.85 # percent
SUBLINEAR_TF = True


In [None]:
corpus = load_corpus_as_dataframe(GLOB_DV)
corpus


In [None]:
corpus.info()


In [None]:
document_term_matrix, dictionary = compose_document_term_matrix(corpus[DOCUMENT_BODY], min_df=MIN_DF, max_df=MAX_DF, sublinear_tf=SUBLINEAR_TF)
document_term_matrix


In [None]:
dictionary

In [None]:
topic_encodings, encoding_matrix, _ = perform_lsa(document_term_matrix, dictionary, 2)
topic_encodings


In [None]:
encoding_matrix


In [None]:
n_most_significant("topic 0", 20, encoding_matrix)


In [None]:
n_most_significant("topic 1", 20, encoding_matrix)


### Determine the correct number of topics using latent sementic analysis
Explained variance based on https://stackoverflow.com/questions/69091520/determine-the-correct-number-of-topics-using-latent-semantic-analysis

In [None]:
performance = []
test = range(1, 255)

for n in test: #removed the loop for reasons of performance
    _, _, explained_variance_sum = perform_lsa(document_term_matrix, dictionary, n)
    performance.append(explained_variance_sum)


In [None]:
fig = plt.figure(figsize=(15, 5))
plt.plot(test, performance, 'ro--')
plt.title('cumulative explained variance ratio by n-components')
plt.xlim(1, 255)


70 to 80% of the explained variance ratio is reached with 125 to 150 topics.

In [None]:
fig = plt.figure(figsize=(15, 5))
plt.plot(test, np.gradient(performance), 'ro--')
plt.title('explained variance ratio by n-components')
plt.xlim(0, 25) # zoom to 25 to see the changes in the first components


The graph shows that the explained variance of the LSA model is about 75% at maximum of 253 topics (= number of documents). This low explained variance is probably due to the relative few but lengthy documents.
The explained variance flattens around 4 topics.

### Determining the beste parameters for the TF-IDF matrix in the LSA context
Below a matrix is constructed for different values of the min_df (absolute values) and max_df (relative values) settings for constructing the TF-IDF matrix. These are evaluated for the total explained variance and the explained variance in the first 5 topics from the LSA. Also the topic composition of the resulting topics are compared.

In [None]:
min_ls = [0, 5, 10, 15, 20, 25, 35, 50]
max_ls = [0.5, 0.75, 0.85, 0.9, 0.95, 1.0]
dimensions = pd.DataFrame(index=min_ls, columns=max_ls)
exvar_tot =  pd.DataFrame(index=min_ls, columns=max_ls)
exvar_4 =    pd.DataFrame(index=min_ls, columns=max_ls)
words = [[None] * len(max_ls)] * len(min_ls)


In [None]:
i = 0
for min_df in min_ls:
    c = 0
    for max_df in max_ls:
        dtm, dict = compose_document_term_matrix(corpus[DOCUMENT_BODY], min_df=min_df, max_df=max_df, sublinear_tf=SUBLINEAR_TF)
        dimensions.loc[min_df, max_df] = dict.size

        # calculate total variance, using n=255
        _, _, expl_variance_sum = perform_lsa(dtm, dict, 255)
        exvar_tot.loc[min_df, max_df] = expl_variance_sum

        # calculate variance at 5 topics, using n=4
        _, enc_matrix, expl_variance_sum = perform_lsa(dtm, dict, 4)
        words[i][c] = enc_matrix
        exvar_4.loc[min_df, max_df] = expl_variance_sum

        c += 1
    i += 1


In [None]:
dimensions


In [None]:
exvar_tot


High total explained variances are achieved for low values of both `min_df` and `max_df`. For `min_df` this is the result of including words that are unique for only a few documents. For `max_df` it is most likely because the total number of word occurances are reduces, which reduces the total variance.

In [None]:
exvar_4


In [None]:
exvar_4 / exvar_tot


The first five topics explain most of the variance with a high max_df and low min_df. The explained variance ratio is largest at high max_df and high max_df, however this ratio reduces only slightly for medium values of max_df. So the influence of max_df is mainly influenced by the higher total explained variance. The effect of min_df is explained by less unique words for the documents (so the other topics are less able to explain variance).

### Influence of `min_df` and `max_df` on Topic Interpretation
Below the top 20 words in each topic are listed in order to be able to interprete the common theme within the topic. The influence of different min_df and max_df settings can be investegated by changing the index of the words matrix (see matrices above).

Topic labelling seems quite stable for most options of `min_df` and `max_df`. Only values of `max_df<=0.75` the label of the two last topics changes. These two topics seems to have quite a similar theme. It is concluded that low values of max_df results in less intepretable topics.

Topic composition is influenced only slightly by different values of `max_df` and `min_df`. Low values of `max_df` seems to reduce the explainability of the topics.

From this analysis it can be concluded that medium values for both max_df and min_df results in a good balance between total explained variance and the explained variance of the first five topics. For our analysis it is proposed to use `min_df=15` and `max_df=0.85`.

### Topic assessment with 4 topics
According to the elbow method 4 topics are optimal to assess the information in the documents. Below these 4 topics are assessed for their content and value for clustering.

In [None]:
topic_encodings, encoded_matrix, _ = perform_lsa(document_term_matrix, dictionary, 4,
                                                ['samenleving', 'milieu', 'mens en natuur', 'energie'])
topic_encodings


The first 5 topics only account for 6.6% of the explained variance between the documents.
\

In [None]:
encoded_matrix


In [None]:
sns.pairplot(topic_encodings);


---
## Generate Word Clouds per Topic

In [None]:
def wordcloud_of_topic(series):
    print(series.sort_values(ascending=False)[:20])
    cloud = WordCloud(background_color="white", max_words=50).generate_from_frequencies(series.sort_values(ascending=False))
    plt.axis('off')
    plt.imshow(cloud);


In [None]:
wordcloud_of_topic(encoded_matrix['samenleving'])


In [None]:
wordcloud_of_topic(encoded_matrix['milieu'])


In [None]:
wordcloud_of_topic(encoded_matrix['mens en natuur'])


In [None]:
wordcloud_of_topic(encoded_matrix['energie'])
