<h1>Welcome to the accompanying Jupyter Notebook to explore how pre-processing and algorithm choices in UML affect the results and their interpretability and representativeness<h1>

This Notebook allows you to change and tweak pre-processing and parameters to your liking to see what changes. As a result two Excel-files are created, one for assessing interpretability and one for representativeness. There is a dataset provided here, but you can upload your own if you so wish. 


Let us begin by importing general Python data analysis packages and Spacy - the natural language processing library we will be using. To install spacy, visit https://spacy.io/usage. We recommend using Anaconda to manage Python environments.

In [1]:
import pandas as pd
import numpy as np
import re

import spacy

In [2]:
! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------------------------------------- 0.0/12.8 MB 330.3 kB/s eta 0:00:39
     --------------------------------------- 0.1/12.8 MB 469.7 kB/s eta 0:00:28
     - -------------------------------------- 0.3/12.8 MB 1.8 MB/s eta 0:00:07
     --- ------------------------------------ 1.0/12.8 MB 4.8 MB/s eta 0:00:03
     ------ --------------------------------- 2.0/12.8 MB 7.9 MB/s eta 0:00:02
     --------- ------------------------------ 3.1/12.8 MB 10.2 MB/s eta 0:00:01
     ------------- -------------------------- 4.3/12.8 MB 12.4 MB/s eta 0:00:01
     ----------------- ---------------------- 5.5/12.8 MB 14.1 MB/s eta 0:00:01
     ------------------- ------------------

In [3]:
spacy_model = "en_core_web_sm"

Uploading the provided data sample to a pandas dataframe and implementing some basic data cleansing. 

In [4]:
df = pd.read_excel('cern_news_data.xlsx')

#Choosing the columns we want for our analysis
df = df[['Document', 'Label']]

#Dropping rows with empty data
df = df.dropna(how = 'any',axis = 0).reset_index(drop = True)
#Dropping rows of duplicate text documents
df = df.drop_duplicates(subset="Document")

#If you want to take a smaller sample of the data for faster exploration, uncomment the following line:
#df = df.head(550)

More basic cleansing steps are applied to the text documents in the 'Action'-column of the dataframe. 
Extra whitespace and asterisks marks are removed.

In [5]:
#Making a new list of text documents with extra whitespace and asterisks and quotation marks 
#that would complicate further cleansing removed
cleaner_documents = [text.replace("*", " ").replace('"','') for text in list(df['Document'])]
clean_documents = [re.sub('[\s+]', ' ',text) for text in cleaner_documents]

#Adding the cleansed documents to the dataframe
df['Action Clean'] = clean_documents

Let us take a look at a sample of the clean documents:

In [6]:
for clean_document in clean_documents[:5]:
    print(clean_document+"\n")

Founded in 2004, Zecotek operates three divisions: Imaging Systems, Optronics Systems and 3D Display Systems with labs located in Canada, Korea, Russia, Singapore and U.S.A. The management team is focused on building shareholder value by commercializing over 50 patented and patent pending novel photonic technologies directly and through strategic alliances with Hamamatsu Photonics (Japan), the European Organization for Nuclear Research (Switzerland), Shanghai EBO Optoelectronics Technology Co. (China), NuCare Medical Systems (South Korea), the University of Washington (United States), and National NanoFab Center (South Korea). For more information visit www.zecotek.com and follow @zecotek on Twitter.

Pakistan has a long tradition of international scientific collaborations. In addition to being actively involved in IAEA's activities, for decades Pakistan has been contributing and regularly participating in European Organization for Nuclear Research's projects, theoretical and nuclear e

<h2>Make pre-processing choices.<h2>

Set the pre-processing step you wish to employ as 'True' and others as 'False'.
A warning is raised if you choose conflicting pre-processing.
However, if you choose 'True' for both chunks and n-grams, the program will make n-grams of chunks.

Consult the table on pre-processing choices in the article for details on these choices.

In [7]:
import warnings 

#Tokenization:
basic_tokenization = False
chunk_tokenization = False
ngram_tokenization = True
#When using n-grams, how many adjacent tokens to gram together:
n = 2

#If conflicting choices, raise warning.
if sum(map(bool, [basic_tokenization,chunk_tokenization,ngram_tokenization])) != 1:
    warnings.warn("Please specify exactly one tokenization.")

#Vectorization:
bow_vectorization = False
tf_idf_vectorization = True

#If conflicting choices, raise warning.
if sum(map(bool, [bow_vectorization, tf_idf_vectorization])) != 1:
    warnings.warn("Please specify exactly one vectorization.")

#Lemmatization:
lemmatized = True

#If chunk tokenization chosen, Spacy will use chunks as tokens
if chunk_tokenization:
    nlp = spacy.load(spacy_model)
    nlp.add_pipe("merge_noun_chunks")
#If chunk tokenization not chosen, chunk retrieval will be disabled from Spacy
else:
    nlp = spacy.load(spacy_model, disable=['merge_noun_chunks'])

The following function is for naming output-files according to the pre-processing regime used. 

In [8]:
from datetime import datetime
#Get the date of today to add to file names for file version control
today = str(datetime.now().date())

def pre_processing_name():
    if basic_tokenization:
        tokenization_name = "Basic"
    if chunk_tokenization:
        tokenization_name = "Chunk"
    if ngram_tokenization:
        tokenization_name = "Ngram"
    if bow_vectorization:
        vectorizer_name = "BOW"
    if tf_idf_vectorization:
        vectorizer_name = "TFIDF"
    
    #Not a string
    return (tokenization_name, vectorizer_name, today)

Using Spacy to tokenize the text documents. Here every token is lowercased, and if lemmatization is chosen, also Spacy's lemmas will be used. Known entities, stopwords, punctuation, spaces, numerals, urls and emails will be removed in cleansing. If any of these want to be retained, simply remove the "token.(what_you_want_to_keep)" from the following:

In [9]:
def basic_tokenizer(document, lemmatized=lemmatized):
    #Converting the text document into a Spacy document
    document = nlp(document)
    if not lemmatized:
        tokenized = [token.text.lower() for token in document if token.ent_iob == 2 #<- This removes known entities
                     and not (token.is_stop or token.is_punct or token.is_space or token.like_num 
                              or token.like_url or token.like_email)]
    if lemmatized:
        tokenized = [token.lemma_.lower() for token in document if token.ent_iob == 2 #<- This removes known entities
                     and not (token.is_stop or token.is_punct or token.is_space or token.like_num 
                              or token.like_url or token.like_email)]       
    #Returns a list of tokens
    return tokenized

The following function does exactly what basic_tokenizer does, but combines the tokens into n-grams according to the set n-value.

In [10]:
def ngram_tokenizer(document, n=n, lemmatized=lemmatized):
    #Converting the text document into a Spacy document
    document = nlp(document)
    if not lemmatized:
        tokenized = [token.text.lower() for token in document if token.ent_iob == 2 
                     and not (token.is_stop or token.is_punct or token.is_space or token.like_num 
                              or token.like_url or token.like_email)]
        #Joins found tokens with "_" into n-grams
        ngrams = ["_".join(ngram) for ngram in zip(*[tokenized[i:] for i in range(n)])]
    if lemmatized:
        tokenized = [token.lemma_.lower() for token in document if token.ent_iob == 2 
                     and not (token.is_stop or token.is_punct or token.is_space or token.like_num 
                              or token.like_url or token.like_email)]       
        #Joins found tokens with "_" into n-grams
        ngrams = ["_".join(ngram) for ngram in zip(*[tokenized[i:] for i in range(n)])]
    #Returns a list of tokens
    return ngrams

Initializing an empty list for tokenized documents to be added to.

In [11]:
tokenized_documents = []

Tokenizing the cleansed text documents according to the pre-processing choices and adding the tokenized document to the initalized list. 

In [12]:
for document in clean_documents:
    if chunk_tokenization:
        #using basic tokenizer on the document with Spacy's chunks enabled
        tokenized = basic_tokenizer(document)
    if basic_tokenization:
        #using basic tokenizer on the document with Spacy's chunks disabled
        tokenized = basic_tokenizer(document)
    if ngram_tokenization:
        tokenized = ngram_tokenizer(document)
    #adding the tokenized document to tokenized_documents
    tokenized_documents.append(tokenized)

#Adding the tokenized documents to the dataframe
df['Tokenized'] = tokenized_documents

Let us take a look at the tokenized documents compared to the cleansed ones:

In [13]:
for i in range(5):
    print("Cleansed: "+clean_documents[i]+"\n")
    print("Tokenized: "+str(tokenized_documents[i])+"\n")

Cleansed: Founded in 2004, Zecotek operates three divisions: Imaging Systems, Optronics Systems and 3D Display Systems with labs located in Canada, Korea, Russia, Singapore and U.S.A. The management team is focused on building shareholder value by commercializing over 50 patented and patent pending novel photonic technologies directly and through strategic alliances with Hamamatsu Photonics (Japan), the European Organization for Nuclear Research (Switzerland), Shanghai EBO Optoelectronics Technology Co. (China), NuCare Medical Systems (South Korea), the University of Washington (United States), and National NanoFab Center (South Korea). For more information visit www.zecotek.com and follow @zecotek on Twitter.

Tokenized: ['found_operate', 'operate_division', 'division_lab', 'lab_locate', 'locate_management', 'management_team', 'team_focus', 'focus_build', 'build_shareholder', 'shareholder_value', 'value_commercialize', 'commercialize_patented', 'patented_patent', 'patent_pende', 'pend

Getting the vectorizers from scikit-learn. scikit-learn vectorizers tokenize documents unless specified other tokenizer, here we already have tokenized documents so we need to provide a "tokenizer"-function that returns the tokenized document we want to vectorize, which we already have.

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

#the dummy function that returns the already tokenized document
def id_fun(already_tokenized):
    return already_tokenized

#initializing tf-idf
if tf_idf_vectorization:
    vectorizer = TfidfVectorizer(
        analyzer='word',
        tokenizer=id_fun,
        preprocessor=id_fun,
        token_pattern=None)
    
#initializing bag-of-words
if bow_vectorization:
    vectorizer = CountVectorizer(
        analyzer='word',
        tokenizer=id_fun,
        preprocessor=id_fun,
        token_pattern=None)

Applying the vectorization on the tokenized documents:

In [15]:
vectorized = vectorizer.fit_transform(tokenized_documents)

The vectorized data is in the form of a sparse matrix. The vocabulary can be retrieved with the '.get_feature_names()'-function. 

In [16]:
print(type(vectorized))
features = vectorizer.get_feature_names_out()
print(features[-20:])

<class 'scipy.sparse._csr.csr_matrix'>
['young_work' 'youngster_interested' 'youth_visit' 'z._jurgen'
 'z._renewable' 'zentrum_für' 'zenuity_autonomous' 'zenuity_found'
 'zenuity_issue' 'zeplin_draw' 'zip_essentially' 'zip_helium' 'zone_grow'
 'zone_launch' 'zoom_electromagnetic' 'zoom_large' '£_sale' 'à_et'
 'â(euro)oefor_outstanding' 'â€‹gã‰ant_lead']


<h3>Determining the number of topics or/and clusters to make with the parametric algorithms:<h3>

In [17]:
clusters = 20

<h3>Running Clustering and Topic Modelling Algorithms<h3>

Fitting the clustering algorithms. Mean Shift will take long if you have not previously specified to only use a smaller sample of the data. Affinity Propragation will raise an error if it does not converge.

In [None]:
from sklearn.cluster import KMeans, AffinityPropagation, MeanShift

#Fitting K-means on the vectorized documents:
kmeans = KMeans(n_clusters=clusters, random_state=0).fit(vectorized)
print("K-means done.")

#If Affinity Propagation does not converge, try increasing damping (0,1)
ap = AffinityPropagation(affinity='euclidean', damping=0.9, random_state=None).fit(vectorized)
print("Affinity Propagation done. Now patience for Mean Shift.\nThis may take a few hours.")

#K-means and Affinity Propagation accept sparce matrices
#Mean Shift only array-like, so sparce matrix is transformed to array
ms = MeanShift().fit(vectorized.toarray())
print("Mean Shift done.")

  super()._check_params_vs_input(X, default_n_init=10)


K-means done.
Affinity Propagation done. Now patience for Mean Shift.
This may take a few hours.


Fitting the topic modelling -algorithms. Gensim algorithms require specific data types (dictionary and corpus), which are specified first. A missing Levenhstein library may raise a Deprecation warning, but similarity-functions are not used here. 

In [None]:
from gensim import corpora
from gensim import matutils, models

dictionary = corpora.Dictionary(tokenized_documents)
corpus = matutils.Sparse2Corpus(vectorized, documents_columns=False)

#Fitting LSI
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=clusters)
print("LSI done.")

#Fitting LDA
lda = models.LdaModel(corpus, id2word=dictionary, num_topics=clusters)
print("LDA done.")

#Fitting HDP
hdp = models.HdpModel(corpus, id2word=dictionary)
print("HDP done.")

<h2>Interpretability Analysis<h2>

Organizing the tokens of the created clusters according to 'importance':

In [None]:
#K-means and Mean Shift return cluster centres as numpy arrays
#Affinity Propagation as a sparce matrix that also needs to be
#transformed into array form
km_order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
ap_order_centroids = ap.cluster_centers_.toarray().argsort()[:, ::-1]
ms_order_centroids = ms.cluster_centers_.argsort()[:, ::-1]

The function returns a dictionary for a clustering output that assigns a list to top tokens per cluster yielded by the method.

In [None]:
def print_clustering(order_centroids, name, top=10):
    #Initialising data structure
    dictionary = {}
    #Iterating through every cluster
    for i in range(len(order_centroids)):
        ind = i + 1
        #Naming cluster
        cluster_name = name + " " + str(ind)
        #Finding top tokens in the ordered list of tokens
        token_list = [features[ind] for ind in order_centroids[i, :top]]
        #Adding the result to the dictionary
        dictionary[cluster_name] = token_list
    return dictionary

Creating dictionaries per clustering output for interpretability assessment.

In [None]:
km_cluster_dict = print_clustering(km_order_centroids, "K-means")
ap_cluster_dict = print_clustering(ap_order_centroids, "Affinity Propagation")
ms_cluster_dict = print_clustering(ms_order_centroids, "Mean Shift")

Creating the pandas dataframes from the dictionaries per clustering output for interpretability assessment.

In [None]:
kmeans_df = pd.DataFrame.from_dict(km_cluster_dict, orient='index')
ap_df = pd.DataFrame.from_dict(ap_cluster_dict, orient='index')
ms_df = pd.DataFrame.from_dict(ms_cluster_dict, orient='index')

The function returns a dictionary for a topic modelling output that assigns a list to top tokens per topic yielded by the method. HDP output is slightly different from LSI and LDA, so it is separated. 

In [None]:
def print_topics(model, name, clusters, hdp=False, vocab=None, top=10):
    #Initialising data structure
    dictionary = {}
    if not hdp:
        topics = model.show_topics(num_topics=clusters, num_words=top)
        #Iterating over every topic
        for item in topics:
            topic = list(item)
            #Naming topic
            topic_name = name + " " + str(int(topic[0])+1)
            clean_str = topic[1].replace('""','"')
            terms = re.findall('\"(.*?)\"', clean_str)
            #Finding top tokens in the list of tokens
            token_list = [str(token) for token in terms[:top]]
            for tok in token_list:
                if "*" in tok:
                    print(token_list)
                    print(item)
            #Adding the result to the dictionary
            dictionary[topic_name] = token_list
    else:
        num_topics=model.get_topics().shape[0] #HDP creates always 150 topics
        topics = model.show_topics(num_topics=num_topics,num_words=top)
        #Iterating over every topic
        for item in topics:
            topic = list(item)
            #Naming topic
            topic_name = name + " " + str(int(topic[0])+1)
            #Cleansing the output to find tokens
            tokens_to_clean = [token for token in topic[1].split("+")]
            to_clean_2 = [token.split("*")[1] for token in tokens_to_clean]
            tokens_clean = [clean.strip() for clean in [token.split("*")[1] for token in tokens_to_clean]]
            #Finding top tokens in the cleansed list of tokens
            token_list = [str(token) for token in tokens_clean[:-1]]
            #Adding the result to the dictionary
            dictionary[topic_name] = token_list
    return dictionary 

Creating dictionaries per topic modelling output for interpretability assessment.

In [None]:
lsi_topic_dict = print_topics(lsi, "LSI", clusters)
lda_topic_dict = print_topics(lda, "LDA", clusters)
hdp_topic_dict = print_topics(hdp, "HDP", clusters, True)

Creating the pandas dataframes from the dictionaries per topic modelling output for interpretability assessment.

In [None]:
lsi_df = pd.DataFrame.from_dict(lsi_topic_dict, orient='index')
lda_df = pd.DataFrame.from_dict(lda_topic_dict, orient='index')
hdp_df = pd.DataFrame.from_dict(hdp_topic_dict, orient='index')

Combining the topic modelling and clustering interpretability dataframes to an Excel-file, algorithm per sheet. 

In [None]:
with pd.ExcelWriter("interpretability_assessment_"+str("_".join(pre_processing_name()))+".xlsx") as writer:  
    kmeans_df.to_excel(writer, sheet_name='K-means')
    ap_df.to_excel(writer, sheet_name='Aff. Prop.')
    ms_df.to_excel(writer, sheet_name='Mean Shift')
    lda_df.to_excel(writer, sheet_name='LDA')
    lsi_df.to_excel(writer, sheet_name='LSI')
    hdp_df.to_excel(writer, sheet_name='HDP')

<h2>Representativeness Analysis<h2>

Next we find for each text document in the vectorized data form which cluster it belongs to with the inbuilt "predict" functions of each clustering algorithm, and add these predictions to the dataframe along with the corresponding top tokens of the assigned cluster. We have chosen to use the predict-function, since then this section of the notebook can be used to predict new, unseen documents if so wished. 

In [None]:
#Predicting the K-means clusters for the vectorized documents
km_pred = kmeans.predict(vectorized)
#Getting the cluster names for the predictions
km_clusters = ["K-means "+str(cluster+1) for cluster in km_pred]
#Getting the cluster tokens per cluster prediction
km_cluster_tokens = [km_cluster_dict[cluster] for cluster in km_clusters]

#Inserting the predictions and tokens into the dataframe as columns
df['K-means Clusters'] = km_clusters
df['K-means tokens'] = ["; ".join(token_list) for token_list in km_cluster_tokens]

#AP and MS predict attributes accept only an array-like data

#Predicting the Affinity Propagation clusters for the vectorized documents
ap_pred = ap.predict(vectorized.toarray())
#Getting the cluster names for the predictions
ap_clusters = ["Affinity Propagation "+str(cluster+1) for cluster in ap_pred]
#Getting the cluster tokens per cluster prediction
ap_cluster_tokens = [ap_cluster_dict[cluster] for cluster in ap_clusters]

#Inserting the predictions and tokens into the dataframe as columns
df['Affinity Propagation Clusters'] = ap_clusters
df['Affinity Propragation tokens'] = ["; ".join(token_list) for token_list in ap_cluster_tokens]

#Predicting the Mean Shift clusters for the vectorized documents
ms_pred = ms.predict(vectorized.toarray())
#Getting the cluster names for the predictions
ms_clusters = ["Mean Shift "+str(cluster+1) for cluster in ms_pred]
#Getting the cluster tokens per cluster prediction
ms_cluster_tokens = [ms_cluster_dict[cluster] for cluster in ms_clusters]                   

#Inserting the predictions and tokens into the dataframe as columns
df['Mean Shift Clusters'] = ms_clusters
df['Mean Shift tokens'] = ["; ".join(token_list) for token_list in ms_cluster_tokens]

This function sorts a tuple. It is used to find the topic model with the highest probability of each topic per document.

In [None]:
def sort_tuple(tup):   
    tup.sort(key = lambda x: abs(x[1]), reverse=True)  
    return tup 

Initializing the same lists for topic modelling as were just now made for clustering: The predicted top topic and the tokens that go along with that topic. Here there are multiple topics a document belongs to due to the probabilistic nature of topic modelling, but we look at only the most representative one. We record the probabilities of the most representative topics as well.

In [None]:
lda_topics = []
lda_probabilities = []
lda_topic_tokens = []

lsi_topics = []
lsi_probabilities = []
lsi_topic_tokens = []

hdp_topics = []
hdp_probabilities = []
hdp_topic_tokens = []

Next the corpus - created by gensim for topic modelling - is iterated and each document is predicted a topic distribution similarly to clustering. From the result, the top representative topic, its percentage/probability, and its topics are collected and appended to the lists initialized previously. If some documents do not have any topics assigned to them, they are treated with exceptions.

In [None]:
for item in corpus:
    #LDA
    #Get ordered topic distribution for document
    lda_res = sort_tuple(lda[item])
    if lda_res:
        #Get topic and name it
        lda_topic = "LDA " +str(lda_res[0][0]+1)
        #Add to list the topic name
        lda_topics.append(lda_topic)
        #Add to list the tokens of topic
        lda_topic_tokens.append(lda_topic_dict[lda_topic])
        #Add to list the probability of topic
        lda_probabilities.append(lda_res[0][1])
    else:
        #If no result, assign missing or NaN values instead
        lda_topic = "LDA NaN"
        lda_topics.append(lda_topic)
        lda_topic_tokens.append(["Missing"])
        lda_probabilities.append("LDA NaN")          

    #LSI
    #Get ordered topic distribution for document
    lsi_res = sort_tuple(lsi[item])
    if lsi_res:
        #Get topic and name it
        lsi_topic = "LSI "+str(lsi_res[0][0]+1)
        #Add to list the topic name
        lsi_topics.append(lsi_topic)
        #Add to list the tokens of topic
        lsi_topic_tokens.append(lsi_topic_dict[lsi_topic])
        #Add to list the probability of topic
        lsi_probabilities.append(lsi_res[0][1])
    else:
        #If no result, assign missing or NaN values instead
        lsi_topic = "LSI NaN"
        lsi_topics.append(lsi_topic)
        lsi_topic_tokens.append(["Missing"])
        lsi_probabilities.append("LSI NaN")           

    #HDP
    #Get ordered topic distribution for document
    hdp_res = sort_tuple(hdp[item])
    if hdp_res:
        #Get topic and name it
        hdp_topic = "HDP " +str(hdp_res[0][0]+1)
        #Add to list the topic name
        hdp_topics.append(hdp_topic)
        #Add to list the tokens of topic
        hdp_topic_tokens.append(hdp_topic_dict[hdp_topic])
        #Add to list the probability of topic
        hdp_probabilities.append(hdp_res[0][1])
    else:
        #If no result, assign missing or NaN values instead
        hdp_topic = "HDP NaN"
        hdp_topics.append(hdp_topic)
        hdp_topic_tokens.append(["Missing"])
        hdp_probabilities.append("HDP NaN")  

Adding the topic modelling topics, probabilities, and tokens as columns to the dataframe. Token lists are formatted to separate tokens in list with ";".

In [None]:
print(lda_topics)

In [None]:
df['LDA Topic']=lda_topics
df['LDA Probability']=lda_probabilities
df['LDA Topic Tokens'] = ["; ".join(token_list) for token_list in lda_topic_tokens]

df['LSI Topic']=lsi_topics
df['LSI Probability']=lsi_probabilities
df['LSI Topic Tokens'] = ["; ".join(token_list) for token_list in lsi_topic_tokens]

df['HDP Topic']=hdp_topics
df['HDP Probability']=hdp_probabilities
df['HDP Topic Tokens'] = ["; ".join(token_list) for token_list in hdp_topic_tokens]

Formatting dataframes of the representativeness data to be used as sheets in the Excel-file.

In [None]:
kmdf = df[['Document', 'Label','Action Clean','Tokenized','K-means Clusters','K-means tokens']]
apdf = df[['Document', 'Label','Action Clean','Tokenized','Affinity Propagation Clusters','Affinity Propragation tokens']]
msdf = df[['Document', 'Label','Action Clean','Tokenized','Mean Shift Clusters','Mean Shift tokens']]
ldadf = df[['Document', 'Label','Action Clean','Tokenized','LDA Topic','LDA Probability','LDA Topic Tokens']]
lsidf = df[['Document', 'Label','Action Clean','Tokenized','LSI Topic','LSI Probability','LSI Topic Tokens']]
hdpdf = df[['Document', 'Label','Action Clean','Tokenized','HDP Topic','HDP Probability','HDP Topic Tokens']]

Creating the Excel-file from the dataframes.

In [None]:
with pd.ExcelWriter("representativeness_assessment_"+str("_".join(pre_processing_name()))+".xlsx") as writer:  
    kmdf.to_excel(writer, sheet_name='K-means')
    apdf.to_excel(writer, sheet_name='Aff. Prop.')
    msdf.to_excel(writer, sheet_name='Mean Shift')
    ldadf.to_excel(writer, sheet_name='LDA')
    lsidf.to_excel(writer, sheet_name='LSI')
    hdpdf.to_excel(writer, sheet_name='HDP')

<h3>Confusion Matrices and Comparison to SML<h3>

The used text documents were also coded by a human into seven predefined categories: Informational, Human, Organizational, Relational, Financial, Legal, and Physical. Here we set LDA and K-means UML to create also 7 clusters or topics, and we compare the clusters and topics against the human categorization with a confusion matrix. 

Setting the number of clusters/topics to make to match the SML categorization and creating a dictionary to abbreviate the categories.

In [None]:
number_of_sml_classes = 3
category_dict = {"Human capital":"Hu","Technology":"Tech","Scientific knowledge":"Sci"}

While we would absolutely love to make beautiful graphics for you, most libraries to do this create conflicts with the current ones - at least for now. So alas, we need to make due with simple text representations of the results. Also we show how many documents are in each topic/cluster/SML category.

In [None]:
import collections 

#Creating the LDA model with 7 topics
confusion_lda = models.LdaModel(corpus, id2word=dictionary, num_topics=number_of_sml_classes)
#Creating the K-means model with 7 clusters
confusion_kmeans = KMeans(n_clusters=number_of_sml_classes, random_state=0).fit(vectorized)

#Abberivating the categories and adding the abbreviations to the dataframe
df['Class ID'] = [category_dict[item] for item in list(df['Label'])]
#Finding the most representative topics for each document in corpus
df['LDA Confusion analysis'] = [sort_tuple(confusion_lda[item])[0][0]+1 for item in corpus]
#Finding the predicted clusters for each document in documents (same as corpus but different format)
df['K-means Confusion analysis'] = [str(cluster+1) for cluster in confusion_kmeans.predict(vectorized)]

#The confusion matrix of LDA against SML categorization
print("LDA")
lda_c_confusion_matrix = pd.crosstab(df['Class ID'], df['LDA Confusion analysis'], 
                                     rownames=['SML'], colnames=['Topic'])
print(lda_c_confusion_matrix)

print("\nNumber of documents in each topic: ")
print(collections.Counter(list(df['LDA Confusion analysis'])).most_common(7))
print("Number of documents in each SML category: ")
print(collections.Counter(list(df['Class ID'])).most_common(7))

#The confusion matrix of K-means against SML categorization
print("\nK-means")
km_confusion_matrix = pd.crosstab(df['Class ID'], df['K-means Confusion analysis'], 
                                  rownames=['SML'], colnames=['Topic'])
print(km_confusion_matrix)

print("\nNumber of documents in each cluster: ")
print(collections.Counter(list(df['K-means Confusion analysis'])).most_common(7))
print("Number of documents in each SML category: ")
print(collections.Counter(list(df['Class ID'])).most_common(7))

Depending on the pre-processing, but usually you will find that topic modelling topics are dispersed all over the matrices, while clustering is more sparce and concentrated. This demonstrates the probabilistic nature of topic modelling: "All topics can be found in all documents", making it difficult to assign a document to a topic or a topic to a document. 