# OntoGen workflow notebook tutorial based on Python script modules for LBD

OntoGen [4] is a semi-automatic data-driven interactive text mining tool that assists users in the creative process of generating topic ontologies by grouping documents into related clusters. 
It is essentially a text mining tool for grouping documents into related clusters that can be viewed as concepts in an automatically created topic ontology. 
The underlying methodology is *𝑘*-means clustering, a particularly popular technique since only the *𝑘* parameter needs to be chosen to determine the number of categories into which documents should be clustered.

<hr>

[1] Petrič, I., Urbančič, T., Cestnik, B., Macedoni-Lukšič, M. (2009). Literature mining method RaLoLink for uncovering relations between biomedical concepts. *Journal of Biomedical Informatics*, 42(2), pp. 219–227.

[4] Fortuna, B., Grobelnik, M., Mladenić, D. (2006). Semi-automatic data-driven ontology construction system, *Proceedings of the 9th International Multi-conference Information Society*, pp. 223–226.

[5] Cestnik, B., Fabbretti, E., Gubiani, D., Urbančič, T., Lavrač, N. (2017). Reducing the search space in literature-based discovery by exploring outlier documents: a case study in finding links between gut microbiome and Alzheimer’s disease. *Genomics and computational biology*, 3(3), e58-1-e58-10. doi:10.18547/gcb.2017.vol3.iss3.e58.

[6] Sluban, B., Juršič, M., Cestnik, B., Lavrač, N. (2012). Exploring the power of outliers for cross-domain literature mining. In M. R. Berthold (Ed.), Bisociative Knowledge Discovery (pp. 325–337). Springer.

[7] Petrič, I., Cestnik, B., Lavrač, N., Urbančič, T. (2012). Outlier detection in crosscontext link discovery for creative literature mining. *The Computer Journal*, 55(1), 47–61.

<hr>

Note that **our motive** was to **re-implement parts** of the tools such as **OntoGen**, RaJoLink and CrossBee so that we can generally **repeat the results** with the tools from the past experiments, and not so much to optimize the written Python code. The fucus was on understanding the learning processes and visualizing the workflows in terms of repeatability of the results obtained; the efficiency and elegance of the programming can be addressed in future versions of the scripts.

Import and initialize `logging` library to track the execution of the scripts.

In [None]:
import logging

# Initialize logging with a basic configuration
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s: %(levelname)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S')

Import LBD components from the framework notebooks. The description of the individual components from the framework notebooks can be found in the respective notebooks.

The purpose of the **import_ipynb** library is to allow the direct import of Jupyter notebooks as modules so that code, functions and classes defined in one notebook can be easily reused in other notebooks or Python scripts.
If the **import_ipynb** library is omitted (or commented out), the corresponding module will be imported from **.py** files exported from the **.ipynb** files. Note that importing from **.py** files is usually much faster and therefore more suitable for running scripts in production.

In [None]:
# import import_ipynb
import LBD_01_data_acquisition
import LBD_02_data_preprocessing
import LBD_03_feature_extraction
import LBD_04_text_mining
import LBD_05_results_analysis
import LBD_06_visualization

Import additional Python libraries.

In [None]:
import os
import nltk
import numpy as np
import itertools
import pandas as pd
import spacy
from sklearn.metrics.pairwise import cosine_distances
from typing import List, Dict
from collections import defaultdict

Define the name of the domains $C$ and $A$, then load the responding text from the input file. The expected file format is as follows:

1. The file is encoded in Ascii (if it is in UTF-8 or other encoding, it should be converted to Ascii).
2. Each line in the file represents one document. The words in each document are separated by spaces. The length of the individual documents may vary.
3. The first word in each line is the **unique id**, followed by a semicolon. Normally **pmid** (pubmed id) can be used for this purpose, alhough any unique id (e.g. **sequential count**) suffices.
4. The second word in each line can optionally stand for a predefined domain (or class) of the document. In this case, the second word is preceded by **!**. For example, if the file contains documents that originate from two domains, e.g. *migraine* and *magnesium*, the second word in each line is either **!migraine** or **!magnesium**. If the file contains documents that originate from *autism* and *calcineurin*, the second word in each line will be either **!autism** or **!calcineurin**.
5. If the second word is not preceded by **!**, it will be considered the first word of the document. In this case, the document will be given the domain **!NA** (**not applicable** or **not available**).


**A background story for this experiment**

First, we selected *autism* and *calcineurin* as our domains of interest.

*Autism*  belongs to a group of pervasive developmental disorders that are portrayed by an
early delay and abnormal development of cognitive, communication and social
interaction skills of a person. It is a very complex and not yet sufficiently understood
domain, where precise causes are still unknown; research suggests that it may be
related to genetic mutations, environmental factors, and brain structure and function. 
*Calcineurin* is a protein phosphatase with a high prevalence in the brain.

The dataset from the input file was constructed using the following PubMed query:

1. autis* [TIAB] AND 1900/01/01:2007/12/31 [PDAT]
2. calcineurin [TIAB] AND 1900/01/01:2007/12/31 [PDAT]

The input file *input/f_autism_calcineurin.txt* was prepared for the experiments described in [1]. It contains 16.139 titles and abstracts, 10.819 from *autism* and 5.320 from *calcineurin*.

In this experiment, we use **OntoGen** to detect outlier documents as described in in Chapter 6 *Outlier-based Closed Discovery* in section 6.3 *Outlier document detection and b-term identification through document clustering*.

In the LBD outlier detection approach, each document from the two literatures is an instance represented by a set of words using frequency statistics based on the Bag Of Words (BoW) and Term Frequency–Inverse Document Frequency (TF-IDF) text representations.
The BoW and TF-IDF vectors enable the content similarity of documents to be measured. Content similarity is calculated using OntoGen, where content similarity is measured by cosine distance and the standard TF-IDF word weighting measure, where a high frequency of co-occurring words in documents indicates high document similarity.
The cosine similarity measure is used to position the documents according to their similarity to the representative document (centroid) of a selected domain.
Documents positioned based on the cosine similarity measure can be visualized in OntoGen by a similarity graph with cosine similarity values falling in the interval [0, 1].
The value 0 means extreme dissimilarity, i.e. two documents have no words in common, while the value 1 represents the similarity between two semantically identical documents in the BoW representation.

In [None]:
# Set the global variable CONTEXT_SWITCH to the following values, depending on which domain pairs you would like to set as a context:
# 1 for Autism-Calcineurin
# 2 for Alzheimer-Macrobiota

CONTEXT_SWITCH = 1

In [None]:
if CONTEXT_SWITCH == 1:
    domainName = 'Autism-Calcineurin'
    fileName = 'input/f_autism_calcineurin.txt'
elif CONTEXT_SWITCH == 2:
    domainName = 'Alzheimer-Macrobiota'
    fileName = 'input/f_alzheimer_gimb.txt'

lines = LBD_01_data_acquisition.load_data_from_file(fileName)
# display the first 7 lines of the document
[LBD_02_data_preprocessing.truncate_with_ellipsis(line, 110) for line in lines[:7]]

**Preprocess the documents into a dictionary - might take a few minutes for longer files**

The script in the next cell is used to prepare text data for further analysis in Literature-Based Discovery (LBD). The aim is to clean, standardize and structure the documents so that they are suitable for further tasks such as feature extraction, topic modeling and the discovery of hidden relationships in the literature. The script prepares the documents stored in `lines` in a dictionary.

<details>
  <summary>Click for more ...</summary>

**Functionality**

1. *Creating a dictionary from raw data*: The script starts by converting a list of rows into a structured dictionary. 
    - *`construct_dict_from_list`*: this function takes the raw list of text lines (`lines`) and creates a dictionary (`docs_dict`) in which each entry typically represents a document, with a unique identifier as the key and the text of the document as the value.
    - This conversion is important because it puts the text data into a more manageable format that allows efficient processing and retrieval.

2. *Preprocessing of documents*: the script then applies various pre-processing steps to the documents:
    - *Cleaning*: the text is cleaned to remove unwanted characters, punctuation and other errors.
    - *Remove stop words*: frequent words that do not provide meaningful information (e.g. "the", "and") are removed.
    - *Lemmatization*: words are reduced to their base or root form (e.g. "running" becomes "run") to ensure consistency.
    - *Minimum word length*: words shorter than four characters are filtered out.
    - *Keep only nouns*: the parameter `keep_only_nouns=True` ensures that only nouns (and proper nouns) are considered in further analysis, removing other word types like adjactives and verbs. 
    A trained pipeline from *spacy* named *en_core_web_md* is used for the task [https://spacy.io/models/en#en_core_web_md]. Note that this filter uses functions from an external *spacy* library to check every word in the vocabulary of the documents and is therefore time-consuming.
    - *MeSH-specific filtering*: the parameter `keep_only_mesh=False` skips the MeSH filtering in this preprocessing.

3. *Extract document IDs and processed text*:  the script then extracts lists of document IDs and the corresponding preprocessed text:
    - *`extract_ids_list`*: returns a list of document IDs from the preprocessed dictionary to facilitate document lookup and management.
    - *`extract_preprocessed_documents_list`*: extracts the cleaned and processed text for each document to prepare it for feature extraction or other analysis.
By extracting these lists, the script organizes the data in a format that is easy to manipulate in subsequent steps, such as creating a Bag of Words (BoW) model or calculating TF-IDF scores.

**Practical applications**

- *Biomedical research and discovery*: This pre-processing approach is valuable in the biomedical field, where ensuring the relevance and accuracy of terms is critical to discovering new relationships between diseases, drugs and other biological concepts. By focusing on specific vocabularies such as MeSH, researchers can more effectively search the literature for new hypotheses or overlooked relationships.
- *Data preparation for machine learning*: The cleaned and structured data generated by this script can be fed directly into machine learning models for tasks such as document classification or clustering.

**Use**

To use this script effectively:
1. *Prepare the data*: Make sure you have a list of raw text lines (`lines`).
2. *Execute the preprocessing steps*: Run the script to clean, filter and structure the text data.
3. *Extract and analyze*: Use the extracted IDs and processed text for further analysis, e.g. to create models and visualizations or for exploratory research.

</details>

In [None]:
# 1. Creating a dictionary from raw data
docs_dict = LBD_02_data_preprocessing.construct_dict_from_list(lines)

# 2. Preprocessing of documents
keep_list = []
# Normally, the original domain names are removed from the vocabulary. If "migraine" and "magnesium" were added to the vocabulary during the comparison, 
# they might dominate the analysis because they are the focus of the study. The alorithms might attach extra importance to these terms 
# and push less obvious but potentially important terms or concepts into the background.
if CONTEXT_SWITCH == 1:
    remove_list = ['autism', 'calcineurin']
elif CONTEXT_SWITCH == 2:
    remove_list = ['alzheimer', 'gut', 'microbiota']
else:
    remove_list = []

prep_docs_dict = LBD_02_data_preprocessing.preprocess_docs_dict(
    docs_dict, keep_list = keep_list, remove_list = remove_list, mesh_word_list = [], \
    cleaning = True, remove_stopwords = True, lemmatization = True, \
    min_word_length = 4, keep_only_nouns = False, keep_only_mesh = False, stemming = False, stem_type = None)

# 3. Extract document IDs and processed text
ids_list = LBD_02_data_preprocessing.extract_ids_list(prep_docs_dict)
prep_docs_list = LBD_02_data_preprocessing.extract_preprocessed_documents_list(prep_docs_dict)

The next three cells show the first dictionary entries, the document IDs (Pubmed) and the pre-processed documents.

When displaying the first few dictionary entries, we can observe the difference between the original and the pre-processed documents.

In [None]:
# display the first 7 dictionary items
truncated_dict = {
    key: {sub_key: LBD_02_data_preprocessing.truncate_with_ellipsis(value, 110) for sub_key, value in sub_dict.items()}
    for key, sub_dict in itertools.islice(prep_docs_dict.items(), 7)
}
truncated_dict

In [None]:
# display the ids of the first 7 documents
ids_list[:7]

In [None]:
# display the preprocessed text for the first 7 documents
[LBD_02_data_preprocessing.truncate_with_ellipsis(line, 110) for line in prep_docs_list[:7]]

**Construct BoW model from important words and n-grams**

The next script continues the feature extraction process and focuses on refining a Bag of Words (BoW) model by filtering out less important terms and n-grams. It creates a Bag of Words matrix from the list of pre-processed documents. It then removes n-gram words that occur less than *min_ngram_count* times (in our case 3) in the entire document corpus. The words that are not contained in the MESH list *mesh_word_list* are also removed. This step is important to improve the quality and relevance of the text representation by reducing the vocabulary so that the following steps can be carried out more efficiently (in terms of time).

<details>
  <summary>Click for more ...</summary>

**Functionality**

1. *Set parameters*: The script starts by setting the parameters for the n-gram size and the minimum document frequency:
    - *`ngram_size`*: Specifies that the model considers pairs of consecutive words (bigrams) as features.
    - *`min_df`*: Specifies the minimum number of documents in which a word or n-gram must occur in order to be included in the initial vocabulary.

2. *Create Bag of Words representation*: The next step is to create the BoW model using the specified n-gram size.
This function creates a vocabulary (`word_list`) from all terms and n-grams found in the preprocessed documents (`prep_docs_list`), together with the corresponding frequency matrix (`bow_matrix`). The output vocabulary includes all n-grams without filtering.

3. *Filtering low-frequency n-grams*: The script then filters out n-grams that occur less frequently than a certain threshold:
    - *`min_count_ngram`*: Specifies the minimum number of occurrences of n-grams to keep.
    - The script calculates two important metrics:
        - *document frequency*: How many documents contain each word or n-gram.
        - *total frequency*: How often each word or n-gram appears in all documents.

4. *Filtering based on specific criteria*: The script applies a more sophisticated filtering process to refine the vocabulary. The loop evaluates each term or n-gram in the vocabulary:
   - *Non-n-grams*: Will only be retained if they are in a predefined `mesh_word_list`.
   - *n-grams*: Are retained if:
       - They fulfill the minimum frequency criteria.
       - All partial words are contained in `mesh_word_list`.
       - The n-gram does not consist of repeated words (e.g. "word word").


5. *Applying the filters*: The script then filters both the rows and the columns of the BoW matrix. 
   - *`filter_matrix_columns`*: Refines the BoW matrix by retaining only the selected words or n-grams that meet the filter criteria.
   - The updated vocabulary and matrix are then stored in `word_list` and `bow_matrix`, respectively.

**Practical applications**

- *Biomedical research and discovery*: This filtering method is particularly useful in medical research, where the focus is on extracting and analyzing relevant biomedical terms and concepts.
- *Document analysis and classification*: By refining the feature set, this script can improve the performance of classifiers used in the categorization of scientific literature or other text corpora.
- *Network analysis*: The filtered vocabulary can serve as a node in a network graph representing meaningful terms and their co-occurrence, which can be analyzed to detect hidden connections.

**Use**

To use this script effectively, you need to make sure you have a preprocessed document list (`prep_docs_list`). Adjust the parameters like `ngram_size`, `min_df` and `min_count_ngram` to your specific needs. After running the script, you will get a filtered vocabulary and a corresponding BoW matrix, which is more suitable for further analysis such as clustering, topic modeling or discovering new hypotheses in biomedical research.

</details>

In [None]:
# 1. Set parameters
if CONTEXT_SWITCH == 1:
    ngram_size = 1 # to reduce the vocabulary, only a single words are used for further analysis
    min_df = 3
elif CONTEXT_SWITCH == 2:
    ngram_size = 1 
    min_df = 1
else:
    ngram_size = 1 
    min_df = 1

# 2. Create Bag of Words representation
word_list, bow_matrix = LBD_03_feature_extraction.create_bag_of_words(prep_docs_list, ngram_size, min_df)
print('Number of terms in initial vocabulary with all n-grams: ', len(word_list))

# 3. Filtering low-frequency n-grams
#    remove n-grams with frequency count less than min_count_ngram from vocabulary word_list and bow_matrix
min_count_ngram = 3

tmp_sum_count_docs_containing_word = LBD_03_feature_extraction.sum_count_documents_containing_each_word(word_list, bow_matrix)

tmp_sum_count_word_in_docs = LBD_03_feature_extraction.sum_count_each_word_in_all_documents(word_list, bow_matrix)

# 4. Filtering based on specific criteria
tmp_filter_columns = []
for i, word in enumerate(word_list):
    if not LBD_03_feature_extraction.word_is_nterm(word):
        tmp_filter_columns.append(i)
    else:
        if tmp_sum_count_word_in_docs[word] >= min_count_ngram:
            tmp_filter_columns.append(i)

# 5. Applying the filters
#    keep the original order of rows
tmp_filter_rows = []
for i, id in enumerate(ids_list):
    tmp_filter_rows.append(i)

tmp_filtered_word_list, tmp_filtered_bow_matrix = LBD_03_feature_extraction.filter_matrix_columns(
    word_list, bow_matrix, tmp_filter_rows, tmp_filter_columns)

word_list = tmp_filtered_word_list
bow_matrix = tmp_filtered_bow_matrix
print('Number of terms in preprocessed vocabulary after removing infrequent n-grams: ', len(word_list))

# Output the lists for checking the order
#LBD_02_data_preprocessing.save_list_to_file(word_list, "output/_list.txt")
#LBD_02_data_preprocessing.save_list_to_file(prep_docs_list, "output/_prep_list.txt")

**Calculate relevant indicators for the BoW matrix**

The script in the next cell is a continuation of the text preprocessing pipeline that calculates the margins for the Bag of Words (BoW) matrix and optimizes the BoW matrix for better interpretability and analysis. By arranging the matrix to highlight the most important terms and documents, this script helps to recognize patterns in the data, which is a crucial step in LBD.

<details>
  <summary>Click for more ...</summary>

**Functionality**

1. *Counting word frequencies*: The script begins by calculating various frequency counts that provide insight into how words are distributed across documents:
   - *`sum_count_docs_containing_word`*: Counts how many documents each word appears in.
   - *`sum_count_word_in_docs`*: Totals the occurrences of each word across all documents.
   - *`sum_count_words_in_doc`*: Tallies the total number of words in each document.

   These metrics are essential for understanding the significance and distribution of terms within the corpus, which can guide further analysis.

2. *Displaying frequency counts*: The script then prints a subset of these frequency counts to give an overview of the data:
   - *`islice`* from `itertools` is used to print just the first few items, making it easier to inspect the data without overwhelming output.
   - These print statements help users quickly assess the distribution and frequency of words and documents in the BoW model.

3. *Optimizing the BoW matrix*: The script proceeds to rearrange the BoW matrix so that the most frequent words and documents are positioned at the top-left corner of the matrix:
   - *sorting*: The words and documents are sorted by their frequencies in descending order.
   - *filtering*: The indices of these sorted words and documents are then used to rearrange the BoW matrix.

   This step ensures that the most significant terms and documents are easily accessible, facilitating further analysis such as clustering, topic modeling, or visualization.

4. *Rearranging the matrix*: Finally, the script filters the matrix according to the computed order:
   - *`filter_matrix`*: This function reorders the BoW matrix based on the sorted indices, ensuring that the most relevant terms and documents are emphasized.

   The script then prints out the first few items in the reordered lists:
   - This output allows users to verify that the matrix has been rearranged as intended, highlighting the most important elements of the dataset.

**Use**

To use this script, you must have a BoW matrix (`bow_matrix`) and the corresponding lists of words (`word_list`) and document IDs (`ids_list`). The script processes these inputs to calculate the frequency counts, reorder the matrix and output the reordered BoW matrix. This optimized matrix can be used for various downstream tasks, e.g. for creating visualizations, for deeper statistical analysis or as a basis for machine learning models for predictions.

</details>

In [None]:
# 1. Counting word frequencies
sum_count_docs_containing_word = LBD_03_feature_extraction.sum_count_documents_containing_each_word(word_list, bow_matrix)

sum_count_word_in_docs = LBD_03_feature_extraction.sum_count_each_word_in_all_documents(word_list, bow_matrix)

sum_count_words_in_doc = LBD_03_feature_extraction.sum_count_all_words_in_each_document(ids_list, bow_matrix)

# 2. Displaying frequency counts
print('Number of documents in which each word is present: ', dict(itertools.islice(sum_count_docs_containing_word.items(), 7)))
print('Number of occurences of each word in all documents: ', dict(itertools.islice(sum_count_word_in_docs.items(), 7)))
print('Number of words in each document: ', dict(itertools.islice(sum_count_words_in_doc.items(), 7)))

# 3. Optimizing the BoW matrix
#    Compute the order of rows (documents) and columns (words) in the bow matrix so that the most frequent words are in the top-left corner. 
filter_columns = LBD_02_data_preprocessing.get_index_list_of_dict1_keys(
    LBD_02_data_preprocessing.sort_dict_by_value(sum_count_word_in_docs, reverse=True), word_list)
filter_rows = LBD_02_data_preprocessing.get_index_list_of_dict1_keys(
    LBD_02_data_preprocessing.sort_dict_by_value(sum_count_words_in_doc, reverse=True), ids_list) 

# 4. Rearranging the matrix
#    Rearange (filter) the bow matrix according to the previously computed order.
filtered_ids_list, filtered_word_list, filtered_bow_matrix = LBD_03_feature_extraction.filter_matrix(
    ids_list, word_list, bow_matrix, filter_rows, filter_columns)

print('The first few documents in the rows of the filtered bow matrix: ', filtered_ids_list[:7])
print('The first few words in the columns of the filtered bow matrix: ', filtered_word_list[:7])

**Visualize a part of BoW matrix**

Visualize the upper left part of the Bag of Words (BoW) matrix. In the BoW matrix, each row corresponds to a document and each column to a word (or n-gram). The values in the matrix represent the frequency of the word in the corresponding document.
As the BoW matrix mainly contains zeros, the displayed matrix is sorted so that the higher values in the cells are moved to the top left-hand corner of the matrix.

In [None]:
first_row = 0
last_row = 20
first_column = 0
last_column = 15
LBD_06_visualization.plot_bow_tfidf_matrix('Filtered Bag of Words', \
                                           filtered_bow_matrix[first_row:last_row,first_column:last_column], \
                                           filtered_ids_list[first_row:last_row], \
                                           filtered_word_list[first_column:last_column], as_int = True)

**Construct TF-IDF matrix from important words and n-grams**

The next script is designed to create a Term Frequency-Inverse Document Frequency (TF-IDF) matrix from a set of preprocessed documents and then refine this matrix by filtering out less relevant terms.

<details>
  <summary>Click for more ...</summary>

**Functionality**

1. *Creating the TF-IDF matrix*:<br>
   The script begins by generating a TF-IDF matrix using a list of preprocessed documents:
   - *TF-IDF matrix*: This matrix represents the importance of each word (or n-gram) across all documents in the corpus.
   - *`ngram_size`*: Specifies the size of word sequences to consider (e.g., unigrams, bigrams).
   - *`min_df`*: Filters out terms that appear in fewer than a specified number of documents, reducing noise in the analysis.

   This step is essential for transforming raw text data into a structured format that highlights important terms.

2. *Rearranging the TF-IDF matrix*:
   The script then refines the TF-IDF matrix by rearranging and filtering the terms:
   - *filtering*: The matrix is filtered based on criteria such as the importance of terms, ensuring that only the most relevant words remain.
   - *rearranging*: The matrix is reorganized according to a predefined order, based on the significance of terms or their relevance to specific documents.

   This refinement process is crucial for improving the quality of the analysis by focusing on the most impactful terms, which can lead to more accurate and insightful results.

**Use**

Users can apply this script as part of a larger text mining workflow where the TF-IDF matrix serves as an important step in structuring and analyzing the data. By filtering and refining the matrix, users can ensure that their analysis focuses on the most relevant and meaningful terms, leading to more meaningful insights. In the context of LBD, this script is an essential tool for turning raw text data into actionable insights.

</details>

In [None]:
# 1. Creating the TF-IDF matrix
word_list, tfidf_matrix = LBD_03_feature_extraction.create_tfidf(prep_docs_list, ngram_size, min_df)
print('Number of terms in initial vocabulary with all n-grams: ', len(word_list))

# 2. Rearranging the TF-IDF matrix
#    Rearange (filter) the tfidf matrix according to the previously computed order from bow matrix.
tmp_filtered_word_list, tmp_filtered_tfidf_matrix = LBD_03_feature_extraction.filter_matrix_columns(
    word_list, tfidf_matrix, tmp_filter_rows, tmp_filter_columns)

word_list = tmp_filtered_word_list
tfidf_matrix = tmp_filtered_tfidf_matrix
print('Number of terms in preprocessed vocabulary after removing infrequent n-grams: ', len(word_list))

**Compute margins for TF-IDF matrix**

This script is designed to analyze and manipulate Term Frequency-Inverse Document Frequency (TF-IDF) data for a corpus of documents. It computes various statistics related to the TF-IDF values for both words and documents, then filters the TF-IDF matrix to reorder it based on the most important words and documents. Note that the importance of words and documents is estimated from the calculated aggregates from TF-IDF matrix.

<details>
  <summary>Click for more ...</summary>

**Functionality**

1. *Summing and maximizing TF-IDF values*:
   - `sum_count_each_word_in_all_documents`: Calculates the sum of TF-IDF scores for each word across all documents, providing insight into the overall importance of words in the entire corpus.
   - `max_tfidf_each_word_in_all_documents`: Finds the maximum TF-IDF score for each word, indicating the document where each word is most important.
   - `sum_count_all_words_in_each_document`: Computes the sum of TF-IDF scores for all words in each document, which can be used to determine the "weight" or importance of the document itself.
   - `max_tfidf_all_words_in_each_document`: Identifies the highest TF-IDF score for each document, which can help isolate which document contains particularly important terms.

2. *Output statistics*:
   - The script uses Python's `itertools.islice` function to print a preview of the top 7 values from each TF-IDF statistic. This offers a quick way to inspect the data without overwhelming the output with large lists.

3. *Sorting and filtering the TF-IDF matrix*:
   - After calculating the TF-IDF statistics, the script computes an ordering for the rows (documents) and columns (words) based on the maximum TF-IDF values. This ensures that the most important terms and documents are given priority in subsequent analyses.
   - The `filter_matrix` function then reorders the original TF-IDF matrix based on these computed rankings, allowing for a focused view of the most significant content in the corpus.

**Practical applications**

- *Document analysis and classification*: By identifying the most important terms and documents in a corpus, this technique can assist in classifying documents into relevant categories.
- *Term and key concept extraction*: Researchers can use the sum and max TF-IDF scores to isolate critical keywords that may represent novel concepts or ideas in the context of Literature-Based Discovery.
- *Summarization and information retrieval*: By filtering out less important words and documents, this script can help narrow down a large corpus to the most relevant data, making retrieval tasks more efficient.

**Use**

This script is a practical tool for analyzing TF-IDF data in text mining applications. By summing and maximizing TF-IDF scores for words and documents, users can highlight the most significant elements of their corpus. The filtered matrix provides a more focused view of the most important terms, which is highly useful in fields like Literature-Based Discovery and NLP.

</details>

In [None]:
# 1. Summing and maximizing TF-IDF Values
sum_word_tfidf = LBD_03_feature_extraction.sum_count_each_word_in_all_documents(word_list, tfidf_matrix)
max_word_tfidf = LBD_03_feature_extraction.max_tfidf_each_word_in_all_documents(word_list, tfidf_matrix)

sum_doc_tfidf = LBD_03_feature_extraction.sum_count_all_words_in_each_document(ids_list, tfidf_matrix)
max_doc_tfidf = LBD_03_feature_extraction.max_tfidf_all_words_in_each_document(ids_list, tfidf_matrix)

# 2. Output statistics
print('Sum of TF-IDF for each word: ', dict(itertools.islice(sum_word_tfidf.items(), 7)))
print('Max of TF-IDF for each word: ', dict(itertools.islice(max_word_tfidf.items(), 7)))

print('Sum of TF-IDF for each document: ', dict(itertools.islice(sum_doc_tfidf.items(), 7)))
print('Max of TF-IDF for each document: ', dict(itertools.islice(max_doc_tfidf.items(), 7)))

# 3. Sorting and filtering the TF-IDF matrix
#    Compute the order of rows (documents) and columns (words) in the TF-IDF matrix so that the most important words are in the top-left corner. 
filter_columns = LBD_02_data_preprocessing.get_index_list_of_dict1_keys(
    LBD_02_data_preprocessing.sort_dict_by_value(max_word_tfidf, reverse=True), word_list)
filter_rows = LBD_02_data_preprocessing.get_index_list_of_dict1_keys(
    LBD_02_data_preprocessing.sort_dict_by_value(max_doc_tfidf, reverse=True), ids_list) 

#    Rearange (filter) the bow matrix according to the previously computed order.
filtered_ids_list, filtered_word_list, filtered_tfidf_matrix = LBD_03_feature_extraction.filter_matrix(
    ids_list, word_list, tfidf_matrix, filter_rows, filter_columns)

**Visualize a part of TF-IDF matrix**

Visualize the upper left part of the TF-IDF matrix. In the TF-IDF matrix, each row corresponds to a document and each column to a word (or n-gram). The values in the matrix represent the Term Frequency Inverse Document Frequency (abbreviated TF-IDF) of the word (term) in the corresponding document and document corpus. TF-IDF is a measure of how relevant a word in a document is in relation to a corpus: the measure increases proportionally to the number of occurrences of a word in the text, but is compensated for by the word frequency in the entire corpus.

In [None]:
first_row, last_row, first_column, last_column = (0, 20, 0, 25)
LBD_06_visualization.plot_bow_tfidf_matrix('Filtered TfIdf', filtered_tfidf_matrix[first_row:last_row,first_column:last_column], \
                                           filtered_ids_list[first_row:last_row], filtered_word_list[first_column:last_column], as_int = False)

Create a list of domain names of all documents (from the dictionary containing the documents) and a list of unique domain names. There are two distinct domains: *Autism* and *Calcineurin*.

In [None]:
domains_list = LBD_02_data_preprocessing.extract_domain_names_list(prep_docs_dict)
print('Domain names for the first few documents: ', domains_list[:7])
unique_domains_list = LBD_02_data_preprocessing.extract_unique_domain_names_list(prep_docs_dict)
print('A list of all uniques domain names in all the documents: ', unique_domains_list)
for unique_domain in unique_domains_list:
    print('Number of documents in ', unique_domain, ': ', domains_list.count(unique_domain), sep='')

Visualize the documents of the two original domains in a 2D diagram by reducing the dimensionality of the TF-IDF matrix with PCA. You can experiment with the interactive diagram by clicking on the legend elements and showing or hiding the documents belonging to the selected element.

In [None]:
LBD_06_visualization.visualize_tfidf_pca_interactive(ids_list, [], domains_list, tfidf_matrix, transpose = False, color_schema = 11)

**Clustering of documents with k-means**

This Python script performs the clustering of documents with the k-means algorithm from *sklearn*. It takes a TF-IDF matrix as input, groups documents into clusters based on their text similarity, and outputs the distribution of documents in each cluster. This script is a fundamental step for exploring and organizing large text corpora, providing deeper insights into text data and facilitating advanced tasks such as LBD.

<details>
  <summary>Click for more ...</summary>

**Functionality**

The main function of the script is to categorize a collection of documents into a predefined number of clusters (`n_clusters`) based on their TF-IDF representations. This is how it works:

1. *Clustering*: Uses the KMeans algorithm to divide documents into `n_clusters` groups based on their similarity in feature space.
2. *Cluster assignments*: Assigns each document to a cluster and stores the labels indicating which cluster the document belongs to.
3. *Summary*: Outputs a list of unique cluster labels and counts the number of documents in each cluster.

**Practical applications**

1. *Topic modeling*: Helps group documents by topic based on text similarity.
2. *Document Categorization*: Organizes unstructured text data, such as research articles or web pages, into different categories.
3. *Information Retrieval*: Improves search systems by clustering documents for better indexing and retrieval.
4. *Literature Based Discovery (LBD)*: Identifies related research or conceptual clusters in large bibliographic datasets.

**Use**

1. *Prepare the input data*: First create a TF-IDF matrix of your documents using a library such as `TfidfVectorizer`.
2. *Set parameters*: Set the number of clusters (`n_clusters`) based on your dataset and your targets.
3. *Run the script*: Run the script to determine the cluster assignments and the number of documents per cluster.
4. *Analyze the output*: Use the cluster information to examine the distribution of documents and recognize meaningful patterns or groupings in the data.

</details>

In [None]:
# Cluster the documents using k-means
n_clusters = 2
cluster_assignments = LBD_04_text_mining.perform_clustering(tfidf_matrix, n_clusters)

unique_cluster_assignments = list(set(cluster_assignments))
print('A list of all unique cluster names in all the documents: ', unique_cluster_assignments)
for unique_cluster in unique_cluster_assignments:
    print('Number of documents in ', unique_cluster, ': ', cluster_assignments.count(unique_cluster), sep='')

Create an interactive visualization of document clusters using k-means based on their TF-IDF matrix representations. The visualization reduces the high-dimensional TF-IDF space to two main components using PCA and colors each point (document) according to its cluster assignment. You can experiment with the interactive diagram by clicking on the legend items and showing or hiding the documents belonging to the selected item.

In [None]:
LBD_06_visualization.visualize_tfidf_pca_interactive(ids_list, [], cluster_assignments, tfidf_matrix, transpose = False, color_schema = 12)

**Analysis of domain cluster combinations and extraction of relevant rows**

This script assigns text domains to their respective clusters, extracts certain lines based on criteria and calculates the frequency of domain-cluster combinations. It uses the `Counter` class from the Python "Collections" module to determine the frequency of each combination. This script is a powerful tool for filtering and analyzing data in clustering workflows and enables a detailed examination of domain-cluster overlaps.

<details>
  <summary>Click for more ...</summary>

**Functionality**

1. *Assign domains to clusters*:
   - Combines each domain from `domains_list` with the corresponding cluster assignment (`cluster_assignments`), creating a `dl_ca` list of `domain-cluster` strings.
   - Extracts specific lines based on the domain and cluster assignment conditions and appends them to the `outlier_lines` list.

2. *Counting frequencies*:
   - Uses `Counter` to calculate the frequency of each `domain-cluster` combination in `dl_ca`.
   - Displays the number of each combination.

3. *Output*:
   - Outputs the count of all unique `domain-cluster` combinations and the `dl_ca` list.

**Practical applications**

1. *Cluster characterization*: Helps to identify the distribution of domains across different clusters.
2. *Text filtering*: Enables extraction of specific text data based on clusters and domain relevance.
3. *Data Analysis*: Provides a quick overview of the relationships between domains and clusters and helps to understand thematic patterns in text data.
4. *LBD*: Supports targeted discovery by focusing on specific domain-cluster overlaps in the scientific literature.

**Use**

1. *Prepare input data*:
   - Create `domains_list`, `cluster_assignments`, and `lines`, where:
   - `domains_list` contains the domain names of the documents.
   - `cluster_assignments` contains the cluster labels for the documents.
   - `lines` contains the corresponding text data for each document.
2. *Run the script*:
   - The script creates `domain-cluster` combinations (`dl_ca`) and extracts lines that fulfill the defined conditions (depending of the domains used, e.g. `Autism` in cluster `1` or `Calcineurin` in cluster `0`; or `Alzheimer` in cluster `1` or `GIMB` in cluster `0`).
3. *Analyze the output*:
   - Check the frequency distribution of the domain-cluster combinations and the extracted lines (`outlier_lines`).

</details>

In [None]:
dl_ca = []
outlier_lines = []
i = 0
for dl, ca in zip(domains_list, cluster_assignments):
    dl_ca.append(dl + '-' + ca)
    if CONTEXT_SWITCH == 1:
        # you have to decide, which combination of the original domain and the assigned cluster 
        # represents outlier documents
        if (dl == 'Autism') and (ca == "1"):
            outlier_lines.append(lines[i])        
        if (dl == 'Calcineurin') and (ca == "0"):
            outlier_lines.append(lines[i])
    elif CONTEXT_SWITCH == 2:
        if (dl == 'Alzheimer') and (ca == "0"):
            outlier_lines.append(lines[i])        
        if (dl == 'GIMB') and (ca == "1"):
            outlier_lines.append(lines[i])
    i += 1

from collections import Counter

# Step 1: Use Counter to count the frequencies of each value
frequencies = Counter(dl_ca)

# Step 2: Display the frequencies
for value, count in frequencies.items():
    print(f"Value {value} appears {count} times")

Draw the interactive visualization of the documents from the four combinations of the original domains and the generated clusters with k-means. You can experiment with the interactive diagram by clicking on the legend elements and showing or hiding the documents belonging to the selected element.

In [None]:
LBD_06_visualization.visualize_tfidf_pca_interactive(ids_list, [], dl_ca, tfidf_matrix, transpose = False, color_schema = 13)

**Crosstab generator for frequency analysis**

This script generates a crosstab matrix from two lists of values, showing the frequency of co-occurrence between items in the lists.

<details>
  <summary>Click for more ...</summary>

**Functionality**

The `create_crosstab` function is designed to take two equal-length lists as input and compute the frequency of their pairwise combinations. It uses Python's `defaultdict` to construct a two-dimensional frequency table. Here’s how it works:
1. *Input Validation*: Ensures both lists have the same length to avoid data mismatches.
2. *Data Processing*: Iterates over the paired elements of the lists, updating the frequency count in a nested dictionary.
3. *Matrix Representation*: Organizes the results into a readable tabular format. It dynamically determines row and column headers based on the unique elements in the input lists and then displays the crosstab as a frequency table.

The function `display_crosstab` returns the crosstab as a formatted string for display in a table format.

**Example workflow**

1. Replace `domains_list` and `cluster_assignments` in the function call with your lists of interest.
2. Run the script to view a frequency matrix as a text-based table.
3. Analyze the output to derive insights, such as identifying dominant relationships or clusters.

</details>

In [None]:
def create_crosstab(list1, list2):
    # Check if the two lists have the same length
    if len(list1) != len(list2):
        raise ValueError("The two lists must have the same length.")

    # Initialize the crosstab dictionary
    crosstab = defaultdict(lambda: defaultdict(int))

    # Populate the crosstab with frequency counts
    for val1, val2 in zip(list1, list2):
        crosstab[val1][val2] += 1

    return crosstab        

def display_crosstab(crosstab):
    headers = sorted(crosstab.keys()) # aaa
    sub_headers = sorted({val for sublist in crosstab.values() for val in sublist.keys()}) # bbb

    # Determine column widths for proper alignment
    # col_width = max(len(str(x)) for x in sub_headers) + 3
    col_width = max(len(str(x)) for x in headers + sub_headers + [v for row in crosstab.values() for v in row.values()]) + 2

    # Create a header row
    sub_header = [" " * col_width] + [f"{bb:>{col_width}}" for bb in sub_headers]
    table = " | ".join(sub_header) + "\n" + "-" * len(" | ".join(sub_header)) + "\n"

    # Add rows for each value in aaa
    for header in headers:
        row = [f"{header:>{col_width}}"]  # Row header
        for sub_header in sub_headers:
            value = crosstab.get(header, {}).get(sub_header, "")
            row.append(f"{value:>{col_width}}")
        table += " | ".join(row) + "\n"
    
    return table

print(display_crosstab(create_crosstab(domains_list, cluster_assignments)))

Create a sparse visualization of the documents randomly from only 10 percent of the documents. This visualization is useful to reduce the time to visualize all documents due to the large number of documents.

In [None]:
selectors = []
for i, dl in enumerate(domains_list):
    if i % 10 == 0:
        selectors.append(True)
    else:
        selectors.append(False)

LBD_06_visualization.visualize_tfidf_pca_interactive([element for element, select in zip(ids_list, selectors) if select], [], [element for element, select in zip(dl_ca, selectors) if select], 
                                                     tfidf_matrix[selectors,:], transpose = False, color_schema = 13)

If we have documents from two domains of interest, $A$ and $C$, we first train a clustering model that splits the documents into two clusters, $0$ and $1$, representing $A'$ and $C'$. 
It is assumed that there are overlaps between $0$ and $A'$ as well as $1$ and $B'$, or vice versa. 

The model created can then be used to classify all documents (i.e. the documents from $A \cup C$). 
The documents that are misclassified according to their domain of origin (i.e. documents from $A$ that are classified as $C'$, and documents from $C$ that are classified as $A'$) are called **outlier documents**. 
These outliers are called borderline documents because, according to the model, they are more similar to the other domain than to the original domain.

Now we display the number of outlier documents that are stored in a file for further processing. Note that the number of outlier documents is usually much smaller than the number of original documents.

In [None]:
print('Number of original documents:', len(lines))
print('Number of outlier documents:', len(outlier_lines))

Save the outlier documents to a file, adding '_outliers' to the end of original input file name.

In [None]:
# Specify the file path (or name) where the list outlier_lines will be saved

# Split the full path into directory and file name
tmp_path, tmp_file_name = os.path.split(fileName)
# Remove the file extension from the file name
tmp_file_name_without_ext = os.path.splitext(tmp_file_name)[0]
# Add _outliers to the original filename
file_path1 = tmp_path+'/'+tmp_file_name_without_ext+'_outliers.txt'

# Open the file in write mode ('w') and write each string to the file
with open(file_path1, 'w', encoding="ascii") as file:
    for line in outlier_lines:
        file.write(line)

The file with the outlier documents is prepared for further processing. 

Let us repeat that the main advantage of focusing on outlier documents is to improve the efficiency of the b-term search by reducing the search space of potential b-terms to those that occur in outlier documents. This significantly reduces the effort required to search for cross-domain bridging terms, as a much smaller subset of documents in which most $b$-terms occur must be examined. 

The experimental results in the standard migraine-magnesium domain as well as in the autism-calcineurin domain confirm the hypothesis that most bridging terms occur in outlier documents and that the search space for identifying b-terms can be greatly reduced by considering outlier documents [6, 7].