Document Categorization

Automatic partitioning the collection of my e-books into categories and labeling each category according to its content

Problem formulation: Unsupervised document clustering + topic modeling

The goal of this project was to automatically organize my large collection of e-books in the PDF format into groups or clusters so that I could easily find sources of useful information when needed. However, the code in this repo could be handful for the general purpose where there are many PDF files that one wants to split into groups where each group contains books or documents on a similar subject/topic. In machine learning terms, such a task is named document categorization and it is completely unsupervised.

Data

As an example, I used 21 e-books from my personal collection (due to the copyright restrictions I cannot upload these books here). Here is their list:

A-Gentle-Introduction-to-Apache-Spark
Advanced Deep Learning with Keras
Advanced Deep Learning with Python
Advanced Elasticsearch 7.0
Advanced_Analytics_with_Spark
Apache Spark 2.x Machine Learning Cookbook
Apache Spark 2.x for Java Developer
Apache Spark Deep Learning Cookbook
Apache_Solr_Essentials
Apache_Solr_High_Performance
Deep Learning with TensorFlow 2 and Keras - Second Edition
Deep_Learning_for_Search
Deep_Learning_with_JavaScript
Elasticsearch 5.x Cookbook - Third Edition
Elasticsearch 7 Quick Start Guide
Elasticsearch A Complete Guide
Elasticsearch_for_Hadoop
Hands-On Deep Learning for IoT
Learning Elastic Stack 6.0
Learning Elasticsearch
Mastering ElasticSearch

These books could roughly be divided into 3-4 clusters: "Spark", "Deep Learning", "Elasticsearch/Solr". I intentionally selected such books so that there would be no (significant) overlap in their topics and document clustering would have clear targets.

Training phase

In this phase, one has a set of documents to begin working with. The outcome is a set of clusters and popular words extracted from each cluster.

Running code

The main file with all necessary code to execute in your favorite IDE or from the command line is document_categorization.py. The file categorization.env is the environment file where all important parameters, such as clustering method or the number of topics per cluster, are set up as follows:

# Set a directory with electronic books
BOOK_PATH="C:/eBooks/A"
# The number of top bigrams/trigrams to select
TOP_N=100
# Clustering algorithm 
# (valid values are "affinity", "kmeans", "hierarchical")
CLUSTERING="affinity"
# The number of clusters to detect (only for CLUSTERING="kmeans")
CLUSTER_NUMBER=3
# The number of features to describe each cluster
FEATURE_NUMBER=10
# Topic modeling algorithm
# (valid values are "lda", "nmf")
TOPIC_MODELING="lda"
# The number of topics per cluster
TOPIC_NUMBER_PER_CLUSTER=1
# File name to save a vectorizer object
VECTORIZER_PKL_FILENAME="vectorizer_pickle_model.pkl"
# File name to save a clustering model object
CLUSTERING_PKL_FILENAME="clustering_pickle_model.pkl"
# File name to save a topic modeling model object
TOPIC_MODELING_PKL_FILENAME="topic_modeling_pickle_model.pkl"

I used a lot of code from the great book "Text Analytics with Python: A Practical Real-World Approach to Gaining Actionable Insights from Your Data", written by Dipanjan Sarkar and published by Apress in 2016. My role was to write so called integration code linking together different parts of the processing pipeline described in the next section. Whenever the code has been adopted, I preserved the original file and function names given by Dipanjan Sarkar. I also adopted two functions related to word cloud generation from the Jupyter notebook (https://nbviewer.jupyter.org/github/LucasTurtle/national-anthems-clustering/blob/master/Cluster_Anthems.ipynb) created by Lucas de Sá.

Processing pipeline

The application seeks for PDF files in a specified folder and extract text (as one long string) from each of them by using the tika parser. Once this is done, text is split into sentences and text pre-processing starts, which includes text filtering (removal of email addresses and web links, tokens with mixed letters and numbers, punctuation symbols, stopwords, tokens from code snippets embedded into text, and certain parts-of-speech), converting all words to the lower case, and sentence tokenization into words.

I have noticed that code snippets embedded into text could harm document clustering by showing up in large numbers among top words characterizing cluster centroids. Currently, I adopted a rather straightforward solution to manually create the so called "black list" of such words that if met in text are removed from further analysis. However, this is clearly a sub-optimal solution that needs to be replaced with an automatic one (see the last section for details).

I also assumed that not all parts-of-speech (POS) are useful for representing the book content. I opted to preserve only three POS: singular adjective (JJ tag), singular nouns (NN tag) and singular proper nouns (NNP tag). All other POS are filtered out.

Once each document is normalized, top N bigrams and top N trigrams are extracted from the remaining adjectives and nouns. Lists of bigrams and trigrams are flattened out afterwards and concatenated into a single list without removing duplicated words.

A book title and its content from extracted top bigrams and trigrams are written to an SQLite database (file documents.sqlite). However, there is a check preventing any book to be written more than once in order to avoid duplicated records and unnecessary database growth.

Next feature extraction is performed where TF-IDF features form a feature matrix that is given as input to the pre-specified clustering function, followed by selected topic modeling.

A TF-IDF vectorizer object, a clustering model object and topic modeling model objects (one per cluster) are saved to files in the current folder. The topic modeling model objects are stored in a list of objects.

Results

There are three clustering methods (affinity propagation, k-means and Ward's hierarchical clustering) and two topic modeling methods (Latent Dirichlet Allocation or LDA and Nonnegative Matrix Factorization or NMF).

Affinity propagation does not require to pre-specify the number of clusters to be found in advance, unlike k-means. Although Ward's hierarchical clustering also does not need to know that number in advance, this clustering method requires a human to judge on the final number of clusters from a dendrogram, i.e., clustering partitioning is rather subjective. Having decided on this number, a user can then supply it to k-means. The dendrogram for my set of 21 books is shown below.

One could observe 3 clusters presented by red, green and light blue lines. One cluster includes all books about Deep Learning, another one about Apache Spark, and the third one about search engines/platforms (Elasticsearch and Solr).

Given these considerations, I decided to go with affinity propagation, as it is rearly that the number of clusters is known beforehand. All results below are obtained with this clustering method.

Word clouds for each of the extracted clusters are given below.

One can see that Cluster 0 is about Apache Spark, Cluster 1 about Solr, Cluster 2 about Deep Learning, and Cluster 3 about Elasticsearch.

Visualization of the clustered 21 books in 2D space is shown below, which confirms the cluster partitioning picture represented by word clouds.

Here are results showing top words describing each cluster's centroid, the books assigned to a given cluster and the top topics characterizing that cluster (LDA was used for topic modeling and word weights were assigned based on TF-IDF features):

Cluster 0 details:

Key features: 'spark', 'machine', 'regression', 'logger', 'error', 'apache', 'program', 'sparksession', 'feature', 'mllib'

Documents in this cluster: A-Gentle-Introduction-to-Apache-Spark, Advanced_Analytics_with_Spark, Apache Spark 2.x Machine Learning Cookbook, Apache Spark 2.x for Java Developers, Apache Spark Deep Learning Cookbook

Topic #1 with weights ('spark', 3.22), ('apache', 1.56), ('machine', 1.38), ('network', 1.37), ('system', 1.37), ('neural', 1.36), ('screenshot', 1.32), ('scala', 1.31), ('sum', 1.31), ('model', 1.31)

Cluster 1 details:

Key features: 'solr', 'query', 'facet', 'cache', 'parser', 'filter', 'content', 'index', 'folder', 'extraction'

Documents in this cluster: Apache_Solr_Essentials, Apache_Solr_High_Performance

Topic #1 with weights ('solr', 1.9), ('query', 1.69), ('index', 1.45), ('search', 1.26), ('apache', 1.25), ('parser', 1.24), ('performance', 1.24), ('filter', 1.22), ('result', 1.22), ('response', 1.22)

Cluster 2 details:

Key features: 'loss', 'accuracy', 'train', 'tensorflow', 'model', 'automl', 'tf', 'relu', 'encoder', 'activation'

Documents in this cluster: Advanced Deep Learning with Keras, Advanced Deep Learning with Python, Deep Learning with TensorFlow 2 and Keras - Second Edition, Deep_Learning_for_Search, Deep_Learning_with_JavaScript, Hands-On Deep Learning for IoT

Topic #1 with weights ('model', 2.17), ('loss', 1.95), ('iot', 1.74), ('train', 1.68), ('deep', 1.61), ('tf', 1.6), ('okun', 1.56), ('neural', 1.53), ('image', 1.51), ('input', 1.49)

Cluster 3 details:

Key features: 'elasticsearch', 'query', 'index', 'score', 'node', 'search', 'title', 'match', 'logstash', 'twitter'

Documents in this cluster: Advanced Elasticsearch 7.0, Elasticsearch 5.x Cookbook - Third Edition, Elasticsearch 7 Quick Start Guide, Elasticsearch A Complete Guide, Elasticsearch_for_Hadoo, Learning Elastic Stack 6.0, Learning Elasticsearch, Mastering ElasticSearch

Topic #1 with weights ('index', 2.96), ('elasticsearch', 2.48), ('query', 2.47), ('aggregation', 1.92), ('elastic', 1.65), ('es', 1.59), ('score', 1.55), ('search', 1.55), ('level', 1.54), ('hadoop', 1.53)

Both words describing centroids and topics are sufficiently well describing the essense of each cluster.

Inference phase

In this phase, one has clusters detected and popular words extracted from each cluster. A new (previously unseen) document is presented to the inference phase pipeline with the purpose of assigning this document to one of the existing clusters and updating the list of popular words.