# Topic Modeling with Gensim

This notebook demonstrates how to perform topic modeling on a corpus of documents using Gensim's Latent Dirichlet Allocation (LDA) implementation. We'll process text documents and extract meaningful topics from them.

In [1]:
# Import required libraries
import os
import nltk
import glob
from pathlib import Path

# Download required NLTK data
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet')

# Import gensim and other libraries
from gensim import corpora
from gensim.models import LdaModel
from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short, stem_text
import pandas as pd

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Jonathan\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jonathan\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Jonathan\AppData\Roaming\nltk_data...


In [2]:
# Create sample documents for topic modeling
sample_documents = [
    """
    Machine learning is a method of data analysis that automates analytical model building. 
    It is a branch of artificial intelligence based on the idea that systems can learn from data, 
    identify patterns and make decisions with minimal human intervention. Machine learning algorithms 
    build a model based on sample data, known as training data, in order to make predictions or 
    decisions without being explicitly programmed to do so.
    """,
    """
    Deep learning is part of a broader family of machine learning methods based on artificial neural networks 
    with representation learning. Learning can be supervised, semi-supervised or unsupervised. Deep learning 
    architectures such as deep neural networks, deep belief networks, recurrent neural networks and 
    convolutional neural networks have been applied to fields including computer vision, speech recognition, 
    natural language processing, audio recognition, social network filtering, machine translation, 
    bioinformatics, drug design, medical image analysis, material inspection and board game programs.
    """,
    """
    Natural language processing is a subfield of linguistics, computer science, and artificial intelligence 
    concerned with the interactions between computers and human language, in particular how to program 
    computers to process and analyze large amounts of natural language data. The goal is a computer capable 
    of understanding the contents of documents, including the contextual nuances of the language within them.
    """,
    """
    Computer vision is an interdisciplinary scientific field that deals with how computers can gain 
    high-level understanding from digital images or videos. From the perspective of engineering, 
    it seeks to automate tasks that the human visual system can do. Computer vision is concerned with 
    the automatic extraction, analysis and understanding of useful information from a single image or 
    a sequence of images.
    """,
    """
    Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and 
    systems to extract knowledge and insights from structured and unstructured data. It uses techniques 
    and theories drawn from many fields within the context of mathematics, statistics, computer science, 
    and information science. Data science is a concept to unify statistics, data analysis, informatics, 
    and their related methods in order to understand and analyze actual phenomena with data.
    """,
    """
    Artificial intelligence is intelligence demonstrated by machines, as opposed to the natural intelligence 
    displayed by animals including humans. Leading AI textbooks define the field as the study of intelligent 
    agents, which are systems that perceive their environment and take actions that maximize their chances 
    of achieving their goals. Some popular accounts use the term artificial intelligence to describe machines 
    that mimic cognitive functions that humans associate with the human mind, such as learning and problem solving.
    """,
    """
    Robotics is an interdisciplinary branch of engineering and science that includes mechanical engineering, 
    electrical engineering, computer science, and others. Robotics deals with the design, construction, 
    operation, and use of robots, as well as computer systems for their control, sensory feedback, 
    and information processing. These technologies are used to develop machines that can substitute for humans. 
    Robots are used in many situations that are dangerous or unpleasant to humans.
    """,
    """
    Internet of things is a system of interrelated computing devices, mechanical and digital machines, 
    objects, animals or people that are provided with unique identifiers and the ability to transfer 
    data over a network without requiring human-to-human or human-to-computer interaction. IoT has 
    evolved from wireless sensor networks and now encompasses a wide range of applications.
    """
]

# Save sample documents to files
os.makedirs("sample_docs", exist_ok=True)
for i, doc in enumerate(sample_documents):
    with open(f"sample_docs/doc_{i+1}.txt", "w", encoding="utf-8") as f:
        f.write(doc)

print(f"Created {len(sample_documents)} sample documents in the 'sample_docs' directory.")

Created 8 sample documents in the 'sample_docs' directory.


In [3]:
# Define the TopicModeler class (simplified version of the script)
class TopicModeler:
    """A class to perform topic modeling on text documents."""
    
    def __init__(self, language='english'):
        """
        Initialize the TopicModeler.
        
        Args:
            language (str): Language for stopwords and tokenization
        """
        self.language = language
        try:
            self.stop_words = set(nltk.corpus.stopwords.words(language))
        except LookupError:
            nltk.download('stopwords')
            self.stop_words = set(nltk.corpus.stopwords.words(language))
        
        self.documents = []
        self.processed_docs = []
        self.dictionary = None
        self.corpus = None
        self.lda_model = None
        
    def load_documents(self, input_path: str) -> None:
        """
        Load documents from a directory or file.
        
        Args:
            input_path (str): Path to directory containing text files or a single file
        """
        print(f"Loading documents from {input_path}")
        
        path = Path(input_path)
        
        if path.is_file():
            # Single file
            with open(path, 'r', encoding='utf-8') as f:
                content = f.read()
                self.documents = [content]
        elif path.is_dir():
            # Directory of files
            file_patterns = ['*.txt', '*.md', '*.html', '*.htm']
            files = []
            for pattern in file_patterns:
                files.extend(glob.glob(str(path / '**' / pattern), recursive=True))
            
            print(f"Found {len(files)} files")
            
            for file_path in files:
                try:
                    with open(file_path, 'r', encoding='utf-8') as f:
                        content = f.read()
                        self.documents.append(content)
                except Exception as e:
                    print(f"Error reading {file_path}: {e}")
        else:
            raise ValueError(f"Input path {input_path} is neither a file nor a directory")
        
        print(f"Loaded {len(self.documents)} documents")
    
    def preprocess_documents(self, custom_filters: list = None) -> None:
        """
        Preprocess documents for topic modeling.
        
        Args:
            custom_filters (list): Custom list of words to filter out
        """
        print("Preprocessing documents")
        
        # Default Gensim filters
        default_filters = [
            strip_tags,
            strip_punctuation,
            strip_multiple_whitespaces,
            strip_numeric,
            remove_stopwords,
            strip_short,
            stem_text
        ]
        
        # Process each document
        self.processed_docs = []
        for doc in self.documents:
            # Apply Gensim preprocessing
            processed = preprocess_string(doc, default_filters)
            self.processed_docs.append(processed)
        
        print(f"Preprocessed {len(self.processed_docs)} documents")
    
    def create_dictionary_and_corpus(self) -> None:
        """Create dictionary and corpus for LDA."""
        print("Creating dictionary and corpus")
        
        # Create dictionary
        self.dictionary = corpora.Dictionary(self.processed_docs)
        
        # Filter extremes (optional)
        self.dictionary.filter_extremes(no_below=2, no_above=0.8)
        
        # Create corpus
        self.corpus = [self.dictionary.doc2bow(doc) for doc in self.processed_docs]
        
        print(f"Dictionary size: {len(self.dictionary)}")
        print(f"Corpus size: {len(self.corpus)}")
    
    def train_lda_model(self, num_topics: int = 10, passes: int = 10, 
                       alpha: str = 'auto', eta: str = 'auto') -> None:
        """
        Train the LDA model.
        
        Args:
            num_topics (int): Number of topics to extract
            passes (int): Number of passes through the corpus
            alpha (str): Document-topic density parameter
            eta (str): Topic-word density parameter
        """
        print(f"Training LDA model with {num_topics} topics")
        
        if not self.corpus or not self.dictionary:
            raise ValueError("Dictionary and corpus must be created before training the model")
        
        # Train LDA model
        self.lda_model = LdaModel(
            corpus=self.corpus,
            id2word=self.dictionary,
            num_topics=num_topics,
            passes=passes,
            alpha=alpha,
            eta=eta,
            random_state=42
        )
        
        print("LDA model training completed")
    
    def print_topics(self, num_words: int = 10) -> None:
        """
        Print the topics in a readable format.
        
        Args:
            num_words (int): Number of words to show per topic
        """
        if not self.lda_model:
            raise ValueError("Model must be trained before printing topics")
        
        topics = self.lda_model.print_topics(num_words=num_words)
        
        print("\n" + "="*50)
        print("TOPICS IDENTIFIED")
        print("="*50)
        
        for i, topic in enumerate(topics):
            print(f"\nTopic {i + 1}:")
            print(topic[1])

In [4]:
# Perform topic modeling on the sample documents
modeler = TopicModeler()

# Load documents
modeler.load_documents("sample_docs")

# Preprocess documents
modeler.preprocess_documents()

# Create dictionary and corpus
modeler.create_dictionary_and_corpus()

# Train LDA model
modeler.train_lda_model(num_topics=5, passes=20)

# Print topics
modeler.print_topics(num_words=10)

Loading documents from sample_docs
Found 8 files
Loaded 8 documents
Preprocessing documents
Preprocessed 8 documents
Creating dictionary and corpus
Dictionary size: 42
Corpus size: 8
Training LDA model with 5 topics
LDA model training completed

TOPICS IDENTIFIED

Topic 1:
0.133*"engin" + 0.074*"human" + 0.070*"scienc" + 0.069*"deal" + 0.069*"interdisciplinari" + 0.069*"us" + 0.069*"inform" + 0.038*"machin" + 0.038*"mechan" + 0.038*"branch"

Topic 2:
0.143*"learn" + 0.108*"network" + 0.073*"data" + 0.073*"machin" + 0.056*"base" + 0.038*"method" + 0.038*"analysi" + 0.038*"artifici" + 0.038*"program" + 0.021*"field"

Topic 3:
0.151*"data" + 0.122*"scienc" + 0.064*"method" + 0.064*"field" + 0.064*"us" + 0.035*"system" + 0.035*"algorithm" + 0.035*"order" + 0.035*"analysi" + 0.035*"process"

Topic 4:
0.107*"comput" + 0.103*"human" + 0.081*"imag" + 0.056*"network" + 0.056*"understand" + 0.056*"digit" + 0.056*"vision" + 0.031*"data" + 0.031*"interact" + 0.031*"machin"

Topic 5:
0.123*"intelli

In [7]:
modeler.dictionary

<gensim.corpora.dictionary.Dictionary at 0x14cc9bec440>

In [8]:
# Analyze topic distribution for individual documents
print("\n" + "="*50)
print("DOCUMENT TOPIC DISTRIBUTIONS")
print("="*50)

for i in range(min(5, len(modeler.corpus))):  # Show first 5 documents
    print(f"\nDocument {i+1}:")
    doc_topics = modeler.lda_model.get_document_topics(modeler.corpus[i])
    for topic_id, prob in doc_topics:
        print(f"  Topic {topic_id+1}: {prob:.4f}")


DOCUMENT TOPIC DISTRIBUTIONS

Document 1:
  Topic 2: 0.9926

Document 2:
  Topic 2: 0.9934

Document 3:
  Topic 5: 0.9922

Document 4:
  Topic 1: 0.2223
  Topic 4: 0.7723

Document 5:
  Topic 3: 0.9928


In [None]:
doc_topics.

[(2, np.float32(0.99283814))]

## Using the Standalone Script

You can also use the standalone script we created for topic modeling. To use it from the command line:

```bash
python src/pydocs/topic_modeling.py --input_dir sample_docs --num_topics 5 --passes 20
```

This approach is useful for processing larger datasets or when you want to run topic modeling as a batch process.

## Conclusion

In this notebook, we've demonstrated how to perform topic modeling using Gensim's LDA implementation. We:

1. Created sample documents on various technology topics
2. Implemented a TopicModeler class to handle the topic modeling process
3. Preprocessed the documents using Gensim's text preprocessing utilities
4. Created a dictionary and corpus for LDA
5. Trained an LDA model to extract topics
6. Analyzed the topic distributions in individual documents

This approach can be applied to any corpus of text documents to discover hidden thematic structures.

## Next Steps

To further improve the topic modeling results, you could:

1. Experiment with different numbers of topics
2. Adjust the preprocessing steps (e.g., add custom stop words)
3. Try different algorithms like Latent Semantic Analysis (LSA) or Hierarchical Dirichlet Process (HDP)
4. Evaluate model quality using coherence scores
5. Visualize topics using tools like pyLDAvis

In [1]:
from pypdf import PdfReader
import os
import glob

In [2]:
# Define the directory to search (current directory in this example)
directory_path = r"C:\Users\Jonathan\Desktop\projects\pydocs\dev\sample-pdfs"
# Define the desired file extension
file_extension = 'pdf'

# Construct the glob pattern
pattern = os.path.join(directory_path, f'**/*.{file_extension}')

# Get a list of files matching the pattern
matching_files = glob.glob(pattern, recursive=True)
print(matching_files)

['C:\\Users\\Jonathan\\Desktop\\projects\\pydocs\\dev\\sample-pdfs\\001-trivial\\minimal-document.pdf', 'C:\\Users\\Jonathan\\Desktop\\projects\\pydocs\\dev\\sample-pdfs\\002-trivial-libre-office-writer\\002-trivial-libre-office-writer.pdf', 'C:\\Users\\Jonathan\\Desktop\\projects\\pydocs\\dev\\sample-pdfs\\003-pdflatex-image\\pdflatex-image.pdf', 'C:\\Users\\Jonathan\\Desktop\\projects\\pydocs\\dev\\sample-pdfs\\004-pdflatex-4-pages\\pdflatex-4-pages.pdf', 'C:\\Users\\Jonathan\\Desktop\\projects\\pydocs\\dev\\sample-pdfs\\005-libreoffice-writer-password\\libreoffice-writer-password.pdf', 'C:\\Users\\Jonathan\\Desktop\\projects\\pydocs\\dev\\sample-pdfs\\006-pdflatex-outline\\pdflatex-outline.pdf', 'C:\\Users\\Jonathan\\Desktop\\projects\\pydocs\\dev\\sample-pdfs\\007-imagemagick-images\\imagemagick-ASCII85Decode.pdf', 'C:\\Users\\Jonathan\\Desktop\\projects\\pydocs\\dev\\sample-pdfs\\007-imagemagick-images\\imagemagick-CCITTFaxDecode.pdf', 'C:\\Users\\Jonathan\\Desktop\\projects\\pydo

In [9]:
docs = []
for file_path in matching_files:
    doc = {}
    try:
        with open(file_path, 'rb') as f:
            r = PdfReader(f)
            doc["metadata"] = r.metadata
            doc["content"] = [i.extract_text() for i in r.pages]
        docs.append(doc)
    except Exception as e:
        print(f"Error reading {file_path}: {e}")

Error reading C:\Users\Jonathan\Desktop\projects\pydocs\dev\sample-pdfs\005-libreoffice-writer-password\libreoffice-writer-password.pdf: File has not been decrypted


Invalid parent xref., rebuild xref
parsing for Object Streams
Object 174 0 not defined.
Object 172 0 not defined.


Error reading C:\Users\Jonathan\Desktop\projects\pydocs\dev\sample-pdfs\017-unreadable-meta-data\unreadablemetadata.pdf: Invalid object in /Pages


In [11]:
docs[0]

{'metadata': {'/Producer': 'pdfTeX-1.40.23',
  '/Creator': 'TeX',
  '/CreationDate': "D:20220403180542+02'00'",
  '/ModDate': "D:20220403180542+02'00'",
  '/Trapped': '/False',
  '/PTEX.Fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.23 (TeX Live 2021) kpathsea version 6.3.3'},
 'content': ['Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod\ntempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero\neos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea taki-\nmata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur\nsadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna\naliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea\nrebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit\namet.\n1']}

In [11]:
meta = reader.metadata

In [12]:
print(meta.title)
print(meta.author)
print(meta.subject)
print(meta.creator)
print(meta.producer)
print(meta.creation_date)
print(meta.modification_date)

None


LaTeX with hyperref
pdfTeX-1.40.23
2022-04-06 20:15:41+02:00
2022-07-16 17:23:03-05:00


In [13]:
[print(i.extract_text()) for i in reader.pages]

Contents
1 Foo 2
2 Bar 2
3 Baz 2
4 Foo 2
5 Bar 3
6 Baz 3
7 Foo 3
8 Bar 4
9 Baz 4
1
1 Foo
Hello, here is some text without a meaning. This text should show what a
printed text will look like at this place. If you read this text, you will get no
information. Really? Is there no information? Is there a difference between this
text and some nonsense like “Huardest gefburn”? Kjift – not at all! A blind
text like this gives you information about the selected font, how the letters are
written and an impression of the look. This text should contain all letters of the
alphabet and it should be written in of the original language. There is no need
for special content, but the length of words should match the language. Hello,
here is some text without a meaning. This text should show what a printed text
will look like at this place. If you read this text, you will get no information.
Really? Is there no information? Is there a difference between this text and
some nonsense like “Huardest gefburn”

[None, None, None, None]