# Data Preparation for Topic Modeling

This notebook demonstrates how to load and prepare text data for topic modeling. We'll work with the Cohere/movies dataset from Hugging Face.

## Parameters Explanation

- **dataset_type**: Source of our dataset (HF = Hugging Face)
- **dataset_name**: The specific dataset we're using (Cohere/movies)

In [3]:
# Define the type and name of the dataset
dataset_type: str = "HF"
dataset_name: str = "Cohere/movies"

## Data Loading

We'll now load our dataset from Hugging Face and prepare it for topic modeling. 

In [11]:
import numpy as np
from src.data_utils import TextPreProcessor, CorpusProcessor
import pandas as pd
from datasets import load_dataset
import warnings
warnings.filterwarnings('ignore')
 
# Load a dataset (replace 'dataset_name' with the desired dataset's name from HuggingFace, or filename of the local dataset)
if dataset_type == "HF":
    dataset = load_dataset(dataset_name)
else: dataset = load_dataset("csv", data_files={"train": f"{dataset_name}.csv"})
len(dataset['train'])

4803

## What is a 'Document' in Topic Modeling?",

In topic modeling, a **document** is a discrete text unit that we analyze to discover underlying topics. Some key points:,

- A document can be of any length - from a short tweet to a full book chapter,
- In our case, each movie overview/description is a separate 'document',
- Topic modeling assumes each document is a mixture of multiple topics,
- The same words can appear in different topics with different probabilities,
- The definition of what constitutes a 'document' depends on your specific analytical needs,

The granularity of what you define as a 'document' can significantly impact your topic modeling results.


In [19]:
# Collect documents and print the number of documents
documents = []
for d in dataset['train']['overview']:
    if d is not None:
        documents.append(d)

print('Number of documents:', len(documents))

# Print the first 2 preprocessed documents
print('First preprocessed documents:', documents[0])

Number of documents: 4800
First preprocessed documents: In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.


## Text Preprocessing

Before running our topic model, we need to clean and prepare the text data:
1. Tokenization: Breaking text into individual words
2. Removing stopwords: Common words like "the", "and", "is" that don't carry much meaning
3. Lemmatization/stemming: Reducing words to their root forms
4. Creating a document-term matrix: Converting processed text to numerical format

In [20]:
# Preprocess the text documents
tp = TextPreProcessor()
documents = tp.preprocess(documents)

# Get and save the documents
with open('./data/documents.pkl', 'wb') as file:
    pickle.dump(documents, file)


In [8]:
# Calculate and print the average number of words in a document
average_words = np.mean([len(d) for d in documents])
print('Average number of words per document:', average_words)

Average number of words per document: 53.24791666666667


In [9]:
import pickle

# Process the documents to create a document-term matrix
cp = CorpusProcessor(max_relative_frequency=0.9, min_absolute_frequency=5)
cp.process(documents)

# Get and save the vocabulary
vocab = cp.get_vocab()
with open('./data/vocab.pkl', 'wb') as file:
    pickle.dump(vocab, file)
print('Vocabulary size:', len(vocab))

Vocabulary size: 4270


In [10]:
# Get the document-term matrix and save it
X = cp.get_vectorised_documents()
with open('./data/doc_term_matrix.pkl', 'wb') as file:
    pickle.dump(X, file)
print('Document-term matrix shape:', X.shape)

Document-term matrix shape: (4799, 4270)
