# Data Preparation for Topic Modeling

This notebook demonstrates how to load and prepare text data for topic modeling. We'll work with the Cohere/movies dataset from Hugging Face.

## Parameters Explanation

- **dataset_type**: Source of our dataset (HF = Hugging Face)
- **dataset_name**: The specific dataset we're using (Cohere/movies)

In [5]:
# Define the type and name of the dataset
dataset_type: str = "HF"
dataset_name: str = "Cohere/movies"

## Data Loading and Preparation

We'll now load our dataset from Hugging Face and prepare it for topic modeling. This involves:
1. Loading the raw text data
2. Examining the data structure
3. Handling any missing values

In [6]:
import numpy as np
from src.data_utils import TextPreProcessor, CorpusProcessor
import pandas as pd
from datasets import load_dataset

# Load a dataset (replace 'dataset_name' with the desired dataset's name from HuggingFace, or filename of the local dataset)
if dataset_type == "HF":
    dataset = load_dataset(dataset_name)
else: dataset = load_dataset("csv", data_files={"train": f"{dataset_name}.csv"})
len(dataset['train'])

  from .autonotebook import tqdm as notebook_tqdm


4803

In [7]:
# Collect documents and print the number of documents
documents = []
for d in dataset['train']['overview']:
    if d is not None:
        documents.append(d)

print('Number of documents:', len(documents))

Number of documents: 4800


In [8]:
# Print the first 2 documents to get an idea of the data
print('First 2 documents:', documents[:2])

First 2 documents: ['In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.', '84 years later, a 101-year-old woman named Rose DeWitt Bukater tells the story to her granddaughter Lizzy Calvert, Brock Lovett, Lewis Bodine, Bobby Buell and Anatoly Mikailavich on the Keldysh about her life set in April 10th 1912, on a ship called Titanic when young Rose boards the departing ship with the upper-class passengers and her mother, Ruth DeWitt Bukater, and her fiancé, Caledon Hockley. Meanwhile, a drifter and artist named Jack Dawson and his best friend Fabrizio De Rossi win third-class tickets to the ship in a game. And she explains the whole story from departure until the death of Titanic on its first and last voyage April 15th, 1912 at 2:20 in the morning.']


## Text Preprocessing

Before running our topic model, we need to clean and prepare the text data:
1. Tokenization: Breaking text into individual words
2. Removing stopwords: Common words like "the", "and", "is" that don't carry much meaning
3. Lemmatization/stemming: Reducing words to their root forms
4. Creating a document-term matrix: Converting processed text to numerical format

In [9]:
# Preprocess the text documents
tp = TextPreProcessor()
documents = tp.preprocess(documents)

# Print the first 2 preprocessed documents
print('First 2 preprocessed documents:', documents[:2])

First 2 preprocessed documents: [['in', 'the', 'nd', 'century', 'a', 'paraplegic', 'marine', 'is', 'dispatched', 'to', 'the', 'moon', 'pandora', 'on', 'a', 'unique', 'mission', 'but', 'becomes', 'torn', 'between', 'following', 'orders', 'and', 'protecting', 'an', 'alien', 'civilization'], ['years', 'later', 'a', 'year', 'old', 'woman', 'named', 'rose', 'dewitt', 'bukater', 'tells', 'the', 'story', 'to', 'her', 'granddaughter', 'lizzy', 'calvert', 'brock', 'lovett', 'lewis', 'bodine', 'bobby', 'buell', 'and', 'anatoly', 'mikailavich', 'on', 'the', 'keldysh', 'about', 'her', 'life', 'set', 'in', 'april', 'th', 'on', 'a', 'ship', 'called', 'titanic', 'when', 'young', 'rose', 'boards', 'the', 'departing', 'ship', 'with', 'the', 'upper', 'class', 'passengers', 'and', 'her', 'mother', 'ruth', 'dewitt', 'bukater', 'and', 'her', 'fianc', 'caledon', 'hockley', 'meanwhile', 'a', 'drifter', 'and', 'artist', 'named', 'jack', 'dawson', 'and', 'his', 'best', 'friend', 'fabrizio', 'de', 'rossi', 'win

In [10]:
# Calculate and print the average number of words in a document
average_words = np.mean([len(d) for d in documents])
print('Average number of words per document:', average_words)

Average number of words per document: 53.24791666666667


In [11]:
import pickle

# Process the documents to create a document-term matrix
cp = CorpusProcessor(max_relative_frequency=0.9, min_absolute_frequency=5)
cp.process(documents)

# Get and save the vocabulary
vocab = cp.get_vocab()
with open('./data/vocab.pkl', 'wb') as file:
    pickle.dump(vocab, file)
print('Vocabulary size:', len(vocab))

Vocabulary size: 4270


In [12]:
# Get the document-term matrix and save it
X = cp.get_vectorised_documents()
with open('./data/doc_term_matrix.pkl', 'wb') as file:
    pickle.dump(X, file)
print('Document-term matrix shape:', X.shape)

Document-term matrix shape: (4799, 4270)
