# Gutenberg Text Preprocessing

This script demonstrates how to:
- Download and use NLTK resources (stopwords, punkt, wordnet, gutenberg).
- Clean text using regex, remove stopwords, lemmatize, and stem tokens.
- Randomly partition cleaned tokens into equal-sized chunks.
- Organize data into a Pandas DataFrame for further use or modeling.

## Requirements

- Python 3.x  
- NLTK  
- pandas  
- (Optional) Jupyter or similar environment for interactive exploration

## Installation

1. Install dependencies:
    ```bash
    pip install nltk pandas
    ```
2. Download necessary NLTK data (stopwords, punkt, wordnet, gutenberg):
    ```python
    import nltk
    nltk.download('stopwords')
    nltk.download('punkt')
    nltk.download('wordnet')
    nltk.download('gutenberg')
    ```

## Usage

1. **Set your file IDs** and corresponding labels:
    ```python
    fileids = ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt']
    labels = ['a', 'b', 'c']
    ```
2. **Call the main function** to generate a DataFrame of cleaned text partitions:
    ```python
    df = process_gutenberg_books(fileids, labels, n_partitions=200, partition_size=100)
    ```
3. **Export or examine** the resulting DataFrame:
    ```python
    df.to_csv('gutenberg_cleaned_partitions.csv', index=False)
    print(df.head())
    ```

## Functions Overview

1. **`clean_text_words(word_list)`**  
   Cleans and normalizes a list of tokens (lowercase, remove non-alpha, remove stopwords, lemmatize, stem).

2. **`get_random_partitions_of_tokens(cleaned_tokens, label, n_partitions=200, partition_size=100)`**  
   Creates random slices of cleaned tokens and pairs them with a label.

3. **`process_gutenberg_books(fileids, labels, n_partitions=200, partition_size=100)`**  
   - Retrieves raw text from the Gutenberg corpus.  
   - Tokenizes and cleans the text.  
   - Partitions cleaned tokens into multiple labeled subsets.  
   - Returns a Pandas DataFrame of (label, text) rows.

## Example

```python
if __name__ == "__main__":
    fileids = ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt']
    labels = ['a', 'b', 'c']
    df = process_gutenberg_books(fileids, labels)
    df.to_csv('gutenberg_cleaned_partitions.csv', index=False)
    print(df.head())


In [1]:
import re
import random
import pandas as pd
import nltk

# -------------------------------------------------------------------
# NLTK downloads (stopwords, punkt, wordnet) -- run once if you haven't
# -------------------------------------------------------------------
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('gutenberg')

from nltk.corpus import stopwords, gutenberg
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

def clean_text_words(word_list):
    """
    Perform text cleaning steps:
      - Lowercase
      - Remove punctuation/numbers (regex)
      - Remove stopwords
      - Lemmatize
      - Stem (optional)
    Return the cleaned list of tokens.
    """
    cleaned_words = []
    for w in word_list:
        # Lowercase
        w = w.lower()
        # Keep only alphabetic characters
        w = re.sub(r'[^a-z]', '', w)

        if w and w not in stop_words:
            # Lemmatize
            w_lemma = lemmatizer.lemmatize(w)
            # Stem
            w_stem = stemmer.stem(w_lemma)
            cleaned_words.append(w_stem)

    return cleaned_words

def get_random_partitions_of_tokens(cleaned_tokens, label, n_partitions=200, partition_size=100):
    """
    Given a list of cleaned tokens (already preprocessed) and a label:
      - Randomly select 'n_partitions' slices of length 'partition_size'
      - Return a list of tuples: (label, [list_of_cleaned_tokens_for_partition])
    """
    partitions = []
    max_start_index = len(cleaned_tokens) - partition_size

    # If not enough tokens for a single partition of size 'partition_size', return empty
    if max_start_index < 0:
        return partitions

    for _ in range(n_partitions):
        start = random.randint(0, max_start_index)
        chunk = cleaned_tokens[start : start + partition_size]
        partitions.append((label, chunk))

    return partitions

def process_gutenberg_books(fileids, labels, n_partitions=200, partition_size=100):
    """
    Takes:
      - fileids: list of fileids from nltk.corpus.gutenberg (e.g., ['austen-emma.txt', ...])
      - labels: list of labels (same length as fileids)
    Returns a Pandas DataFrame with columns [label, text],
    where each row has exactly 100 cleaned words in the 'text'.
    """
    all_partitions = []

    for fileid, label in zip(fileids, labels):
        # 1) Load raw text from Gutenberg
        text = gutenberg.raw(fileid)

        # 2) Tokenize
        tokens = nltk.word_tokenize(text)

        # 3) Clean the tokens (stopwords, lemmatization, stemming, etc.)
        cleaned_tokens = clean_text_words(tokens)

        # 4) Get random partitions of size `partition_size` from cleaned tokens
        partitions = get_random_partitions_of_tokens(
            cleaned_tokens,
            label,
            n_partitions=n_partitions,
            partition_size=partition_size
        )

        # 5) Convert each list of cleaned tokens to a single string
        for lbl, chunk_words in partitions:
            combined_text = " ".join(chunk_words)
            all_partitions.append((lbl, combined_text))

    # Convert to DataFrame
    df = pd.DataFrame(all_partitions, columns=['label', 'text'])
    return df

# -------------------------------------------------------------------
# Example usage
# -------------------------------------------------------------------
if __name__ == "__main__":
    # For demonstration, let's pick three Gutenberg file IDs
    fileids = ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt']
    labels = ['a', 'b', 'c']  # one label per file

    # Generate the DataFrame
    df = process_gutenberg_books(fileids, labels, n_partitions=200, partition_size=100)

    # Save to CSV
    df.to_csv('gutenberg_cleaned_partitions.csv', index=False)

    # Check some rows
    print("Data preparation complete. Sample rows:")
    print(df.head())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


Data preparation complete. Sample rows:
  label                                               text
0     a  face suit give belief produc anoth dread perha...
1     a  swell half hour relat contain multipli proof s...
2     a  mr knightley must never marri littl henri must...
3     a  like charad slight much better passion poet lo...
4     a  would wish leg salt know nice loin dress direc...
