<a href="https://www.kaggle.com/code/mipeichao/kaggle-bards-play-2?scriptVersionId=258440627" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

This program uses the Tiny Shakespeare dataset to explore data preprocessing techniques. At the end, I include two discussions inspired by Google AI:<br>

Should tokenization of an NLP dataset be done row by row?<br>

Should punctuation be handled before or after tokenization?<br>

In addition, how can I effectively remove the names in the Shakespeare dataset? Today's program uses en_core_web_sm, but it didn't work. I have decided to find a character name list from Bard's plays and use it for comparison to remove the names.



References:<br>
[All  Engineering  Machine Learning
What are the most effective ways to train and test NLP models on different datasets?](https://www.linkedin.com/advice/3/what-most-effective-ways-train-test-nlp-models-rb3ke)<br>
[Key Steps in Text Preprocessing + Hands-on with Python](https://pub.towardsai.net/text-preprocessing-for-nlp-a-step-by-step-guide-to-clean-raw-text-data-2bb8918a4e2c)<br>
[Do we need to pre-process both the test and train data set?
](https://datascience.stackexchange.com/questions/103211/do-we-need-to-pre-process-both-the-test-and-train-data-set)<br>
[Coding tips](https://www.cs.williams.edu/~kkeith/teaching/s23/cs375/attach/shakespeare-gen.html)

Tokenization: Split text into individual words or tokens.<br>
Lowercasing: Convert all text to lowercase to ensure consistency.<br>
Removing Noise: Eliminate irrelevant characters, punctuation, or special symbols.<br>
Stopword Removal: Exclude common words like "and", "the", etc., which don't contribute much to meaning.<br>
Stemming or Lemmatization: Reduce words to their root form to normalize variations (e.g., "running" to "run").<br>
Handling Numerical Data: Convert numbers to a standard format if necessary.<br>
Handling Rare Words: Replace rare or misspelled words with a special token or correct them.<br>
Padding or Truncation: Make all sequences uniform in length by adding padding or truncating.<br>

In [1]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("thedevastator/the-bards-best-a-character-modeling-dataset")

print("Path to dataset files:", path)

Path to dataset files: /kaggle/input/the-bards-best-a-character-modeling-dataset


In [2]:
import pandas as pd
import re

In [3]:
# Read the origin data and create DataFrame
def load_data(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.read()
    lines = re.sub(r'\d+','', str(content))   # Remove digits
    lines = re.sub(r'[^\w\s]','', str(lines))   # Remove punctuation
    lines = lines.replace('\r\n', '\n').replace('\r', '\n').split('\n')  # Normalize newlines
    lines = [line.strip() for line in lines if line.strip()]  # Remove empty lines
    return pd.DataFrame({'origin_text': lines})

`f` is a file object created by the `open()` function. It represents the opened file specified by `file_path`, allowing you to read its contents using methods like `f.read()`.

This line creates a new list by iterating over each string in lines, removing leading and trailing whitespace with strip(), and including only non-empty lines. It effectively cleans up the list by removing blank or whitespace-only lines.

In [4]:
df_train = load_data('/kaggle/input/the-bards-best-a-character-modeling-dataset/train.csv')
df_test = load_data('/kaggle/input/the-bards-best-a-character-modeling-dataset/test.csv')
df_validation = load_data('/kaggle/input/the-bards-best-a-character-modeling-dataset/validation.csv')

from IPython.display import display

display(df_train)
display(df_test)
display(df_validation)

Unnamed: 0,origin_text
0,text
1,First Citizen
2,Before we proceed any further hear me speak
3,All
4,Speak speak
...,...
29238,Talk not to me I will go sit and weep
29239,Till I can find occasion of revenge
29240,BAPTISTA
29241,Was ever gentleman thus grieved as I


Unnamed: 0,origin_text
0,text
1,rance taen
2,As shall with either parts agreement stand
3,BAPTISTA
4,Not in my house Lucentio for you know
...,...
1830,And yet so fast asleep
1831,ANTONIO
1832,Noble Sebastian
1833,Thou letst thy fortune sleepdie rather winkst


Unnamed: 0,origin_text
0,text
1,GREMIO
2,Good morrow neighbour Baptista
3,BAPTISTA
4,Good morrow neighbour Gremio
...,...
1698,The match is made and all is done
1699,Your son shall have my daughter with consent
1700,TRANIO
1701,I thank you sir Where then do you know best


In [5]:
def remove_characters(texts) :
    out = [e for e in texts if e[-1]!=":" and e!="text"]
    return pd.DataFrame({'origin_text': out})

In [6]:
df_train1 = remove_characters(df_train["origin_text"])
df_test1 = remove_characters(df_test["origin_text"])
df_validation1 = remove_characters(df_validation["origin_text"])

df_train1

Unnamed: 0,origin_text
0,First Citizen
1,Before we proceed any further hear me speak
2,All
3,Speak speak
4,First Citizen
...,...
29237,Talk not to me I will go sit and weep
29238,Till I can find occasion of revenge
29239,BAPTISTA
29240,Was ever gentleman thus grieved as I


In [7]:
# Batch processing with nlp.pipe for speed
import spacy
nlp = spacy.load("en_core_web_sm")

def remove_named_entities_batch(texts):
    cleaned_texts = []
    for doc in nlp.pipe(texts, batch_size=32):
        cleaned = doc.text
        for ent in doc.ents:
            if ent.label_ == "PERSON":
                cleaned = cleaned.replace(ent.text, "")
        cleaned_texts.append(cleaned)
    return cleaned_texts

This function removes named entities labeled as "PERSON" from a list of texts using spaCy for efficient batch processing.

- It uses `nlp.pipe()` to process all texts in batches (faster than processing one by one).
- For each document, it replaces any detected person name with `[PERSON]`.
- The cleaned texts are collected and returned as a list.

This approach is efficient for large datasets and anonymizes person names in the text.

`doc.ents` is a spaCy property that returns a list of named entities detected in a processed text (a `Doc` object). Each entity has attributes like `text` (the entity string) and `label_` (the entity type, e.g., PERSON, ORG).

Example:
```python
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Barack Obama was born in Hawaii.")
for ent in doc.ents:
    print(ent.text, ent.label_)
# Output: Barack Obama PERSON, Hawaii GPE
```
So, `doc.ents` gives you all recognized entities in the text.

In [8]:
df_train1['Cleaned_Text'] = remove_named_entities_batch(df_train1['origin_text'])

df_train1

Unnamed: 0,origin_text,Cleaned_Text
0,First Citizen,First Citizen
1,Before we proceed any further hear me speak,Before we proceed any further hear me speak
2,All,All
3,Speak speak,Speak speak
4,First Citizen,First Citizen
...,...,...
29237,Talk not to me I will go sit and weep,Talk not to me I will go sit and weep
29238,Till I can find occasion of revenge,Till I can find occasion of revenge
29239,BAPTISTA,BAPTISTA
29240,Was ever gentleman thus grieved as I,Was ever gentleman thus grieved as I


The unique values 21461 and 21454 shown in your DataFrame's `describe()` output represent the number of unique rows (sentences or text entries), **not** characters or words. Each row is typically a sentence or document, so these numbers count unique text entries in your dataset.

In [9]:
df_test1['Cleaned_Text'] = remove_named_entities_batch(df_test1['origin_text'])

df_test1

Unnamed: 0,origin_text,Cleaned_Text
0,rance taen,rance taen
1,As shall with either parts agreement stand,As shall with either parts agreement stand
2,BAPTISTA,BAPTISTA
3,Not in my house Lucentio for you know,Not in my house for you know
4,Pitchers have ears and I have many servants,Pitchers have ears and I have many servants
...,...,...
1829,And yet so fast asleep,And yet so fast asleep
1830,ANTONIO,ANTONIO
1831,Noble Sebastian,Noble Sebastian
1832,Thou letst thy fortune sleepdie rather winkst,Thou letst thy fortune sleepdie rather winkst


In [10]:
df_validation1['Cleaned_Text'] = remove_named_entities_batch(df_validation1['origin_text'])

df_validation1

Unnamed: 0,origin_text,Cleaned_Text
0,GREMIO,GREMIO
1,Good morrow neighbour Baptista,Good morrow neighbour
2,BAPTISTA,BAPTISTA
3,Good morrow neighbour Gremio,Good morrow neighbour Gremio
4,God save you gentlemen,God save you gentlemen
...,...,...
1697,The match is made and all is done,The match is made and all is done
1698,Your son shall have my daughter with consent,Your son shall have my daughter with consent
1699,TRANIO,TRANIO
1700,I thank you sir Where then do you know best,I thank you sir Where then do you know best


In [11]:
import re

def tokenize_text(text):
    return re.split(r"\W+", text.lower())

In [12]:
# Apply to each row in the Series
tokens_train = df_train1['Cleaned_Text'].apply(tokenize_text)
tokens_train


0                                         [first, citizen]
1        [before, we, proceed, any, further, hear, me, ...
2                                                    [all]
3                                           [speak, speak]
4                                         [first, citizen]
                               ...                        
29237     [talk, not, to, me, i, will, go, sit, and, weep]
29238          [till, i, can, find, occasion, of, revenge]
29239                                           [baptista]
29240         [was, ever, gentleman, thus, grieved, as, i]
29241                              [but, who, comes, here]
Name: Cleaned_Text, Length: 29242, dtype: object

In [13]:
tokens_test = df_test1['Cleaned_Text'].apply(tokenize_text)
tokens_test

0                                           [rance, taen]
1       [as, shall, with, either, parts, agreement, st...
2                                              [baptista]
3                    [not, in, my, house, for, you, know]
4       [pitchers, have, ears, and, i, have, many, ser...
                              ...                        
1829                         [and, yet, so, fast, asleep]
1830                                            [antonio]
1831                                   [noble, sebastian]
1832    [thou, letst, thy, fortune, sleepdie, rather, ...
1833                          [whiles, thou, art, waking]
Name: Cleaned_Text, Length: 1834, dtype: object

In [14]:
tokens_validation = df_validation1['Cleaned_Text'].apply(tokenize_text)
tokens_validation

0                                                [gremio]
1                             [good, morrow, neighbour, ]
2                                              [baptista]
3                       [good, morrow, neighbour, gremio]
4                             [god, save, you, gentlemen]
                              ...                        
1697           [the, match, is, made, and, all, is, done]
1698    [your, son, shall, have, my, daughter, with, c...
1699                                             [tranio]
1700    [i, thank, you, sir, where, then, do, you, kno...
1701                        [we, be, affied, and, such, ]
Name: Cleaned_Text, Length: 1702, dtype: object

In [15]:
import string
import nltk

from string import punctuation
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [16]:
def filter_tokens(tokens, stop_words, punctuation):
    return [word for word in tokens if word.lower() not in stop_words]

In [17]:
clean_train_tokens1 = tokens_train.apply(lambda tokens: filter_tokens(tokens, stop_words, punctuation))
clean_test_tokens1 = tokens_test.apply(lambda tokens: filter_tokens(tokens, stop_words, punctuation))
clean_validation_tokens1 = tokens_validation.apply(lambda tokens: filter_tokens(tokens, stop_words, punctuation))

clean_validation_tokens1

0                                [gremio]
1             [good, morrow, neighbour, ]
2                              [baptista]
3       [good, morrow, neighbour, gremio]
4                  [god, save, gentlemen]
                      ...                
1697                  [match, made, done]
1698      [son, shall, daughter, consent]
1699                             [tranio]
1700             [thank, sir, know, best]
1701                           [affied, ]
Name: Cleaned_Text, Length: 1702, dtype: object

In [18]:
# Remove rows where the token list is empty
clean_test_tokens1 = clean_test_tokens1[clean_test_tokens1.apply(lambda x: len(x) > 0)]
clean_train_tokens1 = clean_train_tokens1[clean_train_tokens1.apply(lambda x: len(x) > 0)]
clean_validation_tokens1 = clean_validation_tokens1[clean_validation_tokens1.apply(lambda x: len(x) > 0)]

clean_validation_tokens1

0                                [gremio]
1             [good, morrow, neighbour, ]
2                              [baptista]
3       [good, morrow, neighbour, gremio]
4                  [god, save, gentlemen]
                      ...                
1697                  [match, made, done]
1698      [son, shall, daughter, consent]
1699                             [tranio]
1700             [thank, sir, know, best]
1701                           [affied, ]
Name: Cleaned_Text, Length: 1690, dtype: object

This code filters out stopwords and punctuation from tokenized text.<br>

- `train_tokens` is a Series where each row contains a list of tokens (words) from the original text.<br>
- The `apply` method runs a lambda function on each list of tokens.<br>
- For each token, it checks:<br>
  - If the lowercase word is **not** in the set of English stopwords (`stop_words`)<br>
  - If the word is **not** in the set of punctuation characters (`punctuation`)<br>
- Only tokens passing both checks are kept.

The result, `clean_t_tokens`, is a Series of lists containing only meaningful words from each row, with stopwords and punctuation removed.

In [19]:
def lower_word(tokens):
    return [word.lower() for word in tokens]

In [20]:
# Lowercase all tokens in each list
clean_train_tokens2 = clean_train_tokens1.apply(lambda tokens: lower_word(tokens))
clean_test_tokens2 = clean_test_tokens1.apply(lambda tokens: lower_word(tokens))
clean_validation_tokens2 = clean_validation_tokens1.apply(lambda tokens: lower_word(tokens))
clean_train_tokens2

0                        [first, citizen]
1                  [proceed, hear, speak]
3                          [speak, speak]
4                        [first, citizen]
5         [resolved, rather, die, famish]
                       ...               
29237               [talk, go, sit, weep]
29238     [till, find, occasion, revenge]
29239                          [baptista]
29240    [ever, gentleman, thus, grieved]
29241                             [comes]
Name: Cleaned_Text, Length: 29062, dtype: object

In [21]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def stem_and_lemmatize_tokens(token_series, stemmer, lemmatizer):
    """
    Applies stemming and lemmatization to each list of tokens in a pandas Series.

    Args:
        token_series: pandas Series of lists of tokens.
        stemmer: an instance of a stemmer (e.g., PorterStemmer).
        lemmatizer: an instance of a lemmatizer (e.g., WordNetLemmatizer).

    Returns:
        stemmed_tokens: Series of stemmed token lists.
        lemmatized_tokens: Series of lemmatized token lists.
    """
    stemmed_tokens = token_series.apply(lambda tokens: [stemmer.stem(word) for word in tokens])
    lemmatized_tokens = token_series.apply(lambda tokens: [lemmatizer.lemmatize(word) for word in tokens])
    return stemmed_tokens, lemmatized_tokens

# Apply stemming and lemmatization to each token in each list
stemmed_train, lemmatized_train = stem_and_lemmatize_tokens(clean_train_tokens2, stemmer, lemmatizer)

stemmed_test, lemmatized_test = stem_and_lemmatize_tokens(clean_test_tokens2, stemmer, lemmatizer)

stemmed_validation, lemmatized_validation = stem_and_lemmatize_tokens(clean_validation_tokens2, stemmer, lemmatizer)

print(stemmed_train.head())
print(lemmatized_train.head())


0                 [first, citizen]
1           [proceed, hear, speak]
3                   [speak, speak]
4                 [first, citizen]
5    [resolv, rather, die, famish]
Name: Cleaned_Text, dtype: object
0                   [first, citizen]
1             [proceed, hear, speak]
3                     [speak, speak]
4                   [first, citizen]
5    [resolved, rather, die, famish]
Name: Cleaned_Text, dtype: object


In [22]:
#In sentences
raw_text = "Before we proceed any further, hear me speak."
text_stem = " ".join([stemmer.stem(word) for word in raw_text.split()])
text_lemma = " ".join([lemmatizer.lemmatize(word) for word in raw_text.split()])
print("Stemmed Text:", text_stem)
print("Lemmatized Text:", text_lemma)

Stemmed Text: befor we proceed ani further, hear me speak.
Lemmatized Text: Before we proceed any further, hear me speak.


AI Overview<br>
Tokenizing an NLP dataset row by row is a common and often necessary approach, but the optimal method depends on the specific context and tools being used.<br>
Why row-by-row (or batch processing) is common:<br>
Individual text units:<br>
.
Each row in a dataset often represents a distinct unit of text (e.g., a sentence, a document, a social media post) that needs to be processed independently for tasks like sentiment analysis, text classification, or machine translation.<br>
Applying a tokenizer:<br>
.
Tokenizers from libraries like Hugging Face Transformers, NLTK, or spaCy are designed to process individual text inputs or batches of inputs. Applying them row by row ensures each text unit is correctly broken down into tokens according to the chosen tokenization strategy (word, subword, character).<br>
Memory management:<br>
.
Processing large datasets all at once can be memory-intensive. Tokenizing row by row, or in small batches, helps manage memory efficiently, especially when dealing with very long texts or a massive number of rows.<br>
Considerations for efficiency:<br>
Batching for speed:<br>
While "row by row" implies individual processing, modern NLP libraries often support batch processing for efficiency. This means you can process multiple rows simultaneously within a single tokenizer call, which significantly speeds up the tokenization of large datasets. Libraries like Hugging Face Datasets and their map function with batched=True are designed for this.<br>
Parallel processing:<br>
For extremely large datasets, consider using parallel processing techniques (e.g., Python's multiprocessing module, Dask) to distribute the tokenization workload across multiple CPU cores.<br>
Pre-trained tokenizers:<br>
If using pre-trained models, their associated tokenizers are optimized for efficiency and often handle batching internally.<br>
In summary:<br>
Tokenizing an NLP dataset involves breaking down text into smaller units (tokens). While the conceptual approach is to process each text unit (often a row), for practical efficiency, especially with large datasets, batching is highly recommended over strictly processing one row at a time. Utilize the batching capabilities of your chosen NLP library or implement parallel processing for optimal performance.

AI Overview<br>
The decision of whether to remove punctuation before or after tokenization in Python depends on the specific goals of the Natural Language Processing (NLP) task.<br>
Removing punctuation before tokenization:<br>
Simplifies tokenization:<br>
When punctuation is removed beforehand, tokenization becomes more straightforward as the tokenizer primarily focuses on splitting text based on whitespace or other defined delimiters, without needing to handle punctuation as separate tokens or attached to words.<br>
Reduces noise and dimensionality:<br>
Punctuation marks often carry little semantic meaning for many NLP tasks (e.g., text classification, topic modeling) and can increase the number of unique tokens, leading to higher dimensionality in feature representation. Removing them pre-tokenization can reduce noise and improve model efficiency.<br>
Ensures consistent word forms:<br>
Words like "hello" and "hello!" would be treated as distinct tokens if punctuation is not removed, potentially leading to less accurate analysis. Removing punctuation ensures these are treated as the same base word.<br>
Removing punctuation after tokenization (or handling it during tokenization):<br>
Preserves specific information:<br>
In certain NLP tasks, punctuation can be crucial. For example, in sentiment analysis, exclamation marks or question marks can indicate strong emotions. In Named Entity Recognition, hyphens in compound words (e.g., "state-of-the-art") might need to be preserved to correctly identify entities.<br>
Allows for more nuanced analysis:<br>
If the task requires understanding the grammatical structure or stylistic elements of the text, keeping punctuation (or treating it as separate tokens) might be necessary.<br>
General Recommendation:<br>
For most general-purpose NLP tasks where the focus is on word content rather than specific punctuation nuances, removing punctuation before tokenization is the more common and often recommended approach. This simplifies the data and generally leads to better performance for tasks like text classification, information retrieval, and basic text analysis.