This toy program compares sentences from the Poetry Foundation with the Tiny Shakespeare dataset on Kaggle and identifies the words that appear in both modern poems and the Bard’s plays.

Reference:<br>
[Dataset: TinyShakespeare (Shakespeare's Plays)](https://www.kaggle.com/datasets/thedevastator/the-bards-best-a-character-modeling-dataset)<br>
[NLP (Natural Language Processing) with Python 1.Representing text as numerical data](https://www.kaggle.com/code/faressayah/natural-language-processing-nlp-for-beginners)

In [1]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("thedevastator/the-bards-best-a-character-modeling-dataset")

print("Path to dataset files:", path)

Path to dataset files: /kaggle/input/the-bards-best-a-character-modeling-dataset


In [2]:
import pandas as pd
import re

In [3]:
df_raw = pd.read_csv('/kaggle/input/the-bards-best-a-character-modeling-dataset/train.csv', encoding='utf-8')
df_raw.head()

Unnamed: 0,text
0,"First Citizen:\nBefore we proceed any further,..."


In [4]:
# Read the origin data and create DataFrame
with open('/kaggle/input/the-bards-best-a-character-modeling-dataset/train.csv', 'r', encoding='utf-8') as f:
    content = f.read()

lines = content.replace('\r\n', '\n').replace('\r', '\n').split('\n')
lines = [line.strip() for line in lines if line.strip()]

# Create DataFrame with origin text
df_origin = pd.DataFrame({'origin_text': lines})

df_origin.head()

Unnamed: 0,origin_text
0,text
1,"""First Citizen:"
2,"Before we proceed any further, hear me speak."
3,All:
4,"Speak, speak."


In [5]:
# Remove PERSON entities from text
def remove_names(texts) :
    out = [e for e in texts if e[-1]!=":" and e!="text"]
    return out

In [6]:
df = remove_names(df_origin["origin_text"])
df

['Before we proceed any further, hear me speak.',
 'Speak, speak.',
 'You are all resolved rather to die than to famish?',
 'Resolved. resolved.',
 'First, you know Caius Marcius is chief enemy to the people.',
 "We know't, we know't.",
 "Let us kill him, and we'll have corn at our own price.",
 "Is't a verdict?",
 "No more talking on't; let it be done: away, away!",
 'One word, good citizens.',
 'We are accounted poor citizens, the patricians good.',
 'What authority surfeits on would relieve us: if they',
 'would yield us but the superfluity, while it were',
 'wholesome, we might guess they relieved us humanely;',
 'but they think we are too dear: the leanness that',
 'afflicts us, the object of our misery, is as an',
 'inventory to particularise their abundance; our',
 'sufferance is a gain to them Let us revenge this with',
 'our pikes, ere we become rakes: for the gods know I',
 'speak this in hunger for bread, not in thirst for revenge.',
 'Would you proceed especially against Ca

In [7]:
df = pd.DataFrame({'origin_text': df})

# Randomly pick rows from `df_origin`
#df = df.sample(n=3, random_state=42)


In [8]:
# lowercases each text, extracts word tokens using regex, and then joins them back into a single string
df['origin_text'] = df['origin_text'].str.lower().apply(lambda x: ' '.join(re.findall(r'\b\w+\b', x)))
df

Unnamed: 0,origin_text
0,before we proceed any further hear me speak
1,speak speak
2,you are all resolved rather to die than to famish
3,resolved resolved
4,first you know caius marcius is chief enemy to...
...,...
21575,and for your love to her lead apes in hell
21576,talk not to me i will go sit and weep
21577,till i can find occasion of revenge
21578,was ever gentleman thus grieved as i


In [9]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()
vect.fit(df['origin_text'])
vect_name = vect.get_feature_names_out()  # Call the method after fitting
print(vect_name)

simple_train = vect.transform(df['origin_text'])
simple_train = pd.DataFrame(simple_train.toarray(), columns=vect_name)
simple_train

['abandon' 'abase' 'abate' ... 'zealous' 'zodiacs' 'zounds']


Unnamed: 0,abandon,abase,abate,abated,abbot,abed,abel,abet,abhor,abhorr,...,your,yours,yourself,yourselves,youth,youthful,zeal,zealous,zodiacs,zounds
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21575,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
21576,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
21577,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
21578,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


CountVectorizer converts a collection of text documents to a matrix of token counts, which is a common step in text preprocessing for machine learning tasks.

Example: If you have documents: ["I love cats", "I hate dogs"]<br>
The CountVectorizer creates:<br>
Vocabulary: ["cats", "dogs", "hate", "I", "love"]<br>
Document 1: [1, 0, 0, 1, 1] (1 "cats", 0 "dogs", 0 "hate", 1 "I", 1 "love")<br>
Document 2: [0, 1, 1, 1, 0] (0 "cats", 1 "dogs", 1 "hate", 1 "I", 0 "love")


vect.fit() does not return a value to vect. The fit() method modifies the existing vect object in-place and returns self (the same object).

When you call vect.fit(df['origin_text']), it:<br>
Updates the internal state of the existing vect CountVectorizer object<br>
Builds the vocabulary from the training data<br>
Returns the same vect object (which is typically ignored)<br>

The vect variable still points to the same CountVectorizer object, but now it's trained and ready to transform text data.

In [10]:
poetry = ["Any fool can get into an ocean",
               "But it takes a Goddess ",
               "Peichao, Look at the sea otters bobbing wildly"]
poetry = pd.DataFrame({'poetry_test': poetry})
poetry['poetry_test'] = poetry['poetry_test'].str.lower().apply(lambda x: '  '.join(re.findall(r'\b\w+\b', x)))
poetry


Unnamed: 0,poetry_test
0,any fool can get into an ocean
1,but it takes a goddess
2,peichao look at the sea otters bobbing ...


In [11]:
simple_test = vect.transform(poetry['poetry_test'])
simple_test = pd.DataFrame(simple_test.toarray(), columns=vect_name)
simple_test

Unnamed: 0,abandon,abase,abate,abated,abbot,abed,abel,abet,abhor,abhorr,...,your,yours,yourself,yourselves,youth,youthful,zeal,zealous,zodiacs,zounds
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
# Get all words from all rows combined
all_words = []
for i in range(len(poetry)):
    words = poetry['poetry_test'].iloc[i].split()
    all_words.extend(words)

result = pd.DataFrame({
    'word': all_words,
    'in_BardsPlay': [1 if w in vect_name else 0 for w in all_words]
})

print(result)

       word  in_BardsPlay
0       any             1
1      fool             1
2       can             1
3       get             1
4      into             1
5        an             1
6     ocean             1
7       but             1
8        it             1
9     takes             1
10        a             0
11  goddess             1
12  peichao             0
13     look             1
14       at             1
15      the             1
16      sea             1
17   otters             0
18  bobbing             0
19   wildly             1
