In [16]:
!pip install datasets spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [34]:
from datasets import load_dataset
import spacy
import re

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Load IMDb dataset
dataset = load_dataset("imdb")
sample_text = dataset["train"][1]["text"]
sample_label = dataset["train"][1]["label"]

In [35]:
print("Original Text:\n")
print(sample_text)
print("\nLabel:", "Positive" if sample_label == 1 else "Negative")

Original Text:

"I Am Curious: Yellow" is a risible and pretentious steaming pile. It doesn't matter what one's political views are because this film can hardly be taken seriously on any level. As for the claim that frontal male nudity is an automatic NC-17, that isn't true. I've seen R-rated films with male nudity. Granted, they only offer some fleeting views, but where are the R-rated films with gaping vulvas and flapping labia? Nowhere, because they don't exist. The same goes for those crappy cable shows: schlongs swinging in the breeze but not a clitoris in sight. And those pretentious indie movies like The Brown Bunny, in which we're treated to the site of Vincent Gallo's throbbing johnson, but not a trace of pink visible on Chloe Sevigny. Before crying (or implying) "double-standard" in matters of nudity, the mentally obtuse should take into account one unavoidably obvious anatomical difference between men and women: there are no genitals on display when actresses appears nude, a

In [36]:
text_lower = sample_text.lower()
print("Lowercased Text:\n")
print(text_lower)

Lowercased Text:

"i am curious: yellow" is a risible and pretentious steaming pile. it doesn't matter what one's political views are because this film can hardly be taken seriously on any level. as for the claim that frontal male nudity is an automatic nc-17, that isn't true. i've seen r-rated films with male nudity. granted, they only offer some fleeting views, but where are the r-rated films with gaping vulvas and flapping labia? nowhere, because they don't exist. the same goes for those crappy cable shows: schlongs swinging in the breeze but not a clitoris in sight. and those pretentious indie movies like the brown bunny, in which we're treated to the site of vincent gallo's throbbing johnson, but not a trace of pink visible on chloe sevigny. before crying (or implying) "double-standard" in matters of nudity, the mentally obtuse should take into account one unavoidably obvious anatomical difference between men and women: there are no genitals on display when actresses appears nude,

In [37]:
# remove the  number and punctuation
text_clean = re.sub(r'[^a-z\s]', '', text_lower)
print("Text without punctuation and numbers:\n")
print(text_clean)

Text without punctuation and numbers:

i am curious yellow is a risible and pretentious steaming pile it doesnt matter what ones political views are because this film can hardly be taken seriously on any level as for the claim that frontal male nudity is an automatic nc that isnt true ive seen rrated films with male nudity granted they only offer some fleeting views but where are the rrated films with gaping vulvas and flapping labia nowhere because they dont exist the same goes for those crappy cable shows schlongs swinging in the breeze but not a clitoris in sight and those pretentious indie movies like the brown bunny in which were treated to the site of vincent gallos throbbing johnson but not a trace of pink visible on chloe sevigny before crying or implying doublestandard in matters of nudity the mentally obtuse should take into account one unavoidably obvious anatomical difference between men and women there are no genitals on display when actresses appears nude and the same can

In [38]:
# tokenization
doc = nlp(text_clean)

tokens = [token.text for token in doc]
print(tokens[:20])

['i', 'am', 'curious', 'yellow', 'is', 'a', 'risible', 'and', 'pretentious', 'steaming', 'pile', 'it', 'does', 'nt', 'matter', 'what', 'ones', 'political', 'views', 'are']


In [39]:
# stopword removal
tokens_no_stop = [token.text for token in doc if not token.is_stop]
print("Tokens without Stop Words:\n")
print(tokens_no_stop[:20])


Tokens without Stop Words:

['curious', 'yellow', 'risible', 'pretentious', 'steaming', 'pile', 'nt', 'matter', 'ones', 'political', 'views', 'film', 'hardly', 'taken', 'seriously', 'level', 'claim', 'frontal', 'male', 'nudity']


In [40]:
#lamatization
lemmas = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]
print("Lemmatized Tokens:\n")
print(lemmas[:20])


Lemmatized Tokens:

['curious', 'yellow', 'risible', 'pretentious', 'steaming', 'pile', 'not', 'matter', 'one', 'political', 'view', 'film', 'hardly', 'take', 'seriously', 'level', 'claim', 'frontal', 'male', 'nudity']
