🚀 Hi Everyone, Welcome to the GenAI Journey! 🤖✨

🌟 As this is our very first practical in this series, we’ll start with the most important step

📌 Data Cleaning (also called Text Preprocessing for NLP / ML / GenAI)

This process transforms messy, real-world text into clean, structured input that models can understand and learn from.
This file contains reusable and modular functions for cleaning raw text.
It's designed for use in NLP tasks, chatbots, sentiment analysis, and LLM-based systems.
Each step in this script is explained with clear comments and practical use cases.

**Import Required Libraries and Dataset**

In [1]:
import os
import pandas as pd

In [2]:
data=pd.read_csv("https://raw.githubusercontent.com/Ankit152/IMDB-sentiment-analysis/master/IMDB-Dataset.csv")

In [3]:
data.sample(5)

Unnamed: 0,review,sentiment
44501,"""Imaginary Heroes"" is a 2004 film starring Sig...",positive
16945,So don't even think about renting this from th...,negative
8408,"""The Next Karate Kid"" is a thoroughly predicta...",negative
34144,"I expected a bad movie, and got a bad movie. B...",negative
9360,Another hand-held horror means another divisiv...,positive


Define Individual Cleaning Functions (Text Pre-Processing)
Each function handles a specific part of the text cleaning pipeline.

1. Lowercasing

🔹 Convert all characters to lowercase for uniformity

In [4]:
data['review'] = data['review'].str.lower()

In [5]:
data["review"][3]

"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.<br /><br />ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."

2. Remove HTMLs Tags

🔹Remove HTML tags from text to ensures data is fully readable and clean for NLP or GenAI models

In [6]:
text = "Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them"


In [7]:
## To find pattern from data we use Regular Expression (Text Data)
import re
def Remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub("",text)

In [8]:
Remove_html_tags(text)

"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them"

In [9]:
data['review'] = data['review'].apply(Remove_html_tags)

3. Remove Urls

🔹Remove URLs from Text e.g: (http:.www.//example.com)

In [10]:
def Remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub("",text)

In [11]:
text1 = 'Visit on Website https://www.org/'

In [12]:
Remove_url(text1)

'Visit on Website '

In [13]:
data['review'] = data['review'].apply(Remove_url)

4. Remove punctuation and special characters

🔹 Attribute Punctuations

In [14]:
import string

In [15]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [16]:
exclude = string.punctuation

In [17]:
def Remove_punct(text):
    for char in exclude:
        text = text.replace(char,"")
    return text

In [19]:
data['review'] = data['review'].apply(Remove_punct)

In [20]:
data['review'][2]

'i thought this was a wonderful way to spend time on a too hot summer weekend sitting in the air conditioned theater and watching a lighthearted comedy the plot is simplistic but the dialogue is witty and the characters are likable even the well bread suspected serial killer while some may be disappointed when they realize this is not match point 2 risk addiction i thought it was proof that woody allen is still fully in control of the style many of us have grown to lovethis was the most id laughed at one of woodys comedies in years dare i say a decade while ive never been impressed with scarlet johanson in this she managed to tone down her sexy image and jumped right into a average but spirited young womanthis may not be the crown jewel of his career but it was wittier than devil wears prada and more interesting than superman a great comedy to go see with friends'

5. Slangs

🔹(Short Words / Abbrivation)


In [21]:
chat_word= {"AFK": "Away From Keyboard",
"ASAP":"As Soon As Possible",
"BTW":"By The Way",
"B4":"Before",
"LAMO":"Laugh My A.. Off",
"FYI":"For your information"}

In [22]:
text = "FYI this is not true"
text2 = "LAMO the class was so funny"
text3 = "i want report ASAP"

In [23]:
def chat_conversion(text):
  new_text=[]
  for w in text.split():
      if w.upper() in chat_word:
        new_text.append(chat_word[w.upper()])
      else:
        new_text.append(w)
  return " ".join(new_text)

In [24]:
chat_conversion(text)

'For your information this is not true'

6. spelling Correction

In [28]:
## Importing spelling correct module
from textblob import TextBlob

In [29]:
## Writing the text for Correction
text = 'plese downlod my notbok'

In [31]:
TextBlob(text)

TextBlob("plese download my notbook")

In [32]:
## Creating  a callable variable
txtblob = TextBlob(text)

In [33]:
txtblob.correct().string

'please download my notebook'

7. Remove common stopwords from Text

In [37]:
import nltk
from nltk.corpus import stopwords

In [38]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [39]:
stopwords.words("english")

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [40]:
text5 = "I'am Asmit and we all are learning GenAi and Nlp which is Important"

In [41]:
def remove_stopwords(text):
    new_text=[]
    for word in text.split():
        if word in stopwords.words("english"):
         new_text.append("")
        else:
            new_text.append(word)
    return " ".join(new_text)

In [42]:
remove_stopwords(text5)

"I'am Asmit     learning GenAi  Nlp   Important"

8. Remove emojis from text to avoid noise in analysis & Convert emojis to their textual description

In [44]:
pip install emoji

Collecting emoji
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.14.1-py3-none-any.whl (590 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/590.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m317.4/590.6 kB[0m [31m9.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.14.1


In [45]:
import emoji

In [46]:
original_text = "This is a 😃 sample text with some stopwords like a and the 🚀"

In [48]:
## This is Manual way to Remove Emojis
def remove_emojis(text):
    emoji_pattern = re.compile("[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F1E0-\U0001F1FF]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [49]:
remove_emojis(original_text)

'This is a  sample text with some stopwords like a and the '

In [50]:
## Convert emojis to their textual description
def demojize_emojis(text):
    return emoji.demojize(text)

In [51]:
demojize_emojis(original_text)

'This is a :grinning_face_with_big_eyes: sample text with some stopwords like a and the :rocket:'

9. Tokenization

In [52]:
import re
import nltk
from nltk.tokenize import word_tokenize,sent_tokenize

In [57]:
nltk.download("punkt_tab")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [69]:
text="i am asmit working as a data Analyst Intern. i work in analytical team"

In [70]:
text.split()

['i',
 'am',
 'asmit',
 'working',
 'as',
 'a',
 'data',
 'Analyst',
 'Intern.',
 'i',
 'work',
 'in',
 'analytical',
 'team']

In [68]:
# split is a simplest way
# but lets say split is not working in that case you can with regular expression

In [58]:
my_text = "i am going to delhi today evening"

In [59]:
word_tokenize(my_text)

['i', 'am', 'going', 'to', 'delhi', 'today', 'evening']

In [60]:
## A corpus is essentially a database of language data, containing a vast amount of text and/or speech
my_corpus = """Generative artificial intelligence (generative AI, genAI, GenAI, GAI or GenAI[1]) is artificial intelligence capable of generating text, images or other data using generative models,[2] often in response to prompts.[3][4] Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics.[5][6]
Improvements in transformer-based deep neural networks enabled an AI boom of generative AI systems in the early 2020s. These include large language model (LLM) chatbots such as ChatGPT, Copilot, Bard, and LLaMA, and text-to-image artificial intelligence art systems such as Stable Diffusion, Midjourney, and DALL-E.[7][8][9] Companies such as OpenAI, Anthropic, Microsoft, Google, and Baidu as well as numerous smaller firms have developed generative AI models."""

In [61]:
sent_tokenize(my_corpus)

['Generative artificial intelligence (generative AI, genAI, GenAI, GAI or GenAI[1]) is artificial intelligence capable of generating text, images or other data using generative models,[2] often in response to prompts.',
 '[3][4] Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics.',
 '[5][6]\nImprovements in transformer-based deep neural networks enabled an AI boom of generative AI systems in the early 2020s.',
 'These include large language model (LLM) chatbots such as ChatGPT, Copilot, Bard, and LLaMA, and text-to-image artificial intelligence art systems such as Stable Diffusion, Midjourney, and DALL-E.[7][8][9] Companies such as OpenAI, Anthropic, Microsoft, Google, and Baidu as well as numerous smaller firms have developed generative AI models.']

we can do it by using nltk and even with spacy
- split
- regular expression(simple google)
- nltk

In [64]:
## English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
import spacy

In [63]:
nlp = spacy.load("en_core_web_sm")

In [65]:
def tokenize_text(text):
    # Process the text with the spacy model
    doc = nlp(text)

    # Extract tokens from the processed document
    tokens = [token.text for token in doc]

    return tokens

In [66]:
tokens=tokenize_text(my_corpus)

In [67]:
for word in tokens:
    print(word)

Generative
artificial
intelligence
(
generative
AI
,
genAI
,
GenAI
,
GAI
or
GenAI[1
]
)
is
artificial
intelligence
capable
of
generating
text
,
images
or
other
data
using
generative
models,[2
]
often
in
response
to
prompts.[3][4
]
Generative
AI
models
learn
the
patterns
and
structure
of
their
input
training
data
and
then
generate
new
data
that
has
similar
characteristics.[5][6
]


Improvements
in
transformer
-
based
deep
neural
networks
enabled
an
AI
boom
of
generative
AI
systems
in
the
early
2020s
.
These
include
large
language
model
(
LLM
)
chatbots
such
as
ChatGPT
,
Copilot
,
Bard
,
and
LLaMA
,
and
text
-
to
-
image
artificial
intelligence
art
systems
such
as
Stable
Diffusion
,
Midjourney
,
and
DALL
-
E.[7][8][9
]
Companies
such
as
OpenAI
,
Anthropic
,
Microsoft
,
Google
,
and
Baidu
as
well
as
numerous
smaller
firms
have
developed
generative
AI
models
.


In [71]:
# Example usage:
text = "Spacy is a powerful natural language processing library."
tokens = tokenize_text(text)

print(f"Original text: {text}")
print(f"Tokens: {tokens}")

Original text: Spacy is a powerful natural language processing library.
Tokens: ['Spacy', 'is', 'a', 'powerful', 'natural', 'language', 'processing', 'library', '.']


10. Stemming and Lammetization

In [72]:
## Stemming
from nltk.stem import PorterStemmer

def stemming(text):
    obj=PorterStemmer()
    stem_word=[obj.stem(word) for word in text.split()]
    return stem_word

In [73]:
input_sentence = "The quick brown foxes are jumping over the lazy dogs"

In [75]:
stemming(input_sentence)

['the', 'quick', 'brown', 'fox', 'are', 'jump', 'over', 'the', 'lazi', 'dog']

In [76]:
## Lammetization
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [77]:
from nltk.stem import WordNetLemmatizer

def lammatization(text):
    words=text.split()
    lemmetizer=WordNetLemmatizer()
    lemetized_word=[lemmetizer.lemmatize(word) for word in words]
    return lemetized_word

In [78]:
lammatization(input_sentence)

['The',
 'quick',
 'brown',
 'fox',
 'are',
 'jumping',
 'over',
 'the',
 'lazy',
 'dog']

Take a different senerio and work accoring to that
- tokenization
- lowercase
- uppercase
- emojis
- pancuations
- html,url
- stopwords
- abbravation or slang
- steemming and lemmetization
- spelling correction

did you get the idea how the clean the text
