## 🚀 Hi Everyone, Welcome to the GenAI Journey! 🤖✨
#### 🌟 As this is our very first practical in this series, we’ll start with the most important step: 
#### 📌 **Data Cleaning (also called Text Preprocessing for NLP / ML / GenAI)** 
##### This process transforms messy, real-world text into clean, structured input that models can understand and learn from.
##### This file contains reusable and modular functions for cleaning raw text.
##### It's designed for use in NLP tasks, chatbots, sentiment analysis, and LLM-based systems.
##### Each step in this script is explained with clear comments and practical use cases.

### ✅ 1. Import Required Libraries and Dataset

In [39]:
import re
import emoji
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
from nltk.tokenize import word_tokenize

In [41]:
data =pd.read_csv(r"https://raw.githubusercontent.com/Ankit152/IMDB-sentiment-analysis/master/IMDB-Dataset.csv")

NameError: name 'pd' is not defined

In [5]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [14]:
data.shape

(50000, 2)

In [11]:
data['review'][3]

"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them."

#### As we see that our user reviews data (like movie feedback) is not clean so we will clean by removing noise, lowercasing the text, lemmatizing, and removing stopwords to prepare the text for NLP models.

### ✅ 2. Define Individual Cleaning Functions (Text Pre-Processing)
##### Each function handles a specific part of the text cleaning pipeline.

#### 1. Lowercasing 
##### 🔹 Convert all characters to lowercase for uniformity

In [59]:
data['review'] = data['review'].str.lower()

In [61]:
data['review']

0        one of the other reviewers has mentioned that ...
1        a wonderful little production. the filming tec...
2        i thought this was a wonderful way to spend ti...
3        basically there's a family where a little boy ...
4        petter mattei's "love in the time of money" is...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot, bad dialogue, bad acting, idiotic di...
49997    i am a catholic taught in parochial elementary...
49998    i'm going to have to disagree with the previou...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

#### 2. Remove HTMLs Tags 
##### 🔹Remove HTML tags from text to ensures data is fully readable and clean for NLP or GenAI models  

In [49]:
text = "Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them"

In [51]:
text

"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them"

In [53]:
## To find pattern from data we use Regular Expression (Text Data)
import re
def Remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub("",text)

In [55]:
Remove_html_tags(text)

"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them"

In [57]:
data['review'] = data['review'].apply(Remove_html_tags)

#### 3. Remove Urls
##### 🔹Remove URLs from Text(e.g., http://example.com)

In [74]:
def Remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub("",text)

In [82]:
text1 = 'Visit on Website https://www.org/'

In [84]:
Remove_url(text1)

'Visit on Website '

In [86]:
data['review'] = data['review'].apply(Remove_url)

#### 4. Remove punctuation and special characters
##### 🔹 Attribute Punctuations

In [91]:
import string

In [102]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [104]:
exclude = string.punctuation

In [108]:
def Remove_punct(text):
    for char in exclude:
        text = text.replace(char,"")
    return text

In [110]:
Remove_punct(text)

'Basically theres a family where a little boy Jake thinks theres a zombie in his closet  his parents are fighting all the timebr br This movie is slower than a soap opera and suddenly Jake decides to become Rambo and kill the zombiebr br OK first of all when youre going to make a film you must Decide if its a thriller or a drama As a drama the movie is watchable Parents are divorcing  arguing like in real life And then we have Jake with his closet which totally ruins all the film I expected to see a BOOGEYMAN similar movie and instead i watched a drama with some meaningless thriller spotsbr br 3 out of 10 just for the well playing parents  descent dialogs As for the shots with Jake just ignore them'

In [112]:
data['review'] = data['review'].apply(Remove_punct)

In [114]:
data['review'][2]

'i thought this was a wonderful way to spend time on a too hot summer weekend sitting in the air conditioned theater and watching a lighthearted comedy the plot is simplistic but the dialogue is witty and the characters are likable even the well bread suspected serial killer while some may be disappointed when they realize this is not match point 2 risk addiction i thought it was proof that woody allen is still fully in control of the style many of us have grown to lovethis was the most id laughed at one of woodys comedies in years dare i say a decade while ive never been impressed with scarlet johanson in this she managed to tone down her sexy image and jumped right into a average but spirited young womanthis may not be the crown jewel of his career but it was wittier than devil wears prada and more interesting than superman a great comedy to go see with friends'

#### 5. Slangs (Short Words / Abbrivation)

In [117]:
chat_word= {"AFK": "Away From Keyboard",
"ASAP":"As Soon As Possible",
"BTW":"By The Way",
"B4":"Before",
"LAMO":"Laugh My A.. Off",
"FYI":"For your information"}

In [129]:
text="FYI this is not true"
text2="LAMO the class was so funny"
text3="i want report ASAP"

In [137]:
def chat_conversion(text):
  new_text=[]
  for w in text.split():
      if w.upper() in chat_word:
        new_text.append(chat_word[w.upper()])
      else:
        new_text.append(w)
  return " ".join(new_text)

In [139]:
chat_conversion(text)

'For your information this is not true'

#### 6. spelling Correction

In [146]:
from textblob import TextBlob #importing spelling correct module

In [158]:
text = 'plese downlod my notbook' # Writing the text

In [160]:
TextBlob(text)

TextBlob("plese downlod my notbook")

In [162]:
txtblob = TextBlob(text) #creating  a callable variable

In [164]:
txtblob.correct().string

'please download my notebook'

#### 7. Remove common stopwords

In [167]:
from nltk.corpus import stopwords

In [171]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\asmit\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [175]:
stopwords.words("english")

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [177]:
text5 = "I'am Asmit and we all are learning GenAi and Nlp which is Important"

In [187]:
def remove_stopwords(text):
    new_text=[]
    for word in text.split():
        if word in stopwords.words("english"):
         new_text.append("")
        else:
            new_text.append(word)
    return " ".join(new_text)

In [189]:
remove_stopwords(text5)

"I'am Asmit     learning GenAi  Nlp   Important"

#### 8. Remove emojis from text to avoid noise in analysis & Convert emojis to their textual description 

In [194]:
import emoji

In [196]:
original_text = "This is a 😃 sample text with some stopwords like a and the 🚀"

In [202]:
def remove_emojis(text):
    emoji_pattern = re.compile("[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F1E0-\U0001F1FF]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [204]:
remove_emojis(original_text)

'This is a  sample text with some stopwords like a and the '

##### Convert emojis to their textual description 

In [208]:
def demojize_emojis(text):
    return emoji.demojize(text)

In [210]:
demojize_emojis(original_text)

'This is a :grinning_face_with_big_eyes: sample text with some stopwords like a and the :rocket:'

#### 9. Tokenization

In [221]:
import re
import nltk
from nltk.tokenize import word_tokenize,sent_tokenize

In [227]:
nltk.download("punkt_tab")

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\asmit\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [231]:
my_text = "i am going to delhi evening"

In [233]:
word_tokenize(my_text)

['i', 'am', 'going', 'to', 'delhi', 'evening']

In [13]:
my_corpus = """Generative artificial intelligence (generative AI, genAI, GenAI, GAI or GenAI[1]) is artificial intelligence capable of generating text, images or other data using generative models,[2] often in response to prompts.[3][4] Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics.[5][6]
Improvements in transformer-based deep neural networks enabled an AI boom of generative AI systems in the early 2020s. These include large language model (LLM) chatbots such as ChatGPT, Copilot, Bard, and LLaMA, and text-to-image artificial intelligence art systems such as Stable Diffusion, Midjourney, and DALL-E.[7][8][9] Companies such as OpenAI, Anthropic, Microsoft, Google, and Baidu as well as numerous smaller firms have developed generative AI models."""

In [15]:
my_corpus

'Generative artificial intelligence (generative AI, genAI, GenAI, GAI or GenAI[1]) is artificial intelligence capable of generating text, images or other data using generative models,[2] often in response to prompts.[3][4] Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics.[5][6]\nImprovements in transformer-based deep neural networks enabled an AI boom of generative AI systems in the early 2020s. These include large language model (LLM) chatbots such as ChatGPT, Copilot, Bard, and LLaMA, and text-to-image artificial intelligence art systems such as Stable Diffusion, Midjourney, and DALL-E.[7][8][9] Companies such as OpenAI, Anthropic, Microsoft, Google, and Baidu as well as numerous smaller firms have developed generative AI models.'

In [239]:
sent_tokenize(my_corpus)

['Generative artificial intelligence (generative AI, genAI, GenAI, GAI or GenAI[1]) is artificial intelligence capable of generating text, images or other data using generative models,[2] often in response to prompts.',
 '[3][4] Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics.',
 '[5][6]\nImprovements in transformer-based deep neural networks enabled an AI boom of generative AI systems in the early 2020s.',
 'These include large language model (LLM) chatbots such as ChatGPT, Copilot, Bard, and LLaMA, and text-to-image artificial intelligence art systems such as Stable Diffusion, Midjourney, and DALL-E.[7][8][9] Companies such as OpenAI, Anthropic, Microsoft, Google, and Baidu as well as numerous smaller firms have developed generative AI models.']

#### we can do it by using nltk and even with spacy

##### split
##### regular expression(simple google)
##### nltk

In [4]:
import spacy

# Load the English NLP model from spacy
nlp = spacy.load("en_core_web_sm")

In [7]:
def tokenize_text(text):
    # Process the text with the spacy model
    doc = nlp(text)
    
    # Extract tokens from the processed document
    tokens = [token.text for token in doc]
    
    return tokens

In [17]:
tokens=tokenize_text(my_corpus)

In [19]:
for word in tokens:
    print(word)

Generative
artificial
intelligence
(
generative
AI
,
genAI
,
GenAI
,
GAI
or
GenAI[1
]
)
is
artificial
intelligence
capable
of
generating
text
,
images
or
other
data
using
generative
models,[2
]
often
in
response
to
prompts.[3][4
]
Generative
AI
models
learn
the
patterns
and
structure
of
their
input
training
data
and
then
generate
new
data
that
has
similar
characteristics.[5][6
]


Improvements
in
transformer
-
based
deep
neural
networks
enabled
an
AI
boom
of
generative
AI
systems
in
the
early
2020s
.
These
include
large
language
model
(
LLM
)
chatbots
such
as
ChatGPT
,
Copilot
,
Bard
,
and
LLaMA
,
and
text
-
to
-
image
artificial
intelligence
art
systems
such
as
Stable
Diffusion
,
Midjourney
,
and
DALL
-
E.[7][8][9
]
Companies
such
as
OpenAI
,
Anthropic
,
Microsoft
,
Google
,
and
Baidu
as
well
as
numerous
smaller
firms
have
developed
generative
AI
models
.


#### 10. Stemming and Lemmetization