<a href="https://colab.research.google.com/github/nimisha442/NLP_projects/blob/main/NLP_text_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP TEXT PREPROCESSING

##Efficient Text Preprocessing Pipeline for Natural Language Processing Applications

## INTRODUCTION

Natural Language Processing (NLP) is a core area of Artificial Intelligence that focuses on enabling machines to understand, interpret, and generate human language. However, real-world text data is often noisy, unstructured, and inconsistent. Before applying any NLP model or algorithm, it is crucial to preprocess and clean the raw text data to ensure better performance and accuracy.

This project focuses on developing an automated text preprocessing pipeline capable of handling various forms of unstructured text and preparing it for downstream NLP tasks such as sentiment analysis, text classification, topic modeling, and machine translation.

## OBJECTIVES

* To clean and normalize raw text data from multiple   sources.

* To remove noise such as punctuation, HTML tags, and stop words.

* To apply text normalization techniques like tokenization, stemming, and lemmatization.

* To convert processed text into numerical representations suitable for machine learning models.

* To evaluate the impact of preprocessing steps on NLP model performance.

## METHODOLOGY

The project is divided into the following stages:

i) **Data Collection :**

Collect raw textual data from open datasets .

ii) **Data Cleaning :**

* Lowercasing text.

* Removing punctuation, special characters, numbers, and  extra whitespace.

* Removing URLs, emojis, and HTML tags.

iii) **Tokenization :**

Splitting sentences into individual words or tokens using libraries like NLTK or spaCy.

iv) **Stop Word Removal:**

Removing frequently occurring but semantically weak words (e.g., “the”, “is”, “and”).

v) **Text Normalization :**

Stemming: Reducing words to their root form using Porter or Snowball Stemmer.

Lemmatization: Using linguistic rules to obtain the base form of words.



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import re

In [None]:
data_path = '/content/IMDB Dataset.csv'

In [None]:
df = pd.read_csv(data_path)
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [None]:
 df.shape

(50000, 2)

In [None]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
df.tail()

Unnamed: 0,review,sentiment
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative
49999,No one expects the Star Trek movies to be high...,negative


In [None]:
df['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,25000
negative,25000


In [None]:
#taking only 10 examples

df=df.head(10).copy()


In [None]:
df.shape

(10, 2)

#Lowercasing Text

In [None]:
 df['review'] = df['review'].str.lower()

In [None]:
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
5,"probably my all-time favorite movie, a story o...",positive
6,i sure would like to see a resurrection of a u...,positive
7,"this show was an amazing, fresh & innovative i...",negative
8,encouraged by the positive comments about this...,negative
9,if you like original gut wrenching laughter yo...,positive


#Removing HTML tags

In [None]:
#re : regular expression
import re
def remove_html_tags(text):
  pattern=re.compile('<.*?>')
  return pattern.sub(r'',text)

In [None]:
df['review'] = df['review'].apply(remove_html_tags)
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
5,"probably my all-time favorite movie, a story o...",positive
6,i sure would like to see a resurrection of a u...,positive
7,"this show was an amazing, fresh & innovative i...",negative
8,encouraged by the positive comments about this...,negative
9,if you like original gut wrenching laughter yo...,positive


#Removing URLs

In [None]:
def remove_url(text):
  pattern=re.compile(r'https?://\S+|www\.\S+')
  return pattern.sub(f'',text)

In [None]:
df['review'] =df['review'].apply(remove_url)
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
5,"probably my all-time favorite movie, a story o...",positive
6,i sure would like to see a resurrection of a u...,positive
7,"this show was an amazing, fresh & innovative i...",negative
8,encouraged by the positive comments about this...,negative
9,if you like original gut wrenching laughter yo...,positive


#Punctuation handling

In [None]:
# string, time : a package which we can import punctuations

import string,time
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
exclude = string.punctuation
exclude

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
def remove_punctuation(text):
  for char in exclude:
    text=text.replace(char,'')
  return text

In [None]:
df['review'] = df['review'].apply(remove_punctuation)
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming tech...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive
5,probably my alltime favorite movie a story of ...,positive
6,i sure would like to see a resurrection of a u...,positive
7,this show was an amazing fresh innovative ide...,negative
8,encouraged by the positive comments about this...,negative
9,if you like original gut wrenching laughter yo...,positive


In [None]:
#time taken to remove punctuation

t1=time.time()
df['review'] = df['review'].apply(remove_punctuation)
t2=time.time()
total_time=t2-t1
print(total_time)

0.0007281303405761719


#Chat_convertion

In [None]:
#handle short chat words/chat_acronyms(eg:idk)

chat_words={
    'AFAIK':'As Far As I Know',
    'AFK':'Away From Keyboard',
    'ASAP':'As Soon As Possible',
    'ATK':'At The Keyboard',
    'ATM':'At The Moment',
    'A3':'Anytime, Anywhere, Anyplace',
    'LOL': 'Laughing out loud.',
    'IDK': 'I dont know.'
    }


In [None]:
def chat_conversion(text):
  new_text=[]
  for w in text.split():
    if w.upper() in chat_words:
      new_text.append(chat_words[w.upper()])
    else:
      new_text.append(w)
  return " ".join(new_text)


In [None]:
chat_conversion('Do this work asap')

'Do this work As Soon As Possible'

#Incorrect Text Handling

In [None]:
#handling spelling mistake
incorrect_text = 'ceertain conditionas duriing seveal ggenerations aree moodified in the saameee mannneer'

from textblob import TextBlob
text = TextBlob(incorrect_text)
text.correct().string

'certain conditions during several generations are modified in the saameee manner'

#Stop Word Removal

In [None]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
stopwords.words('english')

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [None]:
def remove_stopwords(text):
  new_text=[]
  for w in text.split():
    if w in stopwords.words('english'):
     new_text.append(' ')
    else:
     new_text.append(w)
  #new_text.clear()
  return ' '.join(new_text)


In [None]:
remove_stopwords('Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but its not preachy or boring.')
#print(r)

'Probably   all-time favorite movie,   story   selflessness, sacrifice   dedication     noble cause,       preachy   boring.'

In [None]:
df['review'].apply(remove_stopwords)


Unnamed: 0,review
0,one reviewers mentioned watching ...
1,wonderful little production filming techni...
2,thought wonderful way spend time ...
3,basically theres family little boy jake ...
4,petter matteis love time money visua...
5,probably alltime favorite movie story se...
6,sure would like see resurrection d...
7,show amazing fresh innovative idea 7...
8,encouraged positive comments film ...
9,like original gut wrenching laughter l...


In [None]:
df.head(10)

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming tech...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive
5,probably my alltime favorite movie a story of ...,positive
6,i sure would like to see a resurrection of a u...,positive
7,this show was an amazing fresh innovative ide...,negative
8,encouraged by the positive comments about this...,negative
9,if you like original gut wrenching laughter yo...,positive


#Handling Emojis (😘😍😅)
i) Remove emoji

ii) Emoji into text

In [None]:
#i) Remove Emojii

import re
def remove_emoji(text):
  emoji_pattern = re.compile("["
                             "\U0001F600-\U0001F64F"  # emoticons
                             "\U0001F300-\U0001F5FF"  # symbols & pictographs
                             "\U0001F680-\U0001F6FF"  # transport & map symbols
                             "\U0001F1E0-\U0001F1FF"  # flags
                             "\U00002702-\U000027B0"  # other miscellaneous symbols
                             "\U000024C2-\U0001F251"
                             "]+",flags=re.UNICODE)
  return emoji_pattern.sub(r' ',text)



In [None]:
remove_emoji('loved this movie😘.it was😍😍')

'loved this movie .it was '

In [None]:
remove_emoji("lmao😅")

'lmao '

In [None]:
#ii) Emojii Into Text
!pip install emoji

Collecting emoji
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.14.1-py3-none-any.whl (590 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.14.1


In [None]:
import emoji
print(emoji.demojize('python is❤️‍🔥 '))

python is:heart_on_fire: 


In [None]:
print(emoji.demojize('loved the movie😘'))

loved the movie:face_blowing_a_kiss:


#Tokenization

i)*Using the split function*

ii)*Using regular expression*

iii)*Using NLTK*


**LIBRARIES**

* nltk

* spacy

## i) Using the split function

In [None]:
#sentence tokenization

word="i am going to japan.i will stay there for 3 days.lets hope the trip to be greate"
word.split('.')

['i am going to japan',
 'i will stay there for 3 days',
 'lets hope the trip to be greate']

In [None]:
#word tokenization
word1='iam going to japan'
word1.split()

['iam', 'going', 'to', 'japan']

## ii) using regular expression

In [None]:
#word
import re
sen1='iam going to visit tokyo.i love japan'
token=re.findall(r'[\w]+',sen1)
print(token)

['iam', 'going', 'to', 'visit', 'tokyo', 'i', 'love', 'japan']


In [None]:
#sentence
sen2="""Japan is very famous for its unique blend of ancient traditions and modern innovations.
      Iconic elements include the stunning cherry blossoms, traditional tea ceremonies, samurai culture, and advanced technology such as bullet trains and robotics.
      Japan's anime and manga have also achieved worldwide popularity."""
token2 = re.compile('[.!?]').split(sen2)
print(token2)

['Japan is very famous for its unique blend of ancient traditions and modern innovations', '\n      Iconic elements include the stunning cherry blossoms, traditional tea ceremonies, samurai culture, and advanced technology such as bullet trains and robotics', "\n      Japan's anime and manga have also achieved worldwide popularity", '']


## NLTK

In [None]:
import nltk
from nltk.tokenize import word_tokenize,sent_tokenize
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
sent1 = 'iam going to visit japan'
word_tokenize(sent1)

['iam', 'going', 'to', 'visit', 'japan']

In [None]:
sent2="""Japan is very famous for its unique blend of ancient traditions and modern innovations.
      Iconic elements include the stunning cherry blossoms, traditional tea ceremonies, samurai culture, and advanced technology such as bullet trains and robotics.
      Japan's anime and manga have also achieved worldwide popularity."""
sent_tokenize(sent2)

['Japan is very famous for its unique blend of ancient traditions and modern innovations.',
 'Iconic elements include the stunning cherry blossoms, traditional tea ceremonies, samurai culture, and advanced technology such as bullet trains and robotics.',
 "Japan's anime and manga have also achieved worldwide popularity."]

## SPACY

In [None]:
import spacy
nlp=spacy.load('en_core_web_sm')
doc=nlp(sent1)
doc

iam going to visit japan

In [None]:
doc2 = nlp(sent2)
doc2

Japan is very famous for its unique blend of ancient traditions and modern innovations.
      Iconic elements include the stunning cherry blossoms, traditional tea ceremonies, samurai culture, and advanced technology such as bullet trains and robotics.
      Japan's anime and manga have also achieved worldwide popularity.

# Text Normalization

## Stemmer

In [None]:
from nltk.stem.porter import PorterStemmer

In [None]:
ps=PorterStemmer()
def st_words(text):
  return " ".join(ps.stem(word) for word in text.split())

In [None]:
eg='walk walks walked walking'
st_words(eg)

'walk walk walk walk'

In [None]:
eg2='Japan is very famous for its unique blend of ancient traditions and modern innovations.Japans anime and manga have also achieved worldwide popularity.'
st_words(eg2)

'japan is veri famou for it uniqu blend of ancient tradit and modern innovations.japan anim and manga have also achiev worldwid popularity.'

## Lemmatization

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
word_lemma = WordNetLemmatizer()

sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."
punctuations="?:!.;"
sentence_words=nltk.word_tokenize(sentence)
for word in sentence_words:
  if word in punctuations:
    sentence_words.remove(word)
sentence_words

print("{0:20}{1:20}".format("Word","Lemma"))
for word in sentence_words:
  print("{0:20}{1:20}".format(word,word_lemma.lemmatize(word,pos='v')))


Word                Lemma               
He                  He                  
was                 be                  
running             run                 
and                 and                 
eating              eat                 
at                  at                  
same                same                
time                time                
He                  He                  
has                 have                
bad                 bad                 
habit               habit               
of                  of                  
swimming            swim                
after               after               
playing             play                
long                long                
hours               hours               
in                  in                  
the                 the                 
Sun                 Sun                 


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


## APPLICATIONS :

* Sentiment Analysis

* Chatbot Development

* Text Classification and Information Retrieval

* Machine Translation and Summarization

## CONCLUSION :

Effective text preprocessing is the foundation of any successful NLP project. This project aims to design and implement a robust, reusable, and efficient text preprocessing pipeline that can serve as a core component for various natural language understanding tasks, ultimately improving the reliability and interpretability of NLP models.