

-----

## **Text Pre-Processing**

- **IMDB Dataset of 50K Movie Reviews**

-----

### **Import Libraries**

In [446]:
from textblob import TextBlob
import pandas as pd
import string,time
import emoji
import re

### **Load Dataset**

In [410]:
df = pd.read_csv('/content/IMDB Dataset.csv')

In [411]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


### **Shape of the Dataset**

In [412]:
df.shape

(50000, 2)

### **Take Sample of Data**

In [413]:
df = df.sample(1000)

### **1. Lowercase the data**

In [414]:
df['review'] = df['review'].str.lower()

In [415]:
df.head()

Unnamed: 0,review,sentiment
46826,it really boggles my mind when someone comes a...,negative
30042,i used to love sabrina the teenage witch and h...,positive
18753,this film takes what could have been a good id...,negative
7971,"after viewing ""still life"", a short film direc...",negative
27805,my girlfriend wanted to see this (lol this is ...,negative


### **2. Remove HTML Tags**

In [416]:
import re  # Import the regular expression module

def remove_html_tags(text):
    # Compile a regular expression pattern that matches HTML tags
    pattern = re.compile('<.*?>')

    # Substitute the matched HTML tags with an empty string and return the result
    return pattern.sub(r'', text)

In [417]:
# take an example to use this function

rev = "A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-styled. <br /><br />T"

In [418]:
cleaned_text = remove_html_tags(rev)
print(cleaned_text)

A wonderful little production. The filming technique is very unassuming- very old-time-styled. T


In [419]:
df['review'] = df['review'].apply(remove_html_tags) # this will remove html tags from review column

In [420]:
df.head() # the reviews are in lower canse and html tags are removed

Unnamed: 0,review,sentiment
46826,it really boggles my mind when someone comes a...,negative
30042,i used to love sabrina the teenage witch and h...,positive
18753,this film takes what could have been a good id...,negative
7971,"after viewing ""still life"", a short film direc...",negative
27805,my girlfriend wanted to see this (lol this is ...,negative


### **3. Remove URLs**

In [421]:
import re  # Import the regular expressions module

def remove_url(text):
    # Compile a regular expression pattern to match URLs
    # The pattern matches both 'http' and 'https' URLs, as well as those starting with 'www.'
    pattern = re.compile(r'https?://\S+|www\.\S+')

    # Substitute all occurrences of the pattern in the text with an empty string
    # This effectively removes the URLs from the input text
    return pattern.sub(r'', text)

In [422]:
df['review'] = df['review'].apply(remove_url)

In [423]:
df.head()

Unnamed: 0,review,sentiment
46826,it really boggles my mind when someone comes a...,negative
30042,i used to love sabrina the teenage witch and h...,positive
18753,this film takes what could have been a good id...,negative
7971,"after viewing ""still life"", a short film direc...",negative
27805,my girlfriend wanted to see this (lol this is ...,negative


### **4. Remove Punctuations**

In [424]:
import string,time
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [425]:
punctuations = string.punctuation

In [426]:
def remove_punctuations(text):
    # Iterate over each punctuation character in the 'punctuations' list
    for punctuation in punctuations:
        # Replace occurrences of the current punctuation character in 'text' with an empty string
        text = text.replace(punctuation, '')
    # Return the modified text with all specified punctuations removed
    return text

In [427]:
df['review'] = df['review'].apply(remove_punctuations) # apply remove punctuations

In [428]:
df.head()

Unnamed: 0,review,sentiment
46826,it really boggles my mind when someone comes a...,negative
30042,i used to love sabrina the teenage witch and h...,positive
18753,this film takes what could have been a good id...,negative
7971,after viewing still life a short film directed...,negative
27805,my girlfriend wanted to see this lol this is t...,negative


### **5. Handle Chat Conversations**

- Sometimes people use short words during conversation instead of compleat words. We can also deal wit that. Here is the few words:

In [429]:
chat_words = {
    'AFAIK': 'As Far As I Know',
    'AFK': 'Away From Keyboard',
    'ASAP': 'As Soon As Possible',
    'FYI': 'For Your Information',
    'BRB': 'Be Right Back',
    'BTW': 'By The Way',
    'OMG': 'Oh My God',
    'IMO': 'In My Opinion',
    'LOL': 'Laugh Out Loud',
    'TTYL': 'Talk To You Later',
    'GTG': 'Got To Go',
    'TTYT': 'Talk To You Tomorrow',
    'IDK': "I Don't Know",
    'TMI': 'Too Much Information',
    'IMHO': 'In My Humble Opinion',
    'ICYMI': 'In Case You Missed It',
    'FAQ': 'Frequently Asked Questions',
    'TGIF': "Thank God It's Friday",
    'FYA': 'For Your Action',
}

In [430]:
def chat_conversations(text):
    # Initialize an empty list to store the processed words
    new_text = []

    # Split the input text into individual words
    for w in text.split():
        # Check if the uppercase version of the word is in the chat_words dictionary
        if w.upper() in chat_words:
            # If it is, append the corresponding full form to the new_text list
            new_text.append(chat_words[w.upper()])
        else:
            # If not, append the original word to the new_text list
            new_text.append(w)

    # Join the processed words back into a single string and return it
    return " ".join(new_text)

In [431]:
 df['review'] = df['review'].apply(chat_conversations)

In [432]:
df.head()

Unnamed: 0,review,sentiment
46826,it really boggles my mind when someone comes a...,negative
30042,i used to love sabrina the teenage witch and h...,positive
18753,this film takes what could have been a good id...,negative
7971,after viewing still life a short film directed...,negative
27805,my girlfriend wanted to see this Laugh Out Lou...,negative


### **6. Incorrect Text-Handling**

In [433]:
from textblob import TextBlob  # Import the TextBlob library for text processing

def correct_spelling(text):
    # Create a TextBlob object from the input text, which allows for text analysis and correction
    textBlb = TextBlob(text)

    # Use the correct() method of TextBlob to correct spelling in the text
    return textBlb.correct()

In [434]:
# df['review'] = df['review'].apply(correct_spelling) # spelling correction takes some time
# df.head()

### **7. Remove Stop Words**

Removing stopwords is a common practice in natural language processing (NLP) for several reasons:

1. **Reduced Noise**: Stopwords (like "and," "the," "is," etc.) carry little semantic meaning. Removing them helps reduce noise in the data, allowing algorithms to focus on more meaningful words.

2. **Improved Efficiency**: Reducing the size of the dataset by removing stopwords can speed up processing times for algorithms, especially in tasks like text classification or clustering.

3. **Enhanced Feature Extraction**: In tasks such as topic modeling or sentiment analysis, focusing on significant words can lead to better feature extraction and improved model performance.

4. **Better Representation**: Removing stopwords can lead to a more accurate representation of the underlying themes and sentiments in the text, as it highlights the more relevant content.

5. **Dimensionality Reduction**: In vector space models, removing stopwords helps reduce the dimensionality of the feature space, which can improve the performance of machine learning models.

Overall, removing stopwords helps to streamline text data, making it more manageable and relevant for analysis.

In [435]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [436]:
# stopwords.words('english') # to check english stopwods from nltk library

In [437]:
from nltk.corpus import stopwords  # Import the stopwords list from the NLTK library

def remove_stopwords(text):
    # Initialize an empty list to hold words that are not stopwords
    new_text = []

    # Split the input text into words
    for word in text.split():
        # Check if the word is in the list of English stopwords
        if word in stopwords.words('english'):
            new_text.append('')  # If it's a stopword, append an empty string
        else:
            new_text.append(word)  # If it's not a stopword, append the word itself

    # Create a copy of new_text
    x = new_text[:]

    # Clear the new_text list
    new_text.clear()

    # Join the words back into a single string and return
    return " ".join(x)

In [438]:
df['review'] = df['review'].apply(remove_stopwords)
df.head()

Unnamed: 0,review,sentiment
46826,really boggles mind someone comes across m...,negative
30042,used love sabrina teenage witch seen ever...,positive
18753,film takes could good idea mummified 200...,negative
7971,viewing still life short film directed jon ...,negative
27805,girlfriend wanted see Laugh Out Loud cas...,negative


### **8. Remove Emoji Handles**

In [439]:
import re  # Import the regular expression module

def remove_emoji(text):
    # Define a regex pattern to match various emoji ranges
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # Emoticons (smiley faces)
        u"\U0001F300-\U0001F5FF"  # Symbols and pictographs
        u"\U0001F680-\U0001F6FF"  # Transport and map symbols
        u"\U0001F1E0-\U0001F1FF"  # Flags (iOS)
        u"\U00002702-\U000027B0"  # Various symbols
        u"\U000024C2-\U0001F251"  # Other symbols
        "]+", flags=re.UNICODE)  # Compile with Unicode flag for broader matching

    # Substitute any matched emoji in the text with an empty string
    return emoji_pattern.sub(r'', text)  # Return the cleaned text without emojis

In [441]:
# let's see an example

remove_emoji("I love nlp ❤")

'I love nlp '

In [442]:
df['review'] = df['review'].apply(remove_emoji)
df.head()

Unnamed: 0,review,sentiment
46826,really boggles mind someone comes across m...,negative
30042,used love sabrina teenage witch seen ever...,positive
18753,film takes could good idea mummified 200...,negative
7971,viewing still life short film directed jon ...,negative
27805,girlfriend wanted see Laugh Out Loud cas...,negative


#### **If We want to keep emoji:**

In [445]:
import emoji

In [447]:
print(emoji.demojize('I love nlp ❤'))

I love nlp :red_heart:


### **9. Tokenization**

#### **1. Word level Tokenization**

In [448]:
sent1 = "I am learning NLP"
sent1.split()

['I', 'am', 'learning', 'NLP']

#### **2. Sentance Level Tokenization**

In [449]:
sent2 = "We need to nderstand nlp. It is crucial to understand how llms work. If we understand how llms work then we can find out how gen Ai Works."
sent2.split('.')

['We need to nderstand nlp',
 ' It is crucial to understand how llms work',
 ' If we understand how llms work then we can find out how gen Ai Works',
 '']

#### **3. With Regular Expression**

In [450]:
import re
sent3 = 'I am going to Lahore!'
tokens = re.findall("[\w']+", sent3)
tokens

['I', 'am', 'going', 'to', 'Lahore']

In [451]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry?
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""
sentences = re.compile('[.!?] ').split(text)
sentences

["Lorem Ipsum is simply dummy text of the printing and typesetting industry?\nLorem Ipsum has been the industry's standard dummy text ever since the 1500s,\nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

#### **4. Using NLTK**

In [453]:
import nltk  # Import the Natural Language Toolkit (NLTK) library

# Import specific functions for tokenizing text into words and sentences
from nltk.tokenize import word_tokenize, sent_tokenize

# Download the 'punkt' tokenizer models, which are necessary for sentence and word tokenization
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [454]:
sent1 = "I am learning NLP"
word_tokenize(sent1)

['I', 'am', 'learning', 'NLP']

In [455]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry?
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""

sent_tokenize(text)

['Lorem Ipsum is simply dummy text of the printing and typesetting industry?',
 "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,\nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

#### **4. Using Spacy**

In [456]:
import spacy  # Import the spaCy library

# Load the English language model for spaCy
nlp = spacy.load('en_core_web_sm')

In [457]:
sent1 = "I am learning NLP"
result = nlp(sent1)
print(result)
for token in result:
    print(token)

I am learning NLP
I
am
learning
NLP


### **10. Stemmer (Stemming)**

In [458]:
from nltk.stem.porter import PorterStemmer  # Import the Porter stemming algorithm from NLTK

In [459]:
ps = PorterStemmer()  # Create an instance of the Porter stemming algorithm
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])  # Apply stemming to each word in the text and join them back

In [460]:
sample = "walk walks walking walked"
stem_words(sample)

'walk walk walk walk'

In [462]:
text = 'probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie'

stem_words(text)

'probabl my alltim favorit movi a stori of selfless sacrific and dedic to a nobl caus but it not preachi or bore it just never get old despit my have seen it some 15 or more time in the last 25 year paul luka perform bring tear to my eye and bett davi in one of her veri few truli sympathet role is a delight the kid are as grandma say more like dressedup midget than children but that onli make them more fun to watch and the mother slow awaken to what happen in the world and under her own roof is believ and startl if i had a dozen thumb theyd all be up for thi movi'

### **11. Lemmatization**

In [463]:
import nltk  # Import the Natural Language Toolkit (nltk) for natural language processing
from nltk.stem import WordNetLemmatizer  # Import the WordNetLemmatizer for lemmatization

# Download the WordNet and Open Multilingual WordNet resources
nltk.download('wordnet')
nltk.download('omw-1.4')

# Initialize the lemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Define a sample sentence for processing
sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."

# Define a string of punctuation characters to remove from the sentence
punctuations = "?:!.,;"

# Tokenize the sentence into words
sentence_words = nltk.word_tokenize(sentence)

# Remove punctuation from the list of words
for word in sentence_words:
    if word in punctuations:
        sentence_words.remove(word)

# Output the words after punctuation removal
print("{0:20}{1:20}".format("Word", "Lemma"))  # Print table header

# Loop through the cleaned list of words and print each word with its lemma
for word in sentence_words:
    # Lemmatize the word with the part-of-speech tag set to 'v' (verb)
    print("{0:20}{1:20}".format(word, wordnet_lemmatizer.lemmatize(word, pos='v')))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Word                Lemma               
He                  He                  
was                 be                  
running             run                 
and                 and                 
eating              eat                 
at                  at                  
same                same                
time                time                
He                  He                  
has                 have                
bad                 bad                 
habit               habit               
of                  of                  
swimming            swim                
after               after               
playing             play                
long                long                
hours               hours               
in                  in                  
the                 the                 
Sun                 Sun                 


- **NOTE: Stemming & lamatization are same to retrieve root words but lamatization is worked good. Lamatization is slow & stemming is fast**

-----