## Text Preprocessing Basics 

- Lower Casing
- Removing HTML Tags
- Removing URLs
- Removing Punctuation
- Chat words treatment
- Spelling Correction
- Removing Stopwords
- Handling Emojis
- Tokenization
- Stemming
- Lemmatization 

In [1]:
# importing numpy and pandas
import pandas as pd
import numpy as np

In [2]:
# Data set Link 
# https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

In [3]:
# Reading the CSV file and storing the data in a DataFrame 'df'
df = pd.read_csv("../archive/IMDB Dataset.csv")

In [4]:
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


## Lowercasing 

In [5]:
# Accessing the 'review' column of the DataFrame 'df' and converting the text 
# in the fourth row (index 3) to lowercase

lowercase_review = df['review'][3].lower()
lowercase_review

"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.<br /><br />ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."

In [6]:
# Converting all the text in the 'review' column of the DataFrame 'df' to lowercase
df['review'] = df['review'].str.lower()

In [7]:
df.sample(5)

Unnamed: 0,review,sentiment
1547,the comments already left for this show are wa...,negative
46095,i saw this film at temple university. i cannot...,negative
22827,"simple, meaningful and delivers an emotional p...",positive
31455,this film is terrible. the story concerns a wo...,negative
24724,"since i am a fan of natalie portman, i had to ...",positive


## Remove HTML Tags 

In [8]:
import re

def remove_html_tags(text):
    # Define the regular expression pattern to match HTML tags
    pattern = re.compile('<.*?>')
    
    # Use the 'sub' method to replace all occurrences of HTML tags with 
    # an empty string
    # This effectively removes all HTML tags from the 'text' input
    return pattern.sub(r'', text)


In [9]:
# Removing HTML tags from the 'review' text in the fourth row (index 3) of the DataFrame 'df'
cleaned_review = remove_html_tags(df['review'][3])
print(cleaned_review[:100])

basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his par


In [10]:
# Applying the 'remove_html_tags()' function to the 'review' column 
# This will remove HTML tags from all the reviews in the 'review' column
df['review'] = df['review'].apply(remove_html_tags)

In [11]:
df.sample(5)

Unnamed: 0,review,sentiment
12161,"ok, so, chuck norris somehow found a way to ge...",negative
28835,faithful to the work of pearl s. buck whose ye...,positive
32826,this excruciatingly boring and unfunny movie m...,negative
33854,well another shootem up. typical run around fi...,negative
44569,poor michael madsen; he must be kicking himsel...,negative


## Remove URLs

In [12]:
import re

def remove_url(text):
    # Define the regular expression pattern to match URLs
    pattern = re.compile(r'https?://\S+|www\.\S+')
    
    # Use the 'sub' method to replace all occurrences of URLs with an empty string
    # This effectively removes all URLs from the 'text' input
    return pattern.sub(r'', text)

In [13]:
text1 = 'Check out my notebook https://www.kaggle.com/campusx/notebook'
text2 = 'Check out my notebook http://www.kaggle.com/campusx/notebook'
text3 = 'Check out my notebook www.kaggle.com'
text4 = 'Check out my notebook https://www.kaggle.com/campusx/notebook and also www.google.com'

In [14]:
for text in [text1,text2,text3,text4]:
    print(remove_url(text))

Check out my notebook 
Check out my notebook 
Check out my notebook 
Check out my notebook  and also 


## Remove Punctuation 

In [15]:
import string
import time

# The 'string.punctuation' attribute contains all punctuation characters
# It includes symbols like !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
punctuation_chars = string.punctuation
punctuation_chars

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [16]:
exclude = string.punctuation

def remove_punc(text):
    # Iterate through each character in 'exclude'
    for char in exclude:
        # Replace the current character with an empty string in 'text'
        text = text.replace(char, '')
    
    # Return the 'text' with all punctuation characters removed
    return text

In [17]:
text = 'string . with ! #  punctuation ? '

In [18]:
# Record the start time before calling the function
start = time.time()
# Call the 'remove_punc' function to remove punctuation from the 'text'
print(remove_punc(text))
# Calculate the time taken to execute the 'remove_punc' function
time1 = time.time() - start
# Print the time taken in seconds
print(time1)

string  with    punctuation  
0.00018095970153808594


In [19]:
 # This approach is very slow and inefficient when the dataset is large 

In [20]:
def remove_punc1(text):
    return text.translate(str.maketrans('','',exclude))

In [21]:
start = time.time()
print(remove_punc1(text))
time2 = time.time() - start
print(time2)

string  with    punctuation  
0.00034427642822265625


In [22]:
time1/time2  # 18 times faster

0.525623268698061

In [23]:
# Applying the 'remove_punc1' function to the 'review' column of the DataFrame 'df'
# This will remove punctuation from all the reviews in the 'review' column
df['review'] = df['review'].apply(remove_punc1)

In [24]:
df.sample(3)

Unnamed: 0,review,sentiment
6820,am i the only one to notice that the realism o...,negative
8758,scoop is also the name of a latethirties evely...,positive
34339,in sri lanka a country divided by religion and...,positive


## Chat word Treatment 

In [25]:
# Slang words 
# https://github.com/rishabhverma17/sms_slang_translator/blob/master/slang.txt

In [26]:
import requests
url = 'https://raw.githubusercontent.com/rishabhverma17/sms_slang_translator/master/slang.txt'
page = requests.get(url)
chat_words = page.text

In [27]:
chat_words = chat_words.split('\n')
chat_words_dict = {}
for line in chat_words:
    key_n_val = line.split('=')
    try:
        chat_words_dict[key_n_val[0]] =  key_n_val[1]
    except:
        pass

In [28]:
chat_words_dict

{'AFAIK': 'As Far As I Know',
 'AFK': 'Away From Keyboard',
 'ASAP': 'As Soon As Possible',
 'ATK': 'At The Keyboard',
 'ATM': 'At The Moment',
 'A3': 'Anytime, Anywhere, Anyplace',
 'BAK': 'Back At Keyboard',
 'BBL': 'Be Back Later',
 'BBS': 'Be Back Soon',
 'BFN': 'Bye For Now',
 'B4N': 'Bye For Now',
 'BRB': 'Be Right Back',
 'BRT': 'Be Right There',
 'BTW': 'By The Way',
 'B4': 'Before',
 'CU': 'See You',
 'CUL8R': 'See You Later',
 'CYA': 'See You',
 'FAQ': 'Frequently Asked Questions',
 'FC': 'Fingers Crossed',
 'FWIW': "For What It's Worth",
 'FYI': 'For Your Information',
 'GAL': 'Get A Life',
 'GG': 'Good Game',
 'GN': 'Good Night',
 'GMTA': 'Great Minds Think Alike',
 'GR8': 'Great!',
 'G9': 'Genius',
 'IC': 'I See',
 'ICQ': 'I Seek you (also a chat program)',
 'ILU': 'ILU: I Love You',
 'IMHO': 'In My Honest/Humble Opinion',
 'IMO': 'In My Opinion',
 'IOW': 'In Other Words',
 'IRL': 'In Real Life',
 'KISS': 'Keep It Simple, Stupid',
 'LDR': 'Long Distance Relationship',
 'LM

In [29]:
def chat_conversation(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words_dict:
            new_text.append(chat_words_dict[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

In [30]:
chat_conversation('IMHO he is good')

'In My Honest/Humble Opinion he is good'

In [31]:
chat_conversation('FYI he is good, hes G9')

'For Your Information he is good, hes Genius'

## Spelling Correction

In [32]:
from textblob import TextBlob

# The text with incorrect spelling
incorrect = 'ceertain conditionas duriing seveal geenarations aree moodified in the samme maner.'

# Create a TextBlob object with the incorrect text
textBlb = TextBlob(incorrect)

# Use the 'correct()' method to correct spelling mistakes in the text
# The 'string' attribute will retrieve the corrected text as a string
corrected_text = textBlb.correct().string

# Print the corrected text
print(corrected_text)

certain conditions during several generations are modified in the same manner.


## Removing Stopwords

In [33]:
from nltk.corpus import stopwords

# Accessing the list of stopwords in the English language
stop_words_english = stopwords.words('english')

# Print the list of stopwords - printing only 20 
print(stop_words_english[:20])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']


In [34]:
# stopwords.words('spanish')

In [35]:
def remove_stopwords(text):
    # Create an empty list to store non-stopwords
    new_text = []
    # Split the input 'text' into individual words and iterate through each word
    for word in text.split():
        # Check if the word is a stopword (found in the 'stopwords.words('english')' list)
        if word in stopwords.words('english'):
            # If it is a stopword, append an empty string to 'new_text'
            # This effectively removes stopwords from the text
            new_text.append('')
        else:
            # If it is not a stopword, append the word to 'new_text'
            new_text.append(word)
    # Create a copy of 'new_text' and clear the original 'new_text' list
    x = new_text[:]
    new_text.clear()
    # Join the words in 'x' back into a single string using spaces and return it
    return " ".join(x)

In [36]:
text = "As the gentle breeze rustled through the leaves, the vibrant colors of the autumn foliage danced in harmony, creating a breathtaking tapestry of nature's artistry that captivated the hearts of all who beheld it."
print(remove_stopwords(text))

As  gentle breeze rustled   leaves,  vibrant colors   autumn foliage danced  harmony, creating  breathtaking tapestry  nature's artistry  captivated  hearts    beheld it.


In [37]:
df['review'][:10].apply(remove_stopwords)

0    one    reviewers  mentioned   watching  1 oz e...
1     wonderful little production  filming techniqu...
2     thought    wonderful way  spend time    hot s...
3    basically theres  family   little boy jake thi...
4    petter matteis love   time  money   visually s...
5    probably  alltime favorite movie  story  selfl...
6     sure would like  see  resurrection    dated s...
7     show   amazing fresh innovative idea   70s   ...
8    encouraged   positive comments   film     look...
9      like original gut wrenching laughter   like ...
Name: review, dtype: object

## Handling emojis 

### Removing 

In [38]:
import re

def remove_emoji(text):
    # Define the regular expression pattern to match emojis
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags
                           u"\U00002702-\U000027B0"  # other miscellaneous symbols
                               u"\U000024C2-\U0001F251"  # enclosed characters
                               "]+", flags=re.UNICODE)
    # Use 're.sub()' to replace all occurrences of emojis with an empty string
    return emoji_pattern.sub(r'', text)
# Test the function with an example text containing emojis
result_text = remove_emoji("😄😄😄😄 hello 😄😄😄😄")
print(result_text)

 hello 


### Replacing 

In [39]:
import emoji

In [40]:
emoji.demojize("😄😄😄😄 hello 😄😄😄😄")

':grinning_face_with_smiling_eyes::grinning_face_with_smiling_eyes::grinning_face_with_smiling_eyes::grinning_face_with_smiling_eyes: hello :grinning_face_with_smiling_eyes::grinning_face_with_smiling_eyes::grinning_face_with_smiling_eyes::grinning_face_with_smiling_eyes:'

## Tokenization

Tokenization in NLP preprocessing is the process of breaking down a text or a sentence into smaller units called tokens. These tokens are typically words or subwords, and tokenization is a fundamental step in preparing text data for various NLP tasks.

### 1. Using the split function 

In [41]:
# word tokenization 
sent1 = 'I am going to delhi'
sent1.split()

['I', 'am', 'going', 'to', 'delhi']

In [42]:
# sentence tokenization 
sent2 = 'I am going to delhi. I will stay there for 2 days. Lets hope the trip will be good'
sent2.split('.')

['I am going to delhi',
 ' I will stay there for 2 days',
 ' Lets hope the trip will be good']

In [43]:
# Problem with split function 
sent3 = 'I am going to delhi!'
sent3.split()

['I', 'am', 'going', 'to', 'delhi!']

In [44]:
sent4 = 'Where do you think I should go? I have 3 day holiday'
sent4.split('.')

['Where do you think I should go? I have 3 day holiday']

### 2. Use Regualr expressions 

In [45]:
import re 
# The sentence to tokenize
sent3 = 'I am going to delhi!'

# Use 're.findall()' to find all word and apostrophe tokens in the sentence
# '\w' matches any word character (letters, digits, or underscore), and "'" matches apostrophes
# '+' matches one or more occurrences of the pattern (one or more word characters and/or apostrophes)
# The result will be a list of all tokens in the sentence
tokens = re.findall("[\w']+", sent3)

# Print the list of tokens
print(tokens)

['I', 'am', 'going', 'to', 'delhi']


In [46]:
text = '''Lorem ipsum is simply dummy text of the printing and typesetting industry?
 Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
 when an unknown printer took a gallery of type and scrambled it to make a type specimen book.'''

sentences = re.compile('[.!?]').split(text)
sentences

['Lorem ipsum is simply dummy text of the printing and typesetting industry',
 "\n Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,\n when an unknown printer took a gallery of type and scrambled it to make a type specimen book",
 '']

### 3. NLTK 

In [47]:
from nltk.tokenize import word_tokenize,sent_tokenize

In [48]:
sent1 = 'I am going to visit delhi!'
word_tokenize(sent1)

['I', 'am', 'going', 'to', 'visit', 'delhi', '!']

In [49]:
text = '''Lorem ipsum is simply dummy text of the printing and typesetting industry?
 Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
 when an unknown printer took a gallery of type and scrambled it to make a type specimen book.'''
sent_tokenize(text)

['Lorem ipsum is simply dummy text of the printing and typesetting industry?',
 "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,\n when an unknown printer took a gallery of type and scrambled it to make a type specimen book."]

In [50]:
sent5 = 'I have a Ph.D in A.I'
sent6 = "We're here to help! mail us at nks@gmail.com"
sent7 = 'A 5km ride cost $10.58'


In [51]:
for sent in [sent5,sent6,sent7]:
    print(word_tokenize(sent))

['I', 'have', 'a', 'Ph.D', 'in', 'A.I']
['We', "'re", 'here', 'to', 'help', '!', 'mail', 'us', 'at', 'nks', '@', 'gmail.com']
['A', '5km', 'ride', 'cost', '$', '10.58']


### 4. Spacy (BEST)

In [52]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [53]:
doc1 = nlp(sent5)
doc2 = nlp(sent6)
doc3 = nlp(sent7)
# doc4 = nlp(sent8)

In [54]:
for token in doc1:
    print(token)

I
have
a
Ph
.
D
in
A.I


## Stemming 

In grammar, inflection is the modificatoin of a word to express different grammatical categories such as tense,case,voice,aspect,person,number,gender,and mood.

eg. Walk - walk,walking,walked,walker,walks

_Stemming_ is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.

- its mostly used in information retreival systems (google search)


Stemmer : Algorithms using which we can perform stemming 

- eg:
    - Portor stemmer (for english)
    - Snowball Stemmer(for other language)

In [55]:
from nltk.stem.porter import PorterStemmer

# Create a Porter Stemmer object
ps = PorterStemmer()

def stem_words(text):
    # Split the input 'text' into individual words and apply stemming to each word
    # Join the stemmed words back into a single string using spaces and return it
    return ' '.join([ps.stem(word) for word in text.split()])

In [56]:
sample = 'walk walks walking walked'
stem_words(sample)

'walk walk walk walk'

In [57]:
text = "The quick brown fox jumps over the lazy dog. The dogs were barking loudly, but the fox didn't seem to care. It continued to run through the fields, chasing after its prey. The fox's agility and speed were unmatched, making it a formidable hunter in the animal kingdom."

In [58]:
stem_words(text)

"the quick brown fox jump over the lazi dog. the dog were bark loudly, but the fox didn't seem to care. it continu to run through the fields, chase after it prey. the fox' agil and speed were unmatched, make it a formid hunter in the anim kingdom."

## Lemmatization

Lemmatization, unlike stemming,reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.

- almost same as stemming, but the root word here is a valid word
- It takes a little longer time when compared to stemming

- if we dont have to show the output to the user then we can use stemming
- else we can use lemmatization

  #### Lemmatization is done using a lexical dictionary instead of an algorithm.
  The WORDNET lexical dictionary is used here

In [59]:
import nltk
from nltk.stem import WordNetLemmatizer
# Initialize the WordNet Lemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
sentence = '''He was running and eating at the same time. He has a bad habit of swimming after playing 
long hours in the Sun.'''


# Iterate through each word in the sentence
for word in sentence_words:
    if word not in punctuations:
        # If the word is not a punctuation mark, append it to the filtered_words list
        filtered_words.append(word)
# Lemmatize the words using the WordNet Lemmatizer
lemmatized_words = [wordnet_lemmatizer.lemmatize(word) for word in filtered_words]

print(lemmatized_words)

['He', 'wa', 'running', 'and', 'eating', 'at', 'the', 'same', 'time', 'He', 'ha', 'a', 'bad', 'habit', 'of', 'swimming', 'after', 'playing', 'long', 'hour', 'in', 'the', 'Sun']


In [60]:
print("{0:20}{1:20}".format("Word","Lemma"))
for word in sentence_words:
    print('{0:20}{1:20}'.format(word,wordnet_lemmatizer.lemmatize(word,pos='v')))

Word                Lemma               
He                  He                  
was                 be                  
running             run                 
and                 and                 
eating              eat                 
at                  at                  
the                 the                 
same                same                
time                time                
.                   .                   
He                  He                  
has                 have                
a                   a                   
bad                 bad                 
habit               habit               
of                  of                  
swimming            swim                
after               after               
playing             play                
long                long                
hours               hours               
in                  in                  
the                 the                 
Sun             