# Text Preprocessing in NLP

- 1.lowercasing
- 2.Removing HTML tags
- 3.Removing URLs
- 4.Removing Punctuation mark
- 5.Spelling Correction
- 6.Removing Stop Words
- 7.Handling Emojis
- 8.Tokenization
- 9.Stemming/Lemmatization

In [1]:
import numpy as np
import pandas as pd

In [2]:
df=pd.read_csv('IMDB Dataset.csv')

In [3]:
df.head(5)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
df.shape

(50000, 2)

In [5]:
list=[]


###  1.Lowercasing

In [6]:
df['review'][2].lower()

'i thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. the plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). while some may be disappointed when they realize this is not match point 2: risk addiction, i thought it was proof that woody allen is still fully in control of the style many of us have grown to love.<br /><br />this was the most i\'d laughed at one of woody\'s comedies in years (dare i say a decade?). while i\'ve never been impressed with scarlet johanson, in this she managed to tone down her "sexy" image and jumped right into a average, but spirited young woman.<br /><br />this may not be the crown jewel of his career, but it was wittier than "devil wears prada" and more interesting than "superman" a great comedy to go see with friends.'

In [7]:
df['review']=df['review'].str.lower()

In [8]:
df['review'].head(5)

0    one of the other reviewers has mentioned that ...
1    a wonderful little production. <br /><br />the...
2    i thought this was a wonderful way to spend ti...
3    basically there's a family where a little boy ...
4    petter mattei's "love in the time of money" is...
Name: review, dtype: object

### 2. Removing HTML tags

In [9]:
# we can use regex101.com
import re 

def remove_html_tag(text):
    pattern=re.compile('<.*?>')
    return pattern.sub(r'',text)


In [10]:
df['review']=df['review'].apply(remove_html_tag)

In [11]:
df['review'].head(5)

0    one of the other reviewers has mentioned that ...
1    a wonderful little production. the filming tec...
2    i thought this was a wonderful way to spend ti...
3    basically there's a family where a little boy ...
4    petter mattei's "love in the time of money" is...
Name: review, dtype: object

### 3.Removing URLs

In [12]:
text1= 'check out my notebook https://www.kaggle.com/code/ifteshanajnin/exoplanet-detection-on-kepler-data '

In [13]:
def remove_url(text):
    pattern=re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'',text)

In [14]:
remove_url(text1)

'check out my notebook  '

### 4.Removing Punctuation mark
- these are mainly ! " $ %
- punctiation makes the meaning of word different at different time

In [15]:
import string,time
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [16]:
exclude=string.punctuation

def remove_punc(text):
    for char in exclude:
        text=text.replace(char,'')
    return text

In [17]:
text= 'Hello, here is the punctuation!'

In [18]:
start=time.time()
remove_punc(text)
print(time.time()-start)

0.0


In [19]:
def remove_punc1(text):
    return text.translate(str.maketrans('','',exclude))

In [20]:
start=time.time()
remove_punc1(text)
print(time.time()-start)

0.0


In [21]:
### Removing Chart-Words

### 5.Spelling Correction

In [22]:
from textblob import TextBlob

In [23]:
incorrect_text= 'I am wreting this intentinally incooract, to check the correctar'

In [24]:
textblob=TextBlob(incorrect_text)
textblob.correct().string

'I am writing this intentionally incorrect, to check the correct'

### 6. Removing Stop Words
- not done in POS tagging

In [25]:
from nltk.corpus import stopwords

In [26]:
stopwords.words('english')[1:5]

['me', 'my', 'myself', 'we']

In [27]:
def remove_stopwords(text):
    new_text = []
    
    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)

In [28]:
text= 'Keep smiling, because life is a beautiful thing and there\'s so much to smile about.'
remove_stopwords(text)

"Keep smiling,  life   beautiful thing  there's  much  smile about."

In [29]:
df['review']

0        one of the other reviewers has mentioned that ...
1        a wonderful little production. the filming tec...
2        i thought this was a wonderful way to spend ti...
3        basically there's a family where a little boy ...
4        petter mattei's "love in the time of money" is...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot, bad dialogue, bad acting, idiotic di...
49997    i am a catholic taught in parochial elementary...
49998    i'm going to have to disagree with the previou...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

In [30]:
df['review'][0:4].apply(remove_stopwords)

0    one    reviewers  mentioned   watching  1 oz e...
1     wonderful little production.  filming techniq...
2     thought    wonderful way  spend time    hot s...
3    basically there's  family   little boy (jake) ...
Name: review, dtype: object

### 7.Removing EMojis

In [31]:
import re
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [32]:
remove_emoji("Loved the movie. It was 😘😘")

'Loved the movie. It was '

In [33]:
# Replacing the emoji with it's meaning

import emoji
print(emoji.demojize('Python is 🔥'))

Python is :fire:


### 8.Tokenization

Challenges in Tokenization: 
Sufffix(10km),Prefix($20),Infix(New-York),Exception(U.S)

#### using split()

In [34]:
# word tokenization
sent1 = 'I am going to delhi'
sent1.split()

['I', 'am', 'going', 'to', 'delhi']

In [35]:
# sentence tokenization
sent2 = 'I am going to delhi. I will stay there for 3 days. Let\'s hope the trip to be great'
sent2.split('.')

['I am going to delhi',
 ' I will stay there for 3 days',
 " Let's hope the trip to be great"]

In [36]:
# Problems with split function
sent3 = 'I am going to delhi!'
sent3.split()



['I', 'am', 'going', 'to', 'delhi!']

#### Using regex

In [37]:
import re
sent3 = 'I am going to delhi!'
tokens = re.findall("[\w']+", sent3)
tokens

['I', 'am', 'going', 'to', 'delhi']

In [38]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry? 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""
sentences = re.compile('[.!?] ').split(text)
sentences

['Lorem Ipsum is simply dummy text of the printing and typesetting industry',
 "\nLorem Ipsum has been the industry's standard dummy text ever since the 1500s, \nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

#### using NLTK

In [39]:
from nltk.tokenize import word_tokenize,sent_tokenize

In [40]:
sent1 = 'I am going to visit delhi!'
word_tokenize(sent1)

['I', 'am', 'going', 'to', 'visit', 'delhi', '!']

In [41]:
# Problems with NLTK

sent5 = 'I have a Ph.D in A.I'
sent6 = "We're here to help! mail us at nks@gmail.com"
sent7 = 'A 5km ride cost $10.50'

word_tokenize(sent6)
word_tokenize(sent7)

['A', '5km', 'ride', 'cost', '$', '10.50']

#### using Spacy

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
doc1 = nlp(sent5)
doc2 = nlp(sent6)
doc3 = nlp(sent7)
doc4 = nlp(sent1)

In [None]:
for token in doc6:
    print(token)

### 9. Stemming

In [46]:
import nltk
from nltk.stem.porter import PorterStemmer

In [47]:
ps = PorterStemmer()
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

In [48]:
sample = "dance dancing danced"
stem_words(sample)

'danc danc danc'

In [49]:
text = 'probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie'
print(text)

probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie


In [50]:
stem_words(text)

'probabl my alltim favorit movi a stori of selfless sacrific and dedic to a nobl caus but it not preachi or bore it just never get old despit my have seen it some 15 or more time in the last 25 year paul luka perform bring tear to my eye and bett davi in one of her veri few truli sympathet role is a delight the kid are as grandma say more like dressedup midget than children but that onli make them more fun to watch and the mother slow awaken to what happen in the world and under her own roof is believ and startl if i had a dozen thumb theyd all be up for thi movi'

#### Lemmatization

In [52]:
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()

text='The girl loves painting, she is drawing while listing some music.'
punctuations='?:!.,;'
text_words=nltk.word_tokenize(text)

for word in text_words:
    if word in punctuations:
        text_words.remove(word)
text_words
print("{0:20}{1:20}".format("Word","Lemma"))

for word in text_words:
    print("{0:20}{1:20}".format(word,lemmatizer.lemmatize(word,pos='v')))

Word                Lemma               
The                 The                 
girl                girl                
loves               love                
painting            paint               
she                 she                 
is                  be                  
drawing             draw                
while               while               
listing             list                
some                some                
music               music               


In [53]:
print(stem_words('went'))
print(lemmatizer.lemmatize('went',pos='v'))

went
go
