# Steps

1. Import Data (Acquisition)
2. LowerCase
3. Remove HTML Tag
4. Remove URLs
5. Remove Punctuation
6. Chat word Treatment
7. Spelling Correction
8. Removing Stop Words
9. Tokenization
10. Lemmatization

1. Import Data (Acquisition)

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as plt

In [2]:
df = pd.read_csv("IMDB Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [3]:
df.shape

(50000, 2)

In [4]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [5]:
df.dropna()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [6]:
df.shape

(50000, 2)

2. LowerCase

In [7]:
df['review'][3].lower()

"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.<br /><br />ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."

In [8]:
df['review'] = df['review'].str.lower()
df['review']

0        one of the other reviewers has mentioned that ...
1        a wonderful little production. <br /><br />the...
2        i thought this was a wonderful way to spend ti...
3        basically there's a family where a little boy ...
4        petter mattei's "love in the time of money" is...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot, bad dialogue, bad acting, idiotic di...
49997    i am a catholic taught in parochial elementary...
49998    i'm going to have to disagree with the previou...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

In [9]:
import warnings
warnings.filterwarnings("ignore")


3. Remove HTML Tag

In [10]:
import re


def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'',text)

In [11]:
df['review'] = df['review'].apply(remove_html_tags)
df['review']

0        one of the other reviewers has mentioned that ...
1        a wonderful little production. the filming tec...
2        i thought this was a wonderful way to spend ti...
3        basically there's a family where a little boy ...
4        petter mattei's "love in the time of money" is...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot, bad dialogue, bad acting, idiotic di...
49997    i am a catholic taught in parochial elementary...
49998    i'm going to have to disagree with the previou...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

In [12]:
df['review'][3]

"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."

4. Remove URLs

In [13]:
import re

def remove_urls(text):
    pattern = re.compile('https?://\S+|www\.\S+')
    return pattern.sub(r'',text)
    

In [14]:
df['review'] = df['review'].apply(remove_urls)
df['review']

0        one of the other reviewers has mentioned that ...
1        a wonderful little production. the filming tec...
2        i thought this was a wonderful way to spend ti...
3        basically there's a family where a little boy ...
4        petter mattei's "love in the time of money" is...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot, bad dialogue, bad acting, idiotic di...
49997    i am a catholic taught in parochial elementary...
49998    i'm going to have to disagree with the previou...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

5. Removing Punctuation

In [15]:
import re

def removing_punctuation(text):
    pattern = re.compile('[!\"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]')
    return pattern.sub(r'',text)

In [16]:
df['review'] = df['review'].apply(removing_punctuation)
df['review']

0        one of the other reviewers has mentioned that ...
1        a wonderful little production the filming tech...
2        i thought this was a wonderful way to spend ti...
3        basically theres a family where a little boy j...
4        petter matteis love in the time of money is a ...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot bad dialogue bad acting idiotic direc...
49997    i am a catholic taught in parochial elementary...
49998    im going to have to disagree with the previous...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

In [17]:
df['review'][3]

'basically theres a family where a little boy jake thinks theres a zombie in his closet  his parents are fighting all the timethis movie is slower than a soap opera and suddenly jake decides to become rambo and kill the zombieok first of all when youre going to make a film you must decide if its a thriller or a drama as a drama the movie is watchable parents are divorcing  arguing like in real life and then we have jake with his closet which totally ruins all the film i expected to see a boogeyman similar movie and instead i watched a drama with some meaningless thriller spots3 out of 10 just for the well playing parents  descent dialogs as for the shots with jake just ignore them'

alternate way for more faster 

In [18]:

import string

exclude = string.punctuation

In [19]:
def remove_pun1(text):
    return text.translate(str.maketrans('','',exclude))

In [20]:
df['review'] = df['review'].apply( remove_pun1)
df['review']

0        one of the other reviewers has mentioned that ...
1        a wonderful little production the filming tech...
2        i thought this was a wonderful way to spend ti...
3        basically theres a family where a little boy j...
4        petter matteis love in the time of money is a ...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot bad dialogue bad acting idiotic direc...
49997    i am a catholic taught in parochial elementary...
49998    im going to have to disagree with the previous...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

6. Chat word Treatment (In chartbox people use shortcurt sentence like asp, blv, kmn)

7. Spelling Correction

In [21]:
from autocorrect import Speller

In [22]:
spell = Speller()

In [23]:
#df['review'].apply(lambda x: str(spell1(x)))

#df['review']
    
    

8. Removing Stop Words [ a, the, of, are, any ]

In [24]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [25]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [26]:
def remove_stopwords(text):
    new_text = []
    
    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)

In [27]:
#df['review'].apply(remove_stopwords) it takes long times

11. Emoji Hanling

In [34]:
import emoji
import re

# Check if a string contains emojis
text = "I love Python! ❤️🐍"

remove_emoji = emoji.demojize(text)
remove_emoji

'I love Python! :red_heart::snake:'

In [35]:

text = "I love Python! ❤️🐍"

# Remove emojis using regular expressions
clean_text = re.sub(r'[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F700-\U0001F77F\U0001F780-\U0001F7FF\U0001F800-\U0001F8FF\U0001F900-\U0001F9FF\U0001FA00-\U0001FA6F\U0001FA70-\U0001FAFF\U0001FAC0-\U0001FAFF\U0001F000-\U0001F251]', '', text)

print(f"Text without emojis: {clean_text}")


Text without emojis: I love Python! ❤️


# Tokenization

1. using the split funciton

In [36]:
#word tokenization
sent1 = 'I am going to Germany'
sent1.split()

['I', 'am', 'going', 'to', 'Germany']

In [39]:
#sentence tokenization

sent2 = 'I am going ot Germany. I will stay there for 15 years. Let\'s hope the trip to be great'

sent2.split('.')

['I am going ot Germany',
 ' I will stay there for 15 years',
 " Let's hope the trip to be great"]

2. Regular Expression

In [41]:
import re

sent3 = 'I am going to Dhaka'

tokens = re.findall('[\w]+', sent3)
tokens

['I', 'am', 'going', 'to', 'Dhaka']

In [44]:
text = 'Please note that this is a simplified template and does not cover the intricacies and specific legal language required for a real law. Creating a law involves a legal drafting process that adheres to established legal principles, follows the legal framework of the jurisdiction, and often requires legal review and approval through a legislative process?.'

In [45]:
senteces = re.compile('[.,?]').split(text)
senteces

['Please note that this is a simplified template and does not cover the intricacies and specific legal language required for a real law',
 ' Creating a law involves a legal drafting process that adheres to established legal principles',
 ' follows the legal framework of the jurisdiction',
 ' and often requires legal review and approval through a legislative process',
 '',
 '']

2. NLKT

In [46]:
from nltk.tokenize import word_tokenize, sent_tokenize

In [47]:
sent1 = 'I am going to Dhaka'
word_tokenize(sent1)

['I', 'am', 'going', 'to', 'Dhaka']

In [48]:
text = 'Please note that this is a simplified template and does not cover the intricacies and specific legal language required for a real law. Creating a law involves a legal drafting process that adheres to established legal principles, follows the legal framework of the jurisdiction, and often requires legal review and approval through a legislative process?.'

sent_tokenize(text)

['Please note that this is a simplified template and does not cover the intricacies and specific legal language required for a real law.',
 'Creating a law involves a legal drafting process that adheres to established legal principles, follows the legal framework of the jurisdiction, and often requires legal review and approval through a legislative process?.']

4. PoterStemmer

In [51]:
from nltk.stem.porter import PorterStemmer

In [52]:
ps = PorterStemmer()

def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

In [53]:
sample = 'walks walk walked walking'
stem_words(sample)

'walk walk walk walk'

In [54]:
stem_words(text) #for user it's tough to read this format cz human can't understand

'pleas note that thi is a simplifi templat and doe not cover the intricaci and specif legal languag requir for a real law. creat a law involv a legal draft process that adher to establish legal principles, follow the legal framework of the jurisdiction, and often requir legal review and approv through a legisl process?.'

In [56]:
import nltk
from nltk.stem import WordNetLemmatizer

WordNet_Lemmatizer = WordNetLemmatizer()

sentence = "The quick brown fox jumped over the lazy dog's tail, and the dog barked loudly! What a noisy commotion, right?"

punctuation = "?:!.,;"

sentence_words = nltk.word_tokenize(sentence)

In [57]:
for word in sentence_words:
    if word in punctuation:
        sentence_words.remove(word)
        
sentence_words
print("{0:20}{1:20}".format("Word","Lemma"))

for word in sentence_words:
    print("{0:20}{1:20}".format(word, WordNet_Lemmatizer.lemmatize(word)))

Word                Lemma               
The                 The                 
quick               quick               
brown               brown               
fox                 fox                 
jumped              jumped              
over                over                
the                 the                 
lazy                lazy                
dog                 dog                 
's                  's                  
tail                tail                
and                 and                 
the                 the                 
dog                 dog                 
barked              barked              
loudly              loudly              
What                What                
a                   a                   
noisy               noisy               
commotion           commotion           
right               right               
