# NLP Basics
1. Lowercasing
2. Remove HTML Tags
3. Remove URLs
4. Remove Punctuations
5. Chat word treatment
6. Spelling Correction
7. Removing Stop Words
8. Handling Emojis
9. Tokenization
10. Stemming
11. Lemmatization

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [15]:
pd.set_option("max_colwidth", 1000)

In [16]:
data = pd.read_csv("./data/IMDB Dataset.csv")
data.head(2)

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the...",positive
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only ""has got all the polari"" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.",positive


In [17]:
data.shape

(50000, 2)

## Lower Casing

In [18]:
# To convert complete review column into lower case.
data["review"] = data["review"].str.lower()

In [19]:
data.head(2)

Unnamed: 0,review,sentiment
0,"one of the other reviewers has mentioned that after watching just 1 oz episode you'll be hooked. they are right, as this is exactly what happened with me.<br /><br />the first thing that struck me about oz was its brutality and unflinching scenes of violence, which set in right from the word go. trust me, this is not a show for the faint hearted or timid. this show pulls no punches with regards to drugs, sex or violence. its is hardcore, in the classic use of the word.<br /><br />it is called oz as that is the nickname given to the oswald maximum security state penitentary. it focuses mainly on emerald city, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. em city is home to many..aryans, muslims, gangstas, latinos, christians, italians, irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />i would say the main appeal of the show is due to the...",positive
1,"a wonderful little production. <br /><br />the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />the actors are extremely well chosen- michael sheen not only ""has got all the polari"" but he has all the voices down pat too! you can truly see the seamless editing guided by the references to williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. a masterful production about one of the great master's of comedy and his life. <br /><br />the realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. it plays on our knowledge and our senses, particularly with the scenes concerning orton and halliwell and the sets (particularly of their flat with halliwell's murals decorating every surface) are terribly well done.",positive


# Removing HTML Tags

https://regex101.com/


In [20]:
import re
def remove_html_tags(text):
    pattern = re.compile("<.*?>")
    return pattern.sub(r'',text)

In [21]:
text = """<div class="ipc-html-content-inner-div">Earth's future has been riddled by disasters, famines, and droughts. There is only one way to ensure mankind's survival: Interstellar travel. A newly discovered wormhole in the far reaches of our solar system allows a team of astronauts to go where no man has gone before, a planet that may have the right environment to sustain human life.<span style="display:inline-block" data-reactroot=""> <!-- -->—<a class="ipc-link ipc-link--base sc-9eebdf80-0 VsoxL" role="button" tabindex="0" aria-disabled="false" href="/search/title/?plot_author=ahmetkozan&amp;view=simple&amp;sort=alpha&amp;ref_=tt_stry_pl">ahmetkozan</a></span></div>
"""

In [22]:
# Example
remove_html_tags(text)

"Earth's future has been riddled by disasters, famines, and droughts. There is only one way to ensure mankind's survival: Interstellar travel. A newly discovered wormhole in the far reaches of our solar system allows a team of astronauts to go where no man has gone before, a planet that may have the right environment to sustain human life. —ahmetkozan\n"

In [24]:
# Removing tags from the review column
data["review"] = data["review"].apply(remove_html_tags)

# Removing URLs


In [41]:
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'',text)

In [42]:
url = " this is the urls of intersteller movie imdb https://www.imdb.com/title/tt0816692/?ref_=ttls_li_tt very good movie to watch"

In [43]:
remove_url(url)

' this is the urls of intersteller movie imdb  very good movie to watch'

In [44]:
# Removing URLs from the review column
data["review"] = data["review"].apply(remove_url)

# Removing Punctuations

In [45]:
import string, time
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [46]:
exclude = string.punctuation

In [47]:
# Good but time consuming function
def remove_punc(text):
    for char in exclude:
        text = text.replace(char,"")
    return text

In [48]:
text = "What's your full name? Y'day was a holiday."

In [57]:

print(remove_punc(text))

Whats your full name Yday was a holiday


In [58]:
start = time.time()
print(remove_punc(text))
time1 = time.time() - start
print(time1)

Whats your full name Yday was a holiday
0.0


In [59]:
def remove_punc(text):
    return text.translate(str.maketrans("","",exclude))

In [60]:
print(remove_punc(text))

Whats your full name Yday was a holiday


In [61]:
start = time.time()
print(remove_punc(text))
time1 = time.time() - start
print(time1)

Whats your full name Yday was a holiday
0.0


In [62]:
# Removing punctuations from the review column
data["review"] = data["review"].apply(remove_punc)

In [63]:
data.head(2)

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that after watching just 1 oz episode youll be hooked they are right as this is exactly what happened with methe first thing that struck me about oz was its brutality and unflinching scenes of violence which set in right from the word go trust me this is not a show for the faint hearted or timid this show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the wordit is called oz as that is the nickname given to the oswald maximum security state penitentary it focuses mainly on emerald city an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda em city is home to manyaryans muslims gangstas latinos christians italians irish and moreso scuffles death stares dodgy dealings and shady agreements are never far awayi would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare forget pretty pic...,positive
1,a wonderful little production the filming technique is very unassuming very oldtimebbc fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece the actors are extremely well chosen michael sheen not only has got all the polari but he has all the voices down pat too you can truly see the seamless editing guided by the references to williams diary entries not only is it well worth the watching but it is a terrificly written and performed piece a masterful production about one of the great masters of comedy and his life the realism really comes home with the little things the fantasy of the guard which rather than use the traditional dream techniques remains solid then disappears it plays on our knowledge and our senses particularly with the scenes concerning orton and halliwell and the sets particularly of their flat with halliwells murals decorating every surface are terribly well done,positive
