# NLP Basics
1. Lowercasing
2. Remove HTML Tags
3. Remove URLs
4. Remove Punctuations
5. Chat word treatment
6. Spelling Correction
7. Removing Stop Words
8. Handling Emojis
9. Tokenization
10. Stemming
11. Lemmatization

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
pd.set_option("max_colwidth", 1000)

In [3]:
data = pd.read_csv("./data/IMDB Dataset.csv")
data.head(2)

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the...",positive
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only ""has got all the polari"" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.",positive


In [4]:
data.shape

(50000, 2)

## Lower Casing

In [5]:
# To convert complete review column into lower case.
data["review"] = data["review"].str.lower()

In [6]:
data.head(2)

Unnamed: 0,review,sentiment
0,"one of the other reviewers has mentioned that after watching just 1 oz episode you'll be hooked. they are right, as this is exactly what happened with me.<br /><br />the first thing that struck me about oz was its brutality and unflinching scenes of violence, which set in right from the word go. trust me, this is not a show for the faint hearted or timid. this show pulls no punches with regards to drugs, sex or violence. its is hardcore, in the classic use of the word.<br /><br />it is called oz as that is the nickname given to the oswald maximum security state penitentary. it focuses mainly on emerald city, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. em city is home to many..aryans, muslims, gangstas, latinos, christians, italians, irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />i would say the main appeal of the show is due to the...",positive
1,"a wonderful little production. <br /><br />the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />the actors are extremely well chosen- michael sheen not only ""has got all the polari"" but he has all the voices down pat too! you can truly see the seamless editing guided by the references to williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. a masterful production about one of the great master's of comedy and his life. <br /><br />the realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. it plays on our knowledge and our senses, particularly with the scenes concerning orton and halliwell and the sets (particularly of their flat with halliwell's murals decorating every surface) are terribly well done.",positive


# Removing HTML Tags

https://regex101.com/


In [3]:
import re
def remove_html_tags(text):
    pattern = re.compile("<.*?>")
    return pattern.sub(r'',text)

In [4]:
text = """<div class="ipc-html-content-inner-div">Earth's future has been riddled by disasters, famines, and droughts. There is only one way to ensure mankind's survival: Interstellar travel. A newly discovered wormhole in the far reaches of our solar system allows a team of astronauts to go where no man has gone before, a planet that may have the right environment to sustain human life.<span style="display:inline-block" data-reactroot=""> <!-- -->—<a class="ipc-link ipc-link--base sc-9eebdf80-0 VsoxL" role="button" tabindex="0" aria-disabled="false" href="/search/title/?plot_author=ahmetkozan&amp;view=simple&amp;sort=alpha&amp;ref_=tt_stry_pl">ahmetkozan</a></span></div>away.br />br />i 
"""

In [5]:
# Example
remove_html_tags(text)

"Earth's future has been riddled by disasters, famines, and droughts. There is only one way to ensure mankind's survival: Interstellar travel. A newly discovered wormhole in the far reaches of our solar system allows a team of astronauts to go where no man has gone before, a planet that may have the right environment to sustain human life. —ahmetkozanaway.br />br />i \n"

In [10]:
# Removing tags from the review column
data["review"] = data["review"].apply(remove_html_tags)

# Removing URLs


In [11]:
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'',text)

In [12]:
url = " this is the urls of intersteller movie imdb https://www.imdb.com/title/tt0816692/?ref_=ttls_li_tt very good movie to watch"

In [13]:
remove_url(url)

' this is the urls of intersteller movie imdb  very good movie to watch'

In [14]:
# Removing URLs from the review column
data["review"] = data["review"].apply(remove_url)

# Removing Punctuations

In [15]:
import string, time
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [16]:
exclude = string.punctuation

In [17]:
# Good but time consuming function
def remove_punc(text):
    for char in exclude:
        text = text.replace(char,"")
    return text

In [18]:
text = "What's your full name? Y'day was a holiday."

In [19]:

print(remove_punc(text))

Whats your full name Yday was a holiday


In [20]:
start = time.time()
print(remove_punc(text))
time1 = time.time() - start
print(time1)

Whats your full name Yday was a holiday
0.0


In [21]:
def remove_punc(text):
    return text.translate(str.maketrans("","",exclude))

In [22]:
print(remove_punc(text))

Whats your full name Yday was a holiday


In [23]:
start = time.time()
print(remove_punc(text))
time1 = time.time() - start
print(time1)

Whats your full name Yday was a holiday
0.0


In [24]:
# Removing punctuations from the review column
data["review"] = data["review"].apply(remove_punc)

In [25]:
data.head(2)

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that after watching just 1 oz episode youll be hooked they are right as this is exactly what happened with methe first thing that struck me about oz was its brutality and unflinching scenes of violence which set in right from the word go trust me this is not a show for the faint hearted or timid this show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the wordit is called oz as that is the nickname given to the oswald maximum security state penitentary it focuses mainly on emerald city an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda em city is home to manyaryans muslims gangstas latinos christians italians irish and moreso scuffles death stares dodgy dealings and shady agreements are never far awayi would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare forget pretty pic...,positive
1,a wonderful little production the filming technique is very unassuming very oldtimebbc fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece the actors are extremely well chosen michael sheen not only has got all the polari but he has all the voices down pat too you can truly see the seamless editing guided by the references to williams diary entries not only is it well worth the watching but it is a terrificly written and performed piece a masterful production about one of the great masters of comedy and his life the realism really comes home with the little things the fantasy of the guard which rather than use the traditional dream techniques remains solid then disappears it plays on our knowledge and our senses particularly with the scenes concerning orton and halliwell and the sets particularly of their flat with halliwells murals decorating every surface are terribly well done,positive


In [26]:
file = open("./data/slang.txt", "r")

In [27]:
file.seek(0)
chat_words_str = file.read().splitlines()

In [28]:
#file.seek(0)
#chat_words_str = ''.join(map(str, file.readlines())).splitlines()

In [29]:
chat_words_str

['AFAIK=As Far As I Know',
 'AFK=Away From Keyboard',
 'ASAP=As Soon As Possible',
 'ATK=At The Keyboard',
 'ATM=At The Moment',
 'A3=Anytime, Anywhere, Anyplace',
 'BAK=Back At Keyboard',
 'BBL=Be Back Later',
 'BBS=Be Back Soon',
 'BFN=Bye For Now',
 'B4N=Bye For Now',
 'BRB=Be Right Back',
 'BRT=Be Right There',
 'BTW=By The Way',
 'B4=Before',
 'B4N=Bye For Now',
 'CU=See You',
 'CUL8R=See You Later',
 'CYA=See You',
 'FAQ=Frequently Asked Questions',
 'FC=Fingers Crossed',
 "FWIW=For What It's Worth",
 'FYI=For Your Information',
 'GAL=Get A Life',
 'GG=Good Game',
 'GN=Good Night',
 'GMTA=Great Minds Think Alike',
 'GR8=Great!',
 'G9=Genius',
 'IC=I See',
 'ICQ=I Seek you (also a chat program)',
 'ILU=ILU: I Love You',
 'IMHO=In My Honest/Humble Opinion',
 'IMO=In My Opinion',
 'IOW=In Other Words',
 'IRL=In Real Life',
 'KISS=Keep It Simple, Stupid',
 'LDR=Long Distance Relationship',
 'LMAO=Laugh My A.. Off',
 'LOL=Laughing Out Loud',
 'LTNS=Long Time No See',
 'L8R=Later',
 'M

In [30]:
mystr = ""
for i in chat_words_str:
    mystr = mystr + i+"\n"
    
print(mystr)


AFAIK=As Far As I Know
AFK=Away From Keyboard
ASAP=As Soon As Possible
ATK=At The Keyboard
ATM=At The Moment
A3=Anytime, Anywhere, Anyplace
BAK=Back At Keyboard
BBL=Be Back Later
BBS=Be Back Soon
BFN=Bye For Now
B4N=Bye For Now
BRB=Be Right Back
BRT=Be Right There
BTW=By The Way
B4=Before
B4N=Bye For Now
CU=See You
CUL8R=See You Later
CYA=See You
FAQ=Frequently Asked Questions
FC=Fingers Crossed
FWIW=For What It's Worth
FYI=For Your Information
GAL=Get A Life
GG=Good Game
GN=Good Night
GMTA=Great Minds Think Alike
GR8=Great!
G9=Genius
IC=I See
ICQ=I Seek you (also a chat program)
ILU=ILU: I Love You
IMHO=In My Honest/Humble Opinion
IMO=In My Opinion
IOW=In Other Words
IRL=In Real Life
KISS=Keep It Simple, Stupid
LDR=Long Distance Relationship
LMAO=Laugh My A.. Off
LOL=Laughing Out Loud
LTNS=Long Time No See
L8R=Later
MTE=My Thoughts Exactly
M8=Mate
NRN=No Reply Necessary
OIC=Oh I See
PITA=Pain In The A..
PRT=Party
PRW=Parents Are Watching
ROFL=Rolling On The Floor Laughing
ROFLOL=Rolli

In [31]:
chat_words_str = """
AFAIK=As Far As I Know
AFK=Away From Keyboard
ASAP=As Soon As Possible
ATK=At The Keyboard
ATM=At The Moment
A3=Anytime, Anywhere, Anyplace
BAK=Back At Keyboard
BBL=Be Back Later
BBS=Be Back Soon
BFN=Bye For Now
B4N=Bye For Now
BRB=Be Right Back
BRT=Be Right There
BTW=By The Way
B4=Before
B4N=Bye For Now
CU=See You
CUL8R=See You Later
CYA=See You
FAQ=Frequently Asked Questions
FC=Fingers Crossed
FWIW=For What It's Worth
FYI=For Your Information
GAL=Get A Life
GG=Good Game
GN=Good Night
GMTA=Great Minds Think Alike
GR8=Great!
G9=Genius
IC=I See
ICQ=I Seek you (also a chat program)
ILU=ILU: I Love You
IMHO=In My Honest/Humble Opinion
IMO=In My Opinion
IOW=In Other Words
IRL=In Real Life
KISS=Keep It Simple, Stupid
LDR=Long Distance Relationship
LMAO=Laugh My A.. Off
LOL=Laughing Out Loud
LTNS=Long Time No See
L8R=Later
MTE=My Thoughts Exactly
M8=Mate
NRN=No Reply Necessary
OIC=Oh I See
PITA=Pain In The A..
PRT=Party
PRW=Parents Are Watching
ROFL=Rolling On The Floor Laughing
ROFLOL=Rolling On The Floor Laughing Out Loud
ROTFLMAO=Rolling On The Floor Laughing My A.. Off
SK8=Skate
STATS=Your sex and age
ASL=Age, Sex, Location
THX=Thank You
TTFN=Ta-Ta For Now!
TTYL=Talk To You Later
U=You
U2=You Too
U4E=Yours For Ever
WB=Welcome Back
WTF=What The F...
WTG=Way To Go!
WUF=Where Are You From?
W8=Wait...
7K=Sick:-D Laugher
"""

In [32]:
chat_words_list = []
chat_words_map_dict = {}
for line in chat_words_str.split("\n"):
    if line != "":
        cw = line.split("=")[0]
        cw_expanded = line.split("=")[1]
        chat_words_list.append(cw)
        chat_words_map_dict[cw] = cw_expanded
chat_words_list = set(chat_words_list)

In [33]:
def chat_words_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words_list:
            new_text.append(chat_words_map_dict[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

In [34]:
chat_words_conversion("imo this is awesome")

'In My Opinion this is awesome'

In [35]:
chat_words_conversion('IMHO he is the best')

'In My Honest/Humble Opinion he is the best'

# Spelling Correction

In [36]:
from textblob import TextBlob

In [37]:
incorrect_text = 'theree you goo. I hoope it corrrect hte spellling rigtl'

textBlb = TextBlob(incorrect_text)

textBlb.correct().string

'there you go. I hope it correct the spelling right'

# Stop Words

In [38]:
from nltk.corpus import stopwords


In [39]:
print(stopwords.fileids())

['arabic', 'azerbaijani', 'basque', 'bengali', 'catalan', 'chinese', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hebrew', 'hinglish', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']


In [40]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [41]:
len(stopwords.words('english'))

179

In [42]:
stopwords.words('hinglish')

['a',
 'aadi',
 'aaj',
 'aap',
 'aapne',
 'aata',
 'aati',
 'aaya',
 'aaye',
 'ab',
 'abbe',
 'abbey',
 'abe',
 'abhi',
 'able',
 'about',
 'above',
 'accha',
 'according',
 'accordingly',
 'acha',
 'achcha',
 'across',
 'actually',
 'after',
 'afterwards',
 'again',
 'against',
 'agar',
 'ain',
 'aint',
 "ain't",
 'aisa',
 'aise',
 'aisi',
 'alag',
 'all',
 'allow',
 'allows',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'an',
 'and',
 'andar',
 'another',
 'any',
 'anybody',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anyways',
 'anywhere',
 'ap',
 'apan',
 'apart',
 'apna',
 'apnaa',
 'apne',
 'apni',
 'appear',
 'are',
 'aren',
 'arent',
 "aren't",
 'around',
 'arre',
 'as',
 'aside',
 'ask',
 'asking',
 'at',
 'aur',
 'avum',
 'aya',
 'aye',
 'baad',
 'baar',
 'bad',
 'bahut',
 'bana',
 'banae',
 'banai',
 'banao',
 'banaya',
 'banaye',
 'banayi',
 'banda',
 'bande',
 'bandi',
 'bane',
 'bani',
 'bas',
 'bata',
 'bat

In [43]:
len(stopwords.words('hinglish'))

1036

In [44]:
def remove_stopwords(text):
    new_text = []
    
    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)

In [45]:
# Takes time to remove stopword
# SHould be a better approach
# data["review"] = data["review"].apply(remove_stopwords)

In [46]:
x = "hello how are you "

In [47]:
#" ".join([t for t in x.split() if t not in stopwords.words('english')])


In [48]:
#data["review"] = data["review"].apply(lambda x: " ".join([t for t in x.split() if t not in stopwords.words("english")]))

# Dealing With Emojis 😊

In [49]:
# Removing Emojis
import re
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [50]:
remove_emoji("TIme for a quick coffee ☕ break 🍩")

'TIme for a quick coffee  break '

In [51]:
# REplacing emojis with its meaning
import emoji
print(emoji.demojize("TIme for a quick coffee ☕ break 🍩"))

TIme for a quick coffee :hot_beverage: break :doughnut:


# Tokenization

**Issues tokenization faces are -** 
- Prefix: Characters at the begining
- Suffix: Characters at the end
- Infix: Characters in between
- Exception: Special case rule to split a string into several tokens or prevent a token from bing split when punctuation rules are applied
- examples: $20, 15KM, let's, U.S. -,_,/,!,), new-delhi

## 1. Using the split function 

In [52]:
# word tokenization
sent1 = 'I am going to delhi'
sent1.split()

['I', 'am', 'going', 'to', 'delhi']

In [53]:
# sentence tokenization
sent2 = 'I am going to delhi. I will stay there for 3 days. Let\'s hope the trip to be great'
sent2.split('.')

['I am going to delhi',
 ' I will stay there for 3 days',
 " Let's hope the trip to be great"]

In [54]:
#Sentence word level tokenization
sent_lst = []
for sent in sent2.split('.'):
    sent_lst.append(sent.split())
sent_lst

[['I', 'am', 'going', 'to', 'delhi'],
 ['I', 'will', 'stay', 'there', 'for', '3', 'days'],
 ["Let's", 'hope', 'the', 'trip', 'to', 'be', 'great']]

In [55]:
# Problems with split function
sent3 = 'I am going to delhi!'
sent3.split()
# delhi has ! with it

['I', 'am', 'going', 'to', 'delhi!']

In [56]:
sent4 = 'Where do think I should go? I have 3 day holiday'
sent4.split('.')

['Where do think I should go? I have 3 day holiday']

## 2. Regular Expressions
- Better than split. Here you have to look for the patterns

In [57]:
import re
sent3 = 'I am going to delhi!'
tokens = re.findall("[\w']+", sent3)
tokens
# ! is removed

['I', 'am', 'going', 'to', 'delhi']

In [58]:
sent3 = "Let's go to delhi!"
tokens = re.findall("[\w']+", sent3)
tokens

["Let's", 'go', 'to', 'delhi']

In [59]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry? 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""
sentences = re.compile('[.!?] ').split(text)
sentences

['Lorem Ipsum is simply dummy text of the printing and typesetting industry',
 "\nLorem Ipsum has been the industry's standard dummy text ever since the 1500s, \nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

# 3. NLTK
- Use libraries where ever possible. Spacy is the best

In [60]:
from nltk.tokenize import word_tokenize, sent_tokenize

In [61]:
sent1 = 'I am going to visit delhi!'
word_tokenize(sent1)

['I', 'am', 'going', 'to', 'visit', 'delhi', '!']

In [62]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry? 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""

sent_tokenize(text)

['Lorem Ipsum is simply dummy text of the printing and typesetting industry?',
 "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, \nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

In [63]:
sent5 = 'I have a Ph.D in A.I'
sent6 = "We're here to help! mail us at robin@gmail.com"
sent7 = 'A 5km ride cost $10.50'

word_tokenize(sent5)

['I', 'have', 'a', 'Ph.D', 'in', 'A.I']

In [64]:
word_tokenize(sent6)

['We',
 "'re",
 'here',
 'to',
 'help',
 '!',
 'mail',
 'us',
 'at',
 'robin',
 '@',
 'gmail.com']

In [65]:
word_tokenize(sent7)

['A', '5km', 'ride', 'cost', '$', '10.50']

# 4. Spacy 
- Best library for tokenization

In [66]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [67]:
doc1 = nlp(sent5)
doc2 = nlp(sent6)
doc3 = nlp(sent7)
doc4 = nlp(sent1)

In [68]:
for token in doc1:
    print(token)

I
have
a
Ph
.
D
in
A.I


# Stemming
- In grammar, inflection is the modification of a word to express different grammatical categories such as tense, case, voice, ascpect, person, number, gender & mood.
- Examples: walk, walking, walked, walks, do, undoable, redo

**Stemming -root-word** Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the language
- Stemming is very useful in Information Retrieval system such as search engines.
- Stemmer - algos used for stemming. NLTK - Porter stemmer & snowball stemmer

In [69]:
from nltk.stem.porter import PorterStemmer

In [70]:
ps = PorterStemmer()

In [71]:
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

In [72]:
text = "I walk daily as walking is good for health. I even walked yday"

In [73]:
stem_words(text)

'i walk daili as walk is good for health. i even walk yday'

In [74]:
# One thing to keep in mind is root word from stemming might does not make any sense. like daili
# Hence we use lemmatization. Only issue with lemmetization is, it is slow in processing.

# Lemmatization
- Lemmatization, unlike stemming, reduces the inflexted words  properly ensuring that the root word belongs to the language. In Lemmatization root wor is called Lemma. A lemma - (plural lemmas or lemmata), is the canonical form, dictionary form, or citation form of a set of words.

**Wordnet lemmatization** is a lexical dictionary. It has information on how different words are related to one another.

In [75]:
import nltk
nltk.download('wordnet')
#nltk.download()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [76]:

from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."
punctuations="?:!.,;"
sentence_words = nltk.word_tokenize(sentence)
for word in sentence_words:
    if word in punctuations:
        sentence_words.remove(word)

sentence_words
print("{0:20}{1:20}".format("Word","Lemma"))
for word in sentence_words:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word,pos='v')))

Word                Lemma               


LookupError: 
**********************************************************************
  Resource [93momw-1.4[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('omw-1.4')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/omw-1.4[0m

  Searched in:
    - 'C:\\Users\\Admin/nltk_data'
    - 'C:\\Users\\Admin\\anaconda3\\nltk_data'
    - 'C:\\Users\\Admin\\anaconda3\\share\\nltk_data'
    - 'C:\\Users\\Admin\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\Admin\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In [None]:
import nltk
print(nltk.find('corpora/wordnet.zip'))

In [77]:
import requests

url = "https://api.themoviedb.org/3/genre/movie/list?language=en"

headers = {
    "accept": "application/json",
    "Authorization": "Bearer eyJhbGciOiJIUzI1NiJ9.eyJhdWQiOiI0YjY5ZDI4ZWQzYTI2NGQ0MGY1ZWE5MDNlMWI4MDVlMiIsInN1YiI6IjYyZGU4YTM3ZGMxY2I0MDA0Yzc3MjgyMiIsInNjb3BlcyI6WyJhcGlfcmVhZCJdLCJ2ZXJzaW9uIjoxfQ.0Hinwq3mTb8DnfVg49BS4oJkxzXFiLv7HA03kEG2pQU"
}

response = requests.get(url, headers=headers)

print(response.text)

{"genres":[{"id":28,"name":"Action"},{"id":12,"name":"Adventure"},{"id":16,"name":"Animation"},{"id":35,"name":"Comedy"},{"id":80,"name":"Crime"},{"id":99,"name":"Documentary"},{"id":18,"name":"Drama"},{"id":10751,"name":"Family"},{"id":14,"name":"Fantasy"},{"id":36,"name":"History"},{"id":27,"name":"Horror"},{"id":10402,"name":"Music"},{"id":9648,"name":"Mystery"},{"id":10749,"name":"Romance"},{"id":878,"name":"Science Fiction"},{"id":10770,"name":"TV Movie"},{"id":53,"name":"Thriller"},{"id":10752,"name":"War"},{"id":37,"name":"Western"}]}


In [78]:
import requests

url = "https://api.themoviedb.org/3/movie/top_rated?language=en-US&page=1"

headers = {
    "accept": "application/json",
    "Authorization": "Bearer eyJhbGciOiJIUzI1NiJ9.eyJhdWQiOiI0YjY5ZDI4ZWQzYTI2NGQ0MGY1ZWE5MDNlMWI4MDVlMiIsInN1YiI6IjYyZGU4YTM3ZGMxY2I0MDA0Yzc3MjgyMiIsInNjb3BlcyI6WyJhcGlfcmVhZCJdLCJ2ZXJzaW9uIjoxfQ.0Hinwq3mTb8DnfVg49BS4oJkxzXFiLv7HA03kEG2pQU"
}

response = requests.get(url, headers=headers)

print(response.text)

{"page":1,"results":[{"adult":false,"backdrop_path":"/tmU7GeKVybMWFButWEGl2M4GeiP.jpg","genre_ids":[18,80],"id":238,"original_language":"en","original_title":"The Godfather","overview":"Spanning the years 1945 to 1955, a chronicle of the fictional Italian-American Corleone crime family. When organized crime family patriarch, Vito Corleone barely survives an attempt on his life, his youngest son, Michael steps in to take care of the would-be killers, launching a campaign of bloody revenge.","popularity":124.761,"poster_path":"/3bhkrj58Vtu7enYsRolD1fZdja1.jpg","release_date":"1972-03-14","title":"The Godfather","video":false,"vote_average":8.708,"vote_count":19066},{"adult":false,"backdrop_path":"/kXfqcdQKsToO0OUXHcrrNCHDBzO.jpg","genre_ids":[18,80],"id":278,"original_language":"en","original_title":"The Shawshank Redemption","overview":"Framed in the 1940s for the double murder of his wife and her lover, upstanding banker Andy Dufresne begins a new life at the Shawshank prison, where he

In [85]:
response.text.json()

AttributeError: 'str' object has no attribute 'json'