# NLP - Count Verbs

By [Leonardo Tozo](https://www.linkedin.com/in/leotozo/)

****************************
Hello,
<br>This is part of my personal portfolio, my intention with this series of notebooks is to keep practicing and improving my A.I & Machine Learning skills.
 
*Leonardo Tozo Bisinoto*
<br>*MBA in Artificial Intelligence & Machine Learning*
<br>*LinkedIn: https://www.linkedin.com/in/leotozo/*
<br>*Github: https://github.com/leotozo*
**************************** 

This data analysis uses the IMDB reviews dataset. I will perform a basic NLP techniques.

In [1]:
import pandas as pd

# Reading the IMDB dataset.

In [6]:
df = pd.read_csv(
    './movies.csv',encoding='utf-8'
).sample(10000)


# Describing the IMDB dataset.

In [8]:
df.describe()

Unnamed: 0,review,sentiment
count,10000,10000
unique,9981,2
top,Alfred Hitchcock invented any kind of thriller...,negative
freq,2,5007


# Displaying the first 5 rows of the dataset.

In [9]:
df.head()

Unnamed: 0,review,sentiment
18086,I was 15 years old when this movie premiered o...,positive
48348,"Yes, my summary just about tells it all.<br />...",negative
8664,"Lou Gossett, Jr. is great as 'Chappy Sinclair'...",positive
7677,"I really liked Get Shorty, but this movie was ...",negative
19679,This movie was astonishing. It is beyond atroc...,negative


# Removing the missing values.

In [11]:
df.dropna(inplace=True)


# Displaying the dataset shape (# of rows, # of columns)


In [14]:
df.shape

(10000, 2)

# Creating a new colounnm name + description

In [16]:
df["reviews"] = df['review'] + " " + df['sentiment']

df.reviews[0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

## How manay Unigrams are there before and after removing stopwords 


In [10]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\LEONARDOTOZOBISINOTO\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Package stopwords is already up-to-date!


True

## Unigrams count before and after removing the stopwords

In [32]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(ngram_range=(1,1))
vect.fit(df.texto)
text_vect = vect.transform(df.texto)

print('UNIGRAMS with the STOPWORDS', text_vect.shape[1])

UNIGRAMS with the STOPWORDS 35466


In [33]:
from sklearn.feature_extraction.text import CountVectorizer

stopwords = nltk.corpus.stopwords.words('portuguese')

vect = CountVectorizer(ngram_range=(1,1), stop_words=stopwords)
vect.fit(df.texto)
text_vect = vect.transform(df.texto)

print('UNIGRAMS without the STOPWORDS', text_vect.shape[1])

UNIGRAMS without the STOPWORDS 35310


## Bigrams count before and after removing the stopwords


In [13]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(ngram_range=(2,2))
vect.fit(df.texto)
text_vect = vect.transform(df.texto)

print('BIGRAMS with the STOPWORDS', text_vect.shape[1])

BIGRAMAS com STOPWORDS 159553


In [14]:
from sklearn.feature_extraction.text import CountVectorizer

stopwords = nltk.corpus.stopwords.words('portuguese')

vect = CountVectorizer(ngram_range=(2,2), stop_words=stopwords)
vect.fit(df.texto)
text_vect = vect.transform(df.texto)

print('BIGRAMS without the STOPWORDS', text_vect.shape[1])

BIGRAMAS sem STOPWORDS 145409


## Trigrams count before and after removing the stopwords

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(ngram_range=(3,3))
vect.fit(df.texto)
text_vect = vect.transform(df.texto)

print('TRIGRAMS with the STOPWORDS', text_vect.shape[1])

TRIGRAMAS com STOPWORDS 228162


In [37]:
from sklearn.feature_extraction.text import CountVectorizer

stopwords = nltk.corpus.stopwords.words('portuguese')

vect = CountVectorizer(ngram_range=(3,3), stop_words=stopwords)
vect.fit(df.texto)
text_vect = vect.transform(df.texto)

print('TRIGRAMS without the STOPWORDS', text_vect.shape[1])

TRIGRAMS without STOPWORDS 177869


## Verbs and Adverbs count

In [17]:
nltk.download('rslp')
nltk.download('punkt')
nltk.download('universal_tagset')
nltk.download('averaged_perceptron_tagger')

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package rslp to
[nltk_data]     C:\Users\LEONARDOTOZOBISINOTO\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Package rslp is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\LEONARDOTOZOBISINOTO\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\LEONARDOTOZOBISINOTO\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Package universal_tagset is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\LEONARDOTOZOBISINOTO\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [21]:
word_tokenize('O Hobbit - 7ª Ed. 2013  Produto NovoBilbo Bolseiro é um hobbit que')

['O',
 'Hobbit',
 '-',
 '7ª',
 'Ed',
 '.',
 '2013',
 'Produto',
 'NovoBilbo',
 'Bolseiro',
 'é',
 'um',
 'hobbit',
 'que']

In [22]:
df['tokens'] = df.texto.apply(word_tokenize)
df.tokens.head()

0    [O, Hobbit, -, 7ª, Ed, ., 2013, Produto, NovoB...
1    [Livro, -, It, A, Coisa, -, Stephen, King, Pro...
2    [Box, As, Crônicas, De, Gelo, E, Fogo, Pocket,...
3    [Box, Harry, Potter, Produto, Novo, e, Físico,...
4    [Livro, Origem, -, Dan, Brown, Produto, NovoDe...
Name: tokens, dtype: object

In [23]:
df['tags'] = df.tokens.apply(pos_tag, tagset='universal')

df.tags.head(11)

0     [(O, NOUN), (Hobbit, NOUN), (-, .), (7ª, NUM),...
1     [(Livro, NOUN), (-, .), (It, PRON), (A, DET), ...
2     [(Box, NOUN), (As, ADP), (Crônicas, NOUN), (De...
3     [(Box, NOUN), (Harry, NOUN), (Potter, NOUN), (...
4     [(Livro, NOUN), (Origem, NOUN), (-, .), (Dan, ...
5     [(Mais, NOUN), (Escuro, NOUN), (-, .), (Cinque...
6     [(O, NOUN), (Silmarillion, NOUN), (-, .), (5ª,...
7     [(O, NOUN), (Pequeno, NOUN), (Principe, NOUN),...
8     [(Ed, NOUN), (&, CONJ), (Lorraine, NOUN), (War...
9     [(Box, NOUN), (-, .), (Franz, NOUN), (Kafka, N...
10    [(Livros, VERB), (The, DET), (Walking, NOUN), ...
Name: tags, dtype: object

In [24]:
df.tags[0]

[('O', 'NOUN'),
 ('Hobbit', 'NOUN'),
 ('-', '.'),
 ('7ª', 'NUM'),
 ('Ed', 'NOUN'),
 ('.', '.'),
 ('2013', 'NUM'),
 ('Produto', 'NOUN'),
 ('NovoBilbo', 'NOUN'),
 ('Bolseiro', 'NOUN'),
 ('é', 'NOUN'),
 ('um', 'ADJ'),
 ('hobbit', 'NOUN'),
 ('que', 'ADJ'),
 ('leva', 'NOUN'),
 ('uma', 'ADJ'),
 ('vida', 'NOUN'),
 ('confortável', 'NOUN'),
 ('e', 'NOUN'),
 ('sem', 'NOUN'),
 ('ambições', 'NOUN'),
 ('.', '.'),
 ('Mas', 'NOUN'),
 ('seu', 'VERB'),
 ('contentamento', 'ADJ'),
 ('é', 'NOUN'),
 ('perturbado', 'NOUN'),
 ('quando', 'NOUN'),
 ('Gandalf', 'NOUN'),
 (',', '.'),
 ('o', 'NOUN'),
 ('mago', 'NOUN'),
 (',', '.'),
 ('e', 'X'),
 ('uma', 'ADJ'),
 ('companhia', 'NOUN'),
 ('de', 'ADP'),
 ('anões', 'X'),
 ('batem', 'NOUN'),
 ('à', 'NOUN'),
 ('sua', 'NOUN'),
 ('porta', 'NOUN'),
 ('e', 'VERB'),
 ('levam-no', 'ADJ'),
 ('para', 'NOUN'),
 ('uma', 'ADJ'),
 ('expedição', 'NOUN'),
 ('.', '.'),
 ('Eles', 'NOUN'),
 ('têm', 'VERB'),
 ('um', 'ADJ'),
 ('plano', 'NOUN'),
 ('para', 'NOUN'),
 ('roubar', 'NOUN'),
 ('

In [25]:
from collections import Counter

counter = Counter()

counter['a'] += 2

counter

Counter({'a': 2})

In [34]:
from collections import Counter

counter = Counter()

for tags in df.tags:
  for _, tag in tags:
    counter[tag] += 1
    
print('Verbs', counter.get('VERB'))
print('Adjectives', counter.get('ADJ'))

Verbs 41774
Adjectives 50788


In [27]:
counter.get('NOUN')

357208

In [28]:
df.tags[0]

[('O', 'NOUN'),
 ('Hobbit', 'NOUN'),
 ('-', '.'),
 ('7ª', 'NUM'),
 ('Ed', 'NOUN'),
 ('.', '.'),
 ('2013', 'NUM'),
 ('Produto', 'NOUN'),
 ('NovoBilbo', 'NOUN'),
 ('Bolseiro', 'NOUN'),
 ('é', 'NOUN'),
 ('um', 'ADJ'),
 ('hobbit', 'NOUN'),
 ('que', 'ADJ'),
 ('leva', 'NOUN'),
 ('uma', 'ADJ'),
 ('vida', 'NOUN'),
 ('confortável', 'NOUN'),
 ('e', 'NOUN'),
 ('sem', 'NOUN'),
 ('ambições', 'NOUN'),
 ('.', '.'),
 ('Mas', 'NOUN'),
 ('seu', 'VERB'),
 ('contentamento', 'ADJ'),
 ('é', 'NOUN'),
 ('perturbado', 'NOUN'),
 ('quando', 'NOUN'),
 ('Gandalf', 'NOUN'),
 (',', '.'),
 ('o', 'NOUN'),
 ('mago', 'NOUN'),
 (',', '.'),
 ('e', 'X'),
 ('uma', 'ADJ'),
 ('companhia', 'NOUN'),
 ('de', 'ADP'),
 ('anões', 'X'),
 ('batem', 'NOUN'),
 ('à', 'NOUN'),
 ('sua', 'NOUN'),
 ('porta', 'NOUN'),
 ('e', 'VERB'),
 ('levam-no', 'ADJ'),
 ('para', 'NOUN'),
 ('uma', 'ADJ'),
 ('expedição', 'NOUN'),
 ('.', '.'),
 ('Eles', 'NOUN'),
 ('têm', 'VERB'),
 ('um', 'ADJ'),
 ('plano', 'NOUN'),
 ('para', 'NOUN'),
 ('roubar', 'NOUN'),
 ('

## Applying Stemmer in a sentence

In [29]:
from nltk.stem import PorterStemmer
from nltk.stem.rslp import RSLPStemmer


tokens = df.tokens[0]


ps = PorterStemmer()
rslp = RSLPStemmer()

for tok in tokens:
  print('PorterStemmer: %s \t\t RSLPStemmer: %s' % (ps.stem(tok), rslp.stem(tok)))


PorterStemmer: O 		 RSLPStemmer: o
PorterStemmer: hobbit 		 RSLPStemmer: hobbit
PorterStemmer: - 		 RSLPStemmer: -
PorterStemmer: 7ª 		 RSLPStemmer: 7ª
PorterStemmer: Ed 		 RSLPStemmer: ed
PorterStemmer: . 		 RSLPStemmer: .
PorterStemmer: 2013 		 RSLPStemmer: 2013
PorterStemmer: produto 		 RSLPStemmer: produt
PorterStemmer: novobilbo 		 RSLPStemmer: novobilb
PorterStemmer: bolseiro 		 RSLPStemmer: bols
PorterStemmer: é 		 RSLPStemmer: é
PorterStemmer: um 		 RSLPStemmer: um
PorterStemmer: hobbit 		 RSLPStemmer: hobbit
PorterStemmer: que 		 RSLPStemmer: que
PorterStemmer: leva 		 RSLPStemmer: lev
PorterStemmer: uma 		 RSLPStemmer: uma
PorterStemmer: vida 		 RSLPStemmer: vid
PorterStemmer: confortável 		 RSLPStemmer: confort
PorterStemmer: e 		 RSLPStemmer: e
PorterStemmer: sem 		 RSLPStemmer: sem
PorterStemmer: ambiçõ 		 RSLPStemmer: amb
PorterStemmer: . 		 RSLPStemmer: .
PorterStemmer: ma 		 RSLPStemmer: mas
PorterStemmer: seu 		 RSLPStemmer: seu
PorterStemmer: contentamento 		 RSLPStem

## Unigrams count after the Stemmer applying

In [30]:
from nltk.stem.rslp import RSLPStemmer
from sklearn.feature_extraction.text import CountVectorizer


rslp = RSLPStemmer()

def stem_pandas(line):
  return ' '.join([rslp.stem(token) for token in line])

df['stemmer'] = df.tokens.apply(stem_pandas)

df.stemmer.head()

0    o hobbit - 7ª ed . 2013 produt novobilb bols é...
1    livr - it a cois - stephen king produt novodur...
2    box as crôn de gel e fog pocket 5 livr produt ...
3    box harry pott produt nov e físic a séri harry...
4    livr orig - dan brown produt novod ond vi ? pa...
Name: stemmer, dtype: object

In [35]:
stopwords = nltk.corpus.stopwords.words('portuguese')
vect = CountVectorizer(ngram_range=(1,1), stop_words=stopwords)
vect.fit(df.stemmer)

text_vect = vect.transform(df.stemmer)

print('UNIGRAMS without the STOPWORDS', text_vect.shape[1])

UNIGRAMS without the STOPWORDS 26528
