<a href="https://colab.research.google.com/github/juniorjse/ATAL/blob/main/NLTK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Autores: Arnaldo Gualberto e Leandro B. Marinho.
Documentação do NLTK: https://www.nltk.org/

# Tutorial NLTK

Nesse notebooks, nós vamos aprender o básico do módulo `NLTK`(*__N__atural __L__anguage __T__ool**K**it*).

Primeiramente, vamos importar as bibliotecas python que vamos usar nesse tutorial:

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize


e vamos fazer o download de alguns módulos específicos do NLTK:

In [None]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

# Tokenization

**Tokenization** é o processo de transformar um texto em uma lista de tokens. Esses tokens podem ser sentenças, palavras ou símbolos.

### Sentence Tokenization

In [None]:
text = """Hello Mr. Smith, how are you doing today?
    The weather is great, and city is awesome.
    The sky is pinkish-blue. You shouldn't eat cardboard
"""

tokenized_sent = sent_tokenize(text)
print(tokenized_sent)

['Hello Mr. Smith, how are you doing today?', 'The weather is great, and city is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard"]


Nós também podemos tokenizar outras linguas:

In [None]:
portuguese_text = "Bom dia, Sr. Smith. Como você está? O tempo está bom, e a cidade maravilhosa."

print(sent_tokenize(portuguese_text, "portuguese"))

['Bom dia, Sr. Smith.', 'Como você está?', 'O tempo está bom, e a cidade maravilhosa.']


### Word Tokenization

In [None]:
tokenized_word = word_tokenize(text)
print(tokenized_word)

['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'city', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'cardboard']


# Stopwords

In [None]:
stop_words = set(stopwords.words("english"))

print(stop_words)

{'but', 'only', 'here', 'that', "weren't", "needn't", 't', 'we', "that'll", 'a', 'most', 'her', 'on', 'above', 'against', 'aren', 'who', 'ourselves', "wouldn't", 'yourselves', 'again', 'with', "shan't", 'hadn', 've', 'few', 'were', 'the', "you'd", "mustn't", 'during', 'just', 'both', 'then', 'other', 'in', 'too', 'any', 'me', 'was', 'what', 'very', 'below', 'under', 'don', 'should', 'having', 'ain', 'ours', 'and', 'same', 'now', "she's", 'did', 'd', 'i', 'for', 'off', 'at', 'its', 'he', 'once', 'll', 'himself', "you've", 'how', 'out', 'further', 'him', 'whom', 'is', 's', 'from', 'each', 'not', 'hasn', "mightn't", 'up', 'or', 'm', 'does', 'your', 'has', 'of', 'didn', 'own', 'because', 'been', 'so', 'can', 'more', "doesn't", 'it', 'through', "didn't", 'am', "hadn't", 're', 'wouldn', 'an', 'mightn', 'mustn', 'by', "it's", 'ma', 'needn', 'have', 'doing', 'them', 'yours', 'y', 'his', 'themselves', 'why', 'wasn', "you'll", "isn't", 'our', 'all', 'do', 'these', 'as', 'no', 'doesn', 'theirs', 

In [None]:
filtered_words =  [word for word in tokenized_word if word not in stop_words]

print("Tokenized Words:", tokenized_word)
print("Filterd Sentence:", filtered_words)

Tokenized Words: ['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'city', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'cardboard']
Filterd Sentence: ['Hello', 'Mr.', 'Smith', ',', 'today', '?', 'The', 'weather', 'great', ',', 'city', 'awesome', '.', 'The', 'sky', 'pinkish-blue', '.', 'You', "n't", 'eat', 'cardboard']


# Stemming

A **Stemming** reduz as palavras aos seus radicais. Por exemplo, as palavras *connection*, *connected*, *connecting* serão reduzidas a "*connect*". Há diversos algoritmos de stemming, mas o mais famoso é o `Porter stemming`.

In [None]:
example_words = ['connect', 'connected', 'connecting']

ps = PorterStemmer()

stemmed_words = [ps.stem(w) for w in example_words]

print("Filtered Sentence:", example_words)
print("Stemmed Sentence:", stemmed_words)

Filtered Sentence: ['connect', 'connected', 'connecting']
Stemmed Sentence: ['connect', 'connect', 'connect']


O algoritmo `SnowBall` pode faz o processo de stemming em até 13 línguas diferentes:

In [None]:
print(SnowballStemmer.languages)

('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')


In [None]:
example_words = ['conexão', 'conectado', 'conectando', 'conectar']

ss = SnowballStemmer("portuguese")

stemmed_words = [ss.stem(w) for w in example_words]

print('Stemmed sentence:', stemmed_words)

Stemmed sentence: ['conexã', 'conect', 'conect', 'conect']


# Lemmatization

O processo de **Lemmatization** reduz as palavras à sua forma base, conhecida como *lemma*. Por exemplo, a palavra "better" tem "good" como sua lemma.  Em geral, é mais sofisticada que o stemming, pois leva em consideração o contexto. Entretanto, é mais lenta que o stemming.

In [None]:
stemmer = PorterStemmer()
print(stemmer.stem('stones'))
print(stemmer.stem('speaking'))
print(stemmer.stem('are'))
print(stemmer.stem('geese'))
print(stemmer.stem('went'))

stone
speak
are
gees
went


In [None]:
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('stones'))
print(lemmatizer.lemmatize('speaking',pos='v'))
print(lemmatizer.lemmatize('are',pos='v'))
print(lemmatizer.lemmatize('geese'))
print(lemmatizer.lemmatize('went',pos='v'))

stone
speak
be
goose
go


# POS Tagging

O principal objetivo de **Part-of-Speech (POS)** é identificar o grupo gramatical de uma certa palavra: *nome, pronome, adjetivo, verbo, advérbio, etc. Ela leva em consideração o contexto e procura por relacionamentos dentro da sentença e atribui uma tag correspondente a palavra.

In [None]:
sent = "Albert Einstein was born in Ulm, Germany in 1879."

tokens = nltk.word_tokenize(sent)
print('Sentence:', tokens)

nltk.pos_tag(tokens)

Sentence: ['Albert', 'Einstein', 'was', 'born', 'in', 'Ulm', ',', 'Germany', 'in', '1879', '.']


[('Albert', 'NNP'),
 ('Einstein', 'NNP'),
 ('was', 'VBD'),
 ('born', 'VBN'),
 ('in', 'IN'),
 ('Ulm', 'NNP'),
 (',', ','),
 ('Germany', 'NNP'),
 ('in', 'IN'),
 ('1879', 'CD'),
 ('.', '.')]

# N-Gramas


Sequências sobrepostas de n-palavras.

In [None]:
from nltk import bigrams
string_bigrams = list(bigrams(tokenized_word))
print(string_bigrams)


[('Hello', 'Mr.'), ('Mr.', 'Smith'), ('Smith', ','), (',', 'how'), ('how', 'are'), ('are', 'you'), ('you', 'doing'), ('doing', 'today'), ('today', '?'), ('?', 'The'), ('The', 'weather'), ('weather', 'is'), ('is', 'great'), ('great', ','), (',', 'and'), ('and', 'city'), ('city', 'is'), ('is', 'awesome'), ('awesome', '.'), ('.', 'The'), ('The', 'sky'), ('sky', 'is'), ('is', 'pinkish-blue'), ('pinkish-blue', '.'), ('.', 'You'), ('You', 'should'), ('should', "n't"), ("n't", 'eat'), ('eat', 'cardboard')]
