# **Atividade 1 - Tokenização, Stemming, Lematização e Expressões Regulares em PLN**

> **Pergunta**: [Python re.split() vs nltk word_tokenize and sent_tokenize](https://stackoverflow.com/questions/35345761/python-re-split-vs-nltk-word-tokenize-and-sent-tokenize/35348340#35348340)

#### **Equipe**
* Camille Santanta
* Nayla Chagas
* Túlio Gois

In [None]:
!pip install --quiet nltk

## **re.split() vs nltk tokenizer**

In [None]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

### **re.split()**

In [None]:
sent = "This is a foo, bar sentence."
sent.split()

['This', 'is', 'a', 'foo,', 'bar', 'sentence.']

### **nltk tokenizer**

In [None]:
from nltk import word_tokenize

word_tokenize(sent)

['This', 'is', 'a', 'foo', ',', 'bar', 'sentence', '.']

## **Avaliando tempo de processamento**

In [None]:
import time
import urllib.request
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')

for _ in range(10):
    start = time.time()
    for line in data.split('\n'):
        line.split()
    print ('str.split():\t', time.time() - start)

for _ in range(10):
    start = time.time()
    for line in data.split('\n'):
        word_tokenize(line)
    print ('word_tokenize():\t', time.time() - start)

str.split():	 0.06275653839111328
str.split():	 0.05916309356689453
str.split():	 0.0557255744934082
str.split():	 0.05568814277648926
str.split():	 0.05805182456970215
str.split():	 0.05927252769470215
str.split():	 0.06445670127868652
str.split():	 0.0639803409576416
str.split():	 0.058779001235961914
str.split():	 0.05937671661376953
word_tokenize():	 5.144822597503662
word_tokenize():	 3.4601986408233643
word_tokenize():	 3.4366424083709717
word_tokenize():	 4.27911901473999
word_tokenize():	 4.30973482131958
word_tokenize():	 3.406015396118164
word_tokenize():	 3.4833948612213135
word_tokenize():	 5.541997909545898
word_tokenize():	 3.72883939743042
word_tokenize():	 3.4897234439849854


### **Avaliando o tempo para o TokTok Tokenizer**

In [None]:
from nltk.tokenize import ToktokTokenizer

url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')

toktok = ToktokTokenizer().tokenize

for _ in range(10):
    start = time.time()
    for line in data.split('\n'):
        toktok(line)
    print ('toktok:\t', time.time() - start)

toktok:	 1.2742667198181152
toktok:	 2.208367347717285
toktok:	 2.056178092956543
toktok:	 1.2482244968414307
toktok:	 1.236565351486206
toktok:	 1.2486047744750977
toktok:	 1.2709181308746338
toktok:	 1.2581582069396973
toktok:	 1.2600646018981934
toktok:	 1.2960405349731445


## **Comparando com a implementação nativa do TokTok Tokenizer em perl**

In [None]:
!wget https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl
!wget wget https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt

--2025-06-16 23:06:45--  https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3003 (2.9K) [text/plain]
Saving to: ‘tok-tok.pl.1’


2025-06-16 23:06:45 (35.0 MB/s) - ‘tok-tok.pl.1’ saved [3003/3003]

--2025-06-16 23:06:45--  http://wget/
Resolving wget (wget)... failed: Name or service not known.
wget: unable to resolve host address ‘wget’
--2025-06-16 23:06:45--  https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, await

In [None]:
!time perl tok-tok.pl < test.txt > /tmp/null


real	0m1.348s
user	0m1.192s
sys	0m0.035s


### **sent_tokenize( )**

In [None]:
text = """In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance. Such were Willarski and even the Grand Master of the principal lodge. Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined. These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.Pierre began to feel dissatisfied with what he was doing. Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals. He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles. And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.What is to be done in these circumstances? To favor revolutions, overthrow everything, repel force by force?No! We are very far from that. Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence. "But what is there in running across it like that?" said Ilagin's groom. "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement. """

answer = """In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance.
Such were Willarski and even the Grand Master of the principal lodge.
Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined.
These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.
Pierre began to feel dissatisfied with what he was doing.
Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals.
He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles.
And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.
What is to be done in these circumstances?
To favor revolutions, overthrow everything, repel force by force?
No!
We are very far from that.
Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence.
"But what is there in running across it like that?" said Ilagin's groom.
"Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement."""


In [None]:
sum(1 for sent in text.split('\n') if sent in answer)

0

In [None]:
display(text.split('\n'))

['In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance. Such were Willarski and even the Grand Master of the principal lodge. Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined. These according to Pierre\'s observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.Pierre began to feel dissatisfied with what he was doing. Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals. He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original p

In [None]:
from nltk import sent_tokenize
sent_tokenize(text)

['In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance.',
 'Such were Willarski and even the Grand Master of the principal lodge.',
 'Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined.',
 "These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.Pierre began to feel dissatisfied with what he was doing.",
 'Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals.',
 'He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated 