<a href="https://colab.research.google.com/github/gorzanskik-ai/natural-language-processing/blob/main/01_preprocessing/08_tokenization_stemming_lemmatization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import re
import nltk
nltk.download('punkt')
nltk.download('rslp')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package rslp to /root/nltk_data...
[nltk_data]   Package rslp is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
text = "What do you think about the latest news? " \
               "A group of scientists led by dr. Smith has discovered " \
               "a new method to stop the spread of pandemic."

Tokenization

In [None]:
re.split('[.\?!]', text)

['What do you think about the latest news',
 ' A group of scientists led by dr',
 ' Smith has discovered a new method to stop the spread of pandemic',
 '']

In [None]:
text.split()

['What',
 'do',
 'you',
 'think',
 'about',
 'the',
 'latest',
 'news?',
 'A',
 'group',
 'of',
 'scientists',
 'led',
 'by',
 'dr.',
 'Smith',
 'has',
 'discovered',
 'a',
 'new',
 'method',
 'to',
 'stop',
 'the',
 'spread',
 'of',
 'pandemic.']

In [None]:
nltk.sent_tokenize(text)

['What do you think about the latest news?',
 'A group of scientists led by dr. Smith has discovered a new method to stop the spread of pandemic.']

In [None]:
nltk.word_tokenize(text)

['What',
 'do',
 'you',
 'think',
 'about',
 'the',
 'latest',
 'news',
 '?',
 'A',
 'group',
 'of',
 'scientists',
 'led',
 'by',
 'dr.',
 'Smith',
 'has',
 'discovered',
 'a',
 'new',
 'method',
 'to',
 'stop',
 'the',
 'spread',
 'of',
 'pandemic',
 '.']

Stemming

In [None]:
stemmers = [
    nltk.stem.LancasterStemmer(),
    nltk.stem.PorterStemmer(),
    nltk.stem.RSLPStemmer(),
    nltk.stem.SnowballStemmer(language="english"),
]

for stemmer in stemmers:
    tokens = nltk.word_tokenize(text)
    stems = map(stemmer.stem, tokens)
    output_text = ' '.join(stems)
    print(f'{stemmer.__class__.__name__}: \n', output_text, '\n')

LancasterStemmer: 
 what do you think about the latest new ? a group of sci led by dr. smi has discov a new method to stop the spread of pandem . 

PorterStemmer: 
 what do you think about the latest news ? a group of scientist led by dr. smith ha discov a new method to stop the spread of pandem . 

RSLPStemmer: 
 what do you think about the latest new ? a group of scientist led by dr. smith ha discovered a new method to stop the spread of pandemic . 

SnowballStemmer: 
 what do you think about the latest news ? a group of scientist led by dr. smith has discov a new method to stop the spread of pandem . 



In [None]:
for stemmer in stemmers:
    tokens = nltk.word_tokenize(
        "Ciekawe jak algorytmy dedykowane dla " \
        "języka angielskiego poradzą sobie z " \
        "tekstem w języku polskim. Angielski " \
        "różni się przecież znacząco, prawda?"
    )
    stems = map(stemmer.stem, tokens)
    output_text = " ".join(stems)
    print(f"{stemmer.__class__.__name__}:\n", output_text, "\n")

LancasterStemmer:
 ciekaw jak algorytmy dedykow dla języka angielskiego poradzą soby z tekstem w języku polskim . angielsk różni się przecież znacząco , prawd ? 

PorterStemmer:
 ciekaw jak algorytmi dedykowan dla języka angielskiego poradzą sobi z tekstem w języku polskim . angielski różni się przecież znacząco , prawda ? 

RSLPStemmer:
 ciekaw jak algorytmy dedykowan dla język angielskieg poradzą sobi z tekst w języku polskim . angielsk różn się przecież znacząc , prawd ? 

SnowballStemmer:
 ciekaw jak algorytmi dedykowan dla języka angielskiego poradzą sobi z tekstem w języku polskim . angielski różni się przecież znacząco , prawda ? 



Lemmatization

In [None]:
lemmatizer = nltk.stem.WordNetLemmatizer()
tokens = nltk.word_tokenize(text)
stems = map(lemmatizer.lemmatize, tokens)
output_text = ' '.join(stems)
output_text

'What do you think about the latest news ? A group of scientist led by dr. Smith ha discovered a new method to stop the spread of pandemic .'

In [None]:
from nltk.corpus.reader import wordnet
lemmatizer.lemmatize('worse', pos=wordnet.ADJ)

'bad'

In [None]:
lemmatizer.lemmatize('worse')

'worse'