## Tokenizing words and sentences

In [23]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [24]:
example_sentences = "Natural Language Processing is the task we give computers to read and understand (process) written text (natural language)"

In [30]:
tokenized_sentences = word_tokenize(example_sentences)
print(tokenized_sentences)

['This', 'is', 'an', 'example', 'showing', 'off', 'stop', 'word', 'filtration', '.']


## Stop words

In [27]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [28]:
example_sentences = "This is an example showing off stop word filtration."
stop_words = set(stopwords.words("english"))

In [29]:
words = word_tokenize(example_sentences)

In [21]:
filtered_sentences = [w for w in words if not w in stop_words]
filtered_sentences

['This', 'example', 'showing', 'stop', 'word', 'filtration', '.']

## Stemming

This is a form of data pre-processing with natural languae processing, called "stemming." The idea is we process words by removing its affices from the end of words. The reason we would this is so that we do not need to store the meaning of every single tense of a word. For example:

- Reader
- Reading
- Read

Or for another advanced instance:

- I was taking a ride in the car
- I was riding in the car

Aside from tense, and even one of these is a noun, they all have the same meaning for their "root" stem (read). By this way, we store one single value for the root stem of "read." Then, when we wish to learn more, we can look into the affices that were on the end, like "ing" is an active word, or in the past, then you have reader as someone who reads. Then just plain read as either past tense or current.

In [3]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

In [4]:
ps = PorterStemmer()

In [5]:
example_words = ["python", "pythoner", "pythoning", "pythoned", "pythonly"]

In [6]:
for w in example_words:
    print(ps.stem(w))

python
python
python
python
pythonli


In [9]:
new_text = "It is important to be pythonly while you are pythoning with python."
tokenized_new_text = word_tokenize(new_text)

In [10]:
for w in tokenized_new_text:
    print(ps.stem(w))

It
is
import
to
be
pythonli
while
you
are
python
with
python
.
