# Introduction to **nltk**

Natural language processing (NLP) is a field that focuses on making natural human language usable by computer programs. NLTK, or Natural Language Toolkit, is a Python package that you can use for NLP.

Before you can analyze text data programmatically, you first need to preprocess it. In this chapter, you’ll take your first look at the kinds of text preprocessing tasks you can do with NLTK.

This chapter is based on the following tutorial: https://realpython.com/nltk-nlp-python/

## Tokenizing

Tokenizing refers to splitting up text, typically by word or by sentence. This will allow you to work with smaller pieces of text that are still relatively coherent and meaningful even outside of the context of the rest of the text. It’s your first step in turning unstructured data into structured data, which is easier to analyze.

Let's import the relevant parts of **nltk** so we can tokenize by word or sentence.

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize

Next, let's create an example string to tokenize.

In [None]:
example_string = """
Muad'Dib learned rapidly because his first training was in how to learn.
And the first lesson of all was the basic trust that he could learn.
It's shocking to find how many people do not believe they can learn,
and how many more believe learning to be difficult."""

We use `sent_token()` to tokenize by sentences, which results in a `list` of three `strings`.

In [None]:
sent_tokenize(example_string)

["\nMuad'Dib learned rapidly because his first training was in how to learn.",
 'And the first lesson of all was the basic trust that he could learn.',
 "It's shocking to find how many people do not believe they can learn,\nand how many more believe learning to be difficult."]

Next, let's tokenize by word, which also results in a `list` of `strings`.

In [None]:
print(word_tokenize(example_string))

["Muad'Dib", 'learned', 'rapidly', 'because', 'his', 'first', 'training', 'was', 'in', 'how', 'to', 'learn', '.', 'And', 'the', 'first', 'lesson', 'of', 'all', 'was', 'the', 'basic', 'trust', 'that', 'he', 'could', 'learn', '.', 'It', "'s", 'shocking', 'to', 'find', 'how', 'many', 'people', 'do', 'not', 'believe', 'they', 'can', 'learn', ',', 'and', 'how', 'many', 'more', 'believe', 'learning', 'to', 'be', 'difficult', '.']


Notice that "It's" was split at the apostrophe to give you 'It' and "'s", but "Muad'Dib" was left whole? This happened because NLTK knows that 'It' and "'s" (a contraction of “is”) are two distinct words, so it counted them separately. But "Muad'Dib" isn’t an accepted contraction like "It's", so it wasn’t read as two separate words and was left intact.

## Filtering Stop Words

*Stop words* are words that you want to ignore, so you filter them out of your text when you’re processing it. Very common words like 'in', 'is', and 'an' are often used as stop words since they don’t add a lot of meaning to a text.

Here’s how to import the relevant parts of **nltk** in order to filter out stop words:

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to /home/pritam/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Let's start with an example piece of text.

In [None]:
worf_quote = "Sir, I protest. I am not a merry man!"

In [None]:
words_in_quote = word_tokenize(worf_quote)
words_in_quote

['Sir', ',', 'I', 'protest', '.', 'I', 'am', 'not', 'a', 'merry', 'man', '!']

Next, let's create a `set` of stop words.

In [None]:
stop_words = set(stopwords.words('english'))
print(stop_words)

{'s', 'when', 'your', 'a', 'does', 'as', 'some', 'itself', 'so', "couldn't", 'no', "don't", 'd', 'she', 'the', 'in', 'after', 'shouldn', 'ma', 'against', 'not', 'to', 'weren', 'my', 'but', 'doing', 'with', 'himself', 'own', 'should', "it's", 'this', 'out', 'mustn', "wasn't", 'its', 'hasn', 'same', 'has', 'off', 'now', 'couldn', 'who', "won't", 'am', 'both', 'such', "doesn't", 'than', 'any', 'themselves', 'most', 've', 'about', 'herself', "you've", 'shan', 'them', "isn't", 'what', 'all', 'ourselves', 'will', "shouldn't", "didn't", 'under', 'few', 'll', "that'll", 'm', 'yourself', 'isn', 'aren', 'because', 'over', "you'll", 'for', 'nor', "you'd", 'hers', 're', 'further', 'can', 'those', 'been', 'through', 'then', 'that', 'was', 'they', 'being', 'very', 'wasn', 'their', "aren't", "needn't", 'there', 'o', 'ours', 'her', "hadn't", 'be', 'too', 'if', 'myself', 'these', 'below', 'hadn', 'why', 'are', 'on', 'by', 'and', 'yourselves', 'an', 'while', 'into', 'having', "mightn't", "weren't", 'it'

In [None]:
# filtered_list = []
# for word in words_in_quote:
#     if word.casefold() not in stop_words:
#         filtered_list.append(word)
# filtered_list

Now we can use a `list` comprehension to fileter out the words in `stop_words`.  Notice we are using `.casefold()` to ignore whether a `word` is uppercase or lower case.

In [None]:
filtered_list = [
    word for word in words_in_quote if word.casefold() not in stop_words 
]
filtered_list

['Sir', ',', 'protest', '.', 'merry', 'man', '!']

You filtered out a few words like 'am' and 'a', but you also filtered out 'not', which does affect the overall meaning of the sentence. 

Words like 'I' and 'not' may seem too important to filter out, and depending on what kind of analysis you want to do, they can be.

## Stemming

Stemming is a text processing task in which you reduce words to their root, which is the core part of a word. For example, the words “helping” and “helper” share the root “help.” Stemming allows you to zero in on the basic meaning of a word rather than all the details of how it’s being used. NLTK has more than one stemmer, but we’ll be using the Porter stemmer.

Here’s how to import the relevant parts of **nltk** in order to start stemming:

In [None]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

Now that you’re done importing, you can create a `stemmer` with `PorterStemmer()`:

In [None]:
stemmer = PorterStemmer()

Let's create an example string to stem.  Noticed that this is a bit of a contrived sentence that has a lot of different uses of the word *discovery*.

In [None]:
string_for_stemming = """
The crew of the USS Discovery discovered many discoveries.
Discovering is what explorers do."""

Before you we can start stemming, we need to tokenize by word.

In [None]:
words = word_tokenize(string_for_stemming)
print(words)

['The', 'crew', 'of', 'the', 'USS', 'Discovery', 'discovered', 'many', 'discoveries', '.', 'Discovering', 'is', 'what', 'explorers', 'do', '.']


Create a list of the stemmed versions of the words in words by using `stemmer.stem()` in a list comprehension:

In [None]:
stemmed_words = [stemmer.stem(word) for word in words]
stemmed_words

['the',
 'crew',
 'of',
 'the',
 'uss',
 'discoveri',
 'discov',
 'mani',
 'discoveri',
 '.',
 'discov',
 'is',
 'what',
 'explor',
 'do',
 '.']

Those results look a little inconsistent. Why would 'Discovery' give you 'discoveri' when 'Discovering' gives you 'discov'?

The Porter stemming algorithm dates from 1979, so it’s a little on the older side. The Snowball stemmer, which is also called Porter2, is an improvement on the original and is also available through NLTK, so you can use that one in your own projects. It’s also worth noting that the purpose of the Porter stemmer is not to produce complete words but to find variant forms of a word.

Fortunately, you have some other ways to reduce words to their core meaning, such as lemmatizing, which you’ll see later in this tutorial. But first, we need to cover parts of speech.

## Parts of Speech

*Part of speech* is a grammatical term that deals with the roles words play when you use them together in sentences. Tagging parts of speech, or POS tagging, is the task of labeling the words in your text according to their part of speech.

In English, there are eight parts of speech:

1. Noun - a person, place, or thing.
2. Pronoun - replaces a noun.
3. Adjective - gives information about what a noun is like.
4. Verb - an action or a state of being.
5. Adverb - gives information about a verb, an adjective, or another adverb.
6. Preposition - gives information about how a noun or pronoun is connected to another word.
7. Conjunction - connects two other words or phrases.
8. Interjection - is an exclamation.

Some sources also include the category articles (like “a” or “the”) in the list of parts of speech, but other sources consider them to be adjectives. **nltk** uses the word determiner to refer to articles.

In [None]:
from nltk.tokenize import word_tokenize

Let's create some text to tag.

In [None]:
sagan_quote = """
If you wish to make an apple pie from scratch,
you must first invent the universe."""

Next, let's tokenize by word.

In [None]:
words_in_sagan_quote = word_tokenize(sagan_quote)

Now call `nltk.pos_tag()` on the `list` of tokens/words:

In [None]:
nltk.pos_tag(words_in_sagan_quote)

[('If', 'IN'),
 ('you', 'PRP'),
 ('wish', 'VBP'),
 ('to', 'TO'),
 ('make', 'VB'),
 ('an', 'DT'),
 ('apple', 'NN'),
 ('pie', 'NN'),
 ('from', 'IN'),
 ('scratch', 'NN'),
 (',', ','),
 ('you', 'PRP'),
 ('must', 'MD'),
 ('first', 'VB'),
 ('invent', 'VB'),
 ('the', 'DT'),
 ('universe', 'NN'),
 ('.', '.')]

We can examine the meaning of all the parts of speech in the tag set.

In [None]:
nltk.download('tagsets')
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

[nltk_data] Downloading package tagsets to /home/pritam/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


In [None]:
# jabberwocky_excerpt = """
# 'Twas brillig, and the slithy toves did gyre and gimble in the wabe:
# all mimsy were the borogoves, and the mome raths outgrabe."""

In [None]:
# words_in_excerpt = word_tokenize(jabberwocky_excerpt)

In [None]:
# nltk.pos_tag(words_in_excerpt)

## Lemmatizing

Like stemming, lemmatizing reduces words to their core meaning, but it will give you a complete English word that makes sense on its own instead of just a fragment of a word like 'discoveri'.

Note: A lemma is a word that represents a whole group of words, and that group of words is called a lexeme.

For example, if you were to look up the word “blending” in a dictionary, then you’d need to look at the entry for “blend,” but you would find “blending” listed in that entry.

In this example, “blend” is the lemma, and “blending” is part of the lexeme. So when you lemmatize a word, you are reducing it to its lemma.

Here’s how to import the relevant parts of **nltk** in order to start lemmatizing:

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

Let’s start with lemmatizing a plural noun:

In [None]:
lemmatizer.lemmatize('scarves')

'scarf'

"scarves" gave you 'scarf', so that’s already a bit more sophisticated than what you would have gotten with the Porter stemmer, which is 'scarv'. Next, create a string with more than one word to lemmatize:

In [None]:
string_for_lemmatizing = "The friends of DeSoto love scarves."

Now tokenize that string by word:

In [None]:
words = word_tokenize(string_for_lemmatizing)
words

['The', 'friends', 'of', 'DeSoto', 'love', 'scarves', '.']

Create a list containing all the words in words after they’ve been lemmatized:

In [None]:
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
lemmatized_words

['The', 'friend', 'of', 'DeSoto', 'love', 'scarf', '.']

But what would happen if you lemmatized a word that looked very different from its lemma? Try lemmatizing "worst":

In [None]:
lemmatizer.lemmatize('worst')

'worst'

You got the result 'worst' because lemmatizer.lemmatize() assumed that "worst" was a noun. You can make it clear that you want "worst" to be an adjective:

In [None]:
lemmatizer.lemmatize('worst', pos='a')

'bad'