# Tokenisation


Tokens are *the parts that make up a sentence*. They may or may not correspond exactly to words.

In [1]:
from nltk.tokenize import sent_tokenize, word_tokenize
#import nltk
#nltk.download('punkt')

The first one parses sentences in smaller sentences, while the second tries to get down to the minimum number of words per token as possible. It is possible to have tokens bigger or smaller than a single word.  
Moreover, they contain punctuation symbols.

In [2]:
example_string = "It is dangerous to go alone. Please take care."

sent_res = sent_tokenize(example_string)
print(sent_res)

['It is dangerous to go alone.', 'Please take care.']


You can see this relies heavily on punctuation.

In [3]:
res = word_tokenize(example_string)
print(res)

['It', 'is', 'dangerous', 'to', 'go', 'alone', '.', 'Please', 'take', 'care', '.']


Note, instead, how this takes punctuation in consideration.

In [5]:
import nltk

nltk.download('omw-1.4')
nltk.download('tagsets')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package omw-1.4 to /home/pachy/nltk_data...
[nltk_data] Downloading package tagsets to /home/pachy/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.
[nltk_data] Downloading package punkt to /home/pachy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/pachy/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/pachy/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /home/pachy/nltk_data...


True

## Stopwords

Stopwords have little semantic meaning and it is useful to parse them out.

In nltk jargon, *corpus* means text.

In [7]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

quote = "It's leviòsa, not leviosà"

words_in_quote = word_tokenize(quote)
words_in_quote

['It', "'s", 'leviòsa', ',', 'not', 'leviosà']

In [10]:
stop_words = set(stopwords.words('english'))
filtered_list = []      #let's take out the stopwords

for word in words_in_quote:
    if word.casefold() not in stop_words:       #casefold makes the inclusion not case-sensitive
        filtered_list.append(word)

print(filtered_list)

["'s", 'leviòsa', ',', 'leviosà']


# Stemming

It is the preprocessing step consisting of the extraction of the root for a word.

In [12]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

stemmer = PorterStemmer()
quote = 'The crew of USS Discovery discovered many discoveries. Discovering is what explorers do.'

In [13]:
words = word_tokenize(quote)
stemmed_words = [stemmer.stem(word) for word in words]
stemmed_words

['the',
 'crew',
 'of',
 'uss',
 'discoveri',
 'discov',
 'mani',
 'discoveri',
 '.',
 'discov',
 'is',
 'what',
 'explor',
 'do',
 '.']

Since PorterStemmer is from 1979, let's use something else.

In [14]:
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer(language = 'english')
stemmed_words = [stemmer.stem(word) for word in words]
stemmed_words

['the',
 'crew',
 'of',
 'uss',
 'discoveri',
 'discov',
 'mani',
 'discoveri',
 '.',
 'discov',
 'is',
 'what',
 'explor',
 'do',
 '.']

In [15]:
quote = 'Computer Science is no more about computers than astronomy is about telescopes.'

tokens = word_tokenize(quote)
tags = nltk.pos_tag(tokens)     #pos means "parse of speech"
tags

[('Computer', 'NNP'),
 ('Science', 'NNP'),
 ('is', 'VBZ'),
 ('no', 'DT'),
 ('more', 'RBR'),
 ('about', 'IN'),
 ('computers', 'NNS'),
 ('than', 'IN'),
 ('astronomy', 'NN'),
 ('is', 'VBZ'),
 ('about', 'IN'),
 ('telescopes', 'NNS'),
 ('.', '.')]

Each of these acronyms correponds to a part of the speech.

Lemmatizer tries to return the meaning of the word. It can be the root or something like the singular of a noun.

In [16]:
from nltk.stem import WordNetLemmatizer

#a simple case

lemmatizer = WordNetLemmatizer()
res = lemmatizer.lemmatize('scarves')

res

'scarf'

In [17]:
#a difficult case

lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('worst')

'worst'

In [18]:
#a difficult case

lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('worst', pos = 'a')       #pos contains the part of the speech

'bad'

# Chunking

See REGEX first!