# Tokenizing Words and Sentences

[Link to tutorial](https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/)

Definitions:
* __Corpus__: Body of text, singular. Corpora is the plural of this. 
* __Lexicon__: words and their meanings. Example: English dictionary.
* __Token__: Each "entity" that is a part of whatever was split up based on rules. For examples, each word is a token when a sentence is "tokenized" into words. Each sentence can also be a token, if you tokenized the sentences out of a paragraph.


In [10]:
from nltk.tokenize import sent_tokenize, word_tokenize
from pprint import pprint

EXAMPLE_TEXT = ("Hello Mr. Smith, how are you doing today? "
                "The weather is great, and Python is awesome. "
                "The sky is pinkish-blue. You shouldn't eat cardboard.")

print("Output of sent_tokenize(EXAMPLE_TEXT):")
pprint(sent_tokenize(EXAMPLE_TEXT), indent=4)

print("\nOutput of word_tokenize(EXAMPLE_TEXT):")
pprint(word_tokenize(EXAMPLE_TEXT), indent=4)

Output of sent_tokenize(EXAMPLE_TEXT):
[   'Hello Mr. Smith, how are you doing today?',
    'The weather is great, and Python is awesome.',
    'The sky is pinkish-blue.',
    "You shouldn't eat cardboard."]

Output of word_tokenize(EXAMPLE_TEXT):
[   'Hello',
    'Mr.',
    'Smith',
    ',',
    'how',
    'are',
    'you',
    'doing',
    'today',
    '?',
    'The',
    'weather',
    'is',
    'great',
    ',',
    'and',
    'Python',
    'is',
    'awesome',
    '.',
    'The',
    'sky',
    'is',
    'pinkish-blue',
    '.',
    'You',
    'should',
    "n't",
    'eat',
    'cardboard',
    '.']


# Stop Words

Stop words are informally defined as useless words(data). Basically a casual term people use for words they don't care about. Seems like most people consider pronouns, articles, prepositions, etc. to be stop words. For example, nltk default outputs:

```python
>>> set(stopwords.words('english'))
{'ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don', 'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what', 'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being', 'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than'} 
```

Below we show how this can be used on a sentence with NLTK.

In [12]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "This is a sample sentence, showing off the stop words filtration."
stop_words   = set(stopwords.words('english'))
word_tokens  = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
print("Remaining (word) tokens after filtering out stop words:")
print(filtered_sentence)

Remaining (word) tokens after filtering out stop words:
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


# Stemming

[Link to tutorial](https://pythonprogramming.net/stemming-nltk-tutorial/?completed=/stop-words-nltk-tutorial/)

The general idea of stemming is removing redundant parts of words ("affixes") that don't really provide new meaning. For example, removing 'ing' off the word 'riding'. 

In [15]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()
example_words = ["python","pythoner","pythoning","pythoned","pythonly"]
for w in example_words: print(ps.stem(w)) 

python
python
python
python
pythonli


In [16]:
new_text = ("It is important to by very pythonly while you are pythoning with python. "
            "All pythoners have pythoned poorly at least once.")
for w in word_tokenize(new_text):
    print(ps.stem(w))

It
is
import
to
by
veri
pythonli
while
you
are
python
with
python
.
all
python
have
python
poorli
at
least
onc
.


# Lemmatizing

[Link to tutorial](https://pythonprogramming.net/lemmatizing-nltk-tutorial/)

TL;DR: same as stemming except actual words

In [38]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
def print_lematize(s, **kwargs):
    print(s, "-->", lemmatizer.lemmatize(s, **kwargs))
for word in ['cats', 'cacti', 'geese', 'rocks', 'python', 'run']:
    print_lematize(word)

cats --> cat
cacti --> cactus
geese --> goose
rocks --> rock
python --> python
run --> run


In [39]:
print_lematize('better')
# pos means part-of-speech
# 'a' means adjective.
# default is pos='n'
print_lematize('better', pos='a')

better --> better
better --> good


# The corpora with NLTK

Here's an example of accessing one of many text documents, the gutenberg bible, from the corpora. 

In [42]:
from nltk.tokenize import sent_tokenize, PunktSentenceTokenizer
from nltk.corpus import gutenberg

# sample text
sample = gutenberg.raw("bible-kjv.txt")

tok = sent_tokenize(sample)

for x in range(5):
    print(tok[x])

[The King James Bible]

The Old Testament of the King James Bible

The First Book of Moses:  Called Genesis


1:1 In the beginning God created the heaven and the earth.
1:2 And the earth was without form, and void; and darkness was upon
the face of the deep.
And the Spirit of God moved upon the face of the
waters.
1:3 And God said, Let there be light: and there was light.
1:4 And God saw the light, that it was good: and God divided the light
from the darkness.


## WordNet

A lexical database for the English language, which was created by Princeton, and is part of the NLTK corpus. 

In [58]:
from nltk.corpus import wordnet
# How to find synonyms ('synsets').
syns = wordnet.synsets("program")
print("syns:\n", syns)
print("\n\n")
pprint([syns[0].lemmas()[i].name() for i in range(len(syns[0].lemmas()))])

syns:
 [Synset('plan.n.01'), Synset('program.n.02'), Synset('broadcast.n.02'), Synset('platform.n.02'), Synset('program.n.05'), Synset('course_of_study.n.01'), Synset('program.n.07'), Synset('program.n.08'), Synset('program.v.01'), Synset('program.v.02')]



['plan', 'program', 'programme']


# Machine Learning with NLTK Examples

### Text Classification

In [77]:
import nltk
import random
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category) 
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)
# Uncommented below since long output.
#print("Output of (joined) documents[1][:10][0]")
#pprint(' '.join(documents[1][:10][0]), width=80)

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
pprint(all_words.most_common(15))
print(all_words["stupid"])

[(',', 77717),
 ('the', 76529),
 ('.', 65876),
 ('a', 38106),
 ('and', 35576),
 ('of', 34123),
 ('to', 31937),
 ("'", 30585),
 ('is', 25195),
 ('in', 21822),
 ('s', 18513),
 ('"', 17612),
 ('it', 16107),
 ('that', 15924),
 ('-', 15595)]
253


In [61]:
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /home/brandon/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


True