# Installing NLTK
if the nltk library dose not exist on your system currently install it 

In [1]:
#pip install --user -U nltk

In [2]:
import nltk
import numpy
from nltk.tokenize import sent_tokenize, word_tokenize

print('done')

done


## Use the Nltk's download features 

In [3]:
#nltk.download('popular')

## important terms in NLP
- tokenizers : word tolkenizers.... sentence tokenizers - seperators 
- corporas- : medical journals / speeches / english language
- lexicon - : words and their means - the terminology for somthing like investors or doctors 

These are the words you will most commonly hear upon entering the Natural Language Processing (NLP) space, but there are many more that we will be covering in time. With that, let's show an example of how one might actually tokenize something into tokens with the NLTK module.

In [4]:
EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard."

print(sent_tokenize(EXAMPLE_TEXT))

['Hello Mr. Smith, how are you doing today?', 'The weather is great, and Python is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard."]


## The Process 

At first, you may think tokenizing by things like words or sentences is a rather trivial enterprise. For many sentences it can be. The first step would be likely doing a simple .split('. '), or splitting by period followed by a space. Then maybe you would bring in some regular expressions to split by period, space, and then a capital letter. The problem is that things like Mr. Smith would cause you trouble, and many other things. Splitting by word is also a challenge, especially when considering things like concatenations like we and are to we're. NLTK is going to go ahead and just save you a ton of time with this seemingly simple, yet very complex, operation.

The above code will output the sentences, split up into a list of sentences, which you can do things like iterate through with a for loop.
['Hello Mr. Smith, how are you doing today?', 'The weather is great, and Python is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard."]

So there, we have created tokens, which are sentences. Let's tokenize by word instead this time:

In [5]:
# here lets tokenize a sentence by word 
print(word_tokenize(EXAMPLE_TEXT))

['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'Python', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'cardboard', '.']


In [6]:
for i in word_tokenize(EXAMPLE_TEXT):
    print(i)

Hello
Mr.
Smith
,
how
are
you
doing
today
?
The
weather
is
great
,
and
Python
is
awesome
.
The
sky
is
pinkish-blue
.
You
should
n't
eat
cardboard
.


## Handeling Stop Words 
Immediately, we can recognize ourselves that some words carry more meaning than other words. We can also see that some words are just plain useless, and are filler words. We use them in the English language, for example, to sort of "fluff" up the sentence so it is not so strange sounding. An example of one of the most common, unofficial, useless words is the phrase "umm." People stuff in "umm" frequently, some more than others. This word means nothing, unless of course we're searching for someone who is maybe lacking confidence, is confused, or hasn't practiced much speaking. We all do it, you can hear me saying "umm" or "uhh" in the videos plenty of ...uh ... times. For most analysis, these words are useless.

We would not want these words taking up space in our database, or taking up valuable processing time. As such, we call these words "stop words" because they are useless, and we wish to do nothing with them. Another version of the term "stop words" can be more literal: Words we stop on.

For example, you may wish to completely cease analysis if you detect words that are commonly used sarcastically, and stop immediately. Sarcastic words, or phrases are going to vary by lexicon and corpus. For now, we'll be considering stop words as words that just contain no meaning, and we want to remove them.

You can do this easily, by storing a list of words that you consider to be stop words. NLTK starts you off with a bunch of words that they consider to be stop words, you can access it via the NLTK corpus with:

In [8]:
#first lets bring in some stuff from corpus 
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

print('done')

done


In [10]:
# set in an example sentence 
example_sent = "This is a sample sentence, showing off the stop words filtration."
stop_words = set(stopwords.words("english"))

#lets examine what nltk uses as stop words 
print(stop_words)

{'just', "don't", 'this', 'being', 'up', 'doesn', "weren't", "shan't", 'where', 'be', "hadn't", 'there', 'needn', 'hasn', "wouldn't", 'aren', 'from', 'ain', 'against', 'which', 'about', 'will', 'shouldn', 'our', 'should', 'yourselves', 'ma', 'his', 'after', 'very', 'now', 'is', 'having', 'her', "you'd", 'with', 'no', 'wouldn', 'own', 'if', 'isn', 'as', 'him', 'themselves', 'an', 'below', 'why', 'few', 'off', 'do', "it's", 'your', 'out', 'further', 'above', 'my', 'these', 'over', "shouldn't", "you're", 'you', "isn't", 'by', 'am', 'yourself', 'who', 'while', 'down', 'all', 'only', 'himself', 'such', 'because', 'other', 'between', 'was', 'wasn', 'hers', 'on', 'how', 'haven', "won't", "doesn't", "that'll", 'again', 'didn', 'same', 'won', 'some', 'both', 'theirs', 'to', "wasn't", 'too', 'myself', 'their', 's', 'i', "haven't", "you'll", 'been', 'not', 'when', 'y', 'ours', 'each', 'm', 'so', 'me', 't', 'she', 'does', 're', 'have', "needn't", "she's", 'it', 'has', 'at', 'under', 'then', 've', 

In [14]:
#lets use it now on our example sentence 
words = word_tokenize(example_sent)

filterd_sentence =[]

for w in words:
    if w not in stop_words:
        filterd_sentence.append(w)
        
# you could also code it like this         
#filterd_sentence = [w for w in words if not w in stop_words]
        
filterd_sentence

['This',
 'sample',
 'sentence',
 ',',
 'showing',
 'stop',
 'words',
 'filtration',
 '.']

## Stemming Words 
The idea of stemming is a sort of normalizing method. Many variations of words carry the same meaning, other than when tense is involved.

The reason why we stem is to shorten the lookup, and normalize sentences.

Consider:

I was taking a ride in the car.
I was riding in the car.
This sentence means the same thing. in the car is the same. I was is the same. the ing denotes a clear past-tense in both cases, so is it truly necessary to differentiate between ride and riding, in the case of just trying to figure out the meaning of what this past-tense activity was?

No, not really.

This is just one minor example, but imagine every word in the English language, every possible tense and affix you can put on a word. Having individual dictionary entries per version would be highly redundant and inefficient, especially since, once we convert to numbers, the "value" is going to be identical.

One of the most popular stemming algorithms is the Porter stemmer, which has been around since 1979.

First, we're going to grab and define our stemmer:

In [1]:
# grab our Stemmer
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()
print(ps)

<PorterStemmer>


In [2]:
#lets test out setemming on some examples
example_words = ["python","pythoner","pythoning","pythoned","pythonly"]

for i in example_words:
    print(ps.stem(i))
    
    

python
python
python
python
pythonli


In [4]:
# now lets use the stemmer on a actual sentence 

new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."

words = word_tokenize(new_text)

for w in words:
    print(ps.stem(w))

It
is
import
to
by
veri
pythonli
while
you
are
python
with
python
.
all
python
have
python
poorli
at
least
onc
.
