In [63]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/jfenata/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [64]:
text="Mary had a little lamb. Her fleece was white as snow"

from nltk.tokenize import word_tokenize, sent_tokenize

sents=sent_tokenize(text)
print(sents)

['Mary had a little lamb.', 'Her fleece was white as snow']


In [65]:
words=[word_tokenize(sent) for sent in sents]
print(words)

[['Mary', 'had', 'a', 'little', 'lamb', '.'], ['Her', 'fleece', 'was', 'white', 'as', 'snow']]


## Remove all the stop words

NLTK comes with some built in linguistic resources and among those linguistic resources is a collection of stopwords in different languages. You can import the set of stopwords by using the statement from NLTK.

Along with the stopwords that are provided by NLTK, we're also importing puctuation.

In [66]:
nltk.download('stopwords')

from nltk.corpus import stopwords
from string import punctuation

customStopWords = set(stopwords.words('english') + list(punctuation))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jfenata/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now given any list of tokens or words that you have got by tokenizing a piece of text, you can take that and just filter it for only those words which are not in the list of custom stopwords.

In [67]:
wordsWithoutStopwords = [word for word in word_tokenize(text) if word not in customStopWords]

print(wordsWithoutStopwords)

['Mary', 'little', 'lamb', 'Her', 'fleece', 'white', 'snow']


## Identify bigrams

N-grams are groups of words that occur commonly together from any piece of text you would want to identify important n-gram which occur in that text.

It's how to construct bigrams from a list of words and also see what the frequency of occurence of those bigrams are within that list of words.

> Collocations - any words that are collocated or that occur together.

In [68]:
from nltk.collocations import *

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder =  BigramCollocationFinder.from_words(wordsWithoutStopwords)

sorted(finder.ngram_fd.items())

[(('Her', 'fleece'), 1),
 (('Mary', 'little'), 1),
 (('fleece', 'white'), 1),
 (('lamb', 'Her'), 1),
 (('little', 'lamb'), 1),
 (('white', 'snow'), 1)]

Now we have all the bigrams which are present within this list of words and their frequencies. So each bigram occurs along with the number of times that it has occured within that list of words.

> So each bigram occurs along with the number of times that it has occured within that list of words. If you had a piece of text in which particular bigrams were more important than others then this particular piece of code would sort all of the bigrams in the order of their frequency and you would see the most important bigrams on top.

<b>

> The collocations module also has a trigram collocation finder, which you can use in a very similar way to find trigrams or group of words in three.

## Stemming and Part Of Speech Tagging

Different morphological forms of the same word (close, closing & closed)

In [69]:
text2 = "Mary closed on closing night when she was in the mood to close."

from nltk.stem.lancaster import LancasterStemmer

st = LancasterStemmer()
stemmedWords = [st.stem(word) for word in word_tokenize(text2)]

print(stemmedWords)

['mary', 'clos', 'on', 'clos', 'night', 'when', 'she', 'was', 'in', 'the', 'mood', 'to', 'clos', '.']


Ordinarily if you tokenized this sentence and looked at all the words within them, each of the different morphological forms of the word close would be treated as different tokens. So if you try to perform a count of how many times each word occurs in this sentence, then you would get the counts closed as one, closing as one, and close as one.

Now instead if you wanted to treat them all as the same word and get a count of close as three, then you would basically have to reduce all these three words to their root form. 
This is basically called stemming and the NLTK.stem module has several different algorithms that allow you to perform stemming.

We are going to use one particular stemmer called the Lancaster stemmer algorithm.

We want to be able to see whether one particular word is a noun, or a verb, or an adverb, and so on.

Once again, NLTK has a built in function for this called part-of-speech tag (pos_tag). All you need to do is pass it a list of words and it will assign the relevant part of speech.

In [70]:
nltk.download('averaged_perceptron_tagger')

nltk.pos_tag(word_tokenize(text2))

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jfenata/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('Mary', 'NNP'),
 ('closed', 'VBD'),
 ('on', 'IN'),
 ('closing', 'NN'),
 ('night', 'NN'),
 ('when', 'WRB'),
 ('she', 'PRP'),
 ('was', 'VBD'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('mood', 'NN'),
 ('to', 'TO'),
 ('close', 'VB'),
 ('.', '.')]

So here you have all the acronyms thich represent the different part of speech for each the words.

> Noun(NNP)
Verb(VBD)
Pronoun(PRP)

<B>

> For a complete list of each acronym and word it means, you can look up the NLTK documentation.