<a href="https://colab.research.google.com/github/xprilion/introduction-to-nltk/blob/master/Part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP with NLTK 

Welcome to a Natural Language Processing tutorial using NLTK.

## What is Natural Language Processing(NLP) ?


Let us understand the concept of NLP in detail

Natural Language Processing, or NLP for short, is broadly defined as the automatic manipulation of natural language, like speech and text, by software.

The study of natural language processing has been around for more than 50 years and grew out of the field of linguistics with the rise of computers.

Before we get into details of NLP first let us try to answer the below question 

- What natural language is and how it is different from other types of data?

Natural language refers to the way we, humans, communicate with each other namely speech and text.We are surrounded by text.
Think about how much text you see each day:

- Signs
- Menus
- Email
- SMS
- Web Pages

and so much more…

The list is endless.

Now think about speech.We may speak to each other, as a species, more than we write. It may even be easier to learn to speak than to write.Voice and text are how we communicate with each other.Given the importance of this type of data, we must have methods to understand and reason about natural language, just like we do for other types of data.

Now lets get into details of this tutorial. 

### <a id='01'>Natural Language Processing using NLTK</a>

#### <a id='1'>1. Introduction to NLTK</a>

The NLTK module is a massive tool kit, aimed at helping with the entire Natural Language Processing (NLP) methodology. NLTK will aid with everything from splitting sentences from paragraphs, splitting up words, recognizing the part of speech of those words, highlighting the main subjects, and then even with helping machine to understand what the text is all about. In this tutorial, we're going to tackle the field of opinion mining, or sentiment analysis.

In our path to learning how to do sentiment analysis with NLTK, we're going to learn the following:

- Tokenizing - Splitting sentences and words from the body of text.
- Part of Speech tagging
- Machine Learning with the Naive Bayes classifier
- How to tie in Scikit-learn (sklearn) with NLTK
- Training classifiers with datasets
- Performing live, streaming, sentiment analysis with Twitter.

...and much more.

In order to get started, you are going to need the NLTK module, as well as Python.

In [0]:
import matplotlib
matplotlib.use('Agg')

In [2]:
import nltk
nltk.download("popular")

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Do

True

#### <a id='2'>2. Tokenizing Words & Sentences</a>

Tokenization is the process of breaking up the given text into units called tokens. The tokens may be words or number or punctuation mark or even sentences. Tokenization does this task by locating word boundaries. Ending point of a word and beginning of the next word is called word boundaries. Tokenization is also known as word segmentation.

- <b>Challenges in tokenization</b> depends on the type of language. Languages such as English and French are referred to as space-delimited as most of the words are separated from each other by white spaces. Languages such as Chinese and Thai are referred to as unsegmented as words do not have clear boundaries. Tokenising unsegmented language sentences requires additional lexical and morphological information. Tokenization is also affected by writing system and the typographical structure of the words. Structures of languges can be grouped into three categories:

    - Isolating: Words do not divide into smaller units. Example: Mandarin Chinese

    - Agglutinative: Words divide into smaller units. Example: Japanese, Tamil

    - Inflectional: Boundaries between morphemes are not clear and ambiguous in terms of grammatical meaning. Example: Latin.


Let us understand some more basic terminology.

- What is Corpora?

It is a body of text e.g Medical journal, Presidential speech, English language

- What is Lexicon?

Lexicon is nothing but words and their means .E.g Investor speak vs. Regular English speak

i.e Investor talk about "BULL" as some stock going positive in the market which bullish as to the regular word of "BULL" describing the usual animal.

    
So in simple for now let us look at Word Tokenizer and Sentence Tokenizer using NLTK.

In [0]:
from nltk.tokenize import sent_tokenize,word_tokenize

In [4]:
example_text = "Hi Mr.Pavan , How are you doing today? You got a nice job at IBM. Wow thats an awesome car. Weather is great."

print(sent_tokenize(example_text))

['Hi Mr.Pavan , How are you doing today?', 'You got a nice job at IBM.', 'Wow thats an awesome car.', 'Weather is great.']


As you can see that sentence tokenizer did split the above example text into seperate sentences.Now let us look at word tokenizer below

In [5]:
print(word_tokenize(example_text))

['Hi', 'Mr.Pavan', ',', 'How', 'are', 'you', 'doing', 'today', '?', 'You', 'got', 'a', 'nice', 'job', 'at', 'IBM', '.', 'Wow', 'thats', 'an', 'awesome', 'car', '.', 'Weather', 'is', 'great', '.']


As you can see that word tokenizer did split the above example text into seperate words.

#### <a id='3'>3. Stopwords</a>

Stop words are natural language words which have very little meaning, such as "and", "the", "a", "an", and similar words.

Basically during the pre processing of natural language text we eliminate the stopwords as they are redundant and do not convey any meaning insight in the data.

In [0]:
from nltk.corpus import stopwords

Now let set the stopwords for english language.Let us see what are all the stopwords in english

In [7]:
stop_words = set(stopwords.words("english"))

print(stop_words)

{'don', "wasn't", 'was', 'who', 'during', 'it', "couldn't", "you've", 'll', 'mustn', 'yourselves', 'each', 'will', 'or', "hadn't", 'against', "shouldn't", 'do', 'doesn', 've', 'are', 'just', 'yourself', 'should', 'the', 'shan', 'himself', 'whom', 'and', 'because', 'by', 'over', 'where', 'more', 'how', 'such', 'so', 'needn', 'been', 'same', 'my', "doesn't", "you'd", 'until', 'both', 'itself', 'own', 'our', 'he', 'before', 'too', 'between', 'did', 'off', 'does', 'further', 'only', "shan't", 'me', 'these', 'hasn', "won't", 'now', 'through', 'again', "mightn't", 'there', 'can', 'shouldn', 'his', 'a', 'why', 'wouldn', "don't", 'be', 'of', 'this', "she's", 'aren', 'hadn', 'them', 'after', 'your', 'y', 'up', 're', 'she', 'couldn', 'myself', 'for', 'nor', 'we', 'to', 'o', 'very', 'but', "that'll", 'some', "didn't", 'what', 'down', 'doing', 'm', 'd', 'those', 'isn', 'am', 'any', 'ain', "you're", 'about', 'other', 'ourselves', "you'll", 'were', 'herself', 'as', "haven't", "wouldn't", 'is', 'whic

Now let us tokenize the sample text and filter the sentence by removing the stopwords from it .

In [8]:
example_text = "Hi Mr.Pavan , How are you doing today?. Cool you got a nice job at IBM. Wow thats an awesome car. Weather is great."
words = word_tokenize(example_text)
filtered_sentence = []
for w in words:
    if w not in stop_words:
        filtered_sentence.append(w)
print(filtered_sentence)    

['Hi', 'Mr.Pavan', ',', 'How', 'today', '?', '.', 'Cool', 'got', 'nice', 'job', 'IBM', '.', 'Wow', 'thats', 'awesome', 'car', '.', 'Weather', 'great', '.']


As you can see from above thats how we can filter out the stopwords from a given content and further process the data .

#### <a id='4'>4. Stemming Words</a>

Stemming is the process of reducing infected or derived words to their word stem,base or root form. It basically affixes to suffixes and prefixes or to the roots of words known as a lemma.It is also a preprocessing step in natural language processing.

Examples: Words like
- organise, organising ,organisation the root of its stem is organis.
- intelligence,intelligently the root of its stem is intelligen

So stemming produces intermediate representation of the word which may not have any meaning.In this case "intelligen" has no meaning.

The idea of stemming is a sort of normalizing method. Many variations of words carry the same meaning, other than when tense is involved.

The reason why we stem is to shorten the lookup, and normalize sentences.

One of the most popular stemming algorithms is the Porter stemmer, which has been around since 1979.

First, we're going to grab and define our stemmer:

In [0]:
from nltk.stem import PorterStemmer

In [10]:
txt = "John is an intelligent individual.He intelligently does smart work. He is a top performer in the company."
sentences = sent_tokenize(txt)
stemmer = PorterStemmer()
new_sentence = []
for i in range(len(sentences)):
    words = word_tokenize(sentences[i])
    words = [stemmer.stem(word) for word in words]
    new_sentence.append(' '.join(words))
print(new_sentence)

['john is an intellig individual.h intellig doe smart work .', 'He is a top perform in the compani .']


As you can see above the word "intellig" and it confirms that stemming process is complete. Now let us look at Lemmatization

#### <a id='5'>5. Lemmatization</a>

It is same as stemming process but the intermediate representation/root has a meaning.It is also a preprocessing step in natural language processing.

Examples: Words like
- going ,goes,gone - when we do lemmatization we get "go" 
- intelligence,intelligently - when we do lemmatization we get "intelligent".

So lemmatization produces intermediate representation of the word which has a meaning.In this case "intelligent" has meaning.

In [0]:
from nltk.stem import WordNetLemmatizer

In [12]:
txt = "John is an intelligent individual.He intelligently does smart work. He is a top performer in the company."
sentences = sent_tokenize(txt)
lemmtizer = WordNetLemmatizer()
new__lemmatize_sentence = []
for i in range(len(sentences)):
    words = word_tokenize(sentences[i])
    words = [lemmtizer.lemmatize(word) for word in words]
    new__lemmatize_sentence.append(' '.join(words))
print(new__lemmatize_sentence)

['John is an intelligent individual.He intelligently doe smart work .', 'He is a top performer in the company .']


#### <a id='6'>6. Part of Speech Tagging</a>

One of the more powerful aspects of the NLTK is the Part of Speech tagging that it can do. This means labeling words in a sentence as nouns, adjectives, verbs...etc. Even more impressive, it also labels by tense, and more. Here's a list of the tags, what they mean, and some examples:

##### POS tag list:

- CC	coordinating conjunction
- CD	cardinal digit
- DT	determiner
- EX	existential there (like: "there is" ... think of it like "there exists")
- FW	foreign word
- IN	preposition/subordinating conjunction
- JJ	adjective	'big'
- JJR	adjective, comparative	'bigger'
- JJS	adjective, superlative	'biggest'
- LS	list marker	1)
- MD	modal	could, will
- NN	noun, singular 'desk'
- NNS	noun plural	'desks'
- NNP	proper noun, singular	'Harrison'
- NNPS	proper noun, plural	'Americans'
- PDT	predeterminer	'all the kids'
- POS	possessive ending	parent\'s
- PRP	personal pronoun	I, he, she
- PRPdollar	possessive pronoun	my, his, hers
- RB	adverb	very, silently,
- RBR	adverb, comparative	better
- RBS	adverb, superlative	best
- RP	particle	give up
- TO	to	go 'to' the store.
- UH	interjection	errrrrrrrm
- VB	verb, base form	take
- VBD	verb, past tense	took
- VBG	verb, gerund/present participle	taking
- VBN	verb, past participle	taken
- VBP	verb, sing. present, non-3d	take
- VBZ	verb, 3rd person sing. present	takes
- WDT	wh-determiner	which
- WP	wh-pronoun	who, what
- WPdollar	possessive wh-pronoun	whose
- WRB	wh-abverb	where, when

 Now let us use a  new sentence tokenizer, called the PunktSentenceTokenizer. This tokenizer is capable of unsupervised machine learning, so we can actually train it on any body of text that we use.

In [0]:
from nltk.tokenize import PunktSentenceTokenizer
# Now, let's create our training and testing data:
train_txt="Crocodiles (subfamily Crocodylinae) or true crocodiles are large aquatic reptiles that live throughout the tropics in Africa, Asia, the Americas and Australia. Crocodylinae, all of whose members are considered true crocodiles, is classified as a biological subfamily. A broader sense of the term crocodile, Crocodylidae that includes Tomistoma, is not used in this article. The term crocodile here applies to only the species within the subfamily of Crocodylinae. The term is sometimes used even more loosely to include all extant members of the order Crocodilia, which includes the alligators and caimans (family Alligatoridae), the gharial and false gharial (family Gavialidae), and all other living and fossil Crocodylomorpha."
sample_text ="Crocodiles are large aquatic reptiles which are carnivorous.Allegators belong to this same reptile species"
# Next, we can train the Punkt tokenizer like:
cust_tokenizer = PunktSentenceTokenizer(train_txt)
# Then we can actually tokenize, using:
tokenized = cust_tokenizer.tokenize(sample_text)

Now we can finish up this part of speech tagging script by creating a function that will run through and tag all of the parts of speech per sentence like so:

In [14]:
print("Speech Tagging Output")
def process_text():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            print(tagged)

    except Exception as e:
        print(str(e))

process_text()

Speech Tagging Output
[('Crocodiles', 'NNS'), ('are', 'VBP'), ('large', 'JJ'), ('aquatic', 'JJ'), ('reptiles', 'NNS'), ('which', 'WDT'), ('are', 'VBP'), ('carnivorous.Allegators', 'NNS'), ('belong', 'RB'), ('to', 'TO'), ('this', 'DT'), ('same', 'JJ'), ('reptile', 'NN'), ('species', 'NNS')]


#### <a id='7'>7. Chunking</a>

Now that we know the parts of speech, we can do what is called chunking, and group words into hopefully meaningful chunks. One of the main goals of chunking is to group into what are known as "noun phrases." These are phrases of one or more words that contain a noun, maybe some descriptive words, maybe a verb, and maybe something like an adverb. The idea is to group nouns with the words that are in relation to them.

In order to chunk, we combine the part of speech tags with regular expressions. Mainly from regular expressions, we are going to utilize the following:

"+" = match 1 or more

"?" = match 0 or 1 repetitions.

"*" = match 0 or MORE repetitions	  

"." = Any character except a new line

The last things to note is that the part of speech tags are denoted with the "<" and ">" and we can also place regular expressions within the tags themselves, so account for things like "all nouns" (<N.*>)

Let us take the same code from the above Speech Tagging section and modify it to include chunking for noun plural (NNS) and adjective (JJ)

In [15]:
from nltk.tokenize import PunktSentenceTokenizer

# Now, let's create our training and testing data:
train_txt="Crocodiles (subfamily Crocodylinae) or true crocodiles are large aquatic reptiles that live throughout the tropics in Africa, Asia, the Americas and Australia. Crocodylinae, all of whose members are considered true crocodiles, is classified as a biological subfamily. A broader sense of the term crocodile, Crocodylidae that includes Tomistoma, is not used in this article. The term crocodile here applies to only the species within the subfamily of Crocodylinae. The term is sometimes used even more loosely to include all extant members of the order Crocodilia, which includes the alligators and caimans (family Alligatoridae), the gharial and false gharial (family Gavialidae), and all other living and fossil Crocodylomorpha."
sample_text ="Crocodiles are large aquatic reptiles which are carnivorous.Allegators belong to this same reptile species"

# Next, we can train the Punkt tokenizer like:
cust_tokenizer = PunktSentenceTokenizer(train_txt)

# Then we can actually tokenize, using:

tokenized = cust_tokenizer.tokenize(sample_text)
print("Chunked Output")
def process_text():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            chunkGram = r"""Chunk:{<NNS.?>*<JJ>+}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            #chunked.draw()
            print(chunked)

    except Exception as e:
        print(str(e))

process_text()

Chunked Output
(S
  Crocodiles/NNS
  are/VBP
  (Chunk large/JJ aquatic/JJ)
  reptiles/NNS
  which/WDT
  are/VBP
  carnivorous.Allegators/NNS
  belong/RB
  to/TO
  this/DT
  (Chunk same/JJ)
  reptile/NN
  species/NNS)


#### <a id='8'>8. Chinking</a>

You may find that, after a lot of chunking, you have some words in your chunk you still do not want, but you have no idea how to get rid of them by chunking. You may find that chinking is your solution.

Chinking is a lot like chunking, it is basically a way for you to remove a chunk from a chunk. The chunk that you remove from your chunk is your chink.

The code is very similar, you just denote the chink, after the chunk, with }{ instead of the chunk's {}

In [16]:
from nltk.tokenize import PunktSentenceTokenizer

# Now, let's create our training and testing data:
train_txt="Crocodiles (subfamily Crocodylinae) or true crocodiles are large aquatic reptiles that live throughout the tropics in Africa, Asia, the Americas and Australia. Crocodylinae, all of whose members are considered true crocodiles, is classified as a biological subfamily. A broader sense of the term crocodile, Crocodylidae that includes Tomistoma, is not used in this article. The term crocodile here applies to only the species within the subfamily of Crocodylinae. The term is sometimes used even more loosely to include all extant members of the order Crocodilia, which includes the alligators and caimans (family Alligatoridae), the gharial and false gharial (family Gavialidae), and all other living and fossil Crocodylomorpha."
sample_text ="Crocodiles are large aquatic reptiles which are carnivorous.Allegators belong to this same reptile species"

# Next, we can train the Punkt tokenizer like:
cust_tokenizer = PunktSentenceTokenizer(train_txt)

# Then we can actually tokenize, using:

tokenized = cust_tokenizer.tokenize(sample_text)

print("Chinked Output")
def process_text():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            chunkGram = r"""Chunk: {<.*>+}
                                    }<VB.?|IN|DT|TO>+{"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            #chunked.draw()
            print(chunked)

    except Exception as e:
        print(str(e))

process_text()

Chinked Output
(S
  (Chunk Crocodiles/NNS)
  are/VBP
  (Chunk large/JJ aquatic/JJ reptiles/NNS which/WDT)
  are/VBP
  (Chunk carnivorous.Allegators/NNS belong/RB)
  to/TO
  this/DT
  (Chunk same/JJ reptile/NN species/NNS))


#### <a id='9'>9. Named Entity Recognition</a>

One of the most major forms of chunking in natural language processing is called "Named Entity Recognition." The idea is to have the machine immediately be able to pull out "entities" like people, places, things, locations, monetary figures, and more.

This can be a bit of a challenge, but NLTK is this built in for us. There are two major options with NLTK's named entity recognition: either recognize all named entities, or recognize named entities as their respective type, like people, places, locations, etc.

In [19]:
from nltk.tokenize import PunktSentenceTokenizer

# Now, let's create our training and testing data:
train_txt="Crocodiles (subfamily Crocodylinae) or true crocodiles are large aquatic reptiles that live throughout the tropics in Africa, Asia, the Americas and Australia. Crocodylinae, all of whose members are considered true crocodiles, is classified as a biological subfamily. A broader sense of the term crocodile, Crocodylidae that includes Tomistoma, is not used in this article. The term crocodile here applies to only the species within the subfamily of Crocodylinae. The term is sometimes used even more loosely to include all extant members of the order Crocodilia, which includes the alligators and caimans (family Alligatoridae), the gharial and false gharial (family Gavialidae), and all other living and fossil Crocodylomorpha."
sample_text ="Crocodiles are large aquatic reptiles which are carnivorous.Allegators belong to this same reptile species"

# Next, we can train the Punkt tokenizer like:
cust_tokenizer = PunktSentenceTokenizer(train_txt)

# Then we can actually tokenize, using:

tokenized = cust_tokenizer.tokenize(sample_text)

print("Named Entity Output")
def process_text():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            namedEnt = nltk.ne_chunk(tagged,binary = True)
            print(namedEnt)

    except Exception as e:
        print(str(e))

process_text()

Named Entity Output
(S
  Crocodiles/NNS
  are/VBP
  large/JJ
  aquatic/JJ
  reptiles/NNS
  which/WDT
  are/VBP
  carnivorous.Allegators/NNS
  belong/RB
  to/TO
  this/DT
  same/JJ
  reptile/NN
  species/NNS)


#### <a id='10'>10. The Corpora</a>

The NLTK corpus is a massive dump of all kinds of natural language data sets that are definitely worth taking a look at.

Almost all of the files in the NLTK corpus follow the same rules for accessing them by using the NLTK module, but nothing is magical about them. These files are plain text files for the most part, some are XML and some are other formats, but they are all accessible by manual, or via the module and Python. Let's talk about viewing them manually.

In [20]:
import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import gutenberg

sample = gutenberg.raw("bible-kjv.txt")
tok = sent_tokenize(sample)
print(tok[5:15])

['1:5 And God called the light Day, and the darkness he called Night.', 'And the evening and the morning were the first day.', '1:6 And God said, Let there be a firmament in the midst of the waters,\nand let it divide the waters from the waters.', '1:7 And God made the firmament, and divided the waters which were\nunder the firmament from the waters which were above the firmament:\nand it was so.', '1:8 And God called the firmament Heaven.', 'And the evening and the\nmorning were the second day.', '1:9 And God said, Let the waters under the heaven be gathered together\nunto one place, and let the dry land appear: and it was so.', '1:10 And God called the dry land Earth; and the gathering together of\nthe waters called he Seas: and God saw that it was good.', '1:11 And God said, Let the earth bring forth grass, the herb yielding\nseed, and the fruit tree yielding fruit after his kind, whose seed is\nin itself, upon the earth: and it was so.', '1:12 And the earth brought forth grass, and