## Natural Language Processing 

Natural-language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to fruitfully process large amounts of natural language data (wikipedia). 

This rapidly improving area of artificial intelligence covers tasks such as speech recognition, natural-language understanding, and natural language generation.

In the following project, I have built NLP kit practicing:

* Tokenizing - Splitting sentences and words from the body of text.
* Part of Speech tagging
* Chunking

Machine learning in conjunction with NLP covered:

* Machine learning in NLP
* How to tie in Scikit-learn (sklearn) with NLTK
* Training classifiers with a datasets (Next Project)
* Performing live, streaming, sentiment analysis with Twitter (Next Project)

Used the Natural Language Toolkit (NLTK) which is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language

From CLI ran below commands:
ubuntu@ip-172-30-1-174:~$ conda install -c anaconda nltk 
Solving environment: done
..
..
..
Proceed ([y]/n)? y
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

ubuntu@ip-172-30-1-174:~$ python
Python 3.5.6 |Anaconda 4.1.1 (64-bit)| (default, Aug 26 2018, 21:41:56) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
import nltk
nltk.download()
NLTK Downloader

    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit

Downloader> u

Nothing to update.


    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit

Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> popular
Done downloading collection popular


    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit

Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> all-nltk
Downloading collection 'all-nltk'
  Done downloading collection all-nltk


    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit

Downloader> q
True
exit
Use exit() or Ctrl-D (i.e. EOF) to exit
exit()
ubuntu@ip-172-30-1-174:~$ 


Basic Terms:

##### Corpus - 
Body of text, singular. Example: A collection of medical journals.
##### Lexicon - 
Words and their meanings. Example: English dictionary. Consider, however, that various fields will have different lexicons. 
##### Token - 
Each "entity" that is a part of whatever was split up based on rules. For examples, each word is a token when a sentence is "tokenized" into words. Each sentence can also be a token, if you tokenized the sentences out of a paragraph.

In [24]:
from nltk.tokenize import sent_tokenize, word_tokenize

text = "Hello Pranshu, how are you doing today? The cricket is awesome, and Python is awesome. Have nice day coding."

print(sent_tokenize(text))

['Hello Pranshu, how are you doing today?', 'The cricket is awesome, and Python is awesome.', 'Have nice day coding.']


In [25]:
print(word_tokenize(text))

['Hello', 'Pranshu', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'cricket', 'is', 'awesome', ',', 'and', 'Python', 'is', 'awesome', '.', 'Have', 'nice', 'day', 'coding', '.']


##### Stop Words with NLTK:

Ultimate goal of NLP is to make computer do stuff on language commands/response to language

To achieve this we must pre process data. Pre processing involves reduction/modification of data / text to get meaning full data only. Such words are stop words and needs to be removed.

In [26]:
from nltk.corpus import stopwords
print(stopwords.words('english'))
print("Total stop words: ",len(stopwords.words('english')))
print(type(stopwords.words('english')))
# print(set(stopwords.words('french')))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [27]:
example_sent = "This is some sample text, showing off the stop words filtration."

stop_words_ = stopwords.words('english')
stop_words_.append(".")
stop_words=set(stop_words_)
word_tokens = word_tokenize(example_sent)

filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

['This', 'is', 'some', 'sample', 'text', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'text', ',', 'showing', 'stop', 'words', 'filtration']


##### Stemming Words with NLTK:

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP)
Stemming is also a part of queries and Internet search engines

It is also a preprocessing step

Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”.
https://www.geeksforgeeks.org/python-stemming-words-with-nltk/

In language, different variations of words and sentences often having the same meaning. Stemming is desirable as it may reduce redundancy as most of the time the word stem and their inflected/derived words mean the same


To account for all the variations of words in the english language, we can use the Porter stemmer, which has been around since 1979.

Recognizing, searching and retrieving more forms of words returns more results. When a form of a word is recognized it can make it possible to return search results that otherwise might have been missed. That additional information retrieved is why stemming is integral to search queries and information retrieval.

In [28]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

example_words = ["retrieval", "retrieved", "retrieves"]

for w in example_words:
    print(ps.stem(w))

print("")

example_words = ["ride", "riding", "rided", "rides"]
for w in example_words:
    print(ps.stem(w))

retriev
retriev
retriev

ride
ride
ride
ride


In [29]:
# Stemming an entire sentence:

_1 = "Like all Americans I am outraged by the violence, lawlessness and mayhem." 
_2 = " The demonstrators who infiltrated the Capitol have defiled the seat of American democracy."

new_text=_1 + _2

words = word_tokenize(new_text)

for w in words:
    print(ps.stem(w))

like
all
american
I
am
outrag
by
the
violenc
,
lawless
and
mayhem
.
the
demonstr
who
infiltr
the
capitol
have
defil
the
seat
of
american
democraci
.


##### Tagging with NLTK: Parts of speech--> 8 parts of speech nouns, verbs, adjectives

Part of speech tagging means labeling words as nouns, verbs, adjectives, etc. NLTK can handle tenses! While we're at it, we are also going to import a new sentence tokenizer (PunktSentenceTokenizer). This tokenizer is capable of unsupervised learning, so it can be trained on any body of text. 

In [30]:
# We can use documents from the nltk.corpus.As  an example, below is UDHR
# https://www.nltk.org/book/ch02.html
from nltk.corpus import udhr
print(udhr.raw('English-Latin1'))

Universal Declaration of Human Rights
Preamble
Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world, 

Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people, 

Whereas it is essential, if man is not to be compelled to have recourse, as a last resort, to rebellion against tyranny and oppression, that human rights should be protected by the rule of law, 

Whereas it is essential to promote the development of friendly relations between nations, 

Whereas the peoples of the United Nations have in the Charter reaffirmed their faith in fundamental human rights, in the dignity and worth of the human person and in

In [31]:
# importing some sample and training text - George Bush's 2005 and 2006 state of the union addresses. 
# https://medium.com/@ishan.cdixit/pythons-natural-language-tool-kit-nltk-tutorial-part-2-f5a4d70fd01e
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

In [32]:
# train the PunktSentenceTokenizer

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

In [33]:
# tokenize the sample_text using our trained tokenizer

tokenized = custom_sent_tokenizer.tokenize(sample_text)

In [36]:
# Tag each tokenized word with a part of speech
import nltk

def process_content():
    try:
        for i in tokenized[:1]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            print(tagged)

    except Exception as e:
        print(str(e))

        
# List of tuples - the word with it's part of speech
process_content()

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('THE', 'NNP'), ('PRESIDENT', 'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')]


##### Chunking with NLTK

Now that each word has been tagged with a part of speech, we can move onto chunking: grouping the words into meaningful clusters.  The main goal of chunking is to group words into "noun phrases", which is a noun with any associated verbs, adjectives, or adverbs. 

The part of speech tags that were generated in the previous step will be combined with regular expressions, such as the following:


In [39]:
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            # combine the part-of-speech tag with a regular expression
            
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            
            # draw the chunks with nltk
            # chunked.draw()     

    except Exception as e:
        print(str(e))

        
process_content()

In [40]:
# We can access the chunks, which are stored as an NLTK tree 

def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            # combine the part-of-speech tag with a regular expression
            
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            
            # print(chunked)
            for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
                print(subtree)
            
            # draw the chunks with nltk
            # chunked.draw()     

    except Exception as e:
        print(str(e))

        
process_content()

(Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
(Chunk ADDRESS/NNP)
(Chunk A/NNP JOINT/NNP SESSION/NNP)
(Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
(Chunk THE/NNP UNION/NNP January/NNP)
(Chunk THE/NNP PRESIDENT/NNP)
(Chunk Thank/NNP)
(Chunk Mr./NNP Speaker/NNP)
(Chunk Vice/NNP President/NNP Cheney/NNP)
(Chunk Congress/NNP)
(Chunk Supreme/NNP Court/NNP)
(Chunk called/VBD America/NNP)
(Chunk Coretta/NNP Scott/NNP King/NNP)
(Chunk Applause/NNP)
(Chunk President/NNP George/NNP W./NNP Bush/NNP)
(Chunk State/NNP)
(Chunk Union/NNP Address/NNP)
(Chunk Capitol/NNP)
(Chunk Tuesday/NNP)
(Chunk Jan/NNP)
(Chunk White/NNP House/NNP photo/NN)
(Chunk Eric/NNP DraperEvery/NNP time/NN)
(Chunk Capitol/NNP dome/NN)
(Chunk have/VBP served/VBN America/NNP)
(Chunk Tonight/NNP)
(Chunk Union/NNP)
(Chunk Applause/NNP)
(Chunk United/NNP)
(Chunk America/NNP)
(Chunk Applause/NNP)
(Chunk America/NNP)
(Chunk September/NNP)
(Chunk Dictatorships/NNP shelter/NN)
(Chunk Applause/NNP)
(Chunk Afghanistan/NNP)
(

##### Chinking with NLTK

Sometimes there are words in the chunks that we don't won't, we can remove them using a process called chinking.

In [41]:
def process_content():
    try:
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            # The main difference here is the }{, vs. the {}. This means we're removing 
            # from the chink one or more verbs, prepositions, determiners, or the word 'to'.

            chunkGram = r"""Chunk: {<.*>+}
                                    }<VB.?|IN|DT|TO>+{"""

            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            
            # print(chunked)
            for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
                print(subtree)

            # chunked.draw()

    except Exception as e:
        print(str(e))

        
process_content()

(Chunk 31/CD ,/, 2006/CD ./.)
(Chunk White/NNP House/NNP photo/NN)
(Chunk Eric/NNP DraperEvery/NNP time/NN I/PRP)
(Chunk invited/JJ)
(Chunk rostrum/NN ,/, I/PRP)
(Chunk privilege/NN ,/, and/CC mindful/NN)
(Chunk history/NN we/PRP)
(Chunk together/RB ./.)
(Chunk We/PRP)
(Chunk Capitol/NNP dome/NN)
(Chunk moments/NNS)
(Chunk national/JJ mourning/NN and/CC national/JJ achievement/NN ./.)
(Chunk We/PRP)
(Chunk America/NNP)
(Chunk one/CD)
(Chunk most/RBS consequential/JJ periods/NNS)
(Chunk our/PRP$ history/NN --/: and/CC it/PRP)
(Chunk my/PRP$ honor/NN)
(Chunk you/PRP ./.)
(Chunk system/NN)
(Chunk
  two/CD
  parties/NNS
  ,/,
  two/CD
  chambers/NNS
  ,/,
  and/CC
  two/CD
  elected/JJ
  branches/NNS
  ,/,
  there/EX
  will/MD
  always/RB)
(Chunk differences/NNS and/CC debate/NN ./.)
(Chunk But/CC even/RB tough/JJ debates/NNS can/MD)
(Chunk
  civil/JJ
  tone/NN
  ,/,
  and/CC
  our/PRP$
  differences/NNS
  can/MD
  not/RB)
(Chunk anger/NN ./.)
(Chunk great/JJ issues/NNS)
(Chunk us/PRP ,/, 

##### Named Entity Recognition with NLTK

One of the most common forms of chunking in natural language processing is called "Named Entity Recognition." NLTK is able to identify people, places, things, locations, monetary figures, and more.

There are two major options with NLTK's named entity recognition: either recognize all named entities, or recognize named entities as their respective type, like people, places, locations, etc.

Here, with the option of binary = True, this means either something is a named entity, or not. There will be no further detail.

In [42]:
def process_content():
    try:
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            namedEnt = nltk.ne_chunk(tagged, binary=True)
            # namedEnt.draw()
            
    except Exception as e:
        print(str(e))

        
process_content()

### Text Classification

##### Text classification using NLTK POSITIVE/NEGATIVE movie REVIEW

Now that we have covered the basics of preprocessing for Natural Language Processing, we can move on to text classification using simple machine learning classification algorithms.

In [51]:
import random
import nltk
from nltk.corpus import movie_reviews
# using NLTK corpus movie reviews

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# shuffle the documents
random.shuffle(documents)

print('Number of Documents: {}'.format(len(documents)))
print('First Review: {}'.format(documents[0]))

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words) #sort words from most common to least common

print('Most common words: {}'.format(all_words.most_common(15)))
print('The word happy: {}'.format(all_words["happy"])) #number of times word happy occurs

Number of Documents: 2000
First Review: (['an', 'affluent', 'horse', 'breeder', "'", 's', 'past', 'comes', 'up', 'to', 'haunt', 'him', ';', 'an', 'ages', 'old', 'cover', '-', 'up', 'and', 'blackmail', 'comes', 'back', 'to', 'haunt', 'him', 'at', 'the', 'hands', 'of', 'one', 'of', 'his', 'accomplices', '.', 'that', "'", 's', 'pretty', 'much', 'the', 'essence', 'of', 'the', 'movie', 'and', 'i', 'have', 'to', 'say', 'that', 'it', 'becomes', 'quite', 'boring', 'at', 'times', 'and', 'is', 'very', 'slow', '.', 'that', 'aside', 'the', 'story', 'was', 'well', 'presented', 'and', 'probably', 'quite', 'close', 'and', 'representative', 'of', 'its', 'source', '.', 'the', 'acting', 'in', 'particular', 'i', 'found', 'very', 'good', ',', 'the', 'character', 'development', 'was', 'also', 'quite', 'interesting', 'but', 'alas', 'the', 'story', 'simply', 'did', 'not', 'hold', 'my', 'interest', 'enough', 'for', 'me', 'to', 'get', 'into', 'the', 'movie', '.', 'a', 'few', 'things', 'about', 'the', 'story', 

In [57]:
# We'll use the 4000 most common words as features
print(len(all_words))#all words
word_features = list(all_words.keys())[:4000]#using first 4000 words as feature==> could be better by removing stop words 

39768


In [46]:
# The find_features function will determine which of the 4000 word features are contained in the review
def find_features(document):
    words = set(document)
    features = {}#dictionary
    for w in word_features:
        features[w] = (w in words)

    return features


# Lets use an example from a negative review
features = find_features(movie_reviews.words('neg/cv000_29416.txt'))
for key, value in features.items():
    if value == True:
        print (key)

half
what
drive
obviously
they
ago
scenes
understanding
by
)
line
t
7
hide
kudos
does
sure
her
problems
but
a
or
someone
plot
lazy
actually
unravel
attempt
package
meantime
2
figured
own
tons
idea
while
explained


In [58]:
# Now for all the documents
featuresets = [(find_features(rev), category) for (rev, category) in documents]

In [59]:
# we can split the featuresets into training and testing datasets using sklearn
from sklearn import model_selection

# define a seed for reproducibility
seed = 1

# split the data into training and testing datasets
training, testing = model_selection.train_test_split(featuresets, test_size = 0.25, random_state=seed)

In [60]:
print(len(training))
print(len(testing))

1500
500


In [61]:
# We can use sklearn algorithms in NLTK
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC

model = SklearnClassifier(SVC(kernel = 'linear'))

# train the model on the training data
model.train(training)

# and test on the testing dataset!
accuracy = nltk.classify.accuracy(model, testing)*100
print("SVC Accuracy: {}".format(accuracy))

SVC Accuracy: 67.2


##### Results and Summary

Not bad! But not too great either. That's okay, we'll learn how to further improve the results later.  In this project, we built a foundation for Natural Language Processing in Python. We covered tokenizing, stemming, part of speech tagging, chunking, named entity recognition, and text classification. 

In future projects we will look at combining mutliple classification algorithms to produce better results. Furthermore, we'll move on to more difficult challenges, such as sentiment analysis. 