# Natural Language Processing (nlp)
Natural language processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.

Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.

# Natural Language Toolkit (nltk)
The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania. NLTK includes graphical demonstrations and sample data. It is accompanied by a book that explains the underlying concepts behind the language processing tasks supported by the toolkit, plus a cookbook.

NLTK is intended to support research and teaching in NLP or closely related areas, including empirical linguistics, cognitive science, artificial intelligence, information retrieval, and machine learning. NLTK has been used successfully as a teaching tool, as an individual study tool, and as a platform for prototyping and building research systems. There are 32 universities in the US and 25 countries using NLTK in their courses. NLTK supports **classification , tokenization, stemming, tagging, parsing, and semantic reasoning functionalities**.

## Installing NLTK

If you are using Windows or Linux or Mac, you can install NLTK using pip: # pip install nltk.

To check if NLTK has installed correctly, you can open your Python terminal and type the following: Import nltk. If everything goes fine, that means you've successfully installed NLTK library.


## Tokenization
Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded. 

In [313]:
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize   # sent_tokenize to tokenize sentences 

In [314]:
test = "In the ninja world, those who break the rules are trash. That's true, but those who abandon their friends are worse than trash."

In [315]:
sent_tokenize(test)

['In the ninja world, those who break the rules are trash.',
 "That's true, but those who abandon their friends are worse than trash."]

In [316]:
words = word_tokenize(test)
print(words,"\ncount = ",len(words))

['In', 'the', 'ninja', 'world', ',', 'those', 'who', 'break', 'the', 'rules', 'are', 'trash', '.', 'That', "'s", 'true', ',', 'but', 'those', 'who', 'abandon', 'their', 'friends', 'are', 'worse', 'than', 'trash', '.'] 
count =  28


## Stop Words
In computing, stop words are words which are filtered out before or after processing of natural language data (text). Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.

In [317]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
stop

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [318]:
clean_words = [w for w in words if not w.lower() in stop]
print(clean_words,"\ncount = ",len(clean_words))


['ninja', 'world', ',', 'break', 'rules', 'trash', '.', "'s", 'true', ',', 'abandon', 'friends', 'worse', 'trash', '.'] 
count =  15


## POS Tagging
In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as **nouns, verbs, adjectives, adverbs, etc.**

In [319]:
from nltk.corpus import state_union
text = state_union.raw("2006-GWBush.txt")
#print(text)

In [320]:
from nltk import pos_tag


pos = pos_tag(word_tokenize(text.lower()))
pos2=np.array(pos)
pos2


array([['president', 'NN'],
       ['george', 'NN'],
       ['w.', 'VBD'],
       ...,
       ['applause', 'IN'],
       ['.', '.'],
       [')', ')']], dtype='<U18')

In [321]:
print(pos_tag(["One"]),
      pos_tag(["legendary"]),
      pos_tag(["flying"]),
      pos_tag(["person"])
)

[('One', 'CD')] [('legendary', 'JJ')] [('flying', 'VBG')] [('person', 'NN')]


https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
## Stemming
In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.

In [322]:
stem_words = ["play", "played", "playing", "player", "happier", "happiness", "universe", "universal"]
from nltk.stem import PorterStemmer
ps = PorterStemmer()
for w in stem_words:
    print (ps.stem(w))

play
play
play
player
happier
happi
univers
univers


## Lemmatization
Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning. Unlike stemming, lemmatisation depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document. As a result, developing efficient lemmatisation algorithms is an open area of research.

In [323]:
from nltk.stem import WordNetLemmatizer

lem = WordNetLemmatizer()

In [324]:
lem.lemmatize("good", pos = 'a')
    

'good'

In [325]:
lem.lemmatize("better", pos = 'a')

'good'

In [326]:
lem.lemmatize("universe", pos = 'a') 

'universe'

In [327]:
lem.lemmatize("universal", pos = 'a')

'universal'

In [328]:
lem.lemmatize("university", pos = 'n')   

'university'

In [329]:
lem.lemmatize("painting", pos = 'n')

'painting'

In [330]:
lem.lemmatize("painting", pos = 'v')

'paint'

## Using NLP for text classification 

https://medium.com/data-from-the-trenches/text-classification-the-first-step-toward-nlp-mastery-f5f95d525d73

In [331]:
from nltk.corpus import movie_reviews

In [332]:
len(movie_reviews.fileids())  # dataset contains 2000 movie reviews out of which 1000 are positive and rest are negative.

2000

In [333]:
movie_reviews.fileids('pos')

['pos/cv000_29590.txt',
 'pos/cv001_18431.txt',
 'pos/cv002_15918.txt',
 'pos/cv003_11664.txt',
 'pos/cv004_11636.txt',
 'pos/cv005_29443.txt',
 'pos/cv006_15448.txt',
 'pos/cv007_4968.txt',
 'pos/cv008_29435.txt',
 'pos/cv009_29592.txt',
 'pos/cv010_29198.txt',
 'pos/cv011_12166.txt',
 'pos/cv012_29576.txt',
 'pos/cv013_10159.txt',
 'pos/cv014_13924.txt',
 'pos/cv015_29439.txt',
 'pos/cv016_4659.txt',
 'pos/cv017_22464.txt',
 'pos/cv018_20137.txt',
 'pos/cv019_14482.txt',
 'pos/cv020_8825.txt',
 'pos/cv021_15838.txt',
 'pos/cv022_12864.txt',
 'pos/cv023_12672.txt',
 'pos/cv024_6778.txt',
 'pos/cv025_3108.txt',
 'pos/cv026_29325.txt',
 'pos/cv027_25219.txt',
 'pos/cv028_26746.txt',
 'pos/cv029_18643.txt',
 'pos/cv030_21593.txt',
 'pos/cv031_18452.txt',
 'pos/cv032_22550.txt',
 'pos/cv033_24444.txt',
 'pos/cv034_29647.txt',
 'pos/cv035_3954.txt',
 'pos/cv036_16831.txt',
 'pos/cv037_18510.txt',
 'pos/cv038_9749.txt',
 'pos/cv039_6170.txt',
 'pos/cv040_8276.txt',
 'pos/cv041_21113.txt',
 

In [334]:
movie_reviews.abspaths
path='C:\\nltk_data\\corpora\\movie_reviews\\pos\\cv892_17576.txt'

In [335]:
str1=open(path,"r").read()
print(str1)              # sample text of a movie review  : The Matrix (1999)


perhaps the most dramatic changes in the motion picture industry in this decade have to do with special effects . 
there is no question that action-adventure and science-fiction/action movies are now judged by the character of their light and noise . 
whereas classic adventure pics of the last twenty years , such as raiders of the lost ark , were made in grand traditional fashion ; contemporary films like jurassic park are multimillion-dollar creations of computer technology . 
the latest in this visually awesome series of movies , the wachowski brothers' the matrix , is a testament to the skilled use of special effects and its ability to enhance a movie's story . 
unlike many sci-fi movies which promote themselves as effects-heavy blockbusters but fail to deliver on that promise , the matrix is a carefully constructed special effects event . 
it runs 135 minutes in length and employs a countless number of computerized tricks which range from gimmick to grandiose , and the quality of t

In [336]:
documents = []
for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        documents.append([movie_reviews.words(fileid), category])
documents[0:5]  ##preparing the dataset, here every review is tokenized and result is appended to get a training example.

[[['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...], 'neg'],
 [['the', 'happy', 'bastard', "'", 's', 'quick', 'movie', ...], 'neg'],
 [['it', 'is', 'movies', 'like', 'these', 'that', 'make', ...], 'neg'],
 [['"', 'quest', 'for', 'camelot', '"', 'is', 'warner', ...], 'neg'],
 [['synopsis', ':', 'a', 'mentally', 'unstable', 'man', ...], 'neg']]

In [337]:
import random
random.seed(2)
random.shuffle(documents)  ## shuffling the training exapmles .


In [338]:
training_documents = documents[0:1500]
testing_documents = documents[1500:]

In [339]:
import string
punc = list(string.punctuation)         # getting list of punctuations 
all_words = []
stop = stop + punc                      # appending the punctuation list with list of stop words 
for doc in training_documents:
    for w in doc[0]:
        if w.lower() not in stop:
            all_words.append(w.lower()) # getting a list of all words except punctuations and stop words 

In [340]:
len(all_words)

529972

In [341]:
import nltk
dist = nltk.FreqDist(all_words)     # getting frequency of all words
features = dist.most_common(3000)   # getting top 3000 words 
feature_words = [i[0] for i in features]
stop=stopwords.words('english')

In [342]:
def get_features(document,stop):
    #words = set(document)
    
    
    punc = list(string.punctuation)         # getting list of punctuations 
    words = []
    
    stop = stop + punc                      # appending the punctuation list with list of stop words 
    for i in document:
        if i.lower() not in stop:
            words.append(i.lower())
    
    #words=document
    features = {}
    
    for w in feature_words:
        features[w] = 0
        
    for w in feature_words:
        if((w in words)):
            features[w]=features[w]+1
           
    return features               # this function returns the dictionary for each document with key
                                  # as word and value as word's frequency in this document 

In [343]:
#get_features(training_documents[0][0],stop)

In [344]:
training_documents[0][1]

'pos'

In [345]:
training_data = [[get_features(i[0],stop), i[1]] for i in training_documents]        # preparing the training data

In [346]:
testing_data = [[get_features(i[0],stop), i[1]] for i in testing_documents]          # preparing the testing data

In [347]:
from nltk.classify.scikitlearn import SklearnClassifier                        # using SVM to classifiy 
from  sklearn.svm import SVC

In [348]:
classifier_sklearn = SklearnClassifier(SVC())
classifier_sklearn.train(training_data)

<SklearnClassifier(SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))>

In [349]:
nltk.classify.accuracy(classifier_sklearn, testing_data)

0.708

In [350]:
from nltk.classify.scikitlearn import SklearnClassifier
from  sklearn import naive_bayes

In [351]:
gnb = naive_bayes.MultinomialNB()                                              #using Multinomial Naive Bayes 
classifier_sklearn = SklearnClassifier(naive_bayes.MultinomialNB())
classifier_sklearn.train(np.array(training_data))

<SklearnClassifier(MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))>

In [352]:
nltk.classify.accuracy(classifier_sklearn, testing_data)

0.8