# NLTK for NLP 
it is a Python module to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet and text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

corpus - a large collection of text <br>
corpora - plural of corpus

In [2]:
# brown is one of the corpus offered by NLTK. This corpus contains text from 500 sources, and 
# the sources have been categorized by genre, such as news, editorial, and so on.
# one can see examples ,sentences etc of any category using brown corpus.
from nltk.corpus import brown

In [4]:
data=brown.categories()
print(type(data),len(data),data,sep='\n')

<class 'list'>
15
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']


In [8]:
data=brown.sents(categories='fiction')
for i in range(5):
    print(' '.join(data[i]))

Thirty-three
Scotty did not go back to school .
His parents talked seriously and lengthily to their own doctor and to a specialist at the University Hospital -- Mr. McKinley was entitled to a discount for members of his family -- and it was decided it would be best for him to take the remainder of the term off , spend a lot of time in bed and , for the rest , do pretty much as he chose -- provided , of course , he chose to do nothing too exciting or too debilitating .
His teacher and his school principal were conferred with and everyone agreed that , if he kept up with a certain amount of work at home , there was little danger of his losing a term .
Scotty accepted the decision with indifference and did not enter the arguments .


# Bag of words pipeline
Machine learning classifiers work on numbers and not on text so to carry out tasks like movie rating prediction based on reviews one has only reviews in human readable language. To perform processing on text data , text has to be converted into numbers using bag of words model so that ML classifier can work on it.
<ul>
    <li>
Get the Data/Corpus    </li>    <li>
Tokenisation, Stopward Removal    </li>    <li>
Stemming/Lemmatisation   </li>    <li>
Building a Vocab   </li>    <li>
Vectorization - converting sentence to a vector of numbers   </li>    <li>
Classification    </li>
    </ul>

In [34]:
document = """It was a very pleasant day. The weather was cool and there were light showers.
I went to the market to buy some fruits."""

sentence = "Send all the 50 documents related to chapters 1,2,3 at Harshit@gmail.com"

### Tokenization

In [35]:
from nltk.tokenize import sent_tokenize,word_tokenize

In [36]:
sents = sent_tokenize(document)
print(sents)
print(len(sents))

['It was a very pleasant day.', 'The weather was cool and there were light showers.', 'I went to the market to buy some fruits.']
3


In [13]:
print(sentence.split())

['Send', 'all', 'the', '50', 'documents', 'related', 'to', 'chapters', '1,2,3', 'at', 'Harshit@gmail.com']


In [14]:
words = word_tokenize(sentence)
print(words)

['Send', 'all', 'the', '50', 'documents', 'related', 'to', 'chapters', '1,2,3', 'at', 'Harshit', '@', 'gmail.com']


### Stopwords

In [17]:
from nltk.corpus import stopwords
sw = set(stopwords.words('english'))
print(sw)

{'of', 'aren', 've', 'theirs', 'your', 'these', 'yourself', 'in', "mustn't", 'himself', 'other', "you've", 'his', 'had', "wouldn't", 'ain', 'at', 'such', 'off', 'hasn', 'ours', 'as', 'down', 'its', "that'll", 'during', 'once', "mightn't", "you'll", "hadn't", 'do', 'doesn', "weren't", 'yours', 'because', 'from', 'doing', 's', 're', 'shan', 'under', "haven't", 'you', "you're", 'm', 'are', 'the', 'too', 'we', 'only', 'won', "won't", 'has', 'yourselves', 'she', "it's", 'above', 'into', 'before', 'very', 'our', 'until', "needn't", 'not', 'or', 'her', 'just', 'nor', 'further', "wasn't", 'some', 'which', 'where', 'were', 'an', 'again', 'a', 'itself', 'out', 'haven', 'through', "hasn't", 'myself', "you'd", 't', 'how', 'and', 'they', 'here', 'up', 'most', 'am', 'with', 'him', "couldn't", 'no', "shouldn't", 'i', 'themselves', 'does', 'mightn', 'll', 'was', 'over', "should've", 'about', "doesn't", 'ourselves', 'what', 'then', 'mustn', 'on', "don't", 'wasn', 'own', 'don', 'if', 'by', 'couldn', 'ne

In [19]:
def remove_stopwords(text,stopwords):
    useful_words = [w for w in text if w not in stopwords]
    return useful_words
text = "i am not bothered about her very much".split()
useful_text = remove_stopwords(text,sw)
print(useful_text)

['bothered', 'much']


### Tokenization using Regular Expression

In [37]:
sentence = "Send all the 50 documents related to chapters 1,2,3 at Harshit@gmail.com"
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('[a-zA-Z@.]') #in bracker we can write all char or range of char
useful_text = tokenizer.tokenize(sentence) #that we want to have in tokenised list

In [39]:
print(useful_text)

['S', 'e', 'n', 'd', 'a', 'l', 'l', 't', 'h', 'e', 'd', 'o', 'c', 'u', 'm', 'e', 'n', 't', 's', 'r', 'e', 'l', 'a', 't', 'e', 'd', 't', 'o', 'c', 'h', 'a', 'p', 't', 'e', 'r', 's', 'a', 't', 'H', 'a', 'r', 's', 'h', 'i', 't', '@', 'g', 'm', 'a', 'i', 'l', '.', 'c', 'o', 'm']


In [40]:
tokenizer = RegexpTokenizer('[a-zA-Z@.]+') # + will now consider continous stream of char that
useful_text = tokenizer.tokenize(sentence) # can be used as one
print(useful_text)

['Send', 'all', 'the', 'documents', 'related', 'to', 'chapters', 'at', 'Harshit@gmail.com']


### Stemming / Lemmatisation
- Process that transforms particular words(verbs,plurals)into their radical form
- Preserve the semantics of the sentence without increasing the number of unique tokens
- Example - jumps, jumping, jumped, jump ==> jump

Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas, lemma is an actual language word. Stemming follows an algorithm with steps to perform on the words which makes it faster.

In [42]:
from nltk.stem import SnowballStemmer, PorterStemmer
from nltk.stem import LancasterStemmer
#Snowball Stemmer, Porter, Lancaster Stemmer

In [43]:
ps = PorterStemmer()
ps.stem('jumping')

'jump'

In [44]:
ps.stem('jumps')

'jump'

In [45]:
ps.stem('loving')

'love'

In [33]:
## Lemmatization
from nltk.stem import WordNetLemmatizer

wn = WordNetLemmatizer()
print(wn.lemmatize('beautiful'))
print(ss.stem('beautiful')) #beauti is not a word!!

beautiful
beauti


### Building a Vocab & Vectorization

In [81]:
# Sample Corpus - Contains 4 Documents, each document can have 1 or more sentences
corpus = [
        'Indian cricket team will wins World Cup, says Capt. Virat Kohli. World cup will be held at Sri Lanka.',
        'We will win next Lok Sabha Elections, says confident Indian PM',
        'The nobel laurate won the hearts of the people.',
        'The movie Raazi is an exciting Indian Spy thriller based upon a real story.'
]
#we have created a dummy corpus to work on!! It has 4 documents of different categories- sports
#, politics, literature,movies . Now we have to build a vocab. vocab is the list of all unique
# words in the entire corpus. To make features in form of numbers from text we have to convert 
# each document into a vector of numbers so for every document we initialise a vector of 0's 
# of size vocab and then for each document we look at all words of document. for example-India 
#.If in vocab india was present at index 7 then we increase count of index 7 by 1.Then we get
# document in form of vector of numbers. in Vocab which word is stored at which index is stored
# in a dictionary

In [88]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
vectorized_corpus = cv.fit_transform(corpus) #fit will train it on corpus and build a vocab
#and transform will return vectorised corpus using vocab
#returns a sparse matrix

In [89]:
print(cv.vocabulary_) #vocab in form of dictionary

{'indian': 12, 'cricket': 6, 'team': 31, 'will': 37, 'wins': 39, 'world': 41, 'cup': 7, 'says': 27, 'capt': 4, 'virat': 35, 'kohli': 14, 'be': 3, 'held': 11, 'at': 1, 'sri': 29, 'lanka': 15, 'we': 36, 'win': 38, 'next': 19, 'lok': 17, 'sabha': 26, 'elections': 8, 'confident': 5, 'pm': 23, 'the': 32, 'nobel': 20, 'laurate': 16, 'won': 40, 'hearts': 10, 'of': 21, 'people': 22, 'movie': 18, 'raazi': 24, 'is': 13, 'an': 0, 'exciting': 9, 'spy': 28, 'thriller': 33, 'based': 2, 'upon': 34, 'real': 25, 'story': 30}


In [90]:
print(type(vectorized_corpus)) 
# (0,6)  1 means in 0th document word at index 6 (wrt to vocab) has freq of 1

<class 'scipy.sparse.csr.csr_matrix'>


In [85]:
vectorized_corpus = vectorized_corpus.toarray()
print(vectorized_corpus.size)
print(vectorized_corpus)

168
[[0 1 0 1 1 0 1 2 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1
  0 2 0 1 0 2]
 [0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0
  1 1 1 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 3 0 0 0
  0 0 0 0 1 0]
 [1 0 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 0
  0 0 0 0 0 0]]


In [76]:
# Reverse Mapping!
numbers = vectorized_corpus[2]
print(numbers)
s = cv.inverse_transform(numbers)
print(s)

  (0, 32)	3
  (0, 20)	1
  (0, 16)	1
  (0, 40)	1
  (0, 10)	1
  (0, 21)	1
  (0, 22)	1
[array(['the', 'nobel', 'laurate', 'won', 'hearts', 'of', 'people'],
      dtype='<U9')]


#### Vectorization with  Stopword Removal

In [78]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('[a-zA-Z@.]') 
useful_text = tokenizer.tokenize(sentence)

def myTokenizer(document):
    words = tokenizer.tokenize(document.lower())
    # Remove Stopwords
    words = remove_stopwords(words,sw)
    return words

In [79]:
cv = CountVectorizer(tokenizer=myTokenizer)
vectorized_corpus = cv.fit_transform(corpus).toarray()
print(vectorized_corpus)

[[0 1 0 1 2 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 2]
 [0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 0 0 0 0]]


In [86]:
# For Test Data
test_corpus = [
        'Indian cricket rock !',        
]
cv.transform(test_corpus).toarray()

array([[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

### More ways to Create Features
- Unigram - every word as a feature
- Bigrams - every 2 unique consecutive word can be used as a single feature
- Trigrams
- n-grams 
- TF-IDF Normalisation

when it is useful to use bigram models over unigram?
ex- "this is not good movie" - A unigram model may predict this as good review because words are jumbled and it is difficult to capture negation (not) but if we use bigram model and club not and good together "not good" we know it will be a bad review.
or "this is good movie but actor is not present" -unigram model may classify it as a negative review because of not.

In [91]:
sent_1  = ["this is good movie"]
sent_2 = ["this is good movie but actor is not present"]
sent_3 = ["this is not good movie"]

In [92]:
from sklearn.feature_extraction.text import CountVectorizer

cv=CountVectorizer(ngram_range=(2,2)) #x,x mean x-gram
vectorized_sent2=cv.fit_transform(sent_2).toarray()
print(cv.vocabulary_)

{'this is': 7, 'is good': 3, 'good movie': 2, 'movie but': 5, 'but actor': 1, 'actor is': 0, 'is not': 4, 'not present': 6}


In [94]:
cv=CountVectorizer(ngram_range=(1,3)) #every word will be considered as a single feature and 
# every 2 adjacent and every 3 adjacent word will also be considered as a single feature. so 
# basically it will have features of unigram,bigram and trigram models
vectorized_sent2=cv.fit_transform(sent_2).toarray()
print(cv.vocabulary_)

{'this': 20, 'is': 9, 'good': 6, 'movie': 14, 'but': 3, 'actor': 0, 'not': 17, 'present': 19, 'this is': 21, 'is good': 10, 'good movie': 7, 'movie but': 15, 'but actor': 4, 'actor is': 1, 'is not': 12, 'not present': 18, 'this is good': 22, 'is good movie': 11, 'good movie but': 8, 'movie but actor': 16, 'but actor is': 5, 'actor is not': 2, 'is not present': 13}


## Tf-idf Normalisation

- Avoid features that occur very often, becauase they contain less information
- Information decreases as the number of occurences increases across different type of documents
- So we define another term - term-document-frequency which associates a weight with every term
- tf = term frequency , idf = inverse document frequency

In [95]:
sent_1  = "this is good movie"
sent_2 = "this was good movie"
sent_3 = "this is not good movie"

corpus = [sent_1,sent_2,sent_3]

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()

In [96]:
vc = tfidf.fit_transform(corpus).toarray()
print(vc)
print(tfidf.vocabulary_)

[[0.46333427 0.59662724 0.46333427 0.         0.46333427 0.        ]
 [0.41285857 0.         0.41285857 0.         0.41285857 0.69903033]
 [0.3645444  0.46941728 0.3645444  0.61722732 0.3645444  0.        ]]
{'this': 4, 'is': 1, 'good': 0, 'movie': 2, 'was': 5, 'not': 3}
