# NLP
Natural Language Programming is a subfield of linguistics (human language), computer science and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data

NLP is the ability of a computer program to understand human language as it is spoken and written -- referred to as natural language. It is a component of artificial intelligence (AI).<br>

NLP has existed for more than 50 years and has roots in the field of linguistics. It has a variety of real-world applications in a number of fields, including medical research, search engines and business intelligence.


### Real World Applications



*   Contextual Advertisements: Based on analysis, targeted ads are shared
*   Email Clients: spam filtering, smart reply
*   Social Media: removing adult content, opinion mining
*   Search engines: summary /one liner answer for searched content
*   Chatbots








## Common NLP Tasks


1.   Text/Document Classification
2.   Sentiment Analysis
3.   Information retrieval
4.   Parts of Speech Tagging
1.   Language Detection and Machine Translation
2.   Conversational Agents
1.   Knowledge Graph and QA Systems
2.   Text Summarization
1.   Topic Modelling
2.   Text Generation
1.   Spell checking and Grammar correction
2.   Text Parsing
1.   Speech-To-Text















## Approaches to NLP
1.   Heuristic Methods
1.   Machine Learning Based Methods
2.   Deep Learning Based Methods

## Challenges in NLP


1.   Ambiguity
1.   Contextual Words
1.   Colloquialisms and slang
1.   Synonymns
1.   Irony,Sarcasm and tonal difference
2.   Spelling errors
2.   Creativity
2.   Diversity




NLP pipeline is a set of steps followed to build an end to end NLP software.
NLP software consist of following steps:


1.   Data Acquisition
2.   Text Preparation

> *   Text Clean up block
> *   Basic preprocessing
> *   Advance Preprocessing

3.   Feature ENgineering
2.   Modelling

> *   Model Building
> *   Evaluation 

5.   Deployment

> *   Deployment
> *   Monitoring
> *   Model Update








Data preprocessing involves preparing and "cleaning" text data for machines to be able to analyze it. preprocessing puts data in workable form and highlights features in the text that an algorithm can work with. There are several ways this can be done, including:
- Tokenization. This is when text is broken down into smaller units to work with.
- Stop word removal. This is when common words are removed from text so unique words that offer the most information about the text remain.
- Lemmatization and stemming. This is when words are reduced to their root forms to process.
- Part-of-speech tagging. This is when words are marked based on the part-of speech they are -- such as nouns, verbs and adjectives.


Once the data has been preprocessed, an algorithm is developed to process it. There are many different natural language processing algorithms, but two main types are commonly used:

- Rules-based system. This system uses carefully designed linguistic rules. This approach was used early on in the development of natural language processing, and is still used.
- Machine learning-based system. Machine learning algorithms use statistical methods. They learn to perform tasks based on training data they are fed, and adjust their methods as more data is processed. Using a combination of machine learning, deep learning and neural networks, natural language processing algorithms hone their own rules through repeated processing and learning.


In [1]:
# import necessary libraries 
import nltk
import string
import re

### Text lowercase

We do lowercase the text to reduce the size of the vocabulary of our text data.

In [2]:
def lowercase_text(text): 
    return text.lower() 
  
input_str = 'We bought 6 equity shares last Month, and 5 are in profit.'
lowercase_text(input_str) 

'we bought 6 equity shares last month, and 5 are in profit.'

In [3]:
# For Removing numbers 
def remove_num(text): 
    result = re.sub(r'\d+', '', text) 
    return result 
  
input_s = 'We bought 6 equity shares, and 5 are in profit.'
remove_num(input_s) 

'We bought  equity shares, and  are in profit.'

In [4]:
# import the library 
import inflect 
q = inflect.engine() 
  
# convert number into text 
def convert_num(text): 
    # split strings into list of texts 
    temp_string = text.split() 
    # initialise empty list 
    new_str = [] 
  
    for word in temp_string: 
        # if text is a digit, convert the digit 
        # to numbers and append into the new_str list 
        if word.isdigit(): 
            temp = q.number_to_words(word) 
            new_str.append(temp) 
  
        # append the texts as it is 
        else: 
            new_str.append(word) 
  
    # join the texts of new_str to form a string 
    temp_str = ' '.join(new_str) 
    return temp_str 
  
input_str = 'We bought 6 equity shares, and 5 are in profit.'
convert_num(input_str)

'We bought six equity shares, and five are in profit.'

In [5]:
# let's remove punctuation 
def rem_punct(text): 
    translator = str.maketrans('', '', string.punctuation) 
    return text.translate(translator) 
  
input_str = 'We bought 6 equity shares :-) :-)$$$$***!, and 5 are in profit:-) :-):-) :-):-) :-)!!.'
rem_punct(input_str) 

'We bought 6 equity shares   and 5 are in profit   '

### Remove default stopwords:

Stopwords are words that do not contribute to the meaning of the sentence. Hence, they can be safely removed without causing any change in the meaning of a sentence. The NLTK(Natural Language Toolkit) library has the set of stopwords and we can use these to remove stopwords from our text and return a list of word tokens.

In [6]:
# importing nltk library
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

nltk.download('stopwords')
nltk.download('punkt')
  
# remove stopwords function 
def rem_stopwords(text): 
    stop_words = set(stopwords.words("english")) 
    word_tokens = word_tokenize(text) 
    filtered_text = [word for word in word_tokens if word not in stop_words] 
    return filtered_text 
  
ex_text = "Data is the new oil. A.I is the last invention"
rem_stopwords(ex_text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['Data', 'new', 'oil', '.', 'A.I', 'last', 'invention']

### Stemming

From Stemming we will process of getting the root form of a word. Root or Stem is the part to which inflextional affixes(like -ed, -ize, etc) are added. We would create the stem words by removing the prefix of suffix of a word. So, stemming a word may not result in actual words.

For Example: Mangoes ---> Mango

             Boys ---> Boy
             
             going ---> go
             
             
If our sentences are not in tokens, then we need to convert it into tokens. After we converted strings of text into tokens, then we can convert those word tokens into their root form. These are the Porter stemmer, the snowball stemmer, and the Lancaster Stemmer. We usually use Porter stemmer among them.

In [7]:
#importing nltk's porter stemmer 
from nltk.stem.porter import PorterStemmer 
from nltk.tokenize import word_tokenize 
stem1 = PorterStemmer() 
  
# stem words in the list of tokenised words 
def s_words(text): 
    word_tokens = word_tokenize(text) 
    stems = [stem1.stem(word) for word in word_tokens] 
    return stems 
  
text = 'Data is the new revolution in the World, in a day one individual would generate terabytes of data.'
s_words(text)

['data',
 'is',
 'the',
 'new',
 'revolut',
 'in',
 'the',
 'world',
 ',',
 'in',
 'a',
 'day',
 'one',
 'individu',
 'would',
 'gener',
 'terabyt',
 'of',
 'data',
 '.']

In [8]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
paragrap="""On the 8th International Day of Yoga, PM Modi participated in a mass yoga demonstration at Mysuru Palace Ground. Speaking on the occasion, the Prime Minister said that the yogic energy, which has been nurtured for centuries by the spiritual centers of India like Mysuru, is today giving direction to global health.PM Modi took part in a programme at Sri Suttur Math, Mysuru. He said, as per scripture, there is nothing as noble as knowledge that is why our sages shaped our consciousness that is laced with knowledge and adorned with science, the one that grows by enlightenment and gets stronger by research."""
sentences=nltk.sent_tokenize(paragrap) # Converts paragraphs into sentences
print(sentences)
# words=nltk.word_tokenize(paragrap) # Converts paragraphs into words
# print(words)
#Stemming is the process of reducing a word to its word stem
# that affixes to suffixes and prefixes or to the roots of words known as a lemma
stemmer=PorterStemmer()
for i in range(len(sentences)):
    words=nltk.word_tokenize(sentences[i])
    words=[stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i]=' '.join(words)
print(sentences)
print(set(stopwords.words('english')))

['On the 8th International Day of Yoga, PM Modi participated in a mass yoga demonstration at Mysuru Palace Ground.', 'Speaking on the occasion, the Prime Minister said that the yogic energy, which has been nurtured for centuries by the spiritual centers of India like Mysuru, is today giving direction to global health.PM Modi took part in a programme at Sri Suttur Math, Mysuru.', 'He said, as per scripture, there is nothing as noble as knowledge that is why our sages shaped our consciousness that is laced with knowledge and adorned with science, the one that grows by enlightenment and gets stronger by research.']
['on 8th intern day yoga , pm modi particip mass yoga demonstr mysuru palac ground .', 'speak occas , prime minist said yogic energi , nurtur centuri spiritu center india like mysuru , today give direct global health.pm modi took part programm sri suttur math , mysuru .', 'he said , per scriptur , noth nobl knowledg sage shape conscious lace knowledg adorn scienc , one grow enl

**Stemming problem:** <br>

Produced intermediate representation of word may not have any meaning. <br>
eg: intelligen,fina etc

### Lemmatization

As stemming, lemmatization do the same but the only difference is that lemmatization ensures that root word belongs to the language. Because of the use of lemmatization we will get the valid words. In NLTK(Natural language Toolkit), we use WordLemmatizer to get the lemmas of words. We also need to provide a context for the lemmatization.So, we added pos(parts-of-speech) as a parameter. 

In [11]:
from nltk.stem import wordnet 
from nltk.tokenize import word_tokenize 
lemma = wordnet.WordNetLemmatizer()
nltk.download('all')
# lemmatize string 
def lemmatize_word(text): 
    word_tokens = word_tokenize(text) 
    # provide context i.e. part-of-speech(pos)
    lemmas = [lemma.lemmatize(word, pos ='v') for word in word_tokens] 
    return lemmas 
  
text = 'Data is the new revolution in the World, in a day one individual would generate terabytes of data.'
lemmatize_word(text)

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloadin

['Data',
 'be',
 'the',
 'new',
 'revolution',
 'in',
 'the',
 'World',
 ',',
 'in',
 'a',
 'day',
 'one',
 'individual',
 'would',
 'generate',
 'terabytes',
 'of',
 'data',
 '.']

In [12]:
lemmatize_word(paragrap)

['On',
 'the',
 '8th',
 'International',
 'Day',
 'of',
 'Yoga',
 ',',
 'PM',
 'Modi',
 'participate',
 'in',
 'a',
 'mass',
 'yoga',
 'demonstration',
 'at',
 'Mysuru',
 'Palace',
 'Ground',
 '.',
 'Speaking',
 'on',
 'the',
 'occasion',
 ',',
 'the',
 'Prime',
 'Minister',
 'say',
 'that',
 'the',
 'yogic',
 'energy',
 ',',
 'which',
 'have',
 'be',
 'nurture',
 'for',
 'centuries',
 'by',
 'the',
 'spiritual',
 'center',
 'of',
 'India',
 'like',
 'Mysuru',
 ',',
 'be',
 'today',
 'give',
 'direction',
 'to',
 'global',
 'health.PM',
 'Modi',
 'take',
 'part',
 'in',
 'a',
 'programme',
 'at',
 'Sri',
 'Suttur',
 'Math',
 ',',
 'Mysuru',
 '.',
 'He',
 'say',
 ',',
 'as',
 'per',
 'scripture',
 ',',
 'there',
 'be',
 'nothing',
 'as',
 'noble',
 'as',
 'knowledge',
 'that',
 'be',
 'why',
 'our',
 'sag',
 'shape',
 'our',
 'consciousness',
 'that',
 'be',
 'lace',
 'with',
 'knowledge',
 'and',
 'adorn',
 'with',
 'science',
 ',',
 'the',
 'one',
 'that',
 'grow',
 'by',
 'enlightenm

### Parts of Speech (POS) Tagging

The pos(parts of speech) explain you how a word is used in a sentence. In the sentence, a word have different contexts and semantic meanings. The basic natural language processing(NLP) models like bag-of-words(bow) fails to identify these relation between the words. For that we use pos tagging to mark a word to its pos tag based on its context in the data. Pos is also used to extract rlationship between the words. 

In [13]:
# importing tokenize library
from nltk.tokenize import word_tokenize 
from nltk import pos_tag 
nltk.download('averaged_perceptron_tagger')
  
# convert text into word_tokens with their tags 
def pos_tagg(text): 
    word_tokens = word_tokenize(text) 
    return pos_tag(word_tokens) 
  
pos_tagg('Are you afraid of something?') 

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('Are', 'NNP'),
 ('you', 'PRP'),
 ('afraid', 'IN'),
 ('of', 'IN'),
 ('something', 'NN'),
 ('?', '.')]

In the above example NNP stands for Proper noun, PRP stands for personal noun, IN as Preposition. We can get all the details pos tags using the Penn Treebank tagset.

In [14]:
# downloading the tagset  
nltk.download('tagsets') 
  
# extract information about the tag 
nltk.help.upenn_tagset('PRP')

PRP: pronoun, personal
    hers herself him himself hisself it itself me myself one oneself ours
    ourselves ownself self she thee theirs them themselves they thou thy us


[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


### Chunking

Chunking is the process of extracting phrases from the Unstructured text and give them more structure to it. We also called them shallow parsing.We can do it on top of pos tagging. It groups words into chunks mainly for noun phrases. chunking we do by using regular expression. 

In [15]:
#importing libraries
from nltk.tokenize import word_tokenize  
from nltk import pos_tag 
  
# here we define chunking function with text and regular 
# expressions representing grammar as parameter 
def chunking(text, grammar): 
    word_tokens = word_tokenize(text) 
  
    # label words with pos 
    word_pos = pos_tag(word_tokens) 
  
    # create chunk parser using grammar 
    chunkParser = nltk.RegexpParser(grammar) 
  
    # test it on the list of word tokens with tagged pos 
    tree = chunkParser.parse(word_pos) 
      
    for subtree in tree.subtrees(): 
        print(subtree) 
    #tree.draw() 
      
sentence = 'the little red parrot is flying in the sky'
grammar = "NP: {<DT>?<JJ>*<NN>}"
chunking(sentence, grammar) 

(S
  (NP the/DT little/JJ red/JJ parrot/NN)
  is/VBZ
  flying/VBG
  in/IN
  (NP the/DT sky/NN))
(NP the/DT little/JJ red/JJ parrot/NN)
(NP the/DT sky/NN)


In the above example, we defined the grammar by using the regular expression rule. This rule tells you that NP(noun phrase) chunk should be formed whenever the chunker find the optional determiner(DJ) followed by any no. of adjectives and then a NN(noun).

Libraries like Spacy and TextBlob are best for chunking.

### Named Entity Recognition

It is used to extract information from unstructured text. It is used to classy the entities which is present in the text into categories like a person, organization, event, places, etc. This will give you a detail knowledge about the text and the relationship between the different entities.

In [16]:
#Importing tokenization and chunk
from nltk.tokenize import word_tokenize 
from nltk import pos_tag, ne_chunk 
nltk.download('maxent_ne_chunker')
nltk.download('words')
  
def ner(text): 
    # tokenize the text 
    word_tokens = word_tokenize(text) 
  
    # pos tagging of words 
    word_pos = pos_tag(word_tokens) 
  
    # tree of word entities 
    print(ne_chunk(word_pos)) 
  
text = 'Brain Lara scored the highest 400 runs in a test match which played in between WI and England.'
ner(text) 

(S
  (PERSON Brain/NNP)
  (PERSON Lara/NNP)
  scored/VBD
  the/DT
  highest/JJS
  400/CD
  runs/NNS
  in/IN
  a/DT
  test/NN
  match/NN
  which/WDT
  played/VBD
  in/IN
  between/IN
  (ORGANIZATION WI/NNP)
  and/CC
  (GPE England/NNP)
  ./.)


[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


# Text normalization

In the tect pre-processing highly overlooked step is text normalization. The text normalization means the process of transforming the text into the canonical(or standard) form. Like, "ok" and "k" can be transformed to "okay", its canonical form.And another example is mapping of near identical words such as "preprocessing", "pre-processing" and "pre processing" to just "preprocessing".

Text normaliztion is too useful for noisy textssuch as social media comments, comment to blog posts, text messages, where abbreviations, misspellings, and the use out-of-vocabulary(oov) are prevalent.


### Effects of normalization

Text normalization has even been effective for analyzing highly unstructured clinical texts where physicians take notes in non-standard ways. We have also found it useful for topic extraction where near synonyms and spelling differences are common (like 'topic modelling', 'topic modeling', 'topic-modeling', 'topic-modelling').

Unlike stemming and lemmatization, there is not a standard way to normalize texts. It typically depends on the task. For e.g, the way you would normalize clinical texts would arguably be different from how you normalize text messages.

Some of the common approaches to text normalization include dictionary mappings, statistical machine translation (SMT) and spelling-correction based approaches.

# Word Count

I am assuming you have the understanding of tokenization,the first figure we can calculate is the word frequency.By *word frequency* we can find out how many times each tokens appear in the text. When talking about word frequency, we distinguished between *types* and *tokens*.Types are the distinct words in a corpus, whereas tokens are the words, including repeats. Let's see how this works in practice.

Let's take an example for better understanding:

“There is no need to panic. We need to work together, take small yet important measures to ensure self-protection,” the Prime Minister tweeted.

How many tokens and types are there in above sentences?

Let's use Python for calculating these figures. First, tokenize the sentence by using the tokenizer which uses the non-alphabetic characters as a separator.

In [17]:
from nltk.tokenize.regexp import WhitespaceTokenizer
m = "'There is no need to panic. We need to work together, take small yet important measures to ensure self-protection,' the Prime Minister tweeted."

In [18]:
tokens = WhitespaceTokenizer().tokenize(m)
print(len(tokens))

23


In [19]:
tokens

["'There",
 'is',
 'no',
 'need',
 'to',
 'panic.',
 'We',
 'need',
 'to',
 'work',
 'together,',
 'take',
 'small',
 'yet',
 'important',
 'measures',
 'to',
 'ensure',
 "self-protection,'",
 'the',
 'Prime',
 'Minister',
 'tweeted.']

In [20]:
my_vocab = set(tokens)
print(len(tokens))

23


In [21]:
my_vocab

{"'There",
 'Minister',
 'Prime',
 'We',
 'ensure',
 'important',
 'is',
 'measures',
 'need',
 'no',
 'panic.',
 "self-protection,'",
 'small',
 'take',
 'the',
 'to',
 'together,',
 'tweeted.',
 'work',
 'yet'}

In [22]:
my_st = "'There is no need to panic. We need to work together, take small yet important measures to ensure self-protection,' the Prime Minister tweeted."
#We'll import different tokenizer:
from nltk.tokenize.regexp import WordPunctTokenizer
#Above tokenizer also split the words into tokens:
m_t = WordPunctTokenizer().tokenize(my_st)

print(len(m_t))

30


In [23]:
m_t

["'",
 'There',
 'is',
 'no',
 'need',
 'to',
 'panic',
 '.',
 'We',
 'need',
 'to',
 'work',
 'together',
 ',',
 'take',
 'small',
 'yet',
 'important',
 'measures',
 'to',
 'ensure',
 'self',
 '-',
 'protection',
 ",'",
 'the',
 'Prime',
 'Minister',
 'tweeted',
 '.']

In [24]:
my_vocab = set(m_t)
print(len(my_vocab))

26


# Frequency distribution

What is Frequency distribution? This is basically counting words in your texts.To give a brief example of how it works,

In [25]:
#from nltk.book import *
import nltk
#nltk.download('gutenberg')
print("\n\n\n")
text1 = "'There is no need to panic. We need to work together, take small yet important measures to ensure self-protection,' the Prime Minister tweeted."
freqDist = nltk.FreqDist(word_tokenize(text1))
print(freqDist)





<FreqDist with 23 samples and 28 outcomes>


The class **FreqDist** works like a dictionary where keys are the words in the text and the values are count associated with that word. For example, if you want to see how many words "person" are in the text, you can type as: