## Introduction to Natural Language Processing

NLP is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text data in a smart and efficient manner. By utilizing NLP and its components, one can organize the massive chunks of text data, perform numerous automated tasks and solve a wide range of problems such as – automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc.


## Text Preprocessing

Since, text is the most unstructured form of all the available data, various types of noise are present in it and the data is not readily analyzable without any pre-processing. The entire process of cleaning and standardization of text, making it noise-free and ready for analysis is known as text preprocessing.

It is predominantly comprised of three steps:

* Noise Removal
* Lexicon Normalization
* Object Standardization

### Noise Removal

Any piece of text which is not relevant to the context of the data and the end-output can be specified as the noise.

For example – language stopwords (commonly used words of a language – is, am, the, of, in etc), URLs or links, social media entities (mentions, hashtags), punctuations and industry specific words. This step deals with removal of all types of noisy entities present in the text.

A general approach for noise removal is to prepare a dictionary of noisy entities, and iterate the text object by tokens (or by words), eliminating those tokens which are present in the noise dictionary.

In [3]:
# Sample code to remove noisy words from a text

noise_list = ["is", "a", "this", "..."] 
def _remove_noise(input_text):
    words = input_text.split() 
    noise_free_words = [word for word in words if word not in noise_list] 
    noise_free_text = " ".join(noise_free_words) 
    return noise_free_text

_remove_noise("this is a sample text")


'sample text'

Another approach is to use the regular expressions while dealing with special patterns of noise.

In [4]:

# Sample code to remove a regex pattern 
import re 

def _remove_regex(input_text, regex_pattern):
    urls = re.finditer(regex_pattern, input_text) 
    for i in urls: 
        input_text = re.sub(i.group().strip(), '', input_text)
    return input_text

regex_pattern = "#[\w]*"  

_remove_regex("remove this #hashtag from the #tweet", regex_pattern)

'remove this  from the '

### Lexicon Normalization

Another type of textual noise is about the multiple representations exhibited by single word.

For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word – “play”, Though they mean different but contextually all are similar. The step converts all the disparities of a word into their normalized form (also known as lemma). Normalization is a pivotal step for feature engineering with text as it converts the high dimensional features (N different features) to the low dimensional space (1 feature), which is an ideal ask for any NLP model.

The most common lexicon normalization practices are :

* **Stemming**:  Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.
* **Lemmatization**: Lemmatization, on the other hand, is an organized & step by step procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).




In [1]:
from nltk.stem.wordnet import WordNetLemmatizer 
lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer 
stem = PorterStemmer()

word = "multiplying" 
lem.lemmatize(word, "v")

u'multiply'

In [2]:
stem.stem(word)

u'multipli'

### Object Standardization

Text data often contains words or phrases which are not present in any standard lexical dictionaries. These pieces are not recognized by search engines and models.

Some of the examples are – acronyms, hashtags with attached words, and colloquial slangs. With the help of regular expressions and manually prepared data dictionaries, this type of noise can be fixed, the code below uses a dictionary lookup method to replace social media slangs from a text.

In [13]:
lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"love"}
def _lookup_words(input_text):
    words = input_text.split() 
    new_words = [] 
    for word in words:
        if word.lower() in lookup_dict:
            word = lookup_dict[word.lower()]
        new_words.append(word) 
    new_text = " ".join(new_words) 
    return new_text

_lookup_words("RT this is a retweeted tweet by Nicholas")

'Retweet this is a retweeted tweet by Nicholas'

Majority of available text data is highly unstructured and noisy in nature – to achieve better insights or to build better algorithms, it is necessary to play with clean data. For example, social media data is highly unstructured – it is an informal communication – typos, bad grammar, usage of slang, presence of unwanted content like URLs, Stopwords, Expressions etc. are the usual suspects.

As a typical business problem, assume you are interested in finding:  which are the features of an iPhone which are more popular among the fans. You have extracted consumer opinions related to iPhone and here is a tweet you extracted:

```I luv my &lt;3 iphone &amp; you're awsm apple. DisplayIsAwesome, sooo happppppy :) http://www.apple.com```


Steps for data cleaning:

**Escaping HTML characters**: Data obtained from web usually contains a lot of html entities like &lt; &gt; &amp; which gets embedded in the original data. It is thus necessary to get rid of these entities. One approach is to directly remove them by the use of specific regular expressions. Another approach is to use appropriate packages and modules (for example htmlparser of Python), which can convert these entities to standard html tags. For example: &lt; is converted to “<” and &amp; is converted to “&”.

In [27]:
import HTMLParser
html_parser = HTMLParser.HTMLParser()

original_tweet = "I luv my &lt;3 iphone &amp; you're awsm apple. DisplayIsAwesome, sooo happppppy :) http://www.apple.com"

tweet = html_parser.unescape(original_tweet)
tweet

u"I luv my <3 iphone & you're awsm apple. DisplayIsAwesome, sooo happppppy :) http://www.apple.com"

**Decoding data**: This is the process of transforming information from complex symbols to simple and easier to understand characters. Text data may be subject to different forms of decoding like “Latin”, “UTF8” etc. Therefore, for better analysis, it is necessary to keep the complete data in standard encoding format. UTF-8 encoding is widely accepted and is recommended to use.


In [28]:
tweet = original_tweet.decode("utf8").encode('ascii','ignore')

In [29]:
tweet

"I luv my &lt;3 iphone &amp; you're awsm apple. DisplayIsAwesome, sooo happppppy :) http://www.apple.com"

**Removal of Stop-words**: When data analysis needs to be data driven at the word level, the commonly occurring words (stop-words) should be removed. One can either create a long list of stop-words or one can use predefined language specific libraries.

**Removal of Punctuations**: All the punctuation marks according to the priorities should be dealt with. For example: “.”, “,”,”?” are important punctuations that should be retained while others need to be removed.

**Removal of Expressions**: Textual data (usually speech transcripts) may contain human expressions like [laughing], [Crying], [Audience paused]. These expressions are usually non relevant to content of the speech and hence need to be removed. Simple regular expression can be useful in this case.

**Split Attached Words**: We humans in the social forums generate text data, which is completely informal in nature. Most of the tweets are accompanied with multiple attached words like RainyDay, PlayingInTheCold etc. These entities can be split into their normal forms using simple rules and regex.

In [37]:
cleaned = " ".join(re.findall('[A-Z][^A-Z]*', original_tweet))
cleaned

"I luv my &lt;3 iphone &amp; you're awsm apple.  Display Is Awesome, sooo happppppy :) http://www.apple.com"

**Standardizing words**: Sometimes words are not in proper formats. For example: “I looooveee you” should be “I love you”. Simple rules and regular expressions can help solve these cases.

In [40]:
import itertools
tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet))
tweet

"I luv my &lt;3 iphone &amp; you're awsm apple. DisplayIsAwesome, soo happy :) http://ww.apple.com"

### Text to Features (Feature Engineering on text data)

To analyse a preprocessed data, it needs to be converted into features. Depending upon the usage, text features can be constructed using assorted techniques – Syntactical Parsing, Entities / N-grams / word-based features, Statistical features, and word embeddings. Read on to understand these techniques in detail.


**Dependency Trees** - Sentences are composed of some words sewed together. The relationship among the words in a sentence is determined by the basic dependency grammar. Dependency grammar is a class of syntactic text analysis that deals with (labeled) asymmetrical binary relations between two lexical items (words). Every relation can be represented in the form of a triplet (relation, governor, dependent).

**Part of speech tagging** - Apart from the grammar relations, every word in a sentence is also associated with a part of speech (pos) tag (nouns, verbs, adjectives, adverbs etc). The pos tags defines the usage and function of a word in the sentence. Here is a list of all possible pos-tags defined by Pennsylvania university. Following code using NLTK performs pos tagging annotation on input text. (it provides several implementations, the default one is perceptron tagger)

In [42]:
from nltk import word_tokenize, pos_tag
text = "I am learning Natural Language Processing in today's class"
tokens = word_tokenize(text)
print pos_tag(tokens)

[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP'), ('in', 'IN'), ('today', 'NN'), ("'s", 'POS'), ('class', 'NN')]


Part of Speech tagging is used for many important purposes in NLP:

**Word sense disambiguation**: Some language words have multiple meanings according to their usage. For example, in the two sentences below:

1. "Please book my flight for Delhi"

2. "I am going to read this book in the flight"

"Book" is used with different context, however the part of speech tag for both of the cases are different. In sentence 1, the word “book” is used as verb, while in 2 it is used as noun. 


**Improving word-based features**: A learning model could learn different contexts of a word when used word as the features, however if the part of speech tag is linked with them, the context is preserved, thus making strong features. For example:

Sentence - "book my flight, I will read this book"

Tokens - ("book", 2), ("my", 1), ("flight", 1), ("I", 1), ("will", 1), ("read", 1), ("this", 1)

Tokens with POS – ("book_VB", 1), ("my_PRP$", 1), ("flight_NN", 1), ("I_PRP", 1), ("will_MD", 1), ("read_VB", 1), ("this_DT", 1), ("book_NN", 1)


**Normalization and Lemmatization**: POS tags are the basis of lemmatization process for converting a word to its base form (lemma).

"**Efficient stopword removal**: POS tags are also useful in efficient removal of stopwords.

For example, there are some tags which always define the low frequency / less important words of a language. For example: (IN - “within”, “upon”, “except”), (CD – “one”,”two”, “hundred”), (MD – “may”, “must” etc)



### Entity Extraction (Entities as features)

Entities are defined as the most important chunks of a sentence – noun phrases, verb phrases or both. Entity Detection algorithms are generally ensemble models of rule based parsing, dictionary lookups, pos tagging and dependency parsing. The applicability of entity detection can be seen in the automated chat bots, content analyzers and consumer insights.

Topic Modelling & Named Entity Recognition are the two key entity detection methods in NLP.


#### Named Entity Recognition (NER)
The process of detecting the named entities such as person names, location names, company names etc from the text is called as NER. For example :

Sentence – Sergey Brin, the manager of Google Inc. is walking in the streets of New York.

Named Entities –  ( “person” : “Sergey Brin” ), (“org” : “Google Inc.”), (“location” : “New York”)

A typical NER model consists of three blocks:

**Noun phrase identification**: This step deals with extracting all the noun phrases from a text using dependency parsing and part of speech tagging.

**Phrase classification**: This is the classification step in which all the extracted noun phrases are classified into respective categories (locations, names etc). Google Maps API provides a good path to disambiguate locations, Then, the open databases from dbpedia, wikipedia can be used to identify person names or company names. Apart from this, one can curate the lookup tables and dictionaries by combining information from different sources.

**Entity disambiguation**: Sometimes it is possible that entities are misclassified, hence creating a validation layer on top of the results is useful. Use of knowledge graphs can be exploited for this purposes. The popular knowledge graphs are – Google Knowledge Graph, IBM Watson and Wikipedia. 



#### Topic Modeling

Topic modeling is a process of automatically identifying the topics present in a text corpus, it derives the hidden patterns among the words in the corpus in an unsupervised manner. Topics are defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic model results in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”.



#### N-Grams as Features
A combination of N words together are called N-Grams. N grams (N > 1) are generally more informative as compared to words (Unigrams) as features. Also, bigrams (N = 2) are considered as the most important features of all the others. The following code generates bigram of a text.

In [50]:
def generate_ngrams(text, n):
    words = text.split()
    output = []  
    for i in range(len(words)-n+1):
        output.append(words[i:i+n])
    return output

generate_ngrams('this is a sample text', 2)

[['this', 'is'], ['is', 'a'], ['a', 'sample'], ['sample', 'text']]

### Statistical Features

Text data can also be quantified directly into numbers using several techniques described in this section:


**Term Frequency - Inverse Document Frequency (TF - IDF)**
TF-IDF is a weighted model commonly used for information retrieval problems. It aims to convert the text documents into vector models on the basis of occurrence of words in the documents without taking considering the exact ordering. For Example – let say there is a dataset of N text documents, In any document “D”, TF and IDF will be defined as –

**Term Frequency (TF)** - TF for a term “t” is defined as the count of a term “t” in a document “D”

**Inverse Document Frequency (IDF)** - IDF for a term is defined as logarithm of ratio of total documents available in the corpus and number of documents containing the term T.

In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer
obj = TfidfVectorizer()
corpus = ['This is sample document.', 'another random document.', 'third sample document text']
X = obj.fit_transform(corpus)
print X

  (0, 7)	0.5844829010200651
  (0, 2)	0.5844829010200651
  (0, 4)	0.444514311537431
  (0, 1)	0.34520501686496574
  (1, 1)	0.3853716274664007
  (1, 0)	0.652490884512534
  (1, 3)	0.652490884512534
  (2, 4)	0.444514311537431
  (2, 1)	0.34520501686496574
  (2, 6)	0.5844829010200651
  (2, 5)	0.5844829010200651


  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


The model creates a vocabulary dictionary and assigns an index to each word. Each row in the output contains a tuple (i,j) and a tf-idf value of word at index j in document i.


#### Count / Density / Readability Features
Count or Density based features can also be used in models and analysis. These features might seem trivial but shows a great impact in learning models. Some of the features are: Word Count, Sentence Count, Punctuation Counts and Industry specific word counts. Other types of measures include readability measures such as syllable counts, smog index and flesch reading ease.


#### Word Embedding (text vectors)

Word embedding is the modern way of representing words as vectors. The aim of word embedding is to redefine the high dimensional word features into low dimensional feature vectors by preserving the contextual similarity in the corpus. They are widely used in deep learning models such as Convolutional Neural Networks and Recurrent Neural Networks.

[Word2Vec](https://code.google.com/archive/p/word2vec/) and [GloVe](http://nlp.stanford.edu/projects/glove/) are the two popular models to create word embedding of a text. These models takes a text corpus as input and produces the word vectors as output.

Word2Vec model is composed of preprocessing module, a shallow neural network model called Continuous Bag of Words and another shallow neural network model called skip-gram. These models are widely used for all other nlp problems. It first constructs a vocabulary from the training corpus and then learns word embedding representations. Following code using gensim package prepares the word embedding as the vectors.

In [53]:
from gensim.models import Word2Vec
sentences = [['data', 'science'], ['john', 'science', 'data', 'analytics'],['machine', 'learning'], ['deep', 'learning']]

In [54]:
# train the model on your corpus  
model = Word2Vec(sentences, min_count = 1)

print model.similarity('data', 'science')

-0.23661329464331132


  after removing the cwd from sys.path.


In [55]:
print model['learning'] 

[ 2.5001867e-03 -4.5279837e-03  3.2164674e-04 -4.4285697e-03
  4.1545318e-03 -2.1649762e-03 -3.8046767e-03 -1.7539723e-03
  5.9814251e-04  3.5325827e-03 -3.9156429e-03  5.8388693e-04
  2.1662123e-03 -4.5341258e-03  1.5721378e-03  4.6008709e-03
 -2.1712466e-03 -1.1091464e-03 -3.1049079e-03 -3.8329777e-03
  2.6114439e-03  6.9665466e-04  1.8230440e-03 -2.0439368e-04
 -1.3953244e-03  2.5326526e-03 -1.3467000e-03  7.9407363e-04
 -8.5678376e-04 -2.7677296e-03 -3.2886763e-03 -4.6619233e-03
  7.3562146e-06  1.1412736e-03 -1.2158153e-03  7.6474954e-04
 -4.3854597e-03  1.7202114e-03  2.6836181e-03  1.4622611e-03
 -2.5849789e-03  4.8951223e-03  4.8026382e-03 -1.6887399e-03
  3.7905199e-03 -1.8123538e-03 -1.8461549e-03 -3.6006782e-03
 -8.0820097e-04  2.4030278e-03  2.6205764e-03 -1.3611730e-03
  3.2463272e-03  4.4065714e-03 -4.9645538e-03  1.0241962e-03
  7.2724192e-04  1.0305879e-03  3.8043426e-03 -1.5256479e-03
  4.4337765e-04  8.2238915e-04  1.0542222e-03  1.5848925e-04
  2.7174465e-04 -4.18748

  """Entry point for launching an IPython kernel.


### Important tasks of NLP

#### Text Classification

Text classification is one of the classical problem of NLP. Notorious examples include – Email Spam Identification, topic classification of news, sentiment classification and organization of web pages by search engines.

Text classification, in common words is defined as a technique to systematically classify a text object (document or sentence) in one of the fixed category. It is really helpful when the amount of data is too large, especially for organizing, information filtering, and storage purposes.

A typical natural language classifier consists of two parts: (a) Training (b) Prediction. Firstly the text input is processes and features are created. The machine learning models then learn these features and is used for predicting against the new text.

#### Other NLP problems / tasks

**Text Summarization** - Given a text article or paragraph, summarize it automatically to produce most important and relevant sentences in order.

**Machine Translation** - Automatically translate text from one human language to another by taking care of grammar, semantics and information about the real world, etc.

**Natural Language Generation and Understanding** - Convert information from computer databases or semantic intents into readable human language is called language generation. Converting chunks of text into more logical structures that are easier for computer programs to manipulate is called language understanding.

**Optical Character Recognition** - Given an image representing printed text, determine the corresponding text.

**Document to Information** - This involves parsing of textual data present in documents (websites, files, pdfs and images) to analyzable and clean format.