# **Natural Language Processing**



## **What Is Natural Language Processing?**

Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data....([Wikipedia)](https://en.wikipedia.org/wiki/Natural_language_processing)

By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation.

## **Use Cases of NLP**

In simple terms, NLP represents the automatic handling of natural human language like speech or text, and although the concept itself is fascinating, the real value behind this technology comes from the use cases.

NLP can help you with lots of tasks and the fields of application just seem to increase on a daily basis. Let’s mention some examples:

 1. NLP enables the recognition and prediction of diseases based on electronic health records and patient’s own speech. This capability is being explored in health conditions that go from cardiovascular diseases to depression and even schizophrenia. For example, Amazon Comprehend Medical is a service that uses NLP to extract disease conditions, medications and treatment outcomes from patient notes, clinical trial reports and other electronic health records.

 2. Organizations can determine what customers are saying about a service or product by identifying and extracting information in sources like social media. This sentiment analysis can provide a lot of information about customers choices and their decision drivers.

 3. Companies like Yahoo and Google filter and classify your emails with NLP by analyzing text in emails that flow through their servers and stopping spam before they even enter your inbox.

 4. To help identifying fake news, the NLP Group at MIT developed a new system to determine if a source is accurate or politically biased, detecting if a news source can be trusted or not.

 5. Amazon’s Alexa and Apple’s Siri are examples of intelligent voice driven interfaces that use NLP to respond to vocal prompts and do everything like find a particular shop, tell us the weather forecast, suggest the best route to the office or turn on the lights at home.

 6. Having an insight into what is happening and what people are talking about can be very valuable to financial traders. NLP is being used to track news, reports, comments about possible mergers between companies, everything can be then incorporated into a trading algorithm to generate massive profits. Remember: buy the rumor, sell the news.

 7. NLP is also being used in both the search and selection phases of talent recruitment, identifying the skills of potential hires and also spotting prospects before they become active on the job market.

 8. Chatbots: Virtual personal assistants also are known as chatbots are rapidly making their presence in the digital world. Businesses are using chatbots across support, marketing, healthcare verticals.

 9. Speech Recognition: This is where devices like Alexa, Siri, Google home and any other virtual assistants come to picture. NLP has developed its roots in healthcare with speech recognition, allowing clinicians to transcribe notes for efficient EHR data entry for nearly two decades.

 10. Credit worthiness assessment: Nowadays many banks and lending companies leveraging NLP and assess the credit worthiness of clients with little or no credit history. For example, students who got a job first time and start earning money have no or little credit history. But they are potential customers to banks for giving loans. Even if these clients have never used credit before, most of them still use smartphones, browse the internet and engage in other activities that leave a lot of digital footprints. NLP algorithms analyze geolocation data, social media activity, browsing behaviour to derive insights into their habits, peer networks, and strength of their relationships. By analyzing thousands of client-related variables, the software generates a credit score highly predictive of customer’s further activity.

 11. Neural Machine Translation: What has previously seemed like an awkward attempt to imitate the professional translation has now substantially improved, but neural machine translation (NMT)has taken the improvements even further. Google, Amazon, and Microsoft are competing to deliver the best machine translation today.

## **Basic concepts in NLP**

### **What is Language?**

A language, basically is a fixed vocabulary of words which is shared by a community of humans to express and communicate their thoughts.
This vocabulary is taught to humans as a part of their growing up process, and mostly remains fixed with few additions each year.
Elaborate resources such as dictionaries are maintained so that if a person comes across a new word he or she can reference the dictionary for its meaning. Once the person gets exposed to the word it gets added in his or her vocabulary and can be used for further communications.

### **How do computers understand language?**

A computer is a machine working under mathematical rules. It lacks the complex interpretations and understandings which humans can do with ease, but can perform a complex calculation in seconds.

> For a computer to work with any concept it is necessary that there should be a way to express the said concept in the form of a mathematical model.

This constraint highly limits the scope and the areas of natural language a computer can work with. So far what machines have been highly successful in performing are classification and translation tasks.


### **Basic Transformations**

As mentioned earlier, for a machine to make sense of natural language( language used by humans) it needs to be converted into some sort of a mathematical framework which can be modeled. Below mentioned, are some of the most commonly used techniques which help us achieve that.

#### **Data cleaning**

There are some basic cleaning steps which can also helps improve effectiveness of NLP system, however it should be used carefully and case by case basis. These are:

 + converting into lowercase
 + Removing punctuations
 + Removing special characters and html tags like @, # etc.a

#### **Tokenization**

Tokenization is the process of segmenting running text into sentences and words. In essence, it’s the task of cutting a text into pieces called tokens, and at the same time throwing away certain characters, such as punctuation. 

To bring a short example let's look at the first sentence of the song “Across the Universe” from The Beatles:

> Words are flowing out like endless rain into a paper cup,

>  They slither while they pass, they slip away across the universe

Now,the result of tokenization would be:

`Words` `are` `flowing` `out` `like` `endless` `rain` `into` `a` `paper` `cup`

`They` `slither` `while` `they` `pass`, `they` `slip` `away` `across` `the` `universe`

Although it may seem quite basic in this case and also in languages like English that separate words by a blank space (called segmented languages) not all languages behave the same, and if you think about it, blank spaces alone are not sufficient enough even for English to perform proper tokenizations. Splitting on blank spaces may break up what should be considered as one token, as in the case of certain names (e.g. San Francisco or New York) or borrowed foreign phrases (e.g. laissez faire).

**Tokenization can remove punctuation too**, easing the path to a proper word segmentation but also triggering possible complications. In the case of periods that follow abbreviation (e.g. dr.), the period following that abbreviation should be considered as part of the same token and not be removed.

The tokenization process can be particularly problematic when dealing with biomedical text domains which contain lots of hyphens, parentheses, and other punctuation marks.

#### **Stop Words**

Stop Words Removal includes getting rid of common language articles, pronouns and prepositions such as “and”, “the” or “to” in English. In this process some very common words that appear to provide little or no value to the NLP objective are filtered and excluded from the text to be processed, hence removing widespread and frequent terms that are not informative about the corresponding text.

Stop words can be safely ignored by carrying out a lookup in a pre-defined list of keywords, freeing up database space and improving processing time.

> There is no universal list of stop words. 

These can be pre-selected or built from scratch. A potential approach is to begin by adopting pre-defined stop words and add words to the list later on. Nevertheless it seems that the general trend over the past time has been to go from the use of large standard stop word lists to the use of no lists at all.

The thing is stop words removal can wipe out relevant information and modify the context in a given sentence. For example, if we are performing a sentiment analysis we might throw our algorithm off track if we remove a stop word like “not”. Under these conditions, you might select a minimal stop word list and add additional terms depending on your specific objective.

#### **Stemming**

Stemming refers to the process of slicing the end or the beginning of words with the intention of removing affixes (lexical additions to the root of the word).

> Affixes that are attached at the beginning of the word are called prefixes (e.g. “astro” in the word “astrobiology”) and the ones attached at the end of the word are called suffixes (e.g. “ful” in the word “helpful”).

Stemming is a kind of normalization for words. Normalization is a technique where a set of words in a sentence are converted into a sequence to shorten its lookup. The words which have the same meaning but have some variation according to the context or sentence are normalized.
Generally, there is one root word, but there are many variations of the same words. For example, the root word is "eat" and it's variations are "eats, eating, eaten and like so". In the same way, with the help of Stemming, we can find the root word of any variations. In short, Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.

However, Stemming has serious limitations as sometime it change context of words altogether. For ex, **News** can be stemmed into **New** so why do we use it? First of all, it can be used to correct spelling errors from the tokens. Stemmers are simple to use and run very fast (they perform simple operations on a string), and if speed and performance are important in the NLP model, then stemming is certainly the way to go. Remember, we use it with the objective of improving our performance, not as a grammar exercise.

#### **Lemmatization**

Lemmatization has the objective of reducing a word to its base form and grouping together different forms of the same word. For example, verbs in past tense are changed into present (e.g. “went” is changed to “go”) and synonyms are unified (e.g. “best” is changed to “good”), hence standardizing words with similar meaning to their root. Although it seems closely related to the stemming process, lemmatization uses a different approach to reach the root forms of words.

> Lemmatization resolves words to their dictionary form (known as lemma) for which it requires detailed dictionaries in which the algorithm can look into and link words to their corresponding lemmas.

For example, the words “running”, “runs” and “ran” are all forms of the word “run”, so “run” is the lemma of all the previous words.

Let's look at comparison of Lemmatization and Stemming for the word **Caring**.

$$Lemmatization : Caring -> Care$$ 

$$Stemming : Caring -> Car$$ 

Lemmatization also takes into consideration the context of the word in order to solve other problems like disambiguation, which means it can discriminate between identical words that have different meanings depending on the specific context. Think about words like “bat” (which can correspond to the animal or to the metal/wooden club used in baseball) or “bank” (corresponding to the financial institution or to the land alongside a body of water). By providing a part-of-speech parameter to a word ( whether it is a noun, a verb, and so on) it’s possible to define a role for that word in the sentence and remove disambiguation.

As you might already pictured, lemmatization is a much more resource-intensive task than performing a stemming process. At the same time, since it requires more knowledge about the language structure than a stemming approach, it demands more computational power than setting up or adapting a stemming algorithm.

> Since stemming occurs based on a set of rules, the root word returned by stemming might not always be a word of the english language. Lemmatization on the other hand reduces the inflected words properly ensuring that the root word belongs to english language.

#### **N-Grams**
N-grams refer to the process of combining the nearby words together for representation purposes where N represents the number of words to be combined together.

For ex, consider first sentence from above example "Words are flowing out like endless rain into a paper cup".

A 1-gram or unigram model will tokenize the sentence into one word combinations and thus the output will be "Words, are, flowing, out, like, endless, rain, into, a, paper, cup."

A bigram model on the other hand will tokenize it into combination of 2 words each and the output will be "Words are, are flowing, flowing out, out like, like endless, endless rain, rain into, into a, a paper, paper cup".

Similarly, a trigram model will break it into "Words are flowing, are flowing out, flowing out like, out like endless, like endless rain, endless rain into, ran into a, into a paper. and a n-gram model will thus tokenize a sentence into combination of n words together.

> Breaking down a natural language into n-grams is essential for maintaining counts of words occurring in sentences which forms the backbone of traditional mathematical processes used in Natural Language Processing.

### **Mathematical transformation for solving NLP**

As computer can understand only Mathematical forms, it is necessary to convert text/natural languages into mathematical forms. Some of most popular method used in NLP are:

 1. One-Hot Encodings
 2. Bag of words
 3. TF-TDF
 4. Word embeddings

#### **One-Hot Encodings**

One hot encodings are another way of representing words in numeric form. The length of the word vector is equal to the length of the vocabulary, and each observation is represented by a matrix with rows equal to the length of vocabulary and columns equal to the length of observation, with a value of 1 where the word of vocabulary is present in the observation and a value of zero where it is not.



In [0]:
Str1="Words are flowing out like endless rain into a paper cup"

Str2="They slither while they pass, they slip away across the universe"

In [0]:
# Making lowercase
Str1=Str1.lower()
# Removing Punctuations, Numbers, and Special Characters
Str1=Str1.replace("[^a-zA-Z]", " ")
#Tokenization
Str1=Str1.split()
print(Str1)

['words', 'are', 'flowing', 'out', 'like', 'endless', 'rain', 'into', 'a', 'paper', 'cup']


In [0]:
import numpy as np
X=np.array(Str1)
X=X.reshape(-1, 1)
print(X.shape)
print(X)

(11, 1)
[['words']
 ['are']
 ['flowing']
 ['out']
 ['like']
 ['endless']
 ['rain']
 ['into']
 ['a']
 ['paper']
 ['cup']]


In [0]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
X=enc.fit_transform(X)
print(X)

  (0, 10)	1.0
  (1, 1)	1.0
  (2, 4)	1.0
  (3, 7)	1.0
  (4, 6)	1.0
  (5, 3)	1.0
  (6, 9)	1.0
  (7, 5)	1.0
  (8, 0)	1.0
  (9, 8)	1.0
  (10, 2)	1.0


In [0]:
enc.categories_

[array(['a', 'are', 'cup', 'endless', 'flowing', 'into', 'like', 'out',
        'paper', 'rain', 'words'], dtype='<U7')]

In [0]:
import numpy as np
import pandas as pd

data=pd.DataFrame(Corpus,columns=['Text'])
data

Unnamed: 0,Text
0,Words are flowing out like endless rain into a...
1,"They slither while they pass, they slip away a..."


In [0]:
tokenized_tweet = data['Text'].apply(lambda x: x.split())
tokenized_tweet

0    [Words, are, flowing, out, like, endless, rain...
1    [They, slither, while, they, pass,, they, slip...
Name: Text, dtype: object

#### Bag of Words

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

Corpus=["Words are flowing out like endless rain into a paper cup","They slither while they pass, they slip away across the universe"]

In [0]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(Corpus)
# print(vectorizer.get_feature_names())
# print(X.toarray())
import pandas as pd
data=pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names())
data

Unnamed: 0,across,are,away,cup,endless,flowing,into,like,out,paper,pass,rain,slip,slither,the,they,universe,while,words
0,0,1,0,1,1,1,1,1,1,1,0,1,0,0,0,0,0,0,1
1,1,0,1,0,0,0,0,0,0,0,1,0,1,1,1,3,1,1,0


In [0]:
vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
X2 = vectorizer2.fit_transform(Corpus)
# print(vectorizer2.get_feature_names())
# print(X2.toarray())

data2=pd.DataFrame(X2.toarray(),columns=vectorizer2.get_feature_names())
data2

Unnamed: 0,across the,are flowing,away across,endless rain,flowing out,into paper,like endless,out like,paper cup,pass they,rain into,slip away,slither while,the universe,they pass,they slip,they slither,while they,words are
0,0,1,0,1,1,1,1,1,1,0,1,0,0,0,0,0,0,0,1
1,1,0,1,0,0,0,0,0,0,1,0,1,1,1,1,1,1,1,0


In [0]:
vectorizer3 = CountVectorizer(analyzer='word', ngram_range=(3, 3))
X3 = vectorizer3.fit_transform(Corpus)
# print(vectorizer3.get_feature_names())
# print(X3.toarray())

data3=pd.DataFrame(X3.toarray(),columns=vectorizer3.get_feature_names())
data3

Unnamed: 0,across the universe,are flowing out,away across the,endless rain into,flowing out like,into paper cup,like endless rain,out like endless,pass they slip,rain into paper,slip away across,slither while they,they pass they,they slip away,they slither while,while they pass,words are flowing
0,0,1,0,1,1,1,1,1,0,1,0,0,0,0,0,0,1
1,1,0,1,0,0,0,0,0,1,0,1,1,1,1,1,1,0


#### TF-IDF

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(Corpus)
# print(vectorizer.get_feature_names())

dataTF=pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names())
dataTF

Unnamed: 0,across,are,away,cup,endless,flowing,into,like,out,paper,pass,rain,slip,slither,the,they,universe,while,words
0,0.0,0.316228,0.0,0.316228,0.316228,0.316228,0.316228,0.316228,0.316228,0.316228,0.0,0.316228,0.0,0.0,0.0,0.0,0.0,0.0,0.316228
1,0.242536,0.0,0.242536,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.242536,0.0,0.242536,0.242536,0.242536,0.727607,0.242536,0.242536,0.0
