## NLP

Natural Language Processing (NLP) is the technology used to help machines to understand and learn text and language. With NLP data scientists aim to teach machines to understand what is said and written to make sense of the human language. It is used to apply machine learning algorithms to text and speech.



![Alt text](images/NLP.webp)

### What are the techniques used in NLP?

NLP has primarily two aspects: natural language understanding (NLU) or natural language interpretation (NLI) (i.e. human to machine) and natural language generation (NLG) (i.e. machine to human). In simple words, one can say that NLG is inverse of NLU (broadly called as NLP). Natural language generation (NLG) is when software automatically transforms data into written narrative.



### SYNTACTIC & SEMANTIC ANALYSIS
Natural Language Processing tasks are primarily achieved by syntactic analysis and semantic analysis. There are many process involves in this process like:

- Named entity recognition (NER) — determine the parts of a text that can be identified and categorized into preset groups, like names of people and objects.
- Word sense disambiguation — give meaning to words based on their context
- Natural language generation (NLG):— It involves using databases to derive semantic intentions and convert them into human language.

### Programming

In python, we have several libraries to work with text.

- Scikit-learn, Keras, TensorFlow — has some text processing capabilities
- NLTK — Natural language toolkit.
- SpaCy — is an industrial strength NLP package with many practical tools in a nice API.

Other libraries — TextBlob, gensim, Stanford CoreNLP, OpenNLP

![Alt text](images/normalization.webp)

### Types:
- Pattern based (like we find pattern of characters or strings to match)
- AI based (we use DL, seq2seq or transformers) to see the context of language

## Understand vocabulary

In [3]:
## One simple way to process our english language is BOW

from collections import Counter
s = "My name is hamsof, and meaning of hamsof is ..."
token_counter = Counter(s.split())

token_counter


Counter({'My': 1,
         'name': 1,
         'is': 2,
         'hamsof,': 1,
         'and': 1,
         'meaning': 1,
         'of': 1,
         'hamsof': 1,
         '...': 1})

We can see there are many issues here like it is capturing hamsof and hamsof, differently, lets resolve it

### Lets now talk about punctuation, stop words

Regular expressions are a way to improve vocabulary by spliting not only ' ' white spaces but also on ? or signs like stop. 

In [11]:
import re
pattern = re.compile(r"([-\s.,;!?])+")
sentence = "Natural Language Processing is so awesome, isn't it?"
tokens = pattern.split(sentence)
tokens = [token for token in tokens if token not in '-\t\n.,;!?']
tokens

['Natural',
 ' ',
 'Language',
 ' ',
 'Processing',
 ' ',
 'is',
 ' ',
 'so',
 ' ',
 'awesome',
 ' ',
 "isn't",
 ' ',
 'it']

But Regrex takes too much complexity to code to tokenize our sentence, lets move to some built in functionality:

Some of the most commonly used libraries are spaCy and NLTK. We will mostly utilize the NLTK library.

### NLTK

In [14]:
from nltk.tokenize import TreebankWordTokenizer
sentence = "Natural Language Processing is so awesome, isn't it?"
tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(sentence)

tokens

['Natural',
 'Language',
 'Processing',
 'is',
 'so',
 'awesome',
 ',',
 'is',
 "n't",
 'it',
 '?']

This was pretty clean but some times we need to remove words like is am are they dont provide such meaningful meanings lets explore them

#### STOP words

In [18]:
import nltk
nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')


from nltk.tokenize import TreebankWordTokenizer
sentence = "Natural Language Processing is so awesome, isn't it?"
tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(sentence)

print(tokens)

tokens = [token for token in tokens if token not in stop_words]

tokens



['Natural', 'Language', 'Processing', 'is', 'so', 'awesome', ',', 'is', "n't", 'it', '?']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['Natural', 'Language', 'Processing', 'awesome', ',', "n't", '?']

Now you can compare the difference between these strings

### Stemming and Lemmatization

Stemming and lemmatization are methods used by search engines and chatbots to analyze the meaning behind a word. Stemming uses the stem of the word, while lemmatization uses the context in which the word is being used.

![Alt text](images/lemma%20or%20Stemma.png)

### Stemming

Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This indiscriminate cutting can be successful in some occasions, but not always, and that is why we affirm that this approach presents some limitations. Below we illustrate the method with examples in both English and Spanish

### Lemmatizing

Lemmatization, on the other hand, takes into consideration the morphological analysis of the words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma. Again, you can see how it works with the same example words.


But sometimes stemming becomes costly

In [20]:
from nltk.stem.porter import PorterStemmer
token = "caring"
stemmer = PorterStemmer()
stems = stemmer.stem(token)

stems


'care'

And we can clearly see the difference with Lemmatization

In [23]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('caring')


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


'caring'

Now we are making our effort to see if reviews from IMDB is postitive or negative with the help og BOW and then training the dataset using Naive Bayes Approach

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

# Load dataset, we only use training data
(X, y), _ = tf.keras.datasets.imdb.load_data()

# Create a Bag-Of-Words Dataframe
X = [Counter(x) for x in X[:5000]]
y = y[:5000]

X = pd.DataFrame(X).fillna(0).astype(int)

# Split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, shuffle=True, random_state=1)

# Instantiate and fit model on training data
clf = MultinomialNB()
clf.fit(X_train, y_train)

# Make a prediction and get accuracy
prediction = clf.predict(X_test)
accuracy = np.sum(y_test==prediction) / len(y_test)

print(accuracy)
# >>> 0.798