# NLP Series 1

Whenever we are referring to some MachineLearning problem, the basic requirement is to have quality data, which we can feed to the Model to Train it, and based on the Training we can make predictions. Now, data can have various shape, size, construct and complexity. Data can be in the form of Text, as we are looking at NLP, it could also be a categorical features etc. Data preprocessing techniques are needed to process the data, even before model building. 

Let us now take an example, to clearly understand the concept. Let's choose the publicly available dataset of Amazon product reviews to begin with.

All over the world, people use Amazon to buy products. Based on individual experiences, some people provide feedback and/or review of the product. Now, how machines can really understand whether the review provided is Positive or Negative? That's where Natural Language Processing comes into play. Various Text processing techniques are used to handle these scenarios. 

We will refer to NLTK open source library to explain few of the concepts. 


## Text Processing

In this notebook, we go over some simple techniques to clean and prepare text data for modeling with machine learning.

1. <a href="#1">Simple text cleaning processes</a>
2. <a href="#2">Lexicon-based text processing</a>
    * Tokenization
    * Lemmatization
    * Stemming
    * Stop words removal 
   

## 1. <a name="1">Simple text cleaning processes</a>
(<a href="#0">Go to top</a>)

In this section, we will try some text cleaning. 

In [None]:
# test we will be cleaning tokenizing, lemmatizing, stemming and look for stop words
original_text = "   We need to clean-up this message. It consists of several steps, which we will follow as we proceed further. It may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     . I am trying and adding another sentence to make it more real-life like. There is another sentence to give it more weight. How, do you feel about it now? Is there anything that can be further added to make it look like a paragraph. At this stage, I think I have few sentences and I can proceed with the process. /  "
print(original_text)

# Clean-up Approach

# Assign original_text to text to keep a backup of the original entry


In [None]:
# Let get rid off the trailing whitespaces, as it doesn't add any value to the ML process
text = original_text
text = text.strip()
print(text)

In [None]:
# Now let's get rid off the HTML tags/markups
import re

text = re.compile('<.*?>').sub('', text)
print(text)

In [None]:
#Let's now remove the punctuations and replace it with spaces
import re, string

text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)
print(text)

In [None]:
# Convert all to lowercase
text = text.lower()
print(text)

In [None]:
# Now lets remove the extra spaces and tabs
import re

text = re.sub('\s+', ' ', text)
print(text)

# 2. Lexicon Based Cleaning Approach

Lexicon based methods are usually applied after the common text processing methods. It helps to put words into a similar format that will also enhace similarities (if any) between sentences.

To do that, we need to download few libraries

In [None]:
import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')

# a) Tokenization 

Simply speaking, Tokenization is the process of converting text and/or documents into small parts by white spaces and punctuation. Let us see what happens to the original_text, when we apply Tokenization to it

In [None]:
# Tokenize sentences. Either you can tokenize sentences or words.
# You will see the sentence is now a unique list. Depending on the paragraph(no of sentences), you may end up 
# with multiple lists
sentences = nltk.sent_tokenize(text)
print(sentences)

In [None]:
# Tokenize words. Remember based on the paragraph/sentence, there could be repetition of words
words = nltk.word_tokenize(text)
print(words)

# b) Stemming and Stop Word

Stemming is the set of rules to dice and slice the words to make more generic sense. Example - Going, Goes, Gone -> Go

Why it is important ? The Stem word provides more contextual perspective to the process. "Go" have more significance over "Going, Goes, Gone" in this contextual understanding. Stemming although doesn't necessary mean it will convert word to a human interpretable form. Example - Finally, Final, Finalized would be "fina", which doesn't have any meaning. Wheras, Lemmatization ensures there is a human understandable meaning of the word. So Finally, Final, Finalized would be "final". But at the same time it takes more time to perform Lemmatization Vs Stemming. So which one to use where? It will depend on Use Cases. Example - Sentiment analysis, Spam classifier etc. we might not need to understand the base word(english meaning perspective), but for Use cases like Chat Bots, FAQ, Q & A we might need a complete meaning of the word, and hence Lemmatization would be more appropriate for a meaningful representaion


In [None]:
# We will use a tokenizer and stemmer from the NLTK library
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords

# Let's get a list of stop words from the NLTK library
snowstemmer = SnowballStemmer('english')

#Stemming
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i]) # words in sentence(s)
    words = [snowstemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)
print(sentences)

So you can see that all words does not have a meaningful representation of the word in english. Let us now look at Lemmatization technique

# c) Lemmatization and Stop Word

In [None]:
# Importing the necessary functions
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

#Lemmatizer Object
lemmatizer = WordNetLemmatizer()

#Lemmatization
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i]) # words in sentence(s)
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)
print(sentences)

You can see Lemmatization produces far better results compared to Stemming.

Let's now continue to focus on how BoW, TF-IDF, Word2Vec is done in real-life. For that, please check out the NLP-Series-2 notebook.......