# NLP Series - Part 1
## Text Preprocessing Techniques


### Installing Natural Language Toolkit
#### NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries

##### Importing NLTK & downloading all libraries

In [1]:
import nltk
# nltk.download()

###### Note - The NLTK download manager GUI opens up, click on 'all' (First option at top), & click on Download
###### this will download all nltk packages to local
 

In [2]:
# defining a paragraph of text to perform actions
paragraph = "Sunset is the time of day when our sky meets the outer space solar winds. There are blue, pink, and purple swirls, spinning and twisting, like clouds of balloons caught in a whirlwind. The sun moves slowly to hide behind the line of horizon, while the moon races to take its place in prominence atop the night sky. People slow to a crawl, entranced, fully forgetting the deeds that must still be done. There is a coolness, a calmness, when the sun does set."

#### Tokenizing - NLTK Tokenizer Package

#### Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string:

##### https://www.nltk.org/api/nltk.tokenize.html


In [3]:
# Tokenizing sentences
sentences = nltk.sent_tokenize(paragraph)
len(sentences)

5

In [4]:
sentences

['Sunset is the time of day when our sky meets the outer space solar winds.',
 'There are blue, pink, and purple swirls, spinning and twisting, like clouds of balloons caught in a whirlwind.',
 'The sun moves slowly to hide behind the line of horizon, while the moon races to take its place in prominence atop the night sky.',
 'People slow to a crawl, entranced, fully forgetting the deeds that must still be done.',
 'There is a coolness, a calmness, when the sun does set.']

In [5]:
# tokenizing words
words = nltk.word_tokenize(paragraph)
len(words)

98

In [6]:
words

['Sunset',
 'is',
 'the',
 'time',
 'of',
 'day',
 'when',
 'our',
 'sky',
 'meets',
 'the',
 'outer',
 'space',
 'solar',
 'winds',
 '.',
 'There',
 'are',
 'blue',
 ',',
 'pink',
 ',',
 'and',
 'purple',
 'swirls',
 ',',
 'spinning',
 'and',
 'twisting',
 ',',
 'like',
 'clouds',
 'of',
 'balloons',
 'caught',
 'in',
 'a',
 'whirlwind',
 '.',
 'The',
 'sun',
 'moves',
 'slowly',
 'to',
 'hide',
 'behind',
 'the',
 'line',
 'of',
 'horizon',
 ',',
 'while',
 'the',
 'moon',
 'races',
 'to',
 'take',
 'its',
 'place',
 'in',
 'prominence',
 'atop',
 'the',
 'night',
 'sky',
 '.',
 'People',
 'slow',
 'to',
 'a',
 'crawl',
 ',',
 'entranced',
 ',',
 'fully',
 'forgetting',
 'the',
 'deeds',
 'that',
 'must',
 'still',
 'be',
 'done',
 '.',
 'There',
 'is',
 'a',
 'coolness',
 ',',
 'a',
 'calmness',
 ',',
 'when',
 'the',
 'sun',
 'does',
 'set',
 '.']

#### Stemming 
##### Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”. Stemming is faster than lemmatization.

##### Some more example of stemming for root word "like" include: -> "likes" , "liked" , "likely" , "liking"

##### Stemming may or may not produce meaning words, hence can be used in applications like Sentiment Analysis. Two kinds of stemming - Understemming & Overstemming

In [7]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

##### https://www.nltk.org/api/nltk.stem.porter.html

##### Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. 

In [8]:
stemmer = PorterStemmer()

In [9]:
for i in range(len(sentences)):
    word_list = nltk.word_tokenize(sentences[i])
    word_list = [stemmer.stem(word.lower()) for word in word_list if word.lower() not in set(stopwords.words('english'))]
    new_sentences = " ".join(word_list)
    print(new_sentences)

sunset time day sky meet outer space solar wind .
blue , pink , purpl swirl , spin twist , like cloud balloon caught whirlwind .
sun move slowli hide behind line horizon , moon race take place promin atop night sky .
peopl slow crawl , entranc , fulli forget deed must still done .
cool , calm , sun set .


#### Lemmatization 

##### Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma

##### Lemmatization produces meaningful words & thus requires more computation time. This can be mostly used in chatbots, Q&A systems, etc.

In [10]:
from nltk.stem import WordNetLemmatizer

In [11]:
lemmatizer = WordNetLemmatizer()

In [12]:
for i in range(len(sentences)):
    word_lemma = nltk.word_tokenize(sentences[i])
    word_lemma = [lemmatizer.lemmatize(word.lower()) for word in word_lemma if word.lower() not in set(stopwords.words('english'))]
    lemma_sentences = " ".join(word_lemma)
    print(lemma_sentences)

sunset time day sky meet outer space solar wind .
blue , pink , purple swirl , spinning twisting , like cloud balloon caught whirlwind .
sun move slowly hide behind line horizon , moon race take place prominence atop night sky .
people slow crawl , entranced , fully forgetting deed must still done .
coolness , calmness , sun set .


#### Bag Of Words

##### The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. In other words it is way of numerical representation which can be fed to model. Two kinds of BOW representation - Normal BOW & Binary BOW(matrix sentence-word frequency). Useful for small datasets only.

##### Disadvantage - words can have same semantics(frequency), hence model can get confused which one should have more weightage. Fails for large datasets.

###### (1) John likes to watch movies. Mary likes movies too.
###### (2) Mary also likes to watch football games.

###### BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
###### BoW2 = {"Mary":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1};

In [13]:
# cleaning text
import re

corpus = []
for i in range(len(sentences)):
    # replacing everything(except alphabets) with space
    cleaned = re.sub("[^a-zA-Z]", ' ', sentences[i])
    # lemmatizing
    cleaned = cleaned.lower()
    cleaned = cleaned.split()   # list of words
    cleaned = [lemmatizer.lemmatize(word) for word in cleaned if word not in set(stopwords.words('english'))]
    cleaned = " ".join(cleaned)
    corpus.append(cleaned)

In [14]:
# Creating BoW model
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=1500)
model = cv.fit_transform(corpus).toarray()

In [15]:
model

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1,
        0, 0, 1],
       [0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
        1, 1, 0],
       [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1,
        0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
        0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0]], dtype=int64)

#### TFIDF

##### In information retrieval, tf–idf (also TF*IDF, TFIDF, TF–IDF, or Tf–idf), short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

###### Term frequency tf(t,d), is the relative frequency of term t within document d,

###### The inverse document frequency is a measure of how much information the word provides, i.e., if it is common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient):

![tf-df](https://cdn-media-1.freecodecamp.org/images/1*q3qYevXqQOjJf6Pwdlx8Mw.png)


In [16]:
# Creating tfidf model
from sklearn.feature_extraction.text import TfidfVectorizer

cv = TfidfVectorizer(max_features=1500)
model = cv.fit_transform(corpus).toarray()

In [17]:
model

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.3399922 ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.3399922 ,
        0.        , 0.        , 0.        , 0.        , 0.3399922 ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.27430356, 0.        , 0.        ,
        0.3399922 , 0.3399922 , 0.        , 0.        , 0.        ,
        0.3399922 , 0.        , 0.        , 0.3399922 , 0.        ,
        0.        , 0.3399922 ],
       [0.        , 0.30151134, 0.        , 0.30151134, 0.        ,
        0.30151134, 0.30151134, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.30151134, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.     