# Introduction to NLP

**NLP stands for Natural Language Processing, which is a part of Computer Science, Human language, and Artificial Intelligence. It is the technology that is used by machines to understand, analyse, manipulate, and interpret human's languages. It helps developers to organize knowledge for performing tasks such as translation, automatic summarization, Named Entity Recognition (NER), speech recognition, relationship extraction, and topic segmentation.**

![](https://image.shutterstock.com/image-photo/cognitive-computing-concept-future-technology-260nw-1259313322.jpg)

# Components of NLP
There are the following two components of NLP -

 **1. Natural Language Understanding (NLU)**

Natural Language Understanding (NLU) helps the machine to understand and analyse human language by extracting the metadata from content such as concepts, entities, keywords, emotion, relations, and semantic roles.

NLU mainly used in Business applications to understand the customer's problem in both spoken and written language.

NLU involves the following tasks -

It is used to map the given input into useful representation.
It is used to analyze different aspects of the language.

 **2. Natural Language Generation (NLG)**

Natural Language Generation (NLG) acts as a translator that converts the computerized data into natural language representation. It mainly involves Text planning, Sentence planning, and Text Realization.

# Table of contents

* Text processing
* Tokenization 
*  Normalization
*       Lemmatization & stemming
        Removing Stopwords
* Text representation
* Bag Of Words
* N-Grams
* TF-IDF (Term Frequency and Inverse Document Frequency)
*  Modelling : Naive Bayes

In [None]:
import pandas as pd 

In [None]:
data=pd.read_csv("../input/physics-vs-chemistry-vs-biology/dataset/train.csv")

In [None]:
data

Unnamed: 0,Id,Comment,Topic
0,0x840,A few things. You might have negative- frequen...,Biology
1,0xbf0,Is it so hard to believe that there exist part...,Physics
2,0x1dfc,There are bees,Biology
3,0xc7e,I'm a medication technician. And that's alot o...,Biology
4,0xbba,Cesium is such a pretty metal.,Chemistry
...,...,...,...
8690,0x1e02,I make similar observations over the last week...,Biology
8691,0xc8d,You would know.,Biology
8692,0x723,Also use the correct number of sig figs,Chemistry
8693,0x667,"What about the ethical delimmas, groundbreaki...",Biology


In [None]:
data.drop('Id',axis=1 , inplace=True)

In [None]:
data

Unnamed: 0,Comment,Topic
0,A few things. You might have negative- frequen...,Biology
1,Is it so hard to believe that there exist part...,Physics
2,There are bees,Biology
3,I'm a medication technician. And that's alot o...,Biology
4,Cesium is such a pretty metal.,Chemistry
...,...,...
8690,I make similar observations over the last week...,Biology
8691,You would know.,Biology
8692,Also use the correct number of sig figs,Chemistry
8693,"What about the ethical delimmas, groundbreaki...",Biology


In [None]:
data.shape

(8695, 2)

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8695 entries, 0 to 8694
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Comment  8695 non-null   object
 1   Topic    8695 non-null   object
dtypes: object(2)
memory usage: 136.0+ KB


In [None]:
data.drop('Topic',axis=1)
y=data['Topic']

In [None]:
y

0         Biology
1         Physics
2         Biology
3         Biology
4       Chemistry
          ...    
8690      Biology
8691      Biology
8692    Chemistry
8693      Biology
8694      Biology
Name: Topic, Length: 8695, dtype: object

# Text processing 

> Text processing contains two main phases, which are tokenization and normalization .

> **Tokenization** is the process of splitting a longer string of text into smaller pieces, or tokens. 

> **Normalization** referring to convert number to their word equivalent, remove punctuation, convert all text to the same case, remove stopwords, remove noise, lemmatizing and stemming.
> 
> * Stemming — removing affixes (suffixed, prefixes, infixes, circumfixes), For example, running to run
> 
> * Lemmatization — capture canonical form based on a word’s lemma. For example, better to good

# 1. Tokenization


> Using tokenizer to separate the sentences into a list of single words (tokens).

![](https://miro.medium.com/max/1050/0*EKgminT7W-0R4Iae.png)

> There are several tokenizer modules in NLTK libraries  WordPunctTokenizer used above. For example, word_tokenize and RegexpTokenizer. 

In [None]:
data1=data['Comment']

In [None]:
data1

0       A few things. You might have negative- frequen...
1       Is it so hard to believe that there exist part...
2                                          There are bees
3       I'm a medication technician. And that's alot o...
4                          Cesium is such a pretty metal.
                              ...                        
8690    I make similar observations over the last week...
8691                                      You would know.
8692              Also use the correct number of sig figs
8693    What about the ethical delimmas,  groundbreaki...
8694                            I would like to know too.
Name: Comment, Length: 8695, dtype: object

In [None]:
data1.shape

(8695,)

In [None]:
type(data1)

pandas.core.series.Series

In [None]:
from nltk.tokenize import WordPunctTokenizer
from nltk.tokenize import word_tokenize

In [None]:
WPT=WordPunctTokenizer()

In [None]:
word_punct_token = WPT.tokenize(data1[0])
Word_tokenize= word_tokenize(data1[0])

In [None]:
print(Word_tokenize)

['A', 'few', 'things', '.', 'You', 'might', 'have', 'negative-', 'frequency', 'dependent', 'selection', 'going', 'on', 'where', 'the', 'least', 'common', 'phenotype', ',', 'reflected', 'by', 'genotype', ',', 'is', 'going', 'to', 'have', 'an', 'advantage', 'in', 'the', 'environment', '.', 'For', 'instance', ',', 'if', 'a', 'prey', 'animal', 'such', 'as', 'a', 'vole', 'were', 'to', 'have', 'a', 'light', 'and', 'a', 'dark', 'phenotype', ',', 'a', 'predator', 'might', 'recognize', 'the', 'more', 'common', 'phenotype', 'as', 'food', '.', 'So', 'if', 'the', 'light', 'voles', 'are', 'more', 'common', ',', 'foxes', 'may', 'be', 'keeping', 'a', 'closer', 'eye', 'out', 'for', 'light', 'phenotypic', 'voles', ',', 'recognising', 'them', 'as', 'good', 'prey', '.', 'This', 'would', 'reduce', 'the', 'light', 'causing', 'alleles', 'due', 'to', 'increased', 'predation', 'and', 'the', 'dark', 'genotypes', 'would', 'increase', 'their', 'proportion', 'of', 'the', 'population', 'until', 'this', 'scenario',

In [None]:
print(word_punct_token)

['A', 'few', 'things', '.', 'You', 'might', 'have', 'negative', '-', 'frequency', 'dependent', 'selection', 'going', 'on', 'where', 'the', 'least', 'common', 'phenotype', ',', 'reflected', 'by', 'genotype', ',', 'is', 'going', 'to', 'have', 'an', 'advantage', 'in', 'the', 'environment', '.', 'For', 'instance', ',', 'if', 'a', 'prey', 'animal', 'such', 'as', 'a', 'vole', 'were', 'to', 'have', 'a', 'light', 'and', 'a', 'dark', 'phenotype', ',', 'a', 'predator', 'might', 'recognize', 'the', 'more', 'common', 'phenotype', 'as', 'food', '.', 'So', 'if', 'the', 'light', 'voles', 'are', 'more', 'common', ',', 'foxes', 'may', 'be', 'keeping', 'a', 'closer', 'eye', 'out', 'for', 'light', 'phenotypic', 'voles', ',', 'recognising', 'them', 'as', 'good', 'prey', '.', 'This', 'would', 'reduce', 'the', 'light', 'causing', 'alleles', 'due', 'to', 'increased', 'predation', 'and', 'the', 'dark', 'genotypes', 'would', 'increase', 'their', 'proportion', 'of', 'the', 'population', 'until', 'this', 'scenar

# 2. Normalization

**Removing Stopwords**

> Stopwords referring to the word which does not carry much insight, such as preposition.

![](https://user.oc-static.com/upload/2021/01/06/16099626487943_P1C2.png)

In [None]:
from nltk.corpus import stopwords

In [None]:
stop_words = stopwords.words('english')# Get the list of stop words
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [None]:
tokens = [x for x in word_punct_token  if x not in stop_words]
print(tokens)

['A', 'things', '.', 'You', 'might', 'negative', '-', 'frequency', 'dependent', 'selection', 'going', 'least', 'common', 'phenotype', ',', 'reflected', 'genotype', ',', 'going', 'advantage', 'environment', '.', 'For', 'instance', ',', 'prey', 'animal', 'vole', 'light', 'dark', 'phenotype', ',', 'predator', 'might', 'recognize', 'common', 'phenotype', 'food', '.', 'So', 'light', 'voles', 'common', ',', 'foxes', 'may', 'keeping', 'closer', 'eye', 'light', 'phenotypic', 'voles', ',', 'recognising', 'good', 'prey', '.', 'This', 'would', 'reduce', 'light', 'causing', 'alleles', 'due', 'increased', 'predation', 'dark', 'genotypes', 'would', 'increase', 'proportion', 'population', 'scenario', 'reversed', '.', 'This', 'cycle', 'continues', 'perpetually', '.', '\\', 'n', '\\', 'nHowever', ',', 'unlikely', 'strictly', 'yearly', 'usually', 'takes', 'time', 'year', 'entire', 'populations', 'allele', 'frequencies', 'change', 'enough', 'make', 'large', 'enough', 'difference', 'alter', 'fitness', '.'

**Lemmatization & stemming**

> Lemmatizing and stemming both help to reduce the dimension of the vocabulary by return the words to their root form (lemmatizing) or remove all the suffix, affix, prefix and so on (stemming). Stemming is nice for reducing the dimension of vocabulary, but most of the time the word become meaningless as stemming only chopped off the suffix but not returning the words to their base form.

In [None]:
import nltk
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...


True

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [None]:
print(lemmatizer.lemmatize(tokens[7]))

frequency


In [None]:
from nltk.stem.snowball import SnowballStemmer
SBS= SnowballStemmer(language="english")

from nltk.stem.porter import PorterStemmer
PS = PorterStemmer()

from nltk.stem.lancaster import LancasterStemmer
LS = LancasterStemmer()

In [None]:
print(SBS.stem("recognising"))
print(PS.stem("recognising"))
print(LS.stem(tokens[20]))       

recognis
recognis
environ


In [None]:
import string
punct = string.punctuation

In [None]:
punct

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
New_tokens = [x for x in tokens if x not in punct]  
print(New_tokens)

['A', 'things', 'You', 'might', 'negative', 'frequency', 'dependent', 'selection', 'going', 'least', 'common', 'phenotype', 'reflected', 'genotype', 'going', 'advantage', 'environment', 'For', 'instance', 'prey', 'animal', 'vole', 'light', 'dark', 'phenotype', 'predator', 'might', 'recognize', 'common', 'phenotype', 'food', 'So', 'light', 'voles', 'common', 'foxes', 'may', 'keeping', 'closer', 'eye', 'light', 'phenotypic', 'voles', 'recognising', 'good', 'prey', 'This', 'would', 'reduce', 'light', 'causing', 'alleles', 'due', 'increased', 'predation', 'dark', 'genotypes', 'would', 'increase', 'proportion', 'population', 'scenario', 'reversed', 'This', 'cycle', 'continues', 'perpetually', 'n', 'nHowever', 'unlikely', 'strictly', 'yearly', 'usually', 'takes', 'time', 'year', 'entire', 'populations', 'allele', 'frequencies', 'change', 'enough', 'make', 'large', 'enough', 'difference', 'alter', 'fitness', 'n', 'nMore', 'likely', 'year', 'year', 'basis', 'population', 'experiencing', 'fluct

# Text representation

> Feature Extraction is a general term that is also known as a text representation of text vectorization which is a process of converting text into numbers. we call vectorization because when text is converted in numbers it is in vector form.

# Label encoder


![](https://miro.medium.com/max/386/1*Yp6r7m82IoSnnZDPpDpYNw.png)

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

In [None]:
y=le.fit_transform(y)
print(y)

[0 2 0 ... 1 0 0]


# Bag Of Words

> Bag of words is a little bit similar to one-hot encoding where we enter each word as a binary value and in a Bag of words we keep a single row and entry the count of words in a document.


![](https://miro.medium.com/max/880/1*hLvya7MXjsSc3NS2SoLMEg.png)

In [None]:
data1

0       A few things. You might have negative- frequen...
1       Is it so hard to believe that there exist part...
2                                          There are bees
3       I'm a medication technician. And that's alot o...
4                          Cesium is such a pretty metal.
                              ...                        
8690    I make similar observations over the last week...
8691                                      You would know.
8692              Also use the correct number of sig figs
8693    What about the ethical delimmas,  groundbreaki...
8694                            I would like to know too.
Name: Comment, Length: 8695, dtype: object

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
bow = cv.fit_transform(data1)

In [None]:
bow

<8695x18177 sparse matrix of type '<class 'numpy.int64'>'
	with 190076 stored elements in Compressed Sparse Row format>

In [None]:
bow.shape

(8695, 18177)

In [None]:
vectorizer = CountVectorizer(max_features=4000)
bow = vectorizer.fit_transform(data1).toarray()
features = vectorizer.get_feature_names_out()
bow = pd.DataFrame(bow, columns=features)

In [None]:
bow

Unnamed: 0,000,019,02,10,100,1000,10th,11,12,13,...,young,younger,your,yourself,youtu,youtube,yt,yup,zero,zinc
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,3,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8690,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8691,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8692,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8693,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# N-Grams

> The technique is similar to Bag of words. All the techniques till now we have read it is made up of a single word and we are not able to use them or utilize them for better understanding. So N-Gram technique solves this problem and constructs vocabulary with multiple words.




![](https://images.deepai.org/glossary-terms/867de904ba9b46869af29cead3194b6c/8ARA1.png)

In [None]:
#Bigram model
from sklearn.feature_extraction.text import CountVectorizer
CV = CountVectorizer(ngram_range=[2,2])
bow = CV.fit_transform(data1)

In [None]:
bow

<8695x121785 sparse matrix of type '<class 'numpy.int64'>'
	with 228017 stored elements in Compressed Sparse Row format>

In [None]:
vectorizer = CountVectorizer(ngram_range=[2,2])
bow = vectorizer.fit_transform(data1).toarray()
features = vectorizer.get_feature_names_out()
bow = pd.DataFrame(bow, columns=features)

In [None]:
bow

Unnamed: 0,000 000,000 americans,000 billion,000 btu,000 eggs,000 times,000 without,000 would,000 years,00000000000000000000000332 grams,...,µg scale,µg scales,área so,árvore como,æther or,μg in,μg nnewspapers,μm or,μμ and,الله اکبر
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8690,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8691,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8692,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8693,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# TF-IDF (Term Frequency and Inverse Document Frequency)

> Now the technique which we will study does not work in the same way as the above techniques. This technique gives different values(weightage) to each word in a document. The core idea of assigning weightage is the word that appears multiple time in a document but has a rare appearance in corpus then it is very important for that document so it gives more weightage to that word. This weightage is calculated by two terms known as TF and IDF.





![](https://miro.medium.com/max/816/1*1pTLnoOPJKKcKIcRi3q0WA.jpeg)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
tfidf_result=tfidf.fit_transform(data1).toarray()

In [None]:
tfidf_result

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
vectorizer = TfidfVectorizer(max_features=6000)
tfidf_result = vectorizer.fit_transform(data1).toarray()
features = vectorizer.get_feature_names_out()
tfidf_result = pd.DataFrame(tfidf_result, columns=features)

In [None]:
tfidf_result

Unnamed: 0,000,01,019,02,020,021,03,04,07,09,...,yours,yourself,youtu,youtube,yt,yup,zeolites,zero,zinc,zp
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8690,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8691,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8692,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8693,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Modelling : Naive Bayes


> Naive Bayes is a powerful tool that leverages Bayes’ Theorem to understand and mimic complex data structures. In recent years, it has commonly been used for Natural Language Processing (NLP) tasks, such as text categorization.

![](https://cdn-images-1.medium.com/max/600/1*aFhOj7TdBIZir4keHMgHOw.png)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
NB= MultinomialNB()

> **MultinomialNB** implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification

In [None]:
xtrain, xtest, ytrain, ytest = train_test_split(tfidf_result, y, test_size=0.2)
NB.fit(xtrain, ytrain)

MultinomialNB()

In [None]:
ypred = NB.predict(xtrain)
ypred = NB.predict(xtest)

In [None]:
ypred.shape

(1739,)

In [None]:
ytrain.shape

(6956,)

In [None]:
ytrain=ytrain[0:1739]
ytest=ytest[0:1739]

In [None]:
print("Training Results:\n")
print(classification_report(ytrain, ypred))
print("\nTesting Results:\n")
print(classification_report(ytest, ypred))

Training Results:

              precision    recall  f1-score   support

           0       0.40      0.56      0.47       706
           1       0.35      0.30      0.32       594
           2       0.24      0.13      0.17       439

    accuracy                           0.36      1739
   macro avg       0.33      0.33      0.32      1739
weighted avg       0.34      0.36      0.34      1739


Testing Results:

              precision    recall  f1-score   support

           0       0.66      0.87      0.75       748
           1       0.68      0.63      0.66       560
           2       0.88      0.48      0.62       431

    accuracy                           0.70      1739
   macro avg       0.74      0.66      0.68      1739
weighted avg       0.72      0.70      0.69      1739

