# Natural Language Processing (NLP) with SciKit Learn

## Introduction to Natural Language Processing
Natural language processing is an automated way to understand and analyze natural human languages and extract information from such data by applying machine algorithms.

![Intro_nlp.PNG](attachment:Intro_nlp.PNG)

It is also referred to as, the field of computer science or AI to extract the linguistics information from the underlying data.

### Why Natural Language Processing
The world is now connected globally due to the advancement of technology and devices. this has resulted in high volume of data around the world and has led to a number of challenges in analysing  data, such as
* Analyzing tons of data
* Identifying various languages
* Applying quantitative analysis
* Handling ambiguities

NLP can achieve full automation by using modern software libraries, modules, and packages. 
![Intro_nlp2.PNG](attachment:Intro_nlp2.PNG)

### NLP Terminology
* __Word boundaries__: 
    * Determines where one word ends and the other begins
* __Tokenization__:
    * Splits words, phrases, and idioms
* __Stemming__:
    * Maps to the valid root word
* __Tf-idf_:
    * Represents term frequency and inverse document frequency
* __Semantic analytics__:
    * Compares words, phrases, and idioms in a set of documents to extract meaning
* __Disambiguation__:
    * Determines meaning and sense of words (context vs. intent)
* __Topic models__:
    * Discovers topics in a collection of documents

### NLP Approach for Text Data
Let us look at the Natural Language Processing approaches to analyze text data.
* Conduct basic text processing
* Categorize and tag words
* Classify text
* Extract information
* Analyze sentence structure
* Build feature-based structure
* Analyze the meaning



### Demo - 01: Demonstrate the installation of NLP environment

In this demo we will see how to install NLP environment.
* Open Anaconda prompt
* pip Intall sklearn
* pip install nltk
* Once installed, open python terminal in anaconda command prompt.
* Type import nlkt, and press enter.
* Type nltk.download()
* A window will prompt up, from there download required nltk subpackages like stopword etc,.

### Demo - 02: Demonstrate how to perform the sentence analysis

In [8]:
#Eliminate punctuation and stowords from the sentence
#Import required library
import string
from nltk.corpus import stopwords

#view first 10 stopwords present in English corpus
stopwords.words('english')[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [17]:
#Create a test sentence
test_sentence = 'This is my first test string. wow!! we are doing hust fine.'

#eliminate the punctuation in-from  of characters and print then
no_punctuation = [char for char in test_sentence if char not in string.punctuation]
no_punctuation

['T',
 'h',
 'i',
 's',
 ' ',
 'i',
 's',
 ' ',
 'm',
 'y',
 ' ',
 'f',
 'i',
 'r',
 's',
 't',
 ' ',
 't',
 'e',
 's',
 't',
 ' ',
 's',
 't',
 'r',
 'i',
 'n',
 'g',
 ' ',
 'w',
 'o',
 'w',
 ' ',
 'w',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 'd',
 'o',
 'i',
 'n',
 'g',
 ' ',
 'h',
 'u',
 's',
 't',
 ' ',
 'f',
 'i',
 'n',
 'e']

In [18]:
#now eliminate the punctuations and print them as a whole sentence.
no_punctuation = ''.join(no_punctuation)
no_punctuation

'This is my first test string wow we are doing hust fine'

In [19]:
#split each words presentin the new sentence (remove stopwords)
no_punctuation.split()

['This',
 'is',
 'my',
 'first',
 'test',
 'string',
 'wow',
 'we',
 'are',
 'doing',
 'hust',
 'fine']

In [22]:
#Eliminate stopwords
clean_sentence = [word for word in no_punctuation.split() if word.lower() not in stopwords.words('english')]
print(clean_sentence)

['first', 'test', 'string', 'wow', 'hust', 'fine']


### Applications of NLP
1. __Machine Translation:__
    Machine translation is used to translate one language into another. Google Translate is an example. It uses NLP to translate the input data from one language to another.
2. __Speech Recognition:__
    The speech recognition application understands human speech and uses it as input information. It is useful for applications like Siri, Google Now, and Microsoft Cortana.
3. __Sentiment Analysis:__
    Sentiment analysis is achieved by processing tons of data received from different interfaces and sources. For example, NLP uses all social media activities to find out the popular topic of discussion or importance.

### Major NLP Libraries
* NLTK
* Scikit-Learn
* TextBlob
* spaCy

## The Scikit-Learn Approach
It is a very powerful library with a set of modules to process and analyze natural language data, such as text and images, and extract information using machine learning algorithms.

* __Built-in module:__
    Contains built-in modules to load the dataset’s content and categories.
    
* __Feature extraction:__
    A way to extract information from data which can be text or images.
    
* __Model training:__
    Analyzes the content based on particular categories and then trains them according to a specific model.
    
* __Pipeline building mechanism:__
    A technique in Scikit-learn approach to streamline the NLP process into stages.
        Various stages of pipeline learning:
        1. Vectorization
        2. Transformation
        3. Model training and application

* __Performance optimization:__
    In this stage we train the models to optimize the overall process.
    
* __Grid search for finding good parameters:__
    It’s a powerful way to search parameters affecting the outcome for model training purposes.
    
### Modules to Load Content and Category
Scikit-learn has many built-in datasets. There are several methods to load these datasets with the help of a data load object. 
![Intro_nlp3-2.PNG](attachment:Intro_nlp3-2.PNG)

The text files are loaded with categories as subfolder names.
![Intro_nlp4.PNG](attachment:Intro_nlp4.PNG)

![Intro_nlp5.PNG](attachment:Intro_nlp5.PNG)

The attributes of a data load object are:
![Intro_nlp6.PNG](attachment:Intro_nlp6.PNG)
The example shows how a dataset can be loaded using Scikit-learn:

__Example on Digits dataset to view the data using built-in methods.__


In [31]:
#Import dataset
from sklearn.datasets import load_digits

#Load the dataset
#create object of the loaded dataset
digit_dataset = load_digits()

#Describe the dataset
#USe built-in DESCR method to describe dataset
digit_dataset.DESCR

".. _digits_dataset:\n\nOptical recognition of handwritten digits dataset\n--------------------------------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 5620\n    :Number of Attributes: 64\n    :Attribute Information: 8x8 image of integer pixels in the range 0..16.\n    :Missing Attribute Values: None\n    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\n    :Date: July; 1998\n\nThis is a copy of the test set of the UCI ML hand-written digits datasets\nhttps://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n\nThe data set contains images of hand-written digits: 10 classes where\neach class refers to a digit.\n\nPreprocessing programs made available by NIST were used to extract\nnormalized bitmaps of handwritten digits from a preprinted form. From a\ntotal of 43 people, 30 contributed to the training set and different 13\nto the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\n4x4 and the number of on pixel

In [32]:
#View type of dataset
type(digit_dataset)

sklearn.utils.Bunch

In [33]:
#View the data
digit_dataset.data

array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., 10.,  0.,  0.],
       [ 0.,  0.,  0., ..., 16.,  9.,  0.],
       ...,
       [ 0.,  0.,  1., ...,  6.,  0.,  0.],
       [ 0.,  0.,  2., ..., 12.,  0.,  0.],
       [ 0.,  0., 10., ..., 12.,  1.,  0.]])

In [34]:
#View the target
digit_dataset.target

array([0, 1, 2, ..., 8, 9, 8])

## Feature Extraction
Feature extraction is a technique to convert the content into the numerical vectors to perform machine learning.

1. __Text feature extraction:__
    For example: Large datasets or documents
2. __Image feature extraction:__
    For example: Patch extraction, hierarchical clustering
    

### Bag of Words
Bag of words is used to convert text data into numerical feature vectors with a fixed size.

__Step 1 - Tokenizing:__
Assign a fixed integer id to each word

__Step 2 - Counting:__
Number of occurrences of each word

__Step 3 - Storing:__
Store as the  value feature.
![Intro_nlp7.PNG](attachment:Intro_nlp7.PNG)

### CountVectorizer Class Signature
![Intro_nlp8.PNG](attachment:Intro_nlp8.PNG)

### Demo - 03: Bags of words
Demonstrate the Bag of Words technique to process document

In [38]:
#import the required library
from sklearn.feature_extraction.text import CountVectorizer

#instantiate the vectorizer
vectorizer = CountVectorizer()

#Create 3 Documents
documnet1 = 'Hi How are you'
documnet2 = 'today is very very very pleasent day and we can have some fun fun fun'
documnet3 = 'This was an amazing experience'

#put all documents together
list_of_documents = [documnet1,documnet2,documnet3]

#fit them as bag of words
bag_of_words = vectorizer.fit(list_of_documents)

#check bag of words
bag_of_words


CountVectorizer()

In [40]:
#Apply transform method
bag_of_words = vectorizer.transform(list_of_documents)

#print bag of words
print(bag_of_words)

  (0, 3)	1
  (0, 9)	1
  (0, 10)	1
  (0, 19)	1
  (1, 2)	1
  (1, 4)	1
  (1, 5)	1
  (1, 7)	3
  (1, 8)	1
  (1, 11)	1
  (1, 12)	1
  (1, 13)	1
  (1, 15)	1
  (1, 16)	3
  (1, 18)	1
  (2, 0)	1
  (2, 1)	1
  (2, 6)	1
  (2, 14)	1
  (2, 17)	1


As we see, the first number is tuple, and the second number is the frequency of words.
The tuple here indicates document number and feature indices of each word which belongs to the document.
Feature indices are generated from the transform method. This is called as the feature extraction process and to extract the features we have to refer to the indexes, which is assigned to any particular feature.

In [41]:
#verifying the vocabulary for repeated words
print(vectorizer.vocabulary_.get('very'))
print(vectorizer.vocabulary_.get('fun'))

16
7


In [42]:
#check the type of bag of words
type(bag_of_words)

scipy.sparse.csr.csr_matrix

In this demo, we learned how to use bag of words technique and transformed documents using transform method.

### Text Feature Extraction Considerations

* __Sparse__:
    This utility deals with sparse matrix while storing them in memory. Sparse data is commonly noticed when it comes to extracting feature values, especially for large document datasets. 
* __Vectorizer__:
    It implements tokenization and occurrence. Words with minimum two letters get tokenized. We can use the analyzer function to vectorize the text data. 
* __Tf-idf__:
    It is a term weighing utility for term frequency and inverse document frequency. Term frequency indicates the frequency of a particular term in the document. Inverse document frequency is a factor which diminishes the weight of terms that occur frequently. 
* __Decoding__:
    This utility can decode text files if their encoding is specified. 

## Model Training
An important task in model training is to identify the right model for the given dataset. The choice of model completely depends on the type of dataset.

1. Supervised:
    * Models predict the outcome of new observations and datasets, and classify documents based on the features and response of a given dataset. 
    * Example: Naïve Bayes, SVM, linear regression, K-NN neighbors
2. Unsupervised:
    * Models identify patterns in the data and extract its structure. They are also used to group documents using clustering algorithms.
    * Example: K-means

### Naïve Bayes Classifier
It is the most basic technique for classification of text.

__Advantages:__
* It is efficient as it uses limited CPU and memory.
* It is fast as the model training takes less time.

__Uses:__
* Naïve Bayes is used for sentiment analysis, email spam detection, categorization of documents, and language detection.
* Multinomial Naïve Bayes is used when multiple occurrences of the words matter.

Let us take a look at the signature of the multinomial Naïve Bayes classifier:
![Naive1.PNG](attachment:Naive1.PNG)


### Grid Search and Multiple Parameters
Document classifiers can have many parameters. A Grid approach helps to search the best parameters for model training and predicting the outcome accurately.
![Grid1.PNG](attachment:Grid1.PNG)

![Grid2.PNG](attachment:Grid2.PNG)

In grid search mechanism, the whole dataset can be divided into multiple grids and a search can be run on the entire grid or a combination of grids. 
![Grid3.PNG](attachment:Grid3.PNG)

### Pipeline
A pipeline is a combination of vectorizers, transformers, and model training.

![pipe1.PNG](attachment:pipe1.PNG)

### Demo - 04:  Pipeline and Grid Search
Demonstrate the Pipeline and Grid Search technique.

In [55]:
#import required libraries
import pandas as pd
import string
from pprint import pprint
from time import time

#import dataset
#get spam data collection with the help of the pandas library
df_spam_collection = pd.read_csv('D:\\NIPUN_SC_REC\\3_Practice_Project\\Course_5_Data Science with Python\\Lesson_recap\\SMSSpamCollection.csv', sep = ',', names=['response','message'])

#view first 5 records with head method
df_spam_collection.head()

Unnamed: 0,response,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [64]:
#import text processing library from scikit-learn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

#import SGD Classifier
from sklearn.linear_model import SGDClassifier

#import Grid Search
from sklearn.model_selection import GridSearchCV

#import pipeline
from sklearn.pipeline import Pipeline

#define pipeline
pipeline = Pipeline([
            ('vect',CountVectorizer()),
            ('tfidf',TfidfTransformer()),
            ('clf',SGDClassifier())
            ])

#parameter for grid search
parameters = {'tfidf__use_idf':(True,False)}

#perform the gridsearch with pipeline and parameters
grid_search = GridSearchCV(pipeline,parameters,n_jobs = -1,verbose = 1)

print('Performing Grid Search now..')
print('Parameters...')
pprint(parameters)
to = time()
grid_search.fit(df_spam_collection['message'],
               df_spam_collection['response'])
print('done in %0.3fs' %(time() - to))
print()

Performing Grid Search now..
Parameters...
{'tfidf__use_idf': (True, False)}
Fitting 5 folds for each of 2 candidates, totalling 10 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


done in 0.692s



[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.4s finished
