## Natural Language Processing
Different Steps in traditional NLP Pipeline are:
1. **Text Cleaning -** Cleaning of HTML Tags and other markdowns from text data
2. **Text Preprocessing -** Noise Removal, Tokenization of words and phrases, Standardization and Normalization of words, objects and tokens.
3. **Feature Engineering -** POS Tagging of words in a sentence, Syntactic relationship of different words in a sentence, Recognition of entities in a sentence (NER) - Identification and classification of Entities.
    1. Rule Based Feature extraction 
    2. Statistical Features
    3. Word Embeddings
4. **NLP Tasks -**
    1. Sentiment Classification
    2. Autocorrect & Autocomplete in Search bar
    3. Social Media Monitoring
    4. Chatbots
    5. Text Summarization
    6. Machine Translation
    7. NLG & NLU
    8. Document to Information
    9. Topic Modeling
    10. Curated content feed that has highest information entropy
    
Word2Vec is a vector representation of a word in higher dimensions. It is used to represent the word semantically. It starts from One hot encoding of vector in which size of vector is equal to the size of the vocabulary.

Objective of traditional NLP is to model the language using rules
Traditional NLP models are rule based and good for small and specific tasks. They don't require a lot of compute compared to the Gen AI Models. But the speed and operating cost of these models comes at a cost of lower accuracy and less flexibility of the models.

Gen AI Models are more generic and gives more flexibility to the user who uses these models. 

In [5]:
import nltk

# 2. Text Preprocessing
- Noise Removal 
    - Stopword removal like is, am, or, of, an 
    - Punctuation removal
    - Removing URLs, Links, Social Media hashtags, mentions
    - lower casing
- Tokenization
- Lexicon Normalization 
    - Stemming
    - Lemmatization
- Object Standardization

## 2.a) Noise Removal
- List based removal or regex based approach to remove specific words that are not providing lot of useful information
- Various Other Noise Removal Steps:
   - Encoding-Decoding Noise
   - Grammar Check
   - Spelling Correction

In [1]:
noise_list = ["is", 'a', "this"]
def remove_noise(input_text):
    words = input_text.split()
    noise_free_words = [word for word in words if(word not in noise_list)]
    noise_free_text = " ".join(noise_free_words)
    return noise_free_text

In [2]:
print(remove_noise("this is a sample text"))

sample text


## 2.b) Lexicon Normalization
- **Stemming** 
    - Rule based process to strip the suffixes - Reduce word to their root or ase form
    - Text Standardization
    - Can cause the root form of word to loose its meaning 
- **Lemmatization** 
    - Step by step process of obtaining the root form of the word. Use of vocab & morphological analysis
    - Restores the words into the root form

In [11]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

In [12]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/prakharrastogi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [13]:
lem = WordNetLemmatizer()
stem = PorterStemmer()
word = "geese"
word = stem.stem(word)
print(word)
print(lem.lemmatize(word))

gees
gee


## 2.c) Object Standardization
- Words are not recognized by search enines and models
- Acronyms, hastags, colloquial slangs
- Using Regex, and Manually prepared data dictionaries - this type of noise can be fixed

In [25]:
lookup_dict = {'rt':'Retweet', 'dm': 'direct message', "awsm": "awesome", "luv": "love"}
def word_standardize(input_text):
    words = input_text.split()
    new_words = []
    for word in words:
        if word.lower() in lookup_dict:
            word = lookup_dict[word.lower()]
        new_words.append(word)
    new_text = " ".join(new_words)
    return new_text

In [26]:
print(word_standardize("RT this is a retweeted tweet by Prakhar Rastogi, DM Me"))

Retweet this is a retweeted tweet by Prakhar Rastogi, direct message Me


# 3. Feature Engineering on Textual Data

## 3.a Syntactic Parsing
Analysis of Words in sentence for grammar and their arrangement that shows relationships among words.

In [37]:
from nltk import word_tokenize, pos_tag

In [42]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/prakharrastogi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/prakharrastogi/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

### POS Tagging

### Applications of POS Tagging
- Word Sense Diambiguation
- Improving Word based features
- Normalization and Lemmatization
- Efficient Stopword removal

In [45]:
text = "I am learning Natural Language Processing on Analytics Vidhya"
tokens = word_tokenize(text)

In [46]:
print(pos_tag(tokens))

[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP'), ('on', 'IN'), ('Analytics', 'NNP'), ('Vidhya', 'NNP')]


## 3.b Entity Extraction
- Entity detetction algorithms are ensemble models of **rule based parsing**, dictionary lookups, pos tagging, dependency parsing

## NER
NER model consists of 3 blocks: 
- Noun Phrase Identification
- Phrase Classification
- Entity Disambiguation

- Spacy library is there to perform basic NER. 
- Special models can be trained for use-case specific NER.

# 4. Topic Modeling
- A repeating pattern of co-occuring terms in a corpus is called Topic.
- Identifying topics present in text corpus and derives hidden patterns among words in corpus in unsupervised manner.
- Essential Statistical tool to reveal abstract topics present in set of documents.
- Enabling discovery of concealed semantic structures within a body of text.
- Gain insight into **underlying themes and concepts embedded in documents** under investigation.
- Rule-based text mining - regex or keyword search based
- Useful for **document clustering, organizing blocks of textual data, categorization and labeling of articles on internet, information retrieval from unstructured data and feature selection**

- **Applications**
    - Text Content Recommendation Engines
    - Job Recommendation to right candidate
    - Organize emails, customer reviews, & user social media profiles
    - Extracting Meaningful entities from 10-K filings or other financial texts
    - Extract entities from legal documents
    - Detecting breaking news on social media
    - Detect Fake messages
    - Recommend Personalized Messages
    - Characterizing Information Flow
    
- **Downstream Usecases**
    - Topic Analysis over the period of time (or Trend Analysis)
    - Combine topical data with geographic and demographic data like zip code, income range of zip codes, etc.

- **Statistical Techniques**
    - Term Frequency
    - Inverse Document Frequency
    - Non-negative Matrix Factorization
    - LDA (Most popular)
    
## Latent Dirichlet Allocation
- Document Term Matrix formation i,j represents frequency count of word Wj in Document Di (N x M)
- LDA converts Document term matrix into lower dimension matrix - M1 and M2
- M1 - Document - Topic Matrix (N x K)
- M2 - Topic - Term Matrix (K x M)
- Parameters of LDA
    - Alpha - Document-Topic density
    - Beta - Topic-word density
    - High Alpha - More topics and vice versa
    - High beta - Lareg number of words and vice-versa
    
- Perform Preprocessing steps for better performance

## BERTopic

In [7]:
from nltk.corpus import stopwords
import string
import gensim
from gensim import corpora
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/prakharrastogi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle."
doc_complete = [doc1, doc2, doc3]

In [15]:
doc6 = "Unless otherwise set forth in Policies (as defined in section 2.10), or required by state or federal law, Facility shall submit Claims to Plan, subject to any applicable HIPAA requirements, using appropriate and current Coded Service Identifiers, within one hundred twenty (120) days from the date the Health Services are rendered or Plan will refuse payment. If Plan is the secondary payor,the one hundred twenty (120) day period will not begin until Facility receives notification of primary payor's responsibility."
doc_complete = [doc6]

In [16]:
def clean(doc):
    stop_free = " ".join([word for word in doc.lower().split() if word not in stop])
    punc_free = ''.join([character for character in stop_free if character not in exclude])
    normalized = " ".join(lem.lemmatize(word) for word in punc_free.split())
    return normalized

In [17]:
doc_clean = [clean(doc).split() for doc in doc_complete]
print(doc_clean)

[['unless', 'otherwise', 'set', 'forth', 'policy', 'a', 'defined', 'section', '210', 'required', 'state', 'federal', 'law', 'facility', 'shall', 'submit', 'claim', 'plan', 'subject', 'applicable', 'hipaa', 'requirement', 'using', 'appropriate', 'current', 'coded', 'service', 'identifier', 'within', 'one', 'hundred', 'twenty', '120', 'day', 'date', 'health', 'service', 'rendered', 'plan', 'refuse', 'payment', 'plan', 'secondary', 'payorthe', 'one', 'hundred', 'twenty', '120', 'day', 'period', 'begin', 'facility', 'receives', 'notification', 'primary', 'payors', 'responsibility']]


In [18]:
def topic_model(doc_clean):
    # Creating term dictionary of our corpus, where every unique term is assigned an index
    dictionary = corpora.Dictionary(doc_clean)
    #Converting list of documents into Document term matrix using dictionary
    doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
    #Creating object for LDA model using gensim library
    lda = gensim.models.ldamodel.LdaModel
    #Running and training LDA model on document term matrix
    ldamodel = lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)
    return ldamodel

In [19]:
ldamodel = topic_model(doc_clean)
ldamodel.print_topics(num_topics=3, num_words=3)

[(0, '0.046*"plan" + 0.032*"facility" + 0.032*"twenty"'),
 (1, '0.021*"date" + 0.021*"rendered" + 0.021*"otherwise"'),
 (2, '0.021*"secondary" + 0.021*"forth" + 0.021*"refuse"')]

## Tips to improve Topic Modeling results
- **Frequency Filter:** Remove low frequency terms as they are weak features in document term matrix. Perform exploratory analysis to determine critical frequency value
- **POS Tag Filter**
- **Batch-wise LDA:** Run LDA multiple times will provide different results and best topic terms will be the intersection of all batches

## Problems with Unsupervised Topic Model
- Overlapping and Correlated Topics
- **Bag of words:** Words are exchangeable and sentence is not modeled
- **Unsupervised:** Weak supervision is desirable

# 5. Statistical Features

## N-grams as features

In [71]:
def generate_ngrams(text, n):
    words = text.split()
    output = []
    for i in range(len(words)-n+1):
        output.append(words[i:i+n])
    return output

In [72]:
generate_ngrams('this is a sample text', 2)

[['this', 'is'], ['is', 'a'], ['a', 'sample'], ['sample', 'text']]

## Term Frequency - Inverse Document Frequency (TF - IDF)
- Weighted Model used for Information Retrieval Problems.
- Convert text into vector models on the basis of occurence of words in documents without considering exact ordering
- **Term** = Count of a term
- **IDF** = Log of ratio os total docs available in corpus and number of documents containing term 't'

In [75]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [78]:
#Each row in the output contains a tuple (i,j) and a tf-idf value of word at index j in document i.
obj = TfidfVectorizer()
corpus = ['This is sample document.', 'another random document.', 'third sample document text']
X = obj.fit_transform(corpus)

In [77]:
print(X)

  (0, 1)	0.34520501686496574
  (0, 4)	0.444514311537431
  (0, 2)	0.5844829010200651
  (0, 7)	0.5844829010200651
  (1, 3)	0.652490884512534
  (1, 0)	0.652490884512534
  (1, 1)	0.3853716274664007
  (2, 5)	0.5844829010200651
  (2, 6)	0.5844829010200651
  (2, 1)	0.34520501686496574
  (2, 4)	0.444514311537431


## Count/Density/Readability Features
- Used in models and analysis
- Features that are trivial but shows great impact in learning models
- Features - Word Count, Sentence Count, Punctuation Count, Industry Specific Word Count
- **Other Measures -** Syllable Counts, Smog Index, Flesch Reading Ease
- **SMOG Index -** "Simple Measure of Gobbledygook" is a readability formula that estimates how many years of education a person needs to understand a piece of writing. Mostly useful for texts with at least 30 sentences.

# Word Embedding (Text Vectors)
- Modern Way of representing words as vectors.
- Redefine high dim word features into low dim features by preserving contextual similarity in corpus
- Word2Vec & GloVe - two text embedding models

## Word2Vec Model

In [83]:
#Combination of preprocessing model, 2 shallow neural network (Bag of Words & skip-gram)
from gensim.models import Word2Vec
sentences = [['data', 'science'], ['vidhya', 'science', 'data', 'analytics'],['machine', 'learning'], ['deep', 'learning']]
model = Word2Vec(sentences, min_count=1)
print(model.wv.similarity('data', 'science'))

-0.023671655


# 6. Important NLP Tasks

# Text Classification
- Systematically classify a text object in one of the fixed category to organize, and filter information and storage purposes
- Email Spam Identification
- Topic Classification of news
- Sentiment Classification
- Organization of web pages by search engines

In [101]:
from textblob.classifiers import NaiveBayesClassifier as NBC
from textblob import TextBlob

In [102]:
training_corpus = [('I am exhausted of this work.', 'Class_B'),
                   ("I can't cooperate with this", 'Class_B'),
                   ('He is my badest enemy!', 'Class_B'),
                   ('My management is poor.', 'Class_B'),
                   ('I love this burger.', 'Class_A'),
                   ('This is an brilliant place!', 'Class_A'),
                   ('I feel very good about these dates.', 'Class_A'),
                   ('This is my best work.', 'Class_A'),
                   ("What an awesome view", 'Class_A'),
                   ('I do not like this dish', 'Class_B')]

test_corpus = [("I am not feeling well today.", 'Class_B'), 
                ("I feel brilliant!", 'Class_A'), 
                ('Gary is a friend of mine.', 'Class_A'), 
                ("I can't believe I'm doing this.", 'Class_B'), 
                ('The date was good.', 'Class_A'), ('I do not enjoy my job', 'Class_B')]

In [90]:
model = NBC(training_corpus)

In [92]:
print(model.classify("Their codes are amazing."))

Class_A


In [93]:
print(model.classify("I don't like their computer."))

Class_B


In [94]:
print(model.accuracy(test_corpus))

0.8333333333333334


In [96]:
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics import classification_report
from sklearn import svm

In [97]:
#Preparing data for SVM model
train_data = []
train_labels = []
for row in training_corpus:
    train_data.append(row[0])
    train_labels.append(row[1])
    
test_data = []
test_labels = []
for row in test_corpus:
    test_data.append(row[0]) 
    test_labels.append(row[1])

# Create feature vectors 
vectorizer = TfidfVectorizer(min_df=4, max_df=0.9)
# Train the feature vectors
train_vectors = vectorizer.fit_transform(train_data)
# Apply model on test data 
test_vectors = vectorizer.transform(test_data)

# Perform classification with SVM, kernel=linear 
model = svm.SVC(kernel='linear') 
model.fit(train_vectors, train_labels) 
prediction = model.predict(test_vectors)

In [98]:
prediction

array(['Class_A', 'Class_A', 'Class_B', 'Class_B', 'Class_A', 'Class_A'],
      dtype='<U7')

In [103]:
print(classification_report(test_labels, prediction))

              precision    recall  f1-score   support

     Class_A       0.50      0.67      0.57         3
     Class_B       0.50      0.33      0.40         3

    accuracy                           0.50         6
   macro avg       0.50      0.50      0.49         6
weighted avg       0.50      0.50      0.49         6



# Text Matching/Simialrity

- Matching of text objects and find similarities
- Applications: 
    - Automatic Spelling Correction
    - Date De-duplication
    - Genome Analysis

## Levenshtein Distance 
- Min number of edits needed to transform one string to other, with allowable edit operations being Insertion, Deletion, Substitution

In [104]:
def levenshtein(s1, s2):
    if len(s1) > len(s2):
        s1,s2 = s2,s1 
    distances = range(len(s1) + 1) 
    for index2,char2 in enumerate(s2):
        newDistances = [index2+1]
        for index1,char1 in enumerate(s1):
            if char1 == char2:
                newDistances.append(distances[index1]) 
            else:
                 newDistances.append(1 + min((distances[index1], distances[index1+1], newDistances[-1]))) 
        distances = newDistances 
    return distances[-1]

In [105]:
print(levenshtein("analyze", "analyse"))

1


## Phonetic Matching
- Produces a character string that identifies set of words that are roughly phonetically similar.
- Very useful for searching large text corpuses, correcting spelling errors, and matching relevant names
- **Soundex** and **Metaphone** are two main phonetic algorithms used for purpose

In [107]:
import fuzzy

In [113]:
soundex = fuzzy.Soundex(4)
soundex("ankit")

''

## Flexible String Matching
- Exact String Matching
- Lemmatized Matching
- Compact Matching

## Cosine Similarity
- Also known as vector simialarity

## Coreference Resolution
- Process of finding relational links among words within sentences
- Component of NLP that does this job automatically
- Used in document summarization, question answering, and information extraction

## Sentiment Analysis
- https://www.analyticsvidhya.com/blog/2018/07/hands-on-sentiment-analysis-dataset-python/?utm_source=reading_list&utm_medium=https://www.analyticsvidhya.com/blog/2021/06/part-5-step-by-step-guide-to-master-nlp-text-vectorization-approaches/

## Information Retrieval
- Using BERT and Bayes for automating data extraction tasks from diverse sources including medical records
- POS: 8 POS in english language - noun, pronoun, verb, adjective, adverb, preposition, conjuction, and intersection


## Opinion Mining
- Classify yelp review into one of five aspects: Food, service, price, ambience, or anecdotal/miscellaneous
- Use Kaggle labelled data for opinion mining using LLM
- In a pipeline, we can use topic modeling before aspect based model
- Use **Aspect-Based Model** - Requires features/topic to look at. 

## Knowledge Graphs
- Sentence Segmentation: Split the text document or article into sentences
- https://www.analyticsvidhya.com/blog/2019/10/how-to-build-knowledge-graph-text-using-spacy/?utm_source=reading_list&utm_medium=https://www.analyticsvidhya.com/blog/2015/04/pagerank-explained-simple/