In this Lecuture, we are going to cover various advanced NLP techniques
and leverage machine learning algorithms to extract information from text
data as well as some of the advanced NLP applications with the solution
approach and implementation

1. Noun Phrase extraction
2. Text similarity
3. Parts of speech tagging
4. Information extraction – NER – Entity recognition
5. Topic modeling
6. Text classification
7. Sentiment analysis
8. Word sense disambiguation
9. Speech recognition and speech to text
10. Text to speech
11. Language detection and translation

Before getting into recipes, let’s understand the NLP pipeline and life
cycle first. There are so many concepts we are implementing in this book,
and we might get overwhelmed by the content of it. To make it simpler
and smoother, let’s see what is the flow that we need to follow for an NLP
solution.

For example, let’s consider customer sentiment analysis and
prediction for a product or brand or service.
- **Define the Problem**: Understand the customer sentiment across the products.

- **Understand the depth and breadth of the problem**: Understand the customer/user sentiments across the product; why we are doing this? What is the business impact? Etc.

- **Data requirement brainstorming**: Have a brainstorming activity to list out all possible data points.
  -All the reviews from customers on e-commerce
    platforms like Amazon, Flipkart, etc.
 - Emails sent by customers
 - Warranty claim forms
 - Survey data
 - Call center conversations using speech to text
 - Feedback forms
 - Social media data like Twitter, Facebook, and LinkedIn

- **Data collection**: We learned different techniques to collect the data in . Based on the data and the problem, we might have to incorporate different data collection methods. In this case, we can use web scraping and Twitter APIs.


- **Text Preprocessing**: We know that data won’t always be clean. We need to spend a significant amount of time to process it and extract insight out of it using different methods that we discussed earlier in

- **Text to feature**: As we discussed, texts are characters and machines will have a tough time understanding them. We have to convert them to features that machines and algorithms can understand using any of the methods we learned in the previous chapter.


- **Machine learning/Deep learning**: Machine learning/Deep learning is a part of an artificial intelligence umbrella that will make systems automatically learn patterns in the data without being programmed. Most of the NLP solutions are based on this, and since we converted text to features, we can leverage machine learning or deep learning algorithms to achieve the goals like text classification, natural language generation, etc.


- **Insights and deployment**: There is absolutely no use for building NLP solutions without proper insights being communicated to the business. Always take time to connect the dots between model/analysis output and the business, thereby creating the maximum impact.

## 1. Extracting Noun Phrases

Noun Phrase extraction is important when you want to analyze the “who”
in a sentence. Let’s see an example below using TextBlob.

In [None]:
!python -m textblob.download_corpora

In [101]:
#Import libraries
import nltk
from textblob import TextBlob
#Extract noun
blob = TextBlob("John is learning natural language processing.He lives in USA")
for np in blob.noun_phrases:
    print(np)

john
natural language processing.he
usa


## 2. Finding Similarity Between Texts

we are going to discuss how to find the similarity between
two documents or text. There are many similarity metrics like Euclidian,
cosine, Jaccard, etc. Applications of text similarity can be found in areas
like spelling correction and data deduplication.
Here are a few of the similarity measures:

- **Cosine similarity**: Calculates the cosine of the angle between the two vectors.
- **Jaccard similarity**: The score is calculated using the intersection or union of words.
- **Jaccard Index** = (the number in both sets) / (the number in either set) * 100.
- **Levenshtein distance**: Minimal number of insertions, deletions, and replacements required for transforming string “a” into string “b.”
- **Hamming distance**: Number of positions with the same symbol in both strings. But it can be defined only for strings with equal length

The simplest way to do this is by using cosine similarity from the sklearn
library.

In [None]:
documents = (
"I like NLP",
"I am exploring ML",
"I am a beginner in NLP",
"I want to learn NLP",
"I like advanced NLP"
)

In [None]:
#Import libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

In [None]:
tfidf_matrix

In [None]:
print(tfidf_matrix.toarray())

In [None]:
tfidf_matrix.shape

In [None]:
tfidf_matrix[0:1].toarray()

In [None]:
#compute similarity for first sentence with rest of the sentences
cosine_similarity(tfidf_matrix[0:1],tfidf_matrix)

If we clearly observe, the first sentence and last sentence have higher
similarity compared to the rest of the sentences

### Phonetic matching

The next version of similarity checking is phonetic matching, which roughly
matches the two words or sentences and also creates an alphanumeric
string as an encoded version of the text or word. It is very useful for searching
large text corpora, correcting spelling errors, and matching relevant names.
Soundex and Metaphone are two main phonetic algorithms used for this
purpose. The simplest way to do this is by using the fuzzy library.

In [None]:
def soundex(word):
    soundex_code = word[0].upper()

    mapping = {
        'BFPV': '1',
        'CGJKQSXZ': '2',
        'DT': '3',
        'L': '4',
        'MN': '5',
        'R': '6'
    }

    for char in word[1:]:
        for key, code in mapping.items():
            if char.upper() in key:
                if code != soundex_code[-1]:
                    soundex_code += code

    soundex_code = soundex_code.replace("0", "")
    soundex_code = soundex_code[:4].ljust(4, "0")
    
    return soundex_code

# Example usage
word = "Smith"
print("Soundex code for '{}': {}".format(word, soundex(word)))  # Output: S530


In [None]:
# Example usage
word = "Smythe"
print("Soundex code for '{}': {}".format(word, soundex(word)))  # Output: S530

In [None]:
# Example usage
word = "John"
print("Soundex code for '{}': {}".format(word, soundex(word)))  # Output: S530

In [None]:
!pip install metaphone


In [None]:
from metaphone import doublemetaphone

def fuzzy_soundex(word):
    return doublemetaphone(word)

# Example usage
word = "Smith"
print("Fuzzy Soundex code for '{}': {}".format(word, fuzzy_soundex(word)))
# Example usage
word = "Smythe"
print("Fuzzy Soundex code for '{}': {}".format(word, fuzzy_soundex(word)))
word = "John"
print("Fuzzy Soundex code for '{}': {}".format(word, fuzzy_soundex(word)))


## 3. Tagging Part of Speech

Part of speech (POS) tagging is another crucial part of natural language
processing that involves labeling the words with a part of speech such as
noun, verb, adjective, etc. POS is the base for Named Entity Resolution,
Sentiment Analysis, Question Answering, and Word Sense Disambiguation.

There are 2 ways a tagger can be built.
- **Rule based** - Rules created manually, which tag a word belonging to a particular POS.
- **Stochastic based** - These algorithms capture the sequence of the words and tag the probability of the sequence using hidden Markov models.

Again, NLTK has the best POS tagging module. nltk.pos_tag(word) is the
function that will generate the POS tagging for any given word. Use for loop
and generate POS for all the words present in the document

In [None]:
Text = "I love NLP and I will learn NLP in 2 month"

In [None]:
type(Text)

In [None]:
# Importing necessary packages and stopwords
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))
# Tokenize the text
tokens = sent_tokenize(Text)

In [None]:
tokens

In [None]:
type(tokens)

In [None]:
#Generate tagging for all the tokens using loop
for i in tokens:
    words = nltk.word_tokenize(i)
    words = [w for w in words if not w in stop_words]
# POS-tagger.
    tags = nltk.pos_tag(words)

In [None]:
tags

Below are the short forms and explanation of POS tagging. The word
“love” is VBP, which means verb, sing. present, non-3d take.

- CC coordinating conjunction
- CD cardinal digit
- DT determiner
- EX existential there (like: “there is” ... think of it like “there exists”)
- FW foreign word
- IN preposition/subordinating conjunction
- JJ adjective ‘big’
- JJR adjective, comparative ‘bigger’
- JJS adjective, superlative ‘biggest’
- LS list marker 1)
- MD modal could, will
- NN noun, singular ‘desk’
- NNS noun plural ‘desks’
- NNP proper noun, singular ‘Harrison’
- NNPS proper noun, plural ‘Americans’
- PDT predeterminer ‘all the kids’
- POS possessive ending parent’s
- PRP personal pronoun I, he, she
- PRP possessive pronoun my, his, hers
- RB adverb very, silently
- RBR adverb, comparative better
- RBS adverb, superlative best
- RP particle give up
- TO to go ‘to’ the store
- UH interjection
- VB verb, base form take
- VBD verb, past tense took
- VBG verb, gerund/present participle taking
- VBN verb, past participle taken
- VBP verb, sing. present, non-3d take
- VBZ verb, 3rd person sing. present takes
- WDT wh-determiner which
- WP wh-pronoun who, what
- WP possessive wh-pronoun whose
- WRB wh-adverb where, when

## 4. Extract Entities from Text

Named Entity Recognition (NER) is a natural language processing (NLP) technique used to identify and classify named entities in text into predefined categories such as person names, organization names, locations, dates, and more. The goal of NER is to extract and understand the key entities mentioned in the text, providing a deeper level of understanding about the content.

Named entities are specific pieces of information that refer to real-world objects and entities. For example, in the sentence "Barack Obama was born in Hawaii and served as the 44th President of the United States," the named entities are "Barack Obama" (PERSON), "Hawaii" (GPE - geopolitical entity), "the 44th President" (ORDINAL), and "the United States" (GPE).

NER is essential in various NLP applications such as information extraction, question answering, sentiment analysis, and text summarization. It helps in structuring unstructured text data by identifying relevant entities and their types.

How NER works:
1. Tokenization: The input text is split into individual words or tokens.
2. Part-of-Speech (POS) Tagging: Each token is assigned a part-of-speech tag, such as noun, verb, adjective, etc.
3. Chunking: Groups of tokens with specific POS tags are identified, forming chunks or phrases.
4. Entity Recognition: Using context and patterns, the algorithm identifies and classifies chunks into named entity categories like PERSON, ORGANIZATION, GPE, DATE, etc.

NER models are often built using machine learning techniques, particularly sequence labeling approaches like conditional random fields (CRFs) or deep learning architectures like bidirectional LSTM (Long Short-Term Memory) networks.

Libraries like `spaCy`, NLTK, StanfordNLP, and Flair offer NER capabilities and make it easier to perform Named Entity Recognition in Python.

In [None]:
sent = "John is studying at ,,  Stanford University in California"

In [None]:
nltk.download('maxent_ne_chunker')
nltk.download('words')

In [None]:
!pip install ghostscript

In [None]:
#import libraries
import nltk
from nltk import ne_chunk
from nltk import word_tokenize
#NER
ne_chunk(nltk.pos_tag(word_tokenize(sent)), binary=False)

In [None]:
word_tokenize(sent)

In [None]:
sent.split()

### Using SpaCy

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
# Read/create a sentence
doc = nlp(u'Apple is ready to launch new phone worth $10000 in New york time square ')
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm

In [None]:
import spacy

# Load the English language model
nlp = spacy.load("en_core_web_sm")

# Example text
text = "Apple Inc. is a technology company based in California, United States. It was founded by Steve Jobs and Steve Wozniak on April 1, 1976."

# Process the text using SpaCy's NLP pipeline
doc = nlp(text)

# Extract and display named entities
for entity in doc.ents:
    print(entity.text, entity.label_)


According to the output, Apple is an organization, 10000 is money, and
New York is place. The results are accurate and can be used for any NLP
applications.

## 5. Extracting Topics from Text

We are going to discuss how to identify topics from the
document. Say, for example, there is an online library with multiple
departments based on the kind of book. As the new book comes in,
you want to look at the unique keywords/topics and decide on which
department this book might belong to and place it accordingly. In these
kinds of situations, topic modeling would be handy.
Basically, this is document tagging and clustering.

### 5.1 Create the text data

In [None]:
doc1 = "I am learning NLP, it is very interesting and exciting.it includes machine learning and deep learning"
doc2 = "My father is a data scientist and he is nlp expert"
doc3 = "My sister has good exposure into android development"

In [None]:
doc_complete = [doc1, doc2, doc3]
doc_complete

### 5.2 Cleaning and preprocessing

In [None]:
# Install and import libraries
!pip install gensim
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string

In [None]:
# Text preprocessing
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()

In [None]:
print(stop)
print(exclude)
print(lemma)

In [None]:
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
def clean(doc):   
    # Lowercase the document and split it into words
    words = doc.lower().split()
    
    # Remove stopwords and punctuation marks, and lemmatize the words
    cleaned_words = [lemma.lemmatize(word) for word in words if word not in stop and word not in exclude]
    
    # Join the cleaned words back into a single string
    return " ".join(cleaned_words)

In [None]:
doc1

In [None]:
clean(doc1)

In [None]:
lemma.lemmatize('learning')

In [None]:
clean

In [None]:
# Clean and normalize each sentence using the 'clean' function
cleaned_sentences = [clean(sentence).split() for sentence in doc_complete]

print(cleaned_sentences)

### 5-3 Preparing document term matrix

A Document-Term Matrix (DTM) is a widely used representation of text data in Natural Language Processing (NLP) and Information Retrieval (IR). It is a mathematical matrix that describes the frequency of terms (words or phrases) that occur in a collection of documents. Each row of the matrix represents a document, and each column represents a term. The elements of the matrix are typically the term frequencies or some other measure of importance for the terms within the corresponding documents.

The DTM is a fundamental data structure used in various text analysis tasks, including text classification, topic modeling, sentiment analysis, and more. It serves as the basis for many text-based machine learning models.

Here's how a Document-Term Matrix is constructed:

1. Tokenization: The text in each document is first tokenized into individual words or terms. Tokenization breaks the text into smaller units, typically words, but it can also be n-grams (combinations of n words).

2. Vocabulary Creation: The vocabulary is the set of all unique terms that appear in the collection of documents. Each term becomes a column in the DTM.

3. Counting Term Frequencies: For each document, the occurrence of each term is counted. The count represents how many times a specific term appears in that document.

4. Assembling the Matrix: The DTM is constructed by arranging the term frequencies for each document in a matrix, where rows represent documents, and columns represent terms.

Here's a simple example of a Document-Term Matrix:

Consider the following three sentences:

1. "I love natural language processing."
2. "Text analysis is fascinating."
3. "Natural language processing is a subfield of NLP."

Vocabulary: love, natural, language, processing, text, analysis, fascinating, is, a, subfield, of, NLP.

DTM:

| Document | love | natural | language | processing | text | analysis | fascinating | is | a | subfield | of | NLP |
|----------|------|---------|----------|------------|------|----------|-------------|----|---|---------|----|-----|
| Sentence 1 | 1    | 1        | 1         | 1           | 0    | 0        | 0            | 1   | 0 | 0       | 0  | 0   |
| Sentence 2 | 0    | 0        | 0         | 0           | 1    | 1        | 1            | 1   | 0 | 0       | 0  | 0   |
| Sentence 3 | 0    | 1        | 1         | 1           | 0    | 0        | 0            | 1   | 1 | 1       | 1  | 2   |

In this example, the DTM represents the three sentences and the term frequencies for each term within each sentence. The DTM is a useful data representation for further text analysis and modeling tasks.

In [None]:
# Importing gensim
import gensim
from gensim import corpora
# Creating the term dictionary of our corpus, where every unique term is assigned an index.
dictionary = corpora.Dictionary(cleaned_sentences)
# Converting a list of documents (corpus) into Document-TermMatrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in cleaned_sentences]
print(doc_term_matrix)

### 5.4 LDA model.

Latent Dirichlet Allocation (LDA) is a widely used probabilistic generative model in Natural Language Processing (NLP) and Machine Learning for topic modeling. LDA is used to discover hidden topics within a collection of documents by analyzing the distribution of words across these topics. It is an unsupervised learning algorithm that can automatically identify the main themes or topics present in the text data.

The LDA model assumes that each document is a mixture of various topics, and each topic is a probability distribution over words. The goal of LDA is to infer the topic distribution for each document and the word distribution for each topic.

Here's a high-level overview of how the LDA model works:

1. Initialization: Choose the number of topics `k` you want to discover in the corpus of documents.

2. Preprocessing: Prepare the text data by tokenizing the documents, removing stop words, and performing other text cleaning steps.

3. Create the Document-Term Matrix (DTM): Represent the text data as a DTM, where rows represent documents, and columns represent terms (words).

4. LDA Model Fitting: Apply the LDA algorithm to the DTM to find the topic distributions for each document and the word distributions for each topic. The algorithm uses probabilistic inference to estimate these distributions.

5. Interpretation: Examine the output of the LDA model to understand the discovered topics. Each topic will have a distribution of words, and each document will have a distribution of topics. The most probable words in each topic can be considered the representative words of that topic.

6. Topic Assignments: The LDA model assigns a probability distribution of topics to each document. It allows documents to be associated with multiple topics to varying degrees.

Applications of LDA:
- Topic Modeling: Discovering the main themes or topics within a collection of documents.
- Document Clustering: Grouping similar documents based on their topic distributions.
- Information Retrieval: Enhancing search results by associating documents with relevant topics.
- Recommender Systems: Providing recommendations based on user preferences and document-topic associations.

LDA is a powerful tool for unsupervised analysis of text data and is widely used in various domains to gain insights from large collections of documents.

Let's walk through an example of using the LDA model for topic modeling with sample texts. For simplicity, we'll use a smaller set of documents.

Sample Texts:
1. "I love natural language processing."
2. "Text analysis is fascinating."
3. "Natural language processing is a subfield of NLP."
4. "Machine learning is an essential part of NLP."
5. "NLP applications are widespread in the industry."

Now, let's use the LDA model to discover topics in these sample texts:

Step 1: Preprocessing
We'll preprocess the text data by tokenizing, converting to lowercase, and removing stopwords. We'll also remove single-character words and punctuations.

Preprocessed Texts:
1. ["love", "natural", "language", "processing"]
2. ["text", "analysis", "fascinating"]
3. ["natural", "language", "processing", "subfield", "NLP"]
4. ["machine", "learning", "essential", "part", "NLP"]
5. ["NLP", "applications", "widespread", "industry"]

Step 2: Create Document-Term Matrix (DTM)
The DTM will represent the text data, where each row corresponds to a document, and each column corresponds to a term (word).

|            | love | natural | language | processing | text | analysis | fascinating | subfield | NLP | machine | learning | essential | part | applications | widespread | industry |
|------------|------|---------|----------|------------|------|----------|-------------|----------|-----|---------|----------|-----------|------|--------------|-----------|----------|
| Document 1 | 1    | 1       | 1        | 0          | 0    | 0        | 0           | 0        | 0   | 0       | 0        | 0         | 0    | 0            | 0         | 0        |
| Document 2 | 0    | 0       | 0        | 1          | 1    | 1        | 0           | 0        | 0   | 0       | 0        | 0         | 0    | 0            | 0         | 0        |
| Document 3 | 0    | 1       | 1        | 0          | 0    | 1        | 1           | 0        | 0   | 0       | 0        | 0         | 0    | 0            | 0         | 0        |
| Document 4 | 0    | 0       | 0        | 0          | 0    | 0        | 0           | 1        | 1   | 1       | 0        | 0         | 0    | 0            | 0         | 0        |
| Document 5 | 0    | 0       | 1        | 0          | 0    | 0        | 0           | 0        | 1   | 0       | 0        | 1         | 1    | 1            | 1         | 1        |


Step 3: Train the LDA Model
We'll use the DTM to train the LDA model and discover topics in the sample texts.

Step 4: Interpretation
The LDA model will provide us with topic distributions for each document and word distributions for each topic. We can then interpret the topics based on the most probable words in each topic.

In this example, the LDA model may discover topics like:
- Topic 1: "NLP, natural language processing, subfield"
- Topic 2: "Text analysis, fascinating"
- Topic 3: "Machine learning, essential part"
- Topic 4: "Applications, widespread, industry"

Keep in mind that the number of topics and their interpretations depend on the input data and the hyperparameters chosen for the LDA model. The goal of the LDA model is to identify topics that capture the main themes present in the documents.

In [None]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel
# Running and Training LDA model on the document term matrixfor 3 topics.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word =dictionary, passes=50)
# Results
print(ldamodel.print_topics())

In [None]:
# Train the LDA model
num_topics = 2  # Number of topics to discover
lda_model = gensim.models.LdaModel(doc_term_matrix, num_topics=num_topics, id2word=dictionary, passes=10)

# Print the discovered topics and their most probable words
for topic_id in range(num_topics):
    print(f"Topic {topic_id + 1}: {lda_model.print_topic(topic_id)}")

All the weights associated with the topics from the sentence seem
almost similar. You can perform this on huge data to extract significant
topics. The whole idea to implement this on sample data is to make you
familiar with it, and you can use the same code snippet to perform on the
huge data for significant results and insights

## 6. Classifying Text

Text classification – The aim of text classification is to automatically classifythe text documents based on pretrained categories

Applications:
- Sentiment Analysis
- Document classification
- Spam – ham mail classification
- Resume shortlisting
- Document summarization

Spam - ham classification using machine learning
"Spam" and "ham" are terms commonly used in the context of email classification and spam filtering. They refer to two categories of emails based on their content.

1. Spam:
Spam refers to unsolicited or unwanted emails, typically sent in bulk to a large number of recipients. These emails often contain promotional content, advertisements, scams, or other forms of irrelevant and potentially harmful messages. The primary purpose of sending spam emails is usually to promote products, services, or fraudulent activities.

2. Ham:
Ham, on the other hand, refers to legitimate and non-spam emails. These are the desired emails that users expect to receive from known and trusted sources, such as personal or work-related emails, newsletters from subscribed services, and communication from friends, colleagues, or organizations.

Spam filtering is a technique used to automatically identify and separate spam emails from legitimate ones (ham) in an email inbox. Machine learning algorithms, such as Naive Bayes, Support Vector Machines, and Deep Learning models, are commonly used for spam filtering tasks. These algorithms analyze the content and characteristics of emails to classify them as either spam or ham.

By using spam filters, email providers and users can help reduce the annoyance and potential risks associated with receiving unwanted or harmful spam emails, ensuring that legitimate emails reach the inbox while unwanted spam is filtered out.

If you observe, your Gmail has a folder called “Spam.” It will basically
classify your emails into spam and ham so that you don’t have to read
unnecessary emails.

### 6.1 Data collection and understanding

Please download data from the below link and save it in your working
directory:<br>

https://www.kaggle.com/uciml/sms-spam-collection-dataset#spam.csv

In [71]:
import pandas as pd
#Read the data
Email_Data = pd.read_csv("spam.csv",encoding ='latin1')
#Data undestanding
Email_Data.columns

Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

In [72]:
Email_Data

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


In [73]:
Email_Data = Email_Data[['v1', 'v2']]
Email_Data = Email_Data.rename(columns={"v1":"Target","v2":"Email"})
Email_Data.head()

Unnamed: 0,Target,Email
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### 6.2 Text processing and feature engineering

In [76]:
#import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import string
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import os
from textblob import TextBlob
from nltk.stem import PorterStemmer
from textblob import Word
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
import sklearn.feature_extraction.text as text
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm

In [77]:
#pre processing steps like lower case, stemming and lemmatization
st = PorterStemmer()
Email_Data['Email'] = Email_Data['Email'].apply(lambda x:" ".join(x.lower() for x in x.split()))
stop = stopwords.words('english')
Email_Data['Email'] = Email_Data['Email'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
Email_Data['Email'] = Email_Data['Email'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
Email_Data['Email'] = Email_Data['Email'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

In [81]:
Email_Data['Email'][3]

'u dun say earli hor... u c alreadi say...'

In [82]:
#Splitting data into train and validation
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(Email_Data['Email'], Email_Data['Target'])

 ### 6.3 Model training

In the context of TF-IDF vectorization, the terms "analyzer" and "token pattern" are important concepts that relate to how text is processed and tokenized before creating the TF-IDF representation.

1. Analyzer:
The analyzer is a component in TF-IDF vectorization that defines how the text data should be preprocessed and tokenized before generating the TF-IDF representation. It essentially determines the unit of analysis, which can be at the word level, character level, or any other custom level.

The default analyzer in scikit-learn's TfidfVectorizer is `word`, which means that the input text will be tokenized into individual words. For example, the sentence "I love natural language processing" will be tokenized into the following individual words: ["I", "love", "natural", "language", "processing"].

You can also specify a custom analyzer by providing your own function. For instance, you could create a custom analyzer that tokenizes the text at the character level, or applies additional preprocessing steps like stemming or lemmatization before tokenization.

2. Token Pattern:
The token pattern is a regular expression (regex) that specifies what constitutes a "token" during tokenization. It defines the rules for breaking the text into smaller units, which are then considered as individual tokens.

By default, the token pattern in scikit-learn's TfidfVectorizer is `r"(?u)\b\w\w+\b"`. This pattern means that tokens are formed by sequences of two or more word characters (letters, digits, or underscores) bounded by word boundaries. The `(?u)` is used to indicate that Unicode word boundaries should be used, allowing tokenization of words with non-ASCII characters.

For example, with the default token pattern, the sentence "Text analysis is fascinating" will be tokenized into the following words: ["Text", "analysis", "is", "fascinating"].

You can customize the token pattern to suit your specific requirements. For instance, if you want to include single-letter words, you can modify the pattern to `r"(?u)\b\w+\b"`. If you want to consider phrases as tokens, you can use a different regex pattern to match phrases based on your desired criteria.

In summary, the analyzer and token pattern in TF-IDF vectorization allow you to control how text data is preprocessed, tokenized, and converted into a numerical representation. These options are essential for fine-tuning the TF-IDF vectorization process to best suit your specific text analysis needs.

In [83]:
# TFIDF feature generation for a maximum of 5000 features
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
tfidf_vect = TfidfVectorizer(analyzer='word',token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(Email_Data['Email'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)
xtrain_tfidf.data

array([0.57107987, 0.30126137, 0.35392139, ..., 0.27726381, 0.20726274,
       0.30688117])

### 6.3 Model training

In [84]:
def train_model(classifier, feature_vector_train, label,feature_vector_valid, is_neural_net=False):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)
    return metrics.accuracy_score(predictions, valid_y)

In [85]:
# Naive Bayes trainig
accuracy = train_model(naive_bayes.MultinomialNB(alpha=0.2),xtrain_tfidf, train_y, xvalid_tfidf)
print ("Accuracy: ", accuracy)

Accuracy:  0.9877961234745154


In [86]:
# Linear Classifier on Word Level TF IDF Vectors
accuracy = train_model(linear_model.LogisticRegression(),xtrain_tfidf, train_y, xvalid_tfidf)
print ("Accuracy: ", accuracy)

Accuracy:  0.9655419956927495


Naive Bayes is giving better results than the linear classifier. We can try
many more classifiers and then choose the best one.

## 7. Carrying Out Sentiment Analysis

We are going to discuss how to understand the sentiment of
a particular sentence or statement. Sentiment analysis is one of the widely
used techniques across the industries to understand the sentiments of the
customers/users around the products/services. Sentiment analysis gives
the sentiment score of a sentence/statement tending toward positive or
negative.

The simplest way to do this by using a **TextBlob** or **vedar** library.

Let’s follow the steps in this section to do sentiment analysis using
TextBlob. It will basically give 2 metrics.
- Polarity = Polarity lies in the range of [-1,1] where 1 means a positive statement and -1 means a negative statement.
- Subjectivity = Subjectivity refers that mostly it is a public opinion and not factual information [0,1].

Polarity and subjectivity are two important concepts in sentiment analysis and text analysis:

1. Polarity:
Polarity refers to the sentiment expressed in a piece of text, indicating whether the text expresses a positive, negative, or neutral sentiment. In sentiment analysis, the polarity score typically ranges from -1 to 1, where:
   - A polarity score of -1 indicates a highly negative sentiment.
   - A polarity score of 0 indicates a neutral sentiment.
   - A polarity score of 1 indicates a highly positive sentiment.

For example:
- "I love this product! It's amazing." - Positive polarity (polarity score close to 1)
- "I hate Mondays." - Negative polarity (polarity score close to -1)
- "The weather is fine." - Neutral polarity (polarity score close to 0)

2. Subjectivity:
Subjectivity refers to the degree of subjectiveness or objectiveness in a piece of text. A subjective statement expresses personal opinions, emotions, or feelings, while an objective statement presents factual information without any personal bias. Subjectivity is usually measured on a scale from 0 to 1, where:
   - A subjectivity score of 0 indicates complete objectivity (the text is purely factual).
   - A subjectivity score of 1 indicates complete subjectivity (the text is purely opinionated or emotional).

For example:
- "The sun rises in the east." - Objective (subjectivity score close to 0)
- "In my opinion, the movie was fantastic!" - Subjective (subjectivity score close to 1)

In sentiment analysis, both polarity and subjectivity scores are commonly used to understand the sentiment expressed in the text and to distinguish between objective and subjective statements. These scores are helpful for various NLP applications, including sentiment analysis, text classification, and opinion mining.

In [87]:
review = "I like this phone. screen quality and camera clarity is really good."
review2 = "This tv is not good. Bad quality, no clarity, worste xperience"


In [89]:
#import libraries
from textblob import TextBlob
#TextBlob has a pre trained sentiment prediction model
blob = TextBlob(review2)
blob.sentiment

Sentiment(polarity=-0.5249999999999999, subjectivity=0.6333333333333333)

This is a negative review, as the polarity is “-0.52.”

## 8. Disambiguating Text

There is ambiguity that arises due to a different meaning of words in a
different context.
For example,

- Text1 = 'I went to the bank to deposit my money'
- Text2 = 'The river bank was full of dead fishes'

In the above texts, the word “bank” has different meanings based on
the context of the sentence.

The Lesk algorithm is one of the best algorithms for word sense
disambiguation. Let’s see how to solve using the package pywsd and nltk.

The Lesk Algorithm is a word sense disambiguation (WSD) algorithm used to determine the correct sense of an ambiguous word within a given context. It was proposed by Michael E. Lesk in 1986. Word sense disambiguation is an important task in natural language processing (NLP) to resolve the ambiguity that arises when a word has multiple meanings or senses depending on its context.

The Lesk Algorithm is based on the idea that words in a given context tend to share similar words in their definitions. It utilizes the glosses (short definitions) of different word senses from a lexical database, such as WordNet, and compares these glosses with the words in the context of the ambiguous word.

Here's a high-level overview of the Lesk Algorithm:

1. Given an ambiguous word and its context sentence.
2. Retrieve the glosses (short definitions) of all possible word senses of the ambiguous word from a lexical database like WordNet.
3. Tokenize the context sentence to obtain individual words.
4. For each word in the context sentence, calculate the overlap between the glosses of each word sense and the set of words in the context. The overlap is determined using set intersection or other similarity metrics.
5. Choose the word sense with the highest overlap as the disambiguated sense for the ambiguous word.

By comparing the glosses of different word senses with the words in the context, the Lesk Algorithm aims to find the most relevant sense that best fits the context.




The Lesk Algorithm has been widely used in NLP tasks, especially in the field of lexical semantics and word sense disambiguation. While the algorithm is relatively simple, it provides a practical and effective approach for resolving word sense ambiguity in various applications, such as machine translation, information retrieval, and question answering systems.

Step-by-step pictorial representation of the Lesk Algorithm with a simple example to help you visualize the process.

Let's consider the word "bank," which can have multiple senses (meanings) depending on the context:

1. Bank (financial institution)
2. Bank (side of a river)

Context Sentence: "I went to the bank to deposit some money."

- Step 1: Get Synsets from WordNet
Retrieve all possible word senses (synsets) of the word "bank" from WordNet.

- Step 2: Get Glosses (Short Definitions) Retrieve the glosses (short definitions) for each synset.

```
Synset 1 (Bank - financial institution):
    Gloss: "a financial establishment that uses money deposited by customers for investment, pays it out when required, makes loans at interest, and exchanges currency."

Synset 2 (Bank - side of a river):
    Gloss: "sloping land (especially the slope beside a body of water)."
```

- Step 3: Tokenize the Context Sentence Tokenize the words in the context sentence to obtain individual words.

Context Tokens: ["I", "went", "to", "the", "bank", "to", "deposit", "some", "money"]

- Step 4: Calculate Overlap For each synset's gloss, calculate the overlap with the words in the context sentence. The overlap can be determined using set intersection or other similarity metrics.

```
Overlap with Synset 1:
    "bank" and "financial" - 1 word match
    "bank" and "institution" - 1 word match
    Total overlap: 2

Overlap with Synset 2:
    "bank" and "sloping" - 0 word match
    "bank" and "land" - 1 word match
    "bank" and "side" - 1 word match
    "bank" and "river" - 0 word match
    Total overlap: 2
```

- Step 5: Choose the Best Sense
Select the synset with the highest overlap as the disambiguated sense for the ambiguous word "bank." In this example, both Synset 1 and Synset 2 have the same overlap (2 words), so we may need further context or disambiguation techniques to make the final decision.

In more complex examples, the Lesk Algorithm considers more words and senses, and the overlap calculations become more extensive. However, this simplified example illustrates the core idea of how the Lesk Algorithm works to disambiguate an ambiguous word based on its context.

### 8.1 Import libraries

In [90]:
#First, import the libraries:
#Install pywsd
!pip install pywsd
#Import functions
from nltk.corpus import wordnet as wn
from nltk.stem import PorterStemmer
from itertools import chain
from pywsd.lesk import simple_lesk

[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

Warming up PyWSD (takes ~10 secs)... took 7.8118109703063965 secs.


### 8.2 Disambiguating word sense

In [92]:
# Sentences
bank_sents = ['I went to the bank to deposit my money','The river bank was full of dead fishes']
# calling the lesk function and printing results for both thesentences
print ("Context-1:", bank_sents[0])
answer = simple_lesk(bank_sents[0],'bank')
print ("Sense:", answer)
print ("Definition : ", answer.definition())
print ("Context-2:", bank_sents[1])
answer = simple_lesk(bank_sents[1],'bank','n')
print ("Sense:", answer)
print ("Definition : ", answer.definition())

Context-1: I went to the bank to deposit my money
Sense: Synset('depository_financial_institution.n.01')
Definition :  a financial institution that accepts deposits and channels the money into lending activities
Context-2: The river bank was full of dead fishes
Sense: Synset('bank.n.01')
Definition :  sloping land (especially the slope beside a body of water)


Observe that in context-1, “bank” is a financial institution, but in
context-2, “bank” is sloping land

## 9. Converting Speech to Text

The simplest way to do this by using Speech Recognition and PyAudio.

Interaction with machines is trending toward the voice, which is the usual
way of human communication. Popular examples are Siri, Alexa’s Google
Voice, etc.

In [93]:
!pip install SpeechRecognition
!pip install PyAudio
import speech_recognition as sr

[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

If you are using Mac machine you may face problem while installing the Pyaudio use the below code in the terminal to install the package

Now after you run the below code snippet, whatever you say on the
microphone (using recognize_google function) gets converted into text.

<span style="color:red;">conda install -c anaconda pyaudio</span>



In [None]:
r=sr.Recognizer()
with sr.Microphone() as source:
    print("Please say something")
    audio = r.listen(source)
    print("Time over, thanks")
try:
    print("I think you said: "+r.recognize_google(audio));
except:
    pass;

## 10. Converting Text to Speech

Converting text to speech is another useful NLP technique.

The simplest way to do this by using the gTTs library.


The "gTTS" (Google Text-to-Speech) library is a Python library and CLI tool that allows you to convert text into speech using Google Text-to-Speech API. It provides a simple interface to generate speech from text, and it supports multiple languages and various voice options

In [96]:
!pip install gTTS
from gtts import gTTS

[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

In [None]:
!pip install --upgrade pip


Now after you run the below code snippet, whatever you input in the text parameter gets converted into audio.

In [97]:
from gtts import gTTS
import os

# Text to convert to speech
text = "LWM tamil youtube channel is the best in the world!"
# Create a gTTS object
tts = gTTS(text=text, lang='en')

# Save the speech as an audio file
tts.save("output.mp3")

## 11. Translating Speech

Whenever you try to analyze data from blogs that are hosted across the
globe, especially websites from countries like China, where Chinese is
used predominantly, analyzing such data or performing NLP tasks on such
data would be difficult. That’s where language translation comes to the rescue. You want to translate one language to another.

In [98]:
!pip install goslate
import goslate

[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

In [99]:
text = "LWM ist das beste YouTube Kanal der Welt"

In [100]:
gs = goslate.Goslate()
translatedText = gs.translate(text,'en')
print(translatedText)

LWM is the best YouTube channel in the world
