## NLP methods <br>
Extract info from the text data using NLP/ML methods.

## 1. Extracting Noun Pharses <br>
use case: It's important when the 'who' in a sentence.

In [1]:
import nltk
from textblob import TextBlob

# extract
blob = TextBlob("Harsha is learning natural language processing")
for np in blob.noun_phrases:
    print(np)

harsha
natural language processing


## 2. finding the similarity between texts<br>
Different types of similarity.<br>
**Cosine similarity**: Calculates the cosine of the angle between the two vectors.<br>
**Jaccard similarity**: the score is calculated using the intersection or union of words.<br>
**Jaccard Index**: ( the number in both sets )/(the number in either set) * 100 <br>
**Levenshtein distance** : Minimal number of insertions, deletions, and replacements required for transforming string “a” into string “b.”<br>

**Hamming distance** : Number of positions with the same symbol in both strings. But it can be defined only for strings with equal length. <br>

Cosine similarity looks at the angle between two vectors, euclidian similarity at the distance between two points.

Let's say you are in an e-commerce setting and you want to compare users for product recommendations:

User 1 bought 1x eggs, 1x flour and 1x sugar.
User 2 bought 100x eggs, 100x flour and 100x sugar
User 3 bought 1x eggs, 1x Vodka and 1x Red Bull
By cosine similarity, user 1 and user 2 are more similar. By euclidean similarity, user 3 is more similar to user 1.

In [2]:
documents = (
"I like NLP",
"I am exploring NLP",
"I am a beginner in NLP",
"I want to learn NLP",
"I like advanced NLP"
)

In [3]:
# lets find the similarity

#Import libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
#Compute tfidf : feature engineering

In [4]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
print(tfidf_matrix.shape)

(5, 10)


In [5]:
#compute similarity for first sentence with rest of the sentences
cosine_similarity(tfidf_matrix[0:1],tfidf_matrix)

array([[ 1.        ,  0.17682765,  0.14284054,  0.13489366,  0.68374784]])

Observation: first and last sentence have higher similarity compare to the esr of the sentences.

Method 2: **Phonetic matching** which roughly matches the two words or sentences and also creates an alphanumeric string as an encoded version of the text/word.<br>
Usage: searching large text corpora, correcting spelling errors, and matching relevant names.<br>
Algorithms: Soundex and Metaphone.<br>
Example: for "Natural" and "Natuaral" have same encode word.

In [6]:
import fuzzy

In [15]:
soundex = fuzzy.Soundex(8)


In [None]:
#generate the phonetic form
soundex("hi")

## 3.Tagging part of speech<br>
Labeling the words with a part of speech such as noun,verb,adjective,etc. POS is the base for Named Entity Resolution, Sentiment Analysis, Question Answering and word sense disambiguation.<br>
Two ways: <br>
Rule based:Rules created manually, which tag a word belonging to a particulat POS.<br>
Stochastic based: These algorithms capture the sequence of the words and tag the probability of the sequence using hidden Markov models.<br>

uses nltk.pos_tag(word) <br>

In [18]:
text  =  "I love NLP and I will learn NLP in 2 month"

In [21]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))

In [23]:
tokens = sent_tokenize(text)

for i in tokens:
    words = nltk.word_tokenize(i)
    words = [w for w in words if not w in stop_words]
    #  POS-tagger.
    tags = nltk.pos_tag(words)
    
tags

[('I', 'PRP'),
 ('love', 'VBP'),
 ('NLP', 'NNP'),
 ('I', 'PRP'),
 ('learn', 'VBP'),
 ('NLP', 'RB'),
 ('2', 'CD'),
 ('month', 'NN')]

observation: lov is VBP; verb,sing and present.<br>
CC - coordinating conjuction<br>
CD - cordinal digit<br>
DT- determiner<br>
FW  - foreign word and so on <br>

## 4 Extract Entities from text
How to identify and extract entites from text called Named Entity Recognition.<br>
Library: NLTK chunker,StanfordNER, spacy, opennlp, and neuroNER and also lot of API WatsonNLU, AlchemyAPI, NERD, Google Cloud API and so on.<br>
Solution: ne_chunk from NLTK or Spacy.<br>


In [1]:
sent = "John is studying at Stanford University in California"

In [4]:
import nltk
from nltk import ne_chunk
from nltk import word_tokenize
import matplotlib
matplotlib.use('Agg')

In [5]:
print( ne_chunk(nltk.pos_tag(word_tokenize(sent)), binary=False) )

(S
  (PERSON John/NNP)
  is/VBZ
  studying/VBG
  at/IN
  (ORGANIZATION Stanford/NNP University/NNP)
  in/IN
  (GPE California/NNP))


**Observation**: "John" is tagged as PERSON and stanford as ORGANIZATION and California as "GPE" ( Geopolitica entity i.e., countries, cities, states )

In [12]:
#Method 2:
import spacy
nlp = spacy.load('en')

In [13]:
doc = nlp(u'Apple is ready to launch new phone worth $10000 in New york time square ')

In [14]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
10000 42 47 MONEY
New york 51 59 GPE


As per output, Apple is an Organization, 10000 is money and newyork is place.

## 5 Extracting Topics from text<br>
How to identify topics from the document.<br>
For example, there is an online library with multiple departments based on the kind of book. As the new book comes in, you want to look at the unique keywords/topics and decide on which department this book might belong to and place it accordingly. <br>
Solution: gensim library

In [16]:
doc1 = "I am learning NLP, it is very interesting and exciting. it includes machine learning and deep learning"
doc2 = "My father is a data scientist and he is nlp expert"
doc3 = "My sister has good exposure into android development"

doc_complete = [doc1, doc2, doc3]
doc_complete

['I am learning NLP, it is very interesting and exciting. it includes machine learning and deep learning',
 'My father is a data scientist and he is nlp expert',
 'My sister has good exposure into android development']

In [17]:
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string

In [18]:
# text preprocessing
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()

In [22]:
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = "".join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

In [23]:
doc_clean = [clean(doc).split() for doc in doc_complete]
doc_clean

[['learning',
  'nlp',
  'interesting',
  'exciting',
  'includes',
  'machine',
  'learning',
  'deep',
  'learning'],
 ['father', 'data', 'scientist', 'nlp', 'expert'],
 ['sister', 'good', 'exposure', 'android', 'development']]

In [24]:
# preparing doc term matrix

import gensim
from gensim import corpora
# Creating the term dictionary of our corpus, where every unique term is assigned an index.

dictionary = corpora.Dictionary(doc_clean)
# Converting a list of documents (corpus) into Document-Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
doc_term_matrix

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 3), (5, 1), (6, 1)],
 [(6, 1), (7, 1), (8, 1), (9, 1), (10, 1)],
 [(11, 1), (12, 1), (13, 1), (14, 1), (15, 1)]]

In [25]:
# LDA
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel
# Running and Training LDA model on the document term matrix for 3 topics.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)
# Results
print(ldamodel.print_topics())

[(0, '0.129*"exposure" + 0.129*"android" + 0.129*"sister" + 0.129*"good" + 0.129*"development" + 0.032*"nlp" + 0.032*"father" + 0.032*"scientist" + 0.032*"data" + 0.032*"expert"'), (1, '0.129*"nlp" + 0.129*"data" + 0.129*"scientist" + 0.129*"father" + 0.129*"expert" + 0.032*"sister" + 0.032*"exposure" + 0.032*"good" + 0.032*"android" + 0.032*"development"'), (2, '0.233*"learning" + 0.093*"deep" + 0.093*"includes" + 0.093*"interesting" + 0.093*"machine" + 0.093*"exciting" + 0.093*"nlp" + 0.023*"scientist" + 0.023*"data" + 0.023*"father"')]


Observation: This will helps on huge data for significant results and insights.

## 6 Classify text<br>
Aim of text classification is to automatically classify the text doc based on pretrained categories.<br>
Applications: Sentiment analysis, doc classification, spam-ham classification, resume shortlisting, doc summarization.<br>


In [26]:
# Spam-ham classification using ML



In [13]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import os
from textblob import Word
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
import  sklearn.feature_extraction.text as text
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from nltk.stem.porter import *

In [55]:
Email_Data = pd.read_csv("./nlp_code/sms-spam-collection-dataset/spam.csv",encoding ='latin1')

In [56]:
Email_Data.columns

Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

In [58]:
#rename the column name
Email_Data = Email_Data[ ['v1','v2']]
Email_Data = Email_Data.rename(columns={'v1': "Target", 'v2':"Email"})
Email_Data.head()

Unnamed: 0,Target,Email
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [59]:
#pre processing steps like lower case, stemming and lemmatization 

Email_Data['Email'] = Email_Data['Email'].apply(lambda x: " ".join(x.lower() for x in x.split()))
stop = stopwords.words('english')
Email_Data['Email'] = Email_Data['Email'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
st = PorterStemmer()
Email_Data['Email'] = Email_Data['Email'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
Email_Data['Email'] =Email_Data['Email'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

Email_Data.head()

Unnamed: 0,Target,Email
0,ham,"go jurong point, crazy.. avail bugi n great wo..."
1,ham,ok lar... joke wif u oni...
2,spam,free entri 2 wkli comp win fa cup final tkt 21...
3,ham,u dun say earli hor... u c alreadi say...
4,ham,"nah think goe usf, live around though"


In [60]:
#Splitting data into train and validation

train_x, valid_x, train_y, valid_y = model_selection.train_test_split(Email_Data['Email'], Email_Data['Target'])

# TFIDF feature generation for a maximum of 5000 features

encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(Email_Data['Email'])
xtrain_tfidf =  tfidf_vect.transform(train_x)
xvalid_tfidf =  tfidf_vect.transform(valid_x)

xtrain_tfidf.data

array([0.61796596, 0.60492428, 0.50217993, ..., 0.13902097, 0.37272412,
       0.2759143 ])

In [61]:
def train_model(classifier, feature_vector_train, label, feature_vector_valid, is_neural_net=False):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)
    return metrics.accuracy_score(predictions, valid_y)

# Naive Bayes trainig
accuracy = train_model(naive_bayes.MultinomialNB(alpha=0.2), xtrain_tfidf, train_y, xvalid_tfidf)
print ("Accuracy: ", accuracy)

Accuracy:  0.9885139985642498


In [62]:
# Linear Classifier on Word Level TF IDF Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf, train_y, xvalid_tfidf)
print ("Accuracy: ", accuracy)

Accuracy:  0.9626704953338119


Observation: SVM gives better results

## 7 Carrying out Sentiment Analysis<br>
How to understand the sentiment of a particular sentence or statement.<br>
Application: used to understand the sentiment of the customers/users/product/services(Positive/Negative)<br>
TextBlob or vedar library.<br>
Basically gives 2 metrics.<br>
**Polarity** lies in the range of [-1,1] where 1 means a positive statement and -1 means a nagative statment.<br>
**Subjectivity** referes that mostly it is a public opinion and not factual information [0,1]

In [63]:
review = "I like this phone. screen quality and camera clarity is really good."
review2 = "This tv is not good. Bad quality, no clarity, worst experience"

In [64]:
from textblob import TextBlob
#TextBlob has a pre trained sentiment prediction model
blob = TextBlob(review)
blob.sentiment

Sentiment(polarity=0.7, subjectivity=0.6000000000000001)

In [65]:
blob = TextBlob(review2)
blob.sentiment

Sentiment(polarity=-0.6833333333333332, subjectivity=0.7555555555555555)

## 8 Disambiguating text<br>
There is ambiguity that arises due to a different meaning of words in a different context.<br>
Ref: https://en.wikipedia.org/wiki/Word_sense_disambiguation

In [66]:
Text1 = 'I went to the bank to deposit my money'
Text2 = 'The river bank was full of dead fishes'



In [67]:
from nltk.corpus import wordnet as wn
from nltk.stem import PorterStemmer
from itertools import chain
from pywsd.lesk import simple_lesk

Warming up PyWSD (takes ~10 secs)... took 8.115720748901367 secs.


In [68]:
bank_sents = ['I went to the bank to deposit my money', 'The river bank was full of dead fishes']

In [69]:
print ("Context-1:", bank_sents[0])
answer = simple_lesk(bank_sents[0],'bank')
print ("Sense:", answer)
print ("Definition : ", answer.definition())
print ("Context-2:", bank_sents[1])
answer = simple_lesk(bank_sents[1],'bank','n')
print ("Sense:", answer)
print ("Definition : ", answer.definition())

Context-1: I went to the bank to deposit my money
Sense: Synset('depository_financial_institution.n.01')
Definition :  a financial institution that accepts deposits and channels the money into lending activities
Context-2: The river bank was full of dead fishes
Sense: Synset('bank.n.01')
Definition :  sloping land (especially the slope beside a body of water)
