# Advanced Natural Language Processing

## 6. Classifying Text

Text classification – The aim of text classification is to automatically classify
the text documents based on pretrained categories.

Applications:
- Sentiment Analysis
- Document classification
- Spam – ham mail classification
- Resume shortlisting
- Document summarization

Download the data from: https://www.kaggle.com/uciml/sms-spam-collection-dataset#spam.csv

In [2]:
import pandas as pd
#Read the data
Email_Data = pd.read_csv("spam.csv",encoding ='latin1')

#Data undestanding
Email_Data.columns

Email_Data = Email_Data[['v1', 'v2']]
Email_Data = Email_Data.rename(columns={"v1":"Target", "v2":"Email"})

Email_Data.head()


Unnamed: 0,Target,Email
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [10]:
# !pip install textblob
import nltk
# nltk.download()

In [11]:
#import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import string
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import os
from textblob import TextBlob
from nltk.stem import PorterStemmer
from textblob import Word
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
import sklearn.feature_extraction.text as text
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm

#pre processing steps like lower case, stemming and lemmatization 

Email_Data['Email'] = Email_Data['Email'].apply(lambda x: " ".join(x.lower() for x in x.split()))
stop = stopwords.words('english')
Email_Data['Email'] = Email_Data['Email'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
st = PorterStemmer()
Email_Data['Email'] = Email_Data['Email'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
Email_Data['Email'] =Email_Data['Email'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

Email_Data.head()


Unnamed: 0,Target,Email
0,ham,"go jurong point, crazy.. avail bugi n great wo..."
1,ham,ok lar... joke wif u oni...
2,spam,free entri 2 wkli comp win fa cup final tkt 21...
3,ham,u dun say earli hor... u c alreadi say...
4,ham,"nah think goe usf, live around though"


In [12]:
#Splitting data into train and validation

train_x, valid_x, train_y, valid_y = model_selection.train_test_split(Email_Data['Email'], Email_Data['Target'])

# TFIDF feature generation for a maximum of 5000 features

encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(Email_Data['Email'])
xtrain_tfidf =  tfidf_vect.transform(train_x)
xvalid_tfidf =  tfidf_vect.transform(valid_x)

xtrain_tfidf.data


array([0.56798178, 0.69120542, 0.4468017 , ..., 0.31474359, 0.2779298 ,
       0.24388794])

In [13]:
xtrain_tfidf.shape

(4179, 5000)

In [14]:
def train_model(classifier, feature_vector_train, label, feature_vector_valid, is_neural_net=False):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)
    return metrics.accuracy_score(predictions, valid_y)

# Naive Bayes trainig
accuracy = train_model(naive_bayes.MultinomialNB(alpha=0.2), xtrain_tfidf, train_y, xvalid_tfidf)
print ("Accuracy: ", accuracy)


Accuracy:  0.9899497487437185


In [15]:
# Linear Classifier on Word Level TF IDF Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf, train_y, xvalid_tfidf)
print ("Accuracy: ", accuracy)


Accuracy:  0.9655419956927495


In [16]:
your_own_email = ['.....']  # get the class spam/ham prediction from model >>> HW

## 7. Carrying Out Sentiment Analysis

TextBlob will basically give 2 metrics.
- `Polarity` = Polarity lies in the range of [-1,1] where 1 means a positive statement and -1 means a negative
statement.
- `Subjectivity` = Subjectivity refers that mostly it is a public opinion and not factual information [0,1].

In [23]:
review = "I like this phone. screen quality and camera clarity is really good."
review2 = "This tv is not good. Bad quality, no clarity, worst experience"

#import libraries
from textblob import TextBlob

#TextBlob has a pre trained sentiment prediction model
blob = TextBlob(review)
blob.sentiment

Sentiment(polarity=0.7, subjectivity=0.6000000000000001)

In [24]:
#now lets look at the sentiment of review2
blob = TextBlob(review2)
blob.sentiment

Sentiment(polarity=-0.6833333333333332, subjectivity=0.7555555555555555)

## 8. Disambiguating Text

There is ambiguity that arises due to a different meaning of words in a different context.

In [25]:
Text1 = 'I went to the bank to deposit my money'
Text2 = 'The river bank was full of dead fishes'

#Install pywsd
# !pip install pywsd
# !pip install -U wn==0.0.22

In [26]:
from nltk.corpus import wordnet as wn

In [28]:
#Import functions

from nltk.corpus import wordnet as wn
from nltk.stem import PorterStemmer
from itertools import chain
from pywsd.lesk import simple_lesk

# Sentences

bank_sents = ['I went to the bank to deposit my money',
'The river bank was full of dead fishes']

# calling the lesk function and printing results for both the sentences

Warming up PyWSD (takes ~10 secs)... took 2.521153450012207 secs.


In [30]:
print ("Context-1:", bank_sents[0])
answer = simple_lesk(bank_sents[0],'bank')
print ("Sense:", answer)
print ("Definition : ", answer.definition())


print ("\nContext-2:", bank_sents[1])
answer = simple_lesk(bank_sents[1],'bank','n')

print ("Sense:", answer)
print ("Definition : ", answer.definition())

Context-1: I went to the bank to deposit my money
Sense: Synset('depository_financial_institution.n.01')
Definition :  a financial institution that accepts deposits and channels the money into lending activities

Context-2: The river bank was full of dead fishes
Sense: Synset('bank.n.01')
Definition :  sloping land (especially the slope beside a body of water)


In [32]:
# Sentences

bank_sents = ['Last night I was reading a book on Black Holes',
'Please book my flight for tomoroow to New Delhi']

# calling the lesk function and printing results for both the sentences

print ("Context-1:", bank_sents[0])
answer = simple_lesk(bank_sents[0],'book')
print ("Sense:", answer)
print ("Definition : ", answer.definition())


print ("\nContext-2:", bank_sents[1])
answer = simple_lesk(bank_sents[1],'book','v')
print ("Sense:", answer)
print ("Definition : ", answer.definition())

Context-1: Last night I was reading a book on Black Holes
Sense: Synset('script.n.01')
Definition :  a written version of a play or other dramatic composition; used in preparing for a performance

Context-2: Please book my flight for tomoroow to New Delhi
Sense: Synset('reserve.v.04')
Definition :  arrange for and reserve (something for someone else) in advance


## 9. Converting speech to text.

In [51]:
# !pip install SpeechRecognition
# !pip install PyAudio



In [35]:
# !pip install converter

In [36]:
# !pip install moviepy

In [15]:
# import moviepy.editor
# Replace the parameter with the location of the video
# video = moviepy.editor.VideoFileClip("test.mp4")
# audio = video.audio
# Replace the parameter with the location along with filename
# audio.write_audiofile("test.wav")

In [43]:
# !pip install SpeechRecognition
# !pip install pyaudio

In [44]:
import speech_recognition as sr
# from converter import Converter

# c = Converter()
# with open("test.mp4", "wb") as handle:
#     for data in r.iter_content():
#         handle.write(data)

# conv = c.convert('test.mp4', 'test.wav', {
#     'format': 'wav',
#     'audio': {
#     'codec': 'pcm',
#     'samplerate': 44100,
#     'channels': 2
#     },
# })

# for timecode in conv:
#     pass


r=sr.Recognizer()

with sr.Microphone() as source:
    print("Please say something")
    audio = r.listen(source)
    print("Time over, thanks")
    
# with sr.AudioFile('test.wav') as source:
#     audio = r.record(source)
    
try:
    print("I think you said: "+r.recognize_google(audio));
except:
    pass;

AttributeError: Could not find PyAudio; check installation

In [38]:

#code snippet
r=sr.Recognizer()

with sr.Microphone() as source:
    print("Please say something")
    audio = r.listen(source)
    print("Time over, thanks")
    
try:
    print("I think you said: "+r.recognize_google(audio, language ='hi-IN'));
except sr.UnknownValueError:
    print("Google Speech Recognition could not understand audio")
except sr.RequestError as e:
    print("Could not request results from Google Speech Recognition service; {0}".format(e))
except:
    pass;


NameError: name 'sr' is not defined

Aslo see: https://realpython.com/python-speech-recognition/

https://www.tutorialspoint.com/artificial_intelligence_with_python/artificial_intelligence_with_python_speech_recognition.htm
    


## 10. Converting Text to Speech

In [46]:
# !pip install gtts

In [20]:
# !pip install gTTS

from gtts import gTTS

#chooses the language, English(‘en’)

myobj = gTTS(text='I like this Natural Language Processing course', lang='en', slow=False) 
  
# Saving the converted audio in a mp3 file named 
myobj.save(r"audio1.mp3")


## 11. Translating Speech

In [49]:
# !pip install goslate

In [50]:
import goslate

text = "Bonjour le monde" 
gs = goslate.Goslate()
translatedText = gs.translate(text,'en')

print(translatedText)

Hi world


In [51]:
text = "Weisheit für Morgen" 
gs = goslate.Goslate()
translatedText = gs.translate(text,'hi-IN')

print(translatedText)

कल के लिए ज्ञान


In [31]:
text = """do they also use google at backend?
there are companies these days who work on NLP but do they have custom code or use google only at backend?
"""

In [29]:
text = '''Former Secretary of State Mike Pompeo on Saturday said that the Wuhan Institute of Virology (WIV) was engaged in military activity alongside its civilian research'''