# NLP Major Assignment

### Machine Translation

Machine Translation (MT) is the task of automatically converting one natural language into another, preserving the meaning of the input text, and producing fluent text in the output language. 

#### using EasyNMT - Easy to use, state-of-the-art Neural Machine Translation
- https://pythonrepo.com/repo/UKPLab-EasyNMT

In [1]:
from easynmt import EasyNMT
model = EasyNMT('opus-mt')

100%|██████████| 11.9k/11.9k [00:00<00:00, 13.0MB/s]


In [2]:
#Translate a single sentence to Hindi
print(model.translate('This is a sentence we want to translate to German', target_lang='hi'))

#Translate several sentences to German
sentences = ['You can define a list with sentences.',
             'All sentences are translated to your target language.',
             'Note, you could also mix the languages of the sentences.']
print(model.translate(sentences, target_lang='de'))

100%|██████████| 938k/938k [00:01<00:00, 697kB/s]  
Downloading: 100%|██████████| 793k/793k [00:01<00:00, 565kB/s] 
Downloading: 100%|██████████| 1.02M/1.02M [00:04<00:00, 242kB/s] 
Downloading: 100%|██████████| 2.00M/2.00M [00:06<00:00, 323kB/s]
Downloading: 100%|██████████| 44.0/44.0 [00:00<00:00, 34.0kB/s]
Downloading: 100%|██████████| 1.12k/1.12k [00:00<00:00, 695kB/s]
Downloading: 100%|██████████| 292M/292M [02:16<00:00, 2.25MB/s] 


यह एक वाक्य है हम जर्मन के लिए अनुवाद करना चाहते हैं


Downloading: 100%|██████████| 750k/750k [00:03<00:00, 225kB/s]  
Downloading: 100%|██████████| 778k/778k [00:03<00:00, 218kB/s]  
Downloading: 100%|██████████| 1.21M/1.21M [00:08<00:00, 150kB/s] 
Downloading: 100%|██████████| 42.0/42.0 [00:00<00:00, 28.7kB/s]
Downloading: 100%|██████████| 1.30k/1.30k [00:00<00:00, 1.24MB/s]
Downloading: 100%|██████████| 284M/284M [01:44<00:00, 2.85MB/s] 


['Sie können eine Liste mit Sätzen definieren.', 'Alle Sätze werden in Ihre Zielsprache übersetzt.', 'Beachten Sie, Sie können auch die Sprachen der Sätze mischen.']


In [8]:
print(model.translate("My name is Indra",target_lang='hi'))

मेरा नाम एन्मर है


### Information Retrieval


In [9]:
# importing the librtaries
import numpy as np
import nltk
import re
import gensim
from gensim.parsing.preprocessing import remove_stopwords
from gensim import corpora
from sklearn.feature_extraction.text import TfidfVectorizer 
import heapq
# text from wikipedia about Elon Musk
txt = "Elon Reeve Musk FRS (/ˈiːlɒn/ EE-lon; born June 28, 1971) is an entrepreneur and business magnate. He is the founder, CEO, and Chief Engineer at SpaceX; early stage investor,[note 1] CEO, and Product Architect of Tesla, Inc.; founder of The Boring Company; and co-founder of Neuralink and OpenAI. A centibillionaire, Musk is one of the richest people in the world.Musk was born to a Canadian mother and South African father and raised in Pretoria, South Africa. He briefly attended the University of Pretoria before moving to Canada aged 17 to attend Queen's University. He transferred to the University of Pennsylvania two years later, where he received bachelor's degrees in economics and physics. He moved to California in 1995 to attend Stanford University but decided instead to pursue a business career, co-founding the web software company Zip2 with his brother Kimbal. The startup was acquired by Compaq for $307 million in 1999. Musk co-founded online bank X.com that same year, which merged with Confinity in 2000 to form PayPal. The company was bought by eBay in 2002 for $1.5 billion.In 2002, Musk founded SpaceX, an aerospace manufacturer and space transport services company, of which he is CEO and CTO. In 2004, he joined electric vehicle manufacturer Tesla Motors, Inc. (now Tesla, Inc.) as chairman and product architect, becoming its CEO in 2008. In 2006, he helped create SolarCity, a solar energy services company that was later acquired by Tesla and became Tesla Energy. In 2015, he co-founded OpenAI, a nonprofit research company that promotes friendly artificial intelligence. In 2016, he co-founded Neuralink, a neurotechnology company focused on developing brain–computer interfaces, and founded The Boring Company, a tunnel construction company. Musk has proposed the Hyperloop, a high-speed vactrain transportation system.Musk has been the subject of criticism due to unorthodox or unscientific stances and highly publicized controversies. In 2018, he was sued for defamation by a diver who advised in the Tham Luang cave rescue; a California jury ruled in favor of Musk. In the same year, he was sued by the US Securities and Exchange Commission (SEC) for falsely tweeting that he had secured funding for a private takeover of Tesla. He settled with the SEC, temporarily stepping down from his chairmanship and accepting limitations on his Twitter usage. Musk has spread misinformation about the COVID-19 pandemic and has received criticism from experts for his other views on such matters as artificial intelligence and public transport."

#class for preprocessing and creating word embedding
class Preprocessing:
    #constructor
    def __init__(self,txt):
        # Tokenization
        nltk.download('punkt')  #punkt is nltk tokenizer 
        # breaking text to sentences
        tokens = nltk.sent_tokenize(txt) 
        self.tokens = tokens
        self.tfidfvectoriser=TfidfVectorizer()

    # Data Cleaning
    # remove extra spaces
    # convert sentences to lower case 
    # remove stopword
    def clean_sentence(self, sentence, stopwords=False):
        sentence = sentence.lower().strip()
        sentence = re.sub(r'[^a-z0-9\s]', '', sentence)
        if stopwords:
          sentence = remove_stopwords(sentence)
        return sentence

    # store cleaned sentences to cleaned_sentences
    def get_cleaned_sentences(self,tokens, stopwords=False):
        cleaned_sentences = []
        for line in tokens:
          cleaned = self.clean_sentence(line, stopwords)
          cleaned_sentences.append(cleaned)
        return cleaned_sentences

    #do all the cleaning
    def cleanall(self):
        cleaned_sentences = self.get_cleaned_sentences(self.tokens, stopwords=True)
        cleaned_sentences_with_stopwords = self.get_cleaned_sentences(self.tokens, stopwords=False)
        # print(cleaned_sentences)
        # print(cleaned_sentences_with_stopwords)
        return [cleaned_sentences,cleaned_sentences_with_stopwords]

    # TF-IDF Vectorizer
    def TFIDF(self,cleaned_sentences):
        self.tfidfvectoriser.fit(cleaned_sentences)
        tfidf_vectors=self.tfidfvectoriser.transform(cleaned_sentences)
        return tfidf_vectors

    #tfidf for question
    def TFIDF_Q(self,question_to_be_cleaned):
        tfidf_vectors=self.tfidfvectoriser.transform([question_to_be_cleaned])
        return tfidf_vectors

    # main call function
    def doall(self):
        cleaned_sentences, cleaned_sentences_with_stopwords = self.cleanall()
        tfidf = self.TFIDF(cleaned_sentences)
        return [cleaned_sentences,cleaned_sentences_with_stopwords,tfidf]

#class for answering the question.
class AnswerMe:
    #cosine similarity
    def Cosine(self, question_vector, sentence_vector):
        dot_product = np.dot(question_vector, sentence_vector.T)
        denominator = (np.linalg.norm(question_vector) * np.linalg.norm(sentence_vector))
        return dot_product/denominator
    
    #Euclidean distance
    def Euclidean(self, question_vector, sentence_vector):
        vec1 = question_vector.copy()
        vec2 = sentence_vector.copy()
        if len(vec1)<len(vec2): vec1,vec2 = vec2,vec1
        vec2 = np.resize(vec2,(vec1.shape[0],vec1.shape[1]))
        return np.linalg.norm(vec1-vec2)

    # main call function
    def answer(self, question_vector, sentence_vector, method):
        if method==1: return self.Euclidean(question_vector,sentence_vector)
        else: return self.Cosine(question_vector,sentence_vector)


def RetrieveAnswer(question_embedding, tfidf_vectors,method=1):
  similarity_heap = []
  if method==1: max_similarity = float('inf')
  else: max_similarity = -1
  index_similarity = -1

  for index, embedding in enumerate(tfidf_vectors):  
    find_similarity = AnswerMe()
    similarity = find_similarity.answer((question_embedding).toarray(),(embedding).toarray() , method).mean()
    if method==1:
      heapq.heappush(similarity_heap,(similarity,index))
    else:
      heapq.heappush(similarity_heap,(-similarity,index))
  return similarity_heap


# Put Your question here
user_question = "musk born june 28 1971 is an entrepreneur and"
#define method
method = 2

preprocess = Preprocessing(txt)
cleaned_sentences,cleaned_sentences_with_stopwords,tfidf_vectors = preprocess.doall()

question = preprocess.clean_sentence(user_question, stopwords=True)
question_embedding = preprocess.TFIDF_Q(question)

similarity_heap = RetrieveAnswer(question_embedding , tfidf_vectors ,method)
print("Question: ", user_question)

number_of_sentences_to_print = 2
while number_of_sentences_to_print>0 and len(similarity_heap)>0:
  x = similarity_heap.pop(0)
  print(cleaned_sentences_with_stopwords[x[1]])
  number_of_sentences_to_print-=1

Question:  musk born june 28 1971 is an entrepreneur and
elon reeve musk frs iln eelon born june 28 1971 is an entrepreneur and business magnate
musk cofounded online bank xcom that same year which merged with confinity in 2000 to form paypal


[nltk_data] Downloading package punkt to /home/indrap24/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Sentiment Analysis

In [10]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

In [11]:
#load dataset
df = pd.read_csv('./Movie Rating Prdiction/Train/Train.csv')

In [12]:
df.head(20)

Unnamed: 0,review,label
0,mature intelligent and highly charged melodram...,pos
1,http://video.google.com/videoplay?docid=211772...,pos
2,Title: Opera (1987) Director: Dario Argento Ca...,pos
3,I think a lot of people just wrote this off as...,pos
4,This is a story of two dogs and a cat looking ...,pos
5,Steve Carell comes into his own in his first s...,pos
6,I'm only going to write more because it's requ...,neg
7,"OK, it was a ""risky"" move to rent this flick, ...",neg
8,"Cannibalism, a pair of cinematic references to...",pos
9,This is one of the great modern kung fu films....,pos


In [13]:
dfx = df.iloc[:,0].values
dfy = df.iloc[:,1].values
temp = [x for x in dfy]
y_train = [1 if x == 'pos' else 0 for x in temp]

In [14]:
# Cleanin Training data
sw = stopwords.words('english')
sw.remove('not')
sw = set(sw)
ps = PorterStemmer()

def cleaning_pipeline(review):
    words = word_tokenize(review.lower())
    words = [ps.stem(word) for word in words if word not in sw and word.isalpha()]
    review = " ".join(words)
    return review

cleaned_reviews = [ cleaning_pipeline(review) for review in dfx]

In [15]:
# Loading and cleaning testing data
dftest = pd.read_csv('./Movie Rating Prdiction/Test/Test.csv').values
test_reviews = dftest.reshape((-1,))
cleaned_test_rev = [cleaning_pipeline(review) for review in test_reviews]

In [16]:
## Vectorization
cv = CountVectorizer(ngram_range=(1,3))
x_train_vect = cv.fit_transform(cleaned_reviews)
x_test_vect = cv.transform(cleaned_test_rev)

In [17]:
# Multinomial NB
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()

In [18]:
#Trainig
mnb.fit(x_train_vect,y_train)

MultinomialNB()

In [19]:
#Prediction
prediction = mnb.predict(x_test_vect)

In [20]:
#score
mnb.score(x_train_vect,y_train)

0.99975

### Text Summarization

In [21]:
# importing libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

# Input text - to summarize
text = """There are many techniques available to generate extractive summarization to keep it simple, I will be using an unsupervised learning approach to find the sentences similarity and rank them. Summarization can be defined as a task of producing a concise and fluent summary while preserving key information and overall meaning. One benefit of this will be, you don’t need to train and build a model prior start using it for your project. It’s good to understand Cosine similarity to make the best use of the code you are going to see. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Its measures cosine of the angle between vectors. The angle will be 0 if sentences are similar."""

# Tokenizing the text
stopWords = set(stopwords.words("english"))
words = word_tokenize(text)

# Creating a frequency table to keep the
# score of each word

freqTable = dict()
for word in words:
	word = word.lower()
	if word in stopWords:
		continue
	if word in freqTable:
		freqTable[word] += 1
	else:
		freqTable[word] = 1

# Creating a dictionary to keep the score
# of each sentence
sentences = sent_tokenize(text)
sentenceValue = dict()

for sentence in sentences:
	for word, freq in freqTable.items():
		if word in sentence.lower():
			if sentence in sentenceValue:
				sentenceValue[sentence] += freq
			else:
				sentenceValue[sentence] = freq



sumValues = 0
for sentence in sentenceValue:
	sumValues += sentenceValue[sentence]

# Average value of a sentence from the original text

average = int(sumValues / len(sentenceValue))

# Storing sentences into our summary.
summary = ''
for sentence in sentences:
	if (sentence in sentenceValue) and (sentenceValue[sentence] > (1.2 * average)):
		summary += " " + sentence
print(summary)


 There are many techniques available to generate extractive summarization to keep it simple, I will be using an unsupervised learning approach to find the sentences similarity and rank them. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.


### Spam Filtering

Automated Spam E-mail Detection Using common NLP tasks

In [23]:
import wordcloud
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/indrap24/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /home/indrap24/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/indrap24/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/indrap24/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [24]:
df = pd.read_csv("messages.csv",encoding='latin-1')
df.head()

Unnamed: 0,subject,message,label
0,job posting - apple-iss research center,content - length : 3386 apple-iss research cen...,0
1,,"lang classification grimes , joseph e . and ba...",0
2,query : letter frequencies for text identifica...,i am posting this inquiry for sergei atamas ( ...,0
3,risk,a colleague and i are researching the differin...,0
4,request book information,earlier this morning i was on the phone with a...,0


In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2893 entries, 0 to 2892
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   subject  2831 non-null   object
 1   message  2893 non-null   object
 2   label    2893 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 67.9+ KB


In [28]:
df['label'].value_counts()

0    2412
1     481
Name: label, dtype: int64

Here, we see that 1 stand for Spam mail and 0 stand for not a spam mail.

In [29]:
print("Not a Spam Email Ratio i.e. 0 label:",round(len(df[df['label']==0])/len(df['label']),2)*100,"%")
print("Spam Email Ratio that is 1 label:",round(len(df[df['label']==1])/len(df['label']),2)*100,"%")

Not a Spam Email Ratio i.e. 0 label: 83.0 %
Spam Email Ratio that is 1 label: 17.0 %


In [30]:
df['length'] = df.message.str.len()
df.head()

Unnamed: 0,subject,message,label,length
0,job posting - apple-iss research center,content - length : 3386 apple-iss research cen...,0,2856
1,,"lang classification grimes , joseph e . and ba...",0,1800
2,query : letter frequencies for text identifica...,i am posting this inquiry for sergei atamas ( ...,0,1435
3,risk,a colleague and i are researching the differin...,0,324
4,request book information,earlier this morning i was on the phone with a...,0,1046


In [31]:
import string
import nltk
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english') + ['u', 'ü', 'ur', '4', '2', 'im', 'dont', 'doin', 'ure'])

df['message'] = df['message'].apply(lambda x: " ".join(term for term in x.split() if term not in stop_words))

In [44]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

In [35]:
tf_vec = TfidfVectorizer()
#naive = MultinomialNB()
SVM = SVC(C=1.0, kernel='linear', degree=3 , gamma='auto')
features = tf_vec.fit_transform(df['message'])
X = features
y = df['label']

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [42]:
y_pred = SVM.fit(X_train, y_train).predict(X_test)
accuracy_score(y_test, y_pred)

0.987434554973822

In [46]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99       773
           1       0.99      0.94      0.97       182

    accuracy                           0.99       955
   macro avg       0.99      0.97      0.98       955
weighted avg       0.99      0.99      0.99       955



### Speech Recognition

Speech is the most common means of communication and the majority of the population in the world relies on speech to communicate with one another. Speech recognition system basically translates spoken languages into text. There are various real-life examples of speech recognition systems. For example, Apple SIRI which recognize the speech and truncates into text.


Hidden Markov Model (HMM), deep neural network models are used to convert the audio into text.  This can be done with the help of the `Speech Recognition API`(*Google speech recognition API*) and `PyAudio` library.


Steps:
- Import Speech recognition library
- Initializing recognizer class in order to recognize the speech. We are using google speech recognition.
- Audio file supports by speech recognition: wav, AIFF, AIFF-C, FLAC. I used `wav` file in this example
- I have used a sample audio clip which says “**Hello, my name is Shobhit.**”
- By default, google recognizer reads English. It supports different languages.

#### From Audio.wav file

In [47]:
#import library
import speech_recognition as sr

# Initialize recognizer class (for recognizing the speech)
r = sr.Recognizer()

# Reading Audio file as source
# listening the audio file and store in audio_text variable

with sr.AudioFile('./sampleAudio.wav') as source:
    
    audio_text = r.listen(source)
    
# recoginize_() method will throw a request error if the API is unreachable, hence using exception handling
    try:
        
        # using google speech recognition
        text = r.recognize_google(audio_text)
        print('Converting audio transcripts into text ...')
        print(text)
     
    except:
         print('Sorry.. run again...')

Converting audio transcripts into text ...
hello my name is


As you can see, the API cannot recognize the name **Shobhit** since its not an English name, but can recognize the other words.

### Question Answering Pipeline

QA systems can be described as a technology that provides the right short answer to a question rather than giving a list of possible answers. In this scenario, QA systems are designed to be alert to text similarity and answer questions that are asked in natural language.

The Transformer architecture which has become a state-of-the-art approach in text based models since 2017, many Machine Learning tasks involving language can now be performed with unprecedented results. Question answering is one such task for which Machine Learning can be used. Here, we will build a Question Answering model pipeline in a really easy way using the HuggingFace Transformers library. This library is becoming increasingly important for democratizing Transformer based approaches in Machine Learning and allows people to use Transformers out-of-the-box.

In [49]:
from transformers import pipeline

In [50]:
# Open and read the article
question = "What is the capital of the Netherlands?"
context = r"The four largest cities in the Netherlands are Amsterdam, Rotterdam, The Hague and Utrecht.[17] Amsterdam is the country's most populous city and nominal capital,[18] while The Hague holds the seat of the States General, Cabinet and Supreme Court.[19] The Port of Rotterdam is the busiest seaport in Europe, and the busiest in any country outside East Asia and Southeast Asia, behind only China and Singapore."

# Generating an answer to the question in context
qa = pipeline("question-answering")
answer = qa(question=question, context=context)

# Print the answer
print(f"Question: {question}")
print(f"Answer: '{answer['answer']}' with score {answer['score']}")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)
Downloading: 100%|██████████| 473/473 [00:00<00:00, 92.0kB/s]
Downloading: 100%|██████████| 249M/249M [01:30<00:00, 2.88MB/s] 
Downloading: 100%|██████████| 29.0/29.0 [00:00<00:00, 13.6kB/s]
Downloading: 100%|██████████| 208k/208k [00:01<00:00, 158kB/s]  
Downloading: 100%|██████████| 426k/426k [00:02<00:00, 176kB/s]  


Question: What is the capital of the Netherlands?
Answer: 'Amsterdam' with score 0.37749919295310974
