# NLP (Natural Language Processing)
* Use cases: Auto complete, Translation, NER (Name Entity Recognition), Sentiment Analysis, Music composition
* Helps solve sequence models better than ANNs by using: Variable size of input/output neurons
* Text Preprocessing: Tokenization, Stop Words, Lemmatization, Bag of Words, TFIDF, Word Embeddings
* Model building: Bidirectional LSTM RNN, GRU, Encoders and Decoders, Attention Models
* Transformers, BERT

# Text Preprocessing using NLTK

* Tokenization
    * Assigns each word an index.
    * Limitations: Presence of Dot: Ph.D --> Ph, D , Presence of Blank Spaces: San Francisco --> San, Francisco
    
* Stop Words
    * Removing words that don't add any context to the sentence. Makes model computationally efficient

* Stemming
    * Strips a word from tense, prefix, suffix and stores the stem word. PorterStemmer, LancasterStemmer
    
* Lemmatization
    * Stemming is a best guess where to cut, Lemmatization is more calculated and computationally heavy
    * Tenses are resolved
    * Context is important: Lemmatization, Speed is important: Stemming
    
* Word Embeddings
    * Methods to convert words to numbers: 
        * Use unique numbers, One hot encoding: Context not considered, Computationally inefficient 
        * Word Embeddings: Context considered, Computationally efficient: Bag of Words, TF-IDF, Word2Vec, Embedding Layer
        * Word Embeddings gives us x_train and x_test for machine learning model building
        * x_bow, x_tfidf can be split into x_train and x_test

In [26]:
import nltk
import re
# # To update nltk libraries
# nltk.download()

In [41]:
text = """Fishers are not very known for fishing in the #canal. 
Something fishy was going on, some of the fish dove into the lake"""

# # Preprocessing text using regex
# text = re.sub('[^a-zA-Z]', ' ', text)
# text = re.sub(r'\[[0-9]*\]',' ',text)
# text = re.sub(r'\s+',' ',text)
# text = text.lower()
# text = re.sub(r'\d',' ',text)
# text = re.sub(r'\s+',' ',text)

# Tokenization
sentences = nltk.tokenize.sent_tokenize(text)
words = nltk.tokenize.word_tokenize(text)

# Stop Words
stop_words = nltk.corpus.stopwords.words('english')
stop_words_list = set(stop_words)
words_without_stop_words = [i for i in words if i not in stop_words_list]

In [42]:
# Stemming
stemmer = nltk.stem.PorterStemmer()
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    words_without_stop_words = [word for word in review if word not in stop_words_list]
    words_stemmed = [stemmer.stem(word) for word in words_without_stop_words ]
    words_stemmed = ' '.join(words_stemmed)
    corpus.append(words_stemmed)

corpus

['fisher known fish canal', 'someth fishi go fish dove lake']

In [43]:
# Lemmatization
lemmatizer = nltk.stem.WordNetLemmatizer()
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    words_without_stop_words = [word for word in review if word not in stop_words_list]
    words_lemmatized = [lemmatizer.lemmatize(word) for word in words_without_stop_words]
    words_lemmatized = ' '.join(words_lemmatized)
    corpus.append(words_lemmatized)

corpus

['fisher known fishing canal', 'something fishy going fish dove lake']

In [44]:
# Bag of Words Model
# Requires input as list of string sentences

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
x_bow = cv.fit_transform(corpus).toarray()

features_bow = cv.get_feature_names()
params_bow = cv.get_params()
print(x_bow, features_bow)

[[1 0 0 1 1 0 0 1 0 0]
 [0 1 1 0 0 1 1 0 1 1]] ['canal', 'dove', 'fish', 'fisher', 'fishing', 'fishy', 'going', 'known', 'lake', 'something']


In [45]:
# TF-IDF Model
# Requires input as list of string sentences
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfv = TfidfVectorizer()
x_tfidf = tfidfv.fit_transform(corpus).toarray()

features_tfidf = tfidfv.get_feature_names()
params_tfidf = tfidfv.get_params()
print(x_tfidf, features_tfidf)

[[0.5        0.         0.         0.5        0.5        0.
  0.         0.5        0.         0.        ]
 [0.         0.40824829 0.40824829 0.         0.         0.40824829
  0.40824829 0.         0.40824829 0.40824829]] ['canal', 'dove', 'fish', 'fisher', 'fishing', 'fishy', 'going', 'known', 'lake', 'something']


In [46]:
# Word2Vec Model
# Requires list of string words of each sentence

# Lemmatizing
lemmatizer = nltk.stem.WordNetLemmatizer()
corpus_word2vec = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    words_without_stop_words = [word for word in review if word not in stop_words_list]
    words_lemmatized = [lemmatizer.lemmatize(word) for word in words_without_stop_words]
    corpus_word2vec.append(words_lemmatized)

# Word2Vec Model
from gensim.models import Word2Vec
model = Word2Vec(corpus_word2vec, min_count=0)
x_vocabulary = model.wv.vocab

# Finding Word Vectors
vector = model.wv['fishy']
# Most similar words
similar = model.wv.most_similar('fishy')

# Spam Classifier using Machine Learning
* Import DataFrame, feature engineering
* Make a corpus: regex, lower, split, remove stop words,''.join --> corpus
* Test Train Split
* ML Model building: Multinomial Naive Bayes model
* Make predictions

In [341]:
import pandas as pd
import numpy as np
import re
import nltk

df_spam = pd.read_csv('SMSSpamCollection.txt', sep='\t', names=['label', 'message'])

# Stop Words
stop_words = nltk.corpus.stopwords.words('english')
stop_words_list = set(stop_words)

In [342]:
lemmatizer = nltk.stem.WordNetLemmatizer()
corpus_spam = []

for i in range(len(df_spam)):
    review = df_spam['message'][i]
    review = re.sub('[^a-zA-Z]', ' ', review)
    review = review.lower()
    review = review.split()    
    words_without_stop_words = [word for word in review if word not in stop_words_list]
    words_lemmatized = [lemmatizer.lemmatize(word) for word in words_without_stop_words]
    words_lemmatized = ' '.join(words_lemmatized)
    corpus_spam.append(words_lemmatized)

In [343]:
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv_spam = CountVectorizer(max_features=2500)
x_spam = cv_spam.fit_transform(corpus_spam).toarray()

# One hot encode y value and drop a column
y_spam = df_spam.drop('message', axis='columns')
y_spam = pd.get_dummies(y_spam['label'])
y_spam = y_spam['spam']

In [360]:
# Train Test Split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_spam, y_spam, test_size = 0.20)

# Training model using Naive bayes classifier
from sklearn.naive_bayes import MultinomialNB
spam_detect_model = MultinomialNB().fit(x_train, y_train)
y_predict = spam_detect_model.predict(x_test)
spam_detect_model.score(x_test,y_test)

0.9829596412556054

# RNN (Recurrent Neural Network)
* A neural network where the output from the previous step is fed as input to the current step
* eg: Zuber likes NLP. word1 = Zuber, word2 = likes, word3 = NLP. 
* ONLY 1 RNN LAYER IS USED. 
* x<> is a vector i.e 0 or 1. y(hat) is a predicted output at that instance of the RNN. Every instance also outputs a loss
* Instance 1: x<0> + word1 --> RNN --> x<1> + y(hat) + Loss1
* Instance 2: x<1> + word2 --> RNN --> x<2> + y(hat) + Loss2
* Instance 3: x<2> + word3 --> RNN --> x<3> + y(hat) + Loss3

###### vanishing gradients and exploding gradients
* w(new) = w(old) - (learning rate)x(gradient)
* gradient = [ d(Loss)/dw ] = d1xd2xd3... (use chain rule across the hidden layers of the NN)
* Gradients are between 0 and 1:
    * Hidden layers increases ==> gradient value dimnishes ==> learning rate decreses ==> Backpropagation hardly changes any weights and biases
* Gradients are greater than 1:
    * Hidden layers increases ==> gradient value increases drastically ==> learning rate increases drastically ==> Backpropagation changes drastically any weights and biases
* Solutions: GRU (Gated Recurrent Unit) and LSTM (Long Short Term Memory) 

###### LSTM
* Memory cell (happening at same instance):
    * Long term memory: c(t-1) ------------------> c(t)
    * Short term memory: x(t-1) + word --> RNN --> x(t)
    * RNN = [ weighted sum of x(t-1) + word ] + [ Passed to an tanh or sigmoid activation function ]
    * Consists of Forget gate, Input Gate, Output Gate
    * More accurate on longer sequence, less efficient

###### GRU
* Memory cell (happening at same instance):
    * Long and Short term memory: x(t-1) + word --> RNN --> x(t)
    * Consists of Update gate, Reset Gate
    * More efficient, latest method

###### Bidirectional RNN
* Helps in getting better context of a sequence

# Text Preprocessing using Tensorflow
* Tokenization, One Hot Encoding
* Embedding using pad sequences

In [66]:
text = ["""Fishers are not very known for fishing in the #canal. 
Something fishy was going on, some of the fish dove into the lake"""]

# Training a tokenizer on given text data
# Tokenization: Assigns each word an index
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
tokenizer.fit_on_texts(text)
word_index = tokenizer.word_index
print("\nWord Index = " , word_index)

# Sequence of tokens forms the orginal text
sequences = tokenizer.texts_to_sequences(text)
print("\nSequences = " , sequences)

# Pad Sequences
from tensorflow.keras.preprocessing.sequence import pad_sequences
padded = pad_sequences(sequences, maxlen=5)
print("\nPadded Sequences:", padded)

# Try with words that the tokenizer wasn't fit to
test_data = ['i really like my fish','my fishers likes my manatee']
test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequence = ", test_seq)
padded = pad_sequences(test_seq, maxlen=10)
print("\nPadded Test Sequence: ", padded)


Word Index =  {'<OOV>': 1, 'the': 2, 'fishers': 3, 'are': 4, 'not': 5, 'very': 6, 'known': 7, 'for': 8, 'fishing': 9, 'in': 10, 'canal': 11, 'something': 12, 'fishy': 13, 'was': 14, 'going': 15, 'on': 16, 'some': 17, 'of': 18, 'fish': 19, 'dove': 20, 'into': 21, 'lake': 22}

Sequences =  [[3, 4, 5, 6, 7, 8, 9, 10, 2, 11, 12, 13, 14, 15, 16, 17, 18, 2, 19, 20, 21, 2, 22]]

Padded Sequences: [[19 20 21  2 22]]

Test Sequence =  [[1, 1, 1, 1, 19], [1, 3, 1, 1, 1]]

Padded Test Sequence:  [[ 0  0  0  0  0  1  1  1  1 19]
 [ 0  0  0  0  0  1  3  1  1  1]]


# Spam Classifier using Deep Learning

* Import DataFrame, feature engineering
* Make a corpus: regex, lower, split, remove stop words,''.join --> corpus
* One Hot Encoding or Tokenization
* Test Train Split
* DL Model building, word embedding done using Embedding layer
* Make predictions

In [292]:
import numpy as np
import pandas as pd

import nltk
import re

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Bidirectional
from tensorflow.keras.layers import Dropout

In [293]:
df_spam_dl = pd.read_csv('SMSSpamCollection.txt', sep='\t', names=['label', 'message'])

# Stop Words
stop_words = nltk.corpus.stopwords.words('english')
stop_words_list = set(stop_words)

In [294]:
lemmatizer = nltk.stem.WordNetLemmatizer()
corpus_spam_dl = []

for i in range(len(df_spam)):
    review = df_spam['message'][i]
    review = re.sub('[^a-zA-Z]', ' ', review)
    review = review.lower()
    review = review.split()    
    words_without_stop_words = [word for word in review if word not in stop_words_list]
    words_lemmatized = [lemmatizer.lemmatize(word) for word in words_without_stop_words]
    words_lemmatized = ' '.join(words_lemmatized)
    corpus_spam_dl.append(words_lemmatized)

In [302]:
# One Hot Encoding
vocabulary_size=10000
one_hot_encoder = [one_hot(i,vocabulary_size) for i in corpus_spam_dl] 
# print(one_hot_encoder,'\n')

# Pad Sequences: 'pre' or 'post'
sentence_length = 20
embedded_text = pad_sequences(one_hot_encoder,padding='pre',maxlen=sentence_length)
print(embedded_text, '\n')

[[   0    0    0 ... 6657 8622 2518]
 [   0    0    0 ... 3968 4687 6449]
 [6358 9361 3982 ...  408 5203 5158]
 ...
 [   0    0    0 ... 3480 9301  401]
 [   0    0    0 ... 9844 4687 3868]
 [   0    0    0 ... 7813 9525 6846]] 



In [303]:
# One hot encode y value and drop a column
y_spam_dl = df_spam.drop('message', axis='columns')
y_spam_dl = pd.get_dummies(y_spam_dl['label'])
y_spam_dl = y_spam_dl['spam']

In [306]:
# DL Model Building

embedding_vector_features=40
model = Sequential()
model.add(Embedding(voc_size,embedding_vector_features,input_length=sentence_length))
model.add(Bidirectional(LSTM(100)))
model.add(Dropout(0.3))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

Model: "sequential_17"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_16 (Embedding)     (None, 20, 40)            400000    
_________________________________________________________________
bidirectional_10 (Bidirectio (None, 200)               112800    
_________________________________________________________________
dropout_10 (Dropout)         (None, 200)               0         
_________________________________________________________________
dense_10 (Dense)             (None, 1)                 201       
Total params: 513,001
Trainable params: 513,001
Non-trainable params: 0
_________________________________________________________________


In [307]:
x_final=np.array(embedded_text)
y_final=np.array(y_spam_dl)

x_final.shape,y_final.shape

((5572, 20), (5572,))

In [335]:
x_train_dl, x_test_dl, y_train_dl, y_test_dl = train_test_split(x_final, y_final, test_size=0.2, random_state=0)
model.fit(x_train,y_train,epochs=1,batch_size=64)



<tensorflow.python.keras.callbacks.History at 0x17319571c70>

In [336]:
y_predict_spam_dl = model.predict(x_test_dl)
y_predict_spam_labels_dl = [0 if i<0.5 else 1 for i in y_predict_spam_dl]
print(confusion_matrix(y_test_dl,y_predict_spam_labels_dl))
print(classification_report(y_test_dl,y_predict_spam_labels_dl))

[[950   5]
 [ 12 148]]
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       955
           1       0.97      0.93      0.95       160

    accuracy                           0.98      1115
   macro avg       0.98      0.96      0.97      1115
weighted avg       0.98      0.98      0.98      1115



# Hugging Face Transformers

# Text Generation
* Text prediction: ask for input string, generate less words, ask for input string, ...

In [2]:
import tensorflow as tf
from transformers import GPT2LMHeadModel, GPT2Tokenizer, TFGPT2Model

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)
sentence = 'YouTube Title: AI learns to'
input_ids = tokenizer.encode(sentence, return_tensors='pt')

# generate text until the output length (which includes the context length) reaches 50
output = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)

print(tokenizer.decode(output[0], skip_special_tokens=True))

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…


YouTube Title: AI learns to love you.

Description: The AI has learned to hate you, but it doesn't know what to do with you anymore. It wants to know how you feel about yourself, about your life, and about you


# Text Summarization

In [3]:
from transformers import pipeline
summarizer = pipeline("summarization")
article = """ Insert text """
summarizer(article, max_length=130, min_length=30, do_sample=False)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1649.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1222317369.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=26.0, style=ProgressStyle(description_w…




Your min_length is set to 30, but you input_length is only 6. You might consider decreasing min_length manually, e.g. summarizer('...', min_length=10)
Your max_length is set to 130, but you input_length is only 6. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)


[{'summary_text': '  Insert text: Insert text . Insert text.  Insert the image . Insert the text:  Insert a photo of a scene from the scene. Insert the photo of the scene from a scene .'}]

# Other Use Cases

In [None]:
# # use transformers pipeline
# from transformers import pipeline

# # Sentiment Analysis
# nlp = pipeline('sentiment-analysis')
# nlp('We are very happy to include pipeline into the transformers repository.')


# # Question Answering
# nlp = pipeline('question-answering')
# nlp({
#     'question': 'What is my name ?',
#     'context': 'I work at HuggingFace'
# })

# # Predicting Masks
# nlp = pipeline('fill-mask')
# nlp('I hope you <mask> this video')


# # Name Entity Recognition
# nlp = pipeline('ner')
# nlp('It is me, I work at HuggingFace')