# Natural Language Processing (Compiled by Pawan Neupane)

## Topics covered
*** 
#### 1) Tokenization 
#### 2) Stemming
#### 3) Lemmatization 
#### 4) Stop Words
#### 5) Bag of words
#### 6) TFIDF
#### 7) Ngrams
#### 8) wrd2vec
#### 9) NER
#### 10) Text classification example
#### 11) RNN
#### 12) LSTM
#### 13) Transformers
#### 14) BERT
#### 15) GPT








## 1) Tokenization
Splitting a larger body of text into small, manageable pieces


 <img src="images/tokenization.png" style= "width:500px">

### Tokenization using Spacy

In [1]:
import os

os.environ['KMP_DUPLICATE_LIB_OK']='True'

In [2]:
import spacy
nlp = spacy.load('en_core_web_sm') #loading a small sized english library for convenience 

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
mytext = "I live in Kathmandu NP"

In [4]:
doc = nlp(mytext)

for d in doc:
    print(d)
    
type(doc)

I
live
in
Kathmandu
NP


spacy.tokens.doc.Doc

### Tokenization using Huggingface

In [11]:
import os

os.environ['KMP_DUPLICATE_LIB_OK']='True'

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [12]:
mytext = "Hello! How are you?"

tokenizer.tokenize(mytext)

['Hello', '!', 'How', 'are', 'you', '?']

## 2) Stemming
Truncation the root word from adjectives adverbs etc(removal of parts like ing, ly, ies)


 <img src="images/stemming.png" style= "width:500px">

In [128]:
from nltk.stem.snowball import SnowballStemmer

s_stemmer = SnowballStemmer(language='english')

In [129]:
words = ['go','going','runs','ran','running','easily','fairly']

for word in words:
    print(word+' => '+s_stemmer.stem(word))


go => go
going => go
runs => run
ran => ran
running => run
easily => easili
fairly => fair


## 3) Lemmatization

More advanced than stemming since it can discern different tenses, like saw->see

 <img src="images/lemmatization.png" style= "width:500px">

In [98]:
def show_lemmas(text):
    print("Word         POS    Token Lemma Hash       Root Lemma")
    print("\n")
    for token in text:
        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')

In [183]:
doc2 = nlp(u"I saw eighteen mice today!")

show_lemmas(doc2)

Word         POS    Token Lemma Hash       Root Lemma


I            PRON   4690420944186131903    I
saw          VERB   11925638236994514241   see
eighteen     NUM    9609336664675087640    eighteen
mice         NOUN   1384165645700560590    mouse
today        NOUN   11042482332948150395   today
!            PUNCT  17494803046312582752   !


## 4) Stop words
Commonly used words with no important meanings whatsoever

In [185]:
#View default stopwords in spacy
print(len(nlp.Defaults.stop_words))
print(list(nlp.Defaults.stop_words)[:100]) #changing set to list

326
['between', 'someone', 'would', 'through', "'ll", 'anyone', 'whereas', 'his', 'ours', 'moreover', 'until', '’s', 'thereby', 'besides', 'just', 'toward', '’ll', 'always', 'mine', "'s", 'down', '’m', 'wherein', 'as', 'almost', 'will', 'give', 'back', 'seemed', 'former', 'n‘t', 'were', 'thereafter', 'what', 'further', 'nobody', 'herself', 'was', 'across', 'below', '‘re', 'become', 'from', 'if', 'beyond', 'whereupon', 'without', 'nothing', 'any', 'less', 'show', 'n’t', 'other', 'eight', 'using', 'himself', '’re', 'each', 'could', 'formerly', 'why', 'nevertheless', 'eleven', 'four', 'otherwise', 'or', 'now', 'hereupon', 'has', 'bottom', 'whereafter', 'thereupon', 'whereby', 'into', 'more', 'hereby', 'much', 'via', 'somehow', 'indeed', '’ve', 'made', 'out', 'something', 'you', 'doing', 'everyone', 'also', 'else', 'how', 'towards', 'due', 'done', 'next', 'though', 'whose', 'every', 'them', 'fifteen', 'but']


In [146]:
nlp.vocab['kathmandu'].is_stop #check if Kathmandu is a stop word


False

### Stopwords removal

In [215]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text = "Rabin likes to play cricket, however he is not into football."
text_tokens = word_tokenize(text)

tokens_without_sw = [word for word in text_tokens if not word in stopwords.words()]

print(tokens_without_sw)

['Rabin', 'likes', 'play', 'cricket', ',', 'football', '.']


In [217]:
#stopwords.words('english')

## 5) Bag of Words
Collection of all the words with their respective counts.

 <img src="images/bagofwords.png" style= "width:500px">

In [192]:
from sklearn.feature_extraction.text import CountVectorizer
 
doc_list = ["the Brown fox jumps over the mountain",
            "the mountain is over a river",
            "River runs down the mountain"]

vectorizer = CountVectorizer() #creating an instance of countvectorizer

vectorizer.fit(doc_list)
vector = vectorizer.transform(doc_list)
lis = sorted(vectorizer.vocabulary_)
print(lis)


# print(vector.toarray())
list_vector = vector.toarray()

for i in list_vector:
    for j in i:
        print(j, end='\t      ')
    print("\n")

['brown', 'down', 'fox', 'is', 'jumps', 'mountain', 'over', 'river', 'runs', 'the']
1	      0	      1	      0	      1	      1	      1	      0	      0	      2	      

0	      0	      0	      1	      0	      1	      1	      1	      0	      1	      

0	      1	      0	      0	      0	      1	      0	      1	      1	      1	      



## 6) TFIDF : Term frequency Inverse Document Frequency 

Considers the occurence of the words throughout the document instead of focusing just a single line. Superior to BagofWords(Countvectorizer)

 <img src="images/tfidf.png" style= "width:500px">

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
 
doc_list = ["the Brown fox jumps over the mountain",
            "the mountain is over a river",
            "River runs down the mountain"]

vectorizer = TfidfVectorizer() #creating an instance of countvectorizer

vectorizer.fit(doc_list)
vector = vectorizer.transform(doc_list)
lis = sorted(vectorizer.vocabulary_)
print(lis)


# print(vector.toarray())
list_vector = vector.toarray()

for i in list_vector:
    for j in i:
        print(round(j,8), end=' ')
    print("\n")

['brown', 'down', 'fox', 'is', 'jumps', 'mountain', 'over', 'river', 'runs', 'the']
0.43345167 0.0 0.43345167 0.0 0.43345167 0.25600354 0.32965117 0.0 0.0 0.51200708 

0.0 0.0 0.0 0.59188659 0.0 0.34957775 0.45014501 0.45014501 0.0 0.34957775 

0.0 0.55249005 0.0 0.0 0.0 0.32630952 0.0 0.42018292 0.55249005 0.32630952 



## 7) N-Grams

Grouping of words into 1-gram, 2-gram, 4-gram etc. according to the requirements. For example, according to a study, Spam detection works best with 4-gram.

 <img src="images/ngrams.png" style= "width:500px">

In [233]:
import re
from nltk.util import ngrams

s = "the Brown fox-jumps over the mountain"
s = s.lower()
s = re.sub(r'[^a-zA-Z0-9\s]', ' ', s)
tokens = [token for token in s.split(" ") if token != ""]
output = list(ngrams(tokens, 3)) #change the number to change n of ngram

In [235]:
print(output[0])

('the', 'brown', 'fox')


## 8) Word2Vec | Average Word2Vec

converting a word to vector with a lot of dimenstions (100-1000 normally) so as to perform mathematical operations

#### Average word2vec is average of all vectors in a sentence


 <img src="images/word2vec.png" style= "width:500px">

In [38]:
nlp = spacy.load('en_core_web_lg')
print(nlp.vocab.vectors.shape)

In [12]:
# nlp(u'home').vector

# len(nlp.vector)

In [39]:
animals = nlp(u"dog pet eagle cat")
print(animals[3].similarity(animals[3]))

1.0


In [40]:
anomaly = nlp("dsdjspwpd")
anomaly[0].has_vector

False

In [44]:

from scipy import spatial

cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)

king = nlp.vocab['king'].vector
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector

# Now we find the closest vector in the vocabulary to the result of "man" - "woman" + "queen"
new_vector = king - man + woman
computed_similarities = []
for word in nlp.vocab:
    if word.has_vector:
        if word.is_lower:
            if word.is_alpha:
#                 print(word)
                similarity = cosine_similarity(new_vector, word.vector)
                computed_similarities.append((word, similarity))

computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])

print([w[0].text for w in computed_similarities[:10]])

['king', 'woman', 'she', 'who', 'eagle', 'when', 'dare', 'cat', 'was', 'not']


In [46]:
#Averege word2vec


#nlp(u'this is a nice line').vector

## 9) NER | Named Entity Recognition 
Find where the specific parts of a text belong. For example: Organizations, Products, Countries etc

 <img src="images/ner.png" style= "width:500px">

In [59]:
import spacy
nlp = spacy.load('en_core_web_sm')

doc = nlp(u'Can I please borrow 500 dollars from you to buy some Microsoft stock?')

for ent in doc.ents:
    print(ent.text, ent.start, ent.end, ent.start_char, ent.end_char, ent.label_)

500 dollars 4 6 20 31 MONEY
Microsoft 11 12 53 62 ORG


In [62]:
import spacy
nlp = spacy.load('en_core_web_lg')

from spacy import displacy

In [63]:
#visualizing
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million. '
         u'By contrast, Sony sold only 7 thousand Walkman music players.')

displacy.render(doc, style='ent', jupyter=True)

## 10) Text Classification (Movie Reviews)

In [64]:
import numpy as np
import pandas as pd

df = pd.read_csv('moviereviews.tsv', sep='\t')

print(df.head())



df.dropna(inplace=True)

  label                                             review
0   pos  I loved this movie and will watch it again. Or...
1   pos  A warm, touching movie that has a fantasy-like...
2   pos  I was not expecting the powerful filmmaking ex...
3   neg  This so-called "documentary" tries to tell tha...
4   pos  This show has been my escape from reality for ...


In [65]:
df['label'].value_counts()

pos    2990
neg    2990
Name: label, dtype: int64

In [66]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  

In [67]:
# Form a prediction set

predictions = text_clf.predict(X_test)

In [68]:

from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

print(metrics.classification_report(y_test,predictions))

[[900  91]
 [ 63 920]]
              precision    recall  f1-score   support

         neg       0.93      0.91      0.92       991
         pos       0.91      0.94      0.92       983

    accuracy                           0.92      1974
   macro avg       0.92      0.92      0.92      1974
weighted avg       0.92      0.92      0.92      1974



In [69]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.9219858156028369


In [71]:
res = text_clf.predict(["this is awful"])
res[0]

'neg'

## 11) RNN

Disadvantages:
* Slow
* Vanishing gradient
* Doesnot work well for long sentences


 <img src="images/RNN1.png" style= "width:500px">
     <br>
     <img src="images/RNN2.png" style= "width:500px">
         <br>
         <img src="images/RNN3.png" style= "width:500px">

## 12) LSTM

 <img src="images/LSTM.png" style= "width:500px">

## 13) Transformers

 <img src="images/transformer.png" style= "width:500px">
 <img src="images/transformer1.png" style= "width:500px">
  <img src="images/transformer2.png" style= "width:500px">
   <img src="images/transformer3.png" style= "width:500px">
 
 

## 14) BERT
Stacking encoders together


<img src="images/bert.png" style= "width:500px">

## 15) GPT
Stacking decoders together


<img src="images/gpt.png" style= "width:500px">