# Chapter 3: NLP & Text Embeddings

## Representation of Text Data

- Various methods exist for representing text data in numerical form, often called embeddings.
- Bag of Words (BoW) is a simple method that creates vectors where each element represents a word's frequency in a document.
- However, BoW only counts the occurrences of words, while embeddings can capture the semantic meaning of words in a numerical form.
- In this chapter, we'll explore text embeddings using the continuous Bag of Words (CBOW) model, and other methods like n-grams, tagging, chunking, tokenization, and Term Frequency-Inverse Document Frequency (TF-IDF).

## Embeddings for NLP

- In natural language processing (NLP), pieces of language (words, phrases) can be represented as high-dimensional vectors, known as embeddings.
- Embeddings are vectors of length 'n', which represent the word's vector in 'n'-dimensional space.
- Generally, embeddings are of much lower dimensionality than the Bag of Words representation.
- The Bag of Words representation takes into account the entire corpus, resulting in large and sparse vectors if the corpus is big.
- Embeddings, on the other hand, typically do not exceed a few hundred dimensions, are densely packed with information, and each dimension contributes meaningfully to the representation.
- Hence, embeddings often provide superior quality representations and are better suited for deep learning models.

## GLoVE

- Global Vectors (GloVe) is a pre-calculated set of word embeddings derived from a large corpus of NLP data.
- These embeddings are trained on a word co-occurrence matrix, under the assumption that words appearing together are more likely to share a similar meaning.

## Cosine Similarity

- Cosine similarity is a measure used to determine how similar two vectors are.
- If the angle between two vectors in an 'n'-dimensional space is 0 degrees, they are considered identical and their cosine similarity is 1.
- High cosine similarity values indicate that two vectors are similar, even if they are not identical.

## Operations on Embeddings

- Mathematical operations can be performed on embeddings to capture semantic relations (e.g., Queen - Woman + Man equals King).
- While pre-calculated embeddings such as GloVe are useful, it is also possible to generate our own embeddings, which may be beneficial when working with a unique corpus.


### Exploring Continuous Bag of Words (CBOW)

- Continuous Bag of Words (CBOW) is a part of the Word2Vec model developed by Google.
- The Word2Vec model consists of two main components:
    - **CBOW**: Predicts a target word in a document given the surrounding words (known as context words).
    - **Skip-gram**: Predicts the surrounding words given a target word (opposite of CBOW).
- For our CBOW model, we'll use a window of length 2. This means for our model's input/output pairs (X, y), we use ([n-2, n-1, n+1, n+2], n) where 'n' is our target word being predicted.

### N-gram Language Models (LM)

- N-gram models help understand how language can be formed (unigrams, bigrams, trigrams, etc.)
- The meaning of a sentence is also influenced by the words around it. N-gram models capture the order of the words in a sentence, not just the frequency.
- However, as we use larger 'n' values (e.g., bigrams, trigrams), the feature space can become extremely large quickly.
- A bigram language model calculates the probability of a word occurring, given the word that appears before it.

### Tokenization

- Tokenization is a pre-processing step used to split sentences into smaller parts, such as individual words or smaller documents.

#### Stopwords

- While some NLP tasks require stopwords (common words like 'the', 'is', 'and'), others do not. For example, in sentiment analysis of a film review, stopwords may not contribute significantly to the overall meaning.
- Removing stopwords can reduce the feature space size, which in turn can reduce the time it takes for our models to train.

### Tagging, Chunking & Parts of Speech (PoS)

- Tagging, chunking, and PoS are used to capture the structure of a sentence, as different words can have different functions within a sentence.
- These aspects and their relationships to one another can be incorporated into our models.

#### Tagging

- Tagging is the process of assigning PoS tags to various words in a sentence.
- Using pre-trained PoS taggers is beneficial because they consider contextual dependencies, rather than acting like a simple lookup table.


# Code Segment

In [1]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
glove_loc = '/home/kprasath/DS/NLP/Resources/GloVe/glove.6B.50d.txt'

def loadGlove(path):
    file = open(path,'r')
    model = {}
    for l in file:
        line = l.split()
        word = line[0]
        value = np.array([float(val) for val in line[1:]])
        model[word] = value
    return model

glove = loadGlove(glove_loc)

In [5]:
glove['python']

array([ 0.5897  , -0.55043 , -1.0106  ,  0.41226 ,  0.57348 ,  0.23464 ,
       -0.35773 , -1.78    ,  0.10745 ,  0.74913 ,  0.45013 ,  1.0351  ,
        0.48348 ,  0.47954 ,  0.51908 , -0.15053 ,  0.32474 ,  1.0789  ,
       -0.90894 ,  0.42943 , -0.56388 ,  0.69961 ,  0.13501 ,  0.16557 ,
       -0.063592,  0.35435 ,  0.42819 ,  0.1536  , -0.47018 , -1.0935  ,
        1.361   , -0.80821 , -0.674   ,  1.2606  ,  0.29554 ,  1.0835  ,
        0.2444  , -1.1877  , -0.60203 , -0.068315,  0.66256 ,  0.45336 ,
       -1.0178  ,  0.68267 , -0.20788 , -0.73393 ,  1.2597  ,  0.15425 ,
       -0.93256 , -0.15025 ])

In [23]:
cosine_similarity(glove['jasmine'].reshape(1, -1), glove['princess'].reshape(1, -1))

array([[0.37487816]])

### Building CBOW

In [3]:
import torch
import torch.nn as nn
import numpy as np

In [4]:
text = """How that personage haunted my dreams, I need scarcely tell you. On
stormy nights, when the wind shook the four corners of the house and
the surf roared along the cove and up the cliffs, I would see him in a
thousand forms, and with a thousand diabolical expressions. Now the leg
would be cut off at the knee, now at the hip, now he was a monstrous
kind of a creature who had never had but the one leg, and that in the
middle of his body. To see him leap and run and pursue me over hedge and
ditch was the worst of nightmares. And altogether I paid pretty dear for
my monthly fourpenny piece, in the shape of these abominable fancies"""


In [5]:
text = text.replace(',','').replace('.','').lower().split()

In [6]:
corpus = set(text)
corpus_length = len(corpus)

#use a set instead of a list as we are only concerned with the unique words within our text

In [7]:
word_dict = {}
inverse_word_dict = {}

for i, word in enumerate(corpus):
    word_dict[word] = i
    inverse_word_dict[i] = word

In [8]:
data = []

for i in range(2, len(text) -2):
    sentence = [text[i-2],text[i-1],text[i+1], text[i+2]]
    target = text[i]
    data.append((sentence, target))
    

In [9]:
print(data[3])

(['haunted', 'my', 'i', 'need'], 'dreams')


In [None]:
#While higher-dimensional embeddings can lead to a more detailed representation of the words, the feature space also becomes sparser
#which means high-dimensional embeddings are only appropriate for large corpuses.

In [10]:
embedding_length = 20

In [11]:
#build model

class CBOW(torch.nn.Module):
    def __init__(self, corpus_length, embedding_dim):
        super(CBOW,self).__init__()
        self.embeddings = nn.Embedding(corpus_length, embedding_dim)
        self.linear1 = nn.Linear(embedding_dim,64)
        self.linear2 = nn.Linear(64,corpus_length)
        self.activation_function1 = nn.ReLU()
        self.activation_function2 = nn.LogSoftmax(dim = -1)
        
    def forward(self, inputs):
        embeds = sum(self.embeddings(inputs)).view(1,-1)
        out = self.linear1(embeds)
        out = self.activation_function1(out)
        out = self.linear2(out)
        out = self.activation_function2(out)
        return out
    
    def get_word_embedding(self,word):
        word = torch.LongTensor([word_dict[word]])
        return self.embeddings(word).view(1,-1)

In [12]:
#train model

model = CBOW(corpus_length, embedding_length)
loss_function = nn.NLLLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

def make_sentence_vector(sentence, word_dict):
    idxs = [word_dict[w] for w in sentence]
    return torch.tensor(idxs, dtype = torch.long)
print(make_sentence_vector(['stormy','nights','when','the'], word_dict))

tensor([50, 51,  5, 18])


In [13]:
for epoch in range(100):
    epoch_loss = 0
    for sentence, target in data:
        model.zero_grad()
        sentence_vector = make_sentence_vector(sentence, word_dict)  
        log_probs = model(sentence_vector)
        loss = loss_function(log_probs, torch.tensor([word_dict[target]], dtype=torch.long))
        loss.backward()
        optimizer.step()
        epoch_loss += loss.data
    print('Epoch: '+str(epoch)+', Loss: ' + str(epoch_loss.item()))

Epoch: 0, Loss: 544.8351440429688
Epoch: 1, Loss: 480.48858642578125
Epoch: 2, Loss: 435.4610595703125
Epoch: 3, Loss: 395.34063720703125
Epoch: 4, Loss: 355.692626953125
Epoch: 5, Loss: 316.1547546386719
Epoch: 6, Loss: 276.779541015625
Epoch: 7, Loss: 238.30360412597656
Epoch: 8, Loss: 201.53378295898438
Epoch: 9, Loss: 167.13568115234375
Epoch: 10, Loss: 136.44906616210938
Epoch: 11, Loss: 110.07818603515625
Epoch: 12, Loss: 88.14295959472656
Epoch: 13, Loss: 70.43659973144531
Epoch: 14, Loss: 56.51333999633789
Epoch: 15, Loss: 45.71113586425781
Epoch: 16, Loss: 37.46113586425781
Epoch: 17, Loss: 31.1269474029541
Epoch: 18, Loss: 26.23110008239746
Epoch: 19, Loss: 22.432907104492188
Epoch: 20, Loss: 19.406402587890625
Epoch: 21, Loss: 17.020523071289062
Epoch: 22, Loss: 15.0863037109375
Epoch: 23, Loss: 13.500479698181152
Epoch: 24, Loss: 12.192682266235352
Epoch: 25, Loss: 11.087519645690918
Epoch: 26, Loss: 10.160371780395508
Epoch: 27, Loss: 9.360690116882324
Epoch: 28, Loss: 8.6

In [14]:
def get_predicted_result(input, inverse_word_dict):
    index = np.argmax(input)
    return inverse_word_dict[index]

def predict_sentence(sentence):
    sentence_split = sentence.replace('.','').lower().split()
    sentence_vector = make_sentence_vector(sentence_split, word_dict)
    prediction_array = model(sentence_vector).data.numpy()
    print('Preceding Words: {}\n'.format(sentence_split[:2]))
    print('Predicted Word: {}\n'.format(get_predicted_result(prediction_array[0], inverse_word_dict)))
    print('Following Words: {}\n'.format(sentence_split[2:]))

predict_sentence('to see leap and')

Preceding Words: ['to', 'see']

Predicted Word: him

Following Words: ['leap', 'and']



In [15]:
print(model.get_word_embedding('leap'))

tensor([[ 1.0962,  1.9593,  0.0110,  0.1595,  0.7783,  0.9081, -0.5268, -1.8456,
         -1.0762, -0.8705,  0.4062,  0.1166, -0.1222,  1.3338, -0.3573,  0.0959,
          0.8397, -1.2704,  1.1369, -1.8897]], grad_fn=<ViewBackward0>)


In [16]:
from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.corpus import stopwords

In [17]:
text = 'This is a single sentence.'
tokens = word_tokenize(text)
print(tokens)

#here '.' is also included, depends on what you need it for

['This', 'is', 'a', 'single', 'sentence', '.']


In [18]:
no_punctuation = [word.lower() for word in tokens if word.isalpha()]
print(no_punctuation)

#removes puncutuations

['this', 'is', 'a', 'single', 'sentence']


In [19]:
#sentence tokenisation 
text = "This is the first sentence. This is the second sentence. A document contains many sentences."
print(sent_tokenize(text))

['This is the first sentence.', 'This is the second sentence.', 'A document contains many sentences.']


In [20]:
print([word_tokenize(sentence) for sentence in sent_tokenize(text)])

[['This', 'is', 'the', 'first', 'sentence', '.'], ['This', 'is', 'the', 'second', 'sentence', '.'], ['A', 'document', 'contains', 'many', 'sentences', '.']]


In [21]:
#stopwords stuff, already known

stop_words = stopwords.words('english')
print(stop_words[:20])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']


In [22]:
text = 'This is a single sentence.'
tokens = [token for token in word_tokenize(text) if token not in stop_words]
print(tokens)

['This', 'single', 'sentence', '.']


### Tagging, Chunking & PoS

In [24]:
import nltk

In [29]:
sentence = "The big dog is sleeping on the bed"
token = nltk.word_tokenize(sentence)
nltk.pos_tag(token)

[('The', 'DT'),
 ('big', 'JJ'),
 ('dog', 'NN'),
 ('is', 'VBZ'),
 ('sleeping', 'VBG'),
 ('on', 'IN'),
 ('the', 'DT'),
 ('bed', 'NN')]

In [30]:
nltk.help.upenn_tagset("VBG")

VBG: verb, present participle or gerund
    telegraphing stirring focusing angering judging stalling lactating
    hankerin' alleging veering capping approaching traveling besieging
    encrypting interrupting erasing wincing ...


In [31]:
tagged = nltk.pos_tag(token)

In [32]:
expression = ('NP: {<DT>?<JJ>*<NN>}')
REchunkParser = nltk.RegexpParser(expression)
tree = REchunkParser.parse(tagged)

In [33]:
tree.draw()

# TF - IDF

In [34]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.corpus import stopwords
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [37]:
emma = nltk.corpus.gutenberg.sents('austen-emma.txt')

emma_sentences = []
emma_word_set = []

for sentence in emma:
    emma_sentences.append([word.lower() for word in sentence if word.isalpha()])
    for word in sentence:
        if word.isalpha():
            emma_word_set.append(word.lower())

emma_word_set = set(emma_word_set)

In [40]:

def TermFreq(document, word):
    doc_length = len(document)
    occurances = len([w for w in document if w == word])
    return occurances / doc_length

TermFreq(emma_sentences[5], 'ago')

0.024390243902439025

In [41]:
def build_DF_dict():
    output = {}
    for word in emma_word_set:
        output[word] = 0
        for doc in emma_sentences:
            if word in doc:
                output[word] += 1
    return output
        
df_dict = build_DF_dict()

df_dict['ago']

32

In [42]:
def InverseDocumentFrequency(word):
    N = len(emma_sentences)
    try:
        df = df_dict[word] + 1
    except:
        df = 1
    return np.log(N/df)

InverseDocumentFrequency('ago')

5.454673404991123

In [44]:
def TFIDF(doc, word):
    tf = TermFreq(doc, word)
    idf = InverseDocumentFrequency(word)
    return tf*idf

print('ago - ' + str(TFIDF(emma_sentences[5],'ago')))
print('indistinct - ' + str(TFIDF(emma_sentences[5],'indistinct')))

ago - 0.13304081475588106
indistinct - 0.2014154581926258


In [45]:

embeddings = []

for word in emma_sentences[5]:
    embeddings.append(glove[word])

mean_embedding = np.mean(embeddings, axis = 0).reshape(1, -1)

print(mean_embedding)

[[ 3.32575634e-01  3.16596488e-01 -1.80050732e-01 -3.82070951e-01
   4.98493527e-01  5.33804805e-01 -5.46517073e-01  9.12476195e-02
  -1.31538483e-01 -2.71967805e-02  2.99867317e-02  2.64278024e-02
  -2.06519756e-01 -1.54796634e-01  4.28036366e-01 -5.74977317e-02
  -2.65928778e-01  1.60373902e-02 -2.84913561e-01 -2.01252268e-01
  -5.96390732e-02  5.72458220e-01  2.06195927e-01 -1.54312293e-01
   2.52049805e-01 -1.64638200e+00 -3.42686049e-01  1.02592522e-01
   1.42848000e-01 -1.09779902e-01  2.89345488e+00  7.36985634e-02
  -3.73648780e-03 -2.76292784e-01  1.50580049e-01  9.80399951e-02
   2.24408780e-03  2.83664024e-01  3.92979024e-02 -2.98091634e-01
  -1.17309171e-01  2.08815776e-01  6.89953902e-03  2.92777244e-02
   5.54180122e-02 -2.20519707e-01 -2.82007805e-01 -4.34917439e-01
  -9.69051537e-02 -1.67569878e-01]]


In [46]:
embeddings = []

for word in emma_sentences[5]:
    tfidf = TFIDF(emma_sentences[5], word)
    embeddings.append(glove[word]* tfidf) 
    
tfidf_weighted_embedding = np.mean(embeddings, axis = 0).reshape(1, -1)

print(tfidf_weighted_embedding)

[[ 0.03384888  0.04561131 -0.02508487 -0.05546237  0.0651311   0.07019455
  -0.06298467  0.02670422 -0.01072827 -0.00508234  0.00517652  0.00817101
  -0.01604324 -0.01483237  0.04946372 -0.01076198 -0.05021479  0.00040191
  -0.01920397 -0.01341318 -0.01123547  0.08492142  0.02142466 -0.01588025
   0.04405683 -0.17856836 -0.03999452  0.01601948  0.02088402 -0.01340125
   0.2829529   0.00694315  0.00485215 -0.02633143  0.01534283  0.01608815
   0.00316191  0.03238881  0.0082704  -0.04192922 -0.0058766   0.01992215
  -0.00304265 -0.00353939  0.01174628 -0.03416807 -0.02939215 -0.06798914
  -0.00774682 -0.01807456]]


In [47]:
cosine_similarity(mean_embedding, tfidf_weighted_embedding)

array([[0.986523]])