# Embeddings
Word embeddings are numerical representations of words in a high-dimensional vector space that capture semantic relationships and contextual information. These embeddings are essential in natural language processing (NLP) tasks, enabling computers to understand and process textual data more effectively. One common method to create word embeddings is Word2Vec, which utilizes neural networks to map words to dense vectors in such a way that similar words are close together in the vector space. Another popular approach is the use of pre-trained embeddings like Word2Vec, GloVe, or FastText, which are learned on large text corpora and can be fine-tuned for specific NLP tasks. Additionally, TF-IDF (Term Frequency-Inverse Document Frequency) can be used to create embeddings by representing the importance of words in documents relative to a corpus. These embeddings are valuable for various NLP tasks, such as sentiment analysis, text classification, machine translation, and more, as they provide a compact and meaningful representation of words that captures their semantic properties.

In [5]:
!pip install nltk
import numpy as np
from nltk.tokenize import word_tokenize 
import nltk
nltk.download('punkt')

Defaulting to user installation because normal site-packages is not writeable


[nltk_data] Downloading package punkt to /home/naseeha/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [6]:
text = [
    "The Cricket World Cup is a major event in the world of sports.\n"+
    "Cricket players from different countries participate in the World Cup.\n"+
    "The final match of the World Cup was an exciting game.\n"+
    "Batsmen scored heavily in the tournament.\n"+
    "The World Cup trophy is awarded to the winning team.\n"
]

In [7]:
sentences = []
word_set = []

In [8]:
for sent in text:
    x = [i.lower() for i in word_tokenize(sent) if i.isalpha()]
    sentences.append(x)
    for word in x:
        if word not in word_set:
            word_set.append(word)
 
#Set of vocab 
word_set = set(word_set)
#Total documents in our corpus
total_documents = len(sentences)
 
#Creating an index for each word in our vocab.
index_dict = {} #Dictionary to store index for each word
i = 0
for word in word_set:
    index_dict[word] = i
    i += 1

In [9]:
#Create a count dictionary
 
def count_dict(sentences):
    word_count = {}
    for word in word_set:
        word_count[word] = 0
        for sent in sentences:
            if word in sent:
                word_count[word] += 1
    return word_count
 
word_count = count_dict(sentences)

In [10]:
#Term Frequency
def termfreq(document, word):
    N = len(document)
    occurance = len([token for token in document if token == word])
    '''Increase the scalability of the embedding by taking the logarithm to the base 10 so that text corpus with 1000s or even more words work efficently'''
    return np.log10(occurance/N)

In [11]:
#Inverse Document Frequency
 
def inverse_doc_freq(word):
    try:
        word_occurance = word_count[word] + 1
    except:
        word_occurance = 1
    '''Increase the scalability of the embedding by taking the logarithm to the base 10 so that text corpus with 1000s or even more words work efficently'''
    return np.log10(total_documents/word_occurance)

In [12]:
def tf_idf(sentence):
    tf_idf_vec = np.zeros((len(word_set),))
    for word in sentence:
        tf = termfreq(sentence,word)
        idf = inverse_doc_freq(word)
         
        value = tf*idf
        tf_idf_vec[index_dict[word]] = value 
    return tf_idf_vec

In [13]:
#TF-IDF Encoded text corpus
vectors = []
for sent in sentences:
    vec = tf_idf(sent)
    vectors.append(vec)
 
print(vectors[0])

[0.51144093 0.51144093 0.33020282 0.42082187 0.51144093 0.51144093
 0.51144093 0.51144093 0.42082187 0.36781312 0.51144093 0.23958376
 0.51144093 0.51144093 0.51144093 0.51144093 0.51144093 0.51144093
 0.51144093 0.51144093 0.51144093 0.30103    0.51144093 0.51144093
 0.51144093 0.51144093 0.51144093 0.51144093 0.51144093 0.51144093
 0.42082187]
