#### 1. Word Embedding

Word2Vec (SkipGram, Continuous Bag of Words – CBOW)  
fastText  

### What is Word Embedding?
Embeddings are dense, continuous vector representations of data, such as words, sentences, or images, in a lower-dimensional space.  
They capture the semantic relationships and patterns in the data, where similar items are placed closer together in the vector space.  
In machine learning, embeddings are used to convert complex data into numerical form that models can process more easily.  
<br>
For example, word embeddings represent words based on their meanings and contexts, allowing models to understand relationships like synonyms or analogies.   
Embeddings are widely used in tasks like natural language processing, recommendation systems, and image recognition to improve model performance and efficiency.  

**Word2Vec** is a widely used method in natural language processing (NLP) that allows words to be represented as vectors in a continuous vector space.

####  Applications of Word Embedding:

**Text classification:** Using word embeddings to increase the precision of tasks such as topic categorization and sentiment analysis.  
**Named Entity Recognition (NER):** Using word embeddings semantic context to improve the identification of entities (such as names and locations).  
**Information Retrieval:** To provide more precise search results, embeddings are used to index and retrieve documents based on semantic similarity.  
**Machine Translation:** The process of comprehending and translating the semantic relationships between words in various languages by using word embeddings.  
**Question Answering:** Increasing response accuracy and understanding of semantic context in Q&A systems.  

### Word embedding using Word2Vec (CBOW and skip gram)

In [123]:
# Python program to generate word vectors using Word2Vec

from gensim.models import Word2Vec
import gensim
from nltk.tokenize import sent_tokenize, word_tokenize
import warnings

warnings.filterwarnings(action='ignore')

# Reads ‘alice.txt’ file
sample = open("text.txt",encoding="utf-8")
s = sample.read()

# Replaces escape character with space
f = s.replace("\n", " ")

data = []
# iterate through each sentence in the file
for i in sent_tokenize(f):
	temp = []

	# tokenize the sentence into words
	for j in word_tokenize(i):
		temp.append(j.lower())

	data.append(temp)

# sg=0: This sets the model to use CBOW (Continuous Bag of Words).  
# sg=1: This sets the model to use Skip-gram  
# Create CBOW model

model1 = gensim.models.Word2Vec(data, min_count=1, vector_size=100, window=5 , sg=0)

# Print results
print("Cosine similarity between 'alice' " + "and 'wonderland' - CBOW : ", model1.wv.similarity('alice', 'wonderland'))

print("Cosine similarity between 'alice' " + "and 'machines' - CBOW : ", model1.wv.similarity('alice', 'machines'))

# Create Skip Gram model
model2 = gensim.models.Word2Vec(data, min_count=1, vector_size=100, window=5, sg=1)

# Print results
print("Cosine similarity between 'alice' " + "and 'wonderland' - Skip Gram : ", model2.wv.similarity('alice', 'wonderland'))

print("Cosine similarity between 'alice' " + "and 'machines' - Skip Gram : ", model2.wv.similarity('alice', 'machines'))

Cosine similarity between 'alice' and 'wonderland' - CBOW :  0.9822592
Cosine similarity between 'alice' and 'machines' - CBOW :  0.8997578
Cosine similarity between 'alice' and 'wonderland' - Skip Gram :  0.8719529
Cosine similarity between 'alice' and 'machines' - Skip Gram :  0.8462893


### Word Embeddings Using FastText

In [126]:
from gensim.models import FastText
from gensim.test.utils import common_texts

# Example corpus (replace with your own corpus)
corpus = common_texts

# Training FastText model
model = FastText(sentences=corpus, vector_size=100, window=5, min_count=1, workers=4, sg=1)

# Example usage: getting embeddings for a word
word_embedding = model.wv['computer']

# Most similar words to a given word
similar_words = model.wv.most_similar('computer')

print("Most similar words to 'computer':", similar_words)

Most similar words to 'computer': [('user', 0.1565941423177719), ('response', 0.12383823841810226), ('eps', 0.0307049248367548), ('system', 0.025573883205652237), ('interface', 0.005858757067471743), ('survey', -0.03156975656747818), ('minors', -0.054556481540203094), ('human', -0.0668589174747467), ('time', -0.06855934858322144), ('trees', -0.10636082291603088)]
