<a href="https://colab.research.google.com/github/mtsilimos/Python-Machine-Learning-for-Beginners_-Source-code/blob/main/Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Word2Vec**

In [3]:
!pip install gensim nltk



In [4]:
import nltk
nltk.download('punkt_tab')
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

# Sample corpus
documents = ["This is a sentence", "Word embeddings capture meaning", "Text representation is important"]

# tokenize sentences, create a list of lists, where each inner list contains the tokens of a document
tokenized_docs = [word_tokenize(doc.lower()) for doc in documents]

# Train Word2Vec model
# vector_size = 100: This specifies the dimensionality of the word vectors (embeddings). Each word will be represented by a vector of 100 numbers.
# window = 5: This defines the maximum distance between the current word and the words in its context. For example, if window=5, the model considers words within 5 words to the left and 5 words to the right of the target word when learning its embedding.
# min_count = 1: This ignores all words with a total frequency lower than this value.
# workers=4: This specifies the number of CPU cores to use for training. More workers can speed up the training process.

model = Word2Vec(sentences=tokenized_docs, vector_size=100, window=5, min_count=1, workers=4)

# Get embedding for a word. This line retrieves the 100-dimensional vector representation for the word 'word'.
word_vector = model.wv['word']
print(word_vector)

[ 8.1681199e-03 -4.4430327e-03  8.9854337e-03  8.2536647e-03
 -4.4352221e-03  3.0310510e-04  4.2744912e-03 -3.9263200e-03
 -5.5599655e-03 -6.5123225e-03 -6.7073823e-04 -2.9592158e-04
  4.4630850e-03 -2.4740540e-03 -1.7260908e-04  2.4618758e-03
  4.8675989e-03 -3.0808449e-05 -6.3394094e-03 -9.2608072e-03
  2.6657581e-05  6.6618943e-03  1.4660227e-03 -8.9665223e-03
 -7.9386048e-03  6.5519023e-03 -3.7856805e-03  6.2549924e-03
 -6.6810320e-03  8.4796622e-03 -6.5163244e-03  3.2880199e-03
 -1.0569858e-03 -6.7875278e-03 -3.2875966e-03 -1.1614120e-03
 -5.4709399e-03 -1.2113475e-03 -7.5633135e-03  2.6466595e-03
  9.0701487e-03 -2.3772502e-03 -9.7651005e-04  3.5135616e-03
  8.6650876e-03 -5.9218528e-03 -6.8875779e-03 -2.9329848e-03
  9.1476962e-03  8.6626766e-04 -8.6784009e-03 -1.4469790e-03
  9.4794659e-03 -7.5494875e-03 -5.3580985e-03  9.3165627e-03
 -8.9737261e-03  3.8259076e-03  6.6544057e-04  6.6607012e-03
  8.3127534e-03 -2.8507852e-03 -3.9923131e-03  8.8979173e-03
  2.0896459e-03  6.24894

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


**FastText**

In [5]:
from gensim.models import FastText

# Train FastText model
fasttext_model = FastText(sentences=tokenized_docs, vector_size=100, window=5, min_count=1, workers=4)

# Get embedding for a word
fasttext_vector = fasttext_model.wv['word']
print(fasttext_vector)

[ 2.1155151e-03  9.8680402e-04  1.2800316e-03  2.1573596e-03
  1.5427172e-04 -3.0147359e-03  2.3549281e-03  9.9292534e-05
 -1.9066199e-03 -1.4001230e-03 -7.1601936e-04 -5.6761183e-04
 -6.3550641e-04 -1.6605038e-05 -4.6853209e-03  2.0611156e-03
  3.9701643e-03 -2.0668460e-03  1.2425384e-03  1.0273020e-03
 -1.2777551e-03 -4.6210687e-04 -3.1799278e-03 -3.7297735e-04
 -1.6908546e-03  5.3695223e-04  1.3541238e-03 -9.9467847e-04
  1.7661083e-03  4.2559326e-04 -3.5514096e-03 -1.7985261e-04
 -3.8914299e-05  4.6839315e-04 -6.3550984e-04  3.3649718e-04
  1.4996319e-03  1.5863772e-03 -1.9504224e-03  1.9697666e-03
 -1.5901782e-05 -4.0345907e-04  1.6221876e-04  1.0139368e-04
 -2.7230161e-03  2.1991446e-03 -1.6542393e-03  3.2167204e-03
  1.0735159e-03  5.4604368e-04 -3.4078590e-03 -3.7226258e-03
 -4.9078057e-04  1.1654339e-03  1.7590716e-03  1.4553722e-03
 -1.2638047e-06 -3.5127540e-04  1.3029241e-03 -3.2649450e-03
  6.9016282e-04  2.8527918e-04  1.9260035e-03 -2.2928943e-03
  1.0747793e-04  2.13664

**GloVe**

In [6]:
import gensim.downloader as api

# Load pretrained GloVe embeddings
glove_model = api.load("glove-wiki-gigaword-100")

# Get embedding for a word
glove_vector = glove_model['word']
print(glove_vector)

[ 0.1233    0.55741   0.74203  -0.06547  -0.33485   0.81541  -0.16384
 -1.0327    0.41834  -0.012764 -0.60695   0.30146   0.35976   0.41161
  0.03381  -0.091115  0.35077  -0.24798  -0.13128   0.19869   0.046961
  0.014633 -0.39851  -0.11829  -0.27432  -0.032518 -0.23637  -0.072372
 -0.04237  -0.11159   0.12129   0.64011  -0.50275  -0.21584   0.30097
 -0.041772 -0.47972  -0.12897   0.6964   -0.27594  -0.29149   0.088033
  0.12874  -0.15249  -0.20548   0.029435  0.055133 -0.12994  -0.33869
 -0.61891   0.4743    0.60288   1.0209    0.48663  -1.0587   -1.9711
 -0.41751   0.12457   1.304     0.26925   0.28003   0.91141  -0.62217
 -0.70356   1.0379   -0.095316  0.54085  -0.36123  -0.10311  -0.31059
 -0.61454   0.63799   0.18329  -0.49599   0.3607    0.70414  -0.28096
  0.1062   -0.64866  -0.28698  -0.26623  -1.4502   -0.69456  -0.48722
 -1.6753    0.40353  -0.085219 -0.85528   0.65113   0.019457 -0.20924
  0.18864  -0.12794   0.41757   0.097439 -0.58381  -0.38945  -0.15608
  0.014198  0.6563

**BERT Embeddings**

In [7]:
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize and convert to tensor
text = "Text embeddings are powerful"

# return_tensors='pt': This tells the tokenizer to return the output as PyTorch tensors. A tensor of numerical IDs corresponding to each token in the input text.
inputs = tokenizer(text, return_tensors='pt')

# This line below feeds the tokenized input to the pre-trained BERT model; the **inputs syntax unpacks the inputs dictionary so that its keys are passed as keyword arguments to the model.
outputs = model(**inputs)

# Extract embedding
# This gives you a vector (768 numbers) for each token in your input sentence.
#.mean(dim=1): collapse the multiple token embeddings of a sentence into one single, comprehensive vector representing that entire sentence
#.detach: in order not to store the computational graph history
# .numpy(): convert a tensor to a numpy array
# The last_hidden_state is the final and most refined set of contextualized numerical representations that a BERT (or similar transformer encoder) model produces for each token in your input sequence.

bert_embedding = outputs.last_hidden_state.mean(dim=1).detach().numpy()
print(bert_embedding)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

[[-2.07033634e-01 -6.09189086e-02 -1.53025746e-01  1.07333317e-01
  -1.68446124e-01 -4.37565207e-01 -1.23240098e-01  1.75206423e-01
   3.71475816e-01 -1.69855267e-01  4.47603688e-02  3.46759945e-04
  -3.71827334e-01  2.41791144e-01 -1.99983627e-01  1.25234678e-01
  -2.34488457e-01  3.28809738e-01 -2.63080508e-01  1.79441106e-02
   1.18280612e-01  2.64627859e-02 -5.89269578e-01  2.64124036e-01
   5.47728062e-01 -7.20182285e-02 -3.64039205e-02  7.64196068e-02
  -1.43767998e-01 -1.96442336e-01  8.35850760e-02  4.74337518e-01
  -1.84556305e-01 -2.60158509e-01 -9.84733924e-03 -3.76923233e-02
   3.55101943e-01 -2.18327418e-01  7.30223730e-02  4.94596688e-03
  -6.74027801e-01 -4.02410448e-01  3.52013767e-01  1.83400556e-01
  -1.14766993e-01 -6.89098120e-01 -3.30274820e-01 -1.01251625e-01
  -8.29913467e-02 -3.21356595e-01 -7.75064588e-01  1.93001807e-01
   1.32726595e-01  3.73309463e-01  2.27105200e-01  6.57030404e-01
   7.78113818e-03 -5.85645437e-01  4.79615808e-01 -9.99056473e-02
   2.50606

The embeddings from Word2Vec, FastText, GloVe and BERT are all dense, continuous vectors that encode semantic relationships, not raw counts or frequencies. Therefore, they are incompatible with Naive Bayes, which is a count-based probabilistic model. While these embeddings can technically be used as input features for a Random Forest model, Random Forest is generally not the optimal classifier for leveraging the high-dimensional, semantic information within these embeddings, and you'll typically achieve better results with models like Logistic Regression or Support Vector Machines (SVM).