*Bag-of-Words (BoW):*

**CountVectorizer** converts text documents into a matrix of token counts.
**The normalized BoW **uses the normalize function (with L1 norm) so that each document vector sums to 1, giving the relative frequency of each word.
TF-IDF:

**TfidfVectorizer** computes the term frequency-inverse document frequency matrix, which scales down the impact of frequently occurring words that might be less informative.

**Word2Vec:**
The text is tokenized using NLTK.
The Word2Vec model from Gensim is trained on the tokenized documents, resulting in dense vector representations (embeddings) for each word.

In [5]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import normalize
import gensim
from gensim.models import Word2Vec
import nltk

# Download required NLTK data packages
nltk.download('punkt')
nltk.download('punkt_tab')  # Ensures the punkt_tab resource is available

documents = [
    "The quick brown fox jumps over the lazy dog.",
    "Never jump over the lazy dog quickly.",
    "A fox is quick and brown.Lazy dogs are not quick."
]

# Preprocess documents (e.g., converting to lowercase)
processed_docs = [doc.lower() for doc in documents]

# 1. Bag-of-Words (BoW)


# 1a. Count occurrence (raw counts)
count_vectorizer = CountVectorizer()
bow_counts = count_vectorizer.fit_transform(processed_docs)
bow_df = pd.DataFrame(bow_counts.toarray(), columns=count_vectorizer.get_feature_names_out())
print("=== Bag-of-Words Count Occurrence ===")
print(bow_df, "\n")

# 1b. Normalized count occurrence (L1 normalization: each document's frequencies sum to 1)
bow_normalized = normalize(bow_counts, norm='l1', axis=1)
bow_normalized_df = pd.DataFrame(bow_normalized.toarray(), columns=count_vectorizer.get_feature_names_out())
print("=== Normalized Count Occurrence (L1 norm) ===")
print(bow_normalized_df, "\n")


# 2. TF-IDF


tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(processed_docs)
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
print("=== TF-IDF Matrix ===")
print(tfidf_df, "\n")


# 3. Word2Vec Embeddings

# Tokenize documents into words using NLTK
tokenized_docs = [nltk.word_tokenize(doc) for doc in processed_docs]
print("=== Tokenized Documents ===")
for i, tokens in enumerate(tokenized_docs, 1):
    print(f"Document {i}: {tokens}")
print()

# Train a Word2Vec model
w2v_model = Word2Vec(sentences=tokenized_docs, vector_size=100, window=5, min_count=1, workers=4)

# Display the vocabulary from the Word2Vec model
vocab = list(w2v_model.wv.index_to_key)
print("=== Vocabulary in Word2Vec Model ===")
print(vocab, "\n")

# Example: Get the embedding for the word "quick"
word = "quick"
if word in w2v_model.wv:
    print(f"=== Word2Vec Embedding for '{word}' ===")
    print(w2v_model.wv[word])
else:
    print(f"Word '{word}' not found in the vocabulary.")


=== Bag-of-Words Count Occurrence ===
   and  are  brown  dog  dogs  fox  is  jump  jumps  lazy  never  not  over  \
0    0    0      1    1     0    1   0     0      1     1      0    0     1   
1    0    0      0    1     0    0   0     1      0     1      1    0     1   
2    1    1      1    0     1    1   1     0      0     1      0    1     0   

   quick  quickly  the  
0      1        0    2  
1      0        1    1  
2      2        0    0   

=== Normalized Count Occurrence (L1 norm) ===
   and  are     brown       dog  dogs       fox   is      jump     jumps  \
0  0.0  0.0  0.111111  0.111111   0.0  0.111111  0.0  0.000000  0.111111   
1  0.0  0.0  0.000000  0.142857   0.0  0.000000  0.0  0.142857  0.000000   
2  0.1  0.1  0.100000  0.000000   0.1  0.100000  0.1  0.000000  0.000000   

       lazy     never  not      over     quick   quickly       the  
0  0.111111  0.000000  0.0  0.111111  0.111111  0.000000  0.222222  
1  0.142857  0.142857  0.0  0.142857  0.000000  0.1428

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
