**Text Vectorization**

This notebook covers three common vectorization techniques used in NLP.

### Methods:
- **Bag of Words (BoW)** – Simple word counts
- **TF-IDF** – Weighted word importance
- **Word2Vec** – Contextual dense embeddings

Each document is transformed into a vector depending on the vocabulary.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec


In [2]:
corpus = [
    "NLP is fun and exciting",
    "NLP is powerful and useful",
    "I love learning NLP"
]

**Bag of words**

In [3]:
vectorizer_bow = CountVectorizer()
X_bow = vectorizer_bow.fit_transform(corpus)


In [4]:
# View vocabulary and vector
print("Vocabulary:", vectorizer_bow.vocabulary_)
print("BoW Vector Shape:", X_bow.shape)


Vocabulary: {'nlp': 6, 'is': 3, 'fun': 2, 'and': 0, 'exciting': 1, 'powerful': 7, 'useful': 8, 'love': 5, 'learning': 4}
BoW Vector Shape: (3, 9)


In [5]:
print("BoW Vectors:\n", X_bow.toarray())

BoW Vectors:
 [[1 1 1 1 0 0 1 0 0]
 [1 0 0 1 0 0 1 1 1]
 [0 0 0 0 1 1 1 0 0]]


**TF_IDF**

In [6]:
vectorizer_tfidf = TfidfVectorizer()
X_tfidf = vectorizer_tfidf.fit_transform(corpus)

In [7]:
print("Vocabulary:", vectorizer_tfidf.vocabulary_)
print("TF-IDF Vector Shape:", X_tfidf.shape)
print("TF-IDF Vectors:\n", X_tfidf.toarray())

Vocabulary: {'nlp': 6, 'is': 3, 'fun': 2, 'and': 0, 'exciting': 1, 'powerful': 7, 'useful': 8, 'love': 5, 'learning': 4}
TF-IDF Vector Shape: (3, 9)
TF-IDF Vectors:
 [[0.40619178 0.53409337 0.53409337 0.40619178 0.         0.
  0.31544415 0.         0.        ]
 [0.40619178 0.         0.         0.40619178 0.         0.
  0.31544415 0.53409337 0.53409337]
 [0.         0.         0.         0.         0.65249088 0.65249088
  0.38537163 0.         0.        ]]


**Word2Vec**

In [11]:
import nltk
from nltk.tokenize import word_tokenize

In [12]:
tokenized_corpus = [word_tokenize(doc.lower()) for doc in corpus]


In [13]:
model_w2v = Word2Vec(sentences=tokenized_corpus, vector_size=50, window=2, min_count=1, workers=1)


In [16]:
print("Word2Vec vector for 'nlp':\n", model_w2v.wv['nlp'])

Word2Vec vector for 'nlp':
 [-1.0724545e-03  4.7286271e-04  1.0206699e-02  1.8018546e-02
 -1.8605899e-02 -1.4233618e-02  1.2917745e-02  1.7945977e-02
 -1.0030856e-02 -7.5267432e-03  1.4761009e-02 -3.0669428e-03
 -9.0732267e-03  1.3108104e-02 -9.7203208e-03 -3.6320353e-03
  5.7531595e-03  1.9837476e-03 -1.6570430e-02 -1.8897636e-02
  1.4623532e-02  1.0140524e-02  1.3515387e-02  1.5257311e-03
  1.2701781e-02 -6.8107317e-03 -1.8928028e-03  1.1537147e-02
 -1.5043275e-02 -7.8722071e-03 -1.5023164e-02 -1.8600845e-03
  1.9076237e-02 -1.4638334e-02 -4.6675373e-03 -3.8754821e-03
  1.6154874e-02 -1.1861792e-02  9.0324880e-05 -9.5074680e-03
 -1.9207101e-02  1.0014586e-02 -1.7519170e-02 -8.7836506e-03
 -7.0199967e-05 -5.9236289e-04 -1.5322480e-02  1.9229487e-02
  9.9641159e-03  1.8466286e-02]
