# Vectorization

Converting input data from its raw format (i.e. text ) into vectors of real numbers which is the format that ML models support.

Vectors are essentially arrays of numbers that represent various features of the text. These arrays can be of different dimensions:<br>
**1D Vectors**: Represent individual words (e.g., word embeddings). <br>
**2D Vectors**: Represent sequences of words, such as sentences or documents (e.g., sentence embeddings).<br>
**Multi-Dimensional Vectors**: Can represent more complex structures and relationships, potentially involving higher-dimensional spaces.<br>

# Why do Vectorization?

1. Vectorization converts text into a numerical format that complex models can use to learn patterns and make predictions.<br>
2. Vectorization captures the nuanced meanings and relationships between words. <br>
3. It transforms messy, high-dimensional data into a structured, manageable

**One-Hot Encoding**: It converts each word into a vector where only one element is “hot” (set to 1) and all others are “cold” (set to 0).

In [1]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Creating the encoder
enc = OneHotEncoder(handle_unknown='ignore')

# Sample data
X = [['Red'], ['Green'], ['Blue']]

# Fitting the encoder to the data
enc.fit(X)

# Transforming new data
result = enc.transform([['Red']]).toarray()

# Displaying the encoded result
print(result)

[[0. 0. 1.]]


**Bag of words**:  The BoW model represents text by the frequency of words. It ignores grammar and word order, focusing solely on word counts.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["I love NLP", "NLP is great", "I enjoy learning NLP"]

# Creation of BoW model
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())

['enjoy' 'great' 'is' 'learning' 'love' 'nlp']
[[0 0 0 0 1 1]
 [0 1 1 0 0 1]
 [1 0 0 1 0 1]]


**TF-IDF**: TF-IDF weighs the importance of a word in a document based on its frequency in the document and its rarity across the corpus. 

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

# same text like previous example
corpus = ["I love NLP", "NLP is great", "I enjoy learning NLP"]

# Creation of TF-IDF model
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())

['enjoy' 'great' 'is' 'learning' 'love' 'nlp']
[[0.         0.         0.         0.         0.861037   0.50854232]
 [0.         0.65249088 0.65249088 0.         0.         0.38537163]
 [0.65249088 0.         0.         0.65249088 0.         0.38537163]]
