# Word Representations in Natural Language Processing
This notebook shows examples of three different types of word representations in NLP: Bag of Words (BoW), TF-IDF, and Word2Vec.

In [2]:
# Required Libraries
import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\kinla\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Bag of words (BoW)

Interpreting the Output of BoW
Bag of Words (BoW) model's output has two parts:

Vocabulary: This is a dictionary where the keys are the unique words in the text, and the values are their respective identifiers. Each word in the text is assigned a unique identifier.

Encoded Document: This is a matrix where the rows represent the documents and columns represent the unique words in the vocabulary. The values in the matrix are the counts of each word in each document.

In [3]:
# Example text
text = ["The quick brown fox jumped over the lazy dog."]

# Create the Transform
vectorizer = CountVectorizer()

# Tokenize and build vocab
vectorizer.fit(text)

# Encode the Document
vector = vectorizer.transform(text)

# Summarize
print('Vocabulary: ', vectorizer.vocabulary_)
print('Encoded Document: ', vector.toarray())

Vocabulary:  {'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
Encoded Document:  [[1 1 1 1 1 1 1 2]]


## TF-IDF

Interpreting the Output of TF-IDF
TF-IDF model's output is similar to the BoW model but instead of raw counts, it contains the TF-IDF scores.

Vocabulary: This is again a dictionary where the keys are the unique words in the text, and the values are their respective identifiers.

Encoded Document: This is a matrix where the rows represent the documents and columns represent the unique words in the vocabulary. The values in the matrix are the TF-IDF scores of each word in each document.

In [4]:
# Example text
text = ["The quick brown fox jumped over the lazy dog."]

# Create the Transform
vectorizer = TfidfVectorizer()

# Tokenize and build vocab
vectorizer.fit(text)

# Encode the Document
vector = vectorizer.transform(text)

# Summarize
print('Vocabulary: ', vectorizer.vocabulary_)
print('Encoded Document: ', vector.toarray())

Vocabulary:  {'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
Encoded Document:  [[0.30151134 0.30151134 0.30151134 0.30151134 0.30151134 0.30151134
  0.30151134 0.60302269]]


## Word2Vec

Interpreting the Output of Word2Vec
The output of a Word2Vec model is a bit different from the previous two:

Model's Vocabulary: This is a dictionary where keys are unique words in the text and the values are a bunch of information about the word, including its count, index, and vector representation.

Vector for a word: Word2Vec model represents each word as a high-dimensional vector (300 in this example). These vectors capture deep semantic meanings and relations with other words. For example, similar words have similar vectors, and the model can understand analogies between words.

In [6]:
# Preparing the text
text = ["The quick brown fox jumped over the lazy dog."]
tokenized_text = [nltk.word_tokenize(sent) for sent in text]

# Creating the model and setting values for the various parameters
num_features = 300  # Word vector dimensionality
min_word_count = 1  # Minimum word count
num_workers = 4     # Number of parallel threads
context = 10        # Context window size

# Initializing the train model
model = Word2Vec(
    tokenized_text,
    workers=num_workers,
    vector_size=num_features,  # Change 'size' to 'vector_size'
    min_count=min_word_count,
    window=context,
)

# Accessing the model's vocabulary
# `model.wv` represents word vector
print('Model\'s Vocabulary: ', model.wv.key_to_index)  # Update 'vocab' to 'key_to_index'

# Access vector for one word
print('Vector for "fox":', model.wv['fox'])

Model's Vocabulary:  {'.': 0, 'dog': 1, 'lazy': 2, 'the': 3, 'over': 4, 'jumped': 5, 'fox': 6, 'brown': 7, 'quick': 8, 'The': 9}
Vector for "fox": [ 3.2451833e-03 -3.2601277e-03 -2.1664966e-03  9.2792866e-04
  2.1439958e-03 -1.7891225e-03  9.1749750e-04  3.0404369e-03
 -2.2718073e-03 -2.0333040e-03 -1.6632139e-03 -1.2254707e-03
  6.1657192e-04  3.2275442e-03  2.1459253e-03  1.3236403e-04
  8.2358957e-04  2.8134971e-03  3.0429931e-03  1.8762510e-03
  1.9820877e-03 -2.5402287e-03 -1.2758906e-03 -1.8934441e-03
  2.0605915e-03 -7.5214985e-04 -2.9264784e-03  2.5397083e-03
  2.7998944e-03 -1.1067450e-03  3.0388865e-03 -2.4611989e-04
 -1.2088398e-03 -1.2823025e-04  6.4808926e-05 -1.1682988e-03
  9.3774719e-04  1.9099017e-03  2.2896698e-03 -2.9678221e-03
 -7.3090952e-04 -1.8272658e-03  2.5070333e-03  2.1672344e-03
 -1.4535740e-03  7.7561103e-04 -1.9845536e-03  7.8831908e-05
  3.1539197e-03 -8.6994725e-04 -1.7292392e-03 -2.4657364e-03
 -9.7064732e-04 -2.8810461e-04  1.1759539e-03  3.2472964e-03