# Bag of Words
The Bag of Words (BoW) model is a representation technique used in natural language processing (NLP) to convert text documents into numerical vectors.

Example:

Consider the following two text documents:

Document 1: "The cat sat on the mat."

Document 2: "The dog played in the garden."

### Steps in Bag of Words:

Lowercasing: Convert all text to lowercase to ensure uniformity. For example, "The" and "the" are treated as the same word.

Stopwords Removal: Remove common words (stopwords) like "the", "is", "in", etc., which occur frequently but do not carry significant meaning.

Vocabulary Frequency Counting: Create a vocabulary of unique words from the remaining words in the documents and count their frequencies.

Document Vectorization: Represent each document as a numerical vector based on the vocabulary. The vector elements correspond to the frequency of each word in the vocabulary in the respective document.

For the given example documents, the vocabulary might be: ["cat", "dog", "sat", "played", "mat", "garden"].

Document 1: [1, 0, 1, 0, 1, 0]

Document 2: [0, 1, 0, 1, 0, 1]

### Advantages:

Easy to Implement: The Bag of Words model is straightforward to implement and understand.

Fixed-Sized Input: It generates fixed-sized input vectors, making it suitable for machine learning algorithms that require consistent input dimensions.

### Disadvantages:

Sparse Matrix Problem: The Bag of Words model often results in high-dimensional and sparse feature vectors, leading to memory and computational inefficiency. This can also lead to overfitting in machine learning models.

Ordering of Words: Since Bag of Words disregards word order and context, the meaning of sentences may be altered based on the frequency of words. For example, "cat sat" and "sat cat" would have the same representation.

Out of Vocabulary (OOV): New words that are not present in the vocabulary during training are ignored, leading to loss of information.

Semantic Meaning Not Captured: Bag of Words does not capture the semantic relationships between words, leading to a loss of meaning in the text.

### N-grams in Bag of Words:

To capture more semantic information and preserve word order, N-grams can be used in conjunction with the Bag of Words model. N-grams are contiguous sequences of n words from a given text. For example, instead of considering individual words, we can consider sequences of two or three words (bigrams or trigrams). This allows the model to capture local context and sequential information in the text.

## Using NLTK:

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
documents = [
    "The cat sat on the mat.",
    "The dog dog played in the garden."
]

In [None]:
tokenized_documents = [word_tokenize(doc.lower()) for doc in documents]
tokenized_documents

[['the', 'cat', 'sat', 'on', 'the', 'mat', '.'],
 ['the', 'dog', 'dog', 'played', 'in', 'the', 'garden', '.']]

In [None]:
CountVec = CountVectorizer(stop_words='english')
Count_data = CountVec.fit_transform(documents)

In [None]:
cv_dataframe=pd.DataFrame(Count_data.toarray(),columns=CountVec.get_feature_names_out())
print(cv_dataframe)

   cat  dog  garden  mat  played  sat
0    1    0       0    1       0    1
1    0    2       1    0       1    0


In [None]:
# Print the BoW matrix
print(Count_data.toarray())

[[1 0 0 1 0 1]
 [0 2 1 0 1 0]]


In [None]:
CountVec.vocabulary_  #Unique words along with their indices

{'cat': 0, 'sat': 5, 'mat': 3, 'dog': 1, 'played': 4, 'garden': 2}

NGram

In [None]:
CountVec = CountVectorizer(ngram_range=(1,1), # to use bigrams ngram_range=(2,2)
                           stop_words='english',
                           max_features=5, #  limits the maximum number of features (i.e., unique words), in this case there are 6 it will display 5
                           binary=True # for Binary BOW - binary=True
                           )

Count_data = CountVec.fit_transform(documents)

In [None]:
cv_dataframe=pd.DataFrame(Count_data.toarray(),columns=CountVec.get_feature_names_out())
print(cv_dataframe)

   cat  dog  garden  mat  played
0    1    0       0    1       0
1    0    1       1    0       1


In [None]:
CountVec = CountVectorizer(ngram_range=(2,2), # to use bigrams ngram_range=(2,2)
                           stop_words='english',
                           max_features=5, #  limits the maximum number of features (i.e., unique words), in this case there are 6 it will display 5
                           binary=True # for Binary BOW - binary=True
                           )
Count_data = CountVec.fit_transform(documents)
cv_dataframe=pd.DataFrame(Count_data.toarray(),columns=CountVec.get_feature_names_out())
print(cv_dataframe)

   cat sat  dog dog  dog played  played garden  sat mat
0        1        0           0              0        1
1        0        1           1              1        0


## Using spaCy:

In [None]:
import spacy
# Load the English language model in SpaCy
nlp = spacy.load("en_core_web_sm")

In [None]:
documents = [
    "The cat sat on the mat.",
    "The dog dog played in the garden."
]

In [None]:
# Tokenize and preprocess the documents using spaCy
tokenized_documents = [[token.text.lower() for token in nlp(doc)] for doc in documents]
tokenized_documents

[['the', 'cat', 'sat', 'on', 'the', 'mat', '.'],
 ['the', 'dog', 'dog', 'played', 'in', 'the', 'garden', '.']]

In [None]:
CountVec = CountVectorizer(ngram_range=(1, 1), stop_words='english', binary=True)

# Transform the tokenized documents into Bag of Words representation
Count_data = CountVec.fit_transform([' '.join(doc) for doc in tokenized_documents])

In [None]:
# Create DataFrame from the Bag of Words representation
cv_dataframe = pd.DataFrame(Count_data.toarray(), columns=CountVec.get_feature_names_out())
print(cv_dataframe)

   cat  dog  garden  mat  played  sat
0    1    0       0    1       0    1
1    0    1       1    0       1    0


In [None]:
# Print the Bag of Words representation as a dense matrix
print(Count_data.toarray())

[[1 0 0 1 0 1]
 [0 1 1 0 1 0]]


In [None]:
print(CountVec.vocabulary_) # (word to index mapping)

{'cat': 0, 'sat': 5, 'mat': 3, 'dog': 1, 'played': 4, 'garden': 2}
