---
<strong>
    <h1 align='center'><strong>Bag of Words</strong></h1>
</strong>

---

**Importing necessary libraries**

In [1]:
from pprint import pprint

In [2]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Download a specific NLTK dataset, e.g., the 'punkt' tokenizer models.
nltk.download('punkt', quiet=True)

# Download the NLTK stopwords dataset, which contains common stopwords for various languages.
nltk.download('stopwords', quiet=True)

# Download the WordNet lexical database, which is used for various NLP tasks like synonym and antonym lookup.
nltk.download('wordnet', quiet=True)

# Download the NLTK averaged perceptron tagger, which is used for part-of-speech tagging.
# nltk.download('averaged_perceptron_tagger', quiet=True)


# Download the NLTK names dataset, which contains a list of common first names and last names.
# nltk.download('names', quiet=True)

# Download the NLTK movie_reviews dataset, which contains movie reviews categorized as positive and negative.
# nltk.download('movie_reviews', quiet=True)

# Download the NLTK reuters dataset, which is a collection of news documents categorized into topics.
# nltk.download('reuters', quiet=True)

# Download the NLTK brown corpus, which is a collection of text from various genres of written American English.
# nltk.download('brown', quiet=True)

True

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
documents = [
    "The cat in the hat.",
    "The dog chased the cat."
]


# Create a CountVectorizer object
vectorizer = CountVectorizer()

# Fit and transform the documents to create the bag of words representation
X = vectorizer.fit_transform(documents)

# Convert the bag of words representation to a dense matrix and print it
print("Bag of Words Matrix:")
print(X.toarray())

# Get the vocabulary
vocabulary = sorted(vectorizer.get_feature_names_out())

# Print the vocabulary
print("Vocabulary:")
print(vocabulary)

Bag of Words Matrix:
[[1 0 0 1 1 2]
 [1 1 1 0 0 2]]
Vocabulary:
['cat', 'chased', 'dog', 'hat', 'in', 'the']


## Bag of Words in Natural Language Processing (NLP)

In **Natural Language Processing (NLP)**, the **"bag of words"** (BoW) is a simple and widely used technique for **text data preprocessing** and **feature extraction**. It represents text data as a collection of individual words or tokens, without considering the order or structure of the words in the text. Here's how it works:

### Tokenization
The first step in creating a bag of words representation is to tokenize the text. Tokenization involves splitting the text into individual words or tokens. For example, the sentence "The quick brown fox" would be tokenized into the tokens: ["The", "quick", "brown", "fox"].

### Vocabulary Creation
Next, a vocabulary or dictionary is created. This vocabulary contains all unique words or tokens from the entire corpus of text data. For example, if you have a collection of documents, the vocabulary would contain all the unique words from those documents.

### Vectorization
Once the vocabulary is established, each document or text sample is represented as a vector. The vector is typically of fixed length, and each element of the vector corresponds to a word in the vocabulary. The value of each element in the vector represents the frequency of the corresponding word in the document. There are variations, such as using binary values (`1` if the word is present, `0` if not) or term frequency-inverse document frequency (TF-IDF) values instead of raw word frequencies.

#### Example:
Let's say you have two sentences:

- Sentence 1: `"The cat in the hat."`
- Sentence 2: `"The dog chased the cat."`



Using the bag of words representation with word frequency counting, the vector representations of these sentences would be:

# Bag of Words Matrix

| Document | cat | chased | dog | hat | in | the |
|---|---|---|---|---|---|---|
| The cat in the hat. | 1 | 0 | 0 | 1 | 1 | 2 |
| The dog chased the cat. | 1 | 1 | 1 | 0 | 0 | 2 |



- `cat` appears once in the first document and once in the second document.
- `chased` appears zero times in the first document and once in the second - document.
- `dog` appears zero times in the first document and once in the second document.
- `hat` appears once in the first document and zero times in the second document.
- `in` appears once in the first document and zero times in the second document.
- `the` appears twice in the first document and twice in the second document.


**Explaining the bags of words Matrix:**

- Each `row` in the matrix represents one of the input documents.
Each column represents a unique word (or token) from the entire corpus of documents.

- The values in the matrix indicate the frequency of each word's occurrence in each document.

**The bag of words model has certain limitations:**

- It does not capture **word order** within a document.
- It does not consider the **semantic meaning** of words or the context in which they are used.

To address these limitations, more advanced techniques have been developed:

- **Word Embeddings**: Techniques like `Word2Vec` and `GloVe` generate `word embeddings`, which are **dense vector representations** of words that **capture semantic meaning and relationships between words**.

- **Deep Learning Models**: Models like `Recurrent Neural Networks (RNNs)` and `Transformer-based models (e.g., BERT)` are **capable of capturing complex patterns** and **context in text data**.

- `RNNs` are especially good at considering sequential information, while **Transformers** excel at capturing `long-range dependencies`.

These advanced techniques provide richer and more meaningful representations of text data, making them suitable for a wide range of NLP tasks, including text classification, sentiment analysis, and machine translation.


In [4]:
# Importing necessary libraries for text cleaning and processing
import re                                  # Regular expressions library for text pattern matching and replacement
import nltk                                # Natural Language Toolkit library for various NLP tasks
from nltk.corpus import stopwords          # NLTK corpus containing common stop words
from nltk.stem.porter import PorterStemmer # Stemming algorithm for word reduction
from nltk.stem import WordNetLemmatizer    # Lemmatization tool for word normalization

# Creating the stemming and lemmatization objects
ps = PorterStemmer()
wordnet = WordNetLemmatizer()

# we will convert this whole text-corpus into the bags of words
text = """
                Natural Language Processing (NLP) is a field of artificial intelligence that focuses
                on the interaction between computers and humans through natural language. The ultimate
                goal of NLP is to enable computers to understand, interpret, and generate human language
                in a way that is both meaningful and useful. NLP techniques are used in a wide range of
                applications, including machine translation, speech recognition, sentiment analysis,
                chatbots, and information retrieval. It involves various tasks such as tokenization,
                part-of-speech tagging, named entity recognition, and syntactic parsing. NLTK is a popular
                Python library for NLP, providing tools and resources for tasks like text processing, text
                classification, and language modeling. It offers a wide range of functions and datasets
                to help you get started with NLP projects. In this sample text, we'll demonstrate some
                basic NLP tasks using NLTK, such as tokenization and part-of-speech tagging.
                Let's get started!
             """

sentences = nltk.sent_tokenize(text)
print("Number of sentences in the text corpus:", len(sentences))

corpus = []

for i in range(len(sentences)):
    cleaned_sentence = re.sub('[^a-zA-Z]', ' ', sentences[i]) # Removes all characters from sentences[i] that are not alphabetic characters and replaces them with spaces.
    cleaned_sentence = cleaned_sentence.lower()
    cleaned_sentence = cleaned_sentence.split()
    cleaned_sentence = [wordnet.lemmatize(word) for word in cleaned_sentence if not word in set(stopwords.words('english'))]
    cleaned_sentence = ' '.join(cleaned_sentence)
    corpus.append(cleaned_sentence)

print(corpus)

Number of sentences in the text corpus: 8
['natural language processing nlp field artificial intelligence focus interaction computer human natural language', 'ultimate goal nlp enable computer understand interpret generate human language way meaningful useful', 'nlp technique used wide range application including machine translation speech recognition sentiment analysis chatbots information retrieval', 'involves various task tokenization part speech tagging named entity recognition syntactic parsing', 'nltk popular python library nlp providing tool resource task like text processing text classification language modeling', 'offer wide range function datasets help get started nlp project', 'sample text demonstrate basic nlp task using nltk tokenization part speech tagging', 'let get started']


**Bag of Words (BoW) model using the scikit-learn library in Python.**

In [5]:
# Creating the Bag of Words model
# CountVectorizer is used for text preprocessing and creating the Bag of Words model.
from sklearn.feature_extraction.text import CountVectorizer

# This parameter controls the maximum number of words (features) to include in the BoW model.
# It means that only the top 1500 most frequent words in your corpus will be considered as features in the BoW model.
vectorizer = CountVectorizer(max_features=1500)

# Converting the corpus into the the Bag of Words matrix then to a dense Numpy array
X = vectorizer.fit_transform(corpus).toarray()

print("Bag of Words Model:")
pprint(X)

Bag of Words Model:
array([[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
        1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,
        0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,
        1, 0],
       [1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
        0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 