# Bag of Words (BoW) Model

### Definition

The bag of words representation is the matrix representation of the frequency of words per document from actual raw text data. The values inside the cells in the matrix representation can be filled-in two ways:
1. We can either fill the cell with the frequency of a word (values $\geq$ 0) 
2. Or we can fill the cell with either `0`, in case the word is not present in the document or `1`, in case the word is present. This approach is also known as the binary bag of words model.

The frequency approach is more commonly used in practice and the popular `NLTK` library in PyThon uses the word frequency approach instead of binary values.

### Steps

In this notebook we'll learn how to perform some pre-processing steps before building a bag of words model. 
1. We have to lowercase all the words to bring every word in the universal casing else it will take “Machine” and “machine” as two separate words. 
2. We need to remove all punctuation from the vocabulary.
3. We need to tokenize each document.
4. Finally, remove all stop words from the documents.

### Duration

$\approx$ 30minutes

## 1. Packages

In [1]:
import nltk

In [2]:
import nltk
nltk.download('stopwords') # you'll need to execute this for the first time 
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/hamzasellak/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2. Load Data

We're going to build a bag of words model based on 10 sentences.

In [3]:
# Open the file in read mode
with open("./data/documents.txt", "r") as file:
    # Read the contents of the file
    lines = file.readlines()

# Remove leading/trailing whitespace and create a list
documents = [line.strip() for line in lines]

# Print the documents
print(documents)

['Deep learning models analyze past data to make accurate predictions.', 'Artificial intelligence encompasses various techniques, including machine learning.', 'Machine learning algorithms discover patterns and make informed decisions.', 'The power of machine learning lies in its ability to learn from experience.', 'Artificial intelligence systems leverage machine learning for autonomous task completion.', 'Machine learning enables computers to adapt and improve their performance over time.', 'Analyzing historical data is crucial for machine learning algorithms to make accurate forecasts.', 'Artificial intelligence integrates machine learning for intelligent decision-making processes.', 'Machine learning algorithms autonomously uncover insights and trends in datasets.', 'The predictive capabilities of machine learning come from analyzing past data patterns.']


## 2. Pre-process

In [4]:
def preprocess(document, punctuation, stop_words):
    """
    Prepares a document to be converted into a specific word representation model.
    
    Args:
        document (str): document to be preprocessed
        punctuation (str): punctuation marks to be removed from the document
        stop_words (list): list of words to be removed from the document 
        
    Returns:
        preprocessed_document (list): final document ready to be converted into a specific word representation model
    """
    
    # Convert to lowercase
    document = document.lower()
    
    # Remove punctuation marks
    document = document.translate(str.maketrans('', '', punctuation))
    
    # Convert document into tokens
    document = document.split()
    
    # Remove stop words
    document = [word for word in document if word not in stop_words]
    
    # Join back words to make a sentence
    document = " ".join(document)
    
    return document

In [5]:
# Initialise list of pre-processed documents
preprocessed_documents = []

# Punctuation marks
punctuation = '!"#$%&\()*+,-./:;<=>?@[\\]^_`{|}~' # inspired from string.punctuation

# Stop words
stop_words = stopwords.words("English") + [str('a')]

# Preprocessing
preprocessed_documents = [preprocess(d, punctuation, stop_words) for d in documents]

# Print original and pre-processed documents
for i in range(len(documents)):
    print(f"(before)D{i+1} = {documents[i]}")
    print(f"(after)D{i+1} = {preprocessed_documents[i]}\n-----------")

(before)D1 = Deep learning models analyze past data to make accurate predictions.
(after)D1 = deep learning models analyze past data make accurate predictions
-----------
(before)D2 = Artificial intelligence encompasses various techniques, including machine learning.
(after)D2 = artificial intelligence encompasses various techniques including machine learning
-----------
(before)D3 = Machine learning algorithms discover patterns and make informed decisions.
(after)D3 = machine learning algorithms discover patterns make informed decisions
-----------
(before)D4 = The power of machine learning lies in its ability to learn from experience.
(after)D4 = power machine learning lies ability learn experience
-----------
(before)D5 = Artificial intelligence systems leverage machine learning for autonomous task completion.
(after)D5 = artificial intelligence systems leverage machine learning autonomous task completion
-----------
(before)D6 = Machine learning enables computers to adapt and impro

### 3. Bag of Words

We'll use the `CountVectorizer` function from `sklearn` package to create our bag of words model. This function take all documents (sentences) as input and convert them into a matrix representation where each cell will be filled by the frequency of each word in our vocabulary in each document.

In [6]:
vectorizer = CountVectorizer()
bow_model = vectorizer.fit_transform(preprocessed_documents)
print(bow_model) # returns the rows and column number of cells which have 1 as value

  (0, 18)	1
  (0, 33)	1
  (0, 38)	1
  (0, 4)	1
  (0, 39)	1
  (0, 14)	1
  (0, 37)	1
  (0, 1)	1
  (0, 43)	1
  (1, 33)	1
  (1, 6)	1
  (1, 30)	1
  (1, 21)	1
  (1, 52)	1
  (1, 48)	1
  (1, 26)	1
  (1, 36)	1
  (2, 33)	1
  (2, 37)	1
  (2, 36)	1
  (2, 3)	1
  (2, 19)	1
  (2, 40)	1
  (2, 27)	1
  (2, 17)	1
  :	:
  (7, 33)	1
  (7, 6)	1
  (7, 30)	1
  (7, 36)	1
  (7, 29)	1
  (7, 31)	1
  (7, 16)	1
  (7, 45)	1
  (8, 33)	1
  (8, 36)	1
  (8, 3)	1
  (8, 8)	1
  (8, 51)	1
  (8, 28)	1
  (8, 50)	1
  (8, 15)	1
  (9, 33)	1
  (9, 39)	1
  (9, 14)	1
  (9, 36)	1
  (9, 40)	1
  (9, 5)	1
  (9, 44)	1
  (9, 9)	1
  (9, 10)	1


Next we need to convert the output into an array format to create a (sparse) matrix representation.

In [7]:
matrix_representation = bow_model.toarray()
print(matrix_representation)
print(f"Number of rows (documents): {matrix_representation.shape[0]}")
print(f"Number of columns (unique words): {matrix_representation.shape[1]}")

[[0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
  0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0
  1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0
  1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1
  1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0
  1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0
  1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0]
 [0 1 0 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0
  1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0
  1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
  1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0]
 

### 4. From a Matrix Representation To Dataframe (optional)

We can convert the bag of words (matrix) to a `pandas.DataFrame` by assigning our unique words as columns and our documents as rows.

In [8]:
# creating a dataframe from a bag of words' matrix representation
bow_to_df = pd.DataFrame(bow_model.toarray(), columns = vectorizer.get_feature_names_out())
bow_to_df

Unnamed: 0,ability,accurate,adapt,algorithms,analyze,analyzing,artificial,autonomous,autonomously,capabilities,...,predictions,predictive,processes,systems,task,techniques,time,trends,uncover,various
0,0,1,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,1
2,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,1,0,0,...,0,0,0,1,1,0,0,0,0,0
5,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
6,0,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,1,0,0,0,...,0,0,1,0,0,0,0,0,0,0
8,0,0,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,1,1,0
9,0,0,0,0,0,1,0,0,0,1,...,0,1,0,0,0,0,0,0,0,0


## 5. Limitations

### Notebook Limitations

As we can observe from the `Dataframe`, we get a lot of redundant features (words) after building the bag of words model. For instance, features such as ‘analyze’ and ‘analyzing’, ‘autonomous’ and ‘autonomously’, ‘predictions’ and ‘predictive’ are just a few examples of many duplicate features. These features should be considered as redundant since they’re not giving us any extra information.

Words such as ‘perfect’ and ‘perfection’ are considered equivalent for certain natural language processing tasks such as sentiment analysis: i.e., when our goal is to detect whether a document (sentence) reflects a positive or negative sentiment. Keeping separate but equivalent words may affect the performance of the machine learning algorithms. It will also increase the number of features.

One way to solve this problem is by applying extra pre-processing steps such as stemming and lemmatization.

### Model Limitations

The bag of words model is a simple and popular approach used in natural language processing. However, it has several limitations including:

- **Loss of word order:**** The bag of words model ignores the order and structure of words in a text. It treats each word as an independent entity, discarding the sequential and contextual information. As a result, it fails to capture the meaning and nuances conveyed by word order.

- **Lack of semantic understanding:** The model treats words as isolated units without considering their semantic relationships. It fails to capture the meaning of phrases, idioms, or expressions that rely on the compositionality of words. Consequently, the model may struggle to differentiate between similar words with different meanings.

- **Vocabulary size:** The bag of words model represents a text as a fixed-length vector, where each dimension corresponds to a unique word in the vocabulary. As a result, the vocabulary size can become large, leading to high-dimensional feature vectors. This can be computationally expensive and may require significant memory resources.

- **Sparsity:** In large corpora or datasets, most words in the vocabulary may not appear frequently. This leads to sparse feature vectors with many zero entries, making it challenging to extract meaningful patterns and relationships between words.

- **Out-of-vocabulary words:** Words that were not present in the training data or not part of the predefined vocabulary are usually ignored or handled separately. This can lead to a loss of information when encountering new or rare words that could be relevant in the text analysis.

Despite these limitations, the bag of words model remains a useful baseline approach for many tasks. Researchers and practitioners have developed more advanced techniques, such as word embeddings and neural network-based models, to overcome some of these limitations and capture more nuanced information from text data.