In natural language processing (NLP), a "bag of words" (BoW) is a common representation technique used to analyze and represent text data. It simplifies the text into an unordered set of words, disregarding grammar and word order but keeping track of the frequency of each word. The term "bag" implies that the words are treated as individual and isoalted elements, much like items in a bg, without considering the order in which they appear.

Below are the steps explain  how the bag of words model works

1. Tokenization: The first step is to break down a piece of text into individual words or tokens. This process involves removing punctuation and splitting the text into words.

2. Vocabulary creation: Create a vocabulary containing all unique words present in the entire corpus (collection of documents). Each word in the vocabulary is assigned a unique index.

3. Vectorization: Represent each document in the corpus as a vector. The ector has the same legnth as the vocabulary and each position corresponds to a word in the vocabulary. The value at each position indicates the frequency of the corresponding word in the document.

In [1]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import pandas as pd
import re
import numpy as np

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [2]:
sentences = ["We are learning about Natural Language Processing", "Natural Language Processing helps computers understand language data", "The field of Natural Language Processing is evolving everyday"]

In [3]:
# Creating a pandas series object from the list of sentences
corpus = pd.Series(sentences)
corpus

0    We are learning about Natural Language Processing
1    Natural Language Processing helps computers un...
2    The field of Natural Language Processing is ev...
dtype: object

In [4]:
def text_clean(corpus, keep_list):
    '''
    Purpose : Function to keep only alphabets, digits and certain words (punctuations, qmarks, tabs etc. removed)

    Input : Takes a text corpus, 'corpus' to be cleaned along with a list of words, 'keep_list', which have to be retained
            even after the cleaning process

    Output : Returns the cleaned text corpus

    '''
    cleaned_corpus = pd.Series()
    for row in corpus:
        qs = []
        for word in row.split():
            if word not in keep_list:
                p1 = re.sub(pattern='[^a-zA-Z0-9]',repl=' ',string=word)
                p1 = p1.lower()
                qs.append(p1)
            else : qs.append(word)
        cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))
    return cleaned_corpus

In [5]:
def stopwords_removal(corpus):
    wh_words = ['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom']
    stop = set(stopwords.words('english'))
    for word in wh_words:
        stop.remove(word)
    corpus = [[x for x in x.split() if x not in stop] for x in corpus]
    return corpus

In [6]:
def lemmatize(corpus):
    lem = WordNetLemmatizer()
    corpus = [[lem.lemmatize(x, pos = 'v') for x in x] for x in corpus]
    return corpus

In [7]:
def stem(corpus, stem_type = None):
    if stem_type == 'snowball':
        stemmer = SnowballStemmer(language = 'english')
        corpus = [[stemmer.stem(x) for x in x] for x in corpus]
    else :
        stemmer = PorterStemmer()
        corpus = [[stemmer.stem(x) for x in x] for x in corpus]
    return corpus

In [8]:
def preprocess(corpus, keep_list, cleaning = True, stemming = False, stem_type = None, lemmatization = False, remove_stopwords = True):
    '''
    Purpose : Function to perform all pre-processing tasks (cleaning, stemming, lemmatization, stopwords removal etc.)

    Input :
    'corpus' - Text corpus on which pre-processing tasks will be performed
    'keep_list' - List of words to be retained during cleaning process
    'cleaning', 'stemming', 'lemmatization', 'remove_stopwords' - Boolean variables indicating whether a particular task should
                                                                  be performed or not
    'stem_type' - Choose between Porter stemmer or Snowball(Porter2) stemmer. Default is "None", which corresponds to Porter
                  Stemmer. 'snowball' corresponds to Snowball Stemmer

    Note : Either stemming or lemmatization should be used. There's no benefit of using both of them together

    Output : Returns the processed text corpus

    '''

    if cleaning == True:
        corpus = text_clean(corpus, keep_list)

    if remove_stopwords == True:
        corpus = stopwords_removal(corpus)
    else :
        corpus = [[x for x in x.split()] for x in corpus]

    if lemmatization == True:
        corpus = lemmatize(corpus)


    if stemming == True:
        corpus = stem(corpus, stem_type)

    corpus = [' '.join(x) for x in corpus]

    return corpus

In [9]:
common_dot_words = ['U.S.', 'Mr.', 'Mrs.', 'D.C.']

In [10]:
# Preprocess the corpus using the NLP pipline.

preprocessed_corpus = preprocess(corpus, \
    keep_list = common_dot_words, stemming = False, \
    stem_type = None, lemmatization = True, \
    remove_stopwords = True)

  cleaned_corpus = pd.Series()
  cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))
  cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))
  cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))


In [11]:
preprocessed_corpus

['learn natural language process',
 'natural language process help computers understand language data',
 'field natural language process evolve everyday']

In [12]:
# Build our vocabulary
set_of_words = set()
for sentence in preprocessed_corpus:
  for word in sentence.split():
    set_of_words.add(word)
vocab = list(set_of_words)
print(vocab)

['evolve', 'computers', 'field', 'everyday', 'data', 'help', 'language', 'learn', 'process', 'natural', 'understand']


In [13]:
# Fetch the position/indexx of each token in the vocabulary
position = {}
for i, token in enumerate(vocab):
  position[token] = i
print(position)

{'evolve': 0, 'computers': 1, 'field': 2, 'everyday': 3, 'data': 4, 'help': 5, 'language': 6, 'learn': 7, 'process': 8, 'natural': 9, 'understand': 10}


In [14]:
# Creating a placeholder matrix for holding the BoW.
# The shape of the matrix is (number of sentences * length of vocabulary)
bow_matrix = np.zeros((len(preprocessed_corpus), len(vocab)))

The above code creates a matrix called bow_matrix which is a full of zeros in the shape ((len(preprocessed_corpus), len(vocab))) using the numpy libarry. preprocessed_corpus and vocab are both probably lists or array. The built-in len function gets the number of elements in those lists or arrays.preprocessed_corpus might be a list of text data that has already been preprocessed (eg converted into a standard format, any noise like punctuation or irrelevant symbols have beeen remove). Vocab is probably a list of distinct vocabulary words int eh corpus. The result is a mtrix (a 2-dimentional array) with as many rows as there are lem ents in preprocessed_corppus and as many columns as there are elements in vocab. This sort of matrix could be used for many various things in natural language processing, but one common use is as a "Bag of words" matrix. In a Bag of words model, each row of the matrix corresponds to a docuament or senteice, each column corresponds to a particuar words int the vocabulary and the entry in the matrix at position (i,j) tells that the occurence of word j in document i (like its frequency of occurence or its TF-IDF score, etc).

In [None]:
# Increase the positional index of every owrd by 1 if it appears in  a sentence

for i, preprocessed_sentence in enumerate(preprocessed_corpus):
    for token in preprocessed_sentence.split():
        bow_matrix[i][position[token]] = \
                            bow_matrix[i][position[token]] + 1

The bag of words model represents each document or sentence as a vector in an m-dimentional coordinate space where m is the number of various tokens, e.g words in a corpus. The order of tokens is disregarded, hence the term bag. The code above, is peforming the below operations
1. For i , preprocessed_sentence in enumerate(preprocessed corpus), it will iterate over each setence.

2. For toekn in preprocessed_sentence.split(), it will tokenize each sentence in to separate words using the split function, which default splits at white spaces.

3. bow_matrix[i][position[token]] + . If a token (word) is found, it woll increase the count in its corresponding position int eh BoW matrix. It will assume that bow_matrix is a 2D list (or similar construct) and position is a dictionary that maps each token to a specific index. The final BoW matrix stores a count of the number of times each word appears in each setnece, learding to a comprehensive , although crude and context-free, representatiuon of the text data of the entire corpus.

In [15]:
bow_matrix

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])