### NLP - Bag-of-words

## Bag of words (BoW) model in NLP
https://www.geeksforgeeks.org/bag-of-words-bow-model-in-nlp/

In [1]:
# paragraph = """I have three visions for India. In 3000 years of our history, people from all over 
#                the world have come and invaded us, captured our lands, conquered our minds. 
#                From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
#                the French, the Dutch, all of them came and looted us, took over what was ours. 
#                Yet we have not done this to any other nation. We have not conquered anyone. 
#                We have not grabbed their land, their culture, 
#                their history and tried to enforce our way of life on them. 
#                Why? Because we respect the freedom of others. That is why my 
#                first vision is that of freedom. I believe that India got its first vision of 
#                this in 1857, when we started the War of Independence. It is this freedom that
#                we must protect and nurture and build on. If we are not free, no one will respect us.
#                My second vision for India’s development. For fifty years we have been a developing nation.
#                It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
#                in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
#                Our achievements are being globally recognised today. Yet we lack the self-confidence to
#                see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
#                I have a third vision. India must stand up to the world. Because I believe that unless India 
#                stands up to the world, no one will respect us. Only strength respects strength. We must be 
#                strong not only as a military power but also as an economic power. Both must go hand-in-hand. 
#                My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of 
#                space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
#                I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. 
#                I see four milestones in my career."""

### Implementation of Bag of Words(NLP) Using sklearn CountVectorizer
https://medium.com/analytics-vidhya/implementation-of-bag-of-words-nlp-397f4cf67970

#### What is Bag of words (BOWs)?
Bag of words is a way of representing text data in NLP, when modeling text with machine learning algorithm. It is a simple method and very flexible to use in modeling.

In general, Bag of words used to convert words in a text into a matrix representation by extracting its features i.e., it shows us which word occurs in a sentence and its frequency, for use in modeling such as machine learning algorithms.

In [16]:
import nltk 
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/koushikdev/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [41]:
text = "sky is nice. clouds are nice. Sky is nice and Clouds are nice."

In [43]:
# step 1 => tokenization
from nltk.tokenize import sent_tokenize
sentences=sent_tokenize(text)

In [45]:
sentences

['sky is nice.', 'clouds are nice.', 'Sky is nice and Clouds are nice.']

In [59]:
import re
# Remove periods using regex
cleaned_sentences = [re.sub(r'\.', '', sentence) for sentence in sentences]

In [57]:
cleaned_sentences

['sky is nice', 'clouds are nice', 'Sky is nice and Clouds are nice']

In [75]:
# sentence--->words tokenize and removing stop words
cleaned_sentence = []

for sentence in cleaned_sentences:
    word = sentence.lower()  
    word = word.split()    ##splitting our sentence into words 
    
    ##removing stop words
    word = [i for i in word if i not in set(stopwords.words('english'))]          
    word = " ".join(word)   
    cleaned_sentence.append(word) 

In [77]:
cleaned_sentence

['sky nice', 'clouds nice', 'sky nice clouds nice']

In [141]:
corpus = cleaned_sentence
corpus

['sky nice', 'clouds nice', 'sky nice clouds nice']

In [107]:
##importing Bag-of-words model
from sklearn.feature_extraction.text import CountVectorizer

# cv = CountVectorizer(max_features = 3)  ##give it a max features as 3
# Bagofwords = cv.fit_transform(cleaned_sentence).toarray()

# An instance of CountVectorizer is created. By default, it converts all characters to lowercase and removes punctuation
'''
Using max_features=3 in CountVectorizer limits the number of unique words (features) 
to the top 3 most frequently occurring words in the corpus. 
This parameter is helpful when you want to control the size of your vocabulary, especially 
when dealing with a large dataset or when you want to focus on the most relevant features for your analysis or model.
'''

# An instance of CountVectorizer is created. By default, it converts all characters to lowercase and removes punctuation.
vectorizer = CountVectorizer(max_features = 3) 

In [143]:
'''
Fit: Learns the vocabulary from the corpus (i.e., all unique words found across the documents).
Transform: Converts the documents into a matrix of token counts.
X is a sparse matrix representing the word counts in each document.
''' 
X = vectorizer.fit_transform(corpus) 

In [149]:
Vocabulary = vectorizer.get_feature_names_out()
print("Vocabulary:",Vocabulary)

Vocabulary: ['clouds' 'nice' 'sky']


In [159]:
Bagofwords = X.toarray()

In [157]:
Bagofwords

array([[0, 1, 1],
       [1, 1, 0],
       [1, 2, 1]])

   Index Type  Size                                              Value
0      0  str     6                    i have three visions for india 
1      1  str    23  in 3000 years of our history people from all o...


In [88]:

# Function to create and style a DataFrame
def create_data_frame(dataset):
    data = {
        'Index': range(len(dataset)),
        'Type': ['str' for _ in dataset],
        'Size': [len(sentence.split()) for sentence in dataset],
        'Value': dataset
    }
    df = pd.DataFrame(data)
    
    # Style the DataFrame
    styled_df = df.style.set_table_attributes('style="width:100%; border-collapse: collapse;"').set_properties(**{'border': '1px solid black', 'padding': '5px'})
    
    return styled_df

In [90]:
import pandas as pd
import nltk
import re

# Sample text
text = """I have three visions for India. In 3000 years of our history, people from all over 
#                the world have come and invaded us, captured our lands, conquered our minds. 
#                From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
#                the French, the Dutch, all of them came and looted us, took over what was ours. 
#                Yet we have not done this to any other nation. We have not conquered anyone. 
#                We have not grabbed their land, their culture, 
#                their history and tried to enforce our way of life on them. 
#                Why? Because we respect the freedom of others. That is why my 
#                first vision is that of freedom. I believe that India got its first vision of 
#                this in 1857, when we started the War of Independence. It is this freedom that
#                we must protect and nurture and build on. If we are not free, no one will respect us.
#                My second vision for India’s development. For fifty years we have been a developing nation.
#                It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
#                in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
#                Our achievements are being globally recognised today. Yet we lack the self-confidence to
#                see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
#                I have a third vision. India must stand up to the world. Because I believe that unless India 
#                stands up to the world, no one will respect us. Only strength respects strength. We must be 
#                strong not only as a military power but also as an economic power. Both must go hand-in-hand. 
#                My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of 
#                space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
#                I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. 
#                I see four milestones in my career."""

# Tokenize and preprocess
dataset = nltk.sent_tokenize(text)
for i in range(len(dataset)):
    dataset[i] = dataset[i].lower()
    dataset[i] = re.sub(r'\W', ' ', dataset[i])
    dataset[i] = re.sub(r'\s+', ' ', dataset[i])

df = create_data_frame(dataset)
df

Unnamed: 0,Index,Type,Size,Value
0,0,str,6,i have three visions for india
1,1,str,23,in 3000 years of our history people from all over the world have come and invaded us captured our lands conquered our minds
2,2,str,29,from alexander onwards the greeks the turks the moguls the portuguese the british the french the dutch all of them came and looted us took over what was ours
3,3,str,10,yet we have not done this to any other nation
4,4,str,5,we have not conquered anyone
5,5,str,20,we have not grabbed their land their culture their history and tried to enforce our way of life on them
6,6,str,1,why
7,7,str,7,because we respect the freedom of others
8,8,str,10,that is why my first vision is that of freedom
9,9,str,19,i believe that india got its first vision of this in 1857 when we started the war of independence


In [None]:
# Save DataFrame to CSV
# df.to_csv('preprocessed_sentences.csv', index=False)


### Another Example 

In [74]:
corpus = '''
Beans. I was trying to explain to somebody as
we were flying in, that’s corn.  That’s beans. And they were very impressed at my 
agricultural knowledge. Please give it up for Amaury once again for that outstanding introduction.
I have a bunch of good friends here today, including somebody who I served with, 
who is one of the finest senators in the country, and we’re lucky to have him, your Senator,
Dick Durbin is here. I also noticed, by the way, former Governor Edgar here, who I haven’t seen 
in a long time, and somehow he has not aged and I have. And it’s great to see you, Governor. 
I want to thank President Killeen and everybody at the U of I System for making it possible 
for me to be here today. And I am deeply honored at the Paul Douglas Award that is being given to me. 
He is somebody who set the path for so much outstanding public service here in Illinois. 
Now, I want to start by addressing the elephant in the room. I know people are still wondering why 
I didn’t speak at the commencement."
'''

In [94]:
import nltk 
import re 
import numpy as np 

# Convert text to lower case.
# Remove all non-word characters.
# Remove all punctuations.

dataset = nltk.sent_tokenize(corpus) 
for i in range(len(dataset)): 
	dataset[i] = dataset[i].lower() 
	dataset[i] = re.sub(r'\W', ' ', dataset[i]) 
	dataset[i] = re.sub(r'\s+', ' ', dataset[i]) 


In [96]:
df = create_data_frame(dataset)
df

Unnamed: 0,Index,Type,Size,Value
0,0,str,1,beans
1,1,str,15,i was trying to explain to somebody as we were flying in that s corn
2,2,str,3,that s beans
3,3,str,9,and they were very impressed at my agricultural knowledge
4,4,str,12,please give it up for amaury once again for that outstanding introduction
5,5,str,38,i have a bunch of good friends here today including somebody who i served with who is one of the finest senators in the country and we re lucky to have him your senator dick durbin is here
6,6,str,28,i also noticed by the way former governor edgar here who i haven t seen in a long time and somehow he has not aged and i have
7,7,str,8,and it s great to see you governor
8,8,str,24,i want to thank president killeen and everybody at the u of i system for making it possible for me to be here today
9,9,str,16,and i am deeply honored at the paul douglas award that is being given to me


## Step #2 : Obtaining most frequent words in our text.

We will apply the following steps to generate our model.

We declare a dictionary to hold our bag of words.
Next we tokenize each sentence to words.
Now for each word in sentence, we check if the word exists in our dictionary.
If it does, then we increment its count by 1. If it doesn’t, we add it to our dictionary and set its count as 1.

In [105]:
dataset

[' beans ',
 'i was trying to explain to somebody as we were flying in that s corn ',
 'that s beans ',
 'and they were very impressed at my agricultural knowledge ',
 'please give it up for amaury once again for that outstanding introduction ',
 'i have a bunch of good friends here today including somebody who i served with who is one of the finest senators in the country and we re lucky to have him your senator dick durbin is here ',
 'i also noticed by the way former governor edgar here who i haven t seen in a long time and somehow he has not aged and i have ',
 'and it s great to see you governor ',
 'i want to thank president killeen and everybody at the u of i system for making it possible for me to be here today ',
 'and i am deeply honored at the paul douglas award that is being given to me ',
 'he is somebody who set the path for so much outstanding public service here in illinois ',
 'now i want to start by addressing the elephant in the room ',
 'i know people are still wond

In [101]:
# Creating the Bag of Words model 
word2count = {} 
for data in dataset: 
	words = nltk.word_tokenize(data) 
	for word in words: 
		if word not in word2count.keys(): 
			word2count[word] = 1
		else: 
			word2count[word] += 1


In [103]:
word2count

{'beans': 2,
 'i': 12,
 'was': 1,
 'trying': 1,
 'to': 8,
 'explain': 1,
 'somebody': 3,
 'as': 1,
 'we': 2,
 'were': 2,
 'flying': 1,
 'in': 5,
 'that': 4,
 's': 3,
 'corn': 1,
 'and': 7,
 'they': 1,
 'very': 1,
 'impressed': 1,
 'at': 4,
 'my': 1,
 'agricultural': 1,
 'knowledge': 1,
 'please': 1,
 'give': 1,
 'it': 3,
 'up': 1,
 'for': 5,
 'amaury': 1,
 'once': 1,
 'again': 1,
 'outstanding': 2,
 'introduction': 1,
 'have': 3,
 'a': 2,
 'bunch': 1,
 'of': 3,
 'good': 1,
 'friends': 1,
 'here': 5,
 'today': 2,
 'including': 1,
 'who': 4,
 'served': 1,
 'with': 1,
 'is': 4,
 'one': 1,
 'the': 9,
 'finest': 1,
 'senators': 1,
 'country': 1,
 're': 1,
 'lucky': 1,
 'him': 1,
 'your': 1,
 'senator': 1,
 'dick': 1,
 'durbin': 1,
 'also': 1,
 'noticed': 1,
 'by': 2,
 'way': 1,
 'former': 1,
 'governor': 2,
 'edgar': 1,
 'haven': 1,
 't': 2,
 'seen': 1,
 'long': 1,
 'time': 1,
 'somehow': 1,
 'he': 2,
 'has': 1,
 'not': 1,
 'aged': 1,
 'great': 1,
 'see': 1,
 'you': 1,
 'want': 2,
 'thank':

In [107]:
# import heapq 
# freq_words = heapq.nlargest(100, word2count, key=word2count.get)


In [111]:
# freq_words

In [117]:
X = [] 
for data in dataset: 
	vector = [] 
	for word in freq_words: 
		if word in nltk.word_tokenize(data): 
			vector.append(1) 
		else: 
			vector.append(0) 
	X.append(vector) 
X = np.asarray(X) 

# Create a DataFrame from the matrix for better visualization
df = pd.DataFrame(X, columns=freq_words)

# Display the entire DataFrame in a full-screen mode
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.max_rows', None)     # Show all rows

# Print the DataFrame
print(df)

    i  the  to  and  in  for  here  that  at  who  is  somebody  s  it  have  \
0   0    0   0    0   0    0     0     0   0    0   0         0  0   0     0   
1   1    0   1    0   1    0     0     1   0    0   0         1  1   0     0   
2   0    0   0    0   0    0     0     1   0    0   0         0  1   0     0   
3   0    0   0    1   0    0     0     0   1    0   0         0  0   0     0   
4   0    0   0    0   0    1     0     1   0    0   0         0  0   1     0   
5   1    1   1    1   1    0     1     0   0    1   1         1  0   0     1   
6   1    1   0    1   1    0     1     0   0    1   0         0  0   0     1   
7   0    0   1    1   0    0     0     0   0    0   0         0  1   1     0   
8   1    1   1    1   0    1     1     0   1    0   0         0  0   1     0   
9   1    1   1    1   0    0     0     1   1    0   1         0  0   0     0   
10  0    1   0    0   1    1     1     0   0    1   1         1  0   0     0   
11  1    1   1    0   1    0     0     0

In [115]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [1, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 1, 0, ..., 1, 1, 1],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 0, ..., 0, 0, 0]])

### Bag of Words (BoW) Model: Advantages and Disadvantages

In [13]:
# pip install --upgrade pandas


In [9]:
import pandas as pd

# Define the data for advantages and disadvantages
data = {
    "Aspect": [
        "Simplicity", 
        "Efficiency", 
        "Feature Representation", 
        "Compatibility", 
        "Semantic Understanding", 
        "Vocabulary Handling", 
        "Model Training", 
        "Generalization", 
        "Use in NLP"
    ],
    "Advantages": [
        "Easy to implement and understand.",
        "Efficient for small datasets and basic tasks.",
        "Provides a fixed-length vector representation.",
        "Works well with traditional machine learning algorithms.",
        "Does not require any semantic knowledge or language rules.",
        "No need for handling complex linguistic variations.",
        "Simplicity leads to faster model training in basic use cases.",
        "Works for general text analysis tasks without the need for deep language understanding.",
        "Suitable for tasks like spam detection and basic text classification."
    ],
    "Disadvantages": [
        "Ignores word order and context, losing semantic meaning.",
        "High dimensionality for large vocabularies, leading to inefficiency.",
        "Generates sparse matrices, which can be computationally expensive.",
        "Limited to algorithms that can handle high-dimensional sparse data.",
        "Cannot differentiate between synonyms or understand polysemy.",
        "Cannot handle out-of-vocabulary (OOV) words, and treats variations of a word as separate features.",
        "Prone to overfitting if the vocabulary size is too large or underfitting if it is too small.",
        "Limited ability to generalize to new or unseen text due to fixed vocabulary.",
        "Ineffective for complex NLP tasks that require contextual or sequential information."
    ]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Style the DataFrame without index
styled_df = df.style.set_table_attributes('style="width:100%; border-collapse: collapse;"') \
                    .set_properties(**{'border': '1px solid black'})

# Display the styled DataFrame (without index)
# styled_df


In [11]:
# Convert to HTML without index
html_output = df.to_html(index=False, border=0, justify='center')

# Display the HTML output in a Jupyter notebook cell
from IPython.core.display import display, HTML
display(HTML(html_output))


Aspect,Advantages,Disadvantages
Simplicity,Easy to implement and understand.,"Ignores word order and context, losing semantic meaning."
Efficiency,Efficient for small datasets and basic tasks.,"High dimensionality for large vocabularies, leading to inefficiency."
Feature Representation,Provides a fixed-length vector representation.,"Generates sparse matrices, which can be computationally expensive."
Compatibility,Works well with traditional machine learning algorithms.,Limited to algorithms that can handle high-dimensional sparse data.
Semantic Understanding,Does not require any semantic knowledge or language rules.,Cannot differentiate between synonyms or understand polysemy.
Vocabulary Handling,No need for handling complex linguistic variations.,"Cannot handle out-of-vocabulary (OOV) words, and treats variations of a word as separate features."
Model Training,Simplicity leads to faster model training in basic use cases.,Prone to overfitting if the vocabulary size is too large or underfitting if it is too small.
Generalization,Works for general text analysis tasks without the need for deep language understanding.,Limited ability to generalize to new or unseen text due to fixed vocabulary.
Use in NLP,Suitable for tasks like spam detection and basic text classification.,Ineffective for complex NLP tasks that require contextual or sequential information.
