# $$Bag-of-Word$$

**Bag of Words (BoW)** is a **Natural Language Processing (NLP)** technique used for text modeling. It is a simple and flexible way of extracting features from documents. In technical terms, it is a method of feature extraction with text data. A bag of words is a representation of text that describes the occurrence of words within a document. We just keep track of word counts and disregard the grammatical details and the word order. It is called a “bag” of words because any information about the order or structure of words in the document is discarded ¹³⁴. 


The BoW model is used to preprocess text by converting it into a bag of words, which keeps a count of the total occurrences of most frequently used words. This model can be visualized using a table, which contains the count of words corresponding to the word itself ⁴. 


In summary, BoW is a technique that helps us to understand and analyze text data by converting it into numerical data that can be processed by machine learning algorithms.

Source: Conversation with Bing, 9/25/2023
(1) An Introduction to Bag of Words (BoW) | What is Bag of Words?. https://www.mygreatlearning.com/blog/bag-of-words/.
(2) A Gentle Introduction to the Bag-of-Words Model - Machine Learning Mastery. https://machinelearningmastery.com/gentle-introduction-bag-words-model/.
(3) Bag of words (BoW) model in NLP - GeeksforGeeks. https://www.geeksforgeeks.org/bag-of-words-bow-model-in-nlp/.
(4) Implementation of Bag of Words(NLP) | by Raj Kumar - Medium. https://medium.com/analytics-vidhya/implementation-of-bag-of-words-nlp-397f4cf67970.

1. Uni- Grams
2. Bi - Grams
3. Tri - Grams

In [4]:
import pandas as pd
import numpy as np
from nltk import word_tokenize,sent_tokenize
from nltk.stem import PorterStemmer,WordNetLemmatizer
from nltk.corpus import stopwords
import re


In [99]:
x = ["Prince data science trainer","trainer teaches data science","data science knows about prince",
     "prince working with data science"]

In [100]:
df = pd.DataFrame({"Title":x})

In [101]:
df

Unnamed: 0,Title
0,Prince data science trainer
1,trainer teaches data science
2,data science knows about prince
3,prince working with data science


In [38]:
df["Title"] = df["Title"].apply(lambda x:x.lower())

In [39]:
df

Unnamed: 0,Title
0,prince data science trainer
1,trainer teaches data science
2,data science knows about prince
3,prince working with data science


In [40]:
df["Title"] = df["Title"].apply(lambda x:WordNetLemmatizer().lemmatize(x,pos="v"))

In [41]:
df

Unnamed: 0,Title
0,prince data science trainer
1,trainer teaches data science
2,data science knows about prince
3,prince working with data science


In [42]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

# Uni - grams or Bag words

In [85]:
cv = CountVectorizer()

In [86]:
m =cv.fit_transform(df['Title']).toarray()
m

array([[0, 1, 0, 1, 1, 0, 1, 0, 0],
       [0, 1, 0, 0, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 1, 0, 0, 0, 0],
       [0, 1, 0, 1, 1, 0, 0, 1, 1]], dtype=int64)

In [87]:
cv.vocabulary_

{'prince': 3,
 'data': 1,
 'science': 4,
 'trainer': 6,
 'teaches': 5,
 'knows': 2,
 'about': 0,
 'working': 8,
 'with': 7}

In [88]:
new_columns =cv.get_feature_names_out()

In [89]:
v = len(new_columns)

In [90]:
print("Total Number of vocabulary :-",v)

Total Number of vocabulary :- 9


In [91]:
pd.DataFrame(m,columns=new_columns)

Unnamed: 0,about,data,knows,prince,science,teaches,trainer,with,working
0,0,1,0,1,1,0,1,0,0
1,0,1,0,0,1,1,1,0,0
2,1,1,1,1,1,0,0,0,0
3,0,1,0,1,1,0,0,1,1


In [92]:
df

Unnamed: 0,Title
0,prince data science trainer
1,trainer teaches data science
2,data science knows about prince
3,prince working with data science


# Bi-grams

In [93]:
bi = CountVectorizer(ngram_range=(2,2))

In [94]:
mat = bi.fit_transform(df["Title"]).toarray()

In [95]:
mat

array([[0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1]], dtype=int64)

In [96]:
bi.vocabulary_

{'prince data': 3,
 'data science': 1,
 'science trainer': 6,
 'trainer teaches': 8,
 'teaches data': 7,
 'science knows': 5,
 'knows about': 2,
 'about prince': 0,
 'prince working': 4,
 'working with': 10,
 'with data': 9}

In [97]:
n_c = bi.get_feature_names_out()
n_c

array(['about prince', 'data science', 'knows about', 'prince data',
       'prince working', 'science knows', 'science trainer',
       'teaches data', 'trainer teaches', 'with data', 'working with'],
      dtype=object)

In [98]:
pd.DataFrame(mat,columns=n_c)

Unnamed: 0,about prince,data science,knows about,prince data,prince working,science knows,science trainer,teaches data,trainer teaches,with data,working with
0,0,1,0,1,0,0,1,0,0,0,0
1,0,1,0,0,0,0,0,1,1,0,0
2,1,1,1,0,0,1,0,0,0,0,0
3,0,1,0,0,1,0,0,0,0,1,1


# Tri-Grams

In [81]:
tr =CountVectorizer(ngram_range=(3,3))
tri = tr.fit_transform(df["Title"]).toarray()
tri

array([[0, 1, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 1, 0, 0],
       [1, 0, 1, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 1, 1]], dtype=int64)

In [82]:
tr.vocabulary_

{'prince data science': 3,
 'data science trainer': 1,
 'trainer teaches data': 7,
 'teaches data science': 6,
 'data science knows': 0,
 'science knows about': 5,
 'knows about prince': 2,
 'prince working with': 4,
 'working with data': 9,
 'with data science': 8}

In [83]:
n_cl = tr.get_feature_names_out()
n_cl

array(['data science knows', 'data science trainer', 'knows about prince',
       'prince data science', 'prince working with',
       'science knows about', 'teaches data science',
       'trainer teaches data', 'with data science', 'working with data'],
      dtype=object)

In [84]:
pd.DataFrame(tri,columns=n_cl)

Unnamed: 0,data science knows,data science trainer,knows about prince,prince data science,prince working with,science knows about,teaches data science,trainer teaches data,with data science,working with data
0,0,1,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,1,1,0,0
2,1,0,1,0,0,1,0,0,0,0
3,0,0,0,0,1,0,0,0,1,1


In [None]:
S