# Bag of Words (BoW)

### What is it?
- Bag of Words is a text representation technique that converts text into numerical data by:
- Creating a vocabulary of unique words from the text.
- Counting the frequency of each word in the text.

### Why Use It?
- To convert text into a machine-readable format for machine learning models.
- Useful for text classification, sentiment analysis, and spam detection.

### Advantages
- Simple and easy to implement.
- Effective for basic text classification tasks.
- Language-independent.

### Disadvantages
- Ignores word order and context.
- High dimensionality (sparse vectors).
- Does not capture word meanings or relationships.

### Example
### Input Texts:
1. "I love programming."
2. "Programming is fun."

#### Vocabulary:
`['i', 'love', 'programming', 'is', 'fun']`

#### BoW Representation:
| Text         | i | love | programming | is | fun |
|--------------|---|------|-------------|----|-----|
| Text 1       | 1 | 1    | 1           | 0  | 0   |
| Text 2       | 0 | 0    | 1           | 1  | 1   |


In [None]:
import pandas as pd
df=pd.read_csv('smsspam.txt', sep='\t', names=['label', 'message'])

In [65]:
df

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [66]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\91830\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [67]:
# text preprocessing: 
ps=PorterStemmer()
corpus=[]
for i in range(0,len(df)): 
    review=re.sub('[^a-zA-Z]', ' ', df['message'][i])  # it removes digits, punctuation, and special characters.
    review=review.lower()
    review=review.split()
    review=[ps.stem(word) for word in review if not word in stopwords.words('english')]
    review=' '.join(review)
    corpus.append(review)

In [68]:
corpus

['go jurong point crazi avail bugi n great world la e buffet cine got amor wat',
 'ok lar joke wif u oni',
 'free entri wkli comp win fa cup final tkt st may text fa receiv entri question std txt rate c appli',
 'u dun say earli hor u c alreadi say',
 'nah think goe usf live around though',
 'freemsg hey darl week word back like fun still tb ok xxx std chg send rcv',
 'even brother like speak treat like aid patent',
 'per request mell mell oru minnaminungint nurungu vettam set callertun caller press copi friend callertun',
 'winner valu network custom select receivea prize reward claim call claim code kl valid hour',
 'mobil month u r entitl updat latest colour mobil camera free call mobil updat co free',
 'gonna home soon want talk stuff anymor tonight k cri enough today',
 'six chanc win cash pound txt csh send cost p day day tsandc appli repli hl info',
 'urgent week free membership prize jackpot txt word claim c www dbuk net lccltd pobox ldnw rw',
 'search right word thank breather

In [72]:
 # create bag of words
# bag of words model: 
from sklearn.feature_extraction.text import CountVectorizer

cv=CountVectorizer()  # by default max_features=None, ngram_range=(1,1), binary=False

In [73]:
X=cv.fit_transform(corpus).toarray()

In [74]:
X

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0

In [75]:
import numpy as np
np.set_printoptions(edgeitems=30, linewidth=100000,
    formatter=dict(float=lambda x: "%.3g" % x))

X

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0

In [None]:
cv.vocabulary_    # feature: index

{'go': 2171,
 'jurong': 2827,
 'point': 4091,
 'crazi': 1169,
 'avail': 379,
 'bugi': 738,
 'great': 2245,
 'world': 6135,
 'la': 2932,
 'buffet': 736,
 'cine': 964,
 'got': 2208,
 'amor': 190,
 'wat': 5957,
 'ok': 3760,
 'lar': 2960,
 'joke': 2794,
 'wif': 6056,
 'oni': 3785,
 'free': 2007,
 'entri': 1673,
 'wkli': 6101,
 'comp': 1058,
 'win': 6067,
 'fa': 1791,
 'cup': 1220,
 'final': 1890,
 'tkt': 5536,
 'st': 5103,
 'may': 3276,
 'text': 5420,
 'receiv': 4402,
 'question': 4319,
 'std': 5131,
 'txt': 5695,
 'rate': 4364,
 'appli': 262,
 'dun': 1551,
 'say': 4651,
 'earli': 1568,
 'hor': 2477,
 'alreadi': 163,
 'nah': 3532,
 'think': 5468,
 'goe': 2175,
 'usf': 5811,
 'live': 3070,
 'around': 302,
 'though': 5485,
 'freemsg': 2013,
 'hey': 2408,
 'darl': 1267,
 'week': 5992,
 'word': 6129,
 'back': 414,
 'like': 3042,
 'fun': 2059,
 'still': 5152,
 'tb': 5367,
 'xxx': 6202,
 'chg': 922,
 'send': 4721,
 'rcv': 4375,
 'even': 1722,
 'brother': 710,
 'speak': 5037,
 'treat': 5638,
 'ai

In [77]:
len(cv.vocabulary_)

6296

In [78]:
#this is count cv(frequency)

In [80]:
cv=CountVectorizer(max_features=100)
X=cv.fit_transform(corpus).toarray()

In [81]:
X

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0

In [82]:
cv.vocabulary_

{'go': 22,
 'great': 25,
 'got': 24,
 'wat': 91,
 'ok': 58,
 'free': 18,
 'win': 95,
 'text': 78,
 'txt': 87,
 'say': 69,
 'alreadi': 0,
 'think': 81,
 'hey': 28,
 'week': 93,
 'back': 4,
 'like': 40,
 'still': 74,
 'send': 71,
 'even': 15,
 'friend': 19,
 'prize': 64,
 'claim': 8,
 'call': 5,
 'mobil': 49,
 'co': 9,
 'home': 30,
 'want': 90,
 'today': 83,
 'cash': 7,
 'day': 12,
 'repli': 66,
 'www': 97,
 'right': 67,
 'thank': 79,
 'take': 76,
 'time': 82,
 'messag': 46,
 'oh': 57,
 'ye': 98,
 'make': 44,
 'way': 92,
 'feel': 16,
 'dont': 14,
 'miss': 48,
 'ur': 88,
 'tri': 86,
 'da': 11,
 'lor': 41,
 'meet': 45,
 'realli': 65,
 'get': 20,
 'know': 34,
 'love': 42,
 'amp': 1,
 'let': 38,
 'work': 96,
 'wait': 89,
 'yeah': 99,
 'tell': 77,
 'pleas': 63,
 'msg': 51,
 'see': 70,
 'pl': 62,
 'need': 53,
 'tomorrow': 84,
 'hope': 31,
 'well': 94,
 'lt': 43,
 'gt': 26,
 'ask': 2,
 'morn': 50,
 'happi': 27,
 'sorri': 73,
 'give': 21,
 'new': 54,
 'find': 17,
 'later': 36,
 'pick': 61,
 'goo

In [83]:
len(cv.vocabulary_)

100

In [84]:
cv.get_feature_names_out()

array(['alreadi', 'amp', 'ask', 'babe', 'back', 'call', 'care', 'cash', 'claim', 'co', 'come', 'da', 'day', 'dear', 'dont', 'even', 'feel', 'find', 'free', 'friend', 'get', 'give', 'go', 'good', 'got', 'great', 'gt', 'happi', 'hey', 'hi', 'home', 'hope', 'im', 'keep', 'know', 'last', 'later', 'leav', 'let', 'life', 'like', 'lor', 'love', 'lt', 'make', 'meet', 'messag', 'min', 'miss', 'mobil', 'morn', 'msg', 'much', 'need', 'new', 'night', 'number', 'oh', 'ok', 'one', 'phone', 'pick', 'pl', 'pleas', 'prize', 'realli', 'repli', 'right', 'said', 'say', 'see', 'send', 'sleep', 'sorri', 'still', 'stop', 'take', 'tell', 'text', 'thank', 'thing', 'think', 'time', 'today', 'tomorrow', 'tone', 'tri', 'txt', 'ur', 'wait', 'want', 'wat', 'way', 'week', 'well', 'win', 'work', 'www', 'ye', 'yeah'], dtype=object)

In [85]:
cv=CountVectorizer(max_features=100, binary=True)
X=cv.fit_transform(corpus).toarray()

In [86]:
X

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0

In [87]:
cv.vocabulary_

{'go': 22,
 'great': 25,
 'got': 24,
 'wat': 90,
 'ok': 56,
 'free': 18,
 'win': 94,
 'text': 77,
 'txt': 85,
 'say': 67,
 'alreadi': 0,
 'think': 80,
 'hey': 28,
 'week': 92,
 'back': 3,
 'like': 38,
 'still': 73,
 'send': 69,
 'even': 15,
 'friend': 19,
 'prize': 62,
 'claim': 7,
 'call': 4,
 'mobil': 47,
 'co': 8,
 'home': 30,
 'want': 89,
 'today': 82,
 'cash': 6,
 'day': 12,
 'repli': 64,
 'www': 96,
 'right': 65,
 'thank': 78,
 'take': 75,
 'time': 81,
 'use': 87,
 'messag': 44,
 'oh': 55,
 'ye': 97,
 'make': 42,
 'way': 91,
 'feel': 16,
 'dont': 14,
 'miss': 46,
 'ur': 86,
 'tri': 84,
 'da': 11,
 'lor': 39,
 'meet': 43,
 'realli': 63,
 'get': 20,
 'know': 33,
 'love': 40,
 'let': 37,
 'work': 95,
 'wait': 88,
 'yeah': 98,
 'tell': 76,
 'pleas': 61,
 'msg': 49,
 'see': 68,
 'pl': 60,
 'need': 51,
 'tomorrow': 83,
 'hope': 31,
 'well': 93,
 'lt': 41,
 'gt': 26,
 'ask': 1,
 'morn': 48,
 'happi': 27,
 'sorri': 72,
 'give': 21,
 'new': 52,
 'find': 17,
 'year': 99,
 'later': 35,
 'pi

In [93]:
cv=CountVectorizer(max_features=100, binary=True, ngram_range=(2,2))
X=cv.fit_transform(corpus).toarray()

In [94]:
cv.get_feature_names_out()

array(['account statement', 'attempt contact', 'await collect', 'call claim', 'call custom', 'call identifi', 'call land', 'call landlin', 'call later', 'call mobileupd', 'call optout', 'call per', 'camera phone', 'cash prize', 'chanc win', 'claim call', 'claim ur', 'claim valid', 'co uk', 'code expir', 'come back', 'come home', 'custom servic', 'date servic', 'decim gt', 'dont know', 'doubl min', 'draw show', 'everi week', 'free call', 'free entri', 'free text', 'get back', 'gift voucher', 'go home', 'go sleep', 'good afternoon', 'good morn', 'good night', 'great day', 'gt lt', 'gt min', 'guarante call', 'gud mrng', 'gud ni', 'happi new', 'holiday cash', 'hope good', 'identifi code', 'land line', 'land row', 'last night', 'last week', 'let know', 'like lt', 'line claim', 'lt decim', 'lt gt', 'mobil number', 'nation rate', 'nd attempt', 'new year', 'nice day', 'nokia tone', 'ok lor', 'pick phone', 'pl send', 'pleas call', 'po box', 'pobox wq', 'point call', 'privat account', 'prize cla

In [95]:
cv=CountVectorizer(max_features=100, binary=True, ngram_range=(3,3))
X=cv.fit_transform(corpus).toarray()

In [96]:
cv.get_feature_names_out()

array(['account statement show', 'admir look make', 'anytim network min', 'await collect sae', 'bonu caller prize', 'bt nation rate', 'call claim code', 'call custom servic', 'call identifi code', 'call land line', 'call mobileupd call', 'call per min', 'caller prize nd', 'camcord repli call', 'cant pick phone', 'cash await collect', 'claim easi call', 'claim valid hr', 'co uk pobox', 'collect sae cs', 'congratul ur award', 'contact find reveal', 'contact today draw', 'custom servic repres', 'draw show prize', 'draw txt music', 'easi call per', 'everi week txt', 'everi wk txt', 'find reveal think', 'free entri weekli', 'free st week', 'getz co uk', 'gt lt gt', 'guarante call land', 'guarante cash prize', 'happi new year', 'hg suit land', 'holiday await collect', 'holiday cash await', 'identifi code expir', 'land line claim', 'land row hl', 'like lt gt', 'line claim valid', 'live oper claim', 'look make contact', 'lt decim gt', 'lt gt lt', 'lt gt min', 'lt gt minut', 'lt gt th', 'ltd po

In [88]:
cv=CountVectorizer(max_features=100, binary=True, ngram_range=(2,3))
X=cv.fit_transform(corpus).toarray()

In [90]:
cv.vocabulary_

{'free entri': 33,
 'claim call': 18,
 'call claim': 4,
 'free call': 32,
 'chanc win': 17,
 'txt word': 91,
 'let know': 54,
 'go home': 36,
 'pleas call': 70,
 'lt gt': 60,
 'want go': 97,
 'like lt': 55,
 'like lt gt': 56,
 'sorri call': 83,
 'call later': 12,
 'sorri call later': 84,
 'ur award': 92,
 'call custom': 5,
 'custom servic': 25,
 'cash prize': 16,
 'call custom servic': 6,
 'po box': 71,
 'tri contact': 89,
 'draw show': 29,
 'show prize': 81,
 'prize guarante': 75,
 'guarante call': 42,
 'valid hr': 95,
 'draw show prize': 30,
 'show prize guarante': 82,
 'prize guarante call': 76,
 'select receiv': 78,
 'privat account': 72,
 'account statement': 0,
 'call identifi': 7,
 'identifi code': 48,
 'code expir': 22,
 'privat account statement': 73,
 'account statement show': 1,
 'call identifi code': 8,
 'identifi code expir': 49,
 'urgent mobil': 94,
 'call landlin': 11,
 'wat time': 98,
 'ur mob': 93,
 'gud ni': 44,
 'new year': 65,
 'send stop': 80,
 'get back': 35,
 'co

In [92]:
cv.get_feature_names_out()

array(['account statement', 'account statement show', 'attempt contact', 'await collect', 'call claim', 'call custom', 'call custom servic', 'call identifi', 'call identifi code', 'call land', 'call land line', 'call landlin', 'call later', 'call mobileupd', 'call optout', 'call per', 'cash prize', 'chanc win', 'claim call', 'claim valid', 'claim valid hr', 'co uk', 'code expir', 'come back', 'come home', 'custom servic', 'date servic', 'decim gt', 'dont know', 'draw show', 'draw show prize', 'everi week', 'free call', 'free entri', 'free text', 'get back', 'go home', 'good morn', 'good night', 'great day', 'gt lt', 'gt min', 'guarante call', 'guarante call land', 'gud ni', 'happi new', 'happi new year', 'hope good', 'identifi code', 'identifi code expir', 'land line', 'land line claim', 'land row', 'last night', 'let know', 'like lt', 'like lt gt', 'line claim', 'lt decim', 'lt decim gt', 'lt gt', 'lt gt min', 'mobil number', 'nation rate', 'nd attempt', 'new year', 'nice day', 'ok lo