## Bag Of Words
**Bag of words (BoW)** is a natural language processing technique used for text analysis and feature extraction. It involves representing a piece of text as a collection or "bag" of its individual words, disregarding grammar and word order. The BoW model counts the frequency of each word in the text and creates a numerical vector representing the occurrence of each word in the document. This vector can then be used as input for machine learning algorithms to classify or analyze the text data. BoW is commonly used in sentiment analysis, spam detection, and topic modeling.

Bag of words is a technique used in natural language processing to represent text data as a collection of words or tokens, without considering their order or context. It is called a "bag" because it treats the text as a unordered set of words, similar to how a bag contains a collection of objects without any specific order.

Bag of words is used because it simplifies the text data and makes it easier to analyze and process. By breaking down the text into individual words, it allows for simple counting and statistical analysis of the frequency of each word. This can be used for various tasks such as sentiment analysis, topic modeling, and text classification.

Additionally, bag of words is language independent, meaning it can be applied to any language without the need for language-specific tools or resources. It is also computationally efficient, making it suitable for processing large amounts of text data.

* Normally the process is such, first we create a vocabulary of all unique words from n articles. In the second steps we count each word appearance in each article. The appearance of words for each article is represented in a vectors, this vectors are also called **Count Vectorizer.**

<img src = "img.jpg" width = "800px" height = "500px"></img>

* So here tackle with a classical **Spam** problem. We'll take an email body then we'll convert it into a numbers using **Bag Of Words (BoW** model and then we'll apply **Naive Bayes** classifier. 

* So to classify emails, **first** we need to create a **vocabulay**, vocabulary is the unique count of words in all your emails.

<img src = "img1.jpg" width = "800px" height = "500px"></img>

* **Next we'll build a **Bag Of Words (BoW)** or **Count Vectorizer** where we take an email and we count the word appearance as vector.

<img src = "img2.png" width = "800px" height = "400px"></img>

* So there are certain limitation of **BoW** model which is listed bellow:

    **1.** Your vocabulary will be locked and we have **Sparse Representation.** Sparse Representation means, it may consume to much memory & compute resources.

<img src = "img3.png" width = "800px" height = "400px"></img>

   * **2.** It doesn't capture the meanings of words accurately. 

<img src = "img4.png" width = "800px" height = "400px"></img>

* **So let's got spam and non spam (ham) email classification problem using BoW for text representation.**

In [4]:
# Neccessary libraries
import pandas as pd
import numpy as np

In [6]:
# We have an email dataset which has more than 500 spam and ham emails, and we want to use it for email classification prob.
# Let's load the dataset:
df = pd.read_csv("spam.csv")
df.sample(5)

Unnamed: 0,Category,Message
265,ham,Why you Dint come with us.
1400,ham,You have registered Sinco as Payee. Log in at ...
920,ham,Dont talk to him ever ok its my word.
2873,ham,See you there!
5232,spam,YOU ARE CHOSEN TO RECEIVE A £350 AWARD! Pls ca...


In [8]:
# So the first thing we want to do is, knowing how many spam and non spam emails are there in the dataset:
df.Category.value_counts()

ham     4825
spam     747
Name: Category, dtype: int64

In [9]:
# So we cleaerly see the imbalance in the dataset, but for now we skip this issue and we go ahead to build a ML model.
# Since the spam and ham are text in the Category column, so we create a new column where for spam we'll have 1 and for non
# spam we'll have 0.
df['spam'] = df['Category'].apply(lambda x: 1 if x =='spam' else 0)

In [10]:
# Now we will have three columns:
df.shape

(5572, 3)

In [13]:
# To see the columns and there values:
df.head(5)

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [14]:
# Next we import train_test_split method from 'Sklearn' to split the dataset into train and test.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.Message, df.spam, test_size=0.2)

In [16]:
# Now if we check the number of train samples:
X_train.shape

(4457,)

In [17]:
# To check the number of test samples:
X_test.shape

(1115,)

In [18]:
# To know about the datatype of samples, we can do it using type function.
type(X_train) # This will output that the datatype is pandas 'series'.

pandas.core.series.Series

In [19]:
# To see first 5 samples from train samples:
X_train[:5]

445     HEY HEY WERETHE MONKEESPEOPLE SAY WE MONKEYARO...
2905    HI DARLIN I HOPE YOU HAD A NICE NIGHT I WISH I...
1266    Im in inperialmusic listening2the weirdest tra...
3498    Oh, the grand is having a bit of a party but i...
2911       How do you guys go to see movies on your side.
Name: Message, dtype: object

In [25]:
# To see 5 samples of test dataset:
y_test[7:12]

540     0
2104    0
1874    1
1829    0
20      0
Name: spam, dtype: int64

* The reason why we use capital 'X' for X_train and small 'y' for y_train is, we may have multiple dependent variables and will be multiple columns and usually y_train is just one column. 

In [27]:
# So in order to build BoW model for numeric representation, we import count vectorizer from sklearn.
# So count vectorizer is a class and we want to create an instatnce from the class.
# CountVectorizer has a method called 'fit_transform' which generate the BoW model for X_train.
from sklearn.feature_extraction.text import CountVectorizer

v = CountVectorizer() # instance of the class.

X_train_cv = v.fit_transform(X_train.values)   # So we do X_train.values, by values means to convert it into numpy array
X_train_cv

<4457x7759 sparse matrix of type '<class 'numpy.int64'>'
	with 59567 stored elements in Compressed Sparse Row format>

In [26]:
# To see the type of 'X_train.values', it will be numpy array.
type(X_train.values)

numpy.ndarray

In [28]:
# Now if we want to see 'X_train_cv', we have to convert it into an numpy array.
X_train_cv.toarray()[:4]   # So we know this is a sparse representation, it's a big array.

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [29]:
# To see the shape:
X_train_cv.shape

(4457, 7759)

In [34]:
# So we see we have 7759 unique words in our vocabulary.
# If we use 'v' which stored CountVectorizer, it will give us all the words.
v.get_feature_names_out()[2000:2050]    # Will show 50 words.

array(['copy', 'cornwall', 'corrct', 'correct', 'correctly', 'corrupt',
       'corvettes', 'cos', 'cosign', 'cost', 'costa', 'costing', 'costs',
       'costume', 'costumes', 'cougar', 'coughing', 'could', 'coulda',
       'couldn', 'count', 'countin', 'countinlots', 'country', 'counts',
       'coupla', 'couple', 'courage', 'course', 'court', 'courtroom',
       'cousin', 'cover', 'covers', 'coz', 'cozy', 'cps', 'cr01327bt',
       'cr9', 'crab', 'crack', 'cramps', 'crap', 'crash', 'crashing',
       'crave', 'craving', 'craziest', 'crazy', 'crazyin'], dtype=object)

In [37]:
# Vocabulary size:
# v.get_feature_names_out().shape
len(v.get_feature_names_out())

7759

In [38]:
# To know about all the methods which is supported by 'v' variable.
dir(v)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_char_ngrams',
 '_char_wb_ngrams',
 '_check_feature_names',
 '_check_n_features',
 '_check_stop_words_consistency',
 '_check_vocabulary',
 '_count_vocab',
 '_get_param_names',
 '_get_tags',
 '_limit_features',
 '_more_tags',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_sort_features',
 '_stop_words_id',
 '_validate_data',
 '_validate_params',
 '_validate_vocabulary',
 '_warn_for_unused_params',
 '_white_spaces',
 '_word_ngrams',
 'analyzer',
 'binary',
 'build_analyzer',
 'build_preprocessor',
 'build_tokenizer',
 'decode',
 'decode_error',
 'dtype',
 'encoding',
 'fit',

In [42]:
# Now for example 'vocabulary_' will give us all the vocabulary:
# v.vocabulary_

In [44]:
# So now we have 4457 emails in the X_train and each email has lenght of 7675.
# So look at to the first email
X_train_np = X_train_cv.toarray()
X_train_np[:2] # Will show two vectors for the first two emails.

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [45]:
# As majority of values are zeroes in the vectors, so we don't see the 1s values here. To see the 1s, we point to the index
# of 1s values.
np.where(X_train_np[0]!=0)  # Will show the indexes where have values 1. Not exactly 1s, it may be 2, 3 ,10 ... but not 0.

(array([2381, 2964, 3204, 3432, 3533, 3536, 3818, 3843, 4240, 4585, 4587,
        5931, 5937, 7427, 7477, 7669], dtype=int64),)

In [46]:
# Now to print the value of these indexes we do as follow:
X_train_np[0][2381]

1

In [47]:
# To check other index:
X_train_np[0][2964]

1

In [50]:
# The first five samples in X_train is:
X_train[:5]

445     HEY HEY WERETHE MONKEESPEOPLE SAY WE MONKEYARO...
2905    HI DARLIN I HOPE YOU HAD A NICE NIGHT I WISH I...
1266    Im in inperialmusic listening2the weirdest tra...
3498    Oh, the grand is having a bit of a party but i...
2911       How do you guys go to see movies on your side.
Name: Message, dtype: object

In [51]:
# Now to see the first email text for which we displayed indexes, we supply the first index '445' to the array:
X_train[:5][445]  # This will shows us the first email.

'HEY HEY WERETHE MONKEESPEOPLE SAY WE MONKEYAROUND! HOWDY GORGEOUS, HOWU DOIN? FOUNDURSELF A JOBYET SAUSAGE?LOVE JEN XXX'

In [52]:
# Now as we saw in line '[46]', for index '2381' we have 1 value, so now if we check that index in the CountVectorizer, the
# out put word must be present in the comment displayed in cell '[51]'.
v.get_feature_names_out()[2381]

'doin'

* So the out put word is 'doin', which is present in the email displayed in the cell '[51]'.

In [53]:
# So now the train and test sets are ready, next we build a ML model.
# Here we use 'Naive Bayes' classifier.
# Here we use MultinomialNB classifier.
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_cv, y_train)

MultinomialNB()

In [54]:
# Now we want to evaluate the performance of the model, for that we use CountVectorizer 'v', and we transform the X_test, 
# because X_train is a simple email (word email), we need to convert it into CountVectorizer.
X_test_cv = v.transform(X_test)

In [56]:
# For evaluation purpose, here we use classification report. As our dataset wan imbalance for this reason we print F1-score.
# If your dataset is balanced, then you can print the accuracy of the model.
from sklearn.metrics import classification_report
y_pred = model.predict(X_test_cv)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99       970
           1       0.98      0.90      0.94       145

    accuracy                           0.98      1115
   macro avg       0.98      0.95      0.96      1115
weighted avg       0.98      0.98      0.98      1115



In [57]:
# Now we can do some testing on the model:
emails = [
    'Hey mohan, can we get together to watch footbal game tomorrow?',
    'Upto 20% discount on parking, exclusive offer just for you. Dont miss this reward!'
]

emails_count = v.transform(emails)
model.predict(emails_count)

array([0, 1], dtype=int64)

In [58]:
# So we can also train the model very easily using sklearn pipeline.
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [59]:
# So now we can train the model.
# See here we don't suppy X_train_cv or t_train_cv, because we apply CountVectorizer in the pipeline.
clf.fit(X_train, y_train)

Pipeline(steps=[('vectorizer', CountVectorizer()), ('nb', MultinomialNB())])

In [60]:
# Now again to see the classification report on the new trained model using sklearn pipeline:
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99       970
           1       0.98      0.90      0.94       145

    accuracy                           0.98      1115
   macro avg       0.98      0.95      0.96      1115
weighted avg       0.98      0.98      0.98      1115



**A beatiful exercise is given...**