<a href="https://colab.research.google.com/github/niksom406/Learning_NLP/blob/main/Bag_Of_Words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SMS Spam Collection Analysis

This notebook analyzes the SMS Spam Collection dataset to build a model for classifying messages as spam or ham.

## Data Loading

Load the dataset into a pandas DataFrame.

In [23]:
import pandas as pd
messages = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/SMSSpam_Dataset/SMSSpamCollection.txt",
                       sep='\t',names = ["label","message"])

## Display Data

Display the first few rows of the DataFrame.

In [24]:
messages

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


## Data Cleaning and Preprocessing

Import necessary libraries for text cleaning and preprocessing, including regular expressions, NLTK for natural language processing tasks, and download the 'stopwords' corpus.

In [25]:
## Data Cleaning and Preprocessing

import re
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Import the stopwords from NLTK and initialize a PorterStemmer for stemming words.

In [26]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

Create a corpus of cleaned and preprocessed messages. This involves removing special characters, converting to lowercase, splitting into words, stemming words, and removing stopwords.

In [27]:
corpus = []
for i in range (0,len(messages)):
  review = re.sub('[^a-zA-z]',' ',messages['message'][i])
  review = review.lower()
  review = review.split()
  review = [ps.stem(word) for word in review if word is not stopwords.words('english')]
  review = " ".join(review)
  corpus.append(review)

## Display Corpus

Display the processed corpus.

In [28]:
corpus

['go until jurong point crazi avail onli in bugi n great world la e buffet cine there got amor wat',
 'ok lar joke wif u oni',
 'free entri in a wkli comp to win fa cup final tkt st may text fa to to receiv entri question std txt rate t c s appli over s',
 'u dun say so earli hor u c alreadi then say',
 'nah i don t think he goe to usf he live around here though',
 'freemsg hey there darl it s been week s now and no word back i d like some fun you up for it still tb ok xxx std chg to send to rcv',
 'even my brother is not like to speak with me they treat me like aid patent',
 'as per your request mell mell oru minnaminungint nurungu vettam ha been set as your callertun for all caller press to copi your friend callertun',
 'winner as a valu network custom you have been select to receivea prize reward to claim call claim code kl valid hour onli',
 'had your mobil month or more u r entitl to updat to the latest colour mobil with camera for free call the mobil updat co free on',
 'i m gonn

## Bag of Words Explanation

The Bag of Words (BoW) model is a simplified representation used in natural language processing and information retrieval. In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even the order of words. The frequency of each word in the document is used as a feature for training a classifier.

In this notebook, we use `CountVectorizer` from scikit-learn to create the Bag of Words model. `CountVectorizer` converts a collection of text documents to a matrix of token counts.

- `max_features`: This parameter limits the number of features (words) to consider based on their frequency. Here, we limit it to the top 2500 most frequent words.
- `binary=True`: This parameter creates a binary Bag of Words model, where the feature value is 1 if the word is present in the document and 0 otherwise, instead of the word count.

## Create the Bag of Words Model

Create the Bag of Words model using `CountVectorizer`.

In [29]:
## Create the Bag of Words

from sklearn.feature_extraction.text import CountVectorizer
## For binary BOW enable binary=True
cv = CountVectorizer(max_features=2500,binary=True)

Fit the `CountVectorizer` to the corpus and transform the text data into a matrix of token counts.

In [30]:
X = cv.fit_transform(corpus).toarray()

## Display Shape of X

Display the shape of the resulting matrix, where the number of rows is the number of documents and the number of columns is the number of features (words).

In [31]:
X.shape

(5572, 2500)

## Display X

Display the Bag of Words matrix.

In [32]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

## N-Grams

N-grams are contiguous sequences of n items from a given sample of text or speech. In this context, we are considering sequences of words. Using n-grams in the Bag of Words model can capture some of the local word order information that is lost when using individual words (unigrams) only.

By setting `ngram_range=(1, 2)` in `CountVectorizer`, we are including both unigrams (single words) and bigrams (sequences of two words) in our feature set. This can potentially improve the performance of a model by considering pairs of words that frequently appear together.

In [33]:
cv.vocabulary_

{'go': 821,
 'until': 2243,
 'point': 1574,
 'crazi': 456,
 'avail': 152,
 'onli': 1475,
 'in': 985,
 'bugi': 283,
 'great': 845,
 'world': 2419,
 'la': 1111,
 'cine': 377,
 'there': 2127,
 'got': 836,
 'wat': 2320,
 'ok': 1464,
 'lar': 1123,
 'joke': 1054,
 'wif': 2381,
 'oni': 1474,
 'free': 758,
 'entri': 629,
 'wkli': 2407,
 'comp': 411,
 'to': 2163,
 'win': 2387,
 'cup': 470,
 'final': 714,
 'tkt': 2158,
 'st': 1995,
 'may': 1288,
 'text': 2109,
 'receiv': 1684,
 'question': 1646,
 'std': 2006,
 'txt': 2217,
 'rate': 1664,
 'appli': 107,
 'over': 1503,
 'dun': 589,
 'say': 1792,
 'so': 1944,
 'earli': 595,
 'alreadi': 71,
 'then': 2125,
 'nah': 1396,
 'don': 567,
 'think': 2132,
 'he': 890,
 'goe': 824,
 'usf': 2261,
 'live': 1181,
 'around': 124,
 'here': 905,
 'though': 2139,
 'freemsg': 760,
 'hey': 907,
 'darl': 489,
 'it': 1026,
 'been': 194,
 'week': 2341,
 'now': 1442,
 'and': 86,
 'no': 1428,
 'word': 2416,
 'back': 165,
 'like': 1170,
 'some': 1951,
 'fun': 779,
 'you': 2

In [34]:
## Create the Bag of Words

from sklearn.feature_extraction.text import CountVectorizer
## For binary BOW enable binary=True
cv = CountVectorizer(max_features=2500,binary=True,ngram_range=(1,2))

In [35]:
X = cv.fit_transform(corpus).toarray()

In [37]:
cv.vocabulary_

{'go': 728,
 'until': 2164,
 'point': 1565,
 'crazi': 429,
 'avail': 160,
 'onli': 1468,
 'in': 922,
 'bugi': 250,
 'great': 757,
 'world': 2371,
 'la': 1064,
 'cine': 352,
 'there': 1982,
 'got': 752,
 'wat': 2243,
 'ok': 1441,
 'lar': 1072,
 'joke': 1023,
 'wif': 2323,
 'free': 658,
 'entri': 570,
 'wkli': 2360,
 'comp': 393,
 'to': 2032,
 'win': 2337,
 'cup': 438,
 'final': 617,
 'st': 1834,
 'may': 1202,
 'text': 1908,
 'receiv': 1630,
 'question': 1603,
 'std': 1844,
 'txt': 2141,
 'rate': 1610,
 'appli': 113,
 'over': 1507,
 'free entri': 661,
 'to win': 2094,
 'to to': 2090,
 'to receiv': 2077,
 'rate appli': 1611,
 'dun': 544,
 'say': 1694,
 'so': 1787,
 'earli': 550,
 'alreadi': 51,
 'then': 1977,
 'so earli': 1790,
 'nah': 1334,
 'don': 515,
 'think': 2003,
 'he': 816,
 'goe': 737,
 'usf': 2199,
 'live': 1125,
 'around': 128,
 'here': 832,
 'though': 2015,
 'don think': 519,
 'freemsg': 666,
 'hey': 834,
 'darl': 450,
 'it': 992,
 'been': 206,
 'week': 2271,
 'now': 1396,
 'a