### About the dataset :

In [None]:
For my first beginner project on Natural Language Processing (NLP), I chose the SMS Spam Collection Dataset.
It consists of about 5572 SMS messages and a label, classifying the message as "spam" or "ham".

In [None]:
In this dataset, I am going to explore some common methods or techniques of NLP like - 
1) Removing stopwords.
2) Perform Tokenization.
3) Perform Lemmatization.
4) Use Bag of Words.


In [None]:
Based on these various preprocessing techniques, I am going to build Naive Bayes Classifier model that will classify unknown messages as "spam" or "ham".

#### Importing the Dataset

In [1]:
#Importing pandas library.
import pandas as pd

In [2]:
messages = pd.read_csv('SMSSpamCollection', sep='\t', names=["label", "message"])

#### Data cleaning and preprocessing

In [3]:
#Importing re or Regular Expression module.
#This module supports various things like Modifiers, Identifiers, and White space characters.

import re

In [4]:
#Importing nltk library.
#It supports classification, tokenization, stemming, tagging, parsing, and semantic reasoning functionalities.

import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Home\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
#Importing other useful packages.

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer 
from nltk.stem import WordNetLemmatizer 

In [6]:
#It helps in normalization of words.

ps = PorterStemmer()

In [7]:
#It helps in grouping together the different inflected forms of a word so they can be analysed as a single item.

wordnet=WordNetLemmatizer()

In [8]:
#Corpus represents a collection of (data) texts, typically labeled with text annotations.

corpus = []

In [9]:
for i in range(0, len(messages)):
    review = re.sub('[^a-zA-Z]', ' ', messages['message'][i])
    review = review.lower()
    review = review.split()
    
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

In [10]:
len(corpus)

5572

#### Creating the Bag of Words model

In [11]:
#The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words.
#It can also be used to encode new documents using that vocabulary.

from sklearn.feature_extraction.text import CountVectorizer 
cv = CountVectorizer(max_features=2500)
X = cv.fit_transform(corpus).toarray()
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [12]:
len(corpus)

5572

In [13]:
#Getting dummy values.

y=pd.get_dummies(messages['label'])
y

Unnamed: 0,ham,spam
0,1,0
1,1,0
2,0,1
3,1,0
4,1,0
5,0,1
6,1,0
7,1,0
8,0,1
9,0,1


In [14]:
#iloc is used to select rows and columns by number, in the order that they appear in the data frame. 

y=y.iloc[:,1].values
y

array([0, 0, 1, ..., 0, 0, 0], dtype=uint8)

#### Dividing the data into training and testing with 0.20(20%) test size

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

#### Training model using Naive bayes classifier

In [16]:
#Multinomial Naive Bayes is a specialized version of Naive Bayes that is designed more for text documents.
#Multinomial Naive Bayes explicitly models the word counts and adjusts the underlying calculations to deal with in.
#Importing Multinomial Naive Bayes classifier.

from sklearn.naive_bayes import MultinomialNB
spam_detect_model = MultinomialNB().fit(X_train, y_train) #Fitting the data.

In [17]:
y_pred=spam_detect_model.predict(X_test)

In [18]:
y_pred

array([0, 1, 0, ..., 0, 1, 0], dtype=uint8)

In [19]:
y_train

array([0, 0, 0, ..., 1, 0, 0], dtype=uint8)

#### Comparing y_train and y_pred

#### Confusion Matrix

In [20]:
#Confusion Matrix basically gives us an idea about how well our classifier has performed, with respect to performance on individual classes.

from sklearn.metrics import confusion_matrix

In [21]:
confusion_m=confusion_matrix(y_test, y_pred)

In [22]:
confusion_m

array([[946,   9],
       [  7, 153]], dtype=int64)

In [23]:
from sklearn.metrics import accuracy_score

In [24]:
accuracy=accuracy_score(y_test, y_pred)

In [25]:
accuracy

0.9856502242152466

In [None]:
#The accuracy obtained by building model using Multinomial Naive Bayes Classifier is 98.56%.

## Conclusion :

In [None]:
This is a very basic and short approach I have carried out for analyzing this dataset.
I hope it wil give you a very basic idea on how to approach your analysis on this types of datasets.

In [None]:
Thank You

Any suggestions, commments are welcome.