# Text Classification

#### Context

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

#### Dataset

https://www.kaggle.com/uciml/sms-spam-collection-dataset

##### Importing the Dataset

In [3]:
messages = pd.read_csv("spam.csv", encoding="latin-1")
messages.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


##### Data Cleaning

In [4]:
# selecting the relevant column
messages = messages[["v1","v2"]]

# renaming column
messages=messages.rename(columns={"v1":"label", "v2":"message"})
messages.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
# importing the Natural Language Tool Kit library for text processing
import nltk

In [6]:
# importing regular expression library for selecting only text
import re

In [7]:
# importing stopwords
from nltk.corpus import stopwords

In [8]:
# importing libraries for stemming and lemmatization
from nltk.stem import PorterStemmer

### Stemming

In [9]:
ps = PorterStemmer()
corpus = []

In [10]:
for i in range(len(messages)):
    review = re.sub("[^a-zA-Z]", " ", messages["message"][i])
    review = review.lower()
    review = review.split()
    
    review = [ps.stem(word) for word in review if not word in set(stopwords.words("english"))]
    review = " ".join(review)
    corpus.append(review)

In [11]:
# creating Bag of Words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=3000)
X = cv.fit_transform(corpus).toarray()

In [12]:
# encoding target variables 0-ham, 1-spam
y = pd.get_dummies(messages["label"])
y = y.iloc[:,1].values

In [13]:
messages["label"].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

* It is imbalanced dataset with very few messages being spam, hence we will use stratified shuffling

In [14]:
# Stratified Shuffle
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
for train_index, test_index in split.split(X,y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

In [15]:
y_train.sum()/len(X_train), y_test.sum()/len(X_test)

(0.13417096701817366, 0.1336322869955157)

* With stratified shuffle split we have ensured that the train and test datasets have equal representations

### Model Creation

In [16]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train, y_train)

MultinomialNB()

In [17]:
# making prediction on test data
y_pred = model.predict(X_test)

In [18]:
# importing libraries for identifying how good the model is
from sklearn.metrics import accuracy_score, confusion_matrix

In [19]:
confusion_matrix(y_test,y_pred)

array([[956,  10],
       [ 12, 137]], dtype=int64)

* 956 records have been correctly identified as "ham"
* 137 records have been correctly identified as "spam"
* 12 records were "spam" but were classified as "ham"
* 10 records were "ham" but were classified as "spam"

In [20]:
accuracy_score(y_test, y_pred)

0.9802690582959641

* The accuracy of the model is 98.02%

In [21]:
from sklearn.model_selection import cross_val_score

score = cross_val_score(model, X_train, y_train, cv=5, scoring="accuracy")
score

array([0.98430493, 0.98318386, 0.97306397, 0.98092031, 0.98092031])

In [27]:
score.mean()

0.9802554694931377

* The cross validation gives fairly the similar results

### Lemmatization

* We will create a new model with Lemmatization instead of Stemming

In [22]:
from nltk.stem import WordNetLemmatizer


wordnet = WordNetLemmatizer()
corpus = []


for i in range(len(messages)):
    review = re.sub("[^a-zA-Z]", " ", messages["message"][i])
    review = review.lower()
    review = review.split()
    
    review = [wordnet.lemmatize(word) for word in review if word not in set(stopwords.words("english"))]
    review = " ".join(review)
    corpus.append(review)

    
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features = 3000)
X = cv.fit_transform(corpus).toarray()

y = pd.get_dummies(messages["label"])
y = y.iloc[:,1].values


from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
for train_index, test_index in split.split(X,y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    

model = MultinomialNB()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

In [24]:
confusion_matrix(y_test,y_pred)

array([[954,  12],
       [ 12, 137]], dtype=int64)

In [25]:
accuracy_score(y_test, y_pred)

0.97847533632287

In [26]:
score = cross_val_score(model, X_train, y_train, cv=5, scoring="accuracy")
score

array([0.98318386, 0.97869955, 0.97306397, 0.98092031, 0.98540965])

In [28]:
score.mean()

0.9802554694931377

* With Lemmatization as well, the accuracy of the model remains fairly the same

## TF-IDF Vectorization

In this part we will try to use TF-IDF Vectorizer instead of Bag of Words

In [29]:
lemmatizer = WordNetLemmatizer()
corpus = []


for i in range(len(messages)):
    review = re.sub("[^a-zA-Z]", " ", messages["message"][i])
    review = review.lower()
    review = review.split()
    
    review = [lemmatizer.lemmatize(word) for word in review if word not in set(stopwords.words("english"))]
    review = " ".join(review)
    corpus.append(review)
    
    

from sklearn.feature_extraction.text import TfidfVectorizer

tfv = TfidfVectorizer(max_features=3000)
X = tfv.fit_transform(corpus).toarray()


y = pd.get_dummies(messages["label"])
y = y.iloc[:,1].values


from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
for train_index, test_index in split.split(X,y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    

model = MultinomialNB()
model.fit(X_train, y_train)


y_pred = model.predict(X_test)

In [30]:
confusion_matrix(y_test,y_pred)

array([[964,   2],
       [ 29, 120]], dtype=int64)

In [31]:
accuracy_score(y_test, y_pred)

0.9721973094170404

In [33]:
score = cross_val_score(model, X_train, y_train, cv=5, scoring="accuracy")
score

array([0.97982063, 0.97421525, 0.97194164, 0.97306397, 0.97867565])

In [34]:
score.mean()

0.9755434262908105

* The accuracy seems to have dropped a little bit than previous 2 models which were based on Bag of Words

# Conclusion

* In this notebook, we tried to predict a classification problem to identify spam messages.
* We identified the relevant columns and renamed them for better understanding
* We worked with both the concepts of Stemming and Lemmatization
* Next step was to further clean the data - removing any numeric data, lower case the text, split the words, remove the stopwords, create a final corpus
* We also both the concepts of Bag of Words and TF-IDF Vectorizer
* Finally, there were Multinomial Naive-Bayes models created - all the models seem to have performed almost similar