Import libraries:
1. pandas library - which is mainly used for data manipulation and analysis.
2. sklearn library - helps us to implement machine learning techniques.
3. keras library - consists of functions and programs related to neural network.
4. re library - operations related to regular expression.

In [2]:
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, SpatialDropout1D
from keras.utils.np_utils import to_categorical
from keras.layers.convolutional import Conv1D, MaxPooling1D
from keras.layers import Flatten
from keras import optimizers
from keras.constraints import maxnorm


import re

Read Dataset:
1. Reading csv file using pandas library.
2. Pass filename and encoding (encoding which is used for UTF while reading/writing example: 'utf-8') as parameter 


In [3]:
dataset = pd.read_csv('spam.csv', encoding='latin-1')

View Dataset:
.head and .info displays the dataset and information about the dataset respectively.

In [4]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
spam_or_ham    5572 non-null object
message        5572 non-null object
Unnamed: 2     50 non-null object
Unnamed: 3     12 non-null object
Unnamed: 4     6 non-null object
dtypes: object(5)
memory usage: 217.7+ KB


In [5]:
dataset.head()

Unnamed: 0,spam_or_ham,message,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


Drop unecessary columns:
1. Unecessary columns are dropped using .drop of method.
2. Pass list of column names to be dropped, index (0 to drop from index or 1 to drop from columns), inplace (true, for doing operation inplace or else return none).

In [6]:
dataset.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1,inplace=True)

Check drop action:
Check whether the unecessary features are dropped from the dataframe y viewing the dataset again.

In [7]:
dataset.head()

Unnamed: 0,spam_or_ham,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Data Pre-processing:
1. Use .strip to remove empty charecter from both the sides.
2. Use .lower function to change all the string to lower case.

In [8]:
dataset['spam_or_ham']=dataset.spam_or_ham.str.strip()
dataset['message']=dataset.message.str.strip()
dataset['message']=dataset.message.str.lower()
dataset.head()

Unnamed: 0,spam_or_ham,message
0,ham,"go until jurong point, crazy.. available only ..."
1,ham,ok lar... joking wif u oni...
2,spam,free entry in 2 a wkly comp to win fa cup fina...
3,ham,u dun say so early hor... u c already then say...
4,ham,"nah i don't think he goes to usf, he lives aro..."


Count categories:
Use .value_counts function to count number of datapoints in each of the classes

In [10]:
dataset.spam_or_ham.value_counts()

ham     4825
spam     747
Name: spam_or_ham, dtype: int64

Tokenization:
1. Tokenization chops the string into pieces called tokens.
2. For example: 
input : "Hi python, this is my last assignment"
output : ["hi","python","this","is","my","last","assignment"]
3. Create tokenizer model with 1500 as num_word and split the string by the charecter ' '.
4. fit on the dataset.
5. make the texts to sequences.

In [11]:
maximum_number_of_features = 1500
tokenizer = Tokenizer(num_words=maximum_number_of_features, split=' ')
tokenizer.fit_on_texts(dataset['message'].values)
X = tokenizer.texts_to_sequences(dataset['message'].values)

Change the sequence to 2D Numpy array:
Use pad_sequences from keras to change the sequence to 2D Numpy array.

In [12]:
print(X)
X = pad_sequences(X)
print("after applying pad_sequences")
print(X)

[[50, 469, 840, 751, 657, 64, 8, 1323, 89, 121, 349, 1324, 147, 1325, 67, 58, 144], [46, 336, 1494, 470, 6], [47, 486, 8, 19, 4, 795, 898, 2, 178, 1198, 658, 267, 71, 2, 2, 337, 486, 554, 954, 73, 388, 179, 659, 389], [6, 246, 152, 23, 379, 6, 140, 154, 57, 152], [1017, 1, 98, 107, 69, 487, 2, 955, 69, 219, 111, 471], [796, 127, 67, 145, 108, 160, 21, 7, 38, 338, 87, 899, 55, 115, 411, 3, 44, 12, 14, 85, 46, 380, 954, 2, 68, 322, 231, 2], [212, 11, 632, 9, 25, 55, 2, 381, 36, 10, 109, 718, 10, 55], [72, 235, 13, 1199, 797, 119, 108, 608, 72, 13, 1018, 12, 51, 841, 412, 2, 1099, 13, 247, 1018], [719, 72, 4, 842, 438, 236, 3, 17, 108, 439, 2, 1326, 150, 956, 2, 124, 16, 124, 413, 516, 957, 581, 64], [136, 13, 96, 682, 1019, 26, 133, 6, 81, 1200, 2, 488, 2, 5, 323, 534, 900, 36, 339, 12, 47, 16, 5, 96, 488, 243, 47, 18], [30, 237, 32, 77, 220, 7, 1, 98, 70, 2, 289, 80, 40, 290, 1201, 232, 93, 208, 440, 88], [2, 178, 157, 48, 720, 2, 901, 441, 633, 73, 7, 68, 2, 371, 187, 65, 260, 389, 92,

Make categorical variable:
1. Create a LabelEncoder model.
2. Use fit_transform to fit the categorical feature 'spam_or_ham to chenge it to categorical variable.
3. LabelEncoder makes the classes from 0 to total class-1

In [13]:
labelencoder = LabelEncoder()
integer_encoded = labelencoder.fit_transform(dataset['spam_or_ham'])
Y = to_categorical(integer_encoded)
print(Y)

[[1. 0.]
 [1. 0.]
 [0. 1.]
 ...
 [1. 0.]
 [1. 0.]
 [1. 0.]]


Split Dataset:
1. Use train_test_split() function to split the into four that is to categorical feature for train and test (Y_train and Y_test) and othe features (dependent features) for train and test (X_train and X_test).
2. Make test size to 33%.
3. Use random state to randomize the datapoints
4. Print the shape of the splitted datapoints

In [15]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
print(X_train.shape, Y_train.shape)
print(X_test.shape, Y_test.shape)

(3733, 172) (3733, 2)
(1839, 172) (1839, 2)


CNN Model training:
1. Create a Sequential model where sequential model where sequential model is adding to several layers to the model.
2. Add several layers like Embedding, Conv1D with MaxPooling1D and SpatialDropout1D
3. At last flatten the fit.

In [16]:
model = Sequential()
model.add(Embedding(1500, 128, dropout=0.2, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))

model.add(Conv1D(64, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))

model.add(Conv1D(32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))

model.add(Flatten())
model.add(Dense(2, activation='sigmoid'))

print(model.summary())
model.compile(loss='categorical_crossentropy', optimizer=optimizers.RMSprop(lr=0.001), metrics=['accuracy'])

W0722 21:23:02.492634 13328 deprecation_wrapper.py:119] From C:\Users\kaphc\Anaconda3\envs\python3.6\lib\site-packages\keras\backend\tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

  
W0722 21:23:02.558622 13328 deprecation_wrapper.py:119] From C:\Users\kaphc\Anaconda3\envs\python3.6\lib\site-packages\keras\backend\tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0722 21:23:02.564530 13328 deprecation_wrapper.py:119] From C:\Users\kaphc\Anaconda3\envs\python3.6\lib\site-packages\keras\backend\tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0722 21:23:02.667668 13328 deprecation_wrapper.py:119] From C:\Users\kaphc\Anaconda3\envs\python3.6\lib\site-packages\keras\backend\tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_d

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 172, 128)          192000    
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 172, 128)          0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 172, 64)           24640     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 86, 64)            0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 86, 32)            6176      
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 43, 32)            0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 1376)              0         
__________

Fit model:
1. Fit the trained model on the splitted dataset
2. Fit the model for 16 epochs(number of times), batch_size(number of datapoints taken for each step of fitting), validation_data for doing validation.

In [17]:
model.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=16, batch_size=100, verbose=2)

W0722 21:23:56.507325 13328 deprecation.py:323] From C:\Users\kaphc\Anaconda3\envs\python3.6\lib\site-packages\tensorflow\python\ops\math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Train on 3733 samples, validate on 1839 samples
Epoch 1/16
 - 8s - loss: 0.3209 - acc: 0.8853 - val_loss: 0.1134 - val_acc: 0.9685
Epoch 2/16
 - 7s - loss: 0.0665 - acc: 0.9807 - val_loss: 0.0613 - val_acc: 0.9831
Epoch 3/16
 - 6s - loss: 0.0310 - acc: 0.9901 - val_loss: 0.0669 - val_acc: 0.9826
Epoch 4/16
 - 7s - loss: 0.0209 - acc: 0.9941 - val_loss: 0.0628 - val_acc: 0.9859
Epoch 5/16
 - 7s - loss: 0.0128 - acc: 0.9960 - val_loss: 0.0717 - val_acc: 0.9864
Epoch 6/16
 - 7s - loss: 0.0081 - acc: 0.9973 - val_loss: 0.0848 - val_acc: 0.9864
Epoch 7/16
 - 6s - loss: 0.0061 - acc: 0.9987 - val_loss: 0.0952 - val_acc: 0.9875
Epoch 8/16
 - 6s - loss: 0.0033 - acc: 0.9995 - val_loss: 0.1117 - val_acc: 0.9859
Epoch 9/16
 - 7s - loss: 0.0033 - acc: 0.9989 - val_loss: 0.1101 - val_acc: 0.9848
Epoch 10/16
 - 6s - loss: 0.0017 - acc: 0.9992 - val_loss: 0.1083 - val_acc: 0.9837
Epoch 11/16
 - 6s - loss: 0.0011 - acc: 0.9995 - val_loss: 0.1492 - val_acc: 0.9842
Epoch 12/16
 - 6s - loss: 0.0010 - ac

<keras.callbacks.History at 0x14a398b3550>

Find Accuracy of the model:
Find the accuracy using model.evaluate()

In [18]:
score, acc = model.evaluate(X_test, Y_test, verbose=2, batch_size=100)
print(score)
print(acc)

0.17939197689830774
0.9831430253417288
