Import libraries:
1. pandas library - which is mainly used for data manipulation and analysis.
2. sklearn library - helps us to implement machine learning techniques.
3. keras library - consists of functions and programs related to neural network.
4. re library - operations related to regular expression.

In [None]:
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, SpatialDropout1D
from keras.utils.np_utils import to_categorical
from keras.layers.convolutional import Conv1D, MaxPooling1D
from keras.layers import Flatten
from keras import optimizers
from keras.constraints import maxnorm


import re

Read Dataset:
1. Reading csv file using pandas library.
2. Pass filename and encoding (encoding which is used for UTF while reading/writing example: 'utf-8') as parameter 


In [None]:
dataset = pd.read_csv('spam.csv', encoding='latin-1')

View Dataset:
.head and .info displays the dataset and information about the dataset respectively.

In [None]:
dataset.info()

In [None]:
dataset.head()

Drop unecessary columns:
1. Unecessary columns are dropped using .drop of method.
2. Pass list of column names to be dropped, index (0 to drop from index or 1 to drop from columns), inplace (true, for doing operation inplace or else return none).

In [None]:
dataset.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1,inplace=True)

Check drop action:
Check whether the unecessary features are dropped from the dataframe y viewing the dataset again.

In [None]:
dataset.head()

Data Pre-processing:
1. Use .strip to remove empty charecter from both the sides.
2. Use .lower function to change all the string to lower case.

In [None]:
dataset['spam_or_ham']=dataset.spam_or_ham.str.strip()
dataset['message']=dataset.message.str.strip()
dataset['message']=dataset.message.str.lower()
dataset.head()

Count categories:
Use .value_counts function to count number of datapoints in each of the classes

In [None]:
dataset.spam_or_ham.value_counts()

Tokenization:
1. Tokenization chops the string into pieces called tokens.
2. For example: 
input : "Hi python, this is my last assignment"
output : ["hi","python","this","is","my","last","assignment"]
3. Create tokenizer model with 1500 as num_word and split the string by the charecter ' '.
4. fit on the dataset.
5. make the texts to sequences.

In [None]:
maximum_number_of_features = 1500
tokenizer = Tokenizer(num_words=maximum_number_of_features, split=' ')
tokenizer.fit_on_texts(dataset['message'].values)
X = tokenizer.texts_to_sequences(dataset['message'].values)

Change the sequence to 2D Numpy array:
Use pad_sequences from keras to change the sequence to 2D Numpy array.

In [None]:
print(X)
X = pad_sequences(X)
print("after applying pad_sequences")
print(X)

Make categorical variable:
1. Create a LabelEncoder model.
2. Use fit_transform to fit the categorical feature 'spam_or_ham to chenge it to categorical variable.
3. LabelEncoder makes the classes from 0 to total class-1

In [None]:
labelencoder = LabelEncoder()
integer_encoded = labelencoder.fit_transform(dataset['spam_or_ham'])
Y = to_categorical(integer_encoded)
print(Y)

Split Dataset:
1. Use train_test_split() function to split the into four that is to categorical feature for train and test (Y_train and Y_test) and othe features (dependent features) for train and test (X_train and X_test).
2. Make test size to 33%.
3. Use random state to randomize the datapoints
4. Print the shape of the splitted datapoints

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
print(X_train.shape, Y_train.shape)
print(X_test.shape, Y_test.shape)

CNN Model training:
1. Create a Sequential model where sequential model where sequential model is adding to several layers to the model.
2. Add several layers like Embedding, Conv1D with MaxPooling1D and SpatialDropout1D
3. At last flatten the fit.

In [None]:
model = Sequential()
model.add(Embedding(1500, 128, dropout=0.2, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))

model.add(Conv1D(64, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))

model.add(Conv1D(32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))

model.add(Flatten())
model.add(Dense(2, activation='sigmoid'))

print(model.summary())
model.compile(loss='categorical_crossentropy', optimizer=optimizers.RMSprop(lr=0.001), metrics=['accuracy'])

Fit model:
1. Fit the trained model on the splitted dataset
2. Fit the model for 16 epochs(number of times), batch_size(number of datapoints taken for each step of fitting), validation_data for doing validation.

In [None]:
model.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=16, batch_size=100, verbose=2)

Find Accuracy of the model:
Find the accuracy using model.evaluate()

In [None]:
score, acc = model.evaluate(X_test, Y_test, verbose=2, batch_size=100)
print(score)
print(acc)