In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/sms-spam-collection-dataset/spam.csv


# What is BERT
BERT (Bidirectional Encoder Representations from Transformers) is a highly influential natural language processing (NLP) model that was introduced by Google in 2018. It represents a breakthrough in pre-training techniques for language understanding tasks. BERT belongs to the transformer-based models family, which have gained significant popularity due to their ability to capture context and long-range dependencies in text.

What sets BERT apart from previous models is its use of bidirectional training. Unlike traditional models that process text in a left-to-right or right-to-left manner, BERT considers both directions simultaneously during training. This bidirectional approach allows BERT to better understand the context and meaning of words by considering the surrounding words and sentences.

BERT is pre-trained on a large corpus of unlabeled text data, such as Wikipedia articles, where it learns to predict missing words in sentences. This pre-training enables the model to capture general language knowledge and context. After pre-training, BERT is fine-tuned on specific downstream tasks, such as text classification, named entity recognition, and question answering.

One of the key advantages of BERT is its ability to handle a wide range of NLP tasks with minimal task-specific modifications. By fine-tuning the pre-trained BERT model on a specific task, it can achieve state-of-the-art performance on various benchmarks, surpassing many traditional approaches and specialized models.

BERT has had a profound impact on the field of NLP, leading to advancements in areas such as sentiment analysis, language translation, information retrieval, and more. Its success has also inspired the development of subsequent models, such as GPT-3, RoBERTa, and ELECTRA, further pushing the boundaries of natural language understanding and generation.

In [2]:
import pandas as pd #for manipulating dataframe
import spacy  #for preprocessing tusk using pretrained nlp models
import numpy as np 
import tensorflow_hub as hub #for processing text
import tensorflow_text as text #for preprocessing the text
from sklearn.model_selection import train_test_split #splitting the data into train data and test data
from sklearn.metrics import classification_report #for model evaluation 
import tensorflow as tf

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


To begin with let's take an overview of how our data look like 

In [3]:
data=pd.read_csv("/kaggle/input/sms-spam-collection-dataset/spam.csv",encoding='latin-1')

In [4]:
data.v1.value_counts()

ham     4825
spam     747
Name: v1, dtype: int64

In [5]:
data=data.drop(["Unnamed: 2","Unnamed: 3","Unnamed: 4"],axis=1)

As we can see our data is not balanced in term of category this issue may affect the performance of the model later so we need to construct a new dataframe by taking sample of 747 row from each category 

In [6]:
df_spam=data[data["v1"]=="spam"].sample(747)
df_ham=data[data["v1"]=="ham"].sample(747,random_state=2022)

In [7]:
final_data=pd.concat([df_spam,df_ham])

In [8]:
final_data.head()

Unnamed: 0,v1,v2
1224,spam,You are a winner U have been specially selecte...
1652,spam,For ur chance to win a å£250 cash every wk TXT...
712,spam,08714712388 between 10am-7pm Cost 10p
2308,spam,Moby Pub Quiz.Win a å£100 High Street prize if...
2363,spam,Fantasy Football is back on your TV. Go to Sky...


As you noticed v1 is categorical feautue so we need to convert it into numerical one 

In [9]:
final_data["v1"]=final_data["v1"].apply(lambda x:1 if x=="spam" else 0)

In [10]:
final_data

Unnamed: 0,v1,v2
1224,1,You are a winner U have been specially selecte...
1652,1,For ur chance to win a å£250 cash every wk TXT...
712,1,08714712388 between 10am-7pm Cost 10p
2308,1,Moby Pub Quiz.Win a å£100 High Street prize if...
2363,1,Fantasy Football is back on your TV. Go to Sky...
...,...,...
303,0,He is a womdarfull actor
2381,0,Best line said in Love: . \I will wait till th...
4085,0,Lemme know when you're here
5318,0,"Good morning, my Love ... I go to sleep now an..."


In [11]:
nlp=spacy.load("en_core_web_sm")

Stop word like "always,would,every etc..." could decrease the performance of the model so simply we romve them from each text row 


In [12]:
def remove_stop_words(text):
    non_stop_words=list()
    docs=nlp(text)
    for token in docs:
        if token.is_stop or token.is_punct:
            continue
        else:
            non_stop_words.append(token.text)
    return " ".join(non_stop_words)

In [13]:
final_data["v2"]=final_data["v2"].apply(remove_stop_words)

Splitting the data into train and test data

In [14]:
X_train,X_test,y_train,y_test=train_test_split(final_data.v2,final_data.v1,random_state=1012)

In [15]:
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")

Building a function in order to process the text in v2 feature and convert them to vectors

In [16]:
def get_sentence_embedding(text):
    preprocessed_text=bert_preprocess(text)
    return bert_encoder(preprocessed_text)["pooled_output"]

building a simple neural net

In [17]:
# Bert layers
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)

# Neural network layers
l = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output'])
l = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(l)

# Use inputs and outputs to construct a final model
model = tf.keras.Model(inputs=[text_input], outputs = [l])

In [18]:
METRICS = [
      tf.keras.metrics.BinaryAccuracy(name='accuracy'),
      tf.keras.metrics.Precision(name='precision'),
      tf.keras.metrics.Recall(name='recall')
]

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=METRICS)

In [19]:
model.fit(X_train,y_train,epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7e0a0af07df0>

In [20]:
model.evaluate(X_test,y_test)



[0.29890942573547363,
 0.8877005577087402,
 0.8839778900146484,
 0.8839778900146484]

In [21]:
y_predicted=model.predict(X_test)



In [22]:
y_predicted=y_predicted.flatten()

In [23]:

y_predicted = np.where(y_predicted > 0.5, 1, 0)
y_predicted

array([0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1,
       1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1,
       0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
       0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0,
       0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1,

In [24]:
print(classification_report(y_test,y_predicted))

              precision    recall  f1-score   support

           0       0.89      0.89      0.89       193
           1       0.88      0.88      0.88       181

    accuracy                           0.89       374
   macro avg       0.89      0.89      0.89       374
weighted avg       0.89      0.89      0.89       374



In [25]:
reviews = [
    'Enter a chance to win $5000, hurry up, offer valid until march 31, 2021',
    'You are awarded a SiPix Digital Camera! call 09061221061 from landline. Delivery within 28days. T Cs Box177. M221BP. 2yr warranty. 150ppm. 16 . p pÂ£3.99',
    'it to 80488. Your 500 free text messages are valid until 31 December 2005.',
    'Hey Sam, Are you coming for a cricket game tomorrow',
    "Why don't you wait 'til at least wednesday to see if you get your .",
    "Dont miss the chance get $100000000 by clicking this link http:vsvlslksdnvlkldvk"
]

In [26]:
model.predict(reviews)




array([[0.798231  ],
       [0.8944814 ],
       [0.8782828 ],
       [0.44492775],
       [0.26584288],
       [0.5294253 ]], dtype=float32)

In [27]:
def prediction(reviews):
    predict=(model.predict(reviews)).flatten()
    for i in predict:
        if i>=0.5:
            print("this email is a spam")
        else:
            print("this email is ham")

In [28]:
prediction(reviews)

this email is a spam
this email is a spam
this email is a spam
this email is ham
this email is ham
this email is a spam
