<h2 align="center">BERT tutorial: Classify spam vs no spam emails</h2>

In [1]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text

<h4>Import the dataset (Dataset is taken from kaggle)</h4>

In [2]:
import pandas as pd

df=pd.read_excel('../data/Constraint_Train.xlsx')
df.head(5)

Unnamed: 0,id,tweet,label
0,1.0,The CDC currently reports 99031 deaths. In gen...,real
1,2.0,States reported 1121 deaths a small rise from ...,real
2,3.0,Politically Correct Woman (Almost) Uses Pandem...,fake
3,4.0,#IndiaFightsCorona: We have 1524 #COVID testin...,real
4,5.0,Populous states can generate large case counts...,real


In [3]:
#df.groupby('Category').describe()

In [4]:
#df['Category'].value_counts()

In [5]:
#747/4825

**15% spam emails, 85% ham emails: This indicates class imbalance**

In [6]:
df_fake = df[df['label']=='fake']
df_fake.shape

(3060, 3)

In [7]:
df_real = df[df['label']=='real']
df_real.shape

(3360, 3)

In [8]:
df_real_downsampled = df_real.sample(df_fake.shape[0])
df_real_downsampled.shape

(3060, 3)

In [9]:
df_balanced = pd.concat([df_real_downsampled, df_fake])
df_balanced.shape

(6120, 3)

In [10]:
df_balanced['label'].value_counts()

real    3060
fake    3060
Name: label, dtype: int64

In [11]:
df_balanced['fake']=df_balanced['label'].apply(lambda x: 1 if x=='fake' else 0)
df_balanced.sample(5)

Unnamed: 0,id,tweet,label,fake
2101,2102.0,COVID-19 is cured with hot water and baking soda.,fake,1
6259,6260.0,The pandemic must be a catalyst for taking oth...,fake,1
5699,5700.0,"Dear friends, \nNew study by China proves that...",fake,1
4900,4901.0,RT @RidgeOnSunday: 'You want children to be pr...,real,0
1095,1096.0,Netflix documentary Tiger King has risen in po...,fake,1


<h4>Split it into training and test data set</h4>

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df_balanced['tweet'],df_balanced['fake'], stratify=df_balanced['fake'])

In [13]:
X_train.head(4)

1495    “I've always known this is a real this is a pa...
4640    _Research shows that parts of the HIV genetic ...
3711    Latest update from the Ministry of Health: The...
6176    RT @MoHFW_INDIA: #IndiaFightsCorona Last 5 lak...
Name: tweet, dtype: object

<h4>Now lets import BERT model and get embeding vectors for few sample statements</h4>

In [14]:
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")

In [15]:
def get_sentence_embeding(sentences):
    preprocessed_text = bert_preprocess(sentences)
    return bert_encoder(preprocessed_text)['pooled_output']

get_sentence_embeding([
    "500$ discount. hurry up", 
    "Bhavin, are you up for a volleybal game tomorrow?"]
)

<tf.Tensor: shape=(2, 768), dtype=float32, numpy=
array([[-0.84351707, -0.5132726 , -0.8884571 , ..., -0.74748856,
        -0.7531474 ,  0.91964495],
       [-0.87208354, -0.50543964, -0.94446677, ..., -0.8584749 ,
        -0.7174535 ,  0.88082975]], dtype=float32)>

<h4>Get embeding vectors for few sample words. Compare them using cosine similarity</h4>

In [16]:
# e = get_sentence_embeding([
#     "banana", 
#     "grapes",
#     "mango",
#     "jeff bezos",
#     "elon musk",
#     "bill gates"
# ]
# )

In [17]:
# from sklearn.metrics.pairwise import cosine_similarity
# cosine_similarity([e[0]],[e[1]])

Values near to 1 means they are similar. 0 means they are very different.
Above you can use comparing "banana" vs "grapes" you get 0.99 similarity as they both are fruits

In [18]:
# cosine_similarity([e[0]],[e[3]])

Comparing banana with jeff bezos you still get 0.84 but it is not as close as 0.99 that we got with grapes

In [19]:
# cosine_similarity([e[3]],[e[4]])

Jeff bezos and Elon musk are more similar then Jeff bezos and banana as indicated above

<h4>Build Model</h4>

There are two types of models you can build in tensorflow. 

(1) Sequential
(2) Functional

So far we have built sequential model. But below we will build functional model. More information on these two is here: https://becominghuman.ai/sequential-vs-functional-model-in-keras-20684f766057

In [20]:
# Bert layers
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)

# Neural network layers
l = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output'])
l = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(l)

# Use inputs and outputs to construct a final model
model = tf.keras.Model(inputs=[text_input], outputs = [l])

https://stackoverflow.com/questions/47605558/importerror-failed-to-import-pydot-you-must-install-pydot-and-graphviz-for-py

In [21]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 text (InputLayer)              [(None,)]            0           []                               
                                                                                                  
 keras_layer (KerasLayer)       {'input_mask': (Non  0           ['text[0][0]']                   
                                e, 128),                                                          
                                 'input_word_ids':                                                
                                (None, 128),                                                      
                                 'input_type_ids':                                                
                                (None, 128)}                                                  

In [22]:
len(X_train)

4590

In [23]:
METRICS = [
      tf.keras.metrics.BinaryAccuracy(name='accuracy'),
      tf.keras.metrics.Precision(name='precision'),
      tf.keras.metrics.Recall(name='recall')
]

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=METRICS)

<h4>Train the model</h4>

In [None]:
model.fit(X_train, y_train, epochs=10)

Epoch 1/10
  2/144 [..............................] - ETA: 22:30 - loss: 1.2105 - accuracy: 0.5156 - precision: 0.5156 - recall: 1.0000

In [None]:
model.evaluate(X_test, y_test)

In [None]:
y_predicted = model.predict(X_test)
y_predicted = y_predicted.flatten()

In [None]:
import numpy as np

y_predicted = np.where(y_predicted > 0.5, 1, 0)
y_predicted

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

cm = confusion_matrix(y_test, y_predicted)
cm 

In [None]:
from matplotlib import pyplot as plt
import seaborn as sn
sn.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Truth')

In [None]:
print(classification_report(y_test, y_predicted))

<h4>Inference</h4>

In [None]:
reviews = [
    'Enter a chance to win $5000, hurry up, offer valid until march 31, 2021',
    'You are awarded a SiPix Digital Camera! call 09061221061 from landline. Delivery within 28days. T Cs Box177. M221BP. 2yr warranty. 150ppm. 16 . p pÂ£3.99',
    'it to 80488. Your 500 free text messages are valid until 31 December 2005.',
    'Hey Sam, Are you coming for a cricket game tomorrow',
    "Why don't you wait 'til at least wednesday to see if you get your ."
]
model.predict(reviews)