# Predicting Covid-19 Misinformation on Twitter

Creating a predictive model to classify misinformation is a first step in mitigating the spread of misinformation. Today with the ease to spread both information, as well as misinformation, it is necessary to try and stop the spread of misinformation sooner rather than later to prevent its rampant spread and harmful impacts. Particularly with regard to the Covid-19 pandemic, circulating misinformation could have a potential larger-scale impact on people's view of the virus and global health more broadly. Creating a classifier is an automated way to begin identifying and stopping the spread of misinformation, and hopefully, mitigtating the outspread of the Covid-19 pandemic.

## Import Data

In [10]:
#Source:Fighting an Infodemic: COVID-19 Fake News Dataset, https://github.com/diptamath/covid_fake_news,https://arxiv.org/abs/2011.03327 

import pandas as pd
trainingdata=pd.read_csv("https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv", usecols = ['tweet','label'])
testdata=pd.read_csv("https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/english_test_with_labels.csv", usecols = ['tweet','label'])

trainingdata

Unnamed: 0,tweet,label
0,The CDC currently reports 99031 deaths. In gen...,real
1,States reported 1121 deaths a small rise from ...,real
2,Politically Correct Woman (Almost) Uses Pandem...,fake
3,#IndiaFightsCorona: We have 1524 #COVID testin...,real
4,Populous states can generate large case counts...,real
...,...,...
6415,A tiger tested positive for COVID-19 please st...,fake
6416,???Autopsies prove that COVID-19 is??� a blood...,fake
6417,_A post claims a COVID-19 vaccine has already ...,fake
6418,Aamir Khan Donate 250 Cr. In PM Relief Cares Fund,fake


**************************************

Example of real news twee

In [11]:
trainingdata[trainingdata.label=="real"].loc[0, 'tweet']

'The CDC currently reports 99031 deaths. In general the discrepancies in death counts between different sources are small and explicable. The death toll stands at roughly 100000 people today.'

Example of fake news tweet

In [12]:
trainingdata[trainingdata.label=="fake"].loc[2, 'tweet']

'Politically Correct Woman (Almost) Uses Pandemic as Excuse Not to Reuse Plastic Bag https://t.co/thF8GuNFPe #coronavirus #nashville'

## Define Preprocessor

In [13]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Build vocabulary from training text data
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(trainingdata.tweet)

# preprocessor tokenizes words and makes sure all documents have the same length
def preprocessor(data, maxlen, max_words):

    sequences = tokenizer.texts_to_sequences(data)

    word_index = tokenizer.word_index
    X = pad_sequences(sequences, maxlen=maxlen)

    return X

## Prepare Train and Test Data

In [22]:
# tokenize and pad X data
X_train = preprocessor(trainingdata.tweet, maxlen=40, max_words=10000)
X_test = preprocessor(testdata.tweet, maxlen=40, max_words=10000)

# one hot encode Y data
y_train = pd.get_dummies(trainingdata.label)
y_test = pd.get_dummies(testdata.label)

In [48]:
print(X_train.shape)
print(X_test.shape)

(6420, 40)
(2140, 40)


### Import Libraries

In [34]:
from tensorflow.keras.layers import SimpleRNN, LSTM, Embedding, SimpleRNN, Flatten, Dense, Bidirectional, BatchNormalization
from tensorflow.keras.models import Sequential
from tensorflow.python.keras.callbacks import ModelCheckpoint, ReduceLROnPlateau
from tensorflow.keras import layers
from tensorflow.python.keras.layers.convolutional import Conv2D, MaxPooling2D

### Model Evaluation Function

In [24]:
# performance metrics
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.metrics import roc_auc_score, mean_squared_error, r2_score, mean_absolute_error
import pandas as pd
from math import sqrt

def model_eval_metrics(y_true, y_pred,classification="TRUE"):
     if classification=="TRUE":
        accuracy_eval = accuracy_score(y_true, y_pred)
        f1_score_eval = f1_score(y_true, y_pred,average="macro",zero_division=0)
        precision_eval = precision_score(y_true, y_pred,average="macro",zero_division=0)
        recall_eval = recall_score(y_true, y_pred,average="macro",zero_division=0)
        mse_eval = 0
        rmse_eval = 0
        mae_eval = 0
        r2_eval = 0
        metricdata = {'accuracy': [accuracy_eval], 'f1_score': [f1_score_eval], 'precision': [precision_eval], 'recall': [recall_eval], 'mse': [mse_eval], 'rmse': [rmse_eval], 'mae': [mae_eval], 'r2': [r2_eval]}
        finalmetricdata = pd.DataFrame.from_dict(metricdata)
     else:
        accuracy_eval = 0
        f1_score_eval = 0
        precision_eval = 0
        recall_eval = 0
        mse_eval = mean_squared_error(y_true, y_pred)
        rmse_eval = sqrt(mean_squared_error(y_true, y_pred))
        mae_eval = mean_absolute_error(y_true, y_pred)
        r2_eval = r2_score(y_true, y_pred)
        metricdata = {'accuracy': [accuracy_eval], 'f1_score': [f1_score_eval], 'precision': [precision_eval], 'recall': [recall_eval], 'mse': [mse_eval], 'rmse': [rmse_eval], 'mae': [mae_eval], 'r2': [r2_eval]}
        finalmetricdata = pd.DataFrame.from_dict(metricdata)
     return finalmetricdata

In [49]:
# set seed for reproducibility
seed = 99
import tensorflow as tf
tf.random.set_seed(seed)

## Baseline Model 

In [None]:
tf.random.set_seed(seed)

# replace this model with the architectures from the task description
model = Sequential()
model.add(Embedding(10000, 16, input_length=40))
model.add(Flatten())
model.add(Dense(2, activation='softmax'))

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

history = model.fit(X_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
# format y_pred as labels 
y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]

In [None]:
# y_test is one hot encoded so we need to extract labels before runing model_eval_metrics()
y_test_labels=y_test.idxmax(axis=1) #extract labels from one hot encoded y_test object
y_test_labels=list(y_test.idxmax(axis=1)) #returns a pandas series of predicted labels

In [None]:
model_eval_metrics( y_test_labels,predicted_labels,classification="TRUE")

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.942056,0.941929,0.941929,0.941929,0,0,0,0


********************************************

### Sequential Model with Embedding layer

In [None]:
tf.random.set_seed(seed)

# replace this model with the architectures from the task description
model = Sequential()
model.add(Embedding(10000, 40, input_length=40))
model.add(LSTM(20, activation='sigmoid', return_sequences=True))
model.add(Flatten())
model.add(Dense(2, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

history = model.fit(X_train, y_train,
                    epochs=25,
                    batch_size=64,
                    validation_split=0.2)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


In [None]:
# format y_pred as labels 
y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]

In [None]:
# y_test is one hot encoded so we need to extract labels before runing model_eval_metrics()
y_test_labels=y_test.idxmax(axis=1) #extract labels from one hot encoded y_test object
y_test_labels=list(y_test.idxmax(axis=1)) #returns a pandas series of predicted labels

In [None]:
model_eval_metrics( y_test_labels,predicted_labels,classification="TRUE")

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.933645,0.933542,0.93334,0.93385,0,0,0,0


### Glove Embedding layers model

In [None]:
# Download Glove embedding matrix weights (pretrained embeddings)
! wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip

--2021-04-12 16:52:29--  http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/wordvecs/glove.6B.zip [following]
--2021-04-12 16:52:29--  https://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip [following]
--2021-04-12 16:52:29--  http://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182753 (822M) [applic

In [None]:
! unzip glove.6B.zip 

Archive:  glove.6B.zip
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       
  inflating: glove.6B.50d.txt        


In [None]:
# Extract embedding data for 100 feature embedding matrix
import os
glove_dir = os.getcwd()

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400001 word vectors.


In [None]:
# Build embedding matrix
embedding_dim = 100 # change if you use txt files using larger number of features
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(trainingdata.tweet)
word_index = tokenizer.word_index


embedding_matrix = np.zeros((10000, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < 10000:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

In [None]:
# Set up same model architecture as before and then import Glove weights to Embedding layer:
tf.random.set_seed(seed)

model = Sequential()
model.add(Embedding(10000, embedding_dim, input_length=40))
model.add(Flatten())
model.add(Dense(16, activation='relu'))
model.add(Flatten())
model.add(Dense(2, activation='sigmoid'))
# Add weights in same manner as transfer learning and turn of trainable option before fitting model to freeze weights.
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False

mc = ModelCheckpoint('best_glovemodel.h5', monitor='auc',mode='max', verbose=1, save_best_only=True) # evaluating val_acc maximization
red_lr= ReduceLROnPlateau(monitor='auc',patience=2,verbose=1,factor=0.5, min_lr=0.001)

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['acc','AUC'])
history = model.fit(X_train, y_train,
                    epochs=25,
                    batch_size=12, 
                    callbacks=[mc,red_lr])


# model.summary()

Epoch 1/25

Epoch 00001: auc improved from -inf to 0.90177, saving model to best_glovemodel.h5
Epoch 2/25

Epoch 00002: auc improved from 0.90177 to 0.96744, saving model to best_glovemodel.h5
Epoch 3/25

Epoch 00003: auc improved from 0.96744 to 0.98521, saving model to best_glovemodel.h5

Epoch 00003: ReduceLROnPlateau reducing learning rate to 0.001.
Epoch 4/25

Epoch 00004: auc improved from 0.98521 to 0.99364, saving model to best_glovemodel.h5
Epoch 5/25

Epoch 00005: auc improved from 0.99364 to 0.99760, saving model to best_glovemodel.h5

Epoch 00005: ReduceLROnPlateau reducing learning rate to 0.001.
Epoch 6/25

Epoch 00006: auc improved from 0.99760 to 0.99944, saving model to best_glovemodel.h5
Epoch 7/25

Epoch 00007: auc improved from 0.99944 to 0.99988, saving model to best_glovemodel.h5

Epoch 00007: ReduceLROnPlateau reducing learning rate to 0.001.
Epoch 8/25

Epoch 00008: auc improved from 0.99988 to 0.99999, saving model to best_glovemodel.h5
Epoch 9/25

Epoch 00009:

In [None]:
# format y_pred as labels 
y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]

In [None]:
# y_test is one hot encoded so we need to extract labels before runing model_eval_metrics()
y_test_labels=y_test.idxmax(axis=1) #extract labels from one hot encoded y_test object
y_test_labels=list(y_test.idxmax(axis=1)) #returns a pandas series of predicted labels

In [None]:
model_eval_metrics( y_test_labels,predicted_labels,classification="TRUE")

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.864486,0.864087,0.864431,0.863839,0,0,0,0


### Recurrent Neural Network (RNN) model

In [None]:
tf.random.set_seed(seed)

model = Sequential()
model.add(Embedding(10000, 16, input_length=40))
model.add(SimpleRNN(16))
model.add(Dense(2, activation='sigmoid'))

mc = ModelCheckpoint('best_rnnmodel.h5', monitor='auc',mode='max', verbose=1, save_best_only=True) # evaluating val_acc maximization
red_lr= ReduceLROnPlateau(monitor='auc',patience=2,verbose=1,factor=0.5, min_lr=0.001)

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc', 'AUC'])
history = model.fit(X_train, y_train,
                    epochs=15,
                    batch_size=32,verbose=1,
                    callbacks=[mc,red_lr])

Epoch 1/15

Epoch 00001: auc improved from -inf to 0.90291, saving model to best_rnnmodel.h5
Epoch 2/15

Epoch 00002: auc improved from 0.90291 to 0.97519, saving model to best_rnnmodel.h5
Epoch 3/15

Epoch 00003: auc improved from 0.97519 to 0.98878, saving model to best_rnnmodel.h5

Epoch 00003: ReduceLROnPlateau reducing learning rate to 0.001.
Epoch 4/15

Epoch 00004: auc improved from 0.98878 to 0.99396, saving model to best_rnnmodel.h5
Epoch 5/15

Epoch 00005: auc improved from 0.99396 to 0.99569, saving model to best_rnnmodel.h5

Epoch 00005: ReduceLROnPlateau reducing learning rate to 0.001.
Epoch 6/15

Epoch 00006: auc improved from 0.99569 to 0.99728, saving model to best_rnnmodel.h5
Epoch 7/15

Epoch 00007: auc improved from 0.99728 to 0.99810, saving model to best_rnnmodel.h5

Epoch 00007: ReduceLROnPlateau reducing learning rate to 0.001.
Epoch 8/15

Epoch 00008: auc improved from 0.99810 to 0.99841, saving model to best_rnnmodel.h5
Epoch 9/15

Epoch 00009: auc improved fr

In [None]:
# format y_pred as labels 
y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]

In [None]:
# y_test is one hot encoded so we need to extract labels before runing model_eval_metrics()
y_test_labels=y_test.idxmax(axis=1) #extract labels from one hot encoded y_test object
y_test_labels=list(y_test.idxmax(axis=1)) #returns a pandas series of predicted labels

In [None]:
model_eval_metrics( y_test_labels,predicted_labels,classification="TRUE")

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.915421,0.915041,0.916536,0.914294,0,0,0,0


### Stacked Recurrent Neural Network (RNN) model


In [None]:
tf.random.set_seed(seed)

model = Sequential()
model.add(Embedding(10000, 116, input_length=40))
model.add(LSTM(116, return_sequences=True))
model.add(SimpleRNN(64, return_sequences=True))
model.add(SimpleRNN(64, return_sequences=True))
model.add(SimpleRNN(32))
model.add(Dense(2, activation='softmax'))

mc = ModelCheckpoint('best_rnnstackedmodel.h5', monitor='auc',mode='max', verbose=1, save_best_only=True) # evaluating val_acc maximization
red_lr= ReduceLROnPlateau(monitor='AUC',patience=2,verbose=1,factor=0.5, min_lr=0.001)

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc','AUC'])
history = model.fit(X_train, y_train,
                    epochs=20,
                    batch_size=32,callbacks=[mc,red_lr])

Epoch 1/20

Epoch 00001: auc improved from -inf to 0.92301, saving model to best_rnnstackedmodel.h5
Epoch 2/20

Epoch 00002: auc improved from 0.92301 to 0.98428, saving model to best_rnnstackedmodel.h5
Epoch 3/20

Epoch 00003: auc improved from 0.98428 to 0.99102, saving model to best_rnnstackedmodel.h5
Epoch 4/20

Epoch 00004: auc improved from 0.99102 to 0.99421, saving model to best_rnnstackedmodel.h5
Epoch 5/20

Epoch 00005: auc improved from 0.99421 to 0.99668, saving model to best_rnnstackedmodel.h5
Epoch 6/20

Epoch 00006: auc improved from 0.99668 to 0.99723, saving model to best_rnnstackedmodel.h5
Epoch 7/20

Epoch 00007: auc improved from 0.99723 to 0.99839, saving model to best_rnnstackedmodel.h5
Epoch 8/20

Epoch 00008: auc improved from 0.99839 to 0.99890, saving model to best_rnnstackedmodel.h5
Epoch 9/20

Epoch 00009: auc did not improve from 0.99890
Epoch 10/20

Epoch 00010: auc improved from 0.99890 to 0.99891, saving model to best_rnnstackedmodel.h5
Epoch 11/20

Epoc

In [None]:
# format y_pred as labels 
y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]

In [None]:
# y_test is one hot encoded so we need to extract labels before runing model_eval_metrics()
y_test_labels=y_test.idxmax(axis=1) #extract labels from one hot encoded y_test object
y_test_labels=list(y_test.idxmax(axis=1)) #returns a pandas series of predicted labels

In [None]:
model_eval_metrics( y_test_labels,predicted_labels,classification="TRUE")

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.940654,0.940556,0.940373,0.940809,0,0,0,0


In [None]:
tf.random.set_seed(seed)

model = Sequential()
model.add(Embedding(10000, 116, input_length=40))
model.add(LSTM(116, return_sequences=True))
model.add(SimpleRNN(116, return_sequences=True))
model.add(SimpleRNN(64, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32, activation='sigmoid'))
model.add(Dense(32, activation='sigmoid'))
model.add(Flatten())
model.add(Dense(2, activation='sigmoid'))

mc = ModelCheckpoint('best_rnnstacked2model.h5', monitor='auc',mode='max', verbose=1, save_best_only=True) # evaluating val_acc maximization
red_lr= ReduceLROnPlateau(monitor='auc',patience=2,verbose=1,factor=0.5, min_lr=0.001)

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc','AUC'])
history = model.fit(X_train, y_train,
                    epochs=50,
                    batch_size=32,callbacks=[mc,red_lr])

Epoch 1/50

Epoch 00001: auc improved from -inf to 0.85879, saving model to best_rnnstacked2model.h5
Epoch 2/50

Epoch 00002: auc improved from 0.85879 to 0.97182, saving model to best_rnnstacked2model.h5
Epoch 3/50

Epoch 00003: auc improved from 0.97182 to 0.98910, saving model to best_rnnstacked2model.h5

Epoch 00003: ReduceLROnPlateau reducing learning rate to 0.001.
Epoch 4/50

Epoch 00004: auc improved from 0.98910 to 0.99174, saving model to best_rnnstacked2model.h5
Epoch 5/50

Epoch 00005: auc improved from 0.99174 to 0.99409, saving model to best_rnnstacked2model.h5

Epoch 00005: ReduceLROnPlateau reducing learning rate to 0.001.
Epoch 6/50

Epoch 00006: auc improved from 0.99409 to 0.99649, saving model to best_rnnstacked2model.h5
Epoch 7/50

Epoch 00007: auc improved from 0.99649 to 0.99700, saving model to best_rnnstacked2model.h5

Epoch 00007: ReduceLROnPlateau reducing learning rate to 0.001.
Epoch 8/50

Epoch 00008: auc improved from 0.99700 to 0.99823, saving model to b

In [None]:
# format y_pred as labels 
y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]

In [None]:
# y_test is one hot encoded so we need to extract labels before runing model_eval_metrics()
y_test_labels=y_test.idxmax(axis=1) #extract labels from one hot encoded y_test object
y_test_labels=list(y_test.idxmax(axis=1)) #returns a pandas series of predicted labels

In [None]:
model_eval_metrics( y_test_labels,predicted_labels,classification="TRUE")

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.947664,0.947571,0.947415,0.947768,0,0,0,0


In [None]:
tf.random.set_seed(seed)

model = Sequential()
model.add(Embedding(10000, 116, input_length=40))
model.add(LSTM(116, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))
model.add(SimpleRNN(116, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))
model.add(SimpleRNN(64, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))
model.add(SimpleRNN(32, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))
model.add(SimpleRNN(32, activation='sigmoid', dropout=0.1, recurrent_dropout=0.1))
model.add(Dense(32, activation='sigmoid'))
model.add(Flatten())
model.add(Dense(2, activation='sigmoid'))

mc = ModelCheckpoint('best_rnnstacked2model.h5', monitor='auc',mode='max', verbose=1, save_best_only=True) # evaluating val_acc maximization
red_lr= ReduceLROnPlateau(monitor='auc',patience=2,verbose=1,factor=0.5, min_lr=0.001)

# model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc','AUC'])
# history = model.fit(X_train, y_train,
#                     epochs=24,
#                     batch_size=32,callbacks=[mc,red_lr])

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc','AUC'])
history = model.fit(X_train, y_train,
                    epochs=50,
                    batch_size=32,callbacks=[mc,red_lr])

Epoch 1/50

Epoch 00001: auc improved from -inf to 0.84954, saving model to best_rnnstacked2model.h5
Epoch 2/50

Epoch 00002: auc improved from 0.84954 to 0.97088, saving model to best_rnnstacked2model.h5
Epoch 3/50

Epoch 00003: auc improved from 0.97088 to 0.98734, saving model to best_rnnstacked2model.h5

Epoch 00003: ReduceLROnPlateau reducing learning rate to 0.001.
Epoch 4/50

Epoch 00004: auc improved from 0.98734 to 0.99320, saving model to best_rnnstacked2model.h5
Epoch 5/50

Epoch 00005: auc improved from 0.99320 to 0.99431, saving model to best_rnnstacked2model.h5

Epoch 00005: ReduceLROnPlateau reducing learning rate to 0.001.
Epoch 6/50

Epoch 00006: auc improved from 0.99431 to 0.99659, saving model to best_rnnstacked2model.h5
Epoch 7/50

Epoch 00007: auc improved from 0.99659 to 0.99803, saving model to best_rnnstacked2model.h5

Epoch 00007: ReduceLROnPlateau reducing learning rate to 0.001.
Epoch 8/50

Epoch 00008: auc improved from 0.99803 to 0.99848, saving model to b

In [None]:
# format y_pred as labels 
y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]

In [None]:
# y_test is one hot encoded so we need to extract labels before runing model_eval_metrics()
y_test_labels=y_test.idxmax(axis=1) #extract labels from one hot encoded y_test object
y_test_labels=list(y_test.idxmax(axis=1)) #returns a pandas series of predicted labels

In [None]:
model_eval_metrics( y_test_labels,predicted_labels,classification="TRUE")

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.942523,0.942474,0.942305,0.943207,0,0,0,0


### Long Short-Term Memory (LSTM) Sequential Model with Embedding layer and Dropout

In [None]:
tf.random.set_seed(seed)

model = Sequential()
model.add(Embedding(10000, 64, input_length=40))
model.add(LSTM(164, dropout=0.1, recurrent_dropout=0.1)) 
model.add(Dense(2, activation='sigmoid'))

mc = ModelCheckpoint('best_lstm_model.h5', monitor='auc',mode='max', verbose=1, save_best_only=True) # evaluating val_acc maximization
red_lr= ReduceLROnPlateau(monitor='auc',patience=2,verbose=1,factor=0.05, min_lr=0.001)

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc','AUC'])
history = model.fit(X_train, y_train,
                    epochs=20,
                    batch_size=64,
                    validation_split=0.2,callbacks=[mc,red_lr])

Epoch 1/20

Epoch 00001: auc improved from -inf to 0.88257, saving model to best_lstm_model.h5
Epoch 2/20

Epoch 00002: auc improved from 0.88257 to 0.98698, saving model to best_lstm_model.h5
Epoch 3/20

Epoch 00003: auc improved from 0.98698 to 0.99745, saving model to best_lstm_model.h5

Epoch 00003: ReduceLROnPlateau reducing learning rate to 0.001.
Epoch 4/20

Epoch 00004: auc improved from 0.99745 to 0.99940, saving model to best_lstm_model.h5
Epoch 5/20

Epoch 00005: auc improved from 0.99940 to 0.99941, saving model to best_lstm_model.h5

Epoch 00005: ReduceLROnPlateau reducing learning rate to 0.001.
Epoch 6/20

Epoch 00006: auc improved from 0.99941 to 0.99959, saving model to best_lstm_model.h5
Epoch 7/20

Epoch 00007: auc improved from 0.99959 to 0.99999, saving model to best_lstm_model.h5

Epoch 00007: ReduceLROnPlateau reducing learning rate to 0.001.
Epoch 8/20

Epoch 00008: auc improved from 0.99999 to 1.00000, saving model to best_lstm_model.h5
Epoch 9/20

Epoch 00009:

In [None]:
# format y_pred as labels 
y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]

In [None]:
# y_test is one hot encoded so we need to extract labels before runing model_eval_metrics()
y_test_labels=y_test.idxmax(axis=1) #extract labels from one hot encoded y_test object
y_test_labels=list(y_test.idxmax(axis=1)) #returns a pandas series of predicted labels

In [None]:
model_eval_metrics( y_test_labels,predicted_labels,classification="TRUE")

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.935047,0.934954,0.934735,0.93532,0,0,0,0


### Bidriectional LSTM model

In [None]:

model = Sequential()
model.add(Embedding(10000, 116, input_length=40))
model.add(Bidirectional(LSTM(116)))
model.add(Flatten())
model.add(Dense(2, activation='sigmoid'))

mc = ModelCheckpoint('best_lstm_model.h5', monitor='auc',mode='max', verbose=1, save_best_only=True) # evaluating val_acc maximization
red_lr= ReduceLROnPlateau(monitor='auc',patience=2,verbose=1,factor=0.05, min_lr=0.001)

model.compile(optimizer=RMSprop(), loss='binary_crossentropy', metrics=['acc','AUC'])
history = model.fit(X_train, y_train,
                    epochs=20,
                    batch_size=116,
                    validation_split=0.2,callbacks=[mc,red_lr])

Epoch 1/20

Epoch 00001: auc improved from -inf to 0.90520, saving model to best_lstm_model.h5
Epoch 2/20

Epoch 00002: auc improved from 0.90520 to 0.98546, saving model to best_lstm_model.h5
Epoch 3/20

Epoch 00003: auc improved from 0.98546 to 0.99550, saving model to best_lstm_model.h5

Epoch 00003: ReduceLROnPlateau reducing learning rate to 0.001.
Epoch 4/20

Epoch 00004: auc improved from 0.99550 to 0.99784, saving model to best_lstm_model.h5
Epoch 5/20

Epoch 00005: auc improved from 0.99784 to 0.99821, saving model to best_lstm_model.h5

Epoch 00005: ReduceLROnPlateau reducing learning rate to 0.001.
Epoch 6/20

Epoch 00006: auc improved from 0.99821 to 0.99916, saving model to best_lstm_model.h5
Epoch 7/20

Epoch 00007: auc improved from 0.99916 to 0.99952, saving model to best_lstm_model.h5

Epoch 00007: ReduceLROnPlateau reducing learning rate to 0.001.
Epoch 8/20

Epoch 00008: auc improved from 0.99952 to 0.99976, saving model to best_lstm_model.h5
Epoch 9/20

Epoch 00009:

In [None]:
# format y_pred as labels 
y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]

In [None]:
# y_test is one hot encoded so we need to extract labels before runing model_eval_metrics()
y_test_labels=y_test.idxmax(axis=1) #extract labels from one hot encoded y_test object
y_test_labels=list(y_test.idxmax(axis=1)) #returns a pandas series of predicted labels

In [None]:
model_eval_metrics( y_test_labels,predicted_labels,classification="TRUE")

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.938318,0.93824,0.938009,0.938708,0,0,0,0


### Stacked Bidirectional LSTM model

In [None]:
tf.random.set_seed(seed)

model = Sequential()
model.add(Embedding(10000, 64, input_length=40))
model.add(Bidirectional(LSTM(64, dropout=0.2, recurrent_dropout=0.2, activation='sigmoid', return_sequences=True)))
model.add(Bidirectional(LSTM(32, dropout=0.2, recurrent_dropout=0.2, activation='sigmoid')))
model.add(Dense(2,activation='sigmoid'))

mc = ModelCheckpoint('best_bilstmstacked_model.h5', monitor='auc',mode='max', verbose=1, save_best_only=True) # evaluating val_acc maximization
red_lr= ReduceLROnPlateau(monitor='auc',patience=2,verbose=1,factor=0.05, min_lr=0.001)

model.compile(loss='binary_crossentropy', optimizer=RMSprop(),  metrics=['acc','AUC'])

history = model.fit(X_train, y_train,
                    epochs=20,
                    batch_size=64,
                    validation_split=0.2,callbacks=[mc,red_lr])

Epoch 1/20

Epoch 00001: auc improved from -inf to 0.61413, saving model to best_bilstmstacked_model.h5
Epoch 2/20

Epoch 00002: auc improved from 0.61413 to 0.88006, saving model to best_bilstmstacked_model.h5
Epoch 3/20

Epoch 00003: auc improved from 0.88006 to 0.96126, saving model to best_bilstmstacked_model.h5

Epoch 00003: ReduceLROnPlateau reducing learning rate to 0.001.
Epoch 4/20

Epoch 00004: auc improved from 0.96126 to 0.98188, saving model to best_bilstmstacked_model.h5
Epoch 5/20

Epoch 00005: auc improved from 0.98188 to 0.98965, saving model to best_bilstmstacked_model.h5

Epoch 00005: ReduceLROnPlateau reducing learning rate to 0.001.
Epoch 6/20

Epoch 00006: auc improved from 0.98965 to 0.99335, saving model to best_bilstmstacked_model.h5
Epoch 7/20

Epoch 00007: auc improved from 0.99335 to 0.99573, saving model to best_bilstmstacked_model.h5

Epoch 00007: ReduceLROnPlateau reducing learning rate to 0.001.
Epoch 8/20

Epoch 00008: auc improved from 0.99573 to 0.997

In [None]:
# format y_pred as labels 
y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]

In [None]:
# y_test is one hot encoded so we need to extract labels before runing model_eval_metrics()
y_test_labels=y_test.idxmax(axis=1) #extract labels from one hot encoded y_test object
y_test_labels=list(y_test.idxmax(axis=1)) #returns a pandas series of predicted labels

In [None]:
model_eval_metrics( y_test_labels,predicted_labels,classification="TRUE")

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.921028,0.920253,0.927184,0.918426,0,0,0,0


### Seqeuntial Model with Emedding and 1D Convnet layers

In [None]:
tf.random.set_seed(seed)

model = Sequential()
model.add(layers.Embedding(10000, 116, input_length=40))
model.add(layers.Conv1D(64, 1, activation='softmax')) 
model.add(layers.MaxPooling1D(10)) #
model.add(layers.Conv1D(32, 1, activation='relu')) 
model.add(layers.MaxPooling1D(2)) #
model.add(layers.Conv1D(32, 1, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(2))

mc = ModelCheckpoint('best_1d_model.h5', monitor='auc',mode='max', verbose=1, save_best_only=True) # evaluating val_acc maximization
red_lr= ReduceLROnPlateau(monitor='auc',patience=2,verbose=1,factor=0.05, min_lr=0.001)

model.compile(optimizer=RMSprop(), loss='binary_crossentropy', metrics=['acc'])
history = model.fit(X_train, y_train,
                    epochs=20,
                    batch_size=164,
                    validation_split=0.2,callbacks=[mc,red_lr])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [None]:
# format y_pred as labels 
y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]

In [None]:
# y_test is one hot encoded so we need to extract labels before runing model_eval_metrics()
y_test_labels=y_test.idxmax(axis=1) #extract labels from one hot encoded y_test object
y_test_labels=list(y_test.idxmax(axis=1)) #returns a pandas series of predicted labels

In [None]:
model_eval_metrics( y_test_labels,predicted_labels,classification="TRUE")

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.937383,0.937031,0.939627,0.935933,0,0,0,0


************************************************

### Evaluation

I created several deep learning models to identify misinformation related to the Covid-19 pandemic in tweets. 

These models included:
- A model with an embedding layer and dense layers
- A model using Conv1d Layers
- Models with LSTM layer(s)
- A model with stacked sequential layers
- A model with bidirectional sequential layers 

Nearly all models acheived above .93 for all performance meterics, including accuracy, f1-score, precision, and recall. 

The best performing model was the **Stacked Recurrent Neural Network (RNN) model**, acheiving nearly .95 for all performance metrics.
The model has both an embedding and LSTM layer and 4 RNN layers. The model also utilizes sigmoid activation.

### Testing my top model with my own sample tweets

In [6]:
best_model = ai.aimsonnx.instantiate_model(api_url, version=69) 

In [15]:
my_test_tweets = ["Covid is not a big deal, its just a cold", # misinformation
                  "Masks are stupid. Covid is joke",  # misinformation

                  "#COVID19 vaccines are a safe way to build protection.", # actual tweet from CDC
                  "Based on COVID-NET data in recent weeks, rates of #COVID19 hospitalizations in adults ages 50–64 have risen faster than other age groups in several states.", # actual tweet from CDC
                  "You can't get Covid from vaccination. Not possible." # Twitter verified tweet by doctor             
]

In [16]:
my_test_tweets_processed = preprocessor(my_test_tweets, maxlen=40, max_words=10000)
best_model.predict(my_test_tweets_processed).argmax(axis=1)

array([1, 1, 1, 1, 1])

### References

Shahi, Gautam Kishore, Anne Dirkson, and Tim A. Majchrzak. "An exploratory study of covid-19 misinformation on twitter." Online Social Networks and Media 22 (2021): 100104.

## Submit Placeholder Model

In [1]:
# install aimodelshare library
%%capture
! pip install aimodelshare --upgrade --extra-index-url https://test.pypi.org/simple/ 

In [2]:
import aimodelshare as ai
from aimodelshare.aimsonnx import model_to_onnx

In [None]:
# save preprocessor
ai.export_preprocessor(preprocessor,"")

In [None]:
# save model in onnx format
onnx_model = model_to_onnx(model, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("onnx_model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())



INFO:tensorflow:Assets written to: /tmp/assets


INFO:tensorflow:Assets written to: /tmp/assets


In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
# set credentials for modeltoapi function 
# make sure you have uploaded your credentials.txt file
from aimodelshare.aws import set_credentials
api_url = "https://wvr23l2z9i.execute-api.us-east-1.amazonaws.com/prod/m"
cred_path = "/content/drive/My Drive/Adv. Machine Learning/Code/Week 12/credentials.txt"

set_credentials(apiurl=api_url,
                credential_file=cred_path, 
                type="submit_model", 
                manual=False)

AI Model Share login credentials set successfully.
AWS credentials set successfully.


In [None]:
# submit model and predictions to competition
ai.submit_model("onnx_model.onnx",
                api_url,
                prediction_submission=predicted_labels,
                preprocessor="preprocessor.zip")

'Your model has been submitted as model version 69'

In [None]:
# check leaderboard
data=ai.get_leaderboard(api_url, verbose=3)
ai.leaderboard.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,bidirectional_layers,conv1d_layers,dense_layers,embedding_layers,flatten_layers,globalmaxpooling1d_layers,lstm_layers,maxpooling1d_layers,simplernn_layers,relu_act,sigmoid_act,softmax_act,tanh_act,loss,optimizer,model_config,username,version
0,95.09%,95.09%,95.07%,95.12%,keras,False,True,Sequential,3,161922,,,1,1,1.0,,,,,,,1.0,,str,RMSprop,"{'name': 'sequential', 'layers...",hpeters,67
1,95.09%,95.09%,95.07%,95.12%,keras,False,True,Sequential,3,161922,,,1,1,1.0,,,,,,,1.0,,str,RMSprop,"{'name': 'sequential', 'layers...",hpeters,66
2,95.00%,94.99%,94.97%,95.02%,keras,False,True,Sequential,5,1081482,1.0,,2,1,,,1.0,,,1.0,,1.0,1.0,str,RMSprop,"{'name': 'sequential_29', 'lay...",kagenlim,61
3,94.86%,94.85%,94.84%,94.87%,keras,False,True,Sequential,5,1035746,,,2,1,,,2.0,,,1.0,,1.0,2.0,str,RMSprop,"{'name': 'sequential_3', 'laye...",kagenlim,19
4,94.77%,94.76%,94.74%,94.78%,keras,False,True,Sequential,9,1313030,,,2,1,1.0,,1.0,,4.0,,3.0,,4.0,str,RMSprop,"{'name': 'sequential_1', 'laye...",kka2120,69
5,94.58%,94.57%,94.57%,94.57%,keras,False,True,Sequential,5,1070202,,,2,1,,,2.0,,,1.0,,1.0,2.0,str,RMSprop,"{'name': 'sequential_4', 'laye...",kagenlim,60
6,94.49%,94.47%,94.47%,94.48%,keras,False,True,Sequential,3,161282,,,1,1,1.0,,,,,,,1.0,,str,RMSprop,"{'name': 'sequential', 'layers...",newusertest,4
7,94.35%,94.34%,94.32%,94.37%,keras,False,True,Sequential,6,148066,,2.0,1,1,1.0,,,1.0,,2.0,,1.0,,str,RMSprop,"{'name': 'sequential_72', 'lay...",prajseth,40
8,94.25%,94.24%,94.24%,94.24%,keras,False,True,Sequential,3,98818,,,1,1,,,1.0,,,,,1.0,1.0,str,RMSprop,"{'name': 'sequential_78', 'lay...",prajseth,41
9,94.21%,94.19%,94.18%,94.21%,keras,False,True,Sequential,3,402690,,,1,1,,,1.0,,,,1.0,,1.0,str,RMSprop,"{'name': 'sequential_5', 'laye...",xc2303_xc,63


In [18]:
 # model next to mine on the leaderboard
 nextbestmodel = ai.aimsonnx.instantiate_model(api_url, version=19) 
 nextbestmodel.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 40, 100)           1000000   
_________________________________________________________________
lstm_2 (LSTM)                (None, 40, 32)            17024     
_________________________________________________________________
lstm_3 (LSTM)                (None, 50)                16600     
_________________________________________________________________
dense_2 (Dense)              (None, 40)                2040      
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 82        
Total params: 1,035,746
Trainable params: 1,035,746
Non-trainable params: 0
_________________________________________________________________


In [None]:
 # my model
 bestmodel = ai.aimsonnx.instantiate_model(api_url, version=69) 
 bestmodel.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 40, 116)           1160000   
_________________________________________________________________
lstm (LSTM)                  (None, 40, 116)           108112    
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 40, 116)           27028     
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 40, 64)            11584     
_________________________________________________________________
simple_rnn_2 (SimpleRNN)     (None, 40, 32)            3104      
_________________________________________________________________
simple_rnn_3 (SimpleRNN)     (None, 32)                2080      
_________________________________________________________________
dense (Dense)                (None, 32)               

In [None]:
# Compare the two models to see differences
ai.aimsonnx.compare_models(api_url, version_list=[19,69]) 

Unnamed: 0,Model_19_Layer,Model_19_Shape,Model_19_Params,Model_69_Layer,Model_69_Shape,Model_69_Params
0,Embedding,"(None, 40, 100)",1000000.0,Embedding,"(None, 40, 116)",1160000
1,LSTM,"(None, 40, 32)",17024.0,LSTM,"(None, 40, 116)",108112
2,LSTM,"(None, 50)",16600.0,SimpleRNN,"(None, 40, 116)",27028
3,Dense,"(None, 40)",2040.0,SimpleRNN,"(None, 40, 64)",11584
4,Dense,"(None, 2)",82.0,SimpleRNN,"(None, 40, 32)",3104
5,,,,SimpleRNN,"(None, 32)",2080
6,,,,Dense,"(None, 32)",1056
7,,,,Flatten,"(None, 32)",0
8,,,,Dense,"(None, 2)",66


My model used RNN layers as well as a Flatten layer, while the other model used two LSTM layers. The two models acheive nearly the same performance.

***********************************************

In [20]:
# Fit the best model from the leader board to training data and 
# evaluate it on test data to complete your report.
top_model = ai.aimsonnx.instantiate_model(api_url, version=1) 
top_model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 100, 16)           160000    
_________________________________________________________________
flatten (Flatten)            (None, 1600)              0         
_________________________________________________________________
dense (Dense)                (None, 2)                 3202      
Total params: 163,202
Trainable params: 163,202
Non-trainable params: 0
_________________________________________________________________


In [67]:
# need to adjust preprocessing for top model
X_train = preprocessor(trainingdata.tweet, maxlen=100, max_words=10000)
X_test = preprocessor(testdata.tweet, maxlen=100, max_words=10000)

# one hot encode Y data
y_train = pd.get_dummies(trainingdata.label)
y_test = pd.get_dummies(testdata.label)


model = Sequential()
model.add(Embedding(10000, 16, input_length=100))
model.add(Flatten())
model.add(Dense(2, activation='softmax'))

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

history = model.fit(X_train, y_train,
                    epochs=20,
                    batch_size=32,
                    validation_split=0.2)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [68]:
# format y_pred as labels 
y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]

In [69]:
# y_test is one hot encoded so we need to extract labels before runing model_eval_metrics()
y_test_labels=y_test.idxmax(axis=1) #extract labels from one hot encoded y_test object
y_test_labels=list(y_test.idxmax(axis=1)) #returns a pandas series of predicted labels

In [70]:
model_eval_metrics( y_test_labels,predicted_labels,classification="TRUE")

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.938785,0.938479,0.940504,0.937535,0,0,0,0


In [71]:
model.summary()

Model: "sequential_24"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_20 (Embedding)     (None, 100, 16)           160000    
_________________________________________________________________
flatten_20 (Flatten)         (None, 1600)              0         
_________________________________________________________________
dense_19 (Dense)             (None, 2)                 3202      
Total params: 163,202
Trainable params: 163,202
Non-trainable params: 0
_________________________________________________________________
