# Is this Tweet a Fake News or Not?
<!-- Github link: https://github.com/mro9395/ML_projects/tree/main/02COVID-19_X-RayDiagnosis -->

Data provided Shahi, Gautam Kishore, Anne Dirkson, and Tim A. Majchrzak. "An exploratory study of covid-19 misinformation on twitter." *Online Social Networks and Media* 22 (2021): 100104.

## Install required packages

In [1]:
# install aimodelshare library
%%capture
! pip install aimodelshare --upgrade --extra-index-url https://test.pypi.org/simple/ 

In [34]:
# Import packages

import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import LSTM
from keras.layers import Flatten
from keras.layers import Bidirectional
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from math import sqrt

## Loading required data

In [3]:
#Source:Fighting an Infodemic: COVID-19 Fake News Dataset, https://github.com/diptamath/covid_fake_news,https://arxiv.org/abs/2011.03327 

trainingdata=pd.read_csv("https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv", usecols = ['tweet','label'])
testdata=pd.read_csv("https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/english_test_with_labels.csv", usecols = ['tweet','label'])

trainingdata

Unnamed: 0,tweet,label
0,The CDC currently reports 99031 deaths. In gen...,real
1,States reported 1121 deaths a small rise from ...,real
2,Politically Correct Woman (Almost) Uses Pandem...,fake
3,#IndiaFightsCorona: We have 1524 #COVID testin...,real
4,Populous states can generate large case counts...,real
...,...,...
6415,A tiger tested positive for COVID-19 please st...,fake
6416,???Autopsies prove that COVID-19 is??� a blood...,fake
6417,_A post claims a COVID-19 vaccine has already ...,fake
6418,Aamir Khan Donate 250 Cr. In PM Relief Cares Fund,fake


In [4]:
# Print shapes of data

print(trainingdata.shape)
print(testdata.shape)

(6420, 2)
(2140, 2)


In [5]:
# Show samples of tweets

print('1. Tweets that demonstrate misinformation:\n')
for i in trainingdata[trainingdata['label']=='fake'].tweet[:7]:
    print('~', i)

print('\n 2. Tweets that demonstrate real information:\n')
for i in trainingdata[trainingdata['label']=='real'].tweet[:7]:
    print('~', i)

1. Tweets that demonstrate misinformation:

~ Politically Correct Woman (Almost) Uses Pandemic as Excuse Not to Reuse Plastic Bag https://t.co/thF8GuNFPe #coronavirus #nashville
~ Obama Calls Trump’s Coronavirus Response A Chaotic Disaster https://t.co/DeDqZEhAsB
~ ???Clearly, the Obama administration did not leave any kind of game plan for something like this.??�
~ Retraction—Hydroxychloroquine or chloroquine with or without a macrolide for treatment of COVID-19: a multinational registry analysis - The Lancet https://t.co/L5V2x6G9or
~ The NBA is poised to restart this month. In March we reported on how the Utah Jazz got 58 coronavirus tests in a matter of hours at a time when U.S. testing was sluggish. https://t.co/I8YjjrNoTh https://t.co/o0Nk6gpyos
~ Because of Donald Trump's negligence and incompetence:
 
- More than 110,000 people have died
- We have nearly 2 million cases of COVID-19
- Over 20 million Americans are out of work
 
And yet he’s patting himself on the back. Unbelievab

**Discussion**: This dataset provides a collection of 8560 tweets related to COVID-19 with a label whether their content presents real or false fact-checked claims.

Building a model to predict whether a tweet is real or misinformating regarding COVID-19 would be useful for many reasons. Given that large list of unproven treatments against COVID-19 that people circulate, this model could alert the users and prevent adverse reactions. A model that distinguishes real information could also inform users when to detect potential false claims against vaccines.

Multiple stakeholders can benefit from this. Governments can reduce the spread of false claims that could endanger their public health. People can be better informed and able to distinguish fact-checked information. Platforms like Twitter can deliver a more healthy virtual environment for their users. One caveat of such a model is that it could prevent the spread of real but new findings. Since the model might not recognize these as real, it could label these as false.


## Preprocessing

### Prepare preprocessor

In [36]:
# Build vocabulary from training text data
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(trainingdata.tweet)

# preprocessor tokenizes words and makes sure all documents have the same length
def preprocessor(data, maxlen, max_words):

    sequences = tokenizer.texts_to_sequences(data)

    word_index = tokenizer.word_index
    X = pad_sequences(sequences, maxlen=maxlen)

    return X

### Prepare Train and Test Data

In [37]:
# tokenize and pad X data
X_train = preprocessor(trainingdata.tweet, maxlen=80, max_words=10000)
X_test = preprocessor(testdata.tweet, maxlen=80, max_words=10000)

# ohe encode Y data
y_train = pd.get_dummies(trainingdata.label)
y_test = pd.get_dummies(testdata.label)

In [38]:
# Print shapes of data

print(X_train.shape)
print(X_test.shape)

(6420, 80)
(2140, 80)


## Model building

In [39]:
# Declare maximum length of tweet

maxlen = 80

In [40]:
# Set function to evaluate models

def model_eval_metrics(y_true, y_pred):
    accuracy_eval = accuracy_score(y_true, y_pred)
    f1_score_eval = f1_score(y_true, y_pred,average="macro",zero_division=0)
    precision_eval = precision_score(y_true, y_pred,average="macro",zero_division=0)
    recall_eval = recall_score(y_true, y_pred,average="macro",zero_division=0)
    mse_eval = 0
    rmse_eval = 0
    mae_eval = 0
    r2_eval = 0
    metricdata = {'accuracy': [accuracy_eval], 'f1_score': [f1_score_eval], 'precision': [precision_eval], 'recall': [recall_eval], 'mse': [mse_eval], 'rmse': [rmse_eval], 'mae': [mae_eval], 'r2': [r2_eval]}
    finalmetricdata = pd.DataFrame.from_dict(metricdata)
    return finalmetricdata

## Model 1: Embeddings with single LSTM

In [None]:
model = Sequential()
model.add(Embedding(10000, 64, input_length=maxlen))
model.add(LSTM(64))
model.add(Dense(2, activation='sigmoid'))


model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

history = model.fit(X_train, y_train, batch_size=32, epochs=10, validation_split=0.2)
score = model.evaluate(X_test, y_test, batch_size=16)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
# Calculate metrics of evaluation 
y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]

y_test_labels=y_test.idxmax(axis=1) #extract labels from one hot encoded y_test object
y_test_labels=list(y_test.idxmax(axis=1)) #returns a pandas series of predicted labels

model_eval_metrics( y_test_labels,predicted_labels)

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.937383,0.937122,0.938427,0.936415,0,0,0,0


## Model 2: Embeddings with bidirectional LSTM

In [None]:
model = Sequential()
model.add(Embedding(10000, 64, input_length=maxlen))
model.add(Bidirectional(LSTM(32)))
model.add(Dense(2, activation='softmax'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(X_train, y_train, epochs=10, batch_size=16, validation_split=0.2)
score = model.evaluate(X_test, y_test, batch_size=16)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
# Calculate metrics of evaluation 

y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]

y_test_labels=y_test.idxmax(axis=1) #extract labels from one hot encoded y_test object
y_test_labels=list(y_test.idxmax(axis=1)) #returns a pandas series of predicted labels

model_eval_metrics( y_test_labels,predicted_labels)

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.931776,0.931775,0.933581,0.933727,0,0,0,0


## Model 3: Embeddings with bidirectional and stacked LSTM

In [44]:
model = Sequential()
model.add(Embedding(10000, 32, input_length=maxlen))
model.add(Bidirectional(LSTM(32, return_sequences=True)))
model.add(LSTM(32, return_sequences=True))
model.add(LSTM(32, return_sequences=True))
model.add(LSTM(32))
model.add(Dense(2, activation='sigmoid'))
model.summary()

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(X_train, y_train,
                    epochs=10,
                    batch_size=16,
                    validation_split=0.2)

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 80, 32)            320000    
_________________________________________________________________
bidirectional_4 (Bidirection (None, 80, 64)            16640     
_________________________________________________________________
lstm_20 (LSTM)               (None, 80, 32)            12416     
_________________________________________________________________
lstm_21 (LSTM)               (None, 80, 32)            8320      
_________________________________________________________________
lstm_22 (LSTM)               (None, 32)                8320      
_________________________________________________________________
dense_6 (Dense)              (None, 2)                 66        
Total params: 365,762
Trainable params: 365,762
Non-trainable params: 0
________________________________________________

In [45]:
# Calculate metrics of evaluation 

y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]

y_test_labels=y_test.idxmax(axis=1) #extract labels from one hot encoded y_test object
y_test_labels=list(y_test.idxmax(axis=1)) #returns a pandas series of predicted labels

model_eval_metrics( y_test_labels,predicted_labels)

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.93972,0.939624,0.939426,0.939916,0,0,0,0


## Model 4: Embeddings with 1D Convolution and LSTM with dropout regularization

In [41]:
model = Sequential()
model.add(Embedding(10000, 32, input_length=maxlen))
model.add(layers.Conv1D(32, 7, activation='relu')) 
model.add(layers.AveragePooling1D())
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Flatten())
model.add(Dense(2, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(X_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [42]:
# Calculate metrics of evaluation 

y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]

y_test_labels=y_test.idxmax(axis=1) #extract labels from one hot encoded y_test object
y_test_labels=list(y_test.idxmax(axis=1)) #returns a pandas series of predicted labels

model_eval_metrics( y_test_labels,predicted_labels)

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.921495,0.921302,0.921427,0.921192,0,0,0,0


**Discussion**: All four models that experiment with different architectures from stacked LSTM, birectional LSTM, and 1D Convolutionals produce an accuracy score above 0.90 among all the four metrics. However some perform slightly better than others.

The best model is one that combines an embedding layer, one bidrectional LSTM and three stacked LSTM layers, all with 32 nodes, in general. While the worst is one that features one 1D convolutional layer with average pooling and a LSTM of 128 nodes. Given that we are dealing with words and phrases that vary in length and don't follow a rigid pattern, the static size of the filter (7) of the 1D Conv could be carrying errors or missing information as opossed to the variable LSTM. Also, an LSTM with 128 potential meaningful values is too high and is generating an overfitting, since the training score is 0.997 (higher than the best model). A value of 32 seems to work better.

Finally, even though it is a slightly less accurate model, its computing complexity is comparatively low, so the training is faster.

## Submit Model

In [23]:
# Import aimodelshare package

import aimodelshare as ai
from aimodelshare.aimsonnx import model_to_onnx

In [24]:
# save preprocessor
ai.export_preprocessor(preprocessor,"")

In [None]:
# save model in onnx format
onnx_model = model_to_onnx(model, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("onnx_model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())



INFO:tensorflow:Assets written to: /tmp/assets


INFO:tensorflow:Assets written to: /tmp/assets


In [26]:
# set credentials for modeltoapi function 
from aimodelshare.aws import set_credentials
api_url = "https://wvr23l2z9i.execute-api.us-east-1.amazonaws.com/prod/m"

set_credentials(apiurl=api_url,credential_file="credentials.txt", type="submit_model", manual=False)

AI Model Share login credentials set successfully.
AWS credentials set successfully.


In [None]:
# submit model and predictions to competition
ai.submit_model("onnx_model.onnx",
                api_url,
                prediction_submission=predicted_labels,
                preprocessor="preprocessor.zip")

'Your model has been submitted as model version 73'

In [None]:
# check leaderboard
data=ai.get_leaderboard(api_url, verbose=3)
ai.leaderboard.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,bidirectional_layers,conv1d_layers,dense_layers,embedding_layers,flatten_layers,globalmaxpooling1d_layers,lstm_layers,maxpooling1d_layers,simplernn_layers,relu_act,sigmoid_act,softmax_act,tanh_act,loss,optimizer,model_config,username,version
0,95.09%,95.09%,95.07%,95.12%,keras,False,True,Sequential,3,161922,,,1,1,1.0,,,,,,,1.0,,str,RMSprop,"{'name': 'sequential', 'layers...",hpeters,67
1,95.09%,95.09%,95.07%,95.12%,keras,False,True,Sequential,3,161922,,,1,1,1.0,,,,,,,1.0,,str,RMSprop,"{'name': 'sequential', 'layers...",hpeters,66
2,95.00%,94.99%,94.97%,95.02%,keras,False,True,Sequential,5,1081482,1.0,,2,1,,,1.0,,,1.0,,1.0,1.0,str,RMSprop,"{'name': 'sequential_29', 'lay...",kagenlim,61
3,94.86%,94.85%,94.84%,94.87%,keras,False,True,Sequential,5,1035746,,,2,1,,,2.0,,,1.0,,1.0,2.0,str,RMSprop,"{'name': 'sequential_3', 'laye...",kagenlim,19
4,94.77%,94.76%,94.74%,94.78%,keras,False,True,Sequential,9,1313030,,,2,1,1.0,,1.0,,4.0,,3.0,,4.0,str,RMSprop,"{'name': 'sequential_1', 'laye...",kka2120,69
5,94.58%,94.57%,94.57%,94.57%,keras,False,True,Sequential,5,1070202,,,2,1,,,2.0,,,1.0,,1.0,2.0,str,RMSprop,"{'name': 'sequential_4', 'laye...",kagenlim,60
6,94.49%,94.47%,94.47%,94.48%,keras,False,True,Sequential,3,161282,,,1,1,1.0,,,,,,,1.0,,str,RMSprop,"{'name': 'sequential', 'layers...",newusertest,4
7,94.35%,94.34%,94.32%,94.37%,keras,False,True,Sequential,6,148066,,2.0,1,1,1.0,,,1.0,,2.0,,1.0,,str,RMSprop,"{'name': 'sequential_72', 'lay...",prajseth,40
8,94.25%,94.24%,94.24%,94.24%,keras,False,True,Sequential,3,98818,,,1,1,,,1.0,,,,,1.0,1.0,str,RMSprop,"{'name': 'sequential_78', 'lay...",prajseth,41
9,94.21%,94.19%,94.18%,94.21%,keras,False,True,Sequential,3,402690,,,1,1,,,1.0,,,,1.0,,1.0,str,RMSprop,"{'name': 'sequential_5', 'laye...",xc2303_xc,63


## Get best model

In [27]:
 # Get best model architecture and view model summary, change version arg as needed
 
 bestmodel = ai.aimsonnx.instantiate_model(api_url, version=67) 

 bestmodel.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 60, 16)            160000    
_________________________________________________________________
flatten (Flatten)            (None, 960)               0         
_________________________________________________________________
dense (Dense)                (None, 2)                 1922      
Total params: 161,922
Trainable params: 161,922
Non-trainable params: 0
_________________________________________________________________


In [30]:
# Compare my model and best external model

ai.aimsonnx.compare_models(api_url, version_list=[67,73])

Unnamed: 0,Model_67_Layer,Model_67_Shape,Model_67_Params,Model_73_Layer,Model_73_Shape,Model_73_Params
0,Embedding,"(None, 60, 16)",160000.0,Embedding,"(None, 40, 32)",320000
1,Flatten,"(None, 960)",0.0,LSTM,"(None, 40, 32)",8320
2,Dense,"(None, 2)",1922.0,LSTM,"(None, 40, 32)",8320
3,,,,LSTM,"(None, 32)",8320
4,,,,Dense,"(None, 2)",66


In [47]:
# tokenize and pad X data
X_train = preprocessor(trainingdata.tweet, maxlen=60, max_words=10000)
X_test = preprocessor(testdata.tweet, maxlen=60, max_words=10000)

# ohe encode Y data
y_train = pd.get_dummies(trainingdata.label)
y_test = pd.get_dummies(testdata.label)

bestmodel.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

bestmodel.fit(X_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

# format y_pred as labels 
y_pred = bestmodel.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]

y_test_labels=y_test.idxmax(axis=1) #extract labels from one hot encoded y_test object
y_test_labels=list(y_test.idxmax(axis=1)) #returns a pandas series of predicted labels

model_eval_metrics( y_test_labels,predicted_labels)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.939252,0.93913,0.939059,0.939207,0,0,0,0


**Discussion**: While my best personal model has 80 as the maximum length of the tweet, the best model from the leaderboard has 60. The latter has a more simple architecture, with just one embeddings, a flatten and a dense layer, while mine has one embeddings, three stacked LSTM and one dense layers. In addition to that, my model has more potential meaningful words (32) compared to the other one (16). In terms of accuracy, both models produce metrics around 0.93 to 0.94.

## Feeding tweets

In [63]:
# Best model from leaderboard
y_pred = bestmodel.predict(preprocessor(pd.Series(['The CDC currently reports 99031 deaths',
                                                   'The masks coverings recommended to slow spread',
                                                   'work is good',
                                                   'tiger is animal',
                                                   'CDC reports 45 cases during the last week in the US',
                                               'covid is fake but people always says it is real',
                                               'The cloth face coverings recommended to slow',
                                                'Big states can generate large case counts',
                                                   'vaccine bad']), maxlen=60, max_words=10000)).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]
predicted_labels

['fake', 'real', 'fake', 'fake', 'real', 'fake', 'fake', 'fake', 'fake']

In [64]:
# Best model from leaderboard
y_pred = bestmodel.predict(preprocessor(pd.Series(['The CDC currently reports 99031 deaths',
                                                   'The masks coverings recommended to slow spread',
                                                   'work is good',
                                                   'tiger is animal',
                                                   'CDC reports 45 cases during the last week in the US',
                                               'covid is fake but people always says it is real',
                                               'The cloth face coverings recommended to slow',
                                                'Big states can generate large case counts',
                                                   'vaccine bad']), maxlen=60, max_words=10000)).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]
predicted_labels

['fake', 'real', 'fake', 'fake', 'real', 'fake', 'fake', 'fake', 'fake']

**Discussion**: Both models seem to predict similarly with these tweets. It seems that longer tweets are more probable to be real according to the models. Also, the use of numbers and names of institutions does not seem to influence on the result, as I used 'CDC' and 'US'.