Real Anot: identifying COVID-19-related fake news using machine learning

Barry YAP, Kelvin SOH, Kenny CHUA and Zhong Hao NEO AI Apprentices, Batch 6, AI Singapore

Introduction

Your phone buzzes to notify you of a new message in your extended family's WhatsApp chat. The message contains claims regarding COVID-19, but you're not sure if this information is trustworthy? Is it a case of fake news?

Fake news is a form of intentional disinformation. When this disinformation is unquestioning taken as true, this can potentially result in severe negative consequences, particularly in the current COVID-19 climate.

Real Anot is a web app that uses machine learning technology to predict the probability that given piece text is fake news. Larger probability values means that the text is more likely to be fake news.

Dataset

The dataset is a subset of the CoAid dataset which containts a set of diverse COVID-19 healthcare misinformation. This dataset has a total of 1,127 real and 266 fake news samples.

Preprocessing

The preprocessing stages consists of

Columnwise concatenation (Text & Content)
Stopwords removal, lemmatization
TF-IDF
Undersampling of real news

Model

The model is a simple Logistic regression with uncertainty built into the predictions which makes it "Bayesian logistic regression". The model is trained using stochastic variational inference (SVI) with Pyro, the purpose is to show that we can approximate any functions using a well-behaved distribution such as Normal dist. Using variational inference, the goal is to maximize the evidence lower bound ELBO such that the KL divergence between the posterior distribution P and the variational distribution Q are minimized. The parameters of the logistic regression model weight W and bias b are parameterized by an initial prior normal distribution w, b ~ N(0, 10).

The goal is to define a variational distribution Q to approximate the parameters of the posterior P. The initial parameters used for Q are w,b ~ N(eps, -8 + 0.05eps) where eps ~ N(0,1)

The following is our true posterior distribution P(Z|X) with parameters Z conditioned on observed data X which we want to approximate.

With the above priors set in place we train the model using SVI for 3000 iterations by optimizing both the parameters of P and Q simultaneously using Tfidf vector as our input data.

Using a well behaved distribution Q e.g. Normal dist. Using SVI the lower bound of the true posterior P is found at around 500 iterations (Notice the shape of the two distribution P and Q).

Making inference

To perform a prediction, we sample n number of times from our approximated variational distribution Q and take the expectation under the approximated Q, this is equivalent to computing the weighted average from the ensembles of plain logistic regression model each with a different parameter values. These ensembles are drawn from the same shared distributions with the ability to express uncertainty in their estimates.

As shown below, 200 samples of weights W and bias b are drawn from the variational distribution Q.

Evaluation

Evaluation is done on the 20% test samples for both model. As shown in the following table below, the Bayesian model refuses to predict 13 samples as it is not confident in their predicions scores. When compared to the baseline logistic regression model, Bayesian model only got 3 incorrect predictions while Logistic regression got 7 incorrect predictions.

This shows that by incorporating uncertainty into the model, we have a much better confidence in the predictions returned from the model, or by having a 3rd party opinion in cases of uncertainty circumstances.

Installing and setting up the app

$ conda create --name fake_news_classifier
$ conda activate fake_news_classifier
$ pip install -r requirements.txt

Demo app

Instructions for launching the demo app

$ python -m src.app

After running the above command, the app will run on http://localhost:8000. Where user can input a piece of news information and perform a prediction, the number of Bayesian samples 2, 5, 10, 200, 500 can be selected to maximize the prediction confidence of any particular news.

The predictions returned is a form of probability distribution where x-axis is the predicted probabilities from the 200 Bayesian samples by taking the median probability.

References

https://github.com/cuilimeng/CoAID/
https://pyro.ai/
https://pyro.ai/examples/svi_part_i.html
https://arxiv.org/pdf/1506.04416.pdf
https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
data		data
imgs		imgs
misc		misc
src		src
.DS_Store		.DS_Store
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

imgs

imgs

misc

misc

src

src

.DS_Store

.DS_Store

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Real Anot: identifying COVID-19-related fake news using machine learning

Introduction

Dataset

Preprocessing

Model

Making inference

Evaluation

Installing and setting up the app

Demo app

References

About

Releases

Packages

Languages

kelvinbksoh/real_anot

Folders and files

Latest commit

History

Repository files navigation

Real Anot: identifying COVID-19-related fake news using machine learning

Introduction

Dataset

Preprocessing

Model

Making inference

Evaluation

Installing and setting up the app

Demo app

References

About

Resources

Stars

Watchers

Forks

Languages