Skip to content

piotrmp/credibilator

Credibilator

This repository contains resources for the article When classification accuracy is not enough: Explaining news credibility assessment published in the Special Issue on Dis/Misinformation Mining from Social Media of the Information Processing & Management journal. The research was done within the HOMADOS project at Institute of Computer Science, Polish Academy of Sciences in cooperation with Institute for Computer Science and Engineering at CONICET and Universidad Nacional del Sur in Bahía Blanca, Argentina.

The resources available here are the following:

  • an updated corpus including credible and non-credible (fake) news documents,
  • the Credibilator browser extension for Chrome (install from the Chrome Web Store),
  • source code and data for training the credibility classifiers used,
  • server-side source code.

If you need any more information consult the paper or contact its authors!

News Style Corpus v2

The corpus used in this research contains 95,900 documents from 199 sources. News Style Corpus v2 is based on a previous corpus (see article and data), using work of PolitiFact and Pew Research Center for source-level credibility assessments. This version is refined by performing plain text extraction through the unfluff library and removing documents with insufficient content.

The folder NewsStyleCorpus2 contains the following files necessary to retrieve the pages constituting the corpus from the WayBackMachine archive:

  • corpusSourcesU.tsv: tab-separated list of all documents in the corpus, each with the website (domain) it comes from and its credibility label, original page URL and the address, under which the document is currently available at the archive,
  • NewsDownloader-2.0-jar-with-dependencies.jar: a Java package that retrieves HTML documents from the given address list and converts them to plain text (NOTE: you will need unfluff installed in your system and available through the unfluff command),
  • CredibilityCorpusDownloaderU.java: source code for the above package,
  • foldsCVU.tsv: a list of fold identifiers for the documents from corpusSourcesU.tsv (in the same order) for two CV scenarios described in the paper: document-based and source-based.

You can start download by a simple command:

java -jar NewsDownloader-2.0-jar-with-dependencies.jar /path/to/corpusSourcesU.tsv /path/to/output-dir

Mind that downloading the whole corpus takes several hours. In order to limit the load on the WayBackMachine infrastructure and retrieve all the pages (some may be temporarily unavailable), you should perform the process in stages. You can select just part of the corpus for download by modifying the address list.

The corpus data are released under the CC BY-NC-SA 4.0 licence.

Credibilator browser extension for Chrome

Source code

The source code for Credibilator is available in the credibilator-extension folder. It was verified to work with Chromium 91.0. The extension uses several external JS libraries:

The source code is released under the GNU GPL 3.0 licence.

Using the extension

The easiest way to use the extension is to install it from the Chrome Web Store.

If you want to load the extension from local files (tested with Chromium 91.0), you need to:

  1. Open the extensions configuration panel and turn the Developer mode on (upper right corner),
  2. Click Load unpacked and select the credibilator-extension folder,
  3. The extension should appear on the configuration panel,
  4. You can now use Credibilator. Whenever you want to check the credibility of a currently browsed page, activate Credbiliator from the Extensions menu (to the right of the address bar).

Video manual

Credibilator_mp4 contains a video presenting the most important features of the extension in action.

Credibility classifiers

Stylometric

To train a stylometric credibility classifier, you first need to generate features from the training documents. Since the model is going to rely on data available to a browser extension, the features have to be generated in a browser environment as well. The code available in Classifiers/Stylometric/ChromiumFeatureGenerator.java executes the javascript code in Chromium and collects the feature values. It uses the following arguments:

  • chromeUserDir -- temporary directory, empty at startup,
  • tempDir -- temporary directory, empty at startup,
  • corpusDir -- directory including the corpus data, e.g. News Style Corpus collected as above,
  • outputDir -- directory to store the generated features,
  • batchPath -- directory with the batch processing javascript code, available in Classifiers/Stylometric/features.

The R code in Classifiers/Stylometric/R/all-credibilator.R shows how to build a regularised logistic regression model based on the generated features. The resulting model used by the extension could be seen in credibilator-extension/style/data/features-true.tsv.

Neural

The code to convert the corpus to the format used by our neural classifiers is the same as in previous work and could be accessed here.

The BiLSTMAvg model was implemented in TensorFlow, using the code available in Classifiers/Neural. Subsequently, it was converted to TensorFlow.js and the result, used for Credibilator, is exported to credibilator-extension/bilstmavg/data/tfjs-10k-interp-iter.

Server-side

If you want to set up your own backend different to the one currently used by the extension, FLASK, mongoDB and nearpy is needed. These are the steps for setting up the backend for Credibilator yourself. It consists of three parts: 1. Index data for ANN, 2. Storage on mongoDB, 3. Start Flask server, 4. Front-end update.

Check Server/readme.txt for a detailed explanation.

About

No description, website, or topics provided.

Resources

License

Apache-2.0 and 2 other licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
ISC
LICENSE-ISC
MIT
LICENSE-MIT

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published