# Exploratory Data Analysis

In [1]:
import pandas as pd
from pathlib import Path
from urllib.parse import urlparse
from collections import Counter
import random

First let's do an exploratory data analysis. We'll get the file names on the directory `data`.

In [2]:
data_path = Path("./data")  # path to files
print([file.name for file in data_path.iterdir()])

['clusters.json', 'fake_test.json', 'fake_train.json', 'real_test.json', 'real_train.json']


We can construct a dictionary whose keys are the file names and the values are the the `Paths` to the files. By using the `Path` object, we make this code platform-idependent.

In [3]:
files_data = {file.stem: file for file in data_path.iterdir()}
files_data

{'clusters': WindowsPath('data/clusters.json'),
 'fake_test': WindowsPath('data/fake_test.json'),
 'fake_train': WindowsPath('data/fake_train.json'),
 'real_test': WindowsPath('data/real_test.json'),
 'real_train': WindowsPath('data/real_train.json')}

Now, we can read the contents of the *train* file, both fakes and reals.

In [4]:
fake_train = pd.read_json(files_data["fake_train"])
real_train = pd.read_json(files_data["real_train"])

Let's check the contents of each:

In [5]:
fake_train.head()

Unnamed: 0,url,title,text
0,https://nabd.com/s/71539812-b7228b/%D9%86%D8%B...,Online Facts New conspiracy theory: #Bel_Gates...,Roger Stone suggested on Monday that Bill Gate...
1,https://shamra.sy/news/article/8eb73454931e6d1...,Revolutionary Guards: Corona could be an Ameri...,Source\nRussia Today |\nIranian Revolutionary ...
2,https://sudanewsnow.com/19800/,Yellow skin is the host environment of the vir...,Sudan news now from all sources sudanewsnow.co...
3,https://arabic.rt.com/press/1100276-%D8%A7%D9%...,China and Russia are doing what the European U...,China and Russia are doing what the European U...
4,https://www.kachaf.com/details.php?n=5e8957fe1...,,Fatal error: Uncaught MongoDB\Driver\Exception...


In [6]:
real_train.head()

Unnamed: 0,url,title,text
0,https://www.thetimes.co.uk/edition/scotland/sc...,Scots GPs told not to meet fever patients as f...,Scots GPs told not to meet fever patients as f...
1,https://www.bbc.com/news/world-africa-52103799,Coronavirus : Fighting al - Shabab propaganda ...,Coronavirus: Fighting al-Shabab propaganda in ...
2,https://www.thetimes.co.uk/edition/business/en...,Engineer fears China virus impact,Engineer fears China virus impact\nA British e...
3,https://www.theguardian.com/world/live/2020/fe...,Coronavirus : South Korean PM vows swift act...,Here’s a summary of what’s happened so far on ...
4,https://yle.fi/uutiset/osasto/news/finnair_iss...,Finnair issues profit warning over Covid - 19 ...,Finnair issues profit warning over Covid-19 fe...


In [7]:
print(f'Elements in the real training set: {len(real_train)}\nElements in the fake training set: {len(fake_train)}')

Elements in the real training set: 800
Elements in the fake training set: 800


Both have the same number of rows. Let's check if the data is fully populated.

In [8]:
real_train[real_train.title==""]

Unnamed: 0,url,title,text
35,https://www.dw.com/overlay/media/en/covid-19-r...,,The coronavirus has nearly paralyzed large are...
98,https://www.dw.com/overlay/media/en/usa-stagge...,,The US government is warning its citizens to e...
574,https://www.dw.com/overlay/media/en/south-kore...,,South Korea remains the country with the highe...
630,https://www.dw.com/overlay/media/en/myths-vs-f...,,Myths vs. facts: How true is coronavirus infor...
659,https://www.dw.com/overlay/media/en/usa-stagge...,,The US government is warning its citizens to e...
797,https://www.dw.com/overlay/media/en/myths-vs-f...,,Do bleach products protect you?\nBleach/chlori...


In [9]:
fake_train[fake_train.title==""]

Unnamed: 0,url,title,text
4,https://www.kachaf.com/details.php?n=5e8957fe1...,,Fatal error: Uncaught MongoDB\Driver\Exception...
232,https://yandex.ru/news/story/V_SSHA_arestovan_...,,Could not display plot\nYou can search for sim...
249,http://www.khabarmasr.com/home,,- Abdel-Al congratulates President Sisi on the...
479,https://www.kachaf.com/details.php?n=5e8f65461...,,Fatal error: Uncaught MongoDB\Driver\Exception...
558,http://www.khabaralyoum.com/news/get_news/3744...,,Severity: Notice\nMessage: Undefined offset: 0...
613,https://www.kachaf.com/details.php?n=5e772949e...,,Fatal error: Uncaught MongoDB\Driver\Exception...
644,https://www.kachaf.com/details.php?n=5e87960dc...,,Fatal error: Uncaught MongoDB\Driver\Exception...
650,http://www.khabaralyoum.com/news/get_news/7050...,,Severity: Notice\nMessage: Undefined offset: 0...


So, there are some missing titles on both sets. However, apparently some of the rows missing a title in `fake_training` also have an error message instead of actual news text. But, the real news do have news text and are only missing the title.

In [12]:
print(fake_train.iloc[613].text)

Fatal error: Uncaught MongoDB\Driver\Exception\ConnectionTimeoutException: No suitable servers found (`serverSelectionTryOnce` set): [connection refused calling ismaster on '127.0.0.1:27017'] in /var/www/html/kachaf/php/vendor/mongodb/mongodb/src/Collection.php:645
Stack trace:
#0 /var/www/html/kachaf/php/vendor/mongodb/mongodb/src/Collection.php(645): MongoDB\Driver\Manager->selectServer(Object(MongoDB\Driver\ReadPreference))
#1 /var/www/html/kachaf/php/scripts/getter.php(105): MongoDB\Collection->findOne(Array)
#2 /var/www/html/kachaf/public/details.php(10): getter->findbyid('5e772949ea1e231...', 'news')
#3 {main}
thrown in/var/www/html/kachaf/php/vendor/mongodb/mongodb/src/Collection.phpon line645


In [13]:
print(real_train.iloc[630].text)

Myths vs. facts: How true is coronavirus information on the web?
Does rinsing your nose with saline protect you?
According to the World Health Organization, there is no evidence to support claims that a saline solution will "kill” the virus and protect you.
Will gargling mouthwash prevent an infection?
Certain brands of mouthwash may eliminate particular microbes from your saliva for a few minutes, but, according to the WHO, this does not protect you from the new coronavirus.
Can eating garlic help?
This dubious claim has been spreading like wildfire across social media. Though it is possible that garlic may have some antimicrobial properties, there is no evidence to suggest from the current coronavirus outbreak that eating this bulb will protect people from the virus.
Can pets spread COVID-19?
There is no evidence to suggest pets, such as cats and dogs, can be infected or transmit the coronavirus. Regularly washing your hands with soap and water after touching your beloved moggy or po

Now, let's see how many different domains there are for each class.

In [19]:
fake_domains_list = [urlparse(url).netloc for url in fake_train["url"]]
real_domains_list = [urlparse(url).netloc for url in real_train["url"]]

print(f"Different fake news sites: {len(set(fake_domains_list))}\nDifferent real news sites: {len(set(real_domains_list))}")

Different fake news sites: 404
Different real news sites: 16


So on one hand, the 800 fake news come from 404 sites, whereas the 800 real news come from only 16 different sites.

Let's see if there's an intersection

In [20]:
print( set(fake_domains_list) & set(real_domains_list) )

set()


There is no intersection in this data set. All 800 real news come from only 16 sites and all 800 fake news come from 404 sites, without any overlap between the two sets of sites. 

Let's see the most common sites where the fake news are coming from and all the 16 sites where the real news are from.

In [21]:
Counter(fake_domains_list).most_common(10)

[('www.albidda.net', 30),
 ('www.youtube.com', 26),
 ('arabic.rt.com', 19),
 ('www.saadaonline.net', 18),
 ('southfront.org', 13),
 ('www.geopolitica.ru', 12),
 ('sputnik.by', 12),
 ('www.rt.com', 11),
 ('lomazoma.com', 10),
 ('es.news-front.info', 10)]

In [22]:
Counter(real_domains_list).most_common(16)

[('www.axios.com', 63),
 ('www.thetimes.co.uk', 60),
 ('www.bbc.com', 60),
 ('news.err.ee', 59),
 ('www.themoscowtimes.com', 59),
 ('www.theguardian.com', 57),
 ('yle.fi', 57),
 ('apnews.com', 56),
 ('www.dw.com', 55),
 ('www.economist.com', 54),
 ('www.wsj.com', 54),
 ('www.theatlantic.com', 51),
 ('www.euronews.com', 49),
 ('www.nytimes.com', 33),
 ('www.reuters.com', 19),
 ('time.com', 14)]

## Conclusions

Although the corpus is balanced among the two classes: real and disinformation news, it is not at all balanced regarding the sources of the news.

The problem of fake news detection (whether they are with the purpose of entertainment--as The Onion outlet--or to misinform or disinform) is a hard problem due to the nature of the classfication task. The veracity of a text depends on facts that are external to the text. There have been attempts to incorporate external evidence (Popat et. al., 2018) and also to check if there are stylistic features inherent to fake news (Rashkin et. al., 2017) to address this problem, but in general it is still an open problem. 

My first intuition in order to check whether a piece of news text is real or "fake" is to check the source of the text. We can attempt to build a sort of black list of websites known to be sources of disinformation. However, the number of credible sites is too small and the task would be trivial and would not scale well. Another idea would be to train a classifier with character *n*-grams out of the URLs of the news segment to try to classify a news outleat as credible or not (only in case there is a pattern to be found by Machine Learning).

Another approach and as a proof of concept we can fine tune a language model such as BERT in a text classification task. That is presented in [this notebook.](./Classification.ipynb)

## References

1. [Popat, K., Mukherjee, S., Yates, A., Weikum, G. (2018). DeClarE: Debunking Fake News and False Claims using Evidence-Aware Deep Learning. In *Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing*](https://www.aclweb.org/anthology/D18-1003/)

1. [Rashkin, H., Choi E., Jang, J., Volkova, S., Choi, Y. (2017). Truth of Varying Shades: Analyzing Language in Fake News and Political Fact-Checking. In *Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing*](https://www.aclweb.org/anthology/D17-1317/)