# CoAID: COVID-19 Healthcare Misinformation Dataset

CoAID (Covid-19 heAlthcare mIsinformation Dataset) is a diverse COVID-19 healthcare misinformation dataset, including fake news on websites and social platforms, along with users' social engagement about such news. It includes 5,216 news, 296,752 related user engagements, 958 social platform posts about COVID-19, and ground truth labels.

This dataset is taken from here: https://github.com/cuilimeng/CoAID

### Necessery Imports

In [2]:
import pandas as pd
import numpy as np
# import plotly.express as px
# import plotly
pd.set_option('display.max_rows', 500, "display.max_colwidth", None)
# plotly.offline.init_notebook_mode(connected=True)

## 1. Loading the Datasets

In [3]:
CoAID_claim_fake = pd.read_csv('Initial_datasets/ClaimFakeCOVID.csv', 
                               usecols = ['title'], index_col = False, low_memory=False)
CoAID_claim_real = pd.read_csv('Initial_datasets/ClaimRealCOVID.csv', 
                               usecols = ['title'], index_col = False, low_memory=False)
CoAID_news_fake = pd.read_csv('Initial_datasets/NewsFakeCOVID.csv', 
                              usecols = ['title'], index_col = False, low_memory=False)
CoAID_news_real = pd.read_csv('Initial_datasets/NewsRealCOVID.csv', 
                              usecols = ['title'], index_col = False, low_memory=False)

In [4]:
# Adding claim_veracity column
CoAID_claim_real['claim_veracity'] = 1
CoAID_claim_fake['claim_veracity'] = 0
CoAID_news_real['claim_veracity'] = 1
CoAID_news_fake['claim_veracity'] = 0

# Merging df in pairs
CoAID_claim = pd.concat([CoAID_claim_real, CoAID_claim_fake], ignore_index=True)
CoAID_news = pd.concat([CoAID_news_real, CoAID_news_fake], ignore_index=True)

# Merging all dfs
CoAID = pd.concat([CoAID_claim, CoAID_news], ignore_index=True)

In [5]:
CoAID.shape[0]

5975

## 2. Formatting the Dataset

### 2.1. Deleting quotes around claims
Multiple claims are wrapped around qoutation marks, which is unnecessary 

In [6]:
def deleteQuotes(df):
    df['title'] = np.where((df['title'].str[0] == '"') & (df['title'].str[-1] == '"'), 
                           df['title'].str[1:-1], df['title'])
    df['title'] = np.where((df['title'].str[0] == '“') & (df['title'].str[-1] == '”'), 
                           df['title'].str[1:-1], df['title'])

for df in [CoAID_claim, CoAID_news]: deleteQuotes(df)

### 2.2. Deleting Weird characters from the dataset

In [7]:
# Deleting non-ascii characters from the strings

def deleteNonAscii(df):
    df['title'] = df['title'].astype(str).apply(lambda x: x.encode('ascii', 'ignore').decode('ascii'))
    
for df in [CoAID_claim, CoAID_news]: deleteNonAscii(df)

### 2.3. Deleting entries with Links

In [8]:
def deleteLinks(df):
    df.drop(df.loc[df['title'].str.lower().str.contains('http')].index, inplace=True)
    
for df in [CoAID_claim, CoAID_news]: deleteLinks(df)

### 2.3. Inspecting CoAID_claim dataset (518 entries)

- Claims as questions : Many claims are questions themselves and can't really be classified as true or false. They are all labelled as true, however they can't really be classified so need to be dropped as faulty entries

In [9]:
CoAID_claim_picked = CoAID_claim[CoAID_claim['title'].str[-1] != '?'] # We are left only with 75 entries

- Other entries are of good quality and can be used in the classifier

### 2.4. Inspecting CoAID_News dataset (5457 entries)
- True entries (4532) are very noisy and of not the greatest quality. Mostly they are articles title that don't have any factual claims in them and are just introductions to bigger articles where the topic is expanded
- Fake claims (925) on the other hand are structured correctly and seem to be of good quality

In [10]:
# I am dropping all true entries from here and taking all the fake claims.
CoAID_news_picked = CoAID_news[CoAID_news['claim_veracity'] == 0]

## 3. Saving the dataset

In [11]:
CoAID_Final = pd.concat([CoAID_claim, CoAID_news])
CoAID_Picked = pd.concat([CoAID_claim_picked, CoAID_news_picked], ignore_index=True) # 1000 entries

In [12]:
# Change the column name
CoAID_Final.rename(columns={'title': 'claim'}, inplace=True)
CoAID_Picked.rename(columns={'title': 'claim'}, inplace=True)

In [11]:
CoAID_Picked.to_csv('CoAID_Picked.csv', encoding='utf-8')
CoAID_Final.to_csv('CoAID_Final.csv', encoding='utf-8')