This notebook constructs the knowledgebase ```.csv``` files for the fact checker to use.

The constructed knowledgebases have 3 columns: statement, verdict and keywords.
In order to construct one on a new subject, replace the ```subject``` variable and run the notebook.

The stopwords list is from https://www.kaggle.com/datasets/rowhitswami/stopwords.

In [None]:
import pandas as pd
import wikipedia
import re

import os

In [None]:
dataPath = os.path.join(os.path.abspath(".."), "podcasts-transcripts\\training_data.csv")

In [None]:
data = pd.read_csv(dataPath)
print(data.columns)
print(data.shape)

In [None]:
data['verdict'] = data['verdict'].map({0:True, 1:False})
data = data[['statement', 'verdict']]

In [None]:
# list of stopwords, from https://github.com/Alir3z4/stop-words/blob/bd8cc1434faeb3449735ed570a4a392ab5d35291/english.txt
# has been modified from this version

file = open("english.txt", "r")
stop = file.read()
file.close()

In [None]:
subject = 'COVID'

facts = data[data['statement'].str.contains(subject)]

In [None]:
facts['keywords'] = facts['statement']
# removes all instances of "'s", all punctuation, all numbers, and sets to lowercase
facts['keywords'] = facts['keywords'].str.replace("’s",'').str.replace('[^\w\s]','').str.replace('[\d]','').str.lower()
# removes stopwords
facts['keywords'] = facts['keywords'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

# removes duplicates
facts['keywords'] = facts['keywords'].str.split(" ").map(set).str.join(" ")

In [None]:
facts.head()

In [None]:
filename = 'factbase\\' + subject + '.csv'
facts.to_csv(filename, index = False)

---
The above section works for established datasets like the Politifact one, but for the HCQ and IVM datasets a different approach is required.

My current idea is to take 10,000 or so values from the dataset to use as the KB.
If that doesn't give the results I'm looking for then my options are:
 - scrape the wikipedia page and use every sentence in the KB

In [None]:
dataPath = os.path.join(os.path.abspath(".."), "podcasts-transcripts\\hcq data\\hcq_processed.csv")

In [None]:
data = pd.read_csv(dataPath)
print(data.columns)
print(data.shape)

In [None]:
# kb only
data = data.rename(columns={'text': 'statement', 'pred': 'verdict'})

In [None]:
# kb only
data.drop(data[data.verdict == 2].index, inplace = True)

In [None]:
print(data.shape)

In [None]:
# for kb
facts = data.sample(n = 5000)
data = data.drop(facts.index)

In [None]:
# for test set
claims = data.sample(n = 5001)
data = data.drop(claims.index)

In [None]:
print(data.shape)
print(claims.shape)

In [None]:
# kb only
facts['verdict'] = facts['verdict'].map({0:True, 1:False})

--- 
keyword list gen

kb only

In [None]:
# list of stopwords, from https://github.com/Alir3z4/stop-words/blob/bd8cc1434faeb3449735ed570a4a392ab5d35291/english.txt
# has been modified from this version

file = open("english.txt", "r")
stop = file.read()
file.close()

In [None]:
facts['keywords'] = facts['statement']
# removes all instances of "'s", all punctuation, all numbers, and sets to lowercase
facts['keywords'] = facts['keywords'].str.replace("’s",'').str.replace('[^\w\s]','').str.replace('[\d]','').str.lower()
# removes stopwords
facts['keywords'] = facts['keywords'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
# removes duplicates
facts['keywords'] = facts['keywords'].str.split(" ").map(set).str.join(" ")

In [None]:
facts.head()

In [None]:
facts.shape

In [None]:
filename = 'factbase\\' + "Hydroxychloroquine_5k_3" + '.csv'
claims.to_csv(filename, index = False)

---

In [None]:
claims.head()

In [None]:
filename = 'test_sets\\' + "Hydroxychloroquine_5k_3" + '.csv'
claims.to_csv(filename, index = False)

---
After mixed (and mostly negative) results using the preivous approach, I've decided to take extracts from Wikipedia pages to use as a fact-base.

In [None]:
in_page = wikipedia.page("COVID-19 misinformation", auto_suggest=False).content

In [None]:
#words = in_page.split()
in_page = re.sub("\=.*\=", "", in_page)
in_page = re.sub("\n", "", in_page)
in_page = in_page.split(".")
in_page = [sentence.strip() for sentence in in_page]
print(in_page)

In [None]:
facts = pd.DataFrame({'statement':in_page})
print (facts)

In [None]:
in_page = wikipedia.page("List of unproven methods against COVID-19").content
in_page = re.sub("\=.*\=", "", in_page)
in_page = re.sub("\n", "", in_page)
in_page = in_page.split(".")
in_page = [sentence.strip() for sentence in in_page]
print(in_page)

In [None]:
facts2 = pd.DataFrame({'statement':in_page})
print (facts2)

In [None]:
facts = facts.append(facts2)
facts.dropna()
print(facts.shape)

In [None]:
facts.drop(facts[facts['statement'].map(len) < 30].index, inplace = True)
print(facts.shape)

In [None]:
facts['verdict'] = True

facts['keywords'] = facts['statement']
# removes all instances of "'s", all punctuation, all numbers, and sets to lowercase
facts['keywords'] = facts['keywords'].str.replace("’s",'').str.replace('[^\w\s]','').str.replace('[\d]','').str.lower()
# removes stopwords
facts['keywords'] = facts['keywords'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
# removes duplicates
facts['keywords'] = facts['keywords'].str.split(" ").map(set).str.join(" ")

print (facts)

In [None]:
filename = 'factbase\\' + "fc_wiki" + '.csv'
facts.to_csv(filename, index = False)