Import all dependencies

These include:
 - pandas
 - numpy
 - url library
 - string io
 - re (regular expression)
 - sklearn
     - tfidf vectoriser
     - train test split
     - linear Support Vector Machine
     - classification report

In [59]:
import pandas as pd
import numpy as np
import re

import urllib.request
from io import StringIO

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

Load data from CSV file

In [69]:
URL = 'https://raw.githubusercontent.com/Gautamshahi/FakeCovid/master/data/FakeCovid_July2020.csv'

response = urllib.request.urlopen(URL)
data = response.read()
text = data.decode('utf-8')

# Create dataframe
df = pd.read_csv(StringIO(text), sep=',') # index_col=[0, 1, 2, 3

#uncomment this lines to read from local source for offline work
#df = pd.read_csv('FakeCovid_July2020.csv')

df.head()

Get a list of the column names

In [None]:
list(df.columns.values)

Clean up some of the abbreviations in the dataset

In [6]:
df["lang"]= df["lang"].replace('en', "English")
df["lang"]= df["lang"].replace('es', "Spanish")
df["lang"]= df["lang"].replace('fr', "French")
df["lang"]= df["lang"].replace('pt', "Portuguese")
df["lang"]= df["lang"].replace('tr', "Turkish")
df["lang"]= df["lang"].replace('hi', "Hindi")
df["lang"]= df["lang"].replace('zh-tw', "Chinese")
df["lang"]= df["lang"].replace('hr', "Croatian")
df["lang"]= df["lang"].replace('te', "Telugu")
df["lang"]= df["lang"].replace('it', "Italian")
df["lang"]= df["lang"].replace('mk', "Macedonian")
df["lang"]= df["lang"].replace('de', "German")
df["lang"]= df["lang"].replace('ar', "Arabic")
df["lang"]= df["lang"].replace('id', "Indonesian")
df["lang"]= df["lang"].replace('ml', "Malayalam")
df["lang"]= df["lang"].replace('ja', "Japanese")
df["lang"]= df["lang"].replace('ta', "Tamil")
df["lang"]= df["lang"].replace('ko', "Korean")
df["lang"]= df["lang"].replace('lt', "Lithuanian")
df["lang"]= df["lang"].replace('pl', "Polish")
df["lang"]= df["lang"].replace('da', "Danish")
df["lang"]= df["lang"].replace('mr', "Marathi")
df["lang"]= df["lang"].replace('tl', "Tagalog")
df["lang"]= df["lang"].replace('ru', "Russian")
df["lang"]= df["lang"].replace('nl', "Dutch")
df["lang"]= df["lang"].replace('fa', "Persian")
df["lang"]= df["lang"].replace('bn', "Bengali")
df["lang"]= df["lang"].replace('el', "Greek")
df["lang"]= df["lang"].replace('lv', "Latvian")
df["lang"]= df["lang"].replace('gu', "Gujarati")
df["lang"]= df["lang"].replace('et', "Estonian")
df["lang"]= df["lang"].replace('uk', "Ukrainian")
df["lang"]= df["lang"].replace('ur', "Urdu")
df["lang"]= df["lang"].replace('th', "Thai")
df["lang"]= df["lang"].replace('ca', "Catalan")
df["lang"]= df["lang"].replace('vi', "Vietnamese")
df["lang"]= df["lang"].replace('fi', "Finnish")
df.head()

Unnamed: 0,ID,ref_category_title,ref_url,pageid,verifiedby,country,class,title,published_date,country1,country2,country3,country4,article_source,ref_source,source_title,content_text,category,lang
0,FC1,FALSE: The coronavirus is an amplified bacteri...,https://www.poynter.org/?ifcn_misinformation=t...,https://www.poynter.org/ifcn-covid-19-misinfor...,La Silla Vacía,Colombia,False,The coronavirus is an amplified bacteria rela...,2020/06/17,Colombia,,,,https://lasillavacia.com/detector-video-falso-...,poynter,Detector a video falso que dice que el Covid e...,La Silla Vacía usa Cookies para mejorar la exp...,,Spanish
1,FC2,FALSE: A law allows people to go for a run dur...,https://www.poynter.org/?ifcn_misinformation=a...,https://www.poynter.org/ifcn-covid-19-misinfor...,Newtral.es,Spain,False,A law allows people to go for a run during th...,2020/04/09,Spain,,,,https://www.newtral.es/la-broma-de-que-a-los-r...,poynter,La broma de que a los “runners” se les permite...,En los últimos días nos ha llegado una consult...,,Spanish
2,FC3,False: Chinese converting to Islam after reali...,https://www.poynter.org/?ifcn_misinformation=c...,https://www.poynter.org/ifcn-covid-19-misinfor...,FactCrescendo,India,False,Chinese converting to Islam after realizing t...,2020/02/20,India,,,,https://english.factcrescendo.com/2020/02/20/c...,poynter,Are Chinese people converting to Islam in fear...,"The fact behind every news!, Ever since the Wo...",,English
3,FC4,False: Bat market and bat meat are being sold ...,https://www.poynter.org/?ifcn_misinformation=b...,https://www.poynter.org/ifcn-covid-19-misinfor...,France 24 Observers,France,False,Bat market and bat meat are being sold in Wuhan.,2020/01/27,France,,,,https://observers.france24.com/fr/20200130-int...,poynter,"La soupe à la chauve-souris, un plat prisé en ...","عربي, English, Français, Contribuer, فارسی, عر...",,French
4,FC5,False: You can self-diagnose COVID-19 by holdi...,https://www.poynter.org/?ifcn_misinformation=y...,https://www.poynter.org/ifcn-covid-19-misinfor...,Agência Lupa,Brazil,False,You can self-diagnose COVID-19 by holding you...,2020/03/16,Brazil,,,,https://piaui.folha.uol.com.br/lupa/2020/03/16...,poynter,#Verificamos: É falso que quem consegue prende...,", “O novo CORONA VÍRUS pode não mostrar sinais...",,Portuguese


Just focusing on English for now...

In [25]:
df2 = df.loc[df['lang'] == 'English'].copy()
df2.head()

Unnamed: 0,ID,ref_category_title,ref_url,pageid,verifiedby,country,class,title,published_date,country1,country2,country3,country4,article_source,ref_source,source_title,content_text,category,lang
2,FC3,False: Chinese converting to Islam after reali...,https://www.poynter.org/?ifcn_misinformation=c...,https://www.poynter.org/ifcn-covid-19-misinfor...,FactCrescendo,India,False,Chinese converting to Islam after realizing t...,2020/02/20,India,,,,https://english.factcrescendo.com/2020/02/20/c...,poynter,Are Chinese people converting to Islam in fear...,"The fact behind every news!, Ever since the Wo...",,English
6,FC7,MISLEADING: Captions on a reuploaded video abo...,https://www.poynter.org/?ifcn_misinformation=c...,https://www.poynter.org/ifcn-covid-19-misinfor...,VERA Files,Philippines,MISLEADING,Captions on a reuploaded video about the U.S....,2020/05/09,Philippines,,,,https://verafiles.org/articles/vera-files-fact...,poynter,VERA FILES FACT CHECK: Remdesivir to ‘end’ COV...,"AUTHOR, VERA Files, DATE, May 08, 2020, SHARE,...",,English
8,FC9,Mostly True: Ghana has 307 ambulances with mob...,https://www.poynter.org/?ifcn_misinformation=g...,https://www.poynter.org/ifcn-covid-19-misinfor...,GhanaFact,Ghana,Mostly True,Ghana has 307 ambulances with mobile ventilat...,2020/04/03,Ghana,,,,https://ghanafact.com/fact-check-does-ghanas-3...,poynter,Fact-check: Does Ghana have 307 ambulances wit...,"Source: Dr Anthony Nsiah Asare, Verdict: Mostl...",,English
9,FC10,FALSE: “Governor Andy Beshear has authorized K...,https://www.poynter.org/?ifcn_misinformation=g...,https://www.poynter.org/ifcn-covid-19-misinfor...,PolitiFact,United States,FALSE,“Governor Andy Beshear has authorized Kentuck...,2020/04/29,United States,,,,https://www.politifact.com/factchecks/2020/may...,poynter,"PolitiFact | No, Kentucky teachers won’t be co...","More Info, Trying to focus on school work at h...",,English
10,FC11,False: Photo shows food being distributed to R...,https://www.poynter.org/?ifcn_misinformation=p...,https://www.poynter.org/ifcn-covid-19-misinfor...,AfricaCheck,Kenya,False,Photo shows food being distributed to Rwandan...,2020/03/30,Kenya,,,,https://africacheck.org/fbcheck/food-distribut...,poynter,Food distribution during Rwanda’s coronavirus ...,A photo of hundreds of neat piles of bedding a...,,English


Clean the text in the content_text column
Make it all lower case, remove numbers and remove some special characters

In [26]:
def text_clean(x):
    #all lower case and remove slashes and underscores
    x = str(x).lower().replace('\ ', '').replace('_', ' ')
    #use a magic regular expression to do more cleaning
    x = re.sub("(.)\1{2,}", "\1", x)
    return x

df2['content_text'] = df2['content_text'].apply(lambda x: text_clean(x))

Unnamed: 0,ID,ref_category_title,ref_url,pageid,verifiedby,country,class,title,published_date,country1,country2,country3,country4,article_source,ref_source,source_title,content_text,category,lang
2,FC3,False: Chinese converting to Islam after reali...,https://www.poynter.org/?ifcn_misinformation=c...,https://www.poynter.org/ifcn-covid-19-misinfor...,FactCrescendo,India,False,Chinese converting to Islam after realizing t...,2020/02/20,India,,,,https://english.factcrescendo.com/2020/02/20/c...,poynter,Are Chinese people converting to Islam in fear...,"the fact behind every news!, ever since the wo...",,English
6,FC7,MISLEADING: Captions on a reuploaded video abo...,https://www.poynter.org/?ifcn_misinformation=c...,https://www.poynter.org/ifcn-covid-19-misinfor...,VERA Files,Philippines,MISLEADING,Captions on a reuploaded video about the U.S....,2020/05/09,Philippines,,,,https://verafiles.org/articles/vera-files-fact...,poynter,VERA FILES FACT CHECK: Remdesivir to ‘end’ COV...,"author, vera files, date, may 08, 2020, share,...",,English
8,FC9,Mostly True: Ghana has 307 ambulances with mob...,https://www.poynter.org/?ifcn_misinformation=g...,https://www.poynter.org/ifcn-covid-19-misinfor...,GhanaFact,Ghana,Mostly True,Ghana has 307 ambulances with mobile ventilat...,2020/04/03,Ghana,,,,https://ghanafact.com/fact-check-does-ghanas-3...,poynter,Fact-check: Does Ghana have 307 ambulances wit...,"source: dr anthony nsiah asare, verdict: mostl...",,English
9,FC10,FALSE: “Governor Andy Beshear has authorized K...,https://www.poynter.org/?ifcn_misinformation=g...,https://www.poynter.org/ifcn-covid-19-misinfor...,PolitiFact,United States,FALSE,“Governor Andy Beshear has authorized Kentuck...,2020/04/29,United States,,,,https://www.politifact.com/factchecks/2020/may...,poynter,"PolitiFact | No, Kentucky teachers won’t be co...","more info, trying to focus on school work at h...",,English
10,FC11,False: Photo shows food being distributed to R...,https://www.poynter.org/?ifcn_misinformation=p...,https://www.poynter.org/ifcn-covid-19-misinfor...,AfricaCheck,Kenya,False,Photo shows food being distributed to Rwandan...,2020/03/30,Kenya,,,,https://africacheck.org/fbcheck/food-distribut...,poynter,Food distribution during Rwanda’s coronavirus ...,a photo of hundreds of neat piles of bedding a...,,English


TO simplify the task, we are just interested in two classes "False" and "Other" for now
We do some cleaning to make the class consistent, then check we are left with only two classes

In [40]:
df2['class']= df2['class'].replace('FALSE', 'False')
df2['class']= df2['class'].replace('false', 'False')
condition = df2['class']!= 'False'
df2.loc[condition, 'class'] = 'Other'
df2['class'].describe()

count      2845
unique        2
top       False
freq       2266
Name: class, dtype: object

Vectorise the text in the dataset with tf-idf and then put it into a list
(should get back a sparse matrix)

In [53]:
tfidf = TfidfVectorizer(max_features=1000)

X = df2['content_text']
y = df2['class']

X = tfidf.fit_transform(X)
# X

<2845x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 546841 stored elements in Compressed Sparse Row format>

Create testing and training sets

In [55]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Create model, train and then test

In [56]:
clf = LinearSVC()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

Print report of the model performance

In [57]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.86      0.94      0.90       452
       Other       0.65      0.43      0.52       117

    accuracy                           0.83       569
   macro avg       0.76      0.68      0.71       569
weighted avg       0.82      0.83      0.82       569

