# Fake Tweets Detection using Concurrent Neural Networks

We need to grab the sample dataset we're using in this research.

In [75]:
! mkdir -p ./dataset
! wget "https://ndownloader.figshare.com/files/11767817" -O "./dataset/pheme_veracity.tar.bz2"

--2020-02-27 02:56:27--  https://ndownloader.figshare.com/files/11767817
Resolving ndownloader.figshare.com (ndownloader.figshare.com)... 18.203.76.79, 18.203.5.169, 34.255.246.69, ...
Connecting to ndownloader.figshare.com (ndownloader.figshare.com)|18.203.76.79|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/11767817/PHEME_veracity.tar.bz2 [following]
--2020-02-27 02:56:28--  https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/11767817/PHEME_veracity.tar.bz2
Resolving s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)... 52.218.97.75
Connecting to s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)|52.218.97.75|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 46529729 (44M) [binary/octet-stream]
Saving to: ‘./dataset/pheme_veracity.tar.bz2’


2020-02-27 02:56:31 (14.1 MB/s) - ‘./dataset/pheme_veracity.tar.bz2’ saved [46529729/46529729]



In [76]:
! tar xC ./dataset -f ./dataset/pheme_veracity.tar.bz2

Let's start cleaning up the dataset. Because we're not using the thread based annotation system used in this dataset, we can go ahead and flatten the folder structure.

In [78]:
! mkdir -p ./flatten1
! rsync -a ./dataset/**/**/non-rumours/* ./flatten1
! rsync -a ./dataset/**/**/rumours/* ./flatten1

In [142]:
from pathlib import Path

rootdir = Path('./flatten1')
tweet_folders = [f for f in rootdir.glob('*') if f.is_dir()]

PHEME Project has helpfully provided a Python method to convert the annotations into "Verified True", "Verified False" and "Unverified" tags.

In [168]:
def convert_annotations(annotation, string = True):
    if 'misinformation' in annotation.keys() and 'true'in annotation.keys():
        if int(annotation['misinformation'])==0 and int(annotation['true'])==0:
            if string:
#                 label = "unverified"
                label = None
            else:
                label = 2
        elif int(annotation['misinformation'])==0 and int(annotation['true'])==1 :
            if string:
                label = "true"
            else:
                label = 1
        elif int(annotation['misinformation'])==1 and int(annotation['true'])==0 :
            if string:
                label = "false"
            else:
                label = 0
        elif int(annotation['misinformation'])==1 and int(annotation['true'])==1:
            label = None
            
    elif 'misinformation' in annotation.keys() and 'true' not in annotation.keys():
        # all instances have misinfo label but don't have true label
        if int(annotation['misinformation'])==0:
            if string:
#                 label = "unverified"
                label = None
            else:
                label = 2
        elif int(annotation['misinformation'])==1:
            if string:
                label = "false"
            else:
                label = 0
                
    elif 'true' in annotation.keys() and 'misinformation' not in annotation.keys():
        label = None
    else:
        label = None
           
    return label

In [158]:
import json

def get_source_tweet_path(tweet_id):
    return Path('./flatten1/' + tweet_id + '/source-tweets/' + tweet_id + '.json')

def get_annotation_path(tweet_id):
    return Path('./flatten1/' + tweet_id + '/annotation.json')

def parse_tweet(tweet_id):
    source_tweet_path = get_source_tweet_path(tweet_id)
    annotation_path = get_annotation_path(tweet_id)
    
    with open(annotation_path) as f:
        raw_annotation = json.load(f)
        annotation = convert_annotations(raw_annotation)
    
    with open(source_tweet_path) as f:
        raw_tweet = json.load(f)
        parsed_tweet = {}
#         parsed_tweet["id"] = raw_tweet["id"]
        parsed_tweet["text"] = raw_tweet["text"]
        parsed_tweet["annotation"] = annotation
        return parsed_tweet

In [169]:
tweets = [parse_tweet(tweet_folder.name) for tweet_folder in tweet_folders if tweet_folder.exists()]

In [171]:
import pandas as pd

tweets_df = pd.DataFrame.from_dict(tweets)
tweets_df.dropna(inplace=True)
tweets_df.describe()

Unnamed: 0,text,annotation
count,1705,1705
unique,1699,2
top,Sydney cafe siege: Two gunmen and up to a doze...,true
freq,2,1067


We have parsed the tweet content and the "True", "False" labels for each tweet. Let's write it to a csv.

In [173]:
tweets_df.to_csv("tweets.csv", index=False)