# Accessing the CREDBANK Dataset

The CREDBANK dataset is a large collection of tweets relating to events that occurred between mid October 2014 and end of February 2015, which were assessed for veracity by crowdworkers through Mechanical Turk (Turkers). The data are publicly available  [here](https://github.com/compsocial/CREDBANK-data#readme), along with a detailed description of the data.

#### The dataset was developed through 4 steps, each with a corresponding CSV file, available for download:
1. ##### Streaming Tweet File - stream_tweets_byTimestamp.data:
    169 million tweets collected from within the aforementioned time interval. 
2. ##### Topic File - eventNonEvent_annotations.data: 
    62,000 tweet topics, which were generated from the above raw tweet stream using automated topic modeling (LDA). Each topic is characterized by a list of 3 topic terms. Each topic is also rated by Turkers as being an event (class 1) or non-event (class 0).
3. ##### Credibility Annotation File - cred_event_TurkRatings.data: 
    1,378 tweet topics that were categorized as events through the process described in #2, above. Each topic was evaluated by 10 Turkers, and was counted as an event if a majority (6/10) assigned it a value of 1. Each of these event topics were then evaluated for credibility by 30 Turkers, each scoring it on a scale ranging from -2 (least credible) to +2 (most credible). 
4. ##### Searched Tweet File - cred_event_SearchTweets.data: 
    80 million tweets grouped by the 1,378 event topics from #3, above. Tweets corresponding to each event topic were extracted using the 3 topic terms to form an 'AND' query.

For our analysis we use only files #3 and #4, to produce a dataset of roughly 13 million individual tweets, that are grouped into event topics and rated for credibility by 30 Turkers. That is, each tweet is given the credibility score corresponding to its event topic. We roughly follow the data preparation process described by Buntain & Golbeck (2017), whereby each tweet is given an aggregate credibility score that is the mean of 30 individual Turker scores. We then use only the top and bottom deciles of scores, to help ensure that the event topics we include have greater Turker consensus, and are therefore more likely to truly belong to the negative or positive class. We then take further measures to select a sample of these data for modeling as we faced time and computational constraints. This processed is outlined through the following annotated python code.

In [1]:
# import libraries

import pandas as pd, numpy as np
import re

from ast import literal_eval

In [2]:
# read csv containing all tweets in dataset, grouped into topics.
# note, this file is 6.12 GB and is too large to store in github repo.
# to import, access these data as described above 
# and save 'cred_event_SearchTweets.data' into your local datasets folder.

data = pd.read_csv('../datasets/cred_event_SearchTweets.data', sep='\t')

In [3]:
# first look at data
# each row indicates an event topic
# last column contains all tweets pertaining to respective event
# actual tweet text not provided and must be accessed via Twitter's API
# tweet_id, user_id, and create_time provided

print(data.shape)
data.head()

(1377, 4)


Unnamed: 0,topic_key,topic_terms,tweet_count,ListOf_tweetid_author_createdAt_tuple
0,host_patrick_neil-20141015_161647-20141015_172214,"host,patrick,neil",34694,"[('ID=522759240817971202', 'AUTHOR=i_Celeb_Gos..."
1,royals_game_series-20141015_203526-20141015_21...,"royals,game,series",22111,"[('ID=522782817538043906', 'AUTHOR=topOrioles'..."
2,giants_game_win-20141015_230140-20141016_000502,"giants,game,win",9990,"[('ID=522861714363015169', 'AUTHOR=GiterDoneSp..."
3,october_ebola_house-20141015_230140-20141016_0...,"october,ebola,house",147,"[('ID=522863310086373378', 'AUTHOR=lumworld', ..."
4,ebola_white_health-20141016_012559-20141016_03...,"ebola,white,health",2956,"[('ID=522797534054711298', 'AUTHOR=HJIrwin', '..."


In [92]:
# over 80 million total tweets
data['tweet_count'].sum()

80277783

In [5]:
# read csv containing credibility scores for each topic
# each topic is given a credibility rating by 30 evaluators, each score provided in Cred_Ratings
# as well, each evaluators gave a brief note justifying their rating, found in Ratings column.

# code modified from 
# https://stackoverflow.com/questions/32742976/how-to-read-a-column-of-csv-as-dtype-list-using-pandas

df_topic_cred = pd.read_csv('../datasets/cred_event_TurkRatings.data',
                            sep='\t',
                            converters={'Cred_Ratings':literal_eval})

print(df_topic_cred.shape)
df_topic_cred.head()

(1378, 4)


Unnamed: 0,topic_key,topic_terms,Cred_Ratings,Reasons
0,everything_royals_rain-20141015_161647-2014101...,"[u'everything', u'royals', u'rain']","[2, 1, 0, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 0, 1, ...","['Game suspended due to rain', 'It is true tha..."
1,host_patrick_neil-20141015_161647-20141015_172214,"[u'host', u'patrick', u'neil']","[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...",['Neil Patrick Harris will host the 2015 Oscar...
2,royals_game_series-20141015_203526-20141015_21...,"[u'royals', u'game', u'series']","[2, 1, 2, 2, 2, 0, 2, 2, 1, 2, 2, 2, 2, 1, 2, ...",['The Royals did win last night and are indeed...
3,giants_game_win-20141015_230140-20141016_000502,"[u'giants', u'game', u'win']","[2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, ...","['They did win the game.', 'Major news sources..."
4,october_ebola_house-20141015_230140-20141016_0...,"[u'october', u'ebola', u'house']","[1, 2, 2, 1, 2, 1, 1, 2, 2, 1, 0, 2, 1, 2, 1, ...","['Accurate news story, some opinion.', 'USA To..."


In [14]:
# credibility scores are vectors of scores from 30 evaluators
# mean score calculated for each topic

# this is the aggregation method from Buntain & Golbeck, 2017
# https://arxiv.org/pdf/1705.01613.pdf

cred_mean = []
for i, score in enumerate(df_topic_cred['Cred_Ratings']):
    cred_mean.append(np.mean([int(score) for score in df_topic_cred['Cred_Ratings'][i]]))

In [15]:
# grand mean is ~1.7
np.mean(cred_mean)

1.7013604591028144

In [16]:
# new column for mean cred score for each topic
df_topic_cred['cred_score'] = cred_mean

df_topic_cred.sort_values('cred_score', ascending=False)

Unnamed: 0,topic_key,topic_terms,Cred_Ratings,Reasons,cred_score
1,host_patrick_neil-20141015_161647-20141015_172214,"[u'host', u'patrick', u'neil']","[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...",['Neil Patrick Harris will host the 2015 Oscar...,2.000000
133,news_senzo_meyiwa-20141026_162345-20141026_173013,"[u'news', u'senzo', u'meyiwa']","[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...","[""it's a Senzo Meyiwa South Africa's internat...",2.000000
356,patriots_east_afc-20141214_152337-20141214_162911,"[u'patriots', u'east', u'afc']","[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...",['I saw the game highlights and the division s...,2.000000
1342,bowl_pro_odell-20150125_184901-20150125_184943...,"[u'bowl', u'pro', u'odell']","[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...",['This event is certainly accurate.In Pro Bowl...,2.000000
717,paris_attack_news-20150107_100842-20150107_112852,"[u'paris', u'attack', u'news']","[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...","['Multiple sources.', 'Many reputable sources ...",2.000000
...,...,...,...,...,...
11,ebola_obama_#ebola-20141016_181953-20141016_19...,"[u'ebola', u'obama', u'#ebola']","[1, 0, 2, 2, 2, -2, -2, 2, 2, 2, 1, -2, 2, 2, ...",['obama wants to intensify action against ebol...,0.666667
999,killed_hostage_isis-20150206_144013-20150206_1...,"[u'killed', u'hostage', u'isis']","[2, 0, 2, 0, 2, -1, 2, 1, 1, -1, 1, 0, -1, 0, ...","['isis news', 'us hostage killed', 'All tweets...",0.633333
60,ebola_free_sometimes-20141020_101221-20141020_...,"[u'ebola', u'free', u'sometimes']","[2, 2, -1, 1, 0, -1, 0, 2, -2, 0, 2, -1, -2, 0...",['There are reputable news agencies online rep...,0.600000
1021,check_haha_#grammys-20150208_183650-20150208_1...,"[u'check', u'haha', u'#grammys']","[2, 0, -2, 1, 0, 2, -2, 2, 0, -1, 2, 0, -2, 2,...",['The grammys was an event that took place whi...,0.466667


In [17]:
# merging both dataframes so that cred score and raw tweets are combined
data_merged = pd.merge(left=data, right=df_topic_cred, on= 'topic_key').drop(columns=['topic_terms_y',
                                                                              'Cred_Ratings',
                                                                              'Reasons'])

In [18]:
# renaming columns for ease and clarity
data_merged.rename(columns={'topic_terms_x':'topic_terms',
                     'ListOf_tweetid_author_createdAt_tuple':'topic_tweets'},
            inplace=True)

In [20]:
# we need to convert cred_score to a dummy variable.
# Buntain & Golbeck (2017) used credibility means in the upper and lower 15% to define positive and negative classes.
# due to limited computation power we opt for a smaller dataset, and so narrow the range.
# we set cred_score to 1 if in upper 10% of mean credibility values, and to 0 if in lower 10%.

data_classed = data_merged[~data_merged.cred_score.between(data_merged.cred_score.quantile(0.10),data_merged.cred_score.quantile(0.90))].reset_index(drop=True)
data_classed['cred_score'] = [1 if i > 1.5 else 0 for i in data_classed.cred_score]

# this produces 122 topics in the negative class (rumour) and 102 topics in the positive class (valid).
data_classed.cred_score.value_counts()

0    122
1    102
Name: cred_score, dtype: int64

In [21]:
# looking at dataset
data_classed

Unnamed: 0,topic_key,topic_terms,tweet_count,topic_tweets,cred_score
0,host_patrick_neil-20141015_161647-20141015_172214,"host,patrick,neil",34694,"[('ID=522759240817971202', 'AUTHOR=i_Celeb_Gos...",1
1,october_ebola_house-20141015_230140-20141016_0...,"october,ebola,house",147,"[('ID=522863310086373378', 'AUTHOR=lumworld', ...",0
2,oscar_pistorius_because-20141016_044421-201410...,"oscar,pistorius,because",446,"[('ID=522827406432669696', 'AUTHOR=Honeyy_Khan...",0
3,artist_vote_year-20141016_111453-20141016_121002,"artist,vote,year",66473,"[('ID=522936371779239937', 'AUTHOR=60sDinerLou...",0
4,story_october_ebola-20141016_111453-20141016_1...,"story,october,ebola",328,"[('ID=522924216291569665', 'AUTHOR=October_14t...",0
...,...,...,...,...,...
219,costa_diego_charged-20150128_121053-20150128_1...,"costa,diego,charged",17310,"[('ID=560615461503926272', 'AUTHOR=jakieboyhah...",1
220,ebola_#ebola_travel-20141016_131147-20141016_1...,"ebola,#ebola,travel",27796,"[('ID=522978106181951488', 'AUTHOR=BINGBINGDEL...",0
221,ebola_news_over-20141016_131147-20141016_14162...,"ebola,news,over",25389,"[('ID=523047031120879616', 'AUTHOR=757LiveNG',...",0
222,god_golden_awards-20150115_101528-20150115_104...,"god,golden,awards",2372,"[('ID=556571788638179328', 'AUTHOR=MikeysMocha...",0


In [106]:
# we need to structure the data with individual tweets as rows in rows, and topics as attributes.
# define function to read each string in 'topic_tweets' and convert into nested list of individual tweets

def tweets_to_list(topic_tweets):
    col_list = re.findall(r"(?<=ID=)\d+|(?<=AUTHOR=)(?!ID=)\w*|(?<=CreatedAt=)(?!AUTHOR=)(?!ID=)[\d-]*\s[\d:]*",
                    topic_tweets,
                    re.MULTILINE)
    
    nested_list = []
    for i in range(0,len(col_list),3):
            nested_list.append([col_list[i], col_list[i+1], col_list[i+2]])
    
    return nested_list

In [1]:
# iterate through each string in topic_tweets and export into CSVs containing tweets by topic

for i, topic in enumerate(data_classed.topic_tweets):
    
    nested_list = tweets_to_list(topic)
    
    df = pd.DataFrame(nested_list, columns=['tweet_id','user_id','create_time'])
    df['topic_key'] = data_classed['topic_key'][i]
    df['is_credible'] = data_classed['cred_score'][i]
    df.to_csv(f'../datasets/topic_tweets/topic_{i}.csv', index=False)