Skip to content

junhua/EPIC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 

Repository files navigation

EPIC30M: an epidemic corpus of over 30 million relevant tweets

License: MIT
Download: Link1

Context

Since the start of COVID-19, several relevant corpora from various sources are presented in the literature that contain millions of data points. While these corpora are valuable in supporting many analysis on this specific pandemic, researchers require additional benchmark corpora that contain other epidemics to facilitate cross-epidemic pattern recognition and trend analysis tasks. During our other efforts on COVID-19 related work, we discover very little disease related corpora in the literature that are sizable and rich enough to support such cross-epidemic analysis tasks.

EPIC

Here we present EPIC, a large-scale epidemic corpus that contains over 20 millions tweets, spanned from year 2006 to 2020. There are two subsets within the corpus, namely general and outbreaks. The general set contains 3 epidemics, namely:

The outbreak set contains 6 epidemic outbreaks, as follows:

Each class of data is contained by a single CSV file named after the respective event. Each file contains comma-separated fields per line, where not all files may have a value. The list of fields are as follows:

Field Type Description
date datetime The date and time (in UTC) that the tweet was posted.
Example: 4/2/09 17:06
username string Unique username of the user account that posted the tweet
Example: douance_quebec
to string The twitter account's username that the tweet that was posted to
Example: CedricFontaine
replies integer The number of replies that the tweet has. A reply is a response to another person’s Tweet. You can reply by clicking or tapping the reply icon from a Tweet.
Example: 3
retweets integer The number of retweets that the tweet has. A tweet that a user shares publicly with his/her followers is known as a Retweet, which is a conventional way to pass along news and interesting discoveries on Twitter.
Example: 3
favorites integer The number of favorites the tweet receives. Favourites are represented by a small heart and are used to show appreciation for a Tweet.
Example: 3
text string The content of a tweet that contains up to 280 characters.
Example: H1N1 + H1N5 = Trouble...
mentions string Another account's Twitter username preceded by the "@" symbol. A mention is a Tweet that contains another person’s username anywhere in the body of the Tweet.
Example: @cyberlou33
hashtags string The hashtags that the tweet includes. A hashtag is formed by a symbol (#) followed by a relevant keyword. Hashtags are commonly used or phrase in their Tweet to categorize those Tweets and help them show more easily in Twitter search.
Example: #panflu
id string A unique identifier of a tweet.
Example: 1136281607
permalink string The unique URL of a tweet. Whenever you view a Tweet's permanent link, you can see The exact time and date the Tweet was posted and the number of likes and Retweets the Tweet received.
Example: https://twitter.com/douance_quebec/status/1096080744

Usage

This data is intended to support only for academic research purporses and may not be used for any commercial purposes, by any commercial entity, or by any party, unless otherwised authorized by the authors.

If your publication uses the data, either in full or in part, you should cite the paper below:

@article{liu2020epic,
  title={EPIC30M: An Epidemics Corpus Of Over 30 Million Relevant Tweets},
  author={Liu, Junhua and Singhal, Trisha and Blessing, Lucienne T.M. and Wood, Kristin L. and Lim, Kwan Hui},
  journal={arXiv preprint  arXiv:2006.08369},
  year={2020}
}

About

EPIC: a large collection of over 30 million epidemic-related tweets

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published