Dataset Summary

iDRAMA-Scored-2024 is a large-scale dataset containing approximately 57 million social media posts from web communities on social media platform, Scored. Scored serves as an alternative to Reddit, hosting banned fringe communities, for example, c/TheDonald, a prominent right-wing community, and c/GreatAwakening, a conspiratorial community. This dataset contains 57M posts from over 950 communities collected over four years, and includes sentence embeddings for all posts.

Scored platform: Scored
Link to paper: Here

License: CC BY-NC-SA 4.0

Repo-links	Purpose
Zenodo	From Zenodo, researchers can download `lite` version of this dataset, which includes only 57M posts from Scored (not the sentence embeddings).
Github	The main repository of this dataset, where we provide code-snippets to get started with this dataset.
Huggingface	On Huggingface, we provide complete dataset with senetence embeddings.

Getting Started

Loading dataset from Huggingface: huggingface_loader.ipynb
Loading dataset from Zenodo: zenodo_loader.ipynb
Explore dataset & attributes: explore_dataset.ipynb

Dataset Info

Dataset is organized by yealry-comments and submissions -- comments-2020, comments-2021, comments-2022, comments-2023, submissions-2020-t0-2023.

Config	Data-points
comments-2020	12,774,203
comments-2021	16,097,941
comments-2022	12,730,301
comments-2023	8,919,159
submissions-2020-to-2023	6,293,980

Top-15 communities in our dataset with total number of posts are shown as following:

Community	Number of posts
c/TheDonald	41,745,699
c/GreatAwakening	6,161,369
c/IP2Always	3,154,741
c/ConsumeProduct	2,263,060
c/KotakuInAction2	747,215
c/Conspiracies	539,164
c/Funny	371,081
c/NoNewNormal	322,300
c/OmegaCanada	249,316
c/Gaming	181,469
c/MGTOW	175,853
c/Christianity	124,866
c/Shithole	98,720
c/WSBets	66,358
c/AskWin	39,308

Submission data fields are as following:

- `uuid`: Unique identifier associated with each sub- mission (uuid).
- `created`: UTC timestamp of the submission posted to Scored platform.
- `date`: Date of the submission, converted from UTC timestamp while data curation.
- `author`: User of the submission. (Note -- We hash the userames for ethical considerations.)
- `community`: Name of the community in which the submission is posted to.
- `title`: Title of the submission.
- `raw_content`: Body of the submission.
- `embedding`: Generated embedding by combining "title" and "raw_content," with 768 dimensional vector with fp32-bit.

- `link`: URL if the submission is a link.
- `type`: Indicates whether the submission is text or a link.
- `domain`: Base domain if the submission is a link.
- `tweet_id`: Associated tweet id if the submission is a Twitter link.
- `video_link`: Associated video link if the submission is a video.

- `score`: Metric about the score of sample submission.
- `score_up`: Metric about the up-votes casted to sample submission.
- `score_down`: Metric about the down-votes casted to sample submission.

- `is_moderator`: Whether the submission is created by moderator or not.
- `is_nsfw`: True, if the submission is flagged not safe for work.
- `is_admin`: Boolean flag about whether the submission is posted by admin.
- `is_image`: Boolean flag if the submission is image type of media.
- `is_video`: Boolean flag if the submission is type of video.
- `is_twitter`: Boolean flag if the submission is a twitter (now, named as X) link.
- `is_deleted`: Whether the submission was deleted as a moderation measure or not. If yes, the "title" and "raw_content" could be empty string.

- `post_flair_text` & `post_flair_class`: Similar to Reddit submission flairs, which is a way to tag a submission with a certain keywords.

Comments data fields are as following:

- `uuid`
- `date`
- `author`
- `community`
- `raw_content`
- `created`
- `embedding`
- `score`
- `score_up`
- `score_down`
- `is_moderator`
- `is_deleted`

Read more about the fields and methodology from the paper.

Dataset fields Nullability:

If field (column) doesn't have a value, the fields are left with an empty value.
- For instance, in the case of post deletion as a moderation measure, title of submission can have no value.
- We do not explicitly mark value as "Null" for any of the column in our dataset except embedding column.
Only, embedding column contains explicit "Null" value.

For eliminating empty records using pandas, the code looks like below:

# Load dataset for `comments-2020` config
dataset = load_dataset("iDRAMALab/iDRAMA-scored-2024", name="comments-2020")
pd_df = dataset["train"].to_pandas()

# Remove all empty records based on empty `title` column
pd_df = pdf_df[pd_df.title != ""]

# Remove all records which do not have `author` information
pd_df = pdf_df[pd_df.author != ""]

# Remove all records which do not have generated embeddings
pd_df = pdf_df[~pd_df.embedding.isna()]

Version

Maintenance Status: Active
Version Details:
- Current Version: v1.0.0
- First Release: 05/16/2024
- Last Update: 05/16/2024

Authorship

This dataset is published at "AAAI ICWSM 2024 (INTERNATIONAL AAAI CONFERENCE ON WEB AND SOCIAL MEDIA)" hosted at Buffalo, NY, USA.

Academic Organization: iDRAMA Lab
Affiliation: Binghamton University, Boston University, University of California Riverside

Licensing

This dataset is available for free to use under terms of the non-commercial license CC BY-NC-SA 4.0.

Citation

@misc{patel2024idramascored2024,
      title={iDRAMA-Scored-2024: A Dataset of the Scored Social Media Platform from 2020 to 2023}, 
      author={Jay Patel and Pujan Paudel and Emiliano De Cristofaro and Gianluca Stringhini and Jeremy Blackburn},
      year={2024},
      eprint={2405.10233},
      archivePrefix={arXiv},
      primaryClass={cs.SI}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

explore_dataset.ipynb

explore_dataset.ipynb

huggingface_loader.ipynb

huggingface_loader.ipynb

zenodo_loader.ipynb

zenodo_loader.ipynb

Repository files navigation

Dataset Summary

Getting Started

Dataset Info

Dataset fields Nullability:

Version

Authorship

Licensing

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
explore_dataset.ipynb		explore_dataset.ipynb
huggingface_loader.ipynb		huggingface_loader.ipynb
zenodo_loader.ipynb		zenodo_loader.ipynb

idramalab/iDRAMA-scored-2024

Folders and files

Latest commit

History

Repository files navigation

Dataset Summary

Getting Started

Dataset Info

Dataset fields Nullability:

Version

Authorship

Licensing

Citation

About

Resources

Stars

Watchers

Forks

Languages