This repository holds the code to process a subreddit dataset.
There could be a lot of files and scripts in the repository, but you only need to look into several files to get to know the data.
- unpopularopinion_comments.10000.jsonl. It contains first 10,000 lines of the cleaned comments file of unpopularopinion subreddit. The unsampled cleaned comments file (47,519,950 lines) could be found in Google Drive.
- unpopularopinion_submissions.10000.jsonl. It contains first 10,000 lines of the cleaned submissions file of unpopularopinion subreddit. The unsampled cleaned submissions file (2,394,871 lines) could be found in Google Drive.
- unpopular_user_summary.tsv. Per user based #post, #comments, #comments_on_unique_posts.
You can find similar, but more comprehensive file structure in chinesefood, because the subreddit of chinesefood is much smaller and the data files could fit better on GitHub.
Here're the explanations on the keys of the comments file and the explanations of the submissions file, by ChatGPT :)
Here's the full file structure on my computer. The GitHub version contains all the Python scripts, but not all the data files, due to size limit.
.
├── chinesefood
│ ├── chinesefood_comments # raw comments data
│ ├── chinesefood_submissions # raw submission data
│ ├── chinesefood_comments.jsonl # processed comments (json line)
│ ├── chinesefood_submissions.jsonl # processed submissions (json line)
│ ├── chinesefood_comments.db # processed comments in database
│ ├── chinesefood_submissions.db # processed submissions in database
│ ├── chinesefood_user_comment_count.tsv # #comments and #comments_on_unique_posts per user
│ ├── chinesefood_user_post_count.tsv # #posts per user
│ └── chinesefood_user_summary.tsv # Summary of user activity (comments & posts)
├── unpopularopinion
│ ├── unpopularopinion_comments # raw comments data
│ ├── unpopularopinion_submissions # raw submission data
│ ├── unpopularopinion_comments.jsonl # processed comments (json line)
│ ├── unpopularopinion_submissions.jsonl # processed submissions (json line)
│ ├── unpopularopinion_comments.db # processed comments in database
│ ├── unpopularopinion_submissions.db # processed submissions in database
│ ├── unpopularopinion_user_comment_count.tsv # #comments and #comments_on_unique_posts per user
│ ├── unpopularopinion_user_post_count.tsv # #posts per user
│ ├── unpopularopinion_user_summary.tsv # Summary of user activity (comments & posts)
│ ├── unpopularopinion_comments.10000.jsonl # Sample of 10,000 comments
│ └── unpopularopinion_submissions.10000.jsonl # Sample of 10,000 submissions
├── comment-filter-fields-chunk.py # Script for filtering necessary fields in comments (chunked processing)
├── comment-filter-fields.py # Script for filtering necessary fields in comments (full processing)
├── comments-db.py # Script to store comments in a database for efficient querying
├── submission-filter-chunk.py # Script for filtering necessary fields in submissions (chunked processing)
├── submission-filter-fields.py # Script for filtering necessary fields in submissions (full processing)
├── submissions-db.py # Script to store submissions in a database for efficient querying
├── count-comment.py # Script to count comments for each user
├── count-submission.py # Script to count posts for each user
├── count-summary.py # Script to generate a user activity summary from counts
├── user-summary-db.py # Script to calculate user activity summary from the database (could be very slow)
├── user-summary.py # Script to generate user activity summary from json line files (could face memory capability problems)
├── reddit-1614740ac8c94505e4ecb9d88be8bed7b6afddd4.torrent # Torrent file for downloading Reddit dataset
└── readme.md
The subreddit dataset is from https://www.reddit.com/r/pushshift/comments/1itme1k/separate_dump_files_for_the_top_40k_subreddits/. You can find the torrent in this file ./reddit-1614740ac8c94505e4ecb9d88be8bed7b6afddd4.torrent in the repository. A torrent downloader is needed to download the files from the torrent.
We can get the <theme>_comments.zst and <theme>_submissions.zst after downloading the selected <theme> from the torrent. Decompress the .zst files, and we can get <theme>_comments (the comments on the subreddit posts) and <theme>_submissions (the subreddit posts). Both the comments and the submissions files are consist of json lines, with each line in a file representing a json, i.e., one commment or one post, respectively.
There are several challenges in processing the comments and the submissions files.
- Size.
unpopularopinion_commentstakes up 64G, whileunpopular_submissionstakes up 5.4G. - Not uniformed format. The json lines in one comments file, generally share the same keys, as is verified by ChatGPT by comparing the keys of several lines. However, it's not the case in submission files, where the json lines don't share the same keys.
- Redundant keys. There are too many keys to handle in the raw comments and submissions files, which could be around 100 keys in the json file. These keys are not all necessary for our project. We used ChatGPT to choose the necessary fields from the randomly selected lines of comments file and submissions file.
We used Python scripts comments-filter-fields-chunk.py and submission-filter-chunk.py to select the necessary fields from the raw data, and exported the results file to <theme>/<theme>_comments.jsonl and <theme>/<theme>_submissions.jsonl.
After the processing, the size of unpopularopinion_comments.jsonl and unpopularopinion_submissions.jsonl shrinked to 14G and 1.9G, respectively. These files are much smaller in size, much more concise and well-formatted in keys.
From now on, we work on <theme>/<theme>_comments.jsonl and <theme>/<theme>_submissions.jsonl, rather than the raw data.
We can count #post of each user from <theme>/<theme>_submissions.jsonl and calculate #comments, #comments_on_unique_posts from <theme>/<theme>_comments.jsonl, using count-submission.py and count-comment.py, respectively. After this, we got <theme>/<theme>_user_comment_count.tsv and <theme>/<theme>_user_post_count.tsv
Finally, we run count-summary.py, using the two files generated just now, to get the <theme>/<theme>_user_summary.tsv
As the size of unpopularopinion_comments.jsonl (14G) and unpopularopinion_submissions.jsonl (1.9G) is still quite big to operate directly in memory, it's a good idea to put them into database for quick retrieval. Running comments-db.py and submissions-db.py and we can get the database version of the comments and submissions (unpopularopinion_toplevel_comments.db and unpopularopinion_submissions.db), which offers a much more feasible solution for situations where top level comments and submissions files are too big.
To build the Docker image:
docker build -t subreddits-app .
The data files are not included in the repository due to size. You can download them from Google Drive and place the unpopularopinion/ folder in the root of the project.
You will have to download the following files:
already_processed.csvauthors_with_at_least_100_distinct_comments.csvbots.csvposts_with_at_least_one_comment_from_selected_users.csvunpopularopinion_toplevel_comments.dbunpopularopinion_submissions.db
To run the preprocessing:
docker run --rm \
-v $(pwd)/unpopularopinion:/unpopularopinion \
subreddits-app
This command will also take care of mounting the data for you.
- Only keep the validated authors in the comment
- For each comment
- find the post
- find the commentator
- conduct the base emotion of the commentator, by averaging score of the comments that the commentator made on neutral posts
- comment emotion = (user base emotion, post title emotion, post body emotion)