*Ignore this first part, it's just housekeeping to get the rest of the example below to load properly.*

In [1]:
cd ../src

/Users/nicovandenhooff/Dropbox/GITHUB/reddit-data-collector/src


In [2]:
import json

with open("../tests/credentials.json") as f:
    login = json.load(f)
    
    client_id = login["client_id"]
    client_secret = login["client_secret"]
    user_agent = login["user_agent"]
    username = login["username"]
    password = login["password"]

# Example Use of the Reddit Data Collector Package

Steps:
1. Create `DataCollector` object
2. Obtain some post and comment data from Reddit
3. Convert post and comment data to `pandas` `DataFrame`
4. Save post and comment data as `.csv` files via `pandas`
5. Collect some more post and comment data from Reddit
6. Add new post and comment data to existing `.csv` files with a convenience function in the Reddit Data Collector package

### Step 1: Create `DataCollector` object

In [3]:
import reddit_data_collector as rdc

In [4]:
data_collector = rdc.DataCollector(
    client_id=client_id, 
    client_secret=client_secret,
    user_agent=user_agent,
    username=username,
    password=password
    )

### Step 2: Obtain some post and comment data from Reddit

In this section we:
- Obtain posts from the subreddits **r/pics** and **r/funny**
- The category that the posts are in is `hot`
- We limit our number of posts returned to `10` (the max is 1,000 per the Reddit API)
- We also set `comment_data` and `replies_data` to `True`, which means we also obtain all the comments on the 10 posts above, and the replies to each comment (subject to the point below)
- We set the `replace_more_limit` to `0`, which means that any instances of comments that are returned as `MoreComment`s by `PRAW` (the Reddit API python wrapper package that this package is built on) are **removed**
    - `MoreComment`s represent when a thread in Reddit says “load more comments”, or “continue this thread”.
    - Setting this as any integer greater than `0` means we would replace `MoreComments` instances with PRAW `Comment` instances (i.e. more valid comments from the Reddit post) until either this value is met or we replace all of them.
    - To ensure all `MoreComment`s are replaced with valid comment values set the value of `replace_more_limit` to `None`.
    - Note that every `MoreComment` we replace is 1 API call!  So setting this to a high integer value, or especially `None` can significantly slow down the script!  Think about if reply data is even valuable for your purpose, often the replies in comments are trolls talking to one another.
    - See the [PRAW Documentation](https://praw.readthedocs.io/en/stable/tutorials/comments.html) for full details on `MoreComment`

In [5]:
subreddits = ["pics", "funny"]
post_filter = "hot"
post_limit = 10
top_post_filter = None
comment_data = True
replies_data = True
replace_more_limit = 0

posts, comments = data_collector.get_data(
    subreddits,
    post_filter,
    post_limit,
    top_post_filter,
    comment_data,
    replies_data,
    replace_more_limit
)

Collecting hot r/pics posts: 100%|██████████████| 10/10 [00:00<00:00, 15.40it/s]
Collecting hot r/funny posts: 100%|█████████████| 10/10 [00:00<00:00, 17.92it/s]
Collecting comments for 10 r/pics posts: 100%|██| 10/10 [00:21<00:00,  2.15s/it]
Collecting comments for 10 r/funny posts: 100%|█| 10/10 [00:11<00:00,  1.17s/it]


### Step 3: Convert post and comment data to `pandas` `DataFrame`

In this section we:
- Create one `DataFrame` that contains all the subreddit posts for **r/pics** and **r/funny** and one `DataFrame` that contains all the comments for **r/pics** and **r/funny**

*Alternative method: Convert to seperate `DataFrame`s for each subreddit*
- If desired, the `to_pandas` function takes an argument `seperate` 
- If we set `seperate=True` then the method returns a `dict` of `DataFrame` objects, one per subreddit 
- For example:
    - In this case if we had ran the code `dfs = rdc.to_pandas(posts, seperate=True)` then we would have received a `dfs` would have been a `dict` that contained 2 `DataFrames`, one that contains the posts for the reddit **r/pics** and one for the subreddit **r/funny**
    - The items in each dictionary would be `{"subreddit_name": DataFrame of posts}`

In [6]:
posts_df = rdc.to_pandas(posts)
comments_df = rdc.to_pandas(comments)

Here we can see that our `DataCollector` collected 10 posts for each subreddit for a total of 20 posts.

In [7]:
posts_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   subreddit_name       20 non-null     object 
 1   post_created_utc     20 non-null     float64
 2   id                   20 non-null     object 
 3   is_original_content  20 non-null     bool   
 4   is_self              20 non-null     bool   
 5   link_flair_text      3 non-null      object 
 6   locked               20 non-null     bool   
 7   num_comments         20 non-null     int64  
 8   over_18              20 non-null     bool   
 9   score                20 non-null     int64  
 10  spoiler              20 non-null     bool   
 11  stickied             20 non-null     bool   
 12  title                20 non-null     object 
 13  upvote_ratio         20 non-null     float64
 14  url                  20 non-null     object 
dtypes: bool(6), float64(2), int64(2), object(5

Here we can see that our `DataCollector` collected 4,482 comments from both subreddits in total.

In [8]:
comments_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4580 entries, 0 to 4579
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   subreddit_name       4580 non-null   object 
 1   id                   4580 non-null   object 
 2   post_id              4580 non-null   object 
 3   parent_id            4580 non-null   object 
 4   top_level_comment    4580 non-null   bool   
 5   body                 4580 non-null   object 
 6   comment_created_utc  4580 non-null   float64
 7   is_submitter         4580 non-null   bool   
 8   score                4580 non-null   int64  
 9   stickied             4580 non-null   bool   
dtypes: bool(3), float64(1), int64(1), object(5)
memory usage: 264.0+ KB


### Step 4: Save post and comment data as `.csv` files via `pandas`

In this section we simply save our data collected as `.csv` files with `pandas`

In [9]:
posts_df.to_csv("../examples/example_posts.csv", index=False)
comments_df.to_csv("../examples/example_comments.csv", index=False)

### Step 5: Collect some more post and comment data from Reddit
- Here we collect some more post and comment data from the same subreddits
- The post data we collect now is filtered by the `top` posts for the `day`
- We collect a total of 100 posts
- This value is determined by the Reddit API with a max of 100, and we cannot set it manually, even within `praw`.
- Unlike step 2, we only obtain top level comment data and not individual replies to each comment.
- Trying to collect comment and reply data for top posts on popular subreddits can take arbitrarily long depending on the depth of comments and replies on a post.

In [10]:
subreddits = ["pics", "funny"]
post_filter = "top"
top_post_filter = "day"
comment_data = True
replies_data = False
replace_more_limit = 0

more_posts, more_comments = data_collector.get_data(
    subreddits,
    post_filter,
    post_limit,
    top_post_filter,
    comment_data,
    replies_data,
    replace_more_limit
)

Collecting top r/pics posts: 100it [00:01, 63.32it/s]
Collecting top r/funny posts: 100it [00:01, 65.34it/s]
Collecting comments for 100 r/pics posts: 100%|█| 100/100 [00:58<00:00,  1.72it/
Collecting comments for 100 r/funny posts: 100%|█| 100/100 [00:39<00:00,  2.52it


In [11]:
more_posts_df = rdc.to_pandas(more_posts)
more_comments_df = rdc.to_pandas(more_comments)

In [12]:
more_posts_df.shape

(200, 15)

In [13]:
more_comments_df.shape

(3621, 10)

### Step 6: Add new post and comment data to existing `.csv` files with a convenience function in the Reddit Data Collector package
- Now we can add our new post and comment data to the existing `.csv` files
- There is a convenience function called `update_data` in the `reddit_data_collector` package that allows us to do this easily
- This function is mindful to not save duplicate data
- This function includes an argument `save` that if set to `True` will overwrite the old `.csv` file
- It also returns the new updated data as a `pandas` `DataFrame` in case the user desires to manipulate it in Python right away

In [14]:
# where we saved post and comment data in step 4
existing_posts_csv_path = "../examples/example_posts.csv"
existing_comments_csv_path = "../examples/example_comments.csv"

new_posts_df = rdc.update_data(
    existing_posts_csv_path,
    more_posts_df,
    save=True
)

new_comments_df = rdc.update_data(
    existing_comments_csv_path,
    more_comments_df,
    save=True
)

We see that 3 new posts we added to our data, that means we had a few duplicates in the `hot` posts we collected and the `top` posts which makes sense

In [15]:
new_posts_df.shape

(203, 15)

In [16]:
new_comments_df.shape

(6918, 10)

## Concluding remarks

In a perfect world, we could just collect all the data that we needed from Reddit in one go, and not have to iterate like the above.  For example it would be ideal if we could just collect data as easy as "collect all hot posts from subreddit X from 2015 to 2020."  Unfortunately, the Reddit API does not allow for this.

Therefore, I designed this package, with the idea that this script could be used to collect samples of post and comment data from Reddit at multiple time periods (e.g. daily at 5pm), and then these samples could be combined into one data set seamlessly.

An example automated workflow to generate a data set would be:

Write a python script that runs every day at 5pm and:

1. Collects post and comment data from Reddit with the Reddit Data Collector
2. Adds post and comment data to a `.csv` file, updating it each time

At the end of the month we have 30 days worth of Reddit data that we can now further analyze!