*Ignore this first part, it's just housekeeping to get the rest of the example below to load properly.*

In [1]:
cd ../src

/Users/nicovandenhooff/Dropbox/GITHUB/reddit-data-collector/src


In [2]:
import json

with open("../tests/credentials.json") as f:
    login = json.load(f)

    client_id = login["client_id"]
    client_secret = login["client_secret"]
    user_agent = login["user_agent"]
    username = login["username"]
    password = login["password"]

# Example Use of the Reddit Data Collector Package

### Step 1: Create `DataCollector` object

In [3]:
import reddit_data_collector as rdc

In [4]:
data_collector = rdc.DataCollector(
    client_id=client_id,
    client_secret=client_secret,
    user_agent=user_agent,
    username=username,
    password=password
)

### Step 2: Obtain some post and comment data from Reddit

In this section we:
- Obtain 10 "hot" posts, their comments, and the comment replies from the subreddits **r/pics** and **r/funny**.
- We set the `replace_more_limit` to `0`, which means that any instances of comments that are returned as `MoreComment` are **removed**.
    - See the [PRAW Documentation](https://praw.readthedocs.io/en/stable/tutorials/comments.html) for full details on `MoreComment`.

In [5]:
posts, comments = data_collector.get_data(
    subreddits=["pics", "funny"],
    post_filter="hot",
    post_limit=10,
    comment_data=True,
    replies_data=True,
    replace_more_limit=0,
    dataframe=True
)

Collecting hot r/pics posts: 100%|██████████████| 10/10 [00:00<00:00, 23.92it/s]
Collecting hot r/funny posts: 100%|█████████████| 10/10 [00:00<00:00, 23.22it/s]
Collecting comments for 10 r/pics posts: 100%|██| 10/10 [00:10<00:00,  1.03s/it]
Collecting comments for 10 r/funny posts: 100%|█| 10/10 [00:14<00:00,  1.47s/it]


In [6]:
posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   subreddit_name       20 non-null     object 
 1   post_created_utc     20 non-null     float64
 2   id                   20 non-null     object 
 3   is_original_content  20 non-null     bool   
 4   is_self              20 non-null     bool   
 5   link_flair_text      2 non-null      object 
 6   locked               20 non-null     bool   
 7   num_comments         20 non-null     int64  
 8   over_18              20 non-null     bool   
 9   score                20 non-null     int64  
 10  spoiler              20 non-null     bool   
 11  stickied             20 non-null     bool   
 12  title                20 non-null     object 
 13  upvote_ratio         20 non-null     float64
 14  url                  20 non-null     object 
dtypes: bool(6), float64(2), int64(2), object(5

In [7]:
comments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3918 entries, 0 to 3917
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   subreddit_name       3918 non-null   object 
 1   id                   3918 non-null   object 
 2   post_id              3918 non-null   object 
 3   parent_id            3918 non-null   object 
 4   top_level_comment    3918 non-null   bool   
 5   body                 3918 non-null   object 
 6   comment_created_utc  3918 non-null   float64
 7   is_submitter         3918 non-null   bool   
 8   score                3918 non-null   int64  
 9   stickied             3918 non-null   bool   
dtypes: bool(3), float64(1), int64(1), object(5)
memory usage: 225.9+ KB


### Step 3: Save post and comment data as `.csv`

In this section we simply save our data collected as `.csv` files.

In [8]:
posts.to_csv("../examples/example_posts.csv", index=False)
comments.to_csv("../examples/example_comments.csv", index=False)

### Step 4: Collect some more post and comment data from Reddit
- Now we collect some additional post and comment data from the same subreddits
- The post data we collect now is filtered by the "top" daily posts
- Unlike step 2, we only obtain top level comment data and not individual replies to each comment, which helps speed things up.

In [9]:
more_posts, more_comments = data_collector.get_data(
    subreddits=["pics", "funny"],
    post_filter="top",
    top_post_filter="day",
    comment_data=True,
    replies_data=False,
    replace_more_limit=0
)

Collecting top r/pics posts: 74it [00:00, 74.14it/s]
Collecting top r/funny posts: 100it [00:01, 62.50it/s]
Collecting comments for 74 r/pics posts: 100%|██| 74/74 [00:25<00:00,  2.96it/s]
Collecting comments for 100 r/funny posts: 100%|█| 100/100 [01:21<00:00,  1.23it


In [10]:
more_posts.shape

(174, 15)

In [11]:
more_comments.shape

(3978, 10)

### Step 5: Update existing `.csv` files with additional data collected
- Now we can add our new post and comment data to the existing `.csv` files
- There is a convenience function called `update_data` in the `reddit_data_collector` package that allows us to do this easily
- This function is mindful to not save duplicate data
- This function includes an argument `save` that if set to `True` will overwrite the old `.csv` file

In [12]:
# where we saved post and comment data in step 4
posts_csv_path = "../examples/example_posts.csv"
comments_csv_path = "../examples/example_comments.csv"

updated_posts = rdc.update_data(
    posts_csv_path,
    more_posts,
    save=True
)

updated_comments = rdc.update_data(
    comments_csv_path,
    more_comments,
    save=True
)

In [13]:
print("Posts collected...")
print(f"First collection: {posts.shape[0]}")
print(f"Second collection: {more_posts.shape[0]}")
print(f"After merging: {updated_posts.shape[0]}")

Posts collected...
First collection: 20
Second collection: 174
After merging: 176


In [14]:
print("Comments collected...")
print(f"First collection: {comments.shape[0]}")
print(f"Second collection: {more_comments.shape[0]}")
print(f"After merging: {updated_comments.shape[0]}")

Comments collected...
First collection: 3918
Second collection: 3978
After merging: 6066
