In [153]:
# Libraries import
import pandas as pd
import praw # needs to be installed first: https://praw.readthedocs.io/en/stable/getting_started/installation.html

### Setting up PRAW API
I've chosen PRAW API to collect reddit data over PushShift because it worked better with r/office subreddit. For some reason PuchShift wasn't able to query all posts, but worked well with r/DunderMifflin subreddit. 
For the data collection, I've ended using PRAW to scrape reddit posts from two subreddits: 
* Dunder Mifflin, that is a fan community dedicated to US TV series "The Office"
* Office - space to discuss working space related topics, like relationship with coworkers or office supplies.

I chose these two subreddits assuming that they would share some parts of vocabulary used in both spaces. 

In [120]:
# Your credentials here 
# How to get PRAW credentials: https://www.geeksforgeeks.org/scraping-reddit-using-python/
reddit = praw.Reddit()

If you would like to run the notebook on your machine, you have to install PRAW and get your own credentials for PRAW API. You can find guidelines on how to do that in the code cell above.

### Downloading Office posts
[r/office](https://www.reddit.com/r/office/)
> A subreddit to talk about all things office related, from what happened in your office, to stationary supplies, moronic managers, to even a quick question about Microsoft Office! Basically if it's connected to the office in some way, come here to discuss it!
* 5.6k Members
* Created Sep 9, 2008
* Active, 5-10 posts a week recently

In [146]:
# Reference: https://www.geeksforgeeks.org/scraping-reddit-using-python/
# Scraping 1000 posts from Office subreddit
posts = reddit.subreddit('office').new(limit=1000)
# Creating an empty list for each column 
posts_dict = {'title': [], 'selftext': [], 'author': [],
              'id': [], 'score': [],
              'comments': [], 'link': [],
              'subreddit' : [], 'created': []
              }

for post in posts:
    # Title of a post
    posts_dict['title'].append(post.title)
     
    # Body
    posts_dict['selftext'].append(post.selftext)
    
    # Author 
    posts_dict['author'].append(post.author)
     
    # Unique ID
    posts_dict['id'].append(post.id)
     
    # The score 
    posts_dict['score'].append(post.score)
     
    # Total number of comments 
    posts_dict['comments'].append(post.num_comments)
     
    # URL
    posts_dict['link'].append(post.url)
    
    # Subreddit name
    posts_dict['subreddit'].append(post.subreddit)
    
    # Time of post creation
    posts_dict['created'].append(post.created_utc)
 
# Saving the data in a pandas dataframe
office_df = pd.DataFrame(posts_dict)

In [147]:
# Saving data
office_df.to_csv('./data/office_df_original.csv', index = False)

### Downloading Dunder Mifflin posts
[r/DunderMifflin](https://www.reddit.com/r/DunderMifflin/)
> Why waste time watch many show when one show do trick?
* 2.1m Members
* Created Jan 8, 2011
* Active, 5-10 posts per day

In [151]:
# Connecting to Dunder Mifflin subreddit through a PRAW API
posts = reddit.subreddit('DunderMifflin').new(limit=1000)
# Creating an empty list for each column 
posts_dict = {'title': [], 'selftext': [], 'author': [],
              'id': [], 'score': [],
              'comments': [], 'link': [],
              'subreddit' : [], 'created': []
              }

for post in posts:
    # Title of a post
    posts_dict['title'].append(post.title)
     
    # Body
    posts_dict['selftext'].append(post.selftext)
    
    # Author 
    posts_dict['author'].append(post.author)
     
    # Unique ID
    posts_dict['id'].append(post.id)
     
    # The score 
    posts_dict['score'].append(post.score)
     
    # Total number of comments 
    posts_dict['comments'].append(post.num_comments)
     
    # URL
    posts_dict['link'].append(post.url)
    
    # Subreddit name
    posts_dict['subreddit'].append(post.subreddit)
    
    # Time of post creation
    posts_dict['created'].append(post.created_utc)
 
# Saving the data in a pandas dataframe
DunderMifflin_df = pd.DataFrame(posts_dict)

In [152]:
# Saving data
DunderMifflin_df.to_csv('./data/DunderMifflin_df_original.csv', index = False)

### Summary
Office: 
* 793 posts
* Sunday, April 19, 2020 10:08:23 PM - Sunday, January 15, 2023 5:30:37 PM (GMT)

DunderMifflin:
* 993 posts
* Sunday, January 1, 2023 1:44:36 PM - Monday, January 16, 2023 5:22:40 AM (GMT)