# Unsupervised NLP on /r/datascience Comments

This is a notebook for performing unsupervised NLP on the comments from the datascience subreddit "Weekly Entering & Transitioning Thread"

The most current one being:
https://www.reddit.com/r/datascience/comments/m4u0uu/weekly_entering_transitioning_thread_14_mar_2021/

This notebook covers all steps of analysis:
1. Data Scraping
1. Data Cleaning
1. Feature Engineering
1. Modelling



In [1]:
import yaml
import praw

import pandas as pd

## Data Scraping

To collect the comments will make use of the reddit praw API, which allows us to perform simple (but adequate) queries and collect posts and comments in a useful object oriented format!

As a note, I store reddit IDs and Sectets in a separate yaml file, you will have to make your own if you plan to copy any code snippets.

In [2]:
yaml_file = open("/Users/harry/scrapeReddit/reddit_keys.yml")
parsed_yaml = yaml.load(yaml_file, Loader=yaml.FullLoader)
reddit = praw.Reddit(client_id=parsed_yaml["client_id"],                                                                                                                                                                                          
                     client_secret=parsed_yaml["client_secret"],                                                                                                                                                                       
                     user_agent=parsed_yaml["user_agent"],                                                                                                                                                                                            
                     username=parsed_yaml["username"],                                                                                                                                                                                      
                     password=parsed_yaml["password"])

Search the datascience subreddit for posts with both "weekly" and "thread" in the title, and to only keep posts written by the authors "datascience-bot" and "AutoModerator".

In [3]:
ALLOWED_AUTHORS = ["datascience-bot", "AutoModerator"]
all_submissions = []
for submission in reddit.subreddit("datascience").search("weekly+thread", limit=20):
    if submission.author not in ALLOWED_AUTHORS:
        continue
    all_submissions.append(submission)

Now to parse through all comments in the thread, saving for each comment: the id, url, thread name (a manually parsed string for keeping track), text body, username, comment depth, number of upvotes and number of replies.

Number of replies has a little extra work, ensuring to only count replies that aren't from "datascience-bot" or "AutoModerator".

In [4]:
all_comments = []

for submission in all_submissions:

    submission.comments.replace_more(limit=None)

    thread_name = submission.title.split("|")[-1]
    thread_name = "".join(thread_name.split())

    for comment in submission.comments.list():

        comment_dict = {}
        comment_dict["id"] = comment.id
        comment_dict["url"] = "reddit.com{}".format(comment.permalink)
        comment_dict["thread"] = thread_name
        comment_dict["textbody"] = comment.body
        comment_dict["username"] = comment.author
        comment_dict["depth"] = comment.depth
        comment_dict["upvotes"] = comment.ups

        # Only count replies that aren't from the reddit bots
        filtered_replies = [r for r in comment.replies if r.author not in ALLOWED_AUTHORS]
        comment_dict["replies"] = len(filtered_replies)
        
        all_comments.append(comment_dict)

Now to simply parse the list of dicts into a dataframe. Everything looks as expected!

In [5]:
raw_df = pd.DataFrame(all_comments)
raw_df = raw_df.set_index('id')
raw_df.head()

Unnamed: 0_level_0,url,thread,textbody,username,depth,upvotes,replies
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
gq5iprx,reddit.com/r/datascience/comments/lzpbaf/weekl...,07Mar2021-14Mar2021,I’ve found a lot of posts/comments within this...,may4422,0,4,2
gq4ztgk,reddit.com/r/datascience/comments/lzpbaf/weekl...,07Mar2021-14Mar2021,I'm graduating from University with a Computer...,praventz,0,3,0
gq68qh7,reddit.com/r/datascience/comments/lzpbaf/weekl...,07Mar2021-14Mar2021,I’m new to Reddit but looking to expand my ski...,No-Half3399,0,2,1
gq8joip,reddit.com/r/datascience/comments/lzpbaf/weekl...,07Mar2021-14Mar2021,**What to learn after Pandas and Matplotlib? (...,meerkat99,0,2,1
gq8w0yu,reddit.com/r/datascience/comments/lzpbaf/weekl...,07Mar2021-14Mar2021,"\n\nHello, guys! I'm working as a data person...",AggressivePrune7212,0,2,2
