# Unsupervised NLP on /r/datascience Comments

This is a notebook for performing unsupervised NLP on the comments from the datascience subreddit "Weekly Entering & Transitioning Thread"

The most current one being:
https://www.reddit.com/r/datascience/comments/m4u0uu/weekly_entering_transitioning_thread_14_mar_2021/

This notebook covers all steps of analysis:
1. Data Scraping
1. Data Cleaning
1. Feature Engineering
1. Modelling



In [1]:
import yaml
import praw

import pandas as pd

## Data Scraping

To collect the comments will make use of the reddit praw API, which allows us to perform simple (but adequate) queries and collect posts and comments in a useful object oriented format!

As a note, I store reddit IDs and Sectets in a separate yaml file, you will have to make your own if you plan to copy any code snippets.

In [2]:
yaml_file = open("/Users/harry/scrapeReddit/reddit_keys.yml")
parsed_yaml = yaml.load(yaml_file, Loader=yaml.FullLoader)
reddit = praw.Reddit(client_id=parsed_yaml["client_id"],                                                                                                                                                                                          
                     client_secret=parsed_yaml["client_secret"],                                                                                                                                                                       
                     user_agent=parsed_yaml["user_agent"],                                                                                                                                                                                            
                     username=parsed_yaml["username"],                                                                                                                                                                                      
                     password=parsed_yaml["password"])

Search the datascience subreddit for posts with both "weekly" and "thread" in the title, and to only keep posts written by the authors "datascience-bot" and "AutoModerator".

In [3]:
ALLOWED_AUTHORS = ["datascience-bot", "AutoModerator"]
all_submissions = []
for submission in reddit.subreddit("datascience").search("weekly+thread", limit=1000):
    if submission.author not in ALLOWED_AUTHORS:
        continue
    all_submissions.append(submission)

Now to parse through all comments in the thread, saving for each comment: the id, url, thread name (a manually parsed string for keeping track), text body, username, comment depth, number of upvotes and number of replies.

Number of replies has a little extra work, ensuring to only count replies that aren't from "datascience-bot" or "AutoModerator".

In [4]:
all_comments = []

for submission in all_submissions:

    submission.comments.replace_more(limit=None)

    thread_name = submission.title.split("|")[-1]
    thread_name = "".join(thread_name.split())

    for comment in submission.comments.list():
        
        if comment.author in ALLOWED_AUTHORS:
            continue
        
        comment_dict = {}
        comment_dict["id"] = comment.id
        comment_dict["url"] = "reddit.com{}".format(comment.permalink)
        comment_dict["thread"] = thread_name
        comment_dict["textbody"] = comment.body
        comment_dict["username"] = comment.author
        comment_dict["depth"] = comment.depth
        comment_dict["upvotes"] = comment.ups

        # Only count replies that aren't from the reddit bots
        filtered_replies = [r for r in comment.replies if r.author not in ALLOWED_AUTHORS]
        comment_dict["replies"] = len(filtered_replies)
        
        all_comments.append(comment_dict)

Now to simply parse the list of dicts into a dataframe. Everything looks as expected!

In [5]:
raw_df = pd.DataFrame(all_comments)
raw_df = raw_df.set_index('id')

print("Collected {} Raw Comments".format(len(raw_df)))

Collected 13630 Raw Comments


## Data Cleaning and Filtering

Now to clean the raw dataframe, removing unuseful or unprofessional comments.
1. Remove comments with body "[Deleted]"
1. Remove comments with negative karma (mostly spam)

First a look at how many comments we scraped that were just empty and contain "[deleted]", also no meaningful username attatched. We drop these from the dataframe.

In [6]:
raw_df[raw_df.textbody.str.contains("\[deleted\]")].head()

Unnamed: 0_level_0,url,thread,textbody,username,depth,upvotes,replies
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
gq86mcu,reddit.com/r/datascience/comments/lzpbaf/weekl...,07Mar2021-14Mar2021,[deleted],,0,1,1
gq8d9a3,reddit.com/r/datascience/comments/lzpbaf/weekl...,07Mar2021-14Mar2021,[deleted],,0,1,1
gqdzydq,reddit.com/r/datascience/comments/lzpbaf/weekl...,07Mar2021-14Mar2021,[deleted],,0,1,2
gokgd4i,reddit.com/r/datascience/comments/lovorx/weekl...,21Feb2021-28Feb2021,[deleted],,0,1,0
gox8y5a,reddit.com/r/datascience/comments/lovorx/weekl...,21Feb2021-28Feb2021,[deleted],,0,1,1


In [7]:
print("Number of Comments Before: {}".format(len(raw_df)))
raw_df = raw_df[~raw_df.textbody.str.contains("\[deleted\]")]
print("Number of Comments After: {}".format(len(raw_df)))

Number of Comments Before: 13630
Number of Comments After: 12771


Next we drop any comment with negative karma, this is usually some kind of spam / self promotion. Though this in theory should protect against some unuseful and hostile comments. 

In [8]:
print("Number of Comments Before: {}".format(len(raw_df)))
raw_df = raw_df[raw_df["upvotes"] >= 0]
print("Number of Comments After: {}".format(len(raw_df)))

Number of Comments Before: 12771
Number of Comments After: 12672


For this study only comments that have gauged some amount of discussion are kept, and hence filter only those at depth=0 with replies>0. This does remove a large amount of comments, but those remaining should be more interesting.

In [9]:
print("Number of Comments Before: {}".format(len(raw_df)))
raw_df = raw_df[(raw_df["replies"] >= 0) & (raw_df["depth"] == 0)]
print("Number of Comments After: {}".format(len(raw_df)))

Number of Comments Before: 12672
Number of Comments After: 4848


While this is not a requirement, to avoid causing too much interference with the NLP all URLs / hyperlinks are removed from the comment textbody. This regex just matches anything beginning with http or www and replaces it with nothing.

In [10]:
raw_df["textbody"] = raw_df["textbody"].str.replace('http\S+|www.\S+', '', case=False)

## Feature Enginerring

The most common methods for encoding a text document for machine learning input are using
1. Count Vectorizer [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
1. tf–idf Vectorizer [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

Generally it is preferable to use a tf-idf as this gives a lower weight to words or phrases that will be common to most comments, but give larger weights to what makes the comment unique, and hopefully what the subject is!

### Count Vectorizer

The sklearn TfidfVectorizer is fitted on the full set of comments. The removal of english stopwords is used and a vectorizer is constructed for ngrams of size (1, 2 and 3). 

For visualisation of what these vectorising are doing, will first just show the top 50 features in each vectorizer.

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

N_FEATURES = 50

vectorizer_ngram1 = TfidfVectorizer(stop_words='english', max_features = N_FEATURES, ngram_range=(1,1))
vectorizer_ngram2 = TfidfVectorizer(stop_words='english', max_features = N_FEATURES, ngram_range=(2,2))
vectorizer_ngram3 = TfidfVectorizer(stop_words='english', max_features = N_FEATURES, ngram_range=(3,3))

vectorizer_ngram1.fit_transform(raw_df["textbody"])
vectorizer_ngram2.fit_transform(raw_df["textbody"])
vectorizer_ngram3.fit_transform(raw_df["textbody"])

print("\n == Vectorizer ngram1 Top {} == \n{}".format(N_FEATURES, vectorizer_ngram1.get_feature_names()))
print("\n == Vectorizer ngram1 Top {} == \n{}".format(N_FEATURES, vectorizer_ngram2.get_feature_names()))
print("\n == Vectorizer ngram1 Top {} == \n{}".format(N_FEATURES, vectorizer_ngram3.get_feature_names()))


 == Vectorizer ngram1 Top 50 == 
['advice', 'analysis', 'analyst', 'analytics', 'background', 'business', 'career', 'company', 'course', 'courses', 'currently', 'data', 'degree', 'doing', 'don', 'ds', 'engineering', 'experience', 'field', 'good', 'help', 'hi', 'job', 'just', 'know', 'learn', 'learning', 'like', 'looking', 'masters', 'need', 'program', 'projects', 'python', 'really', 'school', 'science', 'scientist', 'skills', 'sql', 'statistics', 'thanks', 'think', 'time', 've', 'want', 'work', 'working', 'year', 'years']

 == Vectorizer ngram1 Top 50 == 
['best way', 'big data', 'business analytics', 'career data', 'computer science', 'currently working', 'data analysis', 'data analyst', 'data analytics', 'data engineering', 'data science', 'data scientist', 'data scientists', 'deep learning', 'don know', 'don want', 'entry level', 'experience data', 'feel like', 'grad school', 'greatly appreciated', 'hey guys', 'hi guys', 'interested data', 'job data', 'job market', 'learning data',

What we see is that ngram=1 and ngram=2 vectorizers do have some noise with a lot of general english language still there. These are expected to recieve low tf-idf weights as they'll likely be common to many comments. They do contain some useful phrases such as individual skillsets like
- masters, ml, sql
- linear algebra, big data, science degree

The ngram=3 vectorizer might be too much noise, but it still appears to pick up some interesting phrases such as
- data science bootcamp, machine learning engineer, transition data science


To limit the machine learning only to those that will be useful a different number of features were selected for each ngram level. ngram1 = 500, ngram2 = 250, ngram3 = 100

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer_ngram1 = TfidfVectorizer(stop_words='english', max_features = 500, ngram_range=(1,1))
vectorizer_ngram2 = TfidfVectorizer(stop_words='english', max_features = 250, ngram_range=(2,2))
vectorizer_ngram3 = TfidfVectorizer(stop_words='english', max_features = 100, ngram_range=(3,3))

X = vectorizer_ngram1.fit_transform(raw_df["textbody"])
X = vectorizer_ngram2.fit_transform(raw_df["textbody"])
X = vectorizer_ngram3.fit_transform(raw_df["textbody"])
