# Scraping Reddit

This notebook does the following:

1. Scrapes posts from the list of subreddits found in `subreddit_list.py` and writes to `/data/posts`
2. Cleans the post csv's by removing duplicates and copied headers
3. Scrapes comments from the list of subreddits found in `subreddit_list.py` and writes to `/data/comments`
4. Cleans the comment csv's by removing duplicates and copied headers

In [1]:
#import post scraping function
from reddit_scrape import reddit_scrape

reddit_scrape()

Successfully scraped r/BPD
Successfully appended 998 pulled posts to data/posts/BPD.csv
Delaying 6 minutes
Successfully scraped r/BorderlinePDisorder
Successfully appended 997 pulled posts to data/posts/BorderlinePDisorder.csv
Delaying 6 minutes
Successfully scraped r/BPDmemes
Successfully appended 990 pulled posts to data/posts/BPDmemes.csv
Delaying 4 minutes
Successfully scraped r/SuicideWatch
Successfully appended 990 pulled posts to data/posts/SuicideWatch.csv
Delaying 7 minutes
Successfully scraped r/QuietBPD
Successfully appended 10 pulled posts to data/posts/QuietBPD.csv
Delaying 5 minutes
Successfully scraped r/BPDrecovery
Successfully appended 967 pulled posts to data/posts/BPDrecovery.csv
Delaying 8 minutes
Successfully scraped r/mentalhealth
Successfully appended 994 pulled posts to data/posts/mentalhealth.csv
Delaying 8 minutes
Successfully scraped r/BPDsupport
Successfully appended 983 pulled posts to data/posts/BPDsupport.csv
Delaying 6 minutes


In [2]:
# import post cleaning function
from clean_post_csv import clean_post_csv

# import subreddit list
from subreddit_list import subreddit_list

subreddits = subreddit_list()

# loop and clean scraped post csv's
for subreddit in subreddits:
    filepath = 'data/posts/'+subreddit+'.csv'
    clean_post_csv(filepath)

904 duplicates dropped from data/posts/BPD.csv
903 duplicates dropped from data/posts/BorderlinePDisorder.csv
898 duplicates dropped from data/posts/BPDmemes.csv
11 duplicates dropped from data/posts/SuicideWatch.csv
9 duplicates dropped from data/posts/QuietBPD.csv
960 duplicates dropped from data/posts/BPDrecovery.csv
194 duplicates dropped from data/posts/mentalhealth.csv
956 duplicates dropped from data/posts/BPDsupport.csv


In [1]:
# import comment scraping function
from comment_scrape import comment_scrape

comment_scrape()

Completed 100 posts.
Completed 200 posts.
Completed 300 posts.
Completed 400 posts.
Completed 500 posts.
Completed 600 posts.
Completed 700 posts.
Completed 800 posts.
Completed 900 posts.
Successfully appended 7460 pulled comments to data/comments/BPD_comments.csv
Completed 100 posts.
Completed 200 posts.
Completed 300 posts.
Completed 400 posts.
Completed 500 posts.
Completed 600 posts.
Completed 700 posts.
Completed 800 posts.
Completed 900 posts.
Completed 1000 posts.
Successfully appended 9255 pulled comments to data/comments/BorderlinePDisorder_comments.csv
Completed 100 posts.
Completed 200 posts.
Completed 300 posts.
Completed 400 posts.
Completed 500 posts.
Completed 600 posts.
Completed 700 posts.
Completed 800 posts.
Completed 900 posts.
Completed 1000 posts.
Successfully appended 8472 pulled comments to data/comments/BPDmemes_comments.csv
Completed 100 posts.
Completed 200 posts.
Completed 300 posts.
Completed 400 posts.
Completed 500 posts.
Completed 600 posts.
Completed 7

In [2]:
# import post cleaning function
from clean_comments_csv import clean_comments_csv

# import subreddit list
from subreddit_list import subreddit_list

subreddits = subreddit_list()

# loop and clean scraped post csv's
for subreddit in subreddits:
    filepath = 'data/comments/'+subreddit+'_comments.csv'
    clean_comments_csv(filepath)

7204 duplicates dropped from data/comments/BPD_comments.csv
8428 duplicates dropped from data/comments/BorderlinePDisorder_comments.csv
7890 duplicates dropped from data/comments/BPDmemes_comments.csv
2686 duplicates dropped from data/comments/SuicideWatch_comments.csv
44 duplicates dropped from data/comments/QuietBPD_comments.csv
4965 duplicates dropped from data/comments/BPDrecovery_comments.csv
2582 duplicates dropped from data/comments/mentalhealth_comments.csv
2690 duplicates dropped from data/comments/BPDsupport_comments.csv


In [2]:
from post_merge import post_merge

post_merge()

In [5]:
import pandas as pd

from final_merge_clean import final_merge_clean

final_merge_clean()

mitch_df = pd.read_csv('data/merged/mitch_master_data.csv')
mitch_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49496 entries, 0 to 49495
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ID         49496 non-null  object 
 1   Title      49496 non-null  object 
 2   Text       49496 non-null  object 
 3   Author     49496 non-null  object 
 4   OP         49496 non-null  int64  
 5   Is Post    49496 non-null  int64  
 6   Post Date  49496 non-null  float64
 7   Subreddit  49496 non-null  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 3.0+ MB


In [6]:
mitch_df['Is Post'].value_counts()

Is Post
0    40757
1     8739
Name: count, dtype: int64

In [9]:
mitch_df.sort_values(by=['ID']).reset_index()

Unnamed: 0,index,ID,Title,Text,Author,OP,Is Post,Post Date,Subreddit
0,8668,1019evr,I just had the worst episode of my life,"I didn’t take it out on anybody but myself, i ...",Yeetmyass420,1,1,1.672655e+09,BPDsupport
1,49397,1019evr,I just had the worst episode of my life,I completely agree. The side effects are awful...,Consistent-Hippo-837,0,0,1.675272e+09,BPDsupport
2,49396,1019evr,I just had the worst episode of my life,I can relate. Honestly only know I’ve had one ...,Plantmorejerry,0,0,1.674757e+09,BPDsupport
3,49395,1019evr,I just had the worst episode of my life,"Honestly, I only know when I’m out of an episo...",Consistent-Hippo-837,0,0,1.674695e+09,BPDsupport
4,49394,1019evr,I just had the worst episode of my life,Don’t be sorry for your rant. Thankyou so much...,Plantmorejerry,0,0,1.674476e+09,BPDsupport
...,...,...,...,...,...,...,...,...,...
49491,49401,ykiesa,Any advice would be great! Sorry for the lengt...,You sound like a very compassionate person. I ...,Soylent_green_day1,0,0,1.667494e+09,BPDsupport
49492,8671,yvnog9,Am I(17F) in the wrong here or was I genuinely...,[no text],KAI_IS_FINE,1,1,1.668489e+09,BPDsupport
49493,8670,ywe7vy,Chronic Numbness/Emptiness *mention of suicide*,\n\nSo I am currently sitting here in front o...,ImpossibleBicycle249,1,1,1.668558e+09,BPDsupport
49494,49398,ywe7vy,Chronic Numbness/Emptiness *mention of suicide*,I see you and I can somewhat relate. It used t...,barely_parenting,0,0,1.668579e+09,BPDsupport


### To-do:

* Implement merge function for everyone's data