
### SCRAPING FROM REDDIT

Questo script fa scraping da reddit in prima istanza per ottenere gli **id** dei vari post appartenenti ad uno dei seguenti subreddit:

- https://www.reddit.com/r/CrohnsDisease/
- https://www.reddit.com/r/UlcerativeColitis/
- https://www.reddit.com/r/IBD/
- https://www.reddit.com/r/ibs/

Per ottenere gli id si definiscono dei periodi temporali sotto forma di `tuple (anno, mese)`, dunque per ogni subreddit si ottengono gli id delle submission inerenti ad un dato periodo temporale utilizzando le api di Reddit. Essi vengono salvati in csv denominati in base al periodo temporale di riferimento e al subreddit.

scraping da 2022-03-01 a 2022-04-01 sul subreddit 'IBD' sarà denominato `IBM_2022-03_submissions.csv`

Successivamente si uniscono tutti gli id in unico file .csv aggiungendo il campo 'subreddit' ad ogni entry.

Come ultima cosa si vanno ad ottenere i dati di ogni submission (titolo, autore, body ecc.) ed i relativi commenti.

***

- [x] Ottenere gli id 
- [x] Ottenere le submissions
- [ ] Ottenere i commenti di primo livello

In [1]:
import praw
from psaw import PushshiftAPI

import datetime as dt
from tqdm import tqdm

import pandas as pd
import numpy as np
from IPython.display import display, Markdown, Latex

from os import listdir
from os.path import isfile, join
import csv

from tqdm.notebook import tqdm,tnrange
import pickle

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA

import matplotlib.pyplot as plt
import seaborn as sns

from wordcloud import WordCloud, STOPWORDS

In [2]:
import import_ipynb
from reddit_access import client_id, client_secret, password, user_agent, username

#PSAW TO RETRIEVE IDs OF SUBMISSIONS
api = PushshiftAPI()

#PRAW TO RETRIEVE ACTUAL CONVERSATIONS
reddit = praw.Reddit(client_id=client_id, 
                     client_secret=client_secret,
                     password=password, 
                     user_agent=user_agent,
                     username=username)

reddit.user.me()

importing Jupyter notebook from reddit_access.ipynb


Redditor(name='micheledinelli')

In [None]:
# DEFINE YEAR-MONTH RANGES TO LOOK UP 
today = dt.datetime.today()

today_month = today.month
today_year = today.year

start_year = 2020
start_month = 12
dates = []
while start_year <= today_year:
    
    if start_year == today_year and start_month > today_month:
        break
        
    dates.append((start_year, start_month))
    
    start_month += 1
    if start_month > 12:
        start_year += 1
        start_month = 1

In [None]:
def retrieve_ids_in_range(year, month, SR, verbose=True):
    print("Scraping", "r/" + SR, "for:", year, "-", month)
    
    # Seconds in a day/hour
    DAY = 60 * 60 * 24
    HOUR = 60 * 60
    
    start = int(dt.datetime(year, month, 1).timestamp())
    if month < 12:
        end = int(dt.datetime(year, month + 1, 1).timestamp())
    else: 
        end = int(dt.datetime(year + 1, 1, 1).timestamp())
    
    start_epoch = start
    end_epoch = start + DAY 
    
    # Search IDs on a weekly basis
    ids = []
    while end_epoch <= end:
        res = list(api.search_submissions(
                            after = start_epoch,
                            before= end_epoch + HOUR, 
                            subreddit = SR,
                            limit = 100))
        
        print(dt.datetime.fromtimestamp(start_epoch))
        print(dt.datetime.fromtimestamp(end_epoch), '\n')
       
        if verbose:
            print("FROM: ", np.intc(start_epoch).astype("datetime64[s]"), 
                  "TO:", np.intc(end_epoch).astype("datetime64[s]"))        
            print("FIRST: ", np.intc(res[-1].created_utc).astype("datetime64[s]"), 
                  "LAST: ", np.intc(res[0].created_utc).astype("datetime64[s]"))        
            print("number of posts: ", len(res))
        
        for r in res:
            ids.append(r.id)
        
        start_epoch = end_epoch
        end_epoch = start_epoch + DAY
    
    print("SAVING...", SR + "_" + str(year) + "-" + str(month) + "_submissions.csv")
    print(len(ids), " POSTS", "\n")
    pd.DataFrame(ids).drop_duplicates().to_csv("./submissions/" + SR + "_" + str(year) + "-" + str(month) + "_submissions.csv")
    
    return ids

In [3]:
subreddits = ['CrohnsDisease', 'UlcerativeColitis', 'IBD', 'ibs']

In [5]:
for subreddit in subreddits:
    for year, month in dates:
        retrieve_ids_in_range(year = year, month = month, SR = subreddit, verbose = False)

Scraping r/CrohnsDisease for: 2020 - 12
2020-12-01 00:00:00
2020-12-02 00:00:00 

2020-12-02 00:00:00
2020-12-03 00:00:00 

2020-12-03 00:00:00
2020-12-04 00:00:00 

2020-12-04 00:00:00
2020-12-05 00:00:00 

2020-12-05 00:00:00
2020-12-06 00:00:00 

2020-12-06 00:00:00
2020-12-07 00:00:00 

2020-12-07 00:00:00
2020-12-08 00:00:00 

2020-12-08 00:00:00
2020-12-09 00:00:00 

2020-12-09 00:00:00
2020-12-10 00:00:00 

2020-12-10 00:00:00
2020-12-11 00:00:00 

2020-12-11 00:00:00
2020-12-12 00:00:00 

2020-12-12 00:00:00
2020-12-13 00:00:00 

2020-12-13 00:00:00
2020-12-14 00:00:00 

2020-12-14 00:00:00
2020-12-15 00:00:00 

2020-12-15 00:00:00
2020-12-16 00:00:00 

2020-12-16 00:00:00
2020-12-17 00:00:00 

2020-12-17 00:00:00
2020-12-18 00:00:00 

2020-12-18 00:00:00
2020-12-19 00:00:00 

2020-12-19 00:00:00
2020-12-20 00:00:00 

2020-12-20 00:00:00
2020-12-21 00:00:00 

2020-12-21 00:00:00
2020-12-22 00:00:00 

2020-12-22 00:00:00
2020-12-23 00:00:00 

2020-12-23 00:00:00
2020-12-24 00:00



2021-03-28 00:00:00
2021-03-29 01:00:00 

2021-03-29 01:00:00
2021-03-30 01:00:00 

2021-03-30 01:00:00
2021-03-31 01:00:00 

SAVING... CrohnsDisease_2021-3_submissions.csv
600  POSTS 

Scraping r/CrohnsDisease for: 2021 - 4
2021-04-01 00:00:00
2021-04-02 00:00:00 

2021-04-02 00:00:00
2021-04-03 00:00:00 

2021-04-03 00:00:00
2021-04-04 00:00:00 

2021-04-04 00:00:00
2021-04-05 00:00:00 

2021-04-05 00:00:00
2021-04-06 00:00:00 

2021-04-06 00:00:00
2021-04-07 00:00:00 

2021-04-07 00:00:00
2021-04-08 00:00:00 

2021-04-08 00:00:00
2021-04-09 00:00:00 

2021-04-09 00:00:00
2021-04-10 00:00:00 

2021-04-10 00:00:00
2021-04-11 00:00:00 

2021-04-11 00:00:00
2021-04-12 00:00:00 

2021-04-12 00:00:00
2021-04-13 00:00:00 

2021-04-13 00:00:00
2021-04-14 00:00:00 

2021-04-14 00:00:00
2021-04-15 00:00:00 

2021-04-15 00:00:00
2021-04-16 00:00:00 

2021-04-16 00:00:00
2021-04-17 00:00:00 

2021-04-17 00:00:00
2021-04-18 00:00:00 

2021-04-18 00:00:00
2021-04-19 00:00:00 

2021-04-19 00:00:00

2021-09-26 00:00:00
2021-09-27 00:00:00 

2021-09-27 00:00:00
2021-09-28 00:00:00 

2021-09-28 00:00:00
2021-09-29 00:00:00 

2021-09-29 00:00:00
2021-09-30 00:00:00 

2021-09-30 00:00:00
2021-10-01 00:00:00 

SAVING... CrohnsDisease_2021-9_submissions.csv
989  POSTS 

Scraping r/CrohnsDisease for: 2021 - 10
2021-10-01 00:00:00
2021-10-02 00:00:00 

2021-10-02 00:00:00
2021-10-03 00:00:00 

2021-10-03 00:00:00
2021-10-04 00:00:00 

2021-10-04 00:00:00
2021-10-05 00:00:00 

2021-10-05 00:00:00
2021-10-06 00:00:00 

2021-10-06 00:00:00
2021-10-07 00:00:00 

2021-10-07 00:00:00
2021-10-08 00:00:00 

2021-10-08 00:00:00
2021-10-09 00:00:00 

2021-10-09 00:00:00
2021-10-10 00:00:00 

2021-10-10 00:00:00
2021-10-11 00:00:00 

2021-10-11 00:00:00
2021-10-12 00:00:00 

2021-10-12 00:00:00
2021-10-13 00:00:00 

2021-10-13 00:00:00
2021-10-14 00:00:00 

2021-10-14 00:00:00
2021-10-15 00:00:00 

2021-10-15 00:00:00
2021-10-16 00:00:00 

2021-10-16 00:00:00
2021-10-17 00:00:00 

2021-10-17 00:00:0

2022-03-26 00:00:00
2022-03-27 00:00:00 

2022-03-27 00:00:00
2022-03-28 01:00:00 

2022-03-28 01:00:00
2022-03-29 01:00:00 

2022-03-29 01:00:00
2022-03-30 01:00:00 

2022-03-30 01:00:00
2022-03-31 01:00:00 

SAVING... CrohnsDisease_2022-3_submissions.csv
1168  POSTS 

Scraping r/CrohnsDisease for: 2022 - 4
2022-04-01 00:00:00
2022-04-02 00:00:00 

2022-04-02 00:00:00
2022-04-03 00:00:00 

2022-04-03 00:00:00
2022-04-04 00:00:00 

2022-04-04 00:00:00
2022-04-05 00:00:00 

2022-04-05 00:00:00
2022-04-06 00:00:00 

2022-04-06 00:00:00
2022-04-07 00:00:00 

2022-04-07 00:00:00
2022-04-08 00:00:00 

2022-04-08 00:00:00
2022-04-09 00:00:00 

2022-04-09 00:00:00
2022-04-10 00:00:00 

2022-04-10 00:00:00
2022-04-11 00:00:00 

2022-04-11 00:00:00
2022-04-12 00:00:00 

2022-04-12 00:00:00
2022-04-13 00:00:00 

2022-04-13 00:00:00
2022-04-14 00:00:00 

2022-04-14 00:00:00
2022-04-15 00:00:00 

2022-04-15 00:00:00
2022-04-16 00:00:00 

2022-04-16 00:00:00
2022-04-17 00:00:00 

2022-04-17 00:00:0

2022-09-24 00:00:00
2022-09-25 00:00:00 

2022-09-25 00:00:00
2022-09-26 00:00:00 

2022-09-26 00:00:00
2022-09-27 00:00:00 

2022-09-27 00:00:00
2022-09-28 00:00:00 

2022-09-28 00:00:00
2022-09-29 00:00:00 

2022-09-29 00:00:00
2022-09-30 00:00:00 

2022-09-30 00:00:00
2022-10-01 00:00:00 

SAVING... CrohnsDisease_2022-9_submissions.csv
963  POSTS 

Scraping r/UlcerativeColitis for: 2020 - 12
2020-12-01 00:00:00
2020-12-02 00:00:00 

2020-12-02 00:00:00
2020-12-03 00:00:00 

2020-12-03 00:00:00
2020-12-04 00:00:00 

2020-12-04 00:00:00
2020-12-05 00:00:00 

2020-12-05 00:00:00
2020-12-06 00:00:00 

2020-12-06 00:00:00
2020-12-07 00:00:00 

2020-12-07 00:00:00
2020-12-08 00:00:00 

2020-12-08 00:00:00
2020-12-09 00:00:00 

2020-12-09 00:00:00
2020-12-10 00:00:00 

2020-12-10 00:00:00
2020-12-11 00:00:00 

2020-12-11 00:00:00
2020-12-12 00:00:00 

2020-12-12 00:00:00
2020-12-13 00:00:00 

2020-12-13 00:00:00
2020-12-14 00:00:00 

2020-12-14 00:00:00
2020-12-15 00:00:00 

2020-12-15 00:

2021-05-24 00:00:00
2021-05-25 00:00:00 

2021-05-25 00:00:00
2021-05-26 00:00:00 

2021-05-26 00:00:00
2021-05-27 00:00:00 

2021-05-27 00:00:00
2021-05-28 00:00:00 

2021-05-28 00:00:00
2021-05-29 00:00:00 

2021-05-29 00:00:00
2021-05-30 00:00:00 

2021-05-30 00:00:00
2021-05-31 00:00:00 

2021-05-31 00:00:00
2021-06-01 00:00:00 

SAVING... UlcerativeColitis_2021-5_submissions.csv
864  POSTS 

Scraping r/UlcerativeColitis for: 2021 - 6
2021-06-01 00:00:00
2021-06-02 00:00:00 

2021-06-02 00:00:00
2021-06-03 00:00:00 

2021-06-03 00:00:00
2021-06-04 00:00:00 

2021-06-04 00:00:00
2021-06-05 00:00:00 

2021-06-05 00:00:00
2021-06-06 00:00:00 

2021-06-06 00:00:00
2021-06-07 00:00:00 

2021-06-07 00:00:00
2021-06-08 00:00:00 

2021-06-08 00:00:00
2021-06-09 00:00:00 

2021-06-09 00:00:00
2021-06-10 00:00:00 

2021-06-10 00:00:00
2021-06-11 00:00:00 

2021-06-11 00:00:00
2021-06-12 00:00:00 

2021-06-12 00:00:00
2021-06-13 00:00:00 

2021-06-13 00:00:00
2021-06-14 00:00:00 

2021-06-14 

2021-11-20 00:00:00
2021-11-21 00:00:00 

2021-11-21 00:00:00
2021-11-22 00:00:00 

2021-11-22 00:00:00
2021-11-23 00:00:00 

2021-11-23 00:00:00
2021-11-24 00:00:00 

2021-11-24 00:00:00
2021-11-25 00:00:00 

2021-11-25 00:00:00
2021-11-26 00:00:00 

2021-11-26 00:00:00
2021-11-27 00:00:00 

2021-11-27 00:00:00
2021-11-28 00:00:00 

2021-11-28 00:00:00
2021-11-29 00:00:00 





2021-11-29 00:00:00
2021-11-30 00:00:00 

2021-11-30 00:00:00
2021-12-01 00:00:00 

SAVING... UlcerativeColitis_2021-11_submissions.csv
777  POSTS 

Scraping r/UlcerativeColitis for: 2021 - 12
2021-12-01 00:00:00
2021-12-02 00:00:00 

2021-12-02 00:00:00
2021-12-03 00:00:00 

2021-12-03 00:00:00
2021-12-04 00:00:00 

2021-12-04 00:00:00
2021-12-05 00:00:00 

2021-12-05 00:00:00
2021-12-06 00:00:00 

2021-12-06 00:00:00
2021-12-07 00:00:00 

2021-12-07 00:00:00
2021-12-08 00:00:00 

2021-12-08 00:00:00
2021-12-09 00:00:00 

2021-12-09 00:00:00
2021-12-10 00:00:00 

2021-12-10 00:00:00
2021-12-11 00:00:00 

2021-12-11 00:00:00
2021-12-12 00:00:00 

2021-12-12 00:00:00
2021-12-13 00:00:00 

2021-12-13 00:00:00
2021-12-14 00:00:00 

2021-12-14 00:00:00
2021-12-15 00:00:00 

2021-12-15 00:00:00
2021-12-16 00:00:00 

2021-12-16 00:00:00
2021-12-17 00:00:00 

2021-12-17 00:00:00
2021-12-18 00:00:00 

2021-12-18 00:00:00
2021-12-19 00:00:00 

2021-12-19 00:00:00
2021-12-20 00:00:00 

2021-12-2

2022-05-29 00:00:00
2022-05-30 00:00:00 

2022-05-30 00:00:00
2022-05-31 00:00:00 

2022-05-31 00:00:00
2022-06-01 00:00:00 

SAVING... UlcerativeColitis_2022-5_submissions.csv
902  POSTS 

Scraping r/UlcerativeColitis for: 2022 - 6
2022-06-01 00:00:00
2022-06-02 00:00:00 

2022-06-02 00:00:00
2022-06-03 00:00:00 

2022-06-03 00:00:00
2022-06-04 00:00:00 

2022-06-04 00:00:00
2022-06-05 00:00:00 

2022-06-05 00:00:00
2022-06-06 00:00:00 

2022-06-06 00:00:00
2022-06-07 00:00:00 

2022-06-07 00:00:00
2022-06-08 00:00:00 

2022-06-08 00:00:00
2022-06-09 00:00:00 

2022-06-09 00:00:00
2022-06-10 00:00:00 

2022-06-10 00:00:00
2022-06-11 00:00:00 

2022-06-11 00:00:00
2022-06-12 00:00:00 

2022-06-12 00:00:00
2022-06-13 00:00:00 

2022-06-13 00:00:00
2022-06-14 00:00:00 

2022-06-14 00:00:00
2022-06-15 00:00:00 

2022-06-15 00:00:00
2022-06-16 00:00:00 

2022-06-16 00:00:00
2022-06-17 00:00:00 

2022-06-17 00:00:00
2022-06-18 00:00:00 

2022-06-18 00:00:00
2022-06-19 00:00:00 

2022-06-19 

2021-01-26 00:00:00
2021-01-27 00:00:00 

2021-01-27 00:00:00
2021-01-28 00:00:00 

2021-01-28 00:00:00
2021-01-29 00:00:00 

2021-01-29 00:00:00
2021-01-30 00:00:00 

2021-01-30 00:00:00
2021-01-31 00:00:00 

2021-01-31 00:00:00
2021-02-01 00:00:00 

SAVING... IBD_2021-1_submissions.csv
188  POSTS 

Scraping r/IBD for: 2021 - 2
2021-02-01 00:00:00
2021-02-02 00:00:00 

2021-02-02 00:00:00
2021-02-03 00:00:00 

2021-02-03 00:00:00
2021-02-04 00:00:00 

2021-02-04 00:00:00
2021-02-05 00:00:00 

2021-02-05 00:00:00
2021-02-06 00:00:00 

2021-02-06 00:00:00
2021-02-07 00:00:00 

2021-02-07 00:00:00
2021-02-08 00:00:00 

2021-02-08 00:00:00
2021-02-09 00:00:00 

2021-02-09 00:00:00
2021-02-10 00:00:00 

2021-02-10 00:00:00
2021-02-11 00:00:00 

2021-02-11 00:00:00
2021-02-12 00:00:00 

2021-02-12 00:00:00
2021-02-13 00:00:00 

2021-02-13 00:00:00
2021-02-14 00:00:00 

2021-02-14 00:00:00
2021-02-15 00:00:00 

2021-02-15 00:00:00
2021-02-16 00:00:00 

2021-02-16 00:00:00
2021-02-17 00:00:00

2021-07-30 00:00:00
2021-07-31 00:00:00 

2021-07-31 00:00:00
2021-08-01 00:00:00 

SAVING... IBD_2021-7_submissions.csv
212  POSTS 

Scraping r/IBD for: 2021 - 8
2021-08-01 00:00:00
2021-08-02 00:00:00 

2021-08-02 00:00:00
2021-08-03 00:00:00 

2021-08-03 00:00:00
2021-08-04 00:00:00 

2021-08-04 00:00:00
2021-08-05 00:00:00 

2021-08-05 00:00:00
2021-08-06 00:00:00 

2021-08-06 00:00:00
2021-08-07 00:00:00 

2021-08-07 00:00:00
2021-08-08 00:00:00 

2021-08-08 00:00:00
2021-08-09 00:00:00 

2021-08-09 00:00:00
2021-08-10 00:00:00 

2021-08-10 00:00:00
2021-08-11 00:00:00 

2021-08-11 00:00:00
2021-08-12 00:00:00 

2021-08-12 00:00:00
2021-08-13 00:00:00 

2021-08-13 00:00:00
2021-08-14 00:00:00 

2021-08-14 00:00:00
2021-08-15 00:00:00 

2021-08-15 00:00:00
2021-08-16 00:00:00 

2021-08-16 00:00:00
2021-08-17 00:00:00 

2021-08-17 00:00:00
2021-08-18 00:00:00 

2021-08-18 00:00:00
2021-08-19 00:00:00 

2021-08-19 00:00:00
2021-08-20 00:00:00 

2021-08-20 00:00:00
2021-08-21 00:00:00

2022-01-30 00:00:00
2022-01-31 00:00:00 

2022-01-31 00:00:00
2022-02-01 00:00:00 

SAVING... IBD_2022-1_submissions.csv
173  POSTS 

Scraping r/IBD for: 2022 - 2
2022-02-01 00:00:00
2022-02-02 00:00:00 

2022-02-02 00:00:00
2022-02-03 00:00:00 

2022-02-03 00:00:00
2022-02-04 00:00:00 

2022-02-04 00:00:00
2022-02-05 00:00:00 

2022-02-05 00:00:00
2022-02-06 00:00:00 

2022-02-06 00:00:00
2022-02-07 00:00:00 

2022-02-07 00:00:00
2022-02-08 00:00:00 

2022-02-08 00:00:00
2022-02-09 00:00:00 

2022-02-09 00:00:00
2022-02-10 00:00:00 

2022-02-10 00:00:00
2022-02-11 00:00:00 

2022-02-11 00:00:00
2022-02-12 00:00:00 

2022-02-12 00:00:00
2022-02-13 00:00:00 

2022-02-13 00:00:00
2022-02-14 00:00:00 

2022-02-14 00:00:00
2022-02-15 00:00:00 

2022-02-15 00:00:00
2022-02-16 00:00:00 

2022-02-16 00:00:00
2022-02-17 00:00:00 

2022-02-17 00:00:00
2022-02-18 00:00:00 

2022-02-18 00:00:00
2022-02-19 00:00:00 

2022-02-19 00:00:00
2022-02-20 00:00:00 

2022-02-20 00:00:00
2022-02-21 00:00:00

2022-08-01 00:00:00
2022-08-02 00:00:00 

2022-08-02 00:00:00
2022-08-03 00:00:00 

2022-08-03 00:00:00
2022-08-04 00:00:00 

2022-08-04 00:00:00
2022-08-05 00:00:00 

2022-08-05 00:00:00
2022-08-06 00:00:00 

2022-08-06 00:00:00
2022-08-07 00:00:00 

2022-08-07 00:00:00
2022-08-08 00:00:00 

2022-08-08 00:00:00
2022-08-09 00:00:00 

2022-08-09 00:00:00
2022-08-10 00:00:00 

2022-08-10 00:00:00
2022-08-11 00:00:00 

2022-08-11 00:00:00
2022-08-12 00:00:00 

2022-08-12 00:00:00
2022-08-13 00:00:00 

2022-08-13 00:00:00
2022-08-14 00:00:00 

2022-08-14 00:00:00
2022-08-15 00:00:00 

2022-08-15 00:00:00
2022-08-16 00:00:00 

2022-08-16 00:00:00
2022-08-17 00:00:00 

2022-08-17 00:00:00
2022-08-18 00:00:00 

2022-08-18 00:00:00
2022-08-19 00:00:00 

2022-08-19 00:00:00
2022-08-20 00:00:00 

2022-08-20 00:00:00
2022-08-21 00:00:00 

2022-08-21 00:00:00
2022-08-22 00:00:00 

2022-08-22 00:00:00
2022-08-23 00:00:00 

2022-08-23 00:00:00
2022-08-24 00:00:00 

2022-08-24 00:00:00
2022-08-25 00:

2021-04-04 00:00:00
2021-04-05 00:00:00 

2021-04-05 00:00:00
2021-04-06 00:00:00 

2021-04-06 00:00:00
2021-04-07 00:00:00 

2021-04-07 00:00:00
2021-04-08 00:00:00 

2021-04-08 00:00:00
2021-04-09 00:00:00 

2021-04-09 00:00:00
2021-04-10 00:00:00 

2021-04-10 00:00:00
2021-04-11 00:00:00 

2021-04-11 00:00:00
2021-04-12 00:00:00 

2021-04-12 00:00:00
2021-04-13 00:00:00 

2021-04-13 00:00:00
2021-04-14 00:00:00 

2021-04-14 00:00:00
2021-04-15 00:00:00 

2021-04-15 00:00:00
2021-04-16 00:00:00 

2021-04-16 00:00:00
2021-04-17 00:00:00 

2021-04-17 00:00:00
2021-04-18 00:00:00 

2021-04-18 00:00:00
2021-04-19 00:00:00 

2021-04-19 00:00:00
2021-04-20 00:00:00 

2021-04-20 00:00:00
2021-04-21 00:00:00 

2021-04-21 00:00:00
2021-04-22 00:00:00 

2021-04-22 00:00:00
2021-04-23 00:00:00 

2021-04-23 00:00:00
2021-04-24 00:00:00 

2021-04-24 00:00:00
2021-04-25 00:00:00 

2021-04-25 00:00:00
2021-04-26 00:00:00 

2021-04-26 00:00:00
2021-04-27 00:00:00 

2021-04-27 00:00:00
2021-04-28 00:

2021-10-05 00:00:00
2021-10-06 00:00:00 

2021-10-06 00:00:00
2021-10-07 00:00:00 

2021-10-07 00:00:00
2021-10-08 00:00:00 

2021-10-08 00:00:00
2021-10-09 00:00:00 

2021-10-09 00:00:00
2021-10-10 00:00:00 

2021-10-10 00:00:00
2021-10-11 00:00:00 

2021-10-11 00:00:00
2021-10-12 00:00:00 

2021-10-12 00:00:00
2021-10-13 00:00:00 

2021-10-13 00:00:00
2021-10-14 00:00:00 

2021-10-14 00:00:00
2021-10-15 00:00:00 

2021-10-15 00:00:00
2021-10-16 00:00:00 

2021-10-16 00:00:00
2021-10-17 00:00:00 

2021-10-17 00:00:00
2021-10-18 00:00:00 

2021-10-18 00:00:00
2021-10-19 00:00:00 

2021-10-19 00:00:00
2021-10-20 00:00:00 

2021-10-20 00:00:00
2021-10-21 00:00:00 

2021-10-21 00:00:00
2021-10-22 00:00:00 

2021-10-22 00:00:00
2021-10-23 00:00:00 

2021-10-23 00:00:00
2021-10-24 00:00:00 

2021-10-24 00:00:00
2021-10-25 00:00:00 

2021-10-25 00:00:00
2021-10-26 00:00:00 

2021-10-26 00:00:00
2021-10-27 00:00:00 

2021-10-27 00:00:00
2021-10-28 00:00:00 

2021-10-28 00:00:00
2021-10-29 00:

2022-04-08 00:00:00
2022-04-09 00:00:00 

2022-04-09 00:00:00
2022-04-10 00:00:00 

2022-04-10 00:00:00
2022-04-11 00:00:00 

2022-04-11 00:00:00
2022-04-12 00:00:00 

2022-04-12 00:00:00
2022-04-13 00:00:00 

2022-04-13 00:00:00
2022-04-14 00:00:00 

2022-04-14 00:00:00
2022-04-15 00:00:00 

2022-04-15 00:00:00
2022-04-16 00:00:00 

2022-04-16 00:00:00
2022-04-17 00:00:00 

2022-04-17 00:00:00
2022-04-18 00:00:00 

2022-04-18 00:00:00
2022-04-19 00:00:00 

2022-04-19 00:00:00
2022-04-20 00:00:00 

2022-04-20 00:00:00
2022-04-21 00:00:00 

2022-04-21 00:00:00
2022-04-22 00:00:00 

2022-04-22 00:00:00
2022-04-23 00:00:00 

2022-04-23 00:00:00
2022-04-24 00:00:00 

2022-04-24 00:00:00
2022-04-25 00:00:00 

2022-04-25 00:00:00
2022-04-26 00:00:00 

2022-04-26 00:00:00
2022-04-27 00:00:00 

2022-04-27 00:00:00
2022-04-28 00:00:00 

2022-04-28 00:00:00
2022-04-29 00:00:00 

2022-04-29 00:00:00
2022-04-30 00:00:00 

2022-04-30 00:00:00
2022-05-01 00:00:00 

SAVING... ibs_2022-4_submissions.c

In [4]:
mypath = "./submissions/"
files = [join(mypath, f) for f in listdir(mypath) if isfile(join(mypath, f))]

merged_file = []
for filename in files:
    with open(filename, 'r') as csv_file:
        file = csv.reader(csv_file)
        for row in file:
            merged_file.append(row)
            
merged_df = pd.DataFrame(merged_file, columns=['index', 'id'])
merged_df.drop(columns=['index'], inplace=True)
merged_df.to_csv('all_ids.csv')

In [5]:
import os 

all_ids = pd.read_csv('all_ids.csv')
all_ids.drop(columns=['Unnamed: 0'], inplace=True)

y = []
for subreddit in subreddits:
    x = []
    for fname in os.listdir('./submissions/'):
        if subreddit in fname:
            x.append(pd.read_csv(join('./submissions/', fname),index_col=0))
    
    x = pd.concat(x)
    x['subreddit'] = subreddit
    y.append(x)

y = pd.concat(y)
y.to_csv('all_ids_withsr.csv')

In [6]:
all_ids_withsr = pd.read_csv('all_ids_withsr.csv')
all_ids_withsr.drop(columns = ['Unnamed: 0'], inplace = True)
all_ids_withsr.drop_duplicates(inplace = True)
all_ids_withsr.groupby('subreddit').count()

Unnamed: 0_level_0,0
subreddit,Unnamed: 1_level_1
CrohnsDisease,21475
IBD,3908
UlcerativeColitis,16680
ibs,36583


In [22]:
ibd_ids = all_ids_withsr[all_ids_withsr['subreddit'] == 'IBD'].reset_index()
ibd_ids.rename(columns = {'0': 'id'}, inplace = True)

ibd_posts = []
for id_ in tqdm (ibd_ids['id'], desc="Loading from subreddit IBD"):
    post = reddit.submission(id_)
    ibd_posts.append([post.title, post.author, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
    
df_ibd = pd.DataFrame(ibd_posts, columns = ['title', 'author', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])
df_ibd['created'] = df_ibd['created'].apply(lambda x: dt.datetime.fromtimestamp(x))

df_ibd.to_csv('./submissions_scraped/ibd_submissions.csv')
df_ibd.head(10)

Loading from subreddit IBD:   0%|          | 0/3908 [00:00<?, ?it/s]

Unnamed: 0,title,author,score,id,subreddit,url,num_comments,body,created
0,Anyone else get really intense cramps in the ....,,22,k4wsmz,IBD,,7,[deleted],2020-12-02 00:20:31
1,"Been off Accutane for 2 years, have had consti...",OptimalSyrup193,1,k4wjyp,IBD,https://www.reddit.com/r/IBD/comments/k4wjyp/b...,0,I finally had enough and decides to go to a ga...,2020-12-02 00:07:39
2,Anyone on a JAK inhibitor trial?,abundance-and-joy,7,k4vqri,IBD,https://www.reddit.com/r/IBD/comments/k4vqri/a...,7,Doc is recommending JAK inhibitor to treat Cro...,2020-12-01 23:26:45
3,Has anyone experienced this? Symptoms changing...,rachelcalabresi,10,k4vpft,IBD,https://www.reddit.com/r/IBD/comments/k4vpft/h...,5,I hope this message doesn’t upset anyone in th...,2020-12-01 23:24:59
4,[FYI] Crohns & Colitis Foundation - #GivingTue...,visualoptimism,17,k4njtg,IBD,https://online.crohnscolitisfoundation.org/sit...,3,,2020-12-01 16:56:21
5,Does IBD affect stomach fat?,Kurtinho,7,k4gu2a,IBD,https://www.reddit.com/r/IBD/comments/k4gu2a/d...,13,"I’m 22, been diagnosed since 17 and since I’ve...",2020-12-01 09:14:49
6,Survey Results and Videos on IBD,GISociety,4,k5k88f,IBD,https://www.reddit.com/r/IBD/comments/k5k88f/s...,0,"Last year, I posted in this community about a ...",2020-12-02 23:36:58
7,Head feels weird after Stelara infusion,MarshmallowCat14,7,k5ir5w,IBD,https://www.reddit.com/r/IBD/comments/k5ir5w/h...,13,"I posted a longer post in the UC sub, but thou...",2020-12-02 22:22:35
8,Just began taking 25mg of Methotrexate once a ...,,7,k5gey6,IBD,,3,[deleted],2020-12-02 20:31:08
9,[deleted by user],,40,k5fwow,IBD,,2,[removed],2020-12-02 20:07:36


In [25]:
ulc_col_ids = all_ids_withsr[all_ids_withsr['subreddit'] == 'UlcerativeColitis'].reset_index()
ulc_col_ids.rename(columns = {'0': 'id'}, inplace = True)

ulc_col_posts = []
for id_ in tqdm (ulc_col_ids['id'], desc="Loading from subreddit Ulcerative Colitis"):
    post = reddit.submission(id_)
    ulc_col_posts.append([post.title, post.author ,post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
    
df_ulc_col = pd.DataFrame(ulc_col_posts, columns = ['title', 'author', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])
df_ulc_col['created'] = df_ulc_col['created'].apply(lambda x: dt.datetime.fromtimestamp(x))

df_ulc_col.to_csv('./submissions_scraped/ulc_col_submissions.csv')
df_ulc_col.head(10)

Loading from subreddit Ulcerative Colitis:   0%|          | 0/16680 [00:00<?, ?it/s]

Unnamed: 0,title,author,score,id,subreddit,url,num_comments,body,created
0,Berberine,iserd,2,k4vtmp,UlcerativeColitis,https://www.reddit.com/r/UlcerativeColitis/com...,3,Hi all. I am in remission currently taking mes...,2020-12-01 23:30:39
1,Has anyone experienced this? Symptoms changing...,rachelcalabresi,2,k4vq6z,UlcerativeColitis,https://www.reddit.com/r/UlcerativeColitis/com...,5,I hope this message doesn’t upset anyone in th...,2020-12-01 23:26:03
2,[deleted by user],,2,k4uvuj,UlcerativeColitis,,5,[removed],2020-12-01 22:43:52
3,running to the toilet be like,wheelieboii,142,k4uceq,UlcerativeColitis,https://v.redd.it/qefc6j7e6n261,7,,2020-12-01 22:17:28
4,Mesalamine Enemas,SS13908,3,k4rnfg,UlcerativeColitis,https://www.reddit.com/r/UlcerativeColitis/com...,9,I just took these enemas for four weeks to cal...,2020-12-01 20:08:55
5,Elimination diet,,3,k4pz3y,UlcerativeColitis,,11,[deleted],2020-12-01 18:50:45
6,Mesalamine side effect? Head feels hot at time...,Throwawayhippo12,5,k4oz78,UlcerativeColitis,https://www.reddit.com/r/UlcerativeColitis/com...,8,As the title says I’ve noticed that my forehea...,2020-12-01 18:04:27
7,Ulcerative colitis and suicidal thoughts,colitispatient,74,k4mysk,UlcerativeColitis,https://www.reddit.com/r/UlcerativeColitis/com...,68,"Hey everyone, i was diagnosed with ulcerative ...",2020-12-01 16:27:51
8,Supplement question,BreakfastExpress,2,k4myh4,UlcerativeColitis,https://www.reddit.com/r/UlcerativeColitis/com...,9,Has anyone tried indigo naturalis/ qing dai? I...,2020-12-01 16:27:24
9,[FYI] Crohn's & Colitis Foundation #GivingTues...,visualoptimism,10,k4mxqs,UlcerativeColitis,https://online.crohnscolitisfoundation.org/sit...,0,,2020-12-01 16:26:23


In [26]:
crohns_disease_ids = all_ids_withsr[all_ids_withsr['subreddit'] == 'CrohnsDisease'].reset_index()
crohns_disease_ids.rename(columns = {'0': 'id'}, inplace = True)

crohns_disease_posts = []
for id_ in tqdm (crohns_disease_ids['id'], desc="Loading from subreddit CrohnsDisease"):
    post = reddit.submission(id_)
    crohns_disease_posts.append([post.title, post.author, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
    
df_crohns_disease = pd.DataFrame(crohns_disease_posts, columns = ['title', 'author', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])
df_crohns_disease['created'] = df_crohns_disease['created'].apply(lambda x: dt.datetime.fromtimestamp(x))

df_crohns_disease.to_csv('./submissions_scraped/crohns_disease_submissions.csv')
df_crohns_disease.head(10)

Loading from subreddit CrohnsDisease:   0%|          | 0/21475 [00:00<?, ?it/s]

Unnamed: 0,title,author,score,id,subreddit,url,num_comments,body,created
0,Ibd awareness week! Crohns and colitis website...,,48,k4xgh0,CrohnsDisease,,6,[deleted],2020-12-02 00:57:04
1,Not diagnosed. Looking for advice while waiting,,2,k4x6p1,CrohnsDisease,,2,[deleted],2020-12-02 00:42:02
2,Painkillers....,ChuckFiasco,3,k4w6h7,CrohnsDisease,https://www.reddit.com/r/CrohnsDisease/comment...,13,I’ve been battling a flare for the better part...,2020-12-01 23:48:33
3,Anyone try the trial for JAK inhibitor to trea...,abundance-and-joy,3,k4vof1,CrohnsDisease,https://www.reddit.com/r/CrohnsDisease/comment...,5,Doc recommending JAK inhibitor and looking for...,2020-12-01 23:23:30
4,Why is it so complicated just to get treatment...,VurucaAssault,9,k4vim5,CrohnsDisease,https://www.reddit.com/r/CrohnsDisease/comment...,6,I was dx in 2009 and it has never been a simpl...,2020-12-01 23:15:18
5,Intermittent FMLA,Tmoore188,1,k4t4xh,CrohnsDisease,https://www.reddit.com/r/CrohnsDisease/comment...,4,I am initiating the process of applying for in...,2020-12-01 21:18:35
6,Colonoscopy Findings,cheetoPalmer,2,k4rodi,CrohnsDisease,https://www.reddit.com/r/CrohnsDisease/comment...,5,I was diagnosed with Crohn’s Disease in 2016. ...,2020-12-01 20:10:09
7,DAE Bolt their food?,AJClarkson,1,k4riy4,CrohnsDisease,https://www.reddit.com/r/CrohnsDisease/comment...,1,"For a long time, my gag reflex was on a hair t...",2020-12-01 20:03:18
8,Chalky stools?,Constant-gardener89,1,k4qapv,CrohnsDisease,https://www.reddit.com/r/CrohnsDisease/comment...,9,Has anyone else experienced this? Instead of b...,2020-12-01 19:05:36
9,Chron’s fever - is it avoidable?,MCHFA,1,k4px8u,CrohnsDisease,https://www.reddit.com/r/CrohnsDisease/comment...,2,"Dearest community,\n\nI hope you’re all doing ...",2020-12-01 18:48:16


In [7]:
ibs_ids = all_ids_withsr[all_ids_withsr['subreddit'] == 'ibs'].reset_index()
ibs_ids.rename(columns = {'0': 'id'}, inplace = True)

ibs_posts = []
for id_ in tqdm (ibs_ids['id'], desc="Loading from subreddit ibs"):
    post = reddit.submission(id_)
    ibs_posts.append([post.title, post.author, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
    
    
df_ibs = pd.DataFrame(ibs_posts, columns = ['title', 'author', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])
df_ibs['created'] = df_ibs['created'].apply(lambda x: dt.datetime.fromtimestamp(x))

df_ibs.to_csv('./submissions_scraped/ibs_submissions.csv')
df_ibs.head(10)

Loading from subreddit ibs:   0%|          | 0/36583 [00:00<?, ?it/s]

Unnamed: 0,title,author,score,id,subreddit,url,num_comments,body,created
0,Align Probiotic Gas,Complete_Lack4089,6,k4x6e3,ibs,https://www.reddit.com/r/ibs/comments/k4x6e3/a...,9,Does Align Prebiotic + Probiotic give anyone e...,2020-12-02 00:41:35
1,Does chik-fil-a hurt anyone else’s stomach?,,2,k4wdaq,ibs,https://www.reddit.com/r/ibs/comments/k4wdaq/d...,6,Anytime I eat at this place I get heartburn/re...,2020-12-01 23:58:09
2,Has any one found the whole30 diet helps? I us...,Lordus_,1,k4w98g,ibs,https://www.reddit.com/r/ibs/comments/k4w98g/h...,0,,2020-12-01 23:52:25
3,Can anyone with IBS and constipation give advice?,,1,k4w1i5,ibs,,9,[deleted],2020-12-01 23:41:28
4,🍇,,1,k4voga,ibs,,4,[deleted],2020-12-01 23:23:32
5,Relatable,,8,k4ufr1,ibs,,0,[deleted],2020-12-01 22:22:07
6,How to gain weight with all these restrictions?,rlopes528,3,k4ubao,ibs,https://www.reddit.com/r/ibs/comments/k4ubao/h...,13,"So, I'm on a zero lactose, zero gluten, low su...",2020-12-01 22:15:51
7,"Help, Help, Help ):",Sensitive_Bottle,1,k4tuqo,ibs,https://www.reddit.com/r/ibs/comments/k4tuqo/h...,11,Hi guys new here and my IBS is actually insane...,2020-12-01 21:52:37
8,Gurgulations,intangible-tangerine,12,k4tgee,ibs,https://www.reddit.com/r/ibs/comments/k4tgee/g...,0,"""Cow's milk and ewe's milk..... is not good fo...",2020-12-01 21:33:47
9,Can’t win,georgiaokeeffer,4,k4t783,ibs,https://www.reddit.com/r/ibs/comments/k4t783/c...,5,Just feeling frustrated and defeated after a c...,2020-12-01 21:21:39
