<p> <font face='Nunito'>

# Scraping
This notebook uses PMAW to get archive comments retrieved by Pushshift immediately after they are created, and PRAW to retrieve live data. This way, we can get comments before the risk of them being deleted, and then complement the set with up to date information on comments, their submissions, and their users.


#### Requirements
* praw_functions.py
* hate_terms.csv
* reddit_auth.py (your reddit credentials in a python script)

#### Generates
* comments_df.csv
* submissions_df.csv
* users_df.csv
* ref_sample.csv
* log.csv (temporary)
* comments_raw.csv (temporary)
* new_comments_stats.csv (temporary)
<br/>

##### Links and documentations
Pushshift API [here](https://reddit-api.readthedocs.io/en/latest/) <br/>
PRAW API [here](https://praw.readthedocs.io/en/stable/getting_started/quick_start.html)<br/>
PMAW documentation [here](https://github.com/mattpodolak/pmaw)
<font/> <p/>

In [None]:
!pip install pmaw
!pip install praw

from pmaw import PushshiftAPI
import pandas as pd
import joblib
from praw_functions import *

from collections import defaultdict
from tqdm import tqdm

RANDOM_SEED=697

You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m


## 1. Comments

### Comments from Pushshift (via pmaw)

In [None]:
# features to retrieve
our_filter = ['author','author_flair_type','author_fullname','author_premium',
              'body','body_sha1','controversiality','created_utc','distinguished',
              'gilded','id','is_submitter','link_id', 'locked','parent_id',
              'permalink','retrieved_utc','subreddit','subreddit_id',
              'subreddit_name_prefixed','subreddit_type'
             ]

In [None]:
# terms for our filter
hate_path = 'hate_terms.csv'
hate_terms = pd.read_csv(hate_path)

our_terms = '|'.join(hate_terms.term)

In [None]:
api = PushshiftAPI()

In [None]:
date_range = pd.date_range(start='2022-01-01 00:00:00', 
                           end='2022-02-01 00:00:00', 
                           freq='H').to_list()

In [None]:
#---------------------------------------------------------------------------
# use log below to verify num_items retrieved,
# if more than 1000, scraper hit limit, 
# go back to epoch, divide into smaller epochs and scrape all the comments
#---------------------------------------------------------------------------

# log = pd.DataFrame({'time':[], 'epoch':[], 'num_items':[]})

In [None]:
limit_=1000
epoch_=3600

for i, after_ in enumerate(date_range):
    data = api.search_comments(q=our_terms, 
                                limit=limit_,
                                after=after_, 
                                before=after+epoch_,
                                filter=our_filter)
    df = pd.DataFrame(data)
    if i==0:
        comments_raw = df.copy()
    else:    
        comments_raw = comments_raw.append(df, ignore_index=True)
    comments_raw.to_csv('../data/raw/comments.csv', index=False)
    if df.shape[0]>999:
        update_log(after_)

In [None]:
comments_raw.shape

In [None]:
log[log.num_items>999]

In [None]:
max(comments_raw.retrieved_utc - comments_raw.created_utc)

7889.0

### Comments directly from Reddit (using praw)

In [None]:
c = pd.read_csv('../data/raw/comments.csv')

In [None]:
step = 100
new_comments_stats = update_comments(c['id'][0])

for i in range(1, c.shape[0], step):
    new_comments_stats = new_comments_stats.append(update_comments(c['id'][i:i+step]), ignore_index=True)

In [None]:
print(new_comments_stats.shape)
print(new_comments_stats['id'].nunique())

#### Combine comments

In [None]:
comments_df = c.merge(new_comments_stats, on='id')
comments_df = comments_df.to_csv('../data/raw/comments.csv')

## 2. Submissions

In [None]:
c = pd.read_csv('../data/raw/comments.csv')
id_list = [x[3:] for x in c['link_id'].unique()]

In [None]:
step = 25

sub_df = get_submissions_data(id_list[0])
for i in range(1, len(id_list), step):
    sub_df = sub_df.append(get_submissions_data(id_list[i:i+step]), ignore_index=True)
    sub_df.to_csv('../data/raw/submissions.csv', index=False)

In [None]:
print(sub_df.shape)
print(sub_df['id'].nunique())

## 3. Users

In [None]:
c = pd.read_csv('../data/raw/comments.csv')
user_list = list(c.author.unique())

In [None]:
step = 50

users_df = get_users_data(user_list[0])

for i in range(1, len(user_list), step):
    users_df = users_df.append(get_users_data(user_list[i:i+step]), ignore_index=True)
    users_df.to_csv('../data/raw/users.csv', index=False)

In [None]:
print(users_df.shape)
print(users_df.author.nunique())

## 4. Sample comments for reference - Pushshift

We'll use a reference dataset to compare the gender distribution of our primary dataset, filtered by our hate terms, with the gender distribution of data scraped in the same epoch, without filtering for hate terms.
We can't retrieve data from Pushshift randomly, so our strategy is to scrape the first 70 comments from each hour of each day of January, to a total of approximately 50000 comments, close to a quarter of our primary dataset.

In [None]:
import pandas as pd
import requests
import json
from datetime import datetime 
from tqdm import tqdm
import time

In [None]:
date_range = pd.date_range(start='2022-01-01 00:00:00', 
                           end='2022-01-31 23:00:00', 
                           freq='H').to_list()

In [None]:
our_filter = 'body,created_utc,id,link_id,subreddit,author'

In [None]:
size_ = 70
i = 0
for date in date_range[:1]:
    after_ = int(date.timestamp())
    before_ = after_ + 3600 
    url = f'https://api.pushshift.io/reddit/comment/search?q=all&after='\
    +str(after_)+'&before='+str(before_)+'&size='+str(size_)+'&filter='+our_filter
    r = requests.get(url)
    data = json.loads(r.text, strict=False)
    i +=1
    ref_sample = pd.DataFrame(data['data'])

In [None]:
for date in tqdm(date_range[1:]):
    after_ = int(date.timestamp())
    before_ = after_ + 3600 

    if i%10==0:
        time.sleep(30)

        url = f'https://api.pushshift.io/reddit/comment/search?q=all&after='\
        +str(after_)+'&before='+str(before_)+'&size='+str(size_)+'&filter='+our_filter
    else: 
        url = f'https://api.pushshift.io/reddit/comment/search?q=all&after='\
        +str(after_)+'&before='+str(before_)+'&size='+str(size_)+'&filter='+our_filter

    r = requests.get(url)
    data = json.loads(r.text, strict=False)
    i +=1
    ref_sample = ref_sample.append(pd.DataFrame(data['data']), ignore_index=True)
    ref_sample.to_csv('../data/raw/reference.csv', index=False)


100%|██████████| 743/743 [1:05:50<00:00,  5.32s/it]


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=b22dad3f-c925-4cd0-bb81-e22d83bd774f' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>