# Download Pushshift Reddit Data
See this on [Github](https://github.com/yinleon/doppler_tutorials/blob/master/1-download-data.ipynb), [NbViewer](https://nbviewer.jupyter.org/github/yinleon/doppler_tutorials/blob/master/1-download-data.ipynb)<br>
By Jansen Derr 2021-02-22<br>
This Notebook collects subreddit metadata from PushShift's REST API, and downloads images from Reddit using requests.

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
import os
import warnings

import pandas as pd
from tqdm import tqdm

import config
import context_manager
import data_sources.pushshift as ps
from image_utils import download_media_and_return_dhash, read_image

warnings.filterwarnings('ignore')

A trick in data engineering is creating a dictionary (called the _context_) that defines the destination of all output files. Here we illustrate an example of what it looks like:

In [None]:
def get_subreddit_context(subreddit):
    '''
    Where will data be saved?
    '''
    sub_dir = os.path.join(config.data_dir, subreddit)
    media_dir =  os.path.join(config.data_dir, 'media')
    file_subreddit = os.path.join(sub_dir, 'posts.csv.gz')
    file_subreddit_media = os.path.join(sub_dir, 'media.csv.gz')
    
    for _dir in [config.data_dir, sub_dir, media_dir]:
        os.makedirs(_dir, exist_ok=True)
        
    context = {
        'data_dir' : config.data_dir,
        'sub_dir' : sub_dir,
        'media_dir' : media_dir,
        'file_subreddit' : file_subreddit,
        'file_subreddit_media' : file_subreddit_media
    }
    
    return context

 We will use functions like `get_subreddit_context` throughout this article. Rather than define these within each notebook, they are all writtein in the `context_manager.py` file.

## Query PushShift
Pushshift has an open restful API that returns JSON records. 

For your convenience, we have written a simple Python wrapper found in in `doppler/data_sources/pushshift.py`.

<hr>
Note:<br>
The Jupyter Notebooks used throughout this article differ a bit from what you might expec. We are making a change by defining code in external Python scripts, rather than in the notebook itself. We do this because it makes the code easier to find and share, and allows notebooks to focus on commentary and explaination of moving parts.
<hr>

In [None]:
verbose = True
subreddit = config.subreddit # change this in config.py
context = get_subreddit_context(subreddit)
min_date = config.start_utc
max_date = config.end_utc

# check if the subreddit has already been collected
if os.path.exists(context['file_subreddit']):
    print('File Exists')
    df = pd.read_csv(context['file_subreddit'], 
                     compression='gzip')
    print(f'{ len(df) } Records exist')
    seen_ids = set(df.id.unique()) # these are records we've already collected
    
    # look for records that we haven't already collected
    records = ps.download_subreddit_posts(subreddit, min_date, max_date,
                                                verbose=verbose, 
                                                seen_ids=seen_ids)
    _df = pd.DataFrame(records)
    if verbose:
        print(f"collected { len(_df) } records")
    df = df.append(_df, sort=False)
    df.drop_duplicates(subset=['id'], inplace=True)
    df.sort_values(by=['created_utc'], ascending=False, inplace=True)
    df.to_csv(context['file_subreddit'], index=False, compression='gzip')

# if we've never colelcted the subreddit, we start a fresh query to download records.
else:
    print("New Subreddit")
    records = ps.download_subreddit_posts(subreddit, min_date, max_date, verbose=verbose)
    if verbose:
        print(f"collected { len(records) } records")
    df = pd.DataFrame(records)
    df.to_csv(context['file_subreddit'], index=False, compression='gzip')

if verbose:
    # Summary stats
    print('\n****************')
    df['created_at'] = pd.to_datetime(df['created_utc'], unit='s')
    print(f"N = { len(df) }\n"
          f"Start Date = { df['created_at'].min() }\n"
          f"End Date = { df['created_at'].max() }")

## Collect Images
After collecting the metadata for posts for our the subreddit of interest, we can download the media shared within the subreddit.
In this section, we filter the metadata for media, download all media from the open web, and calculate a fingerprint or _dhash_ of each image. More on dhashing [here](http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html).

We created a high-level function called `download_media_and_return_dhash`, which uses the `requests` library to download an image locally, and `imagehash` to calculate the _dhash_.

In [None]:
len(df)

In [None]:
# we're only interested in records with a media preview...
df_media = df[~df.preview.isnull()]
len(df_media)

In [None]:
# check if the file exists, and which media records have been downloaded
if os.path.exists(context['file_subreddit_media']):
    df_img_meta = pd.read_csv(context['file_subreddit_media'], 
                              compression='gzip')     
    abd = df_img_meta.id
    df_media = df_media[~df_media.id.isin(abd)]

# download new media files
img_meta = []
try:
    for _, row in tqdm(df_media.iterrows(), position=0, leave=True):
        preview = row.get('preview')
        if isinstance(preview, dict):
            images = preview.get('images')
            if not images:
                continue
            for img in images:
                r = row.copy()
                img_url, f_img = context_manager.get_media_context(img, context)
                if not img_url:
                    continue
                d_hash, img_size = download_media_and_return_dhash(img_url, f_img)
                if img_size != 0:
                    r['deleted'] = False
                    r['d_hash'] = d_hash
                    r['f_img'] = f_img 
                    r['img_size'] = img_size
                else:
                    r['deleted'] = True
                    r['d_hash'] = d_hash
                    r['f_img'] = f_img 
                    r['img_size'] = img_size
                img_meta.append(r.to_dict())

except KeyboardInterrupt:
    if verbose:
        print("cancelled early!")
    pass


# append to existing records, if that exitst and write to a csv
if os.path.exists(context['file_subreddit_media']):               
    _df_img_meta = pd.DataFrame(img_meta)
    df_img_meta = df_img_meta.append(_df_img_meta,sort=False)
else:
    df_img_meta = pd.DataFrame(img_meta)
    
df_img_meta.to_csv(context['file_subreddit_media'], 
                   index=False, compression='gzip')     

In [None]:
len(df_img_meta)

Now that the images are downloaded, we can read them from disk:

In [None]:
f_img = df_img_meta.f_img.iloc[0]

In [None]:
read_image(f_img)

Since we calculated the dhash of each image, we can count duplicates. Here are the most re-posted images:

In [None]:
most_shared_image_dhashed = df_img_meta[df_img_meta.d_hash != 'NOHASH'].d_hash.value_counts().head(20)
most_shared_image_dhashed

In [None]:
most_posted_dhash = most_shared_image_dhashed.index[0]
most_posted_image_file = df_img_meta[df_img_meta.d_hash == most_posted_dhash].f_img.iloc[0]

In [None]:
read_image(most_posted_image_file)