# Search Submissions
In this notebook, I will show you how to use the method `search_submissions` from `PMAW` to retrieve submissions from the Reddit Pushshift API. To view more details about the Search Submissions endpoint you can view the Pushshift [documentation](https://github.com/pushshift/api#searching-submissions).

In [1]:
import pandas as pd
from pmaw import PushshiftAPI

In [2]:
# instantiate
api = PushshiftAPI()

## Data Preparation

In [3]:
# import test data into a dataframe
posts_df = pd.read_csv(f'./test_data.csv', delimiter=';', header=0)
posts_df.head(5)

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,author_cakeday,distinguished,suggested_sort,crosspost_parent,crosspost_parent_list,category,top_awarded_type,poll_data,steward_reports,comment_ids
0,[],False,nf_hades,,[],,text,t2_hriq1b,False,False,...,,,,,,,,,,"gjacwx5,gjad2l6,gjadatw,gjadc7w,gjadcwh,gjadgd..."
1,[],False,MyLittleDeku,,[],,text,t2_7dj62vj2,False,False,...,,,,,,,,,,gjacn1r
2,[],False,lilirucaarde12,,[],,text,t2_6i04uaxw,False,False,...,,,,,,,,,,"gjac5fb,gjacdy5,gjaco45,gjasj4f,gjbxfeg"
3,[],False,[deleted],,,,,,,,...,,,,,,,,,,gjac9d6
4,[],False,sirdimpleton,,[],,text,t2_bznmn4i,False,False,...,,,,,,,,,,"gjaocmg,gjb2jsj,gjbisrw,gjbjbk8"


In [4]:
len(posts_df)

2500

The data in `posts_df`, contains 2500 submissions and their respective metadata extracted from a subreddit submission search, the comment_ids were added post-search with additional requests. For the purpose of demonstration, submission ids will be used from this dataframe, even though the data has already been retrieved.

In [5]:
# create submission ID list
post_ids = list(posts_df.loc[:, 'id'])
post_ids[:3]

['kxi2w8', 'kxi2g1', 'kxhzrl']

## Search Submissions

In [6]:
%%time
posts = api.search_submissions(subreddit="science", limit=1000)

Total:: Success Rate: 90.91% - Requests: 11 - Batches: 2 - Items Remaining: 0
Wall time: 12.4 s


### Using a query string

In [7]:
%%time
# example with passing a query string
posts = api.search_submissions(q="quantum", subreddit="science", limit=1000)

Total:: Success Rate: 100.00% - Requests: 10 - Batches: 1 - Items Remaining: 599
Total:: Success Rate: 100.00% - Requests: 20 - Batches: 2 - Items Remaining: 175
Total:: Success Rate: 100.00% - Requests: 25 - Batches: 3 - Items Remaining: 0
Wall time: 37.7 s


In [9]:
print(f'{len(posts)} posts retrieved')

1000 posts retrieved


Since the `search_submissions` method returns a `Response` object which is a generator we store the posts in the list:

In [10]:
post_list = [p for p in posts]

## Search Submissions by ID

### Using a Single Submission ID

In [11]:
post = api.search_submissions(ids=post_ids[0])

Total:: Success Rate: 100.00% - Requests: 1 - Batches: 1 - Items Remaining: 0


### Using Multiple Submission IDs

In [12]:
%%time
posts = api.search_submissions(ids=post_ids)

Total:: Success Rate: 100.00% - Requests: 3 - Batches: 1 - Items Remaining: 0
Wall time: 4.36 s


In [13]:
print(f'{len(posts)} submissions returned by Pushshift')

2500 submissions returned by Pushshift


### Convert to Dataframe

In [15]:
# convert submissions to dataframe
new_posts_df = pd.DataFrame(post_list)

In [16]:
new_posts_df.head(3)

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,steward_reports,og_description,og_title,removed_by,rte_mode,author_id,view_count,brand_safe,crosspost_parent,crosspost_parent_list
0,[],False,HeathenLemming,,[],,text,t2_5on10d6u,False,False,...,,,,,,,,,,
1,[],False,clostridium_dead,,[],,text,t2_9uxh3,False,False,...,,,,,,,,,,
2,[],False,RomanTheOmen,,[],,text,t2_4r7za,False,False,...,,,,,,,,,,
