# Search Comments
In this notebook, I will show you how to use the method `search_comments` from `PMAW` to retrieve comments from the Reddit Pushshift API. To view more details about the Search Comments endpoint you can view the Pushshift [documentation](https://github.com/pushshift/api#searching-comments).

In [1]:
import pandas as pd
from pmaw import PushshiftAPI

In [2]:
# instantiate
api = PushshiftAPI()

## Data Preparation

In [9]:
# import test data into a dataframe
posts_df = pd.read_csv(f'./test_data.csv', delimiter=';', header=0)
posts_df.head(3)

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,author_cakeday,distinguished,suggested_sort,crosspost_parent,crosspost_parent_list,category,top_awarded_type,poll_data,steward_reports,comment_ids
0,[],False,nf_hades,,[],,text,t2_hriq1b,False,False,...,,,,,,,,,,"gjacwx5,gjad2l6,gjadatw,gjadc7w,gjadcwh,gjadgd..."
1,[],False,MyLittleDeku,,[],,text,t2_7dj62vj2,False,False,...,,,,,,,,,,gjacn1r
2,[],False,lilirucaarde12,,[],,text,t2_6i04uaxw,False,False,...,,,,,,,,,,"gjac5fb,gjacdy5,gjaco45,gjasj4f,gjbxfeg"


In [10]:
len(posts_df)

2500

The data in `posts_df`, contains 2500 submissions and their respective metadata extracted from a subreddit submission search, the comment_ids were added post-search with additional requests.

In [11]:
posts_df.loc[:, 'comment_ids'].isna().sum()

271

In [12]:
# extract comment_ids
comment_ids_str = list(posts_df.loc[posts_df['comment_ids'].notna(), 'comment_ids'])

In [13]:
# convert strings to lists
comment_ids = []
for c_str in comment_ids_str:
    # exclude ending , since all entries include one
    comment_ids.extend(c_str[:-1].split(","))
num_comments = len(comment_ids)
print(f'Ready to retrieve {num_comments} comments')

Ready to retrieve 43377 comments


In [14]:
comment_ids[:3]

['gjacwx5', 'gjad2l6', 'gjadatw']

## Search Comments

In [15]:
%%time
comments = api.search_comments(subreddit="science", limit=6000)

Total:: Success Rate: 100.00% - Requests: 60 - Batches: 6 - Items Remaining: 0
Wall time: 1min 1s


In [16]:
len(comments)

6000

### Using a query string

In [18]:
%%time
# example with passing a query string
comments = api.search_comments(q="GME", subreddit="wallstreetbets", limit=1000)

Total:: Success Rate: 90.91% - Requests: 11 - Batches: 2 - Items Remaining: 0
Wall time: 12.4 s


In [19]:
len(comments)

1000

Since the `search_comments` method returns a `Response` object which is a generator we store the comments in a list using the following code: 

In [20]:
comment_list = [c for c in comments]

## Search Comments by ID

### Using a Single Comment ID

In [21]:
comment = api.search_comments(ids=comment_ids[0])

Total:: Success Rate: 100.00% - Requests: 1 - Batches: 1 - Items Remaining: 0


### Using Multiple Comment IDs

In [22]:
%%time
comments_arr = api.search_comments(ids=comment_ids)

Total:: Success Rate: 68.75% - Requests: 64 - Batches: 7 - Items Remaining: 2229
Wall time: 1min 9s


  f'{self.limit} items were not found in Pushshift')


We can see that when searching for comments by id, that some items are no longer stored in Pushshift and could not be returned.

In [23]:
print(f'{len(comments_arr)} comments returned by Pushshift')

41148 comments returned by Pushshift


### Save Comments to CSV

In [24]:
# convert comments to dataframe
comment_list = [c for c in comments_arr]
comments_df = pd.DataFrame(comment_list)

In [25]:
comments_df.head(3)

Unnamed: 0,all_awardings,approved_at_utc,associated_award,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,...,retrieved_on,score,send_replies,stickied,subreddit,subreddit_id,top_awarded_type,total_awards_received,treatment_tags,author_cakeday
0,[],,,AutoModerator,,,[],,,,...,1610731054,1,False,False,anime,t5_2qh22,,0,[],
1,[],,,Nihhrt,,MAL,[],,http://myanimelist.net/animelist/Nihhrt,dark,...,1610731310,2,True,False,anime,t5_2qh22,,0,[],
2,[],,,[deleted],,,,,,dark,...,1610731314,1,True,False,anime,t5_2qh22,,0,[],


In [26]:
# store the extracted comments into a csv file for later use
comments_df.to_csv('./test_comments.csv', header=True, index=False, columns=list(comments_df.axes[1]))