# Search Comments
In this notebook, I will show you how to use the method `search_comments` from `PMAW` to retrieve comments from the Reddit Pushshift API. To view more details about the Search Comments endpoint you can view the Pushshift [documentation](https://github.com/pushshift/api#searching-comments).

In [1]:
import pandas as pd
from pmaw import PushshiftAPI

In [2]:
# instantiate
api = PushshiftAPI()

## Data Preparation

In [3]:
# import test data into a dataframe
posts_df = pd.read_csv(f'./test_data.csv', delimiter=';', header=0)
posts_df.head(5)

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,author_cakeday,distinguished,suggested_sort,crosspost_parent,crosspost_parent_list,category,top_awarded_type,poll_data,steward_reports,comment_ids
0,[],False,nf_hades,,[],,text,t2_hriq1b,False,False,...,,,,,,,,,,"gjacwx5,gjad2l6,gjadatw,gjadc7w,gjadcwh,gjadgd..."
1,[],False,MyLittleDeku,,[],,text,t2_7dj62vj2,False,False,...,,,,,,,,,,gjacn1r
2,[],False,lilirucaarde12,,[],,text,t2_6i04uaxw,False,False,...,,,,,,,,,,"gjac5fb,gjacdy5,gjaco45,gjasj4f,gjbxfeg"
3,[],False,[deleted],,,,,,,,...,,,,,,,,,,gjac9d6
4,[],False,sirdimpleton,,[],,text,t2_bznmn4i,False,False,...,,,,,,,,,,"gjaocmg,gjb2jsj,gjbisrw,gjbjbk8"


In [4]:
len(posts_df)

2500

The data in `posts_df`, contains 2500 submissions and their respective metadata extracted from a subreddit submission search, the comment_ids were added post-search with additional requests.

In [5]:
posts_df.loc[:, 'comment_ids'].isna().sum()

271

In [6]:
# extract comment_ids
comment_ids_str = list(posts_df.loc[posts_df['comment_ids'].notna(), 'comment_ids'])

In [7]:
# convert strings to lists
comment_ids = []
for c_str in comment_ids_str:
    # exclude ending , since all entries include one
    comment_ids.extend(c_str[:-1].split(","))
num_comments = len(comment_ids)
print(f'Ready to retrieve {num_comments} comments')

Ready to retrieve 43219 comments


In [8]:
comment_ids[:10]

['gjacwx5',
 'gjad2l6',
 'gjadatw',
 'gjadc7w',
 'gjadcwh',
 'gjadgd7',
 'gjadlbc',
 'gjadnoc',
 'gjadog1',
 'gjadphb']

## Search Comments

In [15]:
%%time
comments = api.search_comments(subreddit="science", limit=1000)

13656825 total results available for the selected parameters
Total:: Success Rate: 100.00% - Requests: 10 - Batches: 1
Wall time: 12.4 s


In [16]:
%%time
comments = api.search_comments(q="GME", subreddit="wallstreetbets", limit=1000)

531057 total results available for the selected parameters
Total:: Success Rate: 76.92% - Requests: 13 - Batches: 2
Wall time: 29.1 s


## Search Comments by ID

### Using a Single Comment ID

In [9]:
comment = api.search_comments(ids=comment_ids[0])
comment

Total:: Success Rate: 100.00% - Requests: 1 - Batches: 1


[{'all_awardings': [],
  'approved_at_utc': None,
  'associated_award': None,
  'author': 'AVrandomusic',
  'author_flair_background_color': None,
  'author_flair_css_class': None,
  'author_flair_richtext': [],
  'author_flair_template_id': None,
  'author_flair_text': None,
  'author_flair_text_color': None,
  'author_flair_type': 'text',
  'author_fullname': 't2_747ea0dh',
  'author_patreon_flair': False,
  'author_premium': False,
  'awarders': [],
  'banned_at_utc': None,
  'body': "Who's complaining? I'm a thigh stand!",
  'can_mod_post': False,
  'collapsed': False,
  'collapsed_because_crowd_control': None,
  'collapsed_reason': None,
  'comment_type': None,
  'created_utc': 1610668310,
  'distinguished': None,
  'edited': False,
  'gildings': {},
  'id': 'gjacwx5',
  'is_submitter': False,
  'link_id': 't3_kxi2w8',
  'locked': False,
  'no_follow': False,
  'parent_id': 't3_kxi2w8',
  'permalink': '/r/anime/comments/kxi2w8/stop_complaining_about_the_thighs_in/gjacwx5/',
  'ret

### Using Multiple Comment IDs

In [10]:
%%time
comments_arr = api.search_comments(ids=comment_ids)

Total:: Success Rate: 78.57% - Requests: 56 - Batches: 7
Wall time: 1min 41s


In [11]:
print(f'{len(comments_arr)} comments returned by Pushshift')

40990 comments returned by Pushshift


### Save Comments to CSV

In [12]:
# convert comments to dataframe
comments_df = pd.DataFrame(comments_arr)

In [13]:
comments_df.head(3)

Unnamed: 0,all_awardings,approved_at_utc,associated_award,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,...,retrieved_on,score,send_replies,stickied,subreddit,subreddit_id,top_awarded_type,total_awards_received,treatment_tags,author_cakeday
0,[],,,AutoModerator,,,[],,,,...,1610615731,1,False,True,anime,t5_2qh22,,0,[],
1,[],,,AutoModerator,,,[],,,,...,1610615779,1,False,False,anime,t5_2qh22,,0,[],
2,[],,,[deleted],,,,,,dark,...,1610615794,0,True,False,anime,t5_2qh22,,0,[],


In [14]:
# replace usage of ; in comment bodies
import re
for index, row in comments_df.iterrows():
    row['body'] = re.sub(r';+', '.', row['body'])
    
comments_df.to_csv('./test_comments.csv', sep=';', header=True, index=False, columns=list(comments_df.axes[1]))