### Get comments from ID post

If I want to get all comments from post ID "ucbjgz" : https://api.pushshift.io/reddit/comment/search/?link_id=ucbjgz&limit=100&q=*

The limit is currently clamped down to 100 so need to make a loop if we want more (see https://www.reddit.com/r/pushshift/comments/qufgqa/get_all_comments_from_a_post_id/)

In [1]:
import pandas as pd
import requests
import json
from datetime import datetime
import time
from tqdm import tqdm_notebook
# conda install -c conda-forge tqdm

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
subreddit = 'france'

# Functions

In [3]:
def get_pushshift_data(link_id, after = 0, limit = 100) -> dict():
    """
        API results for a post id, returning comments & information about them. The API is giving a maximum of 100 comments per call.
        Which means that we have to do a loop if we have more in a post. 'After' argument is here to specify the time of the last comment scrapped.
    """
    try:
        r = ''
        URL = 'https://api.pushshift.io/reddit/comment/search/?link_id='+str(link_id)+'&limit='+str(limit)+'&q=*'+'&after='+ str(after)
        print(URL)
        r = requests.get(URL)
        if r.status_code == 200:
            data = json.loads(r.text, strict = False)
            return data['data']
        
        #Si on a eu une erreur en récupérant l'URL on réessaye 5 fois(avec sleep pour être sûr), sinon on abandonne
        else:
            time.sleep(1)
            nb_try = 0
            #Si on a eu une erreur status code 429, on risque d'être ban pour 1h. See https://www.reddit.com/r/pushshift/comments/uhjzvd/temporary_banning_of_ips_with_high_rates_of_429s/
            if r.status_code == 429:
                print("I had a 429 status code. I'm waiting 3 seconds.")
                time.sleep(3)
            
            while r.status_code != 200 | nb_try < 5:
                URL = 'https://api.pushshift.io/reddit/comment/search/?link_id='+str(link_id)+'&limit='+str(limit)+'&q=*'
                print(URL)
                r = requests.get(URL)
                data = json.loads(r.text, strict = False)
                nb_try += 1
            if r.status_code == 200:           
                return data['data']
            else: return ''
    except:
        print('Error while accessing API')
        print(r)
        return ''

Example of a post

In [4]:
get_pushshift_data('ucbjgz')

https://api.pushshift.io/reddit/comment/search/?link_id=ucbjgz&limit=100&q=*&after=0


[{'all_awardings': [],
  'archived': False,
  'associated_award': None,
  'author': 'AutoModerator',
  'author_flair_background_color': None,
  'author_flair_css_class': None,
  'author_flair_richtext': [],
  'author_flair_template_id': None,
  'author_flair_text': None,
  'author_flair_text_color': None,
  'author_flair_type': 'text',
  'author_fullname': 't2_6l4z3',
  'author_patreon_flair': False,
  'author_premium': True,
  'body': 'Pas assez de musique sur r/france ? Rejoins-nous sur r/musique et aide nous à développer cette nouvelle communauté\n\n\n*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/france) if you have any questions or concerns.*',
  'body_sha1': 'f165f599f1a2ee35433403826535a438667d5ce0',
  'can_gild': True,
  'collapsed': False,
  'collapsed_because_crowd_control': None,
  'collapsed_reason': None,
  'collapsed_reason_code': None,
  'comment_type': None,
  'controversiality': 0,
  '

In [5]:
def collect_one_comment(comment, columns, link_id) -> pd.Series():
    """
        Return informations of a specific comment as a pandas series
    """
    author = comment['author']
    text = comment['body']    
    commentId = comment['id']
    parent_commentId = comment['parent_id']
    created = comment['created_utc']
    created_date = datetime.fromtimestamp(created)
    permalink = comment['permalink']    
    return pd.Series([link_id, author,commentId,text,parent_commentId,created_date,permalink], index = columns), created

In [6]:
def collect_all_comments(link_id):
    """
        Collect all comments of a specific post. Returning informations as a pandas dataframe.
    """
    columns = ['post_id','authors','commentId','text','parent_commentId','created','permalink']
    rows_list = []

    comments = get_pushshift_data(link_id)
    nb_comments = len(comments)
    for comment in comments:
        pd_comment, after = collect_one_comment(comment, columns, link_id)
        rows_list.append(pd_comment)
    
    if nb_comments == 100:
        comments = get_pushshift_data(link_id, after - 1)
        #We're subtracting one because so we will always scrap the last comment twice but we're sure not to miss one if two comments have the same timestamp
        nb_comments = len(comments)
    else: nb_comments = 0

    #We can get only 100 comments everytime, so we're looping changing the 'after' timestamp of comments until we have no more comments
    while nb_comments > 0:
        #We're scrapping comments and inserting in dataframe
        for comment in comments:
            pd_comment, after = collect_one_comment(comment, columns, link_id)
            rows_list.append(pd_comment)
        comments = get_pushshift_data(link_id, after - 1)
        #We're subtracting one because we will always scrap the last comment twice, to make sure not to miss one if two comments have the same timestamp
        nb_comments = len(comments) - 1
    
    return pd.DataFrame(rows_list, columns=columns)

In [7]:
collect_all_comments('qq0oz8')

https://api.pushshift.io/reddit/comment/search/?link_id=qq0oz8&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=qq0oz8&limit=100&q=*&after=1636476517
https://api.pushshift.io/reddit/comment/search/?link_id=qq0oz8&limit=100&q=*&after=1636617851


Unnamed: 0,post_id,authors,commentId,text,parent_commentId,created,permalink
0,qq0oz8,Soviet75,hjx50jq,*Rage en américain*,t3_qq0oz8,2021-11-09 11:21:25,/r/france/comments/qq0oz8/guide_comment_mesure...
1,qq0oz8,mooothemadcow,hjx50uk,Je me suis pissé dessus tellement j'ai ri.\n\n...,t3_qq0oz8,2021-11-09 11:21:33,/r/france/comments/qq0oz8/guide_comment_mesure...
2,qq0oz8,nyme-me,hjx6twj,C'est vrai! Malheureusement pas tout à fait po...,t3_qq0oz8,2021-11-09 11:47:24,/r/france/comments/qq0oz8/guide_comment_mesure...
3,qq0oz8,plouky,hjx6zob,Aire &gt; système métrique et terrain de foot...,t3_qq0oz8,2021-11-09 11:49:38,/r/france/comments/qq0oz8/guide_comment_mesure...
4,qq0oz8,Pb_Flo,hjx744l,"Médecine, aéronautique, hydraulique etc. rentr...",t3_qq0oz8,2021-11-09 11:51:19,/r/france/comments/qq0oz8/guide_comment_mesure...
...,...,...,...,...,...,...,...
150,qq0oz8,rakoo,hk1uwiq,Il parait que c'est la méthode que les arabes ...,t1_hjxtyun,2021-11-10 11:06:28,/r/france/comments/qq0oz8/guide_comment_mesure...
151,qq0oz8,zdimension,hk1ww3b,Instinctivement j'appellerais ça un diagramme ...,t1_hk06ypt,2021-11-10 11:35:03,/r/france/comments/qq0oz8/guide_comment_mesure...
152,qq0oz8,Dum_spero_spiro_,hk25u7f,Un grand merci !,t1_hk09fec,2021-11-10 13:22:54,/r/france/comments/qq0oz8/guide_comment_mesure...
153,qq0oz8,Dum_spero_spiro_,hk25vew,Merci !,t1_hk1ww3b,2021-11-10 13:23:14,/r/france/comments/qq0oz8/guide_comment_mesure...


### We now want to get coments from all the post we scrapped

In [10]:
import glob

# getting csv files from the folder
path = "exports/" + subreddit + "/posts"

print(path)
# read all the files with extension .csv
filenames = glob.glob(path + "\*.csv")
print('File names:', filenames)
all_titres = pd.DataFrame()
# for loop to iterate all csv files
for file in filenames:
   # reading csv files
   print("\nReading file = ",file)
   all_titres = all_titres.append(pd.read_csv(file))

all_titres = all_titres.reset_index(drop=True)
all_post_ids = set(all_titres['postId'].tolist())
all_post_ids[:3]

csv_exports/france/posts
File names: ['csv_exports/france/posts\\france_20220901_20221030.csv']

Reading file =  csv_exports/france/posts\france_20220901_20221030.csv


{'x6ppsd',
 'xe6uxk',
 'x5yqb5',
 'x94qo6',
 'x7hotu',
 'yghdxe',
 'yc4c6h',
 'xy3n4h',
 'x3we47',
 'x5ro2c',
 'xew1yv',
 'xie1al',
 'y1yp04',
 'xf8ctj',
 'xog3pa',
 'xfny0n',
 'y918p2',
 'xf1bf6',
 'xd79o1',
 'xtv4sc',
 'y4tryg',
 'xhn0h3',
 'x4z1cu',
 'xl0gm7',
 'xn7t4u',
 'x6ms4t',
 'xgkmnr',
 'x5qjea',
 'xemnr1',
 'x8lp0d',
 'xh6gaz',
 'xfldoe',
 'y2hlzi',
 'y76d8h',
 'xj61h6',
 'xd4yrs',
 'xd3we8',
 'xn1826',
 'xbs8oy',
 'xkgdas',
 'xzr8xq',
 'xf25zg',
 'y9unu6',
 'xksv38',
 'xv6lrr',
 'x2ynuc',
 'xa25s7',
 'xedigc',
 'ygiblk',
 'x66d01',
 'x68lbc',
 'xazpn3',
 'yfl9gl',
 'x7ufvq',
 'xnwz3p',
 'y1gpqi',
 'xhgj0q',
 'x61m8d',
 'xooq60',
 'xmpeeu',
 'x6k72q',
 'xejn6u',
 'y8v06b',
 'y294vd',
 'x4bgog',
 'x96syn',
 'ydvwwx',
 'ye4bff',
 'xde8ei',
 'xfd2h8',
 'xfgsjf',
 'xfe4fi',
 'xvfywk',
 'yfhi7c',
 'x3k35b',
 'x7jrl1',
 'x6ispl',
 'xdy8ut',
 'yaja7l',
 'xjsinu',
 'xnsj2n',
 'x5he92',
 'xw4tz0',
 'y6mf4j',
 'y0uf55',
 'xfpm37',
 'y665rc',
 'xxsdu7',
 'xd1mv3',
 'x9sjlh',
 'xgrh8a',

### Reducing the post_ids only with the one we don't already have

In [11]:
# getting csv files from the folder
path = "exports/" + subreddit + "france/comments"

# read all the files with extension .csv
filenames = glob.glob(path + "\*.csv")
print('File names:', filenames)
all_comments = pd.DataFrame()
# for loop to iterate all csv files
for file in filenames:
   # reading csv files
   print("\nReading file = ",file)
   all_comments = all_comments.append(pd.read_csv(file))

all_comments = all_comments.reset_index(drop=True)
if len(all_comments) > 0:
   post_ids_scrapped = set(all_comments['post_id'].tolist())
   post_id_not_scrapped = [i for i in all_post_ids if i not in post_ids_scrapped]
else: 
   post_id_not_scrapped = all_post_ids
   print("No files found")
post_id_not_scrapped[:3]


File names: []
No files found


{'x6ppsd',
 'xe6uxk',
 'x5yqb5',
 'x94qo6',
 'x7hotu',
 'yghdxe',
 'yc4c6h',
 'xy3n4h',
 'x3we47',
 'x5ro2c',
 'xew1yv',
 'xie1al',
 'y1yp04',
 'xf8ctj',
 'xog3pa',
 'xfny0n',
 'y918p2',
 'xf1bf6',
 'xd79o1',
 'xtv4sc',
 'y4tryg',
 'xhn0h3',
 'x4z1cu',
 'xl0gm7',
 'xn7t4u',
 'x6ms4t',
 'xgkmnr',
 'x5qjea',
 'xemnr1',
 'x8lp0d',
 'xh6gaz',
 'xfldoe',
 'y2hlzi',
 'y76d8h',
 'xj61h6',
 'xd4yrs',
 'xd3we8',
 'xn1826',
 'xbs8oy',
 'xkgdas',
 'xzr8xq',
 'xf25zg',
 'y9unu6',
 'xksv38',
 'xv6lrr',
 'x2ynuc',
 'xa25s7',
 'xedigc',
 'ygiblk',
 'x66d01',
 'x68lbc',
 'xazpn3',
 'yfl9gl',
 'x7ufvq',
 'xnwz3p',
 'y1gpqi',
 'xhgj0q',
 'x61m8d',
 'xooq60',
 'xmpeeu',
 'x6k72q',
 'xejn6u',
 'y8v06b',
 'y294vd',
 'x4bgog',
 'x96syn',
 'ydvwwx',
 'ye4bff',
 'xde8ei',
 'xfd2h8',
 'xfgsjf',
 'xfe4fi',
 'xvfywk',
 'yfhi7c',
 'x3k35b',
 'x7jrl1',
 'x6ispl',
 'xdy8ut',
 'yaja7l',
 'xjsinu',
 'xnsj2n',
 'x5he92',
 'xw4tz0',
 'y6mf4j',
 'y0uf55',
 'xfpm37',
 'y665rc',
 'xxsdu7',
 'xd1mv3',
 'x9sjlh',
 'xgrh8a',

### Looping on every post we have

In [15]:
def save_dataframe(df, post_id = str):
    csv_file_name = subreddit + '_comments_' + post_id + '.csv'
    df.to_csv('exports/' + subreddit + '/comments/' + csv_file_name, index = False, encoding="utf-8")

In [16]:
#We're scrapping all comments from all post_ids we have. Saving in a file every 50k comments

count = 0
p_bar = tqdm_notebook(post_id_not_scrapped)
for post_id in p_bar:
    p_bar.set_description(f'Working on "{post_id}"')
    all_comments = pd.concat([all_comments, collect_all_comments(post_id)], ignore_index=True)
    count += 1
    if len(all_comments) > 50000:
        save_dataframe(all_comments, str(post_id))
        all_comments = pd.DataFrame()
save_dataframe(all_comments, str(post_id))
print('I scrapped comments from ' + str(count) + ' posts :).')
all_comments

https://api.pushshift.io/reddit/comment/search/?link_id=x6ppsd&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=xe6uxk&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=x5yqb5&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=x5yqb5&limit=100&q=*&after=1662374040
https://api.pushshift.io/reddit/comment/search/?link_id=x5yqb5&limit=100&q=*&after=1662384690
https://api.pushshift.io/reddit/comment/search/?link_id=x5yqb5&limit=100&q=*&after=1662542625
https://api.pushshift.io/reddit/comment/search/?link_id=x94qo6&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=x7hotu&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=yghdxe&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=yc4c6h&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=xy3n4h&limit=100&q=*&after=0
I had a 429 status code. I'm waiting 3

Unnamed: 0,post_id,authors,commentId,text,parent_commentId,created,permalink
0,xe967t,Neveed,iofgpj1,"Le contenu est probablement plus juste, mais l...",t3_xe967t,2022-09-14 20:26:35,/r/france/comments/xe967t/ecologie_linfox_vien...
1,xe967t,TrueRignak,iofikjz,&gt; la plupart des publication scientifiques ...,t1_iofgpj1,2022-09-14 20:39:00,/r/france/comments/xe967t/ecologie_linfox_vien...
2,xe967t,Neveed,iofk61v,"Quand un article est sur SciHub, c'est pas qu'...",t1_iofikjz,2022-09-14 20:49:39,/r/france/comments/xe967t/ecologie_linfox_vien...
3,xe967t,TrueRignak,ioflhjm,"Oui, mais ce que je voulais dire, c'est que ça...",t1_iofk61v,2022-09-14 20:58:31,/r/france/comments/xe967t/ecologie_linfox_vien...
4,xe967t,Neveed,iofmtp5,C'est pas diffus pour l'état ou les FAI qui so...,t1_ioflhjm,2022-09-14 21:07:37,/r/france/comments/xe967t/ecologie_linfox_vien...
...,...,...,...,...,...,...,...
44706,x854s6,IMAGINE_RS,inghg39,Ok je comprends l’idée ! Merci !,t1_ingc2g9,2022-09-07 16:54:28,/r/france/comments/x854s6/me_déclarer_chômeur_...
44707,x854s6,IMAGINE_RS,inghj04,Merci !,t1_ingfwut,2022-09-07 16:55:00,/r/france/comments/x854s6/me_déclarer_chômeur_...
44708,x854s6,Sherender,ini17zg,Je suis alternant et je suis complété par pôle...,t3_x854s6,2022-09-07 22:44:14,/r/france/comments/x854s6/me_déclarer_chômeur_...
44709,x854s6,IMAGINE_RS,ini1vgr,Ok j’ai vraiment l’impression que c’est répandu,t1_ini17zg,2022-09-07 22:48:13,/r/france/comments/x854s6/me_déclarer_chômeur_...
