### Get comments from ID post

If I want to get all comments from post ID "ucbjgz" : https://api.pushshift.io/reddit/comment/search/?link_id=ucbjgz&limit=100&q=*

The limit is currently clamped down to 100 so need to make a loop if we want more (see https://www.reddit.com/r/pushshift/comments/qufgqa/get_all_comments_from_a_post_id/)

In [1]:
import pandas as pd
import requests
import json
from datetime import datetime
import time
from tqdm import tqdm_notebook
# conda install -c conda-forge tqdm

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
subreddit = 'france'

# Functions

In [3]:
def get_pushshift_data(link_id, after = 0, limit = 100) -> dict():
    """
        API results for a post id, returning comments & information about them. The API is giving a maximum of 100 comments per call.
        Which means that we have to do a loop if we have more in a post. 'After' argument is here to specify the time of the last comment scrapped.
    """
    try:
        r = ''
        URL = 'https://api.pushshift.io/reddit/comment/search/?link_id='+str(link_id)+'&limit='+str(limit)+'&q=*'+'&after='+ str(after)
        print(URL)
        r = requests.get(URL)
        if r.status_code == 200:
            data = json.loads(r.text, strict = False)
            return data['data']
        
        #Si on a eu une erreur en récupérant l'URL on réessaye 5 fois(avec sleep pour être sûr), sinon on abandonne
        else:
            time.sleep(1)
            nb_try = 0
            #Si on a eu une erreur status code 429, on risque d'être ban pour 1h. See https://www.reddit.com/r/pushshift/comments/uhjzvd/temporary_banning_of_ips_with_high_rates_of_429s/
            if r.status_code == 429:
                print("I had a 429 status code. I'm waiting 3 seconds.")
                time.sleep(3)
            
            while r.status_code != 200 | nb_try < 5:
                URL = 'https://api.pushshift.io/reddit/comment/search/?link_id='+str(link_id)+'&limit='+str(limit)+'&q=*'
                print(URL)
                r = requests.get(URL)
                data = json.loads(r.text, strict = False)
                nb_try += 1
            if r.status_code == 200:           
                return data['data']
            else: return ''
    except:
        print('Error while accessing API')
        print(r)
        return ''

Example of a post

In [4]:
get_pushshift_data('ucbjgz')

https://api.pushshift.io/reddit/comment/search/?link_id=ucbjgz&limit=100&q=*&after=0


[{'all_awardings': [],
  'archived': False,
  'associated_award': None,
  'author': 'AutoModerator',
  'author_flair_background_color': None,
  'author_flair_css_class': None,
  'author_flair_richtext': [],
  'author_flair_template_id': None,
  'author_flair_text': None,
  'author_flair_text_color': None,
  'author_flair_type': 'text',
  'author_fullname': 't2_6l4z3',
  'author_patreon_flair': False,
  'author_premium': True,
  'body': 'Pas assez de musique sur r/france ? Rejoins-nous sur r/musique et aide nous à développer cette nouvelle communauté\n\n\n*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/france) if you have any questions or concerns.*',
  'body_sha1': 'f165f599f1a2ee35433403826535a438667d5ce0',
  'can_gild': True,
  'collapsed': False,
  'collapsed_because_crowd_control': None,
  'collapsed_reason': None,
  'collapsed_reason_code': None,
  'comment_type': None,
  'controversiality': 0,
  '

In [5]:
def collect_one_comment(comment, columns, link_id) -> pd.Series():
    """
        Return informations of a specific comment as a pandas series
    """
    author = comment['author']
    text = comment['body']    
    commentId = comment['id']
    parent_commentId = comment['parent_id']
    created = comment['created_utc']
    created_date = datetime.fromtimestamp(created)
    permalink = comment['permalink']    
    return pd.Series([link_id, author,commentId,text,parent_commentId,created_date,permalink], index = columns), created

In [6]:
def collect_all_comments(link_id):
    """
        Collect all comments of a specific post. Returning informations as a pandas dataframe.
    """
    columns = ['post_id','authors','commentId','text','parent_commentId','created','permalink']
    rows_list = []

    comments = get_pushshift_data(link_id)
    nb_comments = len(comments)
    for comment in comments:
        pd_comment, after = collect_one_comment(comment, columns, link_id)
        rows_list.append(pd_comment)
    
    if nb_comments == 100:
        comments = get_pushshift_data(link_id, after - 1)
        #We're subtracting one because so we will always scrap the last comment twice but we're sure not to miss one if two comments have the same timestamp
        nb_comments = len(comments)
    else: nb_comments = 0

    #We can get only 100 comments everytime, so we're looping changing the 'after' timestamp of comments until we have no more comments
    while nb_comments > 0:
        #We're scrapping comments and inserting in dataframe
        for comment in comments:
            pd_comment, after = collect_one_comment(comment, columns, link_id)
            rows_list.append(pd_comment)
        comments = get_pushshift_data(link_id, after - 1)
        #We're subtracting one because we will always scrap the last comment twice, to make sure not to miss one if two comments have the same timestamp
        nb_comments = len(comments) - 1
    
    return pd.DataFrame(rows_list, columns=columns)

In [7]:
collect_all_comments('qq0oz8')

https://api.pushshift.io/reddit/comment/search/?link_id=qq0oz8&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=qq0oz8&limit=100&q=*&after=1636476517
https://api.pushshift.io/reddit/comment/search/?link_id=qq0oz8&limit=100&q=*&after=1636617851


Unnamed: 0,post_id,authors,commentId,text,parent_commentId,created,permalink
0,qq0oz8,Soviet75,hjx50jq,*Rage en américain*,t3_qq0oz8,2021-11-09 11:21:25,/r/france/comments/qq0oz8/guide_comment_mesure...
1,qq0oz8,mooothemadcow,hjx50uk,Je me suis pissé dessus tellement j'ai ri.\n\n...,t3_qq0oz8,2021-11-09 11:21:33,/r/france/comments/qq0oz8/guide_comment_mesure...
2,qq0oz8,nyme-me,hjx6twj,C'est vrai! Malheureusement pas tout à fait po...,t3_qq0oz8,2021-11-09 11:47:24,/r/france/comments/qq0oz8/guide_comment_mesure...
3,qq0oz8,plouky,hjx6zob,Aire &gt; système métrique et terrain de foot...,t3_qq0oz8,2021-11-09 11:49:38,/r/france/comments/qq0oz8/guide_comment_mesure...
4,qq0oz8,Pb_Flo,hjx744l,"Médecine, aéronautique, hydraulique etc. rentr...",t3_qq0oz8,2021-11-09 11:51:19,/r/france/comments/qq0oz8/guide_comment_mesure...
...,...,...,...,...,...,...,...
150,qq0oz8,rakoo,hk1uwiq,Il parait que c'est la méthode que les arabes ...,t1_hjxtyun,2021-11-10 11:06:28,/r/france/comments/qq0oz8/guide_comment_mesure...
151,qq0oz8,zdimension,hk1ww3b,Instinctivement j'appellerais ça un diagramme ...,t1_hk06ypt,2021-11-10 11:35:03,/r/france/comments/qq0oz8/guide_comment_mesure...
152,qq0oz8,Dum_spero_spiro_,hk25u7f,Un grand merci !,t1_hk09fec,2021-11-10 13:22:54,/r/france/comments/qq0oz8/guide_comment_mesure...
153,qq0oz8,Dum_spero_spiro_,hk25vew,Merci !,t1_hk1ww3b,2021-11-10 13:23:14,/r/france/comments/qq0oz8/guide_comment_mesure...


### We now want to get coments from all the post we scrapped

In [8]:
import glob

# getting csv files from the folder
path = "exports/" + subreddit + "/posts"

print(path)
# read all the files with extension .csv
filenames = glob.glob(path + "\*.csv")
print('File names:', filenames)
all_titres = pd.DataFrame()
# for loop to iterate all csv files
for file in filenames:
   # reading csv files
   print("\nReading file = ",file)
   all_titres = all_titres.append(pd.read_csv(file))

all_titres = all_titres.reset_index(drop=True)
all_post_ids = set(all_titres['postId'].tolist())
all_post_ids

exports/france/posts
File names: ['exports/france/posts\\france_20220901_20221030.csv', 'exports/france/posts\\france_20221030_20221129.csv', 'exports/france/posts\\france_20221129_20221201.csv']

Reading file =  exports/france/posts\france_20220901_20221030.csv

Reading file =  exports/france/posts\france_20221030_20221129.csv

Reading file =  exports/france/posts\france_20221129_20221201.csv


{'xex9ql',
 'y14pfc',
 'yw0eyr',
 'xov1hs',
 'ywpeqc',
 'xc6vgm',
 'x9hdxi',
 'y8ngva',
 'y7ijuq',
 'yjjpbz',
 'xi3lcr',
 'yi8eg6',
 'yd0jum',
 'y1m0r9',
 'yr48zp',
 'yk17eq',
 'x95upl',
 'xs0164',
 'yj81l1',
 'xq5ala',
 'xwzs73',
 'yfz7af',
 'xi4qvr',
 'yvrtzl',
 'xuctez',
 'yc5053',
 'yrrqa9',
 'yb5o61',
 'youi3h',
 'y4h2rs',
 'ykapr7',
 'z7p7r8',
 'z3mls5',
 'ym0kxr',
 'x8zyta',
 'y5nx8i',
 'y4igmd',
 'yuwp2x',
 'x8pxsh',
 'y1xu1p',
 'z7zczb',
 'xd0ya6',
 'x6mszd',
 'xkf9m9',
 'yfmrdm',
 'xfsmiy',
 'z2uvxv',
 'xk2k10',
 'xcxem7',
 'y1vhe8',
 'y15s7a',
 'yvrcwc',
 'yky6fh',
 'x70hdc',
 'xcsb1x',
 'xe8poo',
 'xcpkg6',
 'xcawla',
 'xdw06s',
 'xfic8q',
 'y3wtz4',
 'xq6ddc',
 'xhp48g',
 'xzxx35',
 'xf24yw',
 'xajk03',
 'xcwf5m',
 'y7vlvr',
 'xptmcd',
 'xr2xj7',
 'xabuvx',
 'yahvzf',
 'yepbrv',
 'yqb5ry',
 'x5b9y8',
 'yaeibb',
 'ype086',
 'z198t6',
 'y6zgdd',
 'xpiirm',
 'xq6tau',
 'ynr0nc',
 'x81q7e',
 'ywssu6',
 'z78ovf',
 'yt9gdq',
 'x3bl6y',
 'ynxl30',
 'y0fklx',
 'yledyv',
 'yrcwx7',

### Reducing the post_ids only with the one we don't already have

In [9]:
# getting csv files from the folder
path = f"exports/{subreddit}/comments"

# read all the files with extension .csv
filenames = glob.glob(path + "\*.csv")
print('File names:', filenames)
old_comments = pd.DataFrame()
# for loop to iterate all csv files
for file in filenames:
   # reading csv files
   print("\nReading file = ",file)
   old_comments = old_comments.append(pd.read_csv(file))

old_comments = old_comments.reset_index(drop=True)
if len(old_comments) > 0:
   post_ids_scrapped = set(old_comments['post_id'].tolist())
   post_ids_not_scrapped = [i for i in all_post_ids if i not in post_ids_scrapped]
else: 
   post_ids_not_scrapped = all_post_ids
   print("No files found")
post_ids_not_scrapped[:3]


File names: ['exports/france/comments\\france_comments_x67dzf.csv', 'exports/france/comments\\france_comments_x6ppsd.csv', 'exports/france/comments\\france_comments_xci5m1.csv', 'exports/france/comments\\france_comments_xhtzdy.csv', 'exports/france/comments\\france_comments_xv9jol.csv', 'exports/france/comments\\france_comments_yg3dj6.csv']

Reading file =  exports/france/comments\france_comments_x67dzf.csv

Reading file =  exports/france/comments\france_comments_x6ppsd.csv

Reading file =  exports/france/comments\france_comments_xci5m1.csv

Reading file =  exports/france/comments\france_comments_xhtzdy.csv

Reading file =  exports/france/comments\france_comments_xv9jol.csv

Reading file =  exports/france/comments\france_comments_yg3dj6.csv


['yw0eyr', 'ywpeqc', 'xc6vgm']

In [10]:
len(post_ids_not_scrapped)

8708

### Looping on every post we have

In [11]:
def save_dataframe(df, post_id = str):
    csv_file_name = subreddit + '_comments_' + post_id + '.csv'
    df.to_csv('exports/' + subreddit + '/comments/' + csv_file_name, index = False, encoding="utf-8")

In [13]:
#We're scrapping all comments from all post_ids we have. Saving in a file every 50k comments

count = 0
# p_bar = tqdm_notebook(post_ids_not_scrapped)
new_comments=pd.DataFrame()
for post_id in post_ids_not_scrapped:
    new_comments = pd.concat([new_comments, collect_all_comments(post_id)], ignore_index=True)
    count += 1
    if len(new_comments) > 50000:
        save_dataframe(new_comments, str(post_id))
        new_comments = pd.DataFrame()
save_dataframe(new_comments, str(post_id))
print('I scrapped comments from ' + str(count) + ' posts :).')
new_comments

https://api.pushshift.io/reddit/comment/search/?link_id=yw0eyr&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=ywpeqc&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=xc6vgm&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=x9hdxi&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=yjjpbz&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=xi3lcr&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=yi8eg6&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=yd0jum&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=y1m0r9&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=yr48zp&limit=100&q=*&after=0
I had a 429 status code. I'm waiting 3 seconds.
https://api.pushshift.io/reddit/comment/search/?link_id=yk17eq&limit=100&q=*&after=0
https://api.pushs

Unnamed: 0,post_id,authors,commentId,text,parent_commentId,created,permalink
0,z57lsw,Tritri89,ixuhj1f,Aaah ça faisait longtemps. J'en avais déjà qua...,t3_z57lsw,2022-11-26 14:33:50,/r/france/comments/z57lsw/dans_la_famille_des_...
1,z57lsw,AzuNetia,ixuif0x,J'ai toujours voulu savoir la tête que j'avais...,t3_z57lsw,2022-11-26 14:42:50,/r/france/comments/z57lsw/dans_la_famille_des_...
2,z57lsw,Rfogj,ixujdxc,Réponds lui au choix que :\n\n- Tu es un acteu...,t3_z57lsw,2022-11-26 14:52:18,/r/france/comments/z57lsw/dans_la_famille_des_...
3,z57lsw,NielaPureflamme,ixujke5,Porque no los dos ?,t1_ixujdxc,2022-11-26 14:54:01,/r/france/comments/z57lsw/dans_la_famille_des_...
4,z57lsw,Kutekin,ixujwe5,"Mais du coup, le mail vient vraiment de toi ou...",t3_z57lsw,2022-11-26 14:57:22,/r/france/comments/z57lsw/dans_la_famille_des_...
...,...,...,...,...,...,...,...
10732,yo1d52,RedditSuggestion1234,ivj2beg,"Si tu regardes les graphiques RTE, quand le so...",t1_iviyq63,2022-11-08 10:39:40,/r/france/comments/yo1d52/lobligation_de_pose_...
10733,z04t5u,Junoah,ix3jukn,"Yes, next question?",t3_z04t5u,2022-11-20 15:09:33,/r/france/comments/z04t5u/will_europe_survive_...
10734,z04t5u,Talen_92,ix3l3wf,"""I've come to Paris where people are on strike...",t3_z04t5u,2022-11-20 15:20:24,/r/france/comments/z04t5u/will_europe_survive_...
10735,z04t5u,LaQuequetteAuPoete,ix3q0uk,J'ai regardé quelques minutes jusqu'à atteindr...,t3_z04t5u,2022-11-20 15:59:17,/r/france/comments/z04t5u/will_europe_survive_...


In [None]:
# #We're scrapping all comments from all post_ids we have. Saving in a file every 50k comments

# count = 0
# p_bar = tqdm_notebook(post_ids_not_scrapped)
# for post_id in p_bar:
#     p_bar.set_description(f'Working on "{post_id}"')
#     all_comments = pd.concat([all_comments, collect_all_comments(post_id)], ignore_index=True)
#     count += 1
#     if len(all_comments) > 50000:
#         save_dataframe(all_comments, str(post_id))
#         all_comments = pd.DataFrame()
# save_dataframe(all_comments, str(post_id))
# print('I scrapped comments from ' + str(count) + ' posts :).')
# all_comments

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  p_bar = tqdm_notebook(post_ids_not_scrapped)


ImportError: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html