### Get comments from ID post

If I want to get all comments from post ID "ucbjgz" : https://api.pushshift.io/reddit/comment/search/?link_id=ucbjgz&limit=100&q=*

The limit is currently clamped down to 100 so need to make a loop if we want more (see https://www.reddit.com/r/pushshift/comments/qufgqa/get_all_comments_from_a_post_id/)

In [5]:
import pandas as pd
import requests
import json
from datetime import datetime
import time

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [6]:
def get_pushshift_data(link_id, after = 0, limit = 100) -> dict():
    try:
        r = ''
        URL = 'https://api.pushshift.io/reddit/comment/search/?link_id='+str(link_id)+'&limit='+str(limit)+'&q=*'+'&after='+ str(after)
        print(URL)
        r = requests.get(URL)
        if r.status_code == 200:
            data = json.loads(r.text, strict = False)
            return data['data']
        
        #Si on a eu une erreur en r√©cup√©rant l'URL on r√©essaye 5 fois(avec sleep pour √™tre s√ªr), sinon on abandonne
        else:
            time.sleep(1)
            nb_try = 0
            while r.status_code != 200 | nb_try < 5:
                URL = 'https://api.pushshift.io/reddit/comment/search/?link_id='+str(link_id)+'&limit='+str(limit)+'&q=*'
                print(URL)
                r = requests.get(URL)
                data = json.loads(r.text, strict = False)
                nb_try += 1
            if r.status_code == 200:           
                return data['data']
            else: return ''
    except:
        print('Error while accessing API')
        print(r)
        return ''

Example of a post

In [7]:
get_pushshift_data('ucbjgz')

https://api.pushshift.io/reddit/comment/search/?link_id=ucbjgz&limit=100&q=*&after=0


[{'all_awardings': [],
  'archived': False,
  'associated_award': None,
  'author': 'AutoModerator',
  'author_flair_background_color': None,
  'author_flair_css_class': None,
  'author_flair_richtext': [],
  'author_flair_template_id': None,
  'author_flair_text': None,
  'author_flair_text_color': None,
  'author_flair_type': 'text',
  'author_fullname': 't2_6l4z3',
  'author_patreon_flair': False,
  'author_premium': True,
  'body': 'Pas assez de musique sur r/france ? Rejoins-nous sur r/musique et aide nous √† d√©velopper cette nouvelle communaut√©\n\n\n*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/france) if you have any questions or concerns.*',
  'body_sha1': 'f165f599f1a2ee35433403826535a438667d5ce0',
  'can_gild': True,
  'collapsed': False,
  'collapsed_because_crowd_control': None,
  'collapsed_reason': None,
  'collapsed_reason_code': None,
  'comment_type': None,
  'controversiality': 0,


In [8]:
def collect_one_comment(comment, columns, link_id) -> pd.Series():
    author = comment['author']
    text = comment['body']    
    commentId = comment['id']
    parent_commentId = comment['parent_id']
    created = comment['created_utc']
    created_date = datetime.fromtimestamp(created)
    permalink = comment['permalink']    
    return pd.Series([link_id, author,commentId,text,parent_commentId,created_date,permalink], index = columns), created

In [31]:
def collect_all_comments(link_id):
    columns = ['post_id','authors','commentId','text','parent_commentId','created','permalink']
    rows_list = []

    comments = get_pushshift_data(link_id)
    nb_comments = len(comments)
    for comment in comments:
        pd_comment, after = collect_one_comment(comment, columns, link_id)
        rows_list.append(pd_comment)
    
    if nb_comments == 100:
        comments = get_pushshift_data(link_id, after - 1)
        #We're subtracting one because we will always scrap the last comment twice, to make sure not to miss one if two comments have the same timestamp
        nb_comments = len(comments)
    else: nb_comments = 0

    #We can get only 100 comments everytime, so we're looping changing the 'after' timestamp of comments until we have no more comments
    while nb_comments > 0:
        #We're scrapping comments and inserting in dataframe
        for comment in comments:
            pd_comment, after = collect_one_comment(comment, columns, link_id)
            rows_list.append(pd_comment)
        comments = get_pushshift_data(link_id, after - 1)
        #We're subtracting one because we will always scrap the last comment twice, to make sure not to miss one if two comments have the same timestamp
        nb_comments = len(comments) - 1
    
    return pd.DataFrame(rows_list, columns=columns)

In [29]:
collect_all_comments('pfo84s')

https://api.pushshift.io/reddit/comment/search/?link_id=pfo84s&limit=100&q=*&after=0
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
https://api.pushshift.io/reddit/comment/search/?link_id=pfo84s&limit=100&q=*&after=1630487153
https://api.pushshift.io/reddit/comment/search/?link_id=pfo84s&limit=100&q=*&after=1630500047
https://api.pushshift.io/reddit/comment/search/?link_id=pfo84s&limit=100&q=*&after=1632308783


Unnamed: 0,post_id,authors,commentId,text,parent_commentId,created,permalink
0,pfo84s,JeHaisLesEnfants,hb5pbvd,Heureusement qu'il reste des patrons dot√©s de ...,t3_pfo84s,2021-09-01 08:36:20,/r/france/comments/pfo84s/tu_fais_partie_des_b...
1,pfo84s,saintsulpice,hb5pnmy,Un branleur radical ne devrait pas √™tre en sit...,t3_pfo84s,2021-09-01 08:40:37,/r/france/comments/pfo84s/tu_fais_partie_des_b...
2,pfo84s,Snykeurs,hb5pupk,"Si tu te branle mais ne fais pas l'amour, pas ...",t1_hb5pnmy,2021-09-01 08:43:13,/r/france/comments/pfo84s/tu_fais_partie_des_b...
3,pfo84s,croissance_eternelle,hb5pz0h,L'agriculteur on dirait une caricature de film...,t3_pfo84s,2021-09-01 08:44:47,/r/france/comments/pfo84s/tu_fais_partie_des_b...
4,pfo84s,Beru73,hb5qhh0,"OK, mais tu te branles pas pendant les heures ...",t1_hb5pupk,2021-09-01 08:51:24,/r/france/comments/pfo84s/tu_fais_partie_des_b...
...,...,...,...,...,...,...,...
290,pfo84s,ElTroglodyte,hbahrk5,C ki l'patron,t3_pfo84s,2021-09-02 09:59:07,/r/france/comments/pfo84s/tu_fais_partie_des_b...
291,pfo84s,Streuphy,hbalxqx,"Ok, tu as une lecture morale de la situation (...",t1_hb941l2,2021-09-02 11:01:29,/r/france/comments/pfo84s/tu_fais_partie_des_b...
292,pfo84s,tiplinix,hbanji1,"Il n'y a pas de lecture morale, c'est litt√©ral...",t1_hbalxqx,2021-09-02 11:25:45,/r/france/comments/pfo84s/tu_fais_partie_des_b...
293,pfo84s,KafkaDatura,hbb1tqh,"Si le travail √©tait √©panouissant, on te paiera...",t1_hb5xua5,2021-09-02 14:14:37,/r/france/comments/pfo84s/tu_fais_partie_des_b...


### We know want to get coments from all the post we scrapped

In [16]:
import glob

# getting csv files from the folder MyProject
path = "csv_exports"

# read all the files with extension .csv
filenames = glob.glob(path + "\*.csv")
print('File names:', filenames)
all_titres = pd.DataFrame()
# for loop to iterate all csv files
for file in filenames:
   # reading csv files
   print("\nReading file = ",file)
   all_titres = all_titres.append(pd.read_csv(file))

all_titres = all_titres.reset_index(drop=True)
post_ids = all_titres['postId'].tolist()
post_ids

File names: ['csv_exports\\france_20210901_20211002.csv', 'csv_exports\\france_20211002_20211031.csv', 'csv_exports\\france_20211031_20211231.csv', 'csv_exports\\france_20211231_20220131.csv', 'csv_exports\\france_20220131_20220228.csv', 'csv_exports\\france_20220228_20220331.csv', 'csv_exports\\france_20220331_20220401.csv', 'csv_exports\\france_20220331_20220426.csv']

Reading file =  csv_exports\france_20210901_20211002.csv

Reading file =  csv_exports\france_20211002_20211031.csv

Reading file =  csv_exports\france_20211031_20211231.csv

Reading file =  csv_exports\france_20211231_20220131.csv

Reading file =  csv_exports\france_20220131_20220228.csv

Reading file =  csv_exports\france_20220228_20220331.csv

Reading file =  csv_exports\france_20220331_20220401.csv

Reading file =  csv_exports\france_20220331_20220426.csv


['pfg0r5',
 'pfgezw',
 'pfh52l',
 'pfh9pl',
 'pfhakb',
 'pfhn2p',
 'pfhomq',
 'pfkds1',
 'pfktmh',
 'pfkvw2',
 'pflw2r',
 'pfmj9z',
 'pfmwgm',
 'pfn0ez',
 'pfn4hz',
 'pfnj6d',
 'pfnjpj',
 'pfns5n',
 'pfnw7j',
 'pfnwhm',
 'pfo2ck',
 'pfo5he',
 'pfo6wm',
 'pfo84s',
 'pfogr6',
 'pfojbq',
 'pfokcg',
 'pfoojx',
 'pfooqv',
 'pfope1',
 'pfordk',
 'pfos3f',
 'pfov4r',
 'pfp58k',
 'pfp7mg',
 'pfp8ze',
 'pfpaef',
 'pfpmon',
 'pfpoe4',
 'pfprwo',
 'pfpsb0',
 'pfpvqy',
 'pfpxjo',
 'pfq10y',
 'pfq7lv',
 'pfqcth',
 'pfqepz',
 'pfqh7t',
 'pfqhs8',
 'pfql07',
 'pfqldw',
 'pfqm73',
 'pfqs0v',
 'pfqv8u',
 'pfqyj5',
 'pfqzaf',
 'pfr09l',
 'pfr1jp',
 'pfr2qk',
 'pfr3af',
 'pfr61j',
 'pfr6dr',
 'pfr72p',
 'pfr8tk',
 'pfrjtn',
 'pfrksj',
 'pfrlxl',
 'pfrrql',
 'pfrruf',
 'pfrrvk',
 'pfrsdq',
 'pfrvll',
 'pfrwfy',
 'pfs1np',
 'pfs5cp',
 'pfs5nf',
 'pfs5wc',
 'pfs8nm',
 'pfsdq1',
 'pfsejh',
 'pfsjqf',
 'pfslus',
 'pfsmie',
 'pfsqus',
 'pfst0u',
 'pfstwp',
 'pfsx57',
 'pfszsh',
 'pft132',
 'pft4j9',
 'pftfkb',

### Looping on every post we have

In [12]:
# def save_dataframe(df, after : datetime, before : datetime):
#     #On veut retransformer les timestamps en date pour le nom des fichiers
#     after = str(after.strftime("%Y")) +  str(after.strftime("%m")) + str(after.strftime("%d"))
#     before = str(before.strftime("%Y")) +  str(before.strftime("%m")) + str(before.strftime("%d"))
#     csv_file_name = 'france_' + str(after) + '_' + str(before) + '.csv'
#     df.to_csv('csv_exports' + '/' + csv_file_name, index = False, encoding="utf-8")

In [17]:
def save_dataframe(df, post_id = str):
    csv_file_name = 'france_comments_' + post_id
    df.to_csv('csv_exports/comments/' + csv_file_name, index = False, encoding="utf-8")

In [32]:
#We're scrapping all comments from all post_ids we have. Saving in a file every 50k comments

all_comments = pd.DataFrame()
count = 0
for post_id in post_ids:
    all_comments = pd.concat([all_comments, collect_all_comments(post_id)], ignore_index=True)
    count += 1
    if len(all_comments) > 1000:
        save_dataframe(all_comments, str(post_id))
        all_comments = pd.DataFrame()
save_dataframe(all_comments, str(post_id))
print('I scrapped comments from ' + count + ' posts :).')
all_comments

https://api.pushshift.io/reddit/comment/search/?link_id=pfg0r5&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=pfgezw&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=pfh52l&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=pfh9pl&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=pfhakb&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=pfhn2p&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=pfhomq&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=pfkds1&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=pfktmh&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=pfkvw2&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=pflw2r&limit=100&q=*&after=0
Error while accessing API


UnboundLocalError: local variable 'r' referenced before assignment

In [20]:
count

2070

In [33]:
all_comments

Unnamed: 0,post_id,authors,commentId,text,parent_commentId,created,permalink
0,pfg0r5,_Tripe_,hb42jo8,Sur ce coup l√† ü§°\n\nApr√®s faut voir sur quelle...,t3_pfg0r5,2021-09-01 00:08:06,/r/france/comments/pfg0r5/bravo_la_france_on_b...
1,pfg0r5,TB54,hb43vqt,J'avoue que j'ai vraiment du mal √† croire le r...,t3_pfg0r5,2021-09-01 00:17:45,/r/france/comments/pfg0r5/bravo_la_france_on_b...
2,pfg0r5,coelhophisis,hb44wb6,"En m√™me temps, si j'ai un membre de ma famille...",t3_pfg0r5,2021-09-01 00:25:07,/r/france/comments/pfg0r5/bravo_la_france_on_b...
3,pfg0r5,[deleted],hb4568g,[removed],t3_pfg0r5,2021-09-01 00:27:08,/r/france/comments/pfg0r5/bravo_la_france_on_b...
4,pfg0r5,CulteDeLaRaison,hb45ugd,"Je veux bien la taille de la population, la m√©...",t3_pfg0r5,2021-09-01 00:32:07,/r/france/comments/pfg0r5/bravo_la_france_on_b...
...,...,...,...,...,...,...,...
158,pfkvw2,kyp-d,hb6n7be,Si il n'y a que Pfizer !\n\nLes autres vaccins...,t1_hb6lgwb,2021-09-01 15:22:57,/r/france/comments/pfkvw2/pass_sanitaire_dans_...
159,pfkvw2,atohero,hb6n9ft,C'est quand m√™me pas compliqu√© d'avoir son pas...,t1_hb5f2kf,2021-09-01 15:23:25,/r/france/comments/pfkvw2/pass_sanitaire_dans_...
160,pfkvw2,Coutijou,hb6rfbj,"Ben, je comprends pas, parce qu'on a la m√™me s...",t1_hb6n7be,2021-09-01 15:54:58,/r/france/comments/pfkvw2/pass_sanitaire_dans_...
161,pfkvw2,kyp-d,hb6uay9,Moderna c'est dans les pharmacies et chez les ...,t1_hb6rfbj,2021-09-01 16:16:00,/r/france/comments/pfkvw2/pass_sanitaire_dans_...


In [None]:
all_comments.to_csv('csv_exports' + '/' + 'comments_first_all.csv', index = False, encoding="utf-8")

Unnamed: 0,authors,commentId,text,parent_commentId,created,permalink
0,_Tripe_,hb42jo8,Sur ce coup l√† ü§°\n\nApr√®s faut voir sur quelle...,t3_pfg0r5,2021-09-01 00:08:06,/r/france/comments/pfg0r5/bravo_la_france_on_b...
1,TB54,hb43vqt,J'avoue que j'ai vraiment du mal √† croire le r...,t3_pfg0r5,2021-09-01 00:17:45,/r/france/comments/pfg0r5/bravo_la_france_on_b...
2,coelhophisis,hb44wb6,"En m√™me temps, si j'ai un membre de ma famille...",t3_pfg0r5,2021-09-01 00:25:07,/r/france/comments/pfg0r5/bravo_la_france_on_b...
3,[deleted],hb4568g,[removed],t3_pfg0r5,2021-09-01 00:27:08,/r/france/comments/pfg0r5/bravo_la_france_on_b...
4,CulteDeLaRaison,hb45ugd,"Je veux bien la taille de la population, la m√©...",t3_pfg0r5,2021-09-01 00:32:07,/r/france/comments/pfg0r5/bravo_la_france_on_b...
...,...,...,...,...,...,...
126,Canaan-Aus,hehlenh,"I got mine today. 26 days. applied on the 1st,...",t1_hcrn6ff,2021-09-27 18:30:52,/r/france/comments/pfkds1/get_a_digital_sanita...
127,nixsequi,hb54rz2,I have never seen these products in France.,t3_pfktmh,2021-09-01 05:02:54,/r/france/comments/pfktmh/eiffel_brand_product...
128,Frenetic_Platypus,hb556as,"I've never seen them, I can't find any seller ...",t3_pfktmh,2021-09-01 05:06:16,/r/france/comments/pfktmh/eiffel_brand_product...
129,MycroFeline,hb55ecn,Neither have I. A quick search on the french-s...,t1_hb54rz2,2021-09-01 05:08:11,/r/france/comments/pfktmh/eiffel_brand_product...
