### Get comments from ID post

If I want to get all comments from post ID "ucbjgz" : https://api.pushshift.io/reddit/comment/search/?link_id=ucbjgz&limit=100&q=*

The limit is currently clamped down to 100 so need to make a loop if we want more (see https://www.reddit.com/r/pushshift/comments/qufgqa/get_all_comments_from_a_post_id/)

In [2]:
import pandas as pd
import requests
import json
from datetime import datetime
import time
from pmaw import PushshiftAPI
api = PushshiftAPI()

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [46]:
# def get_pushshift_data_pmaw(link_ids):
#     try:
#        # retrieve comment ids for submissions
#         comment_ids = api.search_submission_comment_ids(ids=link_ids) 
#         comment_ids = list(comment_ids)

#         # retrieve comments by id
#         comments = api.search_comments(ids=comment_ids)
#         return pd.DataFrame(comments)
#     except:
#         print('Error while accessing pmaw API')
#         return ''

In [1]:
def get_pushshift_data(link_id, after = 0, limit = 100) -> dict():
    try:
        r = ''
        URL = 'https://api.pushshift.io/reddit/comment/search/?link_id='+str(link_id)+'&limit='+str(limit)+'&q=*'+'&after='+ str(after)
        print(URL)
        r = requests.get(URL)
        if r.status_code == 200:
            data = json.loads(r.text, strict = False)
            return data['data']
        
        #Si on a eu une erreur en récupérant l'URL on réessaye 5 fois(avec sleep pour être sûr), sinon on abandonne
        else:
            time.sleep(1)
            nb_try = 0
            #Si on a eu une erreur status code 429, on risque d'être ban pour 1h. See https://www.reddit.com/r/pushshift/comments/uhjzvd/temporary_banning_of_ips_with_high_rates_of_429s/
            if r.status_code == 429:
                print("I had a 429 status code. I'm waiting 3 seconds.")
                time.sleep(3)
            
            while r.status_code != 200 | nb_try < 5:
                URL = 'https://api.pushshift.io/reddit/comment/search/?link_id='+str(link_id)+'&limit='+str(limit)+'&q=*'
                print(URL)
                r = requests.get(URL)
                data = json.loads(r.text, strict = False)
                nb_try += 1
            if r.status_code == 200:           
                return data['data']
            else: return ''
    except:
        print('Error while accessing API')
        print(r)
        return ''

Example of a post

In [3]:
get_pushshift_data('ucbjgz')

https://api.pushshift.io/reddit/comment/search/?link_id=ucbjgz&limit=100&q=*&after=0


[{'all_awardings': [],
  'archived': False,
  'associated_award': None,
  'author': 'AutoModerator',
  'author_flair_background_color': None,
  'author_flair_css_class': None,
  'author_flair_richtext': [],
  'author_flair_template_id': None,
  'author_flair_text': None,
  'author_flair_text_color': None,
  'author_flair_type': 'text',
  'author_fullname': 't2_6l4z3',
  'author_patreon_flair': False,
  'author_premium': True,
  'body': 'Pas assez de musique sur r/france ? Rejoins-nous sur r/musique et aide nous à développer cette nouvelle communauté\n\n\n*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/france) if you have any questions or concerns.*',
  'body_sha1': 'f165f599f1a2ee35433403826535a438667d5ce0',
  'can_gild': True,
  'collapsed': False,
  'collapsed_because_crowd_control': None,
  'collapsed_reason': None,
  'collapsed_reason_code': None,
  'comment_type': None,
  'controversiality': 0,
  '

In [4]:
def collect_one_comment(comment, columns, link_id) -> pd.Series():
    author = comment['author']
    text = comment['body']    
    commentId = comment['id']
    parent_commentId = comment['parent_id']
    created = comment['created_utc']
    created_date = datetime.fromtimestamp(created)
    permalink = comment['permalink']    
    return pd.Series([link_id, author,commentId,text,parent_commentId,created_date,permalink], index = columns), created

In [5]:
def collect_all_comments(link_id):
    columns = ['post_id','authors','commentId','text','parent_commentId','created','permalink']
    rows_list = []

    comments = get_pushshift_data(link_id)
    nb_comments = len(comments)
    for comment in comments:
        pd_comment, after = collect_one_comment(comment, columns, link_id)
        rows_list.append(pd_comment)
    
    if nb_comments == 100:
        comments = get_pushshift_data(link_id, after - 1)
        #We're subtracting one because we will always scrap the last comment twice, to make sure not to miss one if two comments have the same timestamp
        nb_comments = len(comments)
    else: nb_comments = 0

    #We can get only 100 comments everytime, so we're looping changing the 'after' timestamp of comments until we have no more comments
    while nb_comments > 0:
        #We're scrapping comments and inserting in dataframe
        for comment in comments:
            pd_comment, after = collect_one_comment(comment, columns, link_id)
            rows_list.append(pd_comment)
        comments = get_pushshift_data(link_id, after - 1)
        #We're subtracting one because we will always scrap the last comment twice, to make sure not to miss one if two comments have the same timestamp
        nb_comments = len(comments) - 1
    
    return pd.DataFrame(rows_list, columns=columns)

In [6]:
collect_all_comments('qq0oz8')

https://api.pushshift.io/reddit/comment/search/?link_id=qq0oz8&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=qq0oz8&limit=100&q=*&after=1636476517
https://api.pushshift.io/reddit/comment/search/?link_id=qq0oz8&limit=100&q=*&after=1636617851


Unnamed: 0,post_id,authors,commentId,text,parent_commentId,created,permalink
0,qq0oz8,Soviet75,hjx50jq,*Rage en américain*,t3_qq0oz8,2021-11-09 11:21:25,/r/france/comments/qq0oz8/guide_comment_mesure...
1,qq0oz8,mooothemadcow,hjx50uk,Je me suis pissé dessus tellement j'ai ri.\n\n...,t3_qq0oz8,2021-11-09 11:21:33,/r/france/comments/qq0oz8/guide_comment_mesure...
2,qq0oz8,nyme-me,hjx6twj,C'est vrai! Malheureusement pas tout à fait po...,t3_qq0oz8,2021-11-09 11:47:24,/r/france/comments/qq0oz8/guide_comment_mesure...
3,qq0oz8,plouky,hjx6zob,Aire &gt; système métrique et terrain de foot...,t3_qq0oz8,2021-11-09 11:49:38,/r/france/comments/qq0oz8/guide_comment_mesure...
4,qq0oz8,Pb_Flo,hjx744l,"Médecine, aéronautique, hydraulique etc. rentr...",t3_qq0oz8,2021-11-09 11:51:19,/r/france/comments/qq0oz8/guide_comment_mesure...
...,...,...,...,...,...,...,...
150,qq0oz8,rakoo,hk1uwiq,Il parait que c'est la méthode que les arabes ...,t1_hjxtyun,2021-11-10 11:06:28,/r/france/comments/qq0oz8/guide_comment_mesure...
151,qq0oz8,zdimension,hk1ww3b,Instinctivement j'appellerais ça un diagramme ...,t1_hk06ypt,2021-11-10 11:35:03,/r/france/comments/qq0oz8/guide_comment_mesure...
152,qq0oz8,Dum_spero_spiro_,hk25u7f,Un grand merci !,t1_hk09fec,2021-11-10 13:22:54,/r/france/comments/qq0oz8/guide_comment_mesure...
153,qq0oz8,Dum_spero_spiro_,hk25vew,Merci !,t1_hk1ww3b,2021-11-10 13:23:14,/r/france/comments/qq0oz8/guide_comment_mesure...


### We know want to get coments from all the post we scrapped

In [7]:
import glob

# getting csv files from the folder MyProject
path = "csv_exports"

# read all the files with extension .csv
filenames = glob.glob(path + "\*.csv")
print('File names:', filenames)
all_titres = pd.DataFrame()
# for loop to iterate all csv files
for file in filenames:
   # reading csv files
   print("\nReading file = ",file)
   all_titres = all_titres.append(pd.read_csv(file))

all_titres = all_titres.reset_index(drop=True)
post_ids = all_titres['postId'].tolist()
post_ids

File names: ['csv_exports\\france_20210901_20211002.csv', 'csv_exports\\france_20211002_20211031.csv', 'csv_exports\\france_20211031_20211231.csv', 'csv_exports\\france_20211231_20220131.csv', 'csv_exports\\france_20220131_20220228.csv', 'csv_exports\\france_20220228_20220331.csv', 'csv_exports\\france_20220331_20220401.csv', 'csv_exports\\france_20220331_20220426.csv']

Reading file =  csv_exports\france_20210901_20211002.csv

Reading file =  csv_exports\france_20211002_20211031.csv

Reading file =  csv_exports\france_20211031_20211231.csv

Reading file =  csv_exports\france_20211231_20220131.csv

Reading file =  csv_exports\france_20220131_20220228.csv

Reading file =  csv_exports\france_20220228_20220331.csv

Reading file =  csv_exports\france_20220331_20220401.csv

Reading file =  csv_exports\france_20220331_20220426.csv


['pfg0r5',
 'pfgezw',
 'pfh52l',
 'pfh9pl',
 'pfhakb',
 'pfhn2p',
 'pfhomq',
 'pfkds1',
 'pfktmh',
 'pfkvw2',
 'pflw2r',
 'pfmj9z',
 'pfmwgm',
 'pfn0ez',
 'pfn4hz',
 'pfnj6d',
 'pfnjpj',
 'pfns5n',
 'pfnw7j',
 'pfnwhm',
 'pfo2ck',
 'pfo5he',
 'pfo6wm',
 'pfo84s',
 'pfogr6',
 'pfojbq',
 'pfokcg',
 'pfoojx',
 'pfooqv',
 'pfope1',
 'pfordk',
 'pfos3f',
 'pfov4r',
 'pfp58k',
 'pfp7mg',
 'pfp8ze',
 'pfpaef',
 'pfpmon',
 'pfpoe4',
 'pfprwo',
 'pfpsb0',
 'pfpvqy',
 'pfpxjo',
 'pfq10y',
 'pfq7lv',
 'pfqcth',
 'pfqepz',
 'pfqh7t',
 'pfqhs8',
 'pfql07',
 'pfqldw',
 'pfqm73',
 'pfqs0v',
 'pfqv8u',
 'pfqyj5',
 'pfqzaf',
 'pfr09l',
 'pfr1jp',
 'pfr2qk',
 'pfr3af',
 'pfr61j',
 'pfr6dr',
 'pfr72p',
 'pfr8tk',
 'pfrjtn',
 'pfrksj',
 'pfrlxl',
 'pfrrql',
 'pfrruf',
 'pfrrvk',
 'pfrsdq',
 'pfrvll',
 'pfrwfy',
 'pfs1np',
 'pfs5cp',
 'pfs5nf',
 'pfs5wc',
 'pfs8nm',
 'pfsdq1',
 'pfsejh',
 'pfsjqf',
 'pfslus',
 'pfsmie',
 'pfsqus',
 'pfst0u',
 'pfstwp',
 'pfsx57',
 'pfszsh',
 'pft132',
 'pft4j9',
 'pftfkb',

### Looping on every post we have

In [12]:
# def save_dataframe(df, after : datetime, before : datetime):
#     #On veut retransformer les timestamps en date pour le nom des fichiers
#     after = str(after.strftime("%Y")) +  str(after.strftime("%m")) + str(after.strftime("%d"))
#     before = str(before.strftime("%Y")) +  str(before.strftime("%m")) + str(before.strftime("%d"))
#     csv_file_name = 'france_' + str(after) + '_' + str(before) + '.csv'
#     df.to_csv('csv_exports' + '/' + csv_file_name, index = False, encoding="utf-8")

In [8]:
def save_dataframe(df, post_id = str):
    csv_file_name = 'france_comments_' + post_id + '.csv'
    df.to_csv('csv_exports/comments/' + csv_file_name, index = False, encoding="utf-8")

In [48]:
# get_pushshift_data_pmaw(post_ids[:100])

In [None]:
# all_comments_pmaw = get_pushshift_data_pmaw(post_ids)

In [9]:
#We're scrapping all comments from all post_ids we have. Saving in a file every 50k comments

all_comments = pd.DataFrame()
count = 0
for post_id in post_ids:
    all_comments = pd.concat([all_comments, collect_all_comments(post_id)], ignore_index=True)
    count += 1
    if len(all_comments) > 50000:
        save_dataframe(all_comments, str(post_id))
        all_comments = pd.DataFrame()
save_dataframe(all_comments, str(post_id))
print('I scrapped comments from ' + count + ' posts :).')
all_comments

https://api.pushshift.io/reddit/comment/search/?link_id=pfg0r5&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=pfgezw&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=pfh52l&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=pfh9pl&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=pfhakb&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=pfhn2p&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=pfhomq&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=pfkds1&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=pfktmh&limit=100&q=*&after=0
I had a 429 status code. I'm waiting 3 seconds.
https://api.pushshift.io/reddit/comment/search/?link_id=pfkvw2&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=pflw2r&limit=100&q=*&after=0
https://api.pushs

In [None]:
count

10390

In [None]:
all_comments

Unnamed: 0,post_id,authors,commentId,text,parent_commentId,created,permalink
0,pgdrwq,twitterInfo_bot,hbak79z,"\nÀ Marseille, Emmanuel Macron hué à son arriv...",t3_pgdrwq,2021-09-02 10:35:00,/r/france/comments/pgdrwq/à_marseille_emmanuel...
1,pgdrwq,salutttlmonde,hbaklo0,Pourquoi la suppression,t3_pgdrwq,2021-09-02 10:41:05,/r/france/comments/pgdrwq/à_marseille_emmanuel...
2,pgdrwq,C_kloug,hbakpi4,"Hué par des dizaines de personnes, je crois qu...",t3_pgdrwq,2021-09-02 10:42:41,/r/france/comments/pgdrwq/à_marseille_emmanuel...
3,pgdulr,deyw75,hbanuvp,Un foyer de jeunes travailleurs peut être ?,t3_pgdulr,2021-09-02 11:30:32,/r/france/comments/pgdulr/recherche_logement_s...
4,pgdulr,Mundane_Cabinet33,hbaqgu4,"C'est une bonne idée, je les contacterai demai...",t1_hbanuvp,2021-09-02 12:08:05,/r/france/comments/pgdulr/recherche_logement_s...
...,...,...,...,...,...,...,...
202,pgeofd,bitflag,hbdpm26,&gt; quand les soignants signalent un manque d...,t1_hbbxg9g,2021-09-03 01:35:47,/r/france/comments/pgeofd/12_mesures_pour_2022...
203,pgeofd,Cpassorcier,hbfml81,C'est très simple. Les soignants sont dignes d...,t1_hbdpm26,2021-09-03 13:25:52,/r/france/comments/pgeofd/12_mesures_pour_2022...
204,pgeofd,bitflag,hbfo1u6,Ah okay donc les gens qu'on aime on les crois ...,t1_hbfml81,2021-09-03 13:41:14,/r/france/comments/pgeofd/12_mesures_pour_2022...
205,pgeofd,Caramel_mouais,hbg98wj,Vous n'êtes pas d'accord avec cette affirmatio...,t1_hbar6ll,2021-09-03 16:35:21,/r/france/comments/pgeofd/12_mesures_pour_2022...


In [None]:
# all_comments.to_csv('csv_exports' + '/' + 'comments_first_all.csv', index = False, encoding="utf-8")

Unnamed: 0,authors,commentId,text,parent_commentId,created,permalink
0,_Tripe_,hb42jo8,Sur ce coup là 🤡\n\nAprès faut voir sur quelle...,t3_pfg0r5,2021-09-01 00:08:06,/r/france/comments/pfg0r5/bravo_la_france_on_b...
1,TB54,hb43vqt,J'avoue que j'ai vraiment du mal à croire le r...,t3_pfg0r5,2021-09-01 00:17:45,/r/france/comments/pfg0r5/bravo_la_france_on_b...
2,coelhophisis,hb44wb6,"En même temps, si j'ai un membre de ma famille...",t3_pfg0r5,2021-09-01 00:25:07,/r/france/comments/pfg0r5/bravo_la_france_on_b...
3,[deleted],hb4568g,[removed],t3_pfg0r5,2021-09-01 00:27:08,/r/france/comments/pfg0r5/bravo_la_france_on_b...
4,CulteDeLaRaison,hb45ugd,"Je veux bien la taille de la population, la mé...",t3_pfg0r5,2021-09-01 00:32:07,/r/france/comments/pfg0r5/bravo_la_france_on_b...
...,...,...,...,...,...,...
126,Canaan-Aus,hehlenh,"I got mine today. 26 days. applied on the 1st,...",t1_hcrn6ff,2021-09-27 18:30:52,/r/france/comments/pfkds1/get_a_digital_sanita...
127,nixsequi,hb54rz2,I have never seen these products in France.,t3_pfktmh,2021-09-01 05:02:54,/r/france/comments/pfktmh/eiffel_brand_product...
128,Frenetic_Platypus,hb556as,"I've never seen them, I can't find any seller ...",t3_pfktmh,2021-09-01 05:06:16,/r/france/comments/pfktmh/eiffel_brand_product...
129,MycroFeline,hb55ecn,Neither have I. A quick search on the french-s...,t1_hb54rz2,2021-09-01 05:08:11,/r/france/comments/pfktmh/eiffel_brand_product...
