### Get comments from ID post

If I want to get all comments from post ID "ucbjgz" : https://api.pushshift.io/reddit/comment/search/?link_id=ucbjgz&limit=100&q=*

The limit is currently clamped down to 100 so need to make a loop if we want more (see https://www.reddit.com/r/pushshift/comments/qufgqa/get_all_comments_from_a_post_id/)

In [2]:
import pandas as pd
import requests
import json
from datetime import datetime
import time

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [3]:
def get_pushshift_data(link_id, after = 0, limit = 100) -> dict():
    try:
        URL = 'https://api.pushshift.io/reddit/comment/search/?link_id='+str(link_id)+'&limit='+str(limit)+'&q=*'+'&after='+ str(after)
        print(URL)
        r = requests.get(URL)
        if r.status_code == 200:
            data = json.loads(r.text, strict = False)
            return data['data']
        
        #Si on a eu une erreur en récupérant l'URL on réessaye 5 fois(avec sleep pour être sûr), sinon on abandonne
        else:
            time.sleep(1)
            nb_try = 0
            while r.status_code != 200 | nb_try < 5:
                URL = 'https://api.pushshift.io/reddit/comment/search/?link_id='+str(link_id)+'&limit='+str(limit)+'&q=*'
                print(URL)
                r = requests.get(URL)
                data = json.loads(r.text, strict = False)
                nb_try += 1
            if r.status_code == 200:           
                return data['data']
            else: return ''
    except:
        print('Error while accessing API')
        print(r)
        return ''

Example of a post

In [None]:
get_pushshift_data('ucbjgz')

In [22]:
def collect_one_comment(comment, columns, link_id) -> pd.Series():
    author = comment['author']
    text = comment['body']    
    commentId = comment['id']
    parent_commentId = comment['parent_id']
    created = comment['created_utc']
    created_date = datetime.fromtimestamp(created)
    permalink = comment['permalink']    
    return pd.Series([link_id, author,commentId,text,parent_commentId,created_date,permalink], index = columns), created

In [21]:
def collect_all_comments(link_id):
    columns = ['post_id','authors','commentId','text','parent_commentId','created','permalink']
    rows_list = []

    comments = get_pushshift_data(link_id)
    nb_comments = len(comments)
    #We can get only 100 comments everytime, so we're looping changing the 'after' timestamp of comments until we have no more comments
    while nb_comments > 0:
        #We're scrapping comments and inserting in dataframe
        for comment in comments:
            pd_comment, after = collect_one_comment(comment, columns, link_id)
            rows_list.append(pd_comment)
        comments = get_pushshift_data(link_id, after - 1)
        #We're subtracting one because we will always scrap the last comment twice, to make sure not to miss one if two comments have the same timestamp
        nb_comments = len(comments) - 1
    
    return pd.DataFrame(rows_list, columns=columns)
        


In [23]:
collect_all_comments('ucbjgz')

https://api.pushshift.io/reddit/comment/search/?link_id=ucbjgz&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=ucbjgz&limit=100&q=*&after=1650978436


Unnamed: 0,post_id,authors,commentId,text,parent_commentId,created,permalink
0,ucbjgz,AutoModerator,i69audd,Pas assez de musique sur r/france ? Rejoins-no...,t3_ucbjgz,2022-04-26 14:34:44,/r/france/comments/ucbjgz/claude_francois_si_j...
1,ucbjgz,klapet,i69dejb,C'est quand même vraiment pâle en comparaison ...,t3_ucbjgz,2022-04-26 14:55:46,/r/france/comments/ucbjgz/claude_francois_si_j...
2,ucbjgz,robespierre__,i69e0ku,Ou de Peter Paul and Mary [YouTube](https://m....,t1_i69dejb,2022-04-26 15:00:37,/r/france/comments/ucbjgz/claude_francois_si_j...
3,ucbjgz,not_franck_the_cook,i69evfd,J'ai trouvé des versions métal bien surprenant...,t1_i69dejb,2022-04-26 15:07:17,/r/france/comments/ucbjgz/claude_francois_si_j...


### We know want to get coments from all the post we scrapped

In [8]:
import glob

# getting csv files from the folder MyProject
path = "csv_exports"

# read all the files with extension .csv
filenames = glob.glob(path + "\*.csv")
print('File names:', filenames)
all_titres = pd.DataFrame()
# for loop to iterate all csv files
for file in filenames:
   # reading csv files
   print("\nReading file = ",file)
   all_titres = all_titres.append(pd.read_csv(file))

all_titres = all_titres.reset_index(drop=True)
post_ids = all_titres['postId'].tolist()
post_ids

File names: ['csv_exports\\france_20210901_20211002.csv', 'csv_exports\\france_20211002_20211031.csv', 'csv_exports\\france_20211031_20211231.csv', 'csv_exports\\france_20211231_20220131.csv', 'csv_exports\\france_20220131_20220228.csv', 'csv_exports\\france_20220228_20220331.csv', 'csv_exports\\france_20220331_20220401.csv', 'csv_exports\\france_20220331_20220426.csv']

Reading file =  csv_exports\france_20210901_20211002.csv

Reading file =  csv_exports\france_20211002_20211031.csv

Reading file =  csv_exports\france_20211031_20211231.csv

Reading file =  csv_exports\france_20211231_20220131.csv

Reading file =  csv_exports\france_20220131_20220228.csv

Reading file =  csv_exports\france_20220228_20220331.csv

Reading file =  csv_exports\france_20220331_20220401.csv

Reading file =  csv_exports\france_20220331_20220426.csv


['pfg0r5',
 'pfgezw',
 'pfh52l',
 'pfh9pl',
 'pfhakb',
 'pfhn2p',
 'pfhomq',
 'pfkds1',
 'pfktmh',
 'pfkvw2',
 'pflw2r',
 'pfmj9z',
 'pfmwgm',
 'pfn0ez',
 'pfn4hz',
 'pfnj6d',
 'pfnjpj',
 'pfns5n',
 'pfnw7j',
 'pfnwhm',
 'pfo2ck',
 'pfo5he',
 'pfo6wm',
 'pfo84s',
 'pfogr6',
 'pfojbq',
 'pfokcg',
 'pfoojx',
 'pfooqv',
 'pfope1',
 'pfordk',
 'pfos3f',
 'pfov4r',
 'pfp58k',
 'pfp7mg',
 'pfp8ze',
 'pfpaef',
 'pfpmon',
 'pfpoe4',
 'pfprwo',
 'pfpsb0',
 'pfpvqy',
 'pfpxjo',
 'pfq10y',
 'pfq7lv',
 'pfqcth',
 'pfqepz',
 'pfqh7t',
 'pfqhs8',
 'pfql07',
 'pfqldw',
 'pfqm73',
 'pfqs0v',
 'pfqv8u',
 'pfqyj5',
 'pfqzaf',
 'pfr09l',
 'pfr1jp',
 'pfr2qk',
 'pfr3af',
 'pfr61j',
 'pfr6dr',
 'pfr72p',
 'pfr8tk',
 'pfrjtn',
 'pfrksj',
 'pfrlxl',
 'pfrrql',
 'pfrruf',
 'pfrrvk',
 'pfrsdq',
 'pfrvll',
 'pfrwfy',
 'pfs1np',
 'pfs5cp',
 'pfs5nf',
 'pfs5wc',
 'pfs8nm',
 'pfsdq1',
 'pfsejh',
 'pfsjqf',
 'pfslus',
 'pfsmie',
 'pfsqus',
 'pfst0u',
 'pfstwp',
 'pfsx57',
 'pfszsh',
 'pft132',
 'pft4j9',
 'pftfkb',

### Looping on every post we have

In [None]:
# def save_dataframe(df, after : datetime, before : datetime):
#     #On veut retransformer les timestamps en date pour le nom des fichiers
#     after = str(after.strftime("%Y")) +  str(after.strftime("%m")) + str(after.strftime("%d"))
#     before = str(before.strftime("%Y")) +  str(before.strftime("%m")) + str(before.strftime("%d"))
#     csv_file_name = 'france_' + str(after) + '_' + str(before) + '.csv'
#     df.to_csv('csv_exports' + '/' + csv_file_name, index = False, encoding="utf-8")

In [25]:
all_comments = pd.DataFrame()
for post_id in post_ids:
    all_comments = pd.concat([all_comments, collect_all_comments(post_id)], ignore_index=True)

all_comments
    

https://api.pushshift.io/reddit/comment/search/?link_id=pfg0r5&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=pfg0r5&limit=100&q=*&after=1631016357
https://api.pushshift.io/reddit/comment/search/?link_id=pfgezw&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=pfgezw&limit=100&q=*&after=1630474783
https://api.pushshift.io/reddit/comment/search/?link_id=pfh52l&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=pfh52l&limit=100&q=*&after=1630454580
https://api.pushshift.io/reddit/comment/search/?link_id=pfh9pl&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=pfh9pl&limit=100&q=*&after=1630488890
https://api.pushshift.io/reddit/comment/search/?link_id=pfhakb&limit=100&q=*&after=0
https://api.pushshift.io/reddit/comment/search/?link_id=pfhakb&limit=100&q=*&after=1630524404
https://api.pushshift.io/reddit/comment/search/?link_id=pfhn2p&limit=100&q=*&after=0
https://api.pushshif

In [1]:
all_comments

NameError: name 'all_comments' is not defined

In [10]:
all_comments.to_csv('csv_exports' + '/' + 'comments_first_all.csv', index = False, encoding="utf-8")

Unnamed: 0,authors,commentId,text,parent_commentId,created,permalink
0,_Tripe_,hb42jo8,Sur ce coup là 🤡\n\nAprès faut voir sur quelle...,t3_pfg0r5,2021-09-01 00:08:06,/r/france/comments/pfg0r5/bravo_la_france_on_b...
1,TB54,hb43vqt,J'avoue que j'ai vraiment du mal à croire le r...,t3_pfg0r5,2021-09-01 00:17:45,/r/france/comments/pfg0r5/bravo_la_france_on_b...
2,coelhophisis,hb44wb6,"En même temps, si j'ai un membre de ma famille...",t3_pfg0r5,2021-09-01 00:25:07,/r/france/comments/pfg0r5/bravo_la_france_on_b...
3,[deleted],hb4568g,[removed],t3_pfg0r5,2021-09-01 00:27:08,/r/france/comments/pfg0r5/bravo_la_france_on_b...
4,CulteDeLaRaison,hb45ugd,"Je veux bien la taille de la population, la mé...",t3_pfg0r5,2021-09-01 00:32:07,/r/france/comments/pfg0r5/bravo_la_france_on_b...
...,...,...,...,...,...,...
126,Canaan-Aus,hehlenh,"I got mine today. 26 days. applied on the 1st,...",t1_hcrn6ff,2021-09-27 18:30:52,/r/france/comments/pfkds1/get_a_digital_sanita...
127,nixsequi,hb54rz2,I have never seen these products in France.,t3_pfktmh,2021-09-01 05:02:54,/r/france/comments/pfktmh/eiffel_brand_product...
128,Frenetic_Platypus,hb556as,"I've never seen them, I can't find any seller ...",t3_pfktmh,2021-09-01 05:06:16,/r/france/comments/pfktmh/eiffel_brand_product...
129,MycroFeline,hb55ecn,Neither have I. A quick search on the french-s...,t1_hb54rz2,2021-09-01 05:08:11,/r/france/comments/pfktmh/eiffel_brand_product...
