## **In this notebook**, 

1.   Loading and processing master log live stream data
2.   Defining a timestamp
3. Collecting comment from reddit after delay for that timestamp
3.   Finding the removed comments in the master log for that timestamp

Importing required libraries

In [71]:
import json
import hashlib
import pandas as pd
from google.colab import files
from datetime import datetime

# **Loading and processing master log csv file for one subreddit**

In [72]:
#load the csv into dataframe
final_df = pd.read_csv('/content/com_stream_france.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [73]:
len(final_df)

820939

###Cleaning the master log data

For checking and debugging dataframe data 

In [74]:
final_df[final_df.score.isna()] #for com
# if output is empty, we are good
# this is the case when the csv file is not appended properly, columns are mismatched.

Unnamed: 0,author,body,collapsed,controversiality,created_utc,id,link_id,parent_id,permalink,score
67397,fu589uw,Hondelatte raconte,,,,,,,,
67398,fu589uw,Hondelatte raconte,,,,,,,,
67399,fu589uw,Hondelatte raconte,,,,,,,,
67400,fu589uw,Hondelatte raconte,,,,,,,,
67401,fu589uw,Hondelatte raconte,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...
524288,0:17 (i) <----- LA,0.0,0.0,1.657704e+09,ifyy1cx,t3_vxy3re,t1_ifyxwqf,/r/france/comments/vxy3re/les_pubs_youtube_on_...,1.0,
645478,2q0m6gq8,,,,,,,,,
645479,Voici un brillant exemple de ce sujet https://...,0.0,0.0,1.659201e+09,iia1nbx,t3_wbq180,t3_wbq180,/r/france/comments/wbq180/les_clichés_scénaris...,1.0,
715253,czbnb,,,,,,,,,


In [75]:
# To update the dataframe after removing rows which are of no use.
# Use the above cell to check the rows in above cell, to see if they should be removed or not.
# score andupvote_ratio would be empty if the csv files are not being appended properly

def cleaning(df, df_type):
  # print('before: ', len(df))
  if df_type == 'com':
    df = df[df.body.isna()==False]
    df = df[df.score.isna()==False]
  else:
    df = df[df.upvote_ratio.isna()==False]
  
  df = df[df.author.isna()==False]
  df = df[df.id.isna()==False]
  df = df[df.created_utc.isna()==False]
  df.created_utc = df.created_utc.apply(lambda x: float(x))

  # print('after: ', len(df))
  return df

Cleaning **comment** dataframes

In [76]:
final_df = cleaning(final_df, 'com')

In [79]:
datetime.fromtimestamp(final_df.created_utc.max()).date()

datetime.date(2022, 8, 27)

### Filtering the master log by **timestamp**
**Note:** Defining **timestamp** and **output file name tag** used for all files that would be downloaded during this process.

DO NOT forget to set the subreddit correctly

In [80]:
# IMP NOTE see if the subreddit is set correctly.
subreddit = '_france_'

# example for 15th August 2022
date1 = datetime(2022,8,15)
date2 = datetime.today() #Because com_stream csv loaded above was downloaded from server today

dates_str = str(date1.date().strftime('%d%b%Y'))+'_to_'+str(date2.date().strftime('%d%b%Y'))

out_file_name_tag = subreddit+dates_str
print(out_file_name_tag)

_france_15Aug2022_to_28Aug2022


In [81]:
# I used this link to get timestamp for a date: https://timestamp.online/
# req_df = final_df[(final_df['created_utc'] >= 1655449200) & (final_df['created_utc'] <= 1655881200)]

# you can use above defined dates to filter
req_df = final_df[(final_df['created_utc'] >= date1.timestamp())]

In [84]:
# just confirming the above
datetime.fromtimestamp(req_df.created_utc.min()).date()

datetime.date(2022, 8, 15)

In [83]:
len(req_df)

77069

In [85]:
# assigning the required dataframe to the original name
final_df = req_df

# **Collecting reddit comment using comment IDs**
This collection is done after a significant delay from the time these comment ids were live streaming on server

### Generating **fullname** for the **comment IDs**

In [86]:
comment_ids = list(final_df.id)

In [87]:
# Prefix 't1_'	for Comment, 't3_' for Post/Link in reddit terms
# We need Fullnames to fetch the comments by its ID using reddit api info function.
comment_fullnames = ["t1_"+i for i in comment_ids]
print(len(comment_fullnames))

77069


### Connect to Reddit API (PRAW)

In [88]:
pip install praw

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [89]:
credentials = 'client_secret.json'

with open(credentials) as f:
    creds = json.load(f)

In [90]:
import praw
reddit = praw.Reddit(client_id = creds['client_id'],
                    client_secret = creds['client_secret'],
                    user_agent = creds['user_agent'],
                    redirect_uri = creds['redirect_uri'],
                    refresh_token = creds['refresh_token'],
                     check_for_async = False)

### Collecting using Reddit API

In [91]:
# Function to create batches of 500 fullnames

def create_batch(total):
  final_batchs = []
  start = 0
  end = 500

  while len(total)-start > 500:
    batch = total[start:end]
    final_batchs.append(batch)
    start = end
    end = end+500

  last_batch = total[start:]
  final_batchs.append(last_batch)

  return final_batchs

In [None]:
# For every batch, reddit.info() is collecting only the comment id and body, then appending to the final list
# This cell would run fine for 200 batches, notebook might time out for more than 200 batches if left unattended

final_list = []
batches = create_batch(comment_fullnames)
print("Number of batches: ", len(batches))

b_count = 1
for batch in batches:
  com_generator = reddit.info(fullnames = batch)

  for com in com_generator:
    com_dict={}
    com_dict['id'] = com.id
    com_dict['body'] = com.body

    final_list.append(com_dict)
  
  print("-----Batch: ", b_count, "----Num of comments collected: ", len(final_list))
  b_count+=1

In [93]:
# To removed extra spaced from the comment body
final_list = [{'id': i['id'], 'body': " ".join(i['body'].split())} for i in final_list]

In [94]:
len(final_list)

77069

In [95]:
# Converting the list to data frame
reddit_collected_com = pd.DataFrame(final_list)
reddit_collected_com.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77069 entries, 0 to 77068
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      77069 non-null  object
 1   body    77069 non-null  object
dtypes: object(2)
memory usage: 1.2+ MB


#### Saving **reddit collected** comments
**Note:** It is recommended to run till this section because the collection cell above takes a lot of time, thus it is better to run till the below to cell, so that the collection is alteast saved, in case notebook timeouts

In [96]:
# Check the var 'out_file_name_tag' above to confirm the date and subreddit
collect_outfile_name = 'collected_com'+out_file_name_tag+'.csv'
print(collect_outfile_name)

collected_com_france_15Aug2022_to_28Aug2022.csv


In [97]:
# Downloading just in case the notebook timeouts
# becaus above collection takes a lot of time
reddit_collected_com.to_csv('./'+collect_outfile_name, header=True, index=False, columns=list(reddit_collected_com.axes[1]))
files.download('/content/'+collect_outfile_name)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# **Finding removed comments and saving all required files**


### Comment IDs for comments with body == [removed]

In [98]:
# Filtering out the comments with '[removed]' body and collecting these comments ids

removed_com = reddit_collected_com[reddit_collected_com['body']=='[removed]']
removed_id = removed_com['id']
print(len(removed_id))

655


In [99]:
removed_com 

Unnamed: 0,id,body
164,ikc8yjv,[removed]
1205,ikd27we,[removed]
1272,ikd2x1c,[removed]
1298,ikd37qh,[removed]
1324,ikd3ia1,[removed]
...,...,...
75887,ilww6cl,[removed]
75898,ilwwg7w,[removed]
76719,ilxlctp,[removed]
76962,ily2h0y,[removed]


### Creating removed col in the **timestamp-filtered master log**

**Note**: timestamp-filtered master log means --->This is not the complete master log that was downloaded from the server, this was filtered by timestamp in above steps

In [100]:
# Creating 'removed' column in the master log dataframe which was filtered above by timestamp
# here removed col would be true for all IDs in removed_id and false otherwise.

final_df['removed'] = final_df['id'].isin(set(list(removed_id)))

### Creating darma_author col in the **timestamp-filtered master log**
**Note**: timestamp-filtered master log means --->This is not the complete master log that was downloaded from the server, this was filtered by timestamp in above steps

In [101]:
def map_author(input):
  hash = hashlib.sha256()
  hash.update(input.encode('utf-8'))
  digested = hash.digest()
  output_string = ""

  for iter in range (0,8):
      mod_result = ord(chr(digested[iter])) % 52
      if mod_result < 26:
          output_string += chr(65 + mod_result)
      else:
          mod_result -= 26
          output_string += chr(97 + mod_result)
  return output_string

In [63]:
# input = "hello2017good"
# map_author(input)

In [102]:
# Creating a new column darma_author that maps to reddit author
final_df['darma_author'] = final_df['author'].apply(lambda x: map_author(x))

In [103]:
# To display the dataframe
final_df

Unnamed: 0,author,body,collapsed,controversiality,created_utc,id,link_id,parent_id,permalink,score,removed,darma_author
743870,1fbyf3s,La critique de la coupe du monde au Quatar gar...,0.0,0.0,1.660522e+09,ikbhu8o,t3_wo2wki,t1_ikb5juo,/r/france/comments/wo2wki/la_ville_de_rouen_dé...,1.0,False,TaFVPVtD
743871,3yrqm4am,Pareille j'ai horreur de ça. Je sais jamais qu...,0.0,0.0,1.660522e+09,ikbi1rm,t3_wo1up8,t1_ik8kego,/r/france/comments/wo1up8/quelles_sont_des_règ...,1.0,False,WmjpyRyw
743872,8ntz3i73,Je sens que ce post va être très salé,0.0,0.0,1.660522e+09,ikbi4pp,t3_wo6hz2,t3_wo6hz2,/r/france/comments/wo6hz2/traversée_de_la_manc...,1.0,False,PqHgbkMP
743873,hkp0q,"[Here](https://www.google.fr/maps/@48.8642037,...",0.0,0.0,1.660522e+09,ikbi9xi,t3_wofipk,t3_wofipk,/r/france/comments/wofipk/does_anyone_know_whe...,1.0,False,YbdNBtXW
743874,166xaa,J'ai eu un cas différent mais où la SNCF était...,0.0,0.0,1.660522e+09,ikbicrx,t3_wo0sjk,t1_ik8lx3m,/r/france/comments/wo0sjk/meilleure_moyen_de_c...,1.0,False,kBlsUHtf
...,...,...,...,...,...,...,...,...,...,...,...,...
820934,aatr5,C'est quoi le rapport ? Sachant que le RU est ...,0.0,0.0,1.661570e+09,ilygrgq,t3_wyhk3n,t1_ilxh37m,/r/france/comments/wyhk3n/le_taux_annuel_dinfl...,1.0,False,vlDQdile
820935,mur00,Tout à fait. On peut ajouter qu’au sein d’une ...,0.0,0.0,1.661570e+09,ilygshd,t3_wyrsze,t1_ilygi3d,/r/france/comments/wyrsze/un_fasciste_peut_se_...,1.0,False,OjBlwVLl
820936,1g09eo7x,"Le mot que tu cherches est ""plaisanter"". Mais ...",0.0,0.0,1.661570e+09,ilyh5pg,t3_wymygr,t1_ilye9cf,/r/france/comments/wymygr/comment_je_me_fais_d...,1.0,False,GqiVXDrJ
820937,41qf9kzu,">Dans la réalité, tant que tu as un salaire, t...",0.0,0.0,1.661570e+09,ilyh74f,t3_wyea0q,t1_ilwpgte,/r/france/comments/wyea0q/remettons_les_pendul...,1.0,False,udDrDZOx


### Saving **timestamp-filtered master log** with darma_author and removed col
**Note**: timestamp-filtered master log means --->This is not the complete master log that was downloaded from the server, this was filtered by timestamp in above steps

In [104]:
# Define file name
# Check the var 'out_file_name_tag' above to confirm the date and subreddit
new_master_outfile_name = 'new_com_stream'+out_file_name_tag+'.csv'
print(new_master_outfile_name)

new_com_stream_france_15Aug2022_to_28Aug2022.csv


In [105]:
# Saving the final data frame after creating removed col

final_df.to_csv('./'+new_master_outfile_name, header=True, index=False, columns=list(final_df.axes[1]))
files.download("/content/"+new_master_outfile_name)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Saving **removed** comments

In [106]:
found_removed_com = final_df[final_df.removed==True]
print(len(found_removed_com))
found_removed_com.head()

655


Unnamed: 0,author,body,collapsed,controversiality,created_utc,id,link_id,parent_id,permalink,score,removed,darma_author
744034,fh36u,Women,0.0,0.0,1660535000.0,ikc8yjv,t3_wog2da,t3_wog2da,/r/france/comments/wog2da/what_things_are_expe...,1.0,True,nOQSkhIC
745075,w1e70,Un brillant avenir dans notre pays s'ouvre dev...,0.0,0.0,1660556000.0,ikd27we,t3_wov5ev,t3_wov5ev,/r/france/comments/wov5ev/paris_des_pompiers_a...,1.0,True,GSmuVoWF
745142,4ag4zo2i,Je trouve ça immoral. Dans le pays où j’habite...,0.0,0.0,1660556000.0,ikd2x1c,t3_wov5ev,t3_wov5ev,/r/france/comments/wov5ev/paris_des_pompiers_a...,1.0,True,bduBGbsc
745168,1a2etvkv,C’est du sarcasme ou t’es vraiment abruti ?\n\...,0.0,0.0,1660557000.0,ikd37qh,t3_wotdwy,t1_ikcwhoe,/r/france/comments/wotdwy/inflation_le_coût_de...,1.0,True,iJFUFmEg
745194,4e1ajea7,"""pas de problème de sécurité"" .... ""Peine de m...",0.0,0.0,1660557000.0,ikd3ia1,t3_wov5ev,t1_ikd2x1c,/r/france/comments/wov5ev/paris_des_pompiers_a...,1.0,True,YKfCkvdw


In [107]:
# Define file name
# Check the var 'out_file_name_tag' above to confirm the date and subreddit
removed_outfile_name = 'removed_com'+out_file_name_tag+'.csv'
print(removed_outfile_name)

removed_com_france_15Aug2022_to_28Aug2022.csv


In [108]:
# Saving and downloading only removed comments

found_removed_com.to_csv('./'+removed_outfile_name, header=True, index=False, columns=list(found_removed_com.axes[1]))
files.download("/content/"+removed_outfile_name)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>