## **In this notebook**, 

1.   Generating threads with full details
2.   Sort the generated thread from root to leaf
2.   Filtering out threads with no repeated of toxic author
3.   Filtering out threads with gaps ( body == [removed] and Not found )
4.   Filtering out tthreads by LangID
6.   Generating threads with author, text, and permalink (for verification)
5. Cleaning and Formating final json file according to offline protocol

**Note:** I have broken down several actions in section, so that we can an action if the resultant threads are very less.
Ex: while removing gaps in thread, a lot of threads get discarded, so that is why I am saving the threads at all points. Similarly for langID filter.

**Note**: Make sure that the master log and removed comments are from the same time stamp



In [2]:
import json
import hashlib
import pandas as pd
from datetime import datetime
from google.colab import files

# **Loading and processing master log for comments and posts and removed comments**

After running the [**Removed_com_collection_notebook**](https://colab.research.google.com/drive/1zialczWr7qdN9hfEatYIJpdk5kquCkWb?usp=sharing)

You would get ***new_com_stream_{subreddit}_{timestamp}.csv file*** files which need to be uploaded here.

For post stream you can upload the csv file directly downloaded from the server, as posts csv are small in size 

In [3]:
# from the removed_com_collection_notebook, the dataframe that is saved and downloaded after updating the removed and darma_author column
master_log = pd.read_csv('/content/new_com_stream_france_15Aug2022_to_28Aug2022.csv')

# downloaded from the server
post_stream = pd.read_csv('/content/sub_stream_france.csv')

# final_removed_df = pd.read_csv('/content/removed_com_science_20july_to_15aug.csv')

### To check the schema of hte above dataframes



In [4]:
post_stream.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15862 entries, 0 to 15861
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   author        15862 non-null  object 
 1   created_utc   15862 non-null  float64
 2   id            15862 non-null  object 
 3   num_comments  15862 non-null  float64
 4   permalink     15862 non-null  object 
 5   score         15862 non-null  float64
 6   selftext      6515 non-null   object 
 7   title         15862 non-null  object 
 8   upvote_ratio  15862 non-null  float64
dtypes: float64(4), object(5)
memory usage: 1.1+ MB


In [5]:
master_log.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77069 entries, 0 to 77068
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   author            77069 non-null  object 
 1   body              77069 non-null  object 
 2   collapsed         77069 non-null  float64
 3   controversiality  77069 non-null  float64
 4   created_utc       77069 non-null  float64
 5   id                77069 non-null  object 
 6   link_id           77069 non-null  object 
 7   parent_id         77069 non-null  object 
 8   permalink         77069 non-null  object 
 9   score             77069 non-null  float64
 10  removed           77069 non-null  bool   
 11  darma_author      77069 non-null  object 
dtypes: bool(1), float64(4), object(7)
memory usage: 6.5+ MB


### **Check in above step, using info()**

For **comment** log,

> If the comment master log is loaded from **new_com_stream**... (i.e. from the **removed_com_collection_notebook**) it would have the **darma_author** and **removed** col, then you would **not require** to run the next few steps of **author mapping** and **removed column**.

> Else, If in some case you have the **removed_com** file for a particular **timestamp** but do not have **new_com_stream**, then you can upload the **comment master log csv file** (com_stream) directly downloaded from the server and in that case, the next steps--**author mapping step** and **removed column** are **required**.
(overhead is that it would be large file, hence many comparison to do while generating threads.)

Although this case should not arrive if you have run the **removed_com_collection_notebook**


---


For **post** log, author mapping step is **required**.

### **Author mapping Step** if **darma_author** and **removed** col are not present in the comment stream

In [6]:
def map_author(input):
  hash = hashlib.sha256()
  hash.update(input.encode('utf-8'))
  digested = hash.digest()
  output_string = ""

  for iter in range (0,8):
      mod_result = ord(chr(digested[iter])) % 52
      if mod_result < 26:
          output_string += chr(65 + mod_result)
      else:
          mod_result -= 26
          output_string += chr(97 + mod_result)
  return output_string

Auhtor mapping for post stream is **required**

In [7]:
post_stream['darma_author'] = post_stream['author'].apply(lambda x: map_author(x))

Creating darma_author column if the **comment** master log is directly downloaded from the server

In [None]:
# master_log['darma_author'] = master_log['author'].apply(lambda x: map_author(x))

### **Removed column step** ONLY if the **comment** master log is directly downloaded from the server and you have the removec_com csv file

In [None]:
# master_log['removed'] = master_log['id'].isin(set(list(final_removed_df['id'])))

### **Finalizing Removed com dataframe from master_log so that it has darma_author and removed col**

In [8]:
removed_com = master_log[master_log['removed']==True]
len(removed_com)

655

# **Generate threads**
This section takes a lot of time to generate

### Connecting to Reddit API

In [9]:
pip install praw

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [10]:
credentials = 'client_secret.json'

with open(credentials) as f:
    creds = json.load(f)

In [11]:
import praw
reddit = praw.Reddit(client_id = creds['client_id'],
                    client_secret = creds['client_secret'],
                    user_agent = creds['user_agent'],
                    redirect_uri = creds['redirect_uri'],
                    refresh_token = creds['refresh_token'],
                     check_for_async = False)

### Functions for finding parent by comment ID

In [12]:
# function which find parent post by id
# first looks in the post stream
# if not found, then fetchs it from reddit directly and appends in the post stream

def find_parent_post(id):
  global post_stream

  if id in set(list(post_stream['id'])):
    par_post = post_stream[post_stream['id']==id]
    # print(">>>post found in master")
    return par_post.to_dict('records')[0]

  else:
    parent_post = reddit.submission(id = str(id))

    sub_dict = {}
    sub_dict['id'] = parent_post.id
    sub_dict['title'] = parent_post.title
    sub_dict['selftext'] = parent_post.selftext
    sub_dict['score'] = parent_post.score
    sub_dict['upvote_ratio'] = parent_post.upvote_ratio
    sub_dict['num_comments'] = parent_post.num_comments
    sub_dict['permalink'] = parent_post.permalink
    sub_dict['created_utc'] = parent_post.created_utc

    try:
      if parent_post.author != None:
        sub_dict['author'] = parent_post.author.id
        sub_dict['darma_author'] = map_author(parent_post.author.id)
      else:
        sub_dict['author'] = 'Not found'
        sub_dict['darma_author'] = 'Not found'
    except Exception as e:
      print("Post Author ERROR: author is ", parent_post.author, "Exception: ", e)
      sub_dict['author'] = 'Not found'
      sub_dict['darma_author'] = 'Not found'

    df_dictionary = pd.DataFrame([sub_dict])
    post_stream = pd.concat([post_stream, df_dictionary], ignore_index=True)

    print(">>>post Not found in master ", id)
    return sub_dict

In [13]:
# function which find parent comment by id
# first looks in the comment stream
# if not found, then fetchs it from reddit directly and appends in the comment stream

def find_parent_com(par_id):
  global master_log

  if par_id in set(list(master_log['id'])):
    par_com = master_log[master_log['id']==par_id]
    c_dict = par_com.to_dict('records')[0]
    c_dict['removed'] = 0 if c_dict['removed'] == False else 1
    return c_dict
    
  else:
    p = reddit.comment(id = par_id)
    temp_dict = {}
    temp_dict['id'] = p.id
    temp_dict['parent_id'] = p.parent_id
    temp_dict['link_id'] = p.link_id
    temp_dict['body'] = p.body
    temp_dict['collapsed'] = 0 if p.collapsed == False else 1
    temp_dict['score'] = p.score
    temp_dict['controversiality'] = p.controversiality
    temp_dict['permalink'] = p.permalink
    temp_dict['created_utc'] = p.created_utc

    if p.body == '[removed]':
      temp_dict['body'] =  '[removed] and NOT FOUND'
      temp_dict['removed'] = True
    else:
      temp_dict['removed'] = False

    try: 
      if p.author != None:
        temp_dict['author'] = p.author.id
        temp_dict ['darma_author'] = map_author(p.author.id)
      else:
        temp_dict ['author'] = 'Not found'
        temp_dict['darma_author'] = 'Not found'
    except Exception as e:
      temp_dict ['author'] = 'Not found'
      temp_dict['darma_author'] = 'Not found'
      print("Comment Author ERROR: author is ", p.author, "Exception: ", e)

    df_com_dict = pd.DataFrame([temp_dict])
    master_log = pd.concat([master_log, df_com_dict], ignore_index=True)

    temp_dict['removed'] = 0 if temp_dict['removed'] == False else 1
    print(">>>comment Not found in master ", par_id)
    return temp_dict

In [14]:
# Generates and returns complete thread starting from the leaf (removed comment) till the root (post)
# using the above 2 function

def find_parent(com, level, parent_dict):
  result = parent_dict

  if com['parent_id'] == com['link_id']:
    try:
      sub_dict = find_parent_post(com['parent_id'][3:])
      result[level] = ["Level "+str(level), "Parent Post", sub_dict]
      return result
    except Exception as e:
      print("Post ERROR: ", e)
      print("Comment number:" + com['id'])
  else:
    try:
      com_dict = find_parent_com(com['parent_id'][3:])
      result[level] = ["Level "+str(level), "Parent comment", com_dict]
      level = level +1

      find_parent(com_dict, level, result)
    except Exception as e:
      print("Comment ERROR: ", e)
      print("Comment number:" + com['id'])
  return result

### Generating threads using the above function and Saving in JSON format
**Note:** This step takes a lot of time so maeke sure you run. Check the output file name in the cell before running.

Defining **timestamp** and **output file name tag** used for all files that would be downloaded during this process.

**DO NOT** forget to set the subreddit correctly

In [15]:
# IMP NOTE see if the subreddit is set correctly.
subreddit = '_france_'

# example for 15th August 2022
date1 = datetime(2022,8,15)
date2 = datetime.today() #Because com_stream csv loaded above was downloaded from server today

dates_str = str(date1.date().strftime('%d%b%Y'))+'_to_'+str(date2.date().strftime('%d%b%Y'))

out_file_name_tag = subreddit+dates_str
print(out_file_name_tag)

_france_15Aug2022_to_28Aug2022


In [16]:
# For each comment in removed_com dataframe, call the find_parent function to generate the threads corresponding to that commnet
# Check the file name and dates before running this cell

result_json = {}
temp_list = []
counter = 1

for idx, com in removed_com.iterrows():
  print("-----count: ", counter)
  temp = {}
  x = find_parent(com, 1, temp)
  
  com_dict = com.to_dict()
  com_dict['removed'] = 0 if com_dict['removed'] == False else 1
  final_thread = {0: ["Level 0", "Removed comment", com_dict], **x}
  
  temp_list.append(final_thread)
  # print("RESULT: ", final_thread)
  counter +=1

  # if counter>5:
  #   break;
  # else:
  #   pass

result_json['result'] = temp_list

f_name = "initial_detailed_threads"+out_file_name_tag+'.json'
print(f_name)

with open(f_name, "w") as outfile:
    json.dump(result_json, outfile)

files.download("/content/"+f_name)

-----count:  1
-----count:  2
-----count:  3
-----count:  4
>>>comment Not found in master  ikcwhoe
-----count:  5
-----count:  6
-----count:  7
-----count:  8
-----count:  9
-----count:  10
-----count:  11
-----count:  12
-----count:  13
-----count:  14
-----count:  15
-----count:  16
-----count:  17
-----count:  18
-----count:  19
-----count:  20
-----count:  21
-----count:  22
-----count:  23
-----count:  24
>>>comment Not found in master  ijuzl3f
>>>comment Not found in master  ijux1cx
-----count:  25
-----count:  26
-----count:  27
-----count:  28
-----count:  29
-----count:  30
-----count:  31
-----count:  32
-----count:  33
-----count:  34
-----count:  35
-----count:  36
-----count:  37
-----count:  38
-----count:  39
-----count:  40
-----count:  41
-----count:  42
-----count:  43
-----count:  44
-----count:  45
-----count:  46
-----count:  47
-----count:  48
-----count:  49
-----count:  50
-----count:  51
-----count:  52
-----count:  53
-----count:  54
-----count:  55
-----coun

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [17]:
len(temp_list)

655

# **Sorting the threads** root---->leaf

In [18]:
# in case the notebook timeouts, load the json file from your downloads folder
# IMP load file from your downloads folder

f = open("/content/initial_detailed_threads_france_15Aug2022_to_28Aug2022.json")
json_file = json.load(f)

In [19]:
final = []
for discussion_dict in json_file['result']:
  # print(discussion_dict)
  discussion_dict = {int(k): v for k,v in discussion_dict.items()}
  sorted_dict = dict(sorted(discussion_dict.items(), reverse=True))
  new_dict = {}
  i = 0
  for k,v in sorted_dict.items():
    new_dict[i] = v[2]
    i+=1
  # print(new_dict)
  final.append(new_dict)

final_json = {}
final_json['result'] = final

### Saving json file

In [21]:
# save the sorted threads if required
sorted_f_name = "sorted_detailed_threads"+out_file_name_tag+'.json'
print(sorted_f_name)

with open(sorted_f_name, "w") as outfile:
    json.dump(final_json, outfile)

files.download("/content/"+sorted_f_name)

sorted_detailed_threads_france_15Aug2022_to_28Aug2022.json


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# **Threads for verification with only author, text, permalink, removed attr**

In [22]:
final[0]

{0: {'author': '7epqlahm',
  'created_utc': 1660509374.0,
  'id': 'wog2da',
  'num_comments': 0.0,
  'permalink': '/r/france/comments/wog2da/what_things_are_expensive_in_france_but_so_cheap/',
  'score': 1.0,
  'selftext': 'I only know about cigarettes but nothing more 👉👈.',
  'title': 'What things are expensive in France but so cheap in Algeria?',
  'upvote_ratio': 1.0,
  'darma_author': 'wWRMANMl'},
 1: {'author': 'fh36u',
  'body': 'Women',
  'collapsed': 0.0,
  'controversiality': 0.0,
  'created_utc': 1660534818.0,
  'id': 'ikc8yjv',
  'link_id': 't3_wog2da',
  'parent_id': 't3_wog2da',
  'permalink': '/r/france/comments/wog2da/what_things_are_expensive_in_france_but_so_cheap/ikc8yjv/',
  'score': 1.0,
  'removed': 1,
  'darma_author': 'nOQSkhIC'}}

In [23]:
# Generating author text permalink threads respectively for post and comments
author_text = []

for item in final:
  dis ={}
  for k,v in item.items():
    # print(k,v)
    if int(k) == 0:
      # this is a post
      dis[k] = {"speaker_id" : v['darma_author'], "text" : v['title'] +' '+ v['selftext'] if str(v['selftext']) != 'nan' else v['title'], "permalink": v['permalink']}
    else:
      dis[k] = {"speaker_id": v['darma_author'] , "text": v['body'], "permalink": v['permalink'], "removed": v['removed']}
  # print("\n")
  author_text.append(dis)

In [24]:
author_text[0]

{0: {'speaker_id': 'wWRMANMl',
  'text': 'What things are expensive in France but so cheap in Algeria? I only know about cigarettes but nothing more 👉👈.',
  'permalink': '/r/france/comments/wog2da/what_things_are_expensive_in_france_but_so_cheap/'},
 1: {'speaker_id': 'nOQSkhIC',
  'text': 'Women',
  'permalink': '/r/france/comments/wog2da/what_things_are_expensive_in_france_but_so_cheap/ikc8yjv/',
  'removed': 1}}

In [25]:
len(author_text)

655

### Saving json file

In [26]:
#save and download the files
author_text_json = {}
author_text_json['result'] = author_text

verify_f_name = "verify_threads"+out_file_name_tag+'.json'
print(verify_f_name)

with open(verify_f_name, "w") as outfile:
    json.dump(author_text_json, outfile)

files.download("/content/"+verify_f_name)

verify_threads_france_15Aug2022_to_28Aug2022.json


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

**Display**

In [None]:
# for threads in author_text[:10]:
#   for k,v in threads.items():
#     print(k,v)
#   print('\n')

# **Filter threads by toxic author repetition**

In [27]:
len(author_text)

655

In [28]:
# function that returns true/false based on the frequency of toxic author in a thread

def repeated_tox_author(thread):
  # print("thread of len:", len(thread), thread)
  tox_author = list(thread.values())[-1]['speaker_id']
  all_authors = {}

  for k,v in thread.items():
    # print(k,v)
    temp_key = v['speaker_id']
    all_authors[temp_key] = all_authors.get(temp_key,0) + 1

  keep = True if all_authors[tox_author]>1 else False
  # print("authors freq: ", all_authors)
  # print("toxic author: ", tox_author, "freq: ", all_authors[tox_author])
  # print("keep thread: ", keep)
  # print("\n")

  return keep

In [29]:
# filtering the threads using above funciton
filtered_by_author = []
for thread in author_text:
  if repeated_tox_author(thread):
    filtered_by_author.append(dict(conversation = list(thread.values()), target_user = list(thread.values())[-1]['speaker_id']))

In [30]:
len(filtered_by_author)

253

In [31]:
#display
for threads in filtered_by_author[:10]:
  conv = threads['conversation']
  for i in conv:
    print(i)
  print('\n')

{'speaker_id': 'TowlHeSV', 'text': 'Paris : des pompiers agressés dans un foyer de jeunes résidents étrangers', 'permalink': '/r/france/comments/wov5ev/paris_des_pompiers_agressés_dans_un_foyer_de/'}
{'speaker_id': 'bduBGbsc', 'text': 'Je trouve ça immoral. Dans le pays où j’habite un incident pareil est synonyme de condamnation au caning (châtiment corporel) et potentiellement de peine de mort si les autorités jugent que les individus ne sont pas apte à vivre sans troubler l’harmonie sociale. Et je trouve ça extrêmement sain, c’est d’ailleurs pour ça qu’il n’y a pas de problème de sécurité ici', 'permalink': '/r/france/comments/wov5ev/paris_des_pompiers_agressés_dans_un_foyer_de/ikd2x1c/', 'removed': 1}
{'speaker_id': 'LfQnFwoB', 'text': '> les autorités jugent que les individus ne sont pas apte à vivre sans troubler l’harmonie sociale.\n\nallez, je mords. et si les autorités sont pourries jusqu\'à la moelle et décident (au hasard le plus complet) que ton orientation sexuelle ou ta re

### Saving json file

In [32]:
filtered_by_author_json = {}
filtered_by_author_json['result'] = filtered_by_author

tox_author_f_name = "tox_author_threads"+out_file_name_tag+'.json'
print(tox_author_f_name)

with open(tox_author_f_name,'w') as outfile:
  json.dump(filtered_by_author_json, outfile) 

files.download("/content/"+tox_author_f_name)

tox_author_threads_france_15Aug2022_to_28Aug2022.json


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# **Filter out the gaps-- "Removed and Not Found"**

In [33]:
without_gap = []
count = 0

for threads in filtered_by_author:
  conv = threads['conversation']
  flag = 0
  for i in conv:
    # print(i)
    if i['text'] == "[removed] and NOT FOUND":
      flag = 1
      break

  # print('flag: ', flag)  
  if flag == 1:
    pass
  else:
    without_gap.append(threads)
    count+=1
  # print(count, '\n')

In [34]:
len(without_gap)

252

In [35]:
for threads in without_gap[:5]:
  conv = threads['conversation']
  for i in conv:
    print(i)
  print('\n')

{'speaker_id': 'TowlHeSV', 'text': 'Paris : des pompiers agressés dans un foyer de jeunes résidents étrangers', 'permalink': '/r/france/comments/wov5ev/paris_des_pompiers_agressés_dans_un_foyer_de/'}
{'speaker_id': 'bduBGbsc', 'text': 'Je trouve ça immoral. Dans le pays où j’habite un incident pareil est synonyme de condamnation au caning (châtiment corporel) et potentiellement de peine de mort si les autorités jugent que les individus ne sont pas apte à vivre sans troubler l’harmonie sociale. Et je trouve ça extrêmement sain, c’est d’ailleurs pour ça qu’il n’y a pas de problème de sécurité ici', 'permalink': '/r/france/comments/wov5ev/paris_des_pompiers_agressés_dans_un_foyer_de/ikd2x1c/', 'removed': 1}
{'speaker_id': 'LfQnFwoB', 'text': '> les autorités jugent que les individus ne sont pas apte à vivre sans troubler l’harmonie sociale.\n\nallez, je mords. et si les autorités sont pourries jusqu\'à la moelle et décident (au hasard le plus complet) que ton orientation sexuelle ou ta re

In [36]:
for threads in without_gap:
  for i in threads['conversation']:
    # print(i)
    if i['text']=="[deleted]":
      threads['conversation'].remove(i)
      print(i)
    else:
      pass

  # print('flag: ', flag)  
  # if flag == 1:
  #   pass
  # else:
  #   without_gap.append(threads)
  #   count+=1
  # print(count, '\n')

In [37]:
len(without_gap)

252

### Saving json file

In [38]:
without_gap_json = {}
without_gap_json['result'] = without_gap

wihtout_gap_f_name = "without_gap_threads"+out_file_name_tag+'.json'
print(wihtout_gap_f_name)

with open(wihtout_gap_f_name,'w') as outfile:
  json.dump(without_gap_json, outfile)

files.download("/content/"+wihtout_gap_f_name)

without_gap_threads_france_15Aug2022_to_28Aug2022.json


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# **Filter by langID**
**Note**: for **french** use lang='fr', for **english** use lang='en'

In [39]:
pip install langid

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [40]:
import langid
from langid.langid import LanguageIdentifier, model

identifier = LanguageIdentifier.from_modelstring(model, norm_probs=True)

#checking if the thread is of a particular lang
def check_thread_language(thread, lang):
  thread_len = len(thread['conversation'])

  class_list = []
  for com in thread['conversation']:
    # print(">>>comment: ", com)
    class_list.append(identifier.classify(com['text'])[0])

  count_fr = class_list.count(lang)
  is_french = count_fr > thread_len/2

  # print(class_list, count_fr, thread_len, is_french)
  return is_french

# print(identifier.rank(input3)[:3])

**Note**: for **french** use lang='fr', for **english** use lang='en'

In [41]:
selected_count = 0
filtered_by_lang = []
not_selected = []

#IMP read note
for thread in without_gap:
  keep = check_thread_language(thread,lang = 'fr')
  if keep:
    filtered_by_lang.append(thread)
    selected_count += 1
  else:
    not_selected.append(thread)

print("Count: ", selected_count)

Count:  242


In [45]:
print("selected threads: ", len(filtered_by_lang))
print("not-selected threads: ", len(not_selected))

selected threads:  242
not-selected threads:  10


In [None]:
#thread not selected by langID filter

# for item in not_selected:
#   print("\n Thread len: ", len(item['conversation']))
#   o = check_thread_language(item, lang='fr')
#   print(o)

### Saving json file

In [46]:
filtered_by_lang_json = {}
filtered_by_lang_json['result'] = filtered_by_lang

lang_f_name = "langID_filter_threads"+out_file_name_tag+'.json'
print(lang_f_name)

with open(lang_f_name, "w") as outfile:
    json.dump(filtered_by_lang_json, outfile)

files.download("/content/"+lang_f_name)

langID_filter_threads_france_15Aug2022_to_28Aug2022.json


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# **Text preprocessing: Removing qouted text, new lines, etc**

In [None]:
# new_list=['Bonjour,\r', '\r', 'Ce commentaire a été supprimé. Merci de t’exprimer de façon moins agressive.\r']
# # new_list = [x.replace('\r','') for x in new_list]
# print(new_list)
# # ['Bonjour,\r', '\r', 'Ce commentaire a été supprimé. Merci de t’exprimer de façon moins agressive.\r', '\r', ' ------------------------ \r', '\r', 'This comment has been removed. Please do not be agressive towards other users.\r', '\r', '\r', '\r', "Les règles de /r/france sont [disponibles ici](https://www.reddit.com/r/france/wiki/regles). Pour contester cette action, ou pour toute question, merci d'envoyer un [message aux modérateurs](https://www.reddit.com/message/compose?to=%2Fr%2Ffrance).\r", '\r', 'Merci de ta compréhension.']

# print(" ".join(new_list))

['Bonjour,\r', '\r', 'Ce commentaire a été supprimé. Merci de t’exprimer de façon moins agressive.\r']
Bonjour,  Ce commentaire a été supprimé. Merci de t’exprimer de façon moins agressive.


In [47]:
for thread in filtered_by_lang:
  for com in thread['conversation']:
    # print('OLD: ', com['text'])
    com_list = com['text'].split('\n')

    new_com_list = [x if x.startswith('>')==False else "\n" for x in com_list]
    new_com_list = [x.replace('\r','') for x in new_com_list]
    new_com_list = [x.replace('\n','') for x in new_com_list]

    temp_text = ' '.join(new_com_list)
    new_text = ' '.join(temp_text.split())

    com['text'] = new_text
    
    # print("NEW: ", com['text'])
  # print("-------------------------------------------")

In [48]:
len(filtered_by_lang)

242

In [49]:
for i in filtered_by_lang[:2]:
  for com in i['conversation']:
    print(com['text'])
  print("--------------------------------")

Paris : des pompiers agressés dans un foyer de jeunes résidents étrangers
Je trouve ça immoral. Dans le pays où j’habite un incident pareil est synonyme de condamnation au caning (châtiment corporel) et potentiellement de peine de mort si les autorités jugent que les individus ne sont pas apte à vivre sans troubler l’harmonie sociale. Et je trouve ça extrêmement sain, c’est d’ailleurs pour ça qu’il n’y a pas de problème de sécurité ici
allez, je mords. et si les autorités sont pourries jusqu'à la moelle et décident (au hasard le plus complet) que ton orientation sexuelle ou ta religion sont de nature à "troubler l'harmonie sociale" ? un peu comme dans chaque pays ou ton système a cours, en fait.
Le gars en question a juste fait du copier coller donc il ne pouvait pas prétendre à recevoir le paiement. Techniquement c’est lui qui devrait être torturé pour tentative d’escroquerie
--------------------------------
Comment aider ma voisine avec des troubles psy Bonjour AirFrance, Depuis qu'u

### Saving json file

In [50]:
filtered_clean_f_name = "fltr_clean_threads"+out_file_name_tag+'.json'
print(filtered_clean_f_name)

with open(filtered_clean_f_name, 'w') as op_f:
  json.dump(filtered_by_lang, op_f)

files.download("/content/"+filtered_clean_f_name)

fltr_clean_threads_france_15Aug2022_to_28Aug2022.json


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# **Final format (Text and speaker_id)**

In [51]:
for i in filtered_by_lang[:2]:
  for com in i['conversation']:
    print(json.dumps(com, indent=1))
  print("--------------------------------")

{
 "speaker_id": "TowlHeSV",
 "text": "Paris : des pompiers agress\u00e9s dans un foyer de jeunes r\u00e9sidents \u00e9trangers",
 "permalink": "/r/france/comments/wov5ev/paris_des_pompiers_agress\u00e9s_dans_un_foyer_de/"
}
{
 "speaker_id": "bduBGbsc",
 "text": "Je trouve \u00e7a immoral. Dans le pays o\u00f9 j\u2019habite un incident pareil est synonyme de condamnation au caning (ch\u00e2timent corporel) et potentiellement de peine de mort si les autorit\u00e9s jugent que les individus ne sont pas apte \u00e0 vivre sans troubler l\u2019harmonie sociale. Et je trouve \u00e7a extr\u00eamement sain, c\u2019est d\u2019ailleurs pour \u00e7a qu\u2019il n\u2019y a pas de probl\u00e8me de s\u00e9curit\u00e9 ici",
 "permalink": "/r/france/comments/wov5ev/paris_des_pompiers_agress\u00e9s_dans_un_foyer_de/ikd2x1c/",
 "removed": 1
}
{
 "speaker_id": "LfQnFwoB",
 "text": "allez, je mords. et si les autorit\u00e9s sont pourries jusqu'\u00e0 la moelle et d\u00e9cident (au hasard le plus complet) qu

In [52]:
for i in filtered_by_lang:
  t_temp=[]
  for com in i['conversation']:
    com = {k:v for k,v in com.items() if k in ['speaker_id', 'text']}
    t_temp.append(com)
  i['conversation'] = t_temp
    # print(json.dumps(com, indent=1))
  # print("--------------------------------")

In [53]:
for i in filtered_by_lang[:2]:
  for com in i['conversation']:
    print(json.dumps(com, indent=1))
  print("--------------------------------")

{
 "speaker_id": "TowlHeSV",
 "text": "Paris : des pompiers agress\u00e9s dans un foyer de jeunes r\u00e9sidents \u00e9trangers"
}
{
 "speaker_id": "bduBGbsc",
 "text": "Je trouve \u00e7a immoral. Dans le pays o\u00f9 j\u2019habite un incident pareil est synonyme de condamnation au caning (ch\u00e2timent corporel) et potentiellement de peine de mort si les autorit\u00e9s jugent que les individus ne sont pas apte \u00e0 vivre sans troubler l\u2019harmonie sociale. Et je trouve \u00e7a extr\u00eamement sain, c\u2019est d\u2019ailleurs pour \u00e7a qu\u2019il n\u2019y a pas de probl\u00e8me de s\u00e9curit\u00e9 ici"
}
{
 "speaker_id": "LfQnFwoB",
 "text": "allez, je mords. et si les autorit\u00e9s sont pourries jusqu'\u00e0 la moelle et d\u00e9cident (au hasard le plus complet) que ton orientation sexuelle ou ta religion sont de nature \u00e0 \"troubler l'harmonie sociale\" ? un peu comme dans chaque pays ou ton syst\u00e8me a cours, en fait."
}
{
 "speaker_id": "bduBGbsc",
 "text": "Le 

In [54]:
final_f_name = "protocol_threads"+out_file_name_tag+'.json'
print(final_f_name)

with open(final_f_name,'w') as opf:
  json.dump(filtered_by_lang, opf)

files.download('/content/'+final_f_name)

protocol_threads_france_15Aug2022_to_28Aug2022.json


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>