# Getting Details of Subreddits

I will be investigating the details of subreddits provided by PushShift which is accessible here: https://files.pushshift.io/reddit/subreddits/

In [23]:
!pip install ijson

Collecting ijson

You should consider upgrading via the 'c:\users\user200803\appdata\local\programs\python\python38\python.exe -m pip install --upgrade pip' command.



  Downloading ijson-3.1.4-cp38-cp38-win_amd64.whl (48 kB)
Installing collected packages: ijson
Successfully installed ijson-3.1.4


In [26]:
import json
import csv
import pandas as pd

import zstandard as zstd
import io

from collections import Counter, defaultdict

import ijson # to stream large json files

import datetime

In [2]:
dctx = zstd.ZstdDecompressor(max_window_size=2147483648)

In [3]:
dict_list = []

In [4]:
counter = 0
with open("D://Wellesley/F21/thesis_zst_data/reddit_subreddits.ndjson.zst", 'rb') as ifh: #, open("stream_output.json", 'wb') as ofh:
    with dctx.stream_reader(ifh, read_size=2) as reader:
        text_stream = io.TextIOWrapper(reader, encoding='utf-8')
        for line in text_stream:
            if counter >= 6:
                break
            d = json.loads(line)
            dict_list.append(d)
            counter += 1

In [15]:
ifh.close()

In [5]:
dict_list[0].keys()

dict_keys(['accounts_active', 'accounts_active_is_fuzzed', 'active_user_count', 'advertiser_category', 'all_original_content', 'allow_discovery', 'allow_images', 'allow_polls', 'allow_videogifs', 'allow_videos', 'banner_background_color', 'banner_background_image', 'banner_img', 'banner_size', 'can_assign_link_flair', 'can_assign_user_flair', 'collapse_deleted_comments', 'comment_score_hide_mins', 'community_icon', 'created_utc', 'description', 'description_html', 'disable_contributor_requests', 'display_name', 'display_name_prefixed', 'emojis_custom_size', 'emojis_enabled', 'free_form_reports', 'has_menu_widget', 'header_img', 'header_size', 'header_title', 'hide_ads', 'icon_img', 'icon_size', 'id', 'is_crosspostable_subreddit', 'is_enrolled_in_new_modmail', 'key_color', 'lang', 'link_flair_enabled', 'link_flair_position', 'mobile_banner_image', 'name', 'notification_level', 'original_content_tag_enabled', 'over18', 'primary_color', 'public_description', 'public_description_html', 'pu

Interesting keys:
1. description
2. public_description
3. public_traffic

In [6]:
dict_list[1]['description']

'For US news there is /r/news, for world news there is /r/worldnews, but what about deep and interesting feature articles? In-depth interviews? Other non-urgent or perhaps less-deep but still-powerful pieces?\n\nWelcome to /r/features, an evolving SubReddit for high-quality and evocative long-form content that does not fit neatly into any one category!\n\nThe Texts Continuum:\n\n* **/r/Vignettes** (short)\n* **\\/r/Features** (medium)\n* /r/LongText (long)\n* /r/TrueReddit (deep)\n* /r/DepthHub (meta)\n* /r/News (timely)\n* /r/WorldNews (global)\n* /r/Modded (moderated)\n\nIf you are looking for a place to suggest a feature for reddit, then /r/IdeasForTheAdmins is your subreddit of choice.\n\nNote: this SubReddit is a work in progress and we would welcome feedback from new readers and submitters alike as it grows!'

In [7]:
dict_list[1]['public_description']

'For US news there is /r/news, for world news there is /r/worldnews, but what about deep and interesting feature articles? Welcome to /r/features, an evolving SubReddit for high-quality and evocative long-form content that does not fit neatly into any one category!\n\nIf you are looking for a place to suggest a feature for reddit, then /r/IdeasForTheAdmins is your subreddit of choice.'

Interesting how `description` is not exactly the same as `public_description`. Should I use both? Perhaps just `description`, because people are more inclined to post on a subreddit if they are a member of that subreddit instead of just a random general public?

In [8]:
[dict_list[idx]['title'] for idx in range(len(dict_list))]

['Not Safe for Work',
 'features',
 'Request',
 'Citius, Altius, Fortius',
 'r/de - Extraordinär gut!',
 'Reddit en español para España']

In [9]:
[dict_list[idx]['id'] for idx in range(len(dict_list))]

['vf2', '21n6', '21nj', '21of', '22i0', '22i2']

It seems that I need subreddit id's afterall. Good thing I have this info. For each of the subreddits that we are interested in, let's get their id's and then descriptions.

In [10]:
with open("subreddit_ns.json", "r", encoding="utf-8") as infile1:
    subreddit_ns = json.load(infile1)
    subreddits = subreddit_ns.keys()
infile1.close()

In [11]:
len(subreddits)

20755

Let's get their id's!

In [12]:
spec_sr_id = dict()
with open("subreddit_id.json", "r") as infile2:
    subreddit_id = json.load(infile2)
    for s in subreddits:
        spec_sr_id[s] = subreddit_id[s]

In [14]:
len(spec_sr_id)

20755

In [18]:
spec_id_sr = {k:v for v,k in spec_sr_id.items()}

`spec_sr_id` contains the matching {`subreddit`: `it's id`}, and `spec_id_sr` is its inverse.

Now let's create the matching {`subreddit`: `it's description`} and {`subreddit_id`: `it's description`}

In [19]:
ids = list(spec_id_sr.keys())

In [30]:
len(ids)

20755

In [34]:
ids[1034:1041] # what's with the t5_?

['t5_2t6ov',
 't5_30y53',
 't5_2h9x5r',
 't5_2t57f',
 't5_3p7rg',
 't5_25iba2',
 't5_3nwtdm']

In [39]:
for i in ids:
    if i[:3] != 't5_':
        print(i)

Noting is printed from the above cell. So, in `ids`, all `subreddit_id` starts with `t5_`. Let's trim this then.

In [40]:
ids = [i[3:] for i in ids]

In [46]:
counter = 0
desc_found = 0
subreddit_description = dict()
id_description = dict()

print("start time: {}".format(datetime.datetime.now()))

with open("D://Wellesley/F21/thesis_zst_data/reddit_subreddits.ndjson.zst", 'rb') as ifh: #, open("stream_output.json", 'wb') as ofh:
    with dctx.stream_reader(ifh, read_size=2) as reader:
        text_stream = io.TextIOWrapper(reader, encoding='utf-8')
        for line in text_stream:
            # if counter >= 10000:
            #     break
            if counter % 50000 == 0:
                print("counter: {}".format(counter))
            d = json.loads(line)
            srdtid = d["id"]
            if srdtid in ids:
                # in spec_id_sr, id starts with t5_
                for_spec = "t5_" + srdtid
                subreddit_description[spec_id_sr[for_spec]] = d["description"]
                # id_description[for_spec] = d["description"]
                desc_found += 1
            counter += 1
            # let's break if descriptions for all id's have been found
            if desc_found == len(ids):
                print("all found after {} look ups. Done!".format(counter))
                break
            
print("finish time: {}".format(datetime.datetime.now()))

start time: 2021-11-05 13:11:23.783371
counter: 0
counter: 50000
counter: 100000
counter: 150000
counter: 200000
counter: 250000
counter: 300000
counter: 350000
counter: 400000
counter: 450000
counter: 500000
counter: 550000
counter: 600000
counter: 650000
counter: 700000
counter: 750000
counter: 800000
counter: 850000
counter: 900000
finish time: 2021-11-05 13:24:38.256613


In [48]:
print(f"{desc_found} found out of {len(ids)} ({desc_found*100/len(ids)}). Counter: {counter}.")

14595 found out of 20755 (70.32040472175379). Counter: 914066.


It is interesting that PushShift's `subreddit` folder only contains 70% of the subreddits mentioned in their `submission` folder. `subreddit` has 914k subreddits.

Perhaps 30% has no descriptions?

In [56]:
# example description
subreddit_description["prisons"]

'Nothing can be more abhorrent to democracy than to imprison a person or keep him in prison because he is unpopular. This is really the test of civilization\n\nThe true test of society is how well it treats its prisoners and old people.\n\nThe test of a civilization is in the way it cares for its helpless members\n\nJoin the discussion of abuse in the "troubled teen" bootcamps at /r/troubledteens\n\nYou might also like:\n\n* /r/bad_cop_no_donut\n* /r/justiceporn\n* /r/good_cop_free_donut\n* /r/forfeiture\n* /r/prisonreform\n* /r/AmIFreeToGo\n* /r/ExCons\n* /r/abolish '

In [58]:
spec_sr_id["prisons"]

't5_2sbiw'

In [63]:
id_description = {spec_sr_id[s]: subreddit_description[s] for s in subreddit_description}

In [64]:
id_description['t5_2sbiw']

'Nothing can be more abhorrent to democracy than to imprison a person or keep him in prison because he is unpopular. This is really the test of civilization\n\nThe true test of society is how well it treats its prisoners and old people.\n\nThe test of a civilization is in the way it cares for its helpless members\n\nJoin the discussion of abuse in the "troubled teen" bootcamps at /r/troubledteens\n\nYou might also like:\n\n* /r/bad_cop_no_donut\n* /r/justiceporn\n* /r/good_cop_free_donut\n* /r/forfeiture\n* /r/prisonreform\n* /r/AmIFreeToGo\n* /r/ExCons\n* /r/abolish '

Let's save these into files

In [65]:
with open("subreddit_description.json", "w", encoding = "utf-8") as outfile_srd:
    json.dump(subreddit_description, outfile_srd)

In [66]:
with open("id_description.json", "w", encoding = "utf-8") as outfile_id:
    json.dump(id_description, outfile_id)