# Ranking Subreddits

Author: Junita Sirait

Aim: to help with feature selection, we rank  subreddits based on their subscribers count. Then we will label them either `news_politics`, `nonnews_covid`, `entertainment`, `self_development`, `cultural`, `adult`, `other(specify)` based on their descriptions (and possibly posts), where:
1. `news_politics` includes news-related and politics-related subreddits (includes activisms).
2. `nonnews_covid` includes covid-related subreddits that are not news related.
3. `entertainment` includes subreddits with references to entertainment (celebrities, fashion, music, movies, photos of cats, arts, books, games, travels, hobbies, photography, recipes, slangs, non-political religions, folklore, memes etc.)
4. `self_development` includes subreddits where people share knowledge for self development such as knowledge about technology, psychology, investing, mental health, history, jobs, environment, etc.
5. `cultural` includes culture specific subreddits, including race, religion, sexual orientation, gender, tribe, etc.
6. `adult` includes subreddits whose discussions are meant for 18+ audiences.
5. `other(specify)` include other types of subreddits not listed above, with specified type.


Table of content:
1. [Reading subreddit file](#sub1)
2. [Ranking](#sub2)
3. [Creating resulting `csv` file](#sub3)

In [1]:
import json
import csv
import os

from collections import Counter

In [2]:
pd = os.path.split(os.getcwd())[0]

<a id="sub1"></a>
## Reading subreddit file

In [3]:
data_fp = os.path.join(pd,"data/subreddits")
with open(os.path.join(data_fp,"subreddit_subscribers.json"), "r", encoding="utf-8") as infile:
    ss = Counter(json.load(infile))

Let's clean the data by throwing out `None` type

In [4]:
ss = Counter({k:v for k,v in ss.items() if v})

Let's read in our relevant subreddits

In [11]:
with open(os.path.join(data_fp,"relevant_subreddits.json"), "r", encoding="utf-8") as infile1:
    rel = Counter(json.load(infile1))

Now we intersect `ss` and `rel` and get the top 500 of that intersection.

In [12]:
itsn = Counter({k:v for k,v in ss.items() if k in rel})

<a id="sub2"></a>
## Ranking

Let's extract the top 500 subreddits with the most subscribers.

In [14]:
s500 = itsn.most_common(500)

In [15]:
s500[:5] 

[('funny', 31060166),
 ('gaming', 26628444),
 ('aww', 25197817),
 ('pics', 24911449),
 ('science', 24268001)]

In [20]:
s500

[('funny', 31060166),
 ('gaming', 26628444),
 ('aww', 25197817),
 ('pics', 24911449),
 ('science', 24268001),
 ('worldnews', 24253881),
 ('Music', 24041486),
 ('videos', 23114010),
 ('movies', 23087255),
 ('todayilearned', 23030716),
 ('news', 20935015),
 ('Showerthoughts', 20244617),
 ('IAmA', 20169315),
 ('gifs', 20053188),
 ('EarthPorn', 19963315),
 ('askscience', 19222916),
 ('food', 18869885),
 ('explainlikeimfive', 18139518),
 ('books', 18078148),
 ('LifeProTips', 17957462),
 ('Art', 17826701),
 ('DIY', 17551698),
 ('sports', 17292578),
 ('nottheonion', 17063313),
 ('space', 16937271),
 ('gadgets', 16890685),
 ('television', 16595714),
 ('Documentaries', 16293009),
 ('photoshopbattles', 16220376),
 ('GetMotivated', 16163114),
 ('listentothis', 15940866),
 ('UpliftingNews', 15926382),
 ('tifu', 15777792),
 ('InternetIsBeautiful', 15019126),
 ('history', 14885003),
 ('philosophy', 14818794),
 ('Futurology', 14675526),
 ('OldSchoolCool', 14624312),
 ('dataisbeautiful', 14612638),
 (

<a id="sub3"></a>
## Creating `csv` file

In [21]:
fieldnames = ["subreddit", "subscribers", "label"]

In [22]:
rows = [{"subreddit":s, "subscribers":c, "label":""}  for s,c in s500 ]

In [23]:
rows[:5]

[{'subreddit': 'funny', 'subscribers': 31060166, 'label': ''},
 {'subreddit': 'gaming', 'subscribers': 26628444, 'label': ''},
 {'subreddit': 'aww', 'subscribers': 25197817, 'label': ''},
 {'subreddit': 'pics', 'subscribers': 24911449, 'label': ''},
 {'subreddit': 'science', 'subscribers': 24268001, 'label': ''}]

In [24]:
with open('subreddits_labels.csv', 'w', encoding='UTF8', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(rows)