# Ranking Subreddits

Author: Junita Sirait

Aim: to help with feature selection, we rank  subreddits based on their subscribers count. Then we will label them either `news_politics`, `nonnews_covid`, `entertainment`, `self_development`, `cultural`, `adult`, `other(specify)` based on their descriptions (and possibly posts), where:
1. `news_politics` includes news-related and politics-related subreddits (includes activisms).
2. `nonnews_covid` includes covid-related subreddits that are not news related.
3. `entertainment` includes subreddits with references to entertainment (celebrities, fashion, music, movies, photos of cats, arts, books, games, travels, hobbies, photography, recipes, slangs, non-political religions, folklore, memes etc.)
4. `self_development` includes subreddits where people share knowledge for self development such as knowledge about technology, psychology, investing, mental health, history, jobs, environment, etc.
5. `cultural` includes culture specific subreddits, including race, religion, sexual orientation, gender, tribe, etc.
6. `adult` includes subreddits whose discussions are meant for 18+ audiences.
5. `other(specify)` include other types of subreddits not listed above, with specified type.


Table of content:
1. [Reading subreddit file](#sub1)
2. [Ranking](#sub2)
3. [Creating resulting `csv` file](#sub3)

In [27]:
import json
import csv
import os

from collections import Counter

In [4]:
pd = os.path.split(os.getcwd())[0]

<a id="sub1"></a>
## Reading subreddit file

In [10]:
data_fp = os.path.join(pd,"data/subreddits")
with open(os.path.join(data_fp,"subreddit_subscribers.json"), "r", encoding="utf-8") as infile:
    ss = Counter(json.load(infile))

Let's clean the data by throwing out `None` type

In [16]:
ss = Counter({k:v for k,v in ss.items() if v})

<a id="sub2"></a>
## Ranking

Let's extract the top 500 subreddits with the most subscribers.

In [19]:
s500 = ss.most_common(500)

In [20]:
s500[:5] 

[('funny', 31060166),
 ('gaming', 26628444),
 ('aww', 25197817),
 ('pics', 24911449),
 ('science', 24268001)]

<a id="sub3"></a>
## Creating `csv` file

In [21]:
fieldnames = ["subreddit", "subscribers", "theme"]

In [24]:
rows = [{"subreddit":s, "subscribers":c, "theme":""}  for s,c in s500 ]

In [25]:
rows[:5]

[{'subreddit': 'funny', 'subscribers': 31060166, 'theme': ''},
 {'subreddit': 'gaming', 'subscribers': 26628444, 'theme': ''},
 {'subreddit': 'aww', 'subscribers': 25197817, 'theme': ''},
 {'subreddit': 'pics', 'subscribers': 24911449, 'theme': ''},
 {'subreddit': 'science', 'subscribers': 24268001, 'theme': ''}]

In [28]:
with open('subreddits_labels.csv', 'w', encoding='UTF8', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(rows)