[Reddit](reddit.com) is one of the discussion websites which I visit frequently. Its a forum based website, where users can post links or text posts which can be downvoted and upvoted. One of the brilliant things about reddit is the comments - the actual "discussion" part of reddit. These can also be upvoted and downvoted. Topics are divided among "subreddits", you can submit a post to a particular subreddit, for example, a soccer based post can go to the /r/soccer subreddit, news based post to /r/news, and so on.

In this blog post, I try to analyze the distribution of point counts of comments in the "top" subreddits and see if they have a relationship with the particular subreddit.

## Getting the data

To start with, I'll be getting comments distribution data for the top 1000 posts over the past year. The attributes of the post I'm interested in, are :

* Comments distribution
* Subreddit of the post
* Submission ID (So that we can get back more information of the submission if needed)
* score (Might be useful)
* Number of comments

[Here](https://github.com/kvsingh/reddit-scripts/blob/master/reddit-comments-top-subreddits.py) is a github link for the code to get this data, for anyone who's interested.

## Subreddit breakdown

Lets start by looking at the subreddits which these posts belong to, and the frequency of each subreddit.

In [49]:
import pandas as pd
df = pd.read_csv('reddit-top-1000-post-comments-distr.csv')
print df.groupby(['subreddit', 'num_subs']).size().sort_values(ascending=False).head(10)

subreddit            num_subs
r/pics               17341453    283
r/funny              17711708    190
r/gifs               14865217    154
r/aww                15740709     97
r/gaming             16688421     49
r/worldnews          17365378     39
r/todayilearned      17502144     29
r/videos             16606967     23
r/news               14838769     21
r/mildlyinteresting  12982562     19
dtype: int64


/r/pics, /r/funny, /r/gifs, /r/aww, anyone who uses reddit even a decent bit will tell you its no surprise that these subreddits occupy the top spots in the "top" posts. Lets look at the tail.

In [27]:
print df.groupby(['subreddit', 'num_subs']).size().sort_values(ascending=False).tail(10)

subreddit              num_subs
r/SandersForPresident  213436      1
r/InternetIsBeautiful  12351784    1
r/KendrickLamar        56404       1
r/LateStageCapitalism  163334      1
r/LifeProTips          13112055    1
r/woahdude             1296893     1
r/UpliftingNews        12117045    1
r/OldSchoolCool        11980329    1
r/PrequelMemes         280343      1
r/nottheonion          12311419    1
dtype: int64


Redditors might be familiar with some of these, /r/nottheonion, /r/woahdude, /r/LifeProTips, but more or less, these are less popular subreddits.

## Data filtering

Lets filter out some of the submission which are from subreddits which have a very low count of submissions in the top 1000. The rationale being, to figure out the general relationship of comments in a subreddit, it makes sense to look at those subreddits which have a considerable number of posts. Looking at a single post from a subreddit hardly makes any sense.

I decided to filter out and keep only those subreddits which have >=10 posts in the top 1000.

In [28]:
df = df.groupby("subreddit").filter(lambda x: len(x) >= 10)

## Analyzing comment vote count distributions

To start with, mean and standard deviation (or variance) are the most obvious things to look at when analyzing a distribution. So we create a couple of more columns in our pandas dataframe denoting the mean and standard deviation of each comment vote count distribution.

In [39]:
df['comment_points'] = df['comment_points'].apply(lambda a:literal_eval(a))  
df['means'] = df['comment_points'].apply(lambda a:pd.Series(a).mean())
df['std'] = df['comment_points'].apply(lambda a:pd.Series(a).std())

Now lets look at the mean and standard deviation for 2 subreddits and compare them. Lets look at /r/news and /r/pics.

In [46]:
print df[df['subreddit']=='r/news'][['means', 'std']].head()
print df[df['subreddit']=='r/pics'][['means', 'std']].head()

           means          std
83    707.744000  1940.516254
87    571.812000  1553.848302
142  1020.303213  2800.076360
208   564.312625  1860.300583
229   737.817269  2064.548227
        means          std
2  382.449597  1586.172387
3  334.892929  1204.248983
4  327.652000  1442.347723
6  841.641283  4297.797784
9  172.792683  1003.191137


Notice anything peculiar? The standard deviation for r/pics posts seems to be at a much higher proportion of the mean compared to that of r/news. Could this ... mean anything? (Sorry, the pun was there for the taking).

I went back to googling for this, since my last intersection with any sort of theoretical statistics was ... back in college. So, apparently, there is a thing called [coefficient of variation](https://en.wikipedia.org/wiki/Coefficient_of_variation), which is essentially just standard deviation divided by the mean. This can be (and is) used to compare vastly different distributions. By this, I mean distributions which vary a lot in their range of values. As a result, just looking at the standard deviation won't help, and sd needs to be looked at in context of the mean.

Now, technically speaking, coefficient of variation (CV from now on) only makes sense for datasets measured on "ratio scale". 