# Reddit Clustering

For this project, we set to explore the topography of reddit. We wanted to see what subreddits are commonly used between users, if there are communities of users that act as links between subreddits, and if there are subreddits isolated from the main users' network of Reddit.

## Research Question

We are interested in clustering subreddits, by using comment and submission data to discern connections between subreddits and the users who are active in them. We are hoping to see similar userbases in subreddits we did not expect, along with disconnects in userbases between subreddits that appear intuitively similar. We want to use a Clustering algorithm to connect subreddits in space, and then calculate the distance between subreddits.We are interested in clustering subreddits, by using comment and submission data obtained from PRAW to discern connections between subreddits and the users who are active in them. We are hoping to see similar userbases in subreddits we did not expect, along with disconnections in userbases between subreddits that appear intuitively similar. We want to use a Clustering algorithm to connect subreddits in space, and then calculate the distance between subreddits. 

For the former part of research, we mainly focused on obtaining data set including reddit user, their comments and submissions for subreddits, cleaning out unnecessary features and pre-processing data for further exploration. Simple visualization of reddit posts distribution, relations between user posts number and posted subreddit number was displayed, which none of them has suggested significant results so far. For over 22,000 subreddits, there is a large portion with few data included, and handling this majority is believed to affect the following data analysis significantly. Multi-dimensional vectors implemented with array structure are also the major sources to perform data exploration and clustering on.

For the latter part, the main purpose will be completing data pre-processing,  figuring out correct application of Clustering Algorithm(method to choose, cluster number etc.) and displaying final visualization. The statistical analysis should also suggest some insights into the correlation among different subreddits and their active user community.

Several questions are expected to be answered while we proceed:

- How should we filter out the large amount of subreddits with few comments by simply removing the data or considering imputation methods? How should we determine a proper threshold for unpopular subreddits?

- K-means Clustering Algorithm seems to work for current data, but instead of presuming the number of classification arbitrarily, is there any statistical domain knowledge or library function we can refer to for more reliable output?

- The visualization in 2D space apparently fails to display more original properties for our multidimensional vectors. We need to implement proper functions to display more comprehensive data visualization, perhaps by using dimensionality reduction approach.


## Data Sources

All of our data is coming from reddit. Reddit submission and comment data is publicly accessible, and reddit has a nice API structure. We are using the package PRAW (python reddit api wrapper), which makes the reddit api calls easier to use and python importable. We are attempting to build subreddit comment vectors for a large number of reddit users. We would like to create nested dictionaries, the first level key being a reddit user, the second level keys being a subreddit name, and the values being how many time that specific user posted to a specific subreddit. 

Another important note is there are two kinds of ways to post to reddit, and the api distinguishes them. There are submissions and comments. Submissions are posts including an image, a video, or a question, and comments are replies and follow-ups to posts. Both kinds of posts on reddit are useful for the clustering we want to do, so we must do both.

We split our data scripts into different pieces, below is the documentation for each script, followed by the code. 

Also to gather any data from Reddit we must log on to the system with a developer id and create a reddit instance in our code. This happens once below.

In [1]:
import praw
import pandas as pd
import sys
import numpy as np
from scipy.spatial import distance

reddit = praw.Reddit(client_id='tc_fFbWZrkDSRw',
                     client_secret='fTq7nFVzdkCHFZY7jWQvHmkLpwk',
                     user_agent='lhimelman')

# userNameScraper.py:
  *A Scraper that gets just usernames, does it very quickly. 6000 usernames can be collected in a few minutes*

  EX: 
      
      
      python3 userNameScraper.py *saveFilename* *ListofSubreddit* *numberofpoststolookat*
      
      python3 userNameScraper.py data.txt funny,pics,todayilearned 100

In [2]:
def scrapeUsers(reddit, subredditList, postNum):
    subnum = 0
    for subredditname in subredditList:
        users = []
        posts = reddit.subreddit(subredditname).hot(limit=postNum)
        pc = 0
        for submission in posts:
            all_comments = submission.comments.list()
            for c in all_comments:
                try:
                    name = c.author.name
                    if name not in users:
                        users.append(name)
                except:
                    pass
            pc = pc + 1
            print( pc, "post")
        subnum = subnum + 1
        print( subnum, "subreddit")
    return users

##An example call scraping one post from r/funny
users = scrapeUsers(reddit,['politics'],1)
users

1 post
1 subreddit


['Thoramel',
 'Hrekires',
 'Sip_py',
 'dallasmorningnews',
 'jimbozak',
 'TransQuantinentalAce',
 'darkseadrake',
 'GotOutOfCowtown',
 'becauseineedone3',
 'bivalve_attack',
 'TrumpImpeachedAugust',
 'DistillateMedia',
 'wjbc',
 'neverliveindoubt',
 'HorsecockBillionaire',
 'MaimedJester',
 'PrincessSandySparkle',
 'Burning_Lovers',
 'Choco316',
 'Communist99',
 'esteban1386',
 'Schkateboarda',
 'nuncio-tc',
 'IMAVINCEMCMAHONGUY',
 'all2neat',
 'not-working-at-work',
 'garybusey42069',
 'erratically_sporadic',
 '10iss',
 'Vernacularry',
 'wisdom_and_frivolity',
 'halebara01',
 'ThrowAway_Phone',
 'ericolinn',
 'KellyJoyCuntBunny',
 'JacenGraff',
 'Roidciraptor',
 'TempAcct20005']

# ScrapeFreqfromUser.py:
  *A scraper that gets frequencies of comments from a list of users*

  EX: 
    
        python3 ScrapeFreqfromUser.py *savefielName* *userlistfilename*
     
    
        python3 ScrapeFreqfromUser.py freq.txt users.txt

In [None]:
def scrapeSubreddit(reddit, users):
    commentFreq = {}
    headers = []
    usernum = 1
    for user in users:
        userCFreq = {}
        for comment in reddit.redditor(user).comments.new(limit=None):
            sub = comment.subreddit
            if sub not in userCFreq:
                userCFreq[sub] = 1
            else:
                userCFreq[sub] += 1
            if sub not in headers:
                headers.append(sub)
        commentFreq[user] = userCFreq
        usernum = usernum + 1
        print(usernum, "out of", len(users))
    return commentFreq,headers

##An example call scraping the users gotten above
cfreq,headers = scrapeSubreddit(reddit, users[0:3])
df = pd.DataFrame.from_dict(data=cfreq, orient='index').fillna(0)
df

2 out of 3
3 out of 3


# ScrapeSubFreqfromUser.py:
  *A scraper that gets frequencies of submissions from a list of users*

  EX: 
    
        python3 ScrapeSubFreqfromUser.py *savefielName* *userlistfilename*
     
    
        python3 ScrapeSubFreqfromUser.py freq.txt users.txt

In [None]:
def scrapeSubreddit(reddit, users):
    subFreq = {}
    headers = []
    usernum = 1
    for user in users:
        userCFreq = {}
        for submission in reddit.redditor(user).submissions.new(limit=None):
            sub = submission.subreddit
            if sub not in userCFreq:
                userCFreq[sub] = 1
            else:
                userCFreq[sub] += 1
            if sub not in headers:
                headers.append(sub)
        subFreq[user] = userCFreq
        usernum = usernum + 1
        print(usernum, "out of", len(users))
    return subFreq,headers

subfreq,headers = scrapeSubreddit(reddit, users[0:3])
df = pd.DataFrame.from_dict(data=subfreq, orient='index').fillna(0)
df

At this point, we have two sparse matrices in which each column is a vector for an individual subreddit that contains frequencies of different users posting to or commenting on that subreddit.

***
The above calls are examples of running our scripts, but of course our actual data sets, (which we only want to pull down once), are much larger.
***

## Data Cleaning

The data came to us pretty clean. Reddit's api allows us to filter deleted comments and such. Our data cleaning and preprocessing included three different tasks. Below is the first 100 rows of our large table, and then a description of each task. 

In [None]:
bigdf = pd.read_csv('../data.csv',nrows=100)
bigdf

# Task One:
    
    Create a method for removing very sparse vectors from our dataset. In looking through the data, we realized that there are some subreddits with really very few posts, that appears in our set of vectors without really doing anything. We decided to test clustering with the whole set and with smaller sets, so we made a way of thresholding how many posts a subreddit needs to be included.

In [None]:
Threshold = 10

def delSparse(df, threshold):
    for c in list(df)[1:]:
        if sum(list(df[c])) < Threshold:
            del df[c]
            
delSparse(bigdf,Threshold)
bigdf.shape

# Task Two:
    Format data so it is in the form expected by the clustering algorithm. The following code takes the dataframe and changes it to a numpy array. The code also saves a list of headers for referencing specific nodes in a cluster later.
    
    

In [None]:
from numpy import array

def changetoVec(df):
    vectors = []
    for c in list(df)[1:]:
        vectors.append(list(df[c]))

    return array(vectors)

bigdfVec = changetoVec(bigdf)
print(bigdfVec)

# Task Three:
    Remove Porn. What we discovered is that most porn subreddits fell under the category of incredibly sparse vectors, and so were removed, the ones that we decided not to include in our analysis for now. 

## Now we can try clustering:
    As it happens running unsupervised learning is not that complicated. We can run the clustering algorithm on our vectors fairly easy, but there are two problems. The first is we dont know what the optimal number of clusters our algorithm should produce is. We must find a way to choose a cluster number with the least error.

In [None]:
from scipy.cluster.vq import vq, kmeans, whiten
import matplotlib.pyplot as plt


##A function that clusters with a given K
def Cluster(vectors, Num_clusters):
    whitened = whiten(vectors)
    codebook, distortion = kmeans(whitened, Num_clusters)
    return codebook, distortion

##An example of the centroids returned by clustering
codebook, dist = Cluster(bigdfVec[1:1000],5)
whitened = whiten(bigdfVec[1:100])

## The list of centroids
print(codebook)

#The estimated error of the clustering given that k value
print(dist)


The Distortion value returned from the k-means algorithm is the value that represents the error of that many clusters. So our first thought is to simply minimize that error. Below we create a list of all possible distortions for our data.

In [None]:
#####WARNING, THIS TAKES A WHILE

##Get all distortions
distortions = []
for i in range(1,int(len(list(bigdfVec))/2)):
    cb, dist = Cluster(bigdfVec, i)
    distortions.append(dist)
plt.plot(distortions)

It is easy to see the problem with just choosing the minimum here. The minimum will always be the same number of clusters as datapoints! This number of clusters isn't useful however, as having the same number of clusters as datapoints, is 100% overfitting. A common technique to combat this used alongside k-means, is finding when the maximum jump appears of inverted distortiong values. Below is the code to find the number of clusters when that point occurs. 

In [None]:
#Look for the maximum distortion jump. This should be our best k.

maxdistjump = 0
maxk = 0
for i in range(1, len(distortions)):
    if distortions[i] - distortions[i-1] > maxdistjump:
        maxdistjump = distortions[i] - distortions[i-1]
        maxk = i
        
print(maxk)
print(maxdistjump)

finalcb, finaldist = Cluster(bigdfVec, maxk)

Now we have our theoretically best K value. Lets compute distances between each subreddit and cluster centers, then look at which subreddits are most in a specific cluster.

In [None]:
##compute distances between every subreddit and cluster centers for the best k we found

Clusterdistframe = pd.DataFrame(columns=list(bigdf)[1:])
for i in range(len(finalcb)):
    dists = []
    for v2 in bigdfVec:
        dists.append(distance.euclidean(finalcb[i], v2))
    Clusterdistframe.loc[i] = dists
Clusterdistframe.head()

Now we can get the ten minimum and the ten maximum subreddits for each cluster

In [None]:
mins = pd.DataFrame()
maxs = pd.DataFrame()
for index,row in Clusterdistframe.iterrows():
    mins[index] = list(Clusterdistframe.columns[row.argsort()][0:100])
    maxs[index] = list(Clusterdistframe.columns[row.argsort()][-99:])
    
mins=mins.transpose()
maxs=maxs.transpose()

In [None]:
mins

In [None]:
del maxs[2]
del maxs[3]
del maxs[4]
del maxs[5]
maxs

So theres a problem. By looking at this list we can see that there are unique clusters, but many of the clusters contain the same minimum subreddits, or they contain the same maximum subreddits. For example, there are many centroids very close to mildy_infuriating, which in and of itself is mildly infuriating. More importantly however, outliers appear to be super obvious. In that there are no cluster centers anywhere close to some subreddits, like Unity3D, and Denmark. Despite the fact that 297 clusters is mathematically the right number of clusters. It is harder to interpret on a human scale, so we also decided to guess a small number of clusters and look at the subreddits that appeared as minimums and maximums with less clusters.

In [None]:
##cluster
guesscb, guessdist = Cluster(bigdfVec, 10)

##Get distances
guessClusterdistframe = pd.DataFrame(columns=list(bigdf)[1:])
for i in range(len(guesscb)):
    dists = []
    for v2 in bigdfVec:
        dists.append(distance.euclidean(guesscb[i], v2))
    guessClusterdistframe.loc[i] = dists

##Get mins and maxs
gmins = pd.DataFrame()
gmaxs = pd.DataFrame()
for index,row in guessClusterdistframe.iterrows():
    gmins[index] = list(guessClusterdistframe.columns[row.argsort()][0:100])
    gmaxs[index] = list(guessClusterdistframe.columns[row.argsort()][-99:])
    
gmins=gmins.transpose()
gmaxs=gmaxs.transpose()

In [None]:
gmins

In [None]:
del gmaxs[2]
del gmaxs[3]
del gmaxs[5]
gmaxs

These results are a little more interesting to look at, in that there is a significant difference between the minimum subreddits that appear in each cluster. The maximums still arent that interesting, as those outlier subreddits are still really far away. This distance could indicate that those subreddits have pretty isolated userbases, which honestly makes sense. Politics is a subreddit that appears at a very high distance from most other subreddits. This implies that lots of users post on politics, and only politics. Which is certainly very possible, and in fact very likely.

The next thing we thought would be interesting is to not only plot subreddits to cluster centers, but subreddits to eachother, to see total minimum distances and maximum distances between subreddits. 

In [None]:
from scipy.spatial import distance
##compute distances between every subreddit and put them in a massive table

distframe = pd.DataFrame(columns=list(bigdf)[1:])
for i in range(len(bigdfVec)):
    dists = []
    for v2 in bigdfVec:
        dists.append(distance.euclidean(bigdfVec[i], v2))
    distframe.loc[i] = dists
distframe.index = list(bigdf)[1:]
distframe

In [None]:
sub_mins = pd.DataFrame()
sub_maxs = pd.DataFrame()
for index,row in distframe.iterrows():
    sub_mins[index] = list(distframe.columns[row.argsort()][0:100])
    sub_maxs[index] = list(distframe.columns[row.argsort()][-99:])
    
sub_mins=sub_mins.transpose()
sub_maxs=sub_maxs.transpose()

In [None]:
sub_mins

In [None]:
sub_maxs.head()

# Statistical Analysis

To perform statistical analysis on our data, we read in small data set. Both .csv files contain the same users. One file contains a user's comment frequency data, and the other contains that same user's submission frequency data.

In [None]:
donaldc = pd.read_csv('donaldCFreq.csv')
donalds = pd.read_csv('donaldSFreq.csv')

print("DonaldCFreq:")
print(donaldc)
print("donaldSFreq:")
print(donalds)

In [None]:
#dataframe for donaldc
donaldc["total_comments"] = donaldc.sum(axis = 1)
donaldc["total_comments"]

In [None]:
#dataframe for donalds
donalds["total_submissions"] = donalds.sum(axis = 1)
donalds["total_submissions"]

In [None]:
donaldCombined = pd.DataFrame()
donaldCombined["#Comments"] = donaldc["total_comments"]
donaldCombined["#Submissions"] = donalds["total_submissions"]
donaldCombined = donaldCombined.dropna()
donaldCombined

Delete submissions > 800:

In [None]:
donaldCombined = donaldCombined[donaldCombined["#Submissions"] < 800] 
donaldCombined

Now, calculate Z scores to normalize the data, so we can perform statistical analysis.

In [None]:
import scipy.stats

donaldCombined['#Comments_Z'] = scipy.stats.zscore(donaldCombined['#Comments'])
donaldCombined['#Submissions_Z'] = scipy.stats.zscore(donaldCombined['#Submissions'])

Scatter plot of Z scores:

In [None]:
#scatter plot of Z scores
plt.scatter(donaldCombined['#Comments_Z'], donaldCombined['#Submissions_Z'])
plt.show

Run Pearson Correlation Coefficient to find correlation between two columns:

In [None]:
donaldCombined['#Comments_Z'].corr(donaldCombined['#Submissions_Z'])

The correlation coefficient is about -0.25, making it somewhat anti-correlated. While this is not a very strong anti-correlation, it might still be assumed that users who comment a lot do not post submissions.