# Reddit Clustering

For this project, we set to explore the topography of reddit. We wanted to see what subreddits are commonly used between users, if there are communities of users that act as links between subreddits, and if there are subreddits isolated from the main usern network of Reddit.

## Research Question

We are interested in clustering subreddits, by using comment and submission data to discern connections between subreddits and the users who are active in them. We are hoping to see similar userbases in subreddits we did not expect, along with disconnects in userbases between subreddits that appear intuitively similar. We want to use a Clustering algorithm to connect subreddits in space, and then calculate the distance between subreddits.

## Data Sources

All of our data is coming from reddit. Reddit submission and comment data is publicly accessible, and reddit has a nice API structure. We are using the package PRAW (python reddit api wrapper), which makes the reddit api calls easier to use and python importable. We are attempting to build subreddit comment vectors for a large number of reddit users. We would like to create nested dictionaries, the first level key being a reddit user, the second level keys being a subreddit name, and the values being how many time that specific user posted to a specific subreddit. 

Another important note is there are two kinds of ways to post to reddit, and the api distinguishes them. There are submissions and comments. Submissions are posts including an image, a video, or a question, and comments are replies and follow-ups to posts. Both kinds of posts on reddit are useful for the clustering we want to do, so we must do both.

We split our data scripts into differen pieces, below is the documentation for each script, followed by the code. 

Also to gather any data from Reddit we must log on to the system with a developer id and create a reddit instance in our code. This happens once below.

In [1]:
import praw
import pandas as pd
import sys

reddit = praw.Reddit(client_id='tc_fFbWZrkDSRw',
                     client_secret='fTq7nFVzdkCHFZY7jWQvHmkLpwk',
                     user_agent='lhimelman')

# userNameScraper.py:
  *A Scraper that gets just usernames, does it very quickly. 6000 usernames can be collected in a few minutes*

  EX: 
      
      
      python3 userNameScraper.py *saveFilename* *ListofSubreddit* *numberofpoststolookat*
      
      python3 userNameScraper.py data.txt funny,pics,todayilearned 100

In [2]:
def scrapeUsers(reddit, subredditList, postNum):
    subnum = 0
    for subredditname in subredditList:
        users = []
        posts = reddit.subreddit(subredditname).hot(limit=postNum)
        pc = 0
        for submission in posts:
            all_comments = submission.comments.list()
            for c in all_comments:
                try:
                    name = c.author.name
                    if name not in users:
                        users.append(name)
                except:
                    pass
            pc = pc + 1
            print( pc, "post")
        subnum = subnum + 1
        print( subnum, "subreddit")
    return users

##An example call scraping one post from r/funny
users = scrapeUsers(reddit,['politics'],1)
users

1 post
1 subreddit


['TrumpImpeachedAugust',
 'zaikanekochan',
 'mmaireenehc',
 'IMAVINCEMCMAHONGUY',
 'dallasmorningnews',
 'BrokenZen',
 'Thoramel',
 'jimbozak',
 'ivsciguy',
 'ArtysFartys',
 'metaldood19',
 'ExRays',
 'verifex',
 'Hoplophilia',
 'hammersklavier',
 'caravaggio2000',
 'DoingRandomCrap31',
 'Metallic144',
 'worldwarli',
 'sadist-trombone',
 'Amazing_Archigram',
 'HorrorSquirrel1',
 'ballmermurland',
 'galleyest',
 'hawkiron',
 'Firechess',
 'DanielTigerUppercut',
 'AsperonThorn',
 'Vernacularry',
 'highorderdetonation',
 'aoi_to_midori',
 'Markanaya',
 'Baltron9000',
 'Sporian',
 'bo_dingles',
 'ashycharasmatic',
 'Music_Tech',
 '0and18',
 'narwhilian',
 'fascist___hag',
 '10iss',
 'BeachJas',
 'iSpoonz',
 'd9_m_5',
 'subpargalois',
 'serothis',
 'not_even_once_okay',
 'StuStutterKing',
 'Prometheus_II']

# ScrapeFreqfromUser.py:
  *A scraper that gets frequencies of comments from a list of users*

  EX: 
    
        python3 ScrapeFreqfromUser.py *savefielName* *userlistfilename*
     
    
        python3 ScrapeFreqfromUser.py freq.txt users.txt

In [3]:
def scrapeSubreddit(reddit, users):
    commentFreq = {}
    headers = []
    usernum = 1
    for user in users:
        userCFreq = {}
        for comment in reddit.redditor(user).comments.new(limit=None):
            sub = comment.subreddit
            if sub not in userCFreq:
                userCFreq[sub] = 1
            else:
                userCFreq[sub] += 1
            if sub not in headers:
                headers.append(sub)
        commentFreq[user] = userCFreq
        usernum = usernum + 1
        print(usernum, "out of", len(users))
    return commentFreq,headers

##An example call scraping the users gotten above
cfreq,headers = scrapeSubreddit(reddit, users[0:3])
df = pd.DataFrame.from_dict(data=cfreq, orient='index').fillna(0)
df

2 out of 3
3 out of 3
4 out of 3


Unnamed: 0,technology,WTF,baseball,food,reactiongifs,Music,AskReddit,Cardinals,SmokersRights,movies,...,dontdeadopeninside,cscareerquestions,Pets,dadjokes,ramen,mildlyamusing,OCCaliPokemonGo,creepyPMs,GradSchool,LongDistance
TrumpImpeachedAugust,6.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mmaireenehc,0.0,0.0,0.0,0.0,0.0,0.0,6,0.0,0.0,0.0,...,1.0,1.0,58.0,1.0,1.0,1.0,1.0,2.0,196.0,4.0
zaikanekochan,5.0,4.0,160.0,1.0,1.0,4.0,11,25.0,1.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# ScrapeSubFreqfromUser.py:
  *A scraper that gets frequencies of submissions from a list of users*

  EX: 
    
        python3 ScrapeSubFreqfromUser.py *savefielName* *userlistfilename*
     
    
        python3 ScrapeSubFreqfromUser.py freq.txt users.txt

In [4]:
def scrapeSubreddit(reddit, users):
    subFreq = {}
    headers = []
    usernum = 1
    for user in users:
        userCFreq = {}
        for submission in reddit.redditor(user).submissions.new(limit=None):
            sub = submission.subreddit
            if sub not in userCFreq:
                userCFreq[sub] = 1
            else:
                userCFreq[sub] += 1
            if sub not in headers:
                headers.append(sub)
        subFreq[user] = userCFreq
        usernum = usernum + 1
        print(usernum, "out of", len(users))
    return subFreq,headers

subfreq,headers = scrapeSubreddit(reddit, users[0:3])
df = pd.DataFrame.from_dict(data=subfreq, orient='index').fillna(0)
df

2 out of 3
3 out of 3
4 out of 3


Unnamed: 0,golf,MosinNagant,ilstu,WTF,baseball,CHICubs,HotWheels,AskReddit,aww,SmokersRights,...,asianamerican,Pets,pokemon,curledfeetsies,UnnecessaryQuotes,teefies,berkeley,AskWomen,mildlyinfuriating,AnimalsBeingDerps
TrumpImpeachedAugust,0.0,0.0,0.0,2.0,0.0,0.0,0.0,3,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mmaireenehc,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,3.0
zaikanekochan,1.0,4.0,1.0,1.0,4.0,6.0,1.0,5,1.0,11.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


At this point, we have two sparse matrices in which each column is a vector for an individual subreddit that contains frequencies of different users posting to or commenting on that subreddit.

***
The above calls are examples of running our scripts, but of course our actual data sets, (which we only want to pull down once), are much larger.
***

## Data Cleaning

The data came to us pretty clean. Reddit's api allows us to filter deleted comments and such. Our data cleaning and preprocessing included three different tasks. Below is the first 100 rows of our large table, and then a description of each task. 

In [5]:
df = pd.read_csv('../data.csv',nrows=100)
df

Unnamed: 0.1,Unnamed: 0,cocktails,AskReddit,videos,hearthstone,PUBATTLEGROUNDS,dontdeadopeninside,CompetitiveHS,funny,TWWPRDT,...,firefall,KUFIIOnline,atlantar4r,HeadBangToThis,Kochen,aachen,KnitRequest,petplay,Balls,peachfuzz
0,--abadox--,0.0,0.0,0.0,0.0,0.0,0.0,0.0,56.0,67.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,--arete--,0.0,0.0,0.0,0.0,0.0,2.0,0.0,7.0,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,-COPBLOCK-,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-Chakas-,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,60.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-DisobedientAvocado-,0.0,61.0,0.0,0.0,0.0,0.0,0.0,24.0,14.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,-Enrique_Shockwave-,0.0,3.0,0.0,2.0,0.0,40.0,0.0,50.0,30.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,-FuckYourGod,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,-KyloRen,0.0,2.0,0.0,0.0,0.0,0.0,0.0,5.0,14.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,-Mateo-,0.0,0.0,0.0,0.0,0.0,4.0,0.0,2.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,-Meik,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Task One:
    
    Create a method for removing very sparse vectors from our dataset. In looking through the data, we realized that there are some subreddits with really very few posts, that appears in our set of vectors without really doing anything. We decided to test clustering with the whole set and with smaller sets, so we made a way of thresholding how many posts a subreddit needs to be oncluded

In [None]:
Threshold = 100
subs = list(df)

for c in list(df)[1:]:
    if sum(list(df[c])) < Threshold:
        del df[c]
        subs.remove(c)
print(df)

# Task Two:
    Format data so it is in the form expected by the clustering algorithm. The following code takes the dataframe and changes it to a numpy array. The code also saves a list of headers for referencing specific nodes in a cluster later.
    
    

In [None]:
from numpy import array

vectors = []
for c in subs[1:]:
    vectors.append(list(df[c]))
print(subs[1:10])

vectors = array(vectors)
print(vectors[0:10,0:14])

# Task Three:
    Remove Porn. What we discovered is that most porn subreddits fell under the category of incredibly sparse vectors, and so were removed, the ones that didn't we decided to include in our analysis for now. 

## Try clustering:

In [None]:
from scipy.cluster.vq import vq, kmeans, whiten
import matplotlib.pyplot as plt

# Whiten data
whitened = whiten(vectors[1:1000])
# Find 2 clusters in the data
codebook, distortion = kmeans(whitened, 20)
# Plot whitened data and cluster centers in red
plt.scatter(whitened[:, 0], whitened[:, 1])
plt.scatter(codebook[:, 0], codebook[:, 1], c='r')
plt.show()

# Data Visualization

## Bar Graph

What are the most commented on subreddits from out sample?

In [None]:
#from collections import Counter
    
new_data = pd.DataFrame(df, columns = ['funny', 'worldnews', 'AskReddit', 'askscience', 'vegan'])
#new_data

sums = {}
for c in new_data:
    sums[c] = new_data[c].sum()
sums

plt.bar(range(len(sums)), list(sums.values()), align = 'center')
plt.xticks(range(len(sums)), list(sums.keys()))

new_data.plot.bar()

## Scatter Plot

Number of subreddits commented on vs total number of posts for a user

In [None]:
import scipy

df["total_posts"] = df.sum(axis=1)
df["total_posts"] = df["total_posts"].astype(int)


    
total_subreddits = []
for i in range(0, len(df.index)):
    subreddits = 0
    for j in range(1,len(df.columns)-1):
        if df.iat[i,j] != 0.0:
            subreddits += 1
    total_subreddits.append(subreddits)
  
df["total_subreddits"] = total_subreddits       
            

plt.scatter(df.total_posts,df.total_subreddits, color='k', s=20, marker="^")
plt.xlabel('number of posts')
plt.ylabel('number of subreddits')
plt.show()


## Line graph

Is reddit user posting a normal distribution?

In [None]:
df["total_posts"].plot(kind="density")
scipy.stats.normaltest(df["total_posts"])

The number of posts reddit users make is not nearly a normal distribution. There are lots of people with almost no posts, and lots of users with tons of posts, and a large drop inbetween.There is no consistent distribution of how may posts a reddit user makes.