# Classifying Reddit Boards Based on Word Embeddings

Here we will build a word2vec model with all of the reddit corpus, and then train/test to predict a reddit board (for pairwise disorders). This means that we define relationships using over 120K entries from across reddit, select two boards, and then for each post ("document") we generate an average vector based on averaging the word embeddings for the words that appear in the post. We then try to use these vectors to distinguish between the two boards. I originally did this with many different kernels (this is an SVM) and ultimately chose linear so we could use weights/support vectors later if desired.

In [6]:
import pandas
pandas.set_option('display.max_rows', 8000)
classifiers = pandas.read_csv("classifiers_reddit_linear.tsv",sep="\t",index_col=0)
classifiers = classifiers.sort(columns=["TP","TN","accuracy"],ascending=False)
classifiers

Unnamed: 0,accuracy,N,board1,board2,board1N,board2N,N_train,N_test,TP,FP,TN,FN,kernel
837,0.962517,14538,politics,Showerthoughts,13940,598,14538,2908,2799,109,0,0,linear
850,0.976598,14313,politics,phobia,13940,373,14313,2863,2795,66,1,1,linear
826,0.97752,14677,politics,Alzheimers,13940,737,14677,2936,2794,51,76,15,linear
847,0.989727,14111,politics,narcissism,13940,171,14111,2823,2794,29,0,0,linear
842,0.979965,14222,politics,cringe,13940,282,14222,2845,2788,57,0,0,linear
852,0.998925,13949,politics,rage,13940,9,13949,2790,2787,3,0,0,linear
836,0.988632,14073,politics,SPD,13940,133,14073,2815,2783,32,0,0,linear
849,0.972028,14300,politics,niceguys,13940,360,14300,2860,2780,80,0,0,linear
858,0.971468,14369,politics,stress,13940,429,14369,2874,2778,80,14,2,linear
851,0.97088,14592,politics,psychoticreddit,13940,652,14592,2919,2775,74,59,11,linear


It's not really meaningful to have classifiers with a small test set for either board1 or board2 (for example, "rage"), so let's just nix those. We could of course go back and select better train/test sets based on composition of groups, but a rough (20% of entire N for test, 80% for train) is reasonable to start with.

In [7]:
classifiers = classifiers[classifiers.board2N > 200]
classifiers = classifiers[classifiers.board1N > 200]
classifiers

Unnamed: 0,accuracy,N,board1,board2,board1N,board2N,N_train,N_test,TP,FP,TN,FN,kernel
837,0.962517,14538,politics,Showerthoughts,13940,598,14538,2908,2799,109,0,0,linear
850,0.976598,14313,politics,phobia,13940,373,14313,2863,2795,66,1,1,linear
826,0.97752,14677,politics,Alzheimers,13940,737,14677,2936,2794,51,76,15,linear
842,0.979965,14222,politics,cringe,13940,282,14222,2845,2788,57,0,0,linear
849,0.972028,14300,politics,niceguys,13940,360,14300,2860,2780,80,0,0,linear
858,0.971468,14369,politics,stress,13940,429,14369,2874,2778,80,14,2,linear
851,0.97088,14592,politics,psychoticreddit,13940,652,14592,2919,2775,74,59,11,linear
848,0.960665,18936,politics,narcolepsy,13940,4996,18936,3788,2774,77,865,72,linear
856,0.951133,15450,politics,science,13940,1510,15450,3090,2769,111,170,40,linear
843,0.960778,14656,politics,gaming,13940,716,14656,2932,2767,108,50,7,linear


Not all of these are meaningful - for example, politics is a huge boards, and the accuracy looks falsely good because the classifier could mostly just call everything "politics". However, when we look at more equally matched boards (size wise), for example narcolpsy and OCD (row 735) , the performance is pretty decent, and it suggests that there is signal to be distinguished between the two. I think if we do a simple Venn Diagram to look at closest word embedding vectors for support vectors vs. vectors in both classes, we could get insight to the words that were meaningful to derive the model. I also think this has interesting signal to better map/define different phenotypes, and we should discuss now a specific strategy to go about this.